Re: [PATCHES] variadic function support

2008-07-14 Thread Jeff Davis
On Sun, 2008-07-13 at 12:32 -0400, Tom Lane wrote:
> >> Also, it doesn't seem to allow calls to a variadic function with zero
> >> arguments, e.g. "mleast()". I think this should be allowed.
> 
> > It's not possible for all cases, because empty array have be typed
> > array still. But for non polymorphic variadic functions it's probably
> > possible - I would to solve this question later - and for now use
> > overloading etc
> 
> I don't really like the idea of a feature that would work except in the
> polymorphic case --- that just seems too non-orthogonal.  Also I tend
> to think that a pretty large fraction of variadic functions will be
> polymorphic, making the feature's usefulness even more dubious.

I think it has the potential for surprise both ways. I was surprised
when it didn't allow a zero-argument call.

> I concur with the idea that variadic functions should only match to
> calls that offer at least one value for the variadic array.  If you can
> actually define the behavior sensibly for the zero-element case, a
> separate function of the same name can cover that case.

Similarly, if zero-argument calls to variadic functions were allowed,
and you want the polymorphism to work as you suggest, you can just
define:

foo(int, variadic int[]) 
foo(text, variadic text[])

and that forces one argument to be provided, and the functions don't
conflict. If this is the common case, I can see why you wouldn't want to
require declaration of the extra parameter each time.

I don't have a strong opinion, but allowing zero-argument variadic
function calls -- and therefore causing foo(variadic int[]) and
foo(variadic text[]) to conflict -- makes more sense than requiring one
argument. It also seems a little more natural from a function author's
perspective.

Regards,
Jeff Davis


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] variadic function support

2008-07-12 Thread Jeff Davis
On Tue, 2008-06-24 at 17:10 +0200, Pavel Stehule wrote:
> Hello
> 
> this version implements syntax based on argmodes.
> 
> 
> CREATE FUNCTION mleast(variadic numeric[]) RETURNS numeric AS $$
> SELECT min($1[i])
>FROM generate_subscripts($1,1) g(i);
> $$ LANGUAGE SQL;
> 

I don't have a strong opinion about whether the variadic argument is
declared as an array or scalar, so I am posting my comments about this
version of the patch as well.

This version also has a problem when declaring two functions "foo(int)"
and "foo(variadic int[])". In this version, the declaration is allowed
but the backend crashes when the function is called.

The variable "transform_variadic" should have some kind of comment. It
seems to be there to distinguish between when you're looking for a
candidate function for a function call, and when you're looking for a
candidate function for, e.g., CREATE FUNCTION. It's a little hard to
follow, and is probably the cause for the aformentioned crash.

Also, it doesn't seem to allow calls to a variadic function with zero
arguments, e.g. "mleast()". I think this should be allowed.

I suggest the following error message rewording:
"variadic argument isn't an array" should be something like: "variadic
argument must be an array".

Regards,
Jeff Davis


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] variadic function support

2008-07-12 Thread Jeff Davis
On Sat, 2008-07-12 at 23:06 -0700, Jeff Davis wrote:
> I don't have a strong opinion about whether the variadic argument is
> declared as an array or scalar, so I am posting my comments about this
> version of the patch as well.
> 
> This version also has a problem when declaring two functions "foo(int)"
> and "foo(variadic int[])". In this version, the declaration is allowed
> but the backend crashes when the function is called.
> 
> The variable "transform_variadic" should have some kind of comment. It
> seems to be there to distinguish between when you're looking for a
> candidate function for a function call, and when you're looking for a
> candidate function for, e.g., CREATE FUNCTION. It's a little hard to
> follow, and is probably the cause for the aformentioned crash.
> 
> Also, it doesn't seem to allow calls to a variadic function with zero
> arguments, e.g. "mleast()". I think this should be allowed.
> 
> I suggest the following error message rewording:
> "variadic argument isn't an array" should be something like: "variadic
> argument must be an array".
> 

To be clear, these comments apply to v2.0.0 of the patch.

Regards,
Jeff Davis


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] variadic function support

2008-07-12 Thread Jeff Davis
On Sun, 2008-07-13 at 07:52 +0200, Pavel Stehule wrote:
> you checked second or third variant? There are two variants still.

Sorry for being unclear. Those comments regarded patch v2.2.1. The bug
is in pg_function_is_visible().

Additionally, I'd like to see support for calling variadic functions
with no arguments. I mentioned that in my other email, but it applies to
v2.2.1 as well.

And we should come to a consensus quickly about the declaration style,
e.g. "variadic int[]" or "variadic int".

Regards,
Jeff Davis


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] variadic function support

2008-07-12 Thread Jeff Davis
On Thu, 2008-06-26 at 17:03 +0200, Pavel Stehule wrote:
> this is third variant with variadic argumen as scalar. But I still
> strongly prefer second variant with conformance declared variadic
> array with used array variable.

This version allows you to declare two functions "foo(variadic numeric)"
and "foo(numeric)", and then if you do a "\df foo" the backend crashes. 

Also, you didn't update an error string:

"variadic argument isn't an array" should (in this version) be something
like: "can't find array type for variadic parameter type %s"

I suggest a slightly different wording for the following error messages:

"aggregate function has variadic argument" -> "variadic parameters not
supported for aggregate functions"

and

"variadic argument isn't last function's argument" -> "variadic
parameter must be the last parameter to the function"

Those are just suggested wordings.

Regards,
Jeff Davis


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] synchronized scan: reset state at end of scan

2008-05-31 Thread Jeff Davis
On Sat, 2008-05-31 at 22:01 -0400, Tom Lane wrote:
> I thought the end conclusion of that thread was to not do anything,
> on the grounds that
> (1) having new scans sometimes fail to join an existing syncscan
> herd would be a bad thing because of the resulting performance
> uncertainty;
> (2) partially masking the order-nondeterminism created by syncscans
> would be a bad thing because it would make it more likely for people
> to not notice the issue during testing.

Ok, I certainly am not pushing for this patch to be applied. I'll
consider it closed.

Regards,
Jeff Davis


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


[PATCHES] synchronized scan: reset state at end of scan

2008-05-31 Thread Jeff Davis
I was looking into supporting synchronized scans for VACUUM, and I
noticed that we currently don't remove the reported scan location as
this post suggests:

http://archives.postgresql.org/pgsql-patches/2007-06/msg00047.php

There was some debate about whether it should be done, but I thought
that the solution here seemed to satisfy most people's concerns:

http://archives.postgresql.org/pgsql-patches/2007-06/msg00052.php

I attached a patch that implements the above idea.

The benefit is that if you have a singular scan, it will start at the
beginning of the heap and not at some arbitrary place.

The cost is that it's not 100% guaranteed that the location entry will
be removed. The backend that started the scan could abort, die, etc.
Also, in rare situations there is a small window created where a new
scan might not be synchronized with existing concurrent scans. This is
really only an issue when issuing queries with limits or issuing two
scans that progress at different rates. I think it's somewhat reasonable
to say that if you're doing either of those things, you shouldn't be too
surprised if it messes with synchronized scanning. I have more
information in the comments in the attached patch.

I do not have a strong opinion about whether this patch is applied or
not. I am submitting this just for the sake of completeness.

Regards,
Jeff Davis
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c1afff3..b5bf780 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1234,6 +1234,9 @@ heap_endscan(HeapScanDesc scan)
 
 	if (scan->rs_strategy != NULL)
 		FreeAccessStrategy(scan->rs_strategy);
+	
+	if (scan->rs_syncscan)
+		ss_reset_location(scan->rs_rd);
 
 	pfree(scan);
 }
diff --git a/src/backend/access/heap/syncscan.c b/src/backend/access/heap/syncscan.c
index dfc7265..5b2aa66 100644
--- a/src/backend/access/heap/syncscan.c
+++ b/src/backend/access/heap/syncscan.c
@@ -34,6 +34,8 @@
  * INTERFACE ROUTINES
  *		ss_get_location		- return current scan location of a relation
  *		ss_report_location	- update current scan location
+ *		ss_reset_location	- reset location to zero if started by this 
+ *			  process
  *
  *
  * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group
@@ -91,6 +93,7 @@ bool		trace_syncscan = false;
 typedef struct ss_scan_location_t
 {
 	RelFileNode relfilenode;	/* identity of a relation */
+	pid_t pid; 	/* PID of the process that set the scan location */
 	BlockNumber location;		/* last-reported location in the relation */
 } ss_scan_location_t;
 
@@ -161,6 +164,7 @@ SyncScanShmemInit(void)
 			item->location.relfilenode.spcNode = InvalidOid;
 			item->location.relfilenode.dbNode = InvalidOid;
 			item->location.relfilenode.relNode = InvalidOid;
+			item->location.pid = 0;
 			item->location.location = InvalidBlockNumber;
 
 			item->prev = (i > 0) ?
@@ -212,6 +216,13 @@ ss_search(RelFileNode relfilenode, BlockNumber location, bool set)
 			else if (set)
 item->location.location = location;
 
+			/*
+			 * If we are starting a new scan, set the pid to that of the 
+			 * current process.
+			 */
+			if(!set)
+item->location.pid = MyProcPid;
+
 			/* Move the entry to the front of the LRU list */
 			if (item != scan_locations->head)
 			{
@@ -319,3 +330,53 @@ ss_report_location(Relation rel, BlockNumber location)
 #endif
 	}
 }
+
+/*
+ * ss_reset_location --- reset location to zero if started by this process
+ *
+ * When a scan finishes, it can remove itself from the list of 
+ * scan_locations. This means that when scans are no longer being run 
+ * concurrently, new scans will again be started at the beginning of the 
+ * heap. This is not required for correctness.
+ *
+ * The scan_location entry holds the pid of the most recently started scan,
+ * and when a scan finishes, it resets the entry to zero if and only if the
+ * pid in the entry matches that of the current process.
+ *
+ * When concurrent scans are active, it is unlikely that the most
+ * recently started scan will finish first, so the hint will usually not 
+ * be removed unless this is the only scan on that relation. If the scans 
+ * are merely started at nearly the same time, and the last one to start
+ * happens to finish first, there would be little benefit from
+ * synchronizing with a nearly-complete scan, anyway.
+ *
+ * In the rare case that the most recently started scan does finish 
+ * significantly before older concurrent scans (such as in the case of a 
+ * LIMIT clause), the hint will most likely be quickly filled by a location
+ * report from one of those older scans. If another scan begins during that
+ * narrow window, it will not have the benefit of being synchronized with 
+ * the older concurrent scans.
+ * 
+ * If we can't get the lock without waiting, then we do nothing.
+ */
+void ss_reset_location(Relatio

Re: [PATCHES] Doc patch: type modifiers

2008-05-27 Thread Jeff Davis
On Tue, 2008-05-27 at 12:08 -0400, Tom Lane wrote:
> Yeah, this text is a holdover from the original user-definable-modifiers
> patch, in which the modifiers indeed had to be numbers.  I don't quite
> like your suggestion of using "textual", though, because that makes it
> sound like the input and output functions are exact inverses, which they
> are not.  How about "... converts an array of modifier(s) for ..."?

Sounds good to me.

Regards,
Jeff Davis


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


[PATCHES] Doc patch: type modifiers

2008-05-27 Thread Jeff Davis
From
http://developer.postgresql.org/pgdocs/postgres/sql-createtype.html :

type_modifier_input_function

The name of a function that converts numeric modifier(s) for the
type into internal form. 

type_modifier_output_function

The name of a function that converts the internal form of the
type's modifier(s) to external textual form. 

But the paragraph above says:

"...take one or more simple constants or identifiers as modifiers..."

So in the description of the input function, "numeric" is wrong, and
should be something like "textual" (for the sake of symmetry).

Patch attached.

Regards,
Jeff Davis
diff --git a/doc/src/sgml/ref/create_type.sgml b/doc/src/sgml/ref/create_type.sgml
index 641c13e..d7d1a0b 100644
--- a/doc/src/sgml/ref/create_type.sgml
+++ b/doc/src/sgml/ref/create_type.sgml
@@ -433,7 +433,7 @@ CREATE TYPE name
 type_modifier_input_function
 
  
-  The name of a function that converts numeric modifier(s) for the type
+  The name of a function that converts textual modifier(s) for the type
   into internal form.
  
 

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Proposed patch: synchronized_scanning GUC variable

2008-01-28 Thread Jeff Davis
On Sun, 2008-01-27 at 15:07 -0500, Tom Lane wrote:
> Per today's -hackers discussion, add a GUC variable to allow clients to
> disable the new synchronized-scanning behavior, and make pg_dump disable
> sync scans so that it will reliably preserve row ordering.  This is a
> pretty trivial patch, but seeing how late we are in the 8.3 release
> cycle, I thought I'd better post it for comment anyway.

I apologize for the late reply, but I have one comment I'd like to add.

> + if (g_fout->remoteVersion >= 80300)
> + do_sql_command(g_conn, "SET synchronized_scanning TO off");
> + 
> + /*
>* Start serializable transaction to dump consistent data.
>*/

I think that pg_dump is a reasonable use case for synchoronized scans
when the table has not been clustered. It could potentially make pg_dump
have much less of a performance impact when run against an active
system.

I think it's worth considering enabling sync scans for non-clustered
tables if it would not interfere with the release. Of course, a painless
8.3 release is the top priority.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-10 Thread Jeff Davis
On Sat, 2007-06-09 at 09:58 -0400, Tom Lane wrote:
> Jeff Davis <[EMAIL PROTECTED]> writes:
> >  * For a large table, do lazy_scan_heap, scan_heap, and a sequential
> > scan usually progress at approximately the same rate?
> 
> scan_heap would probably be faster than a regular seqscan, since it
> isn't doing any where-clause-checking or data output.  Except if you've
> got vacuum-cost-limit enabled, which I think is likely to be true by
> default in future.  Another problem is that lazy_scan_heap stops every
> so often to make a pass over the table's indexes, which'd certainly
> cause it to fall out of sync with more typical seqscans.

I think that these problems are significant enough that I'm not sure
sync-scanning a VACUUM is the right way to approach the problem.

Maybe a better solution would be to try to get a sequential scan to do
some of the work required by a VACUUM. I don't think we can stop in the
middle of a sequential scan to vacuum the indexes, but perhaps we could
come up with some kind of scheme. It would be cheaper (perhaps) to spill
the list of deletable TIDs to disk than to rescan a big (mostly live)
table later. And if it was costly, we wouldn't need to do the scan part
of a VACUUM on every sequential scan.

I'm sure this has been brought up before, does someone have a pointer to
a discussion about doing VACUUM-like work in a sequential scan?

Regards,
Jeff Davis


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Synchronized scans

2007-06-08 Thread Jeff Davis
On Fri, 2007-06-08 at 11:57 -0700, Jeff Davis wrote:
> On Fri, 2007-06-08 at 14:36 -0400, Tom Lane wrote:
> > Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> > > Here's an update of the patch. I reverted the behavior at end of scan 
> > > back to the way it was in Jeff's original patch, and disabled reporting 
> > > the position when moving backwards.
> > 
> > Applied with minor editorializations --- notably, I got rid of the
> > HeapScanDesc dependency in syncscan.c's API, so that it could be used
> > in other contexts (VACUUM, anyone?).  There were a few glitches in the
> > heapam.c code too.
> 
> I think VACUUM would be an ideal place for it. I assume we don't want to

I have a few thoughts:

 * For a large table, do lazy_scan_heap, scan_heap, and a sequential
scan usually progress at approximately the same rate?

 * Are there any other parts of the vacuum process that may benefit?

 * Just adding in the syncscan to scan_heap and lazy_scan_heap seems
very easy at first thought. Are there any complications that I'm
missing?

Regards,
Jeff Davis


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-08 Thread Jeff Davis
On Fri, 2007-06-08 at 14:36 -0400, Tom Lane wrote:
> Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> > Here's an update of the patch. I reverted the behavior at end of scan 
> > back to the way it was in Jeff's original patch, and disabled reporting 
> > the position when moving backwards.
> 
> Applied with minor editorializations --- notably, I got rid of the
> HeapScanDesc dependency in syncscan.c's API, so that it could be used
> in other contexts (VACUUM, anyone?).  There were a few glitches in the
> heapam.c code too.

I think VACUUM would be an ideal place for it. I assume we don't want to
make a change like that before 8.3 though, right?

> This would be a trivial addition to the code-as-committed (clear
> rs_syncscan upon backing up by a page) but I didn't add it.  Any
> strong feelings one way or the other?  AFAIK the only case where
> it'd happen is if someone reads forwards in a large-seqscan cursor
> for awhile and then reads backwards.  You could argue that doing
> that is a good reason to drop them out of the syncscan pack ...
> 

I don't feel strongly about it at all. Mostly because it seems so rare
that it would matter.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Synchronized scans

2007-06-08 Thread Jeff Davis
On Fri, 2007-06-08 at 12:22 -0400, Tom Lane wrote:
> Now that I'm awake, it is reachable code, per this comment:
> 
>  * Note: when we fall off the end of the scan in either direction, we
>  * reset rs_inited.  This means that a further request with the same
>  * scan direction will restart the scan, which is a bit odd, but a

I'm confused about this part of the comment.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Synchronized scans

2007-06-08 Thread Jeff Davis
On Fri, 2007-06-08 at 11:05 +0100, Heikki Linnakangas wrote:
> BTW: Should we do the synchronization in the non-page-at-a-time mode? 
> It's not many lines of code to do so, but IIRC that codepath is only 
> used for catalog access. System tables really shouldn't grow that big, 
> and if they do we shouldn't be doing seq scans for them anyway. Does the 
> unstable ordering confuse any catalog accesses?

http://archives.postgresql.org/pgsql-hackers/2006-09/msg01199.php

There is a very minor assumption there that scans on pg_class will
return in the same order. I'm not sure if that's even a problem.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Synchronized scans

2007-06-07 Thread Jeff Davis
On Thu, 2007-06-07 at 22:52 -0400, Tom Lane wrote:
> Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> > I fixed a little off-by-one in "backward scan, not inited" branch, but I 
> > was unable to test it. It seems that code is actually never used because 
> > that case is optimized to a rewind in the executor. I marked those 
> > seemingly unreachable places in the code with a comment.
> 
> Actually it's not equivalent to a rewind, it's more like the startup
> condition for an Index Scan Backward: start at the far end of the
> relation and go backwards.  I suspect that the whole thing may be
> unreachable code because the planner knows that seqscans are unordered
> and backward-scan is only of interest for an ordered scan.  But be that
> as it may: do we even want a backwards-running scan to participate in a
> syncscan group?  Unless *all* the backends are doing the same thing,
> it will not help and instead will bollix the syncscan for everyone else.
> I'm inclined to disable use of syncscan.c altogether when the scan is

Just to be sure: a backwards-started scan is currently unreachable code,
correct? 

But as long as the code is there (reachable or not) it sounds good to
disable sync scan in that case.

> started backwards.  It also seems prudent to suppress ss_report_location
> calls when stepping backward in a generally-forwards scan.  Thoughts?

I agree that we should disable ss_report_location if the scan is moving
backwards.

I might go so far as to suggest if the scan *ever* moves backwards, we
taint the scan such that it never reports.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Synchronized scans

2007-06-05 Thread Jeff Davis
On Mon, 2007-06-04 at 10:53 +0100, Heikki Linnakangas wrote:
> I'm now done with this patch and testing it.
> 
> 

One difference between our patches is that, in my patch, the ending
condition of the scan is after the hint is set back to the starting
position. 

That means, in my patch, if you do:
SELECT * FROM bigtable;
SELECT * FROM bigtable;

with no concurrency at all, the returned order will be the same.

In your patch, each full table scan leaves the hint at 16 pages before
the position it started in, leading to different orders on the full
table scans.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Synchronized scans

2007-06-05 Thread Jeff Davis
On Mon, 2007-06-04 at 21:39 -0400, Tom Lane wrote:
> idea of deleting the hint.  But if we could change the hint behavior to
> say "start reading here", successive short LIMITed reads would all start
> reading from the same point, which fixes both my reproducibility concern
> and Heikki's original point about being able to re-use cached data.
> You'd only get into the irreproducible behavior if the LIMIT was larger
> than the amount of data scanned between hint updates.

That's how it works now. Small limit queries don't change the location
in the hint, so if you repeat them, the queries keep starting from the
same place, and fetching the same tuples.

Large LIMIT queries (larger than the number of pages between updates) do
change the location in the hint, and so that's really the case you're
worried about.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Jeff Davis
On Mon, 2007-06-04 at 18:25 -0400, Tom Lane wrote:
> But note that barring backend crash, once all the scans are done it is
> guaranteed that the hint will be removed --- somebody will be last to
> update the hint, and therefore will remove it when they do heap_endscan,
> even if others are not quite done.  This is good in the sense that
> later-starting backends won't be fooled into starting at what is
> guaranteed to be the most pessimal spot, but it's got a downside too,
> which is that there will be windows where seqscans are in process but
> a newly started scan won't see them.  Maybe that's a killer objection.

I don't think it would be a major objection. If there aren't other
sequential scans in progress, the point is moot, and if there are:
(a) the hint has a lower probability of being removed, since it may
contain the PID of one of those other scans.
(b) the hint is likely to be replaced quite quickly

The problem is, I think people would be more frustrated by 1 in 1000
queries starting the scan in the wrong place because a hint was deleted,
because that could cause a major difference in performance. I expect the
current patch would have more consistent performance for that reason.

To me, it seems to be a small benefit and a small cost. It's hard for me
to feel very strongly either way.

> When exactly is the hint updated?  I gathered from something Heikki said
> that it's set after processing X amount of data, but I think it might be
> better to set it *before* processing X amount of data.  That is, the
> hint means "I'm going to be scanning at least  blocks
> starting here", not "I have scanned  blocks ending here",
> which seems like the interpretation that's being used at the moment.
> What that would mean is that successive "LIMIT 1000" calls would in fact
> all start at the same place, barring interference from other backends.
> 

If I understand correctly, this is a one-page difference in the report
location, right? We can either report that we've just finished scanning
block 1023 (ending an X block chunk of reading) and another backend can
start scanning at 1023 (current behavior); or we could report that we're
about to scan an X block chunk of data starting with block 1024, and the
new scan can start at 1024. We don't want the new scan to jump in ahead
of the existing scan, because then we're introducing uncached blocks
between the two scans -- risking divergence. 

If the data occupies less than X data pages, the LIMIT queries will be
deterministic for single-scans anyway, because no reports will happen
(other than the starting location, which won't matter in this case).

If the data is more than that, then at least one report would have
happened. At this point, you're talking about rewinding the scan (how
far?), which I originally coded for with sync_seqscan_offset. That
feature didn't prove very useful (yet), so I removed it. 

Regards,
Jeff Davis





---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Jeff Davis
On Mon, 2007-06-04 at 22:57 +0100, Heikki Linnakangas wrote:
> > That's what I thought at first, and why I didn't do it. Right now I'm
> > thinking we could just add the PID to the hint, so that it would only
> > remove its own hint. Would that work?
> 
> Were you thinking of storing the PID of the backend that originally 
> created the hint, or updating the PID every time the hint is updated? In 
> any case, we still wouldn't know if there's other scanners still running.
> 

My thought was that every time the location was reported by a backend,
it would store 3 pieces of information, not 2:
 * relfilenode
 * the PID of the backend that created or updated this particular hint
last
 * the location

Then, on heap_endscan() (if that's the right place), we find the hint,
and if the PID matches, we remove it. If not, it does nothing.

This would only matter when there weren't other scans. When concurrent
scans were happening, chances are the PID wouldn't match anyway, and
thus not be removed.

Regards,
Jeff Davis




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Jeff Davis
On Mon, 2007-06-04 at 22:09 +0100, Heikki Linnakangas wrote:
> > I think the real problem here is that the first scan is leaving state
> > behind that changes the behavior of the next scan.  Which can have no
> > positive benefit, since obviously the first scan is not still
> > proceeding; the best you can hope for is that it's a no-op and the worst
> > case is that it actively pessimizes things.  Why doesn't the patch
> > remove the shmem entry at scan termination?
> 
> Because there's no reason why it should, and it would require a lot more 
> bookkeeping. There can be many scanners on the same table, so we'd need 
> to implement some kind of reference counting, which means having to 
> reliably decrement the counter when a scan terminates.
> 

That's what I thought at first, and why I didn't do it. Right now I'm
thinking we could just add the PID to the hint, so that it would only
remove its own hint. Would that work?

It's still vulnerable to a backend being killed and the hint hanging
around. However, the next scan would clear get it back to normal.
Reference counting would cause weirdness if a backend died or something,
because it would never decrement to 0.

> In any case if there actually is a concurrent scan, you'd still see 
> different ordering. Removing the entry when a scan is over would just 
> make it harder to trigger.
> 

Agreed. I don't know for sure whether that's good or bad, but it would
make the nondeterminism less immediately visible.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Jeff Davis
On Mon, 2007-06-04 at 16:42 -0400, Tom Lane wrote:
> Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> > I don't think anyone can reasonably expect to get the same ordering when 
> > the same query issued twice in general, but within the same transaction 
> > it wouldn't be that unreasonable. If we care about that, we could keep 
> > track of starting locations per transaction, only do the synchronization 
> > on the first scan in a transaction, and start subsequent scans from the 
> > same page as the first one.
> 
> I think the real problem here is that the first scan is leaving state
> behind that changes the behavior of the next scan.  Which can have no
> positive benefit, since obviously the first scan is not still
> proceeding; the best you can hope for is that it's a no-op and the worst
> case is that it actively pessimizes things.  Why doesn't the patch
> remove the shmem entry at scan termination?
> 

Sounds like a reasonable idea to me. We could add the PID to the data
structure so that it would only remove the hint if it's the one that set
the hint. I think we'd just need to call a function to do that from
heap_endscan(), correct?

However, we couldn't 100% guarantee that the state would be cleared. A
backend could be killed in the middle of a scan.

The case you're worried about seems very narrow to me, and I think
"actively pessimizes" is too strong. Unless I misunderstand, "the best
you can hope for" no-op happens in all cases except a most bizarre one:
that in which you're executing repeated identical LIMIT queries with no
ORDER BY; and the tuples returned occupy more than 128K (16 pages is the
reporting period) but fewer would be effective to cache; and the table
in question is larger than the large table threshold. I'm just trying to
add some perspective about what we're fixing, here.

But it's fair to say that a scan should clear any state when it's done.

Regards,
Jeff Davis




---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Jeff Davis
On Mon, 2007-06-04 at 10:53 +0100, Heikki Linnakangas wrote:
> I'm now done with this patch and testing it.
> 

Great!

> For the record, this patch has a small negative impact on scans like 
> "SELECT * FROM foo LIMIT 1000". If such a scan is run repeatedly, in CVS 
> HEAD the first 1000 rows will stay in buffer cache, but with the patch 
> each scan will start from roughly where previous one stopped, requiring 
> more pages to be read from disk each time. I don't think it's something 
> to worry about in practice, but I thought I'd mention it.
> 

No surprise here, as you and Bruce have already pointed out.

If we wanted to reduce the occurrence of this phenomena, we could
perhaps "time out" the hints so that it's impossible to pick up a hint
from a scan that finished 5 minutes ago.

It doesn't seem helpful to further obscure the non-determinism of the
results, however.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized Scan WIP patch

2007-05-31 Thread Jeff Davis
On Thu, 2007-05-31 at 09:08 +0100, Heikki Linnakangas wrote:
> Here's a work-in-progress update of this patch.
> 
> I haven't done any major changes, but a lot of little refactoring and 
> commenting, including:
> 
> * moved the sync scan stuff to a new file access/heapam/syncscan.c. 
> heapam.c is long enough already, and in theory the same mechanism could 
> be used for large bitmap heap scans in the future.

Good idea, I hadn't thought of that. It seems like the bitmaps in two
bitmap scans would have to match very closely, but that sounds
plausible.

This is similar to another idea I had considered (I forget who thought
of it) to try to have a bitmap of "tuples still needed" and then try to
optimize based on that information somehow (read the ones in cache
first, etc). Seems substantially more complex though, more like a
prefetch system at that point.

I expected the general refactoring. Hopefully my next patch is a little
closer to the code expectations and places less burden on the reviewers.

> Testing:
> * Multiple scans on different tables, causing movement in the LRU list
> * Measure the CPU overhead for a single scan
> * Measure the lock contention with multiple scanners
> 

Is there any way to measure the necessity of the hash table? I would
think the conditions for that would be a large number of tables being
actively scanned causing a lot of LRU activity such that the locks are
held too long. 

I also think the optimization of only reporting when the block is not
found in cache would be useful to test if the lock contention is a problem.

Regards,
Jeff Davis



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Seq scans status update

2007-05-30 Thread Jeff Davis
On Wed, 2007-05-30 at 17:45 -0400, Tom Lane wrote:
> According to Heikki's explanation here
> http://archives.postgresql.org/pgsql-patches/2007-05/msg00498.php
> each backend doing a heapscan would collect its own ring of buffers.
> You might have a few backends that are always followers, never leaders,
> and so never actually fetch any pages --- but for each backend that
> actually did any I/O there would be a separate ring.  In practice I'd
> expect the lead would "change hands" pretty often and so you'd see all
> the backends accumulating their own rings.
> 

Oh, I see what you mean. The rings will actually become sparsely
populated with many concurrent scans, and therefore, extend outside of
L2 cache.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Seq scans status update

2007-05-30 Thread Jeff Davis
On Wed, 2007-05-30 at 15:56 -0400, Tom Lane wrote:
> In
> the sync-scan case the idea seems pretty bogus anyway, because the
> actual working set will be N backends' rings not just one.

I don't follow. Ideally, in the sync-scan case, the sets of buffers in
the ring of different scans on the same relation will overlap
completely, right?

We might not be at the ideal, but the sets of buffers in the rings
shouldn't be disjoint, should they?

Regards,
Jeff Davis


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Seq scans status update

2007-05-30 Thread Jeff Davis
On Tue, 2007-05-29 at 17:43 -0700, Jeff Davis wrote:
> > Hmm.  But we probably don't want the same buffer in two different
> > backends' rings, either.  You *sure* the sync-scan patch has no
> > interaction with this one?
> > 
> 
> I will run some tests again tonight, I think the interaction needs more
> testing than I did originally. Also, I'm not sure that the hardware I
> have is sufficient to test those cases.
> 

I ran some sanity tests last night with cvs head, plus my syncscan20-
cvshead.patch, plus scan_recycle_buffers.v3.patch.

It passed the sanity tests at least.

I did see that there was more interference with sync_seqscan_threshold=0
(always on) and scan_recycle_buffers=0 (off) than I had previously seen
with 8.2.4, so I will test again against 8.2.4 to see why that might be.
The interference that I saw was still quite small, a scan moving
concurrently with 9 other scans was about 10% slower than a scan running
alone -- which is still very good compared with plain cvs head and no
sync scan -- it's just not ideal. 

However, turning scan_recycle_buffers between 0 (off), 16, 32, and 128
didn't have much effect. At 32 it appeared to be about 1% worse during
10 scans, but that may have been noise. The other values I tried didn't
have any difference that I could see.

This was really just a quick sanity test, I think more hard data would
be useful.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Seq scans status update

2007-05-29 Thread Jeff Davis
On Mon, 2007-05-28 at 17:36 -0400, Tom Lane wrote:
> Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> > One idea is to keep track which pins are taken using the bulk strategy. 
> > It's a bit tricky when a buffer is pinned multiple times since we don't 
> > know which ReleaseBuffer corresponds which ReadBuffer, but perhaps we 
> > could get away with just a flag per pinned buffer. Set the flag when a 
> > buffer is pinned with bulk strategy and it wasn't pinned by us before, 
> > and clear it when it's pinned with another strategy. I'm thinking we 
> > steal one bit from PrivateRefCount for this.
> 
> Seems like a mess.  Why don't we just fix it so there's no need for
> different behavior at Unpin time?  The facts on the ground are that
> the current patch's change in UnpinBuffer is a no-op anyway, because
> of the tupletableslot interference.
> 
> The behavior I'm imagining is just that when we try to take a buffer
> from the ring, if its usage count exceeds 1 then drop it from the ring
> and get another buffer.  1 would be the expected case if no one had
> touched it since we last used it.
> 
> >> A heapscan would pin the buffer only once and hence bump its count at
> >> most once, so I don't see a big problem here.  Also, I'd argue that
> >> buffers that had a positive usage_count shouldn't get sucked into the
> >> ring to begin with.
> 
> > True, except that with the synchronized scans patch two synchronized 
> > scans will pin the buffer twice.
> 
> Hmm.  But we probably don't want the same buffer in two different
> backends' rings, either.  You *sure* the sync-scan patch has no
> interaction with this one?
> 

I will run some tests again tonight, I think the interaction needs more
testing than I did originally. Also, I'm not sure that the hardware I
have is sufficient to test those cases.

It looks like the case to worry about is when there are a large number
of scans on the same table and the I/O system is fast enough that it
causes lock contention on the buffers in the rings. Is this the case
you're worried about?

Also, keep in mind that I have added a SyncScanLock after I ran those
tests. That could have an effect.

Regards,
Jeff Davis



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Synchronized Scan

2007-05-21 Thread Jeff Davis
On Mon, 2007-05-21 at 16:03 +0200, Peter Eisentraut wrote:
> Am Montag, 21. Mai 2007 00:01 schrieb Jeff Davis:
> > Here is the latest version of my patch that's revised according to my
> > discussions with Heikki and Simon:
> 
> This patch was apparently done against 8.2.4, but it doesn't apply to CVS 
> head.
> 

Thanks. Here's a version that applies cleanly to CVS head.

Regards,
Jeff Davis
diff -rc pgsql/src/backend/access/heap/heapam.c pgsql-ss/src/backend/access/heap/heapam.c
*** pgsql/src/backend/access/heap/heapam.c	2007-04-07 18:26:27.0 -0700
--- pgsql-ss/src/backend/access/heap/heapam.c	2007-05-21 07:49:17.0 -0700
***
*** 67,72 
--- 67,300 
   * 
   */
  
+ static BlockNumber ss_init(HeapScanDesc);
+ static int ss_store_hint(HeapScanDesc,BlockNumber);
+ static ss_hints_t  *ss_hints_init();
+ static BlockNumber ss_hints_search(ss_hints_t*,RelFileNode,BlockNumber,bool);
+ 
+ bool Trace_sync_seqscan = false;
+ double sync_seqscan_threshold = DEFAULT_SYNC_SCAN_THRESHOLD;
+ 
+ 
+ /*
+  * ss_hints_init:
+  *
+  * Creates and initializes hint structure of size
+  * nelem. If the structure already exists,
+  * it will simply be returned. 
+  *
+  * It is assumed that SYNC_SCAN_NELEM > 1.
+  *
+  * The hint structure is essentially a
+  * doubly-linked LRU with head and tail 
+  * pointer, but designed to hold a fixed maximum
+  * number of elements in fixed-size shared memory.
+  */
+ static ss_hints_t *ss_hints_init()
+ {
+ 	int i;
+ 	int size;
+ 	int nelem = SYNC_SCAN_NELEM;
+ 	bool found;
+ 	ss_hints_t *hints;
+ 	
+ 	Assert(nelem > 1);
+ 
+ 	size = sizeof(ss_hints_t) + sizeof(ss_lru_t)  + \
+ 		(sizeof(ss_lru_item_t) + sizeof(ss_lru_item_t*)) * nelem;
+ 
+ 	hints = (ss_hints_t*)ShmemInitStruct("Sync Scan Hint List",
+ 		size,&found);
+ 	
+ 	if(found)
+ 		return hints;
+ 
+ 	if(Trace_sync_seqscan)
+ 		elog(DEBUG2,"SYNC_SCAN: Created Hint List");
+ 
+ 	/*
+ 	 * Because it needs to fit in one fixed-size shared memory chunk, 
+ 	 * the code below uses some ugly offset math and typecasts.
+ 	 * 
+ 	 * Also, it maintains it's own rudimentary freelist to manage the 
+ 	 * memory it has.
+ 	 */
+ 	hints->lru = (ss_lru_t*)((char*)hints + sizeof(ss_hints_t));
+ 	hints->lru->nelem = nelem;
+ 	hints->lru->head = NULL;
+ 	hints->lru->tail = NULL;
+ 	hints->lru->first = (ss_lru_item_t*)((char*)hints->lru + sizeof(ss_lru_t));
+ 	hints->lru->freelist = (ss_lru_item_t**)(hints->lru->first + nelem);
+ 	hints->lru->free = hints->lru->freelist;
+ 	for( i = nelem-1; i >= 0; i-- ) {
+ 		*hints->lru->free = hints->lru->first + i;
+ 		hints->lru->free++;
+ 	}
+ 	return hints;
+ }
+ 
+ /* 
+  * ss_hints_search:
+  *
+  * This searches the hint structure for a hint with the given 
+  * relfilenode.
+  *
+  * If "set" is true, it will set the location in the hint to 
+  * the given location, and return that location. If not, it 
+  * will return the pre-existing location without modification.
+  *
+  * If the hint is not found, it will be created
+  * at the head of the list with the given location, and that
+  * location will be returned.
+  * 
+  */
+ static BlockNumber ss_hints_search(ss_hints_t *hints, RelFileNode relfilenode, 
+ 	BlockNumber location, bool set)
+ {
+ 	ss_lru_item_t	*item;
+ 	ss_lru_t 	*lru = hints->lru;
+ 
+ 	item = lru->head;
+ 
+ 	while(item != NULL) {
+ 		if(RelFileNodeEquals(item->hint.relfilenode,relfilenode)) {
+ 			if(set)
+ item->hint.location = location;
+ 			if(item == lru->head)
+ return lru->head->hint.location;
+ 
+ 			if(item == lru->tail)
+ lru->tail = item->prev;
+ 			item->prev->next = item->next;
+ 			if(item->next)
+ item->next->prev = item->prev;
+ 
+ 			item->prev = NULL;
+ 			item->next = lru->head;
+ 			lru->head->prev = item;
+ 			lru->head = item;
+ 
+ 			return lru->head->hint.location;
+ 		}
+ 		item = item->next;
+ 	}
+ 
+ 	// if list is full, remove tail
+ 	if(lru->free == lru->freelist) {
+ 		*lru->free = lru->tail;
+ 		lru->free++;
+ 		Assert(lru->tail->prev != NULL);
+ 		lru->tail->prev->next = NULL;
+ 		lru->tail = lru->tail->prev;
+ 	}
+ 
+ 	lru->free--;
+ 	item = *lru->free;
+ 	item->hint.relfilenode = relfilenode;
+ 	item->hint.location = location;
+ 	if(lru->head != NULL)
+ 		lru->head->prev = item;
+ 	item->next = lru->head;
+ 	item->prev = NULL;
+ 	lru->head = item;
+ 	if(item->next == NULL)
+ 		lru->tail = item;
+ 
+ 	return lru->head->hint.location;
+ }
+ 
+ /*
+  * ss_init: 
+  *
+  * This function reads the Sync Scan hint structure
+  * (creating it if it doesn't already exist) to 
+  * find a possible location 

[PATCHES] Synchronized Scan

2007-05-20 Thread Jeff Davis
Here is the latest version of my patch that's revised according to my
discussions with Heikki and Simon:

Changes:
  * uses LWLocks when accessing shared memory
  * removes the "sync_seqscan_offset" feature
  * uses the relfilenode as a key rather than relation OID
  * fixes regression test failure
  * uses a simple LRU (that stays in fixed-size shared memory) to track
the locations of other concurrent scans

For the LRU I used a doubly-linked list, which isn't strictly necessary.
However, we may decide to use a hash table for the lookup, in which case
the extra pointers will be useful. 

Regards,
Jeff Davis
diff -rc postgresql-8.2.4/src/backend/access/heap/heapam.c postgresql-8.2.4-ss/src/backend/access/heap/heapam.c
*** postgresql-8.2.4/src/backend/access/heap/heapam.c	2007-02-04 12:00:49.0 -0800
--- postgresql-8.2.4-ss/src/backend/access/heap/heapam.c	2007-05-20 14:26:43.0 -0700
***
*** 65,70 
--- 65,298 
   * 
   */
  
+ static BlockNumber ss_init(HeapScanDesc);
+ static int ss_store_hint(HeapScanDesc,BlockNumber);
+ static ss_hints_t  *ss_hints_init();
+ static BlockNumber ss_hints_search(ss_hints_t*,RelFileNode,BlockNumber,bool);
+ 
+ bool Trace_sync_seqscan = false;
+ double sync_seqscan_threshold = DEFAULT_SYNC_SCAN_THRESHOLD;
+ 
+ 
+ /*
+  * ss_hints_init:
+  *
+  * Creates and initializes hint structure of size
+  * nelem. If the structure already exists,
+  * it will simply be returned. 
+  *
+  * It is assumed that SYNC_SCAN_NELEM > 1.
+  *
+  * The hint structure is essentially a
+  * doubly-linked LRU with head and tail 
+  * pointer, but designed to hold a fixed maximum
+  * number of elements in fixed-size shared memory.
+  */
+ static ss_hints_t *ss_hints_init()
+ {
+ 	int i;
+ 	int size;
+ 	int nelem = SYNC_SCAN_NELEM;
+ 	bool found;
+ 	ss_hints_t *hints;
+ 	
+ 	Assert(nelem > 1);
+ 
+ 	size = sizeof(ss_hints_t) + sizeof(ss_lru_t)  + \
+ 		(sizeof(ss_lru_item_t) + sizeof(ss_lru_item_t*)) * nelem;
+ 
+ 	hints = (ss_hints_t*)ShmemInitStruct("Sync Scan Hint List",
+ 		size,&found);
+ 	
+ 	if(found)
+ 		return hints;
+ 
+ 	if(Trace_sync_seqscan)
+ 		elog(DEBUG2,"SYNC_SCAN: Created Hint List");
+ 
+ 	/*
+ 	 * Because it needs to fit in one fixed-size shared memory chunk, 
+ 	 * the code below uses some ugly offset math and typecasts.
+ 	 * 
+ 	 * Also, it maintains it's own rudimentary freelist to manage the 
+ 	 * memory it has.
+ 	 */
+ 	hints->lru = (ss_lru_t*)((char*)hints + sizeof(ss_hints_t));
+ 	hints->lru->nelem = nelem;
+ 	hints->lru->head = NULL;
+ 	hints->lru->tail = NULL;
+ 	hints->lru->first = (ss_lru_item_t*)((char*)hints->lru + sizeof(ss_lru_t));
+ 	hints->lru->freelist = (ss_lru_item_t**)(hints->lru->first + nelem);
+ 	hints->lru->free = hints->lru->freelist;
+ 	for( i = nelem-1; i >= 0; i-- ) {
+ 		*hints->lru->free = hints->lru->first + i;
+ 		hints->lru->free++;
+ 	}
+ 	return hints;
+ }
+ 
+ /* 
+  * ss_hints_search:
+  *
+  * This searches the hint structure for a hint with the given 
+  * relfilenode.
+  *
+  * If "set" is true, it will set the location in the hint to 
+  * the given location, and return that location. If not, it 
+  * will return the pre-existing location without modification.
+  *
+  * If the hint is not found, it will be created
+  * at the head of the list with the given location, and that
+  * location will be returned.
+  * 
+  */
+ static BlockNumber ss_hints_search(ss_hints_t *hints, RelFileNode relfilenode, 
+ 	BlockNumber location, bool set)
+ {
+ 	ss_lru_item_t	*item;
+ 	ss_lru_t 	*lru = hints->lru;
+ 
+ 	item = lru->head;
+ 
+ 	while(item != NULL) {
+ 		if(RelFileNodeEquals(item->hint.relfilenode,relfilenode)) {
+ 			if(set)
+ item->hint.location = location;
+ 			if(item == lru->head)
+ return lru->head->hint.location;
+ 
+ 			if(item == lru->tail)
+ lru->tail = item->prev;
+ 			item->prev->next = item->next;
+ 			if(item->next)
+ item->next->prev = item->prev;
+ 
+ 			item->prev = NULL;
+ 			item->next = lru->head;
+ 			lru->head->prev = item;
+ 			lru->head = item;
+ 
+ 			return lru->head->hint.location;
+ 		}
+ 		item = item->next;
+ 	}
+ 
+ 	// if list is full, remove tail
+ 	if(lru->free == lru->freelist) {
+ 		*lru->free = lru->tail;
+ 		lru->free++;
+ 		Assert(lru->tail->prev != NULL);
+ 		lru->tail->prev->next = NULL;
+ 		lru->tail = lru->tail->prev;
+ 	}
+ 
+ 	lru->free--;
+ 	item = *lru->free;
+ 	item->hint.relfilenode = relfilenode;
+ 	item->hint.location = location;
+ 	if(lru->head != NULL)
+ 		lru->head->prev = item;
+ 	item->next = lru->head;
+ 	item->prev = NULL;
+ 	lru->head = item;
+ 	if(item->next == NU

Re: [PATCHES] Synchronized Scan WIP patch

2007-03-22 Thread Jeff Davis
On Thu, 2007-03-22 at 16:43 -0400, Bruce Momjian wrote:
> Will use '16' rather than '100'.
> 
> Your patch has been added to the PostgreSQL unapplied patches list at:
> 
>   http://momjian.postgresql.org/cgi-bin/pgpatches
> 
> It will be applied as soon as one of the PostgreSQL committers reviews
> and approves it.
> 

Here is the latest version, which includes the change to report every 16
pages. 

This patch has the following improvements:

 * reporting interval to 16 pages
 * rearranges the scan location tracing output to work regardless of the
reporting interval. Previously it did not trace the output correctly if
the logging interval was not an even multiple of the reporting interval
 * GUC trace_sync_seqscan= now controls whether the DEBUG output
is generated or not. If this is true, a lot of logging output will be
generated at DEBUG3. 
 * You can set sync_seqscan_threshold=<-1.0 ... 100.0>. Positive values
are treated as a fraction of NBuffers. Negative values disable sync
scans.
 
Still TODO:

 * Publish my test results (I've collected much of the raw data already
on this version of the patch)
 * SGML documentation (after we stabilize the GUC names and meanings)
 * Possibly remove sync_seqscan_threshold= and instead use a
simple enable/disable boolean that sets the threshold at a constant
fraction of NBuffers (most likely the same fraction as Simon's recycle
buffers patch)

Regards,
Jeff Davis
diff -cr postgresql-8.2.3/src/backend/access/heap/heapam.c postgresql-8.2.3-ss/src/backend/access/heap/heapam.c
*** postgresql-8.2.3/src/backend/access/heap/heapam.c	Sun Feb  4 12:00:49 2007
--- postgresql-8.2.3-ss/src/backend/access/heap/heapam.c	Tue Mar 20 16:12:12 2007
***
*** 65,70 
--- 65,275 
   * 
   */
  
+ static BlockNumber ss_init(HeapScanDesc);
+ static int ss_store_hint(HeapScanDesc,BlockNumber);
+ static int ss_hash(HeapScanDesc);
+ bool Trace_sync_seqscan = false;
+ double sync_seqscan_threshold = DEFAULT_SYNC_SCAN_THRESHOLD;
+ double sync_seqscan_offset = DEFAULT_SYNC_SCAN_OFFSET;
+ 
+ /*
+  * ss_init: 
+  *
+  * This function reads the Sync Scan Hint Table 
+  * (creating it if it doesn't already exist) to 
+  * find a possible location for an already running 
+  * sequential scan on this relation.
+  *
+  * By starting a sequential scan near the location
+  * of an already running scan, we improve the chance
+  * of finding pages in cache.
+  *
+  * Also, depending on SYNC_SCAN_START_OFFSET, this
+  * function will subtract from the hint before
+  * starting the scan, in order to pick up pages that
+  * are likely to already be in cache.
+  *
+  * This function assumes that scan->rs_nblocks is 
+  * already properly set, and sets scan->rs_start_page
+  * to a value based on the hint found. Also, it sets
+  * scan->rs_hint to point to the location of the hint
+  * in the hint table.
+  */
+ static BlockNumber ss_init(HeapScanDesc scan)
+ {
+ 	ss_hint_t *hint_table;
+ 	int table_offset;
+ 	bool found;
+ 	int threshold = sync_seqscan_threshold * NBuffers;
+ 	int offset = sync_seqscan_offset * NBuffers;
+ 
+ 	/*
+ 	 * If the table is not large enough, or sync_scan_threshold 
+ 	 * is disabled (negative), don't Sync Scan.
+ 	 */
+ 	if(threshold < 0 || scan->rs_nblocks < threshold)
+ 	{
+ 		scan->rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	table_offset = ss_hash(scan);
+ 	hint_table = (ss_hint_t*)ShmemInitStruct("Sync Scan Hint Table",
+ 		SYNC_SCAN_TABLE_SIZE*sizeof(ss_hint_t),&found);
+ 			
+ 	scan->rs_hint = &hint_table[table_offset];
+ 
+ 	/*
+ 	 * If we just created the hint table for the first time,
+ 	 * initialize the table to zero and start the scan at page 0.
+ 	 */
+ 	if(!found) {
+ 		if(Trace_sync_seqscan)
+ 			elog(DEBUG2,"SYNC_SCAN: Created Hint Table");
+ 		memset(hint_table,0,sizeof(ss_hint_t)*SYNC_SCAN_TABLE_SIZE);
+ 		scan->rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	/*
+ 	 * If the hint's relid is 0, that means
+ 	 * we have not previously created a hint
+ 	 * at this location in the table.
+ 	 */
+ 	if(scan->rs_hint->relid == 0) {
+ 		if(Trace_sync_seqscan)
+ 			elog(DEBUG2, "SYNC_SCAN: Hint empty");
+ 		scan->rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	/*
+ 	 * If the relid doesn't match the one in the hint,
+ 	 * we have a hash collision.
+ 	 */
+ 	if(RelationGetRelid(scan->rs_rd) != scan->rs_hint->relid)
+ 	{
+ 		if(Trace_sync_seqscan)
+ 			elog(DEBUG1,"SYNC_SCAN: Hash collision");
+ 		scan->rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	/*
+ 	 * If the hint is not a valid block number
+ 	 * for this relation, start at 0.
+ 	 *
+ 	 * This can happen if, for instance, someone
+ 	 * TRUNCATEd the table between when the hint 
+ 	 * was set and now.
+ 	 */
+ 	if(scan->rs_hint->location &l

Re: [PATCHES] Synchronized Scan WIP patch

2007-03-15 Thread Jeff Davis
On Thu, 2007-03-15 at 08:27 +, Simon Riggs wrote:
> On Wed, 2007-03-14 at 01:42 -0700, Jeff Davis wrote:
> > SYNC_SCAN_REPORT_INTERVAL 100
> 
> Jeff,
> 
> This will stop SeqScans from working with buffer recycling, unless we
> put the recycle limit to more than 100. That was why I requested you set
> this to 16, so we can use a recycle buffer of 32.
> 

Yes, you're correct. If it's set at 100 it should still work by using
the OS buffer cache to catch up (and then be within a few buffers), but
I think you're right that 16 is the correct value.

I'll change my patch to be 16. I don't think I need to send along a
diff ;)

Thanks,
Jeff Davis


---(end of broadcast)---
TIP 6: explain analyze is your friend


[PATCHES] Synchronized Scan WIP patch

2007-03-14 Thread Jeff Davis
This is my latest revision of the Sync Scan patch, and it implements the
observability as discussed with Simon.

Changes:
 * ss_report_loc() called once per hundred pages rather than once per
page
 * DEBUG messages are a little cleaner and easier to parse, for the sake
of analysis after the fact.
 * DEBUG2 reports a sync scan starting, the relation size in pages, and
the location at which the scan starts.
 * DEBUG2 reports the location of a scan every 50k pages, DEBUG3 every
5k pages (before it was 100k/10k at DEBUG3/DEBUG4, respectively).
Numbers are aligned along 5k boundaries to make analysis easier.
 * GUCs:
   * sync_seqscan_threshold: fraction of NBuffers for the threshold
   * sync_seqscan_offset: fraction of NBuffers for the offset
   * trace_sync_seqscan: will be used in final version of patch to
control DEBUG output

Sync_scan_offset may be eliminated completely if it's not shown to be
useful enough in conjunction with Simon's patch. Sync Scans are still a
big win without sync_seqscan_offset.

Sync_scan_threshold= may be turned into sync_seqscan=
with a fixed activation threshold (NBuffers/2 per Simon's suggestion).
The reason is that synchronized scans should activate at the same
threshold as Simon's scan_recycle_buffers feature. Should we make a
"#define BIG_SCAN_THRESHOLD NBuffers/2" to use for both sync_seqscan and
for scan_recycle_buffers?

Regards,
Jeff Davis
diff -cr postgresql-8.2.3/src/backend/access/heap/heapam.c postgresql-8.2.3-syncscan/src/backend/access/heap/heapam.c
*** postgresql-8.2.3/src/backend/access/heap/heapam.c	2007-02-04 12:00:49.0 -0800
--- postgresql-8.2.3-syncscan/src/backend/access/heap/heapam.c	2007-03-13 23:21:27.0 -0700
***
*** 65,70 
--- 65,279 
   * 
   */
  
+ static BlockNumber ss_init(HeapScanDesc);
+ static int ss_store_hint(HeapScanDesc,BlockNumber);
+ static int ss_hash(HeapScanDesc);
+ bool Trace_sync_seqscan = false;
+ double sync_seqscan_threshold = DEFAULT_SYNC_SCAN_THRESHOLD;
+ double sync_seqscan_offset = DEFAULT_SYNC_SCAN_OFFSET;
+ 
+ /*
+  * ss_init: 
+  *
+  * This function reads the Sync Scan Hint Table 
+  * (creating it if it doesn't already exist) to 
+  * find a possible location for an already running 
+  * sequential scan on this relation.
+  *
+  * By starting a sequential scan near the location
+  * of an already running scan, we improve the chance
+  * of finding pages in cache.
+  *
+  * Also, depending on SYNC_SCAN_START_OFFSET, this
+  * function will subtract from the hint before
+  * starting the scan, in order to pick up pages that
+  * are likely to already be in cache.
+  *
+  * This function assumes that scan->rs_nblocks is 
+  * already properly set, and sets scan->rs_start_page
+  * to a value based on the hint found. Also, it sets
+  * scan->rs_hint to point to the location of the hint
+  * in the hint table.
+  */
+ static BlockNumber ss_init(HeapScanDesc scan)
+ {
+ 	ss_hint_t *hint_table;
+ 	int table_offset;
+ 	bool found;
+ 	int threshold = sync_seqscan_threshold * NBuffers;
+ 	int offset = sync_seqscan_offset * NBuffers;
+ 
+ 	/*
+ 	 * If the table is not large compared to effective_cache_size,
+ 	 * don't Sync Scan.
+ 	 */
+ 	if(scan->rs_nblocks < threshold)
+ 	{
+ 		elog(DEBUG2,"SYNC_SCAN: Table too small to sync scan");
+ 		scan->rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	table_offset = ss_hash(scan);
+ 	hint_table = (ss_hint_t*)ShmemInitStruct("Sync Scan Hint Table",
+ 		SYNC_SCAN_TABLE_SIZE*sizeof(ss_hint_t),&found);
+ 			
+ 	scan->rs_hint = &hint_table[table_offset];
+ 
+ 	/*
+ 	 * If we just created the hint table for the first time,
+ 	 * initialize the table to zero and start the scan at page 0.
+ 	 */
+ 	if(!found) {
+ 		elog(DEBUG2,"SYNC_SCAN: Created Hint Table");
+ 		memset(hint_table,0,sizeof(ss_hint_t)*SYNC_SCAN_TABLE_SIZE);
+ 		scan->rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	/*
+ 	 * If the hint's relid is 0, that means
+ 	 * we have not previously created a hint
+ 	 * at this location in the table.
+ 	 */
+ 	if(scan->rs_hint->relid == 0) {
+ 		elog(DEBUG2, "SYNC_SCAN: Hint empty");
+ 		scan->rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	/*
+ 	 * If the relid doesn't match the one in the hint,
+ 	 * we have a hash collision.
+ 	 */
+ 	if(RelationGetRelid(scan->rs_rd) != scan->rs_hint->relid)
+ 	{
+ 		elog(DEBUG1,"SYNC_SCAN: Hash collision");
+ 		scan->rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	/*
+ 	 * If the hint is not a valid block number
+ 	 * for this relation, start at 0.
+ 	 *
+ 	 * This can happen if, for instance, someone
+ 	 * TRUNCATEd the table between when the hint 
+ 	 * was set and now.
+ 	 */
+ 	if(scan->rs_hint->location < 0 || 
+ 		scan->rs_hint->location >= scan->rs_nblocks) 
+ 	{
+

Re: [PATCHES] scan_recycle_buffers

2007-03-09 Thread Jeff Davis
On Fri, 2007-03-09 at 20:08 +, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > Patch to implement buffer cache recycling for scans, as being discussed
> > on pgsql-hackers.
> 
> A few questions come to mind:
> 
> How does it behave with Jeff's synchronized seq scans patch?
> 

I will test it and post my results. I would expect that the CPU usage
will increase, but it might not make a big difference in the overall
cache hit rate if you count OS buffer cache hits.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] scan_recycle_buffers

2007-03-09 Thread Jeff Davis
On Fri, 2007-03-09 at 20:37 +, Simon Riggs wrote:
> > I wonder if calling RelationGetNumberOfBlocks on every seq scan becomes 
> > a performance issue for tiny tables with for example just 1 page. It 
> > performs an lseek, which isn't free.
> 
> Jeff's patch does this also, for similar reasons.
> 

As Tom pointed out, the value is already in memory by the time it gets
to my code. My code just reads that value from memory.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] Patch for UUID datatype (beta)

2006-09-18 Thread Jeff Davis
On Mon, 2006-09-18 at 16:00 -0500, Jim C. Nasby wrote:
> BTW, at a former company we used SHA1s to identify files that had been
> uploaded. We were wondering on the odds of 2 different files hashing to
> the same value and found some statistical comparisons of probabilities.
> I don't recall the details, but the odds of duplicating a SHA1 (1 in
> 2^160) are so insanely small that it's hard to find anything in the
> physical world that compares. To duplicate random 256^256 numbers you'd
> probably have to search until the heat-death of the universe.

That assumes you have good random data. Usually there is some kind of
tradeoff between the randomness and the performance. If you
read /dev/random each time, that eliminates some applications that need
to generate UUIDs very quickly. If you use pseudorandom data, you are
vulnerable in the case a clock is set back or the data repeats.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[PATCHES] fix xact.c for bad pointer checking

2004-08-03 Thread Jeff Davis
I haven't posted a code patch before, so please be a little patient.

In the 2004-08-02 snapshot, the following sequence causes a crash
(thanks to Christopher Kings-Lynne for finding this bug):

test=# begin;
BEGIN
test=# savepoint "A";
SAVEPOINT
test=# rollback to a;
server closed the connection unexpectedly

--8<--snip--8<--

This seems to have been caused by bad pointer checking in xact.c. I have
attached a patch which fixes that crash. I do not see any side effects,
and it doesn't look like anything needs to additionally be cleaned up,
since that seems to be the same execution path the orginal author
intended for an invalid savepoint name anyway.

Regards,
Jeff Davis


--- src/backend/access/transam/xact.c.old  2004-08-03 03:18:12.0
-0700
+++ src/backend/access/transam/xact.c  2004-08-03 03:19:05.0
-0700
@@ -2529,7 +2529,7 @@
target = target->parent;

/* we don't cross savepoint level boundaries */
-   if (target->savepointLevel != s->savepointLevel)
+   if (PointerIsValid(target) && (target->savepointLevel !=
s->savepointLevel))
ereport(ERROR,
   
(errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
 errmsg("no such savepoint")));



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html