Re: [PATCHES] Synchronized Scan WIP patch

2007-06-04 Thread Heikki Linnakangas

Jeff Davis wrote:

On Thu, 2007-05-31 at 09:08 +0100, Heikki Linnakangas wrote:
* moved the sync scan stuff to a new file access/heapam/syncscan.c. 
heapam.c is long enough already, and in theory the same mechanism could 
be used for large bitmap heap scans in the future.


Good idea, I hadn't thought of that. It seems like the bitmaps in two
bitmap scans would have to match very closely, but that sounds
plausible.


Yeah, it's a pretty narrow use case, but plausible in theory.


This is similar to another idea I had considered (I forget who thought
of it) to try to have a bitmap of tuples still needed and then try to
optimize based on that information somehow (read the ones in cache
first, etc). Seems substantially more complex though, more like a
prefetch system at that point.

I expected the general refactoring. Hopefully my next patch is a little
closer to the code expectations and places less burden on the reviewers.


No worries, that's the easy part.


Testing:
* Multiple scans on different tables, causing movement in the LRU list
* Measure the CPU overhead for a single scan
* Measure the lock contention with multiple scanners


Is there any way to measure the necessity of the hash table? I would
think the conditions for that would be a large number of tables being
actively scanned causing a lot of LRU activity such that the locks are
held too long. 


Yep, and it's hard to imagine a system like that.


I also think the optimization of only reporting when the block is not
found in cache would be useful to test if the lock contention is a problem.


I tried to demonstrate lock contention by running 10 backends all 
repeatedly scanning different tables that are bigger than the sync scan 
threshold but small enough to all fit in OS cache. The total runtime of 
the tests was the same, ~45 s, with and without the patch. That's pretty 
much the worst case scenario I could think of, so it seems that 
contention of the SyncScanLock is not an issue. There was some missed 
updates of the scan location, due to the LWLockConditionalAcquire that 
I put there to reduce lock contention, but not too much to worry about.


I'll post an updated patch shortly..

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized Scan WIP patch

2007-05-31 Thread Heikki Linnakangas

Here's a work-in-progress update of this patch.

I haven't done any major changes, but a lot of little refactoring and 
commenting, including:


* moved the sync scan stuff to a new file access/heapam/syncscan.c. 
heapam.c is long enough already, and in theory the same mechanism could 
be used for large bitmap heap scans in the future.
* initialization of the shared memory structures is done at postmaster 
startup, instead of lazily the first time it's needed
* removed the freelist from the shared mem structure. Instead all 
entries are initialized with invalid values, so they will be consumed 
from the end of the LRU without any special handling.

* changed many variable and struct names to suit my personal taste.

TODO:
* Only report the new scan location when the page was not in buffer 
cache. This should reduce the congestion when all synchronized scanners 
try to update the location at the same time. Should probably test to see 
if it's worth optimizing first.
* decide what to do with the threshold. Should use the same threshold as 
the buffer ring stuff.

* decide the size of the LRU list.

Testing:
* Multiple scans on different tables, causing movement in the LRU list
* Measure the CPU overhead for a single scan
* Measure the lock contention with multiple scanners

Any thoughts? Any additional test cases you'd like to see?

Jeff Davis wrote:

This is my latest revision of the Sync Scan patch, and it implements the
observability as discussed with Simon.

Changes:
 * ss_report_loc() called once per hundred pages rather than once per
page
 * DEBUG messages are a little cleaner and easier to parse, for the sake
of analysis after the fact.
 * DEBUG2 reports a sync scan starting, the relation size in pages, and
the location at which the scan starts.
 * DEBUG2 reports the location of a scan every 50k pages, DEBUG3 every
5k pages (before it was 100k/10k at DEBUG3/DEBUG4, respectively).
Numbers are aligned along 5k boundaries to make analysis easier.
 * GUCs:
   * sync_seqscan_threshold: fraction of NBuffers for the threshold
   * sync_seqscan_offset: fraction of NBuffers for the offset
   * trace_sync_seqscan: will be used in final version of patch to
control DEBUG output

Sync_scan_offset may be eliminated completely if it's not shown to be
useful enough in conjunction with Simon's patch. Sync Scans are still a
big win without sync_seqscan_offset.

Sync_scan_threshold=real may be turned into sync_seqscan=boolean
with a fixed activation threshold (NBuffers/2 per Simon's suggestion).
The reason is that synchronized scans should activate at the same
threshold as Simon's scan_recycle_buffers feature. Should we make a
#define BIG_SCAN_THRESHOLD NBuffers/2 to use for both sync_seqscan and
for scan_recycle_buffers?

Regards,
Jeff Davis




diff -cr postgresql-8.2.3/src/backend/access/heap/heapam.c 
postgresql-8.2.3-syncscan/src/backend/access/heap/heapam.c
*** postgresql-8.2.3/src/backend/access/heap/heapam.c   2007-02-04 
12:00:49.0 -0800
--- postgresql-8.2.3-syncscan/src/backend/access/heap/heapam.c  2007-03-13 
23:21:27.0 -0700
***
*** 65,70 
--- 65,279 
   * 
   */
  
+ static BlockNumber ss_init(HeapScanDesc);

+ static int ss_store_hint(HeapScanDesc,BlockNumber);
+ static int ss_hash(HeapScanDesc);
+ bool Trace_sync_seqscan = false;
+ double sync_seqscan_threshold = DEFAULT_SYNC_SCAN_THRESHOLD;
+ double sync_seqscan_offset = DEFAULT_SYNC_SCAN_OFFSET;
+ 
+ /*
+  * ss_init: 
+  *
+  * This function reads the Sync Scan Hint Table 
+  * (creating it if it doesn't already exist) to 
+  * find a possible location for an already running 
+  * sequential scan on this relation.

+  *
+  * By starting a sequential scan near the location
+  * of an already running scan, we improve the chance
+  * of finding pages in cache.
+  *
+  * Also, depending on SYNC_SCAN_START_OFFSET, this
+  * function will subtract from the hint before
+  * starting the scan, in order to pick up pages that
+  * are likely to already be in cache.
+  *
+  * This function assumes that scan-rs_nblocks is 
+  * already properly set, and sets scan-rs_start_page

+  * to a value based on the hint found. Also, it sets
+  * scan-rs_hint to point to the location of the hint
+  * in the hint table.
+  */
+ static BlockNumber ss_init(HeapScanDesc scan)
+ {
+   ss_hint_t *hint_table;
+   int table_offset;
+   bool found;
+   int threshold = sync_seqscan_threshold * NBuffers;
+   int offset = sync_seqscan_offset * NBuffers;
+ 
+ 	/*

+* If the table is not large compared to effective_cache_size,
+* don't Sync Scan.
+*/
+   if(scan-rs_nblocks  threshold)
+   {
+   elog(DEBUG2,SYNC_SCAN: Table too small to sync scan);
+   scan-rs_start_page = 0;
+   return 0;
+   }
+ 
+ 	

Re: [PATCHES] Synchronized Scan WIP patch

2007-05-31 Thread Jeff Davis
On Thu, 2007-05-31 at 09:08 +0100, Heikki Linnakangas wrote:
 Here's a work-in-progress update of this patch.
 
 I haven't done any major changes, but a lot of little refactoring and 
 commenting, including:
 
 * moved the sync scan stuff to a new file access/heapam/syncscan.c. 
 heapam.c is long enough already, and in theory the same mechanism could 
 be used for large bitmap heap scans in the future.

Good idea, I hadn't thought of that. It seems like the bitmaps in two
bitmap scans would have to match very closely, but that sounds
plausible.

This is similar to another idea I had considered (I forget who thought
of it) to try to have a bitmap of tuples still needed and then try to
optimize based on that information somehow (read the ones in cache
first, etc). Seems substantially more complex though, more like a
prefetch system at that point.

I expected the general refactoring. Hopefully my next patch is a little
closer to the code expectations and places less burden on the reviewers.

 Testing:
 * Multiple scans on different tables, causing movement in the LRU list
 * Measure the CPU overhead for a single scan
 * Measure the lock contention with multiple scanners
 

Is there any way to measure the necessity of the hash table? I would
think the conditions for that would be a large number of tables being
actively scanned causing a lot of LRU activity such that the locks are
held too long. 

I also think the optimization of only reporting when the block is not
found in cache would be useful to test if the lock contention is a problem.

Regards,
Jeff Davis



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Synchronized Scan WIP patch

2007-03-22 Thread Bruce Momjian

Will use '16' rather than '100'.

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---


Jeff Davis wrote:
 This is my latest revision of the Sync Scan patch, and it implements the
 observability as discussed with Simon.
 
 Changes:
  * ss_report_loc() called once per hundred pages rather than once per
 page
  * DEBUG messages are a little cleaner and easier to parse, for the sake
 of analysis after the fact.
  * DEBUG2 reports a sync scan starting, the relation size in pages, and
 the location at which the scan starts.
  * DEBUG2 reports the location of a scan every 50k pages, DEBUG3 every
 5k pages (before it was 100k/10k at DEBUG3/DEBUG4, respectively).
 Numbers are aligned along 5k boundaries to make analysis easier.
  * GUCs:
* sync_seqscan_threshold: fraction of NBuffers for the threshold
* sync_seqscan_offset: fraction of NBuffers for the offset
* trace_sync_seqscan: will be used in final version of patch to
 control DEBUG output
 
 Sync_scan_offset may be eliminated completely if it's not shown to be
 useful enough in conjunction with Simon's patch. Sync Scans are still a
 big win without sync_seqscan_offset.
 
 Sync_scan_threshold=real may be turned into sync_seqscan=boolean
 with a fixed activation threshold (NBuffers/2 per Simon's suggestion).
 The reason is that synchronized scans should activate at the same
 threshold as Simon's scan_recycle_buffers feature. Should we make a
 #define BIG_SCAN_THRESHOLD NBuffers/2 to use for both sync_seqscan and
 for scan_recycle_buffers?
 
 Regards,
   Jeff Davis

[ Attachment, skipping... ]

 
 ---(end of broadcast)---
 TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Synchronized Scan WIP patch

2007-03-22 Thread Jeff Davis
On Thu, 2007-03-22 at 16:43 -0400, Bruce Momjian wrote:
 Will use '16' rather than '100'.
 
 Your patch has been added to the PostgreSQL unapplied patches list at:
 
   http://momjian.postgresql.org/cgi-bin/pgpatches
 
 It will be applied as soon as one of the PostgreSQL committers reviews
 and approves it.
 

Here is the latest version, which includes the change to report every 16
pages. 

This patch has the following improvements:

 * reporting interval to 16 pages
 * rearranges the scan location tracing output to work regardless of the
reporting interval. Previously it did not trace the output correctly if
the logging interval was not an even multiple of the reporting interval
 * GUC trace_sync_seqscan=bool now controls whether the DEBUG output
is generated or not. If this is true, a lot of logging output will be
generated at DEBUG3. 
 * You can set sync_seqscan_threshold=-1.0 ... 100.0. Positive values
are treated as a fraction of NBuffers. Negative values disable sync
scans.
 
Still TODO:

 * Publish my test results (I've collected much of the raw data already
on this version of the patch)
 * SGML documentation (after we stabilize the GUC names and meanings)
 * Possibly remove sync_seqscan_threshold=real and instead use a
simple enable/disable boolean that sets the threshold at a constant
fraction of NBuffers (most likely the same fraction as Simon's recycle
buffers patch)

Regards,
Jeff Davis
diff -cr postgresql-8.2.3/src/backend/access/heap/heapam.c postgresql-8.2.3-ss/src/backend/access/heap/heapam.c
*** postgresql-8.2.3/src/backend/access/heap/heapam.c	Sun Feb  4 12:00:49 2007
--- postgresql-8.2.3-ss/src/backend/access/heap/heapam.c	Tue Mar 20 16:12:12 2007
***
*** 65,70 
--- 65,275 
   * 
   */
  
+ static BlockNumber ss_init(HeapScanDesc);
+ static int ss_store_hint(HeapScanDesc,BlockNumber);
+ static int ss_hash(HeapScanDesc);
+ bool Trace_sync_seqscan = false;
+ double sync_seqscan_threshold = DEFAULT_SYNC_SCAN_THRESHOLD;
+ double sync_seqscan_offset = DEFAULT_SYNC_SCAN_OFFSET;
+ 
+ /*
+  * ss_init: 
+  *
+  * This function reads the Sync Scan Hint Table 
+  * (creating it if it doesn't already exist) to 
+  * find a possible location for an already running 
+  * sequential scan on this relation.
+  *
+  * By starting a sequential scan near the location
+  * of an already running scan, we improve the chance
+  * of finding pages in cache.
+  *
+  * Also, depending on SYNC_SCAN_START_OFFSET, this
+  * function will subtract from the hint before
+  * starting the scan, in order to pick up pages that
+  * are likely to already be in cache.
+  *
+  * This function assumes that scan-rs_nblocks is 
+  * already properly set, and sets scan-rs_start_page
+  * to a value based on the hint found. Also, it sets
+  * scan-rs_hint to point to the location of the hint
+  * in the hint table.
+  */
+ static BlockNumber ss_init(HeapScanDesc scan)
+ {
+ 	ss_hint_t *hint_table;
+ 	int table_offset;
+ 	bool found;
+ 	int threshold = sync_seqscan_threshold * NBuffers;
+ 	int offset = sync_seqscan_offset * NBuffers;
+ 
+ 	/*
+ 	 * If the table is not large enough, or sync_scan_threshold 
+ 	 * is disabled (negative), don't Sync Scan.
+ 	 */
+ 	if(threshold  0 || scan-rs_nblocks  threshold)
+ 	{
+ 		scan-rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	table_offset = ss_hash(scan);
+ 	hint_table = (ss_hint_t*)ShmemInitStruct(Sync Scan Hint Table,
+ 		SYNC_SCAN_TABLE_SIZE*sizeof(ss_hint_t),found);
+ 			
+ 	scan-rs_hint = hint_table[table_offset];
+ 
+ 	/*
+ 	 * If we just created the hint table for the first time,
+ 	 * initialize the table to zero and start the scan at page 0.
+ 	 */
+ 	if(!found) {
+ 		if(Trace_sync_seqscan)
+ 			elog(DEBUG2,SYNC_SCAN: Created Hint Table);
+ 		memset(hint_table,0,sizeof(ss_hint_t)*SYNC_SCAN_TABLE_SIZE);
+ 		scan-rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	/*
+ 	 * If the hint's relid is 0, that means
+ 	 * we have not previously created a hint
+ 	 * at this location in the table.
+ 	 */
+ 	if(scan-rs_hint-relid == 0) {
+ 		if(Trace_sync_seqscan)
+ 			elog(DEBUG2, SYNC_SCAN: Hint empty);
+ 		scan-rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	/*
+ 	 * If the relid doesn't match the one in the hint,
+ 	 * we have a hash collision.
+ 	 */
+ 	if(RelationGetRelid(scan-rs_rd) != scan-rs_hint-relid)
+ 	{
+ 		if(Trace_sync_seqscan)
+ 			elog(DEBUG1,SYNC_SCAN: Hash collision);
+ 		scan-rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	/*
+ 	 * If the hint is not a valid block number
+ 	 * for this relation, start at 0.
+ 	 *
+ 	 * This can happen if, for instance, someone
+ 	 * TRUNCATEd the table between when the hint 
+ 	 * was set and now.
+ 	 */
+ 	if(scan-rs_hint-location  0 || 
+ 		scan-rs_hint-location = scan-rs_nblocks) 
+ 	{
+ 		if(Trace_sync_seqscan)
+ 			elog(DEBUG2,SYNC_SCAN: Hint %d out of range. \
+  Relation has %d pages.,
+ scan-rs_hint-location,scan-rs_nblocks);
+ 		

Re: [PATCHES] Synchronized Scan WIP patch

2007-03-15 Thread Simon Riggs
On Wed, 2007-03-14 at 01:42 -0700, Jeff Davis wrote:
 SYNC_SCAN_REPORT_INTERVAL 100

Jeff,

This will stop SeqScans from working with buffer recycling, unless we
put the recycle limit to more than 100. That was why I requested you set
this to 16, so we can use a recycle buffer of 32.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Synchronized Scan WIP patch

2007-03-15 Thread Jeff Davis
On Thu, 2007-03-15 at 08:27 +, Simon Riggs wrote:
 On Wed, 2007-03-14 at 01:42 -0700, Jeff Davis wrote:
  SYNC_SCAN_REPORT_INTERVAL 100
 
 Jeff,
 
 This will stop SeqScans from working with buffer recycling, unless we
 put the recycle limit to more than 100. That was why I requested you set
 this to 16, so we can use a recycle buffer of 32.
 

Yes, you're correct. If it's set at 100 it should still work by using
the OS buffer cache to catch up (and then be within a few buffers), but
I think you're right that 16 is the correct value.

I'll change my patch to be 16. I don't think I need to send along a
diff ;)

Thanks,
Jeff Davis


---(end of broadcast)---
TIP 6: explain analyze is your friend


[PATCHES] Synchronized Scan WIP patch

2007-03-14 Thread Jeff Davis
This is my latest revision of the Sync Scan patch, and it implements the
observability as discussed with Simon.

Changes:
 * ss_report_loc() called once per hundred pages rather than once per
page
 * DEBUG messages are a little cleaner and easier to parse, for the sake
of analysis after the fact.
 * DEBUG2 reports a sync scan starting, the relation size in pages, and
the location at which the scan starts.
 * DEBUG2 reports the location of a scan every 50k pages, DEBUG3 every
5k pages (before it was 100k/10k at DEBUG3/DEBUG4, respectively).
Numbers are aligned along 5k boundaries to make analysis easier.
 * GUCs:
   * sync_seqscan_threshold: fraction of NBuffers for the threshold
   * sync_seqscan_offset: fraction of NBuffers for the offset
   * trace_sync_seqscan: will be used in final version of patch to
control DEBUG output

Sync_scan_offset may be eliminated completely if it's not shown to be
useful enough in conjunction with Simon's patch. Sync Scans are still a
big win without sync_seqscan_offset.

Sync_scan_threshold=real may be turned into sync_seqscan=boolean
with a fixed activation threshold (NBuffers/2 per Simon's suggestion).
The reason is that synchronized scans should activate at the same
threshold as Simon's scan_recycle_buffers feature. Should we make a
#define BIG_SCAN_THRESHOLD NBuffers/2 to use for both sync_seqscan and
for scan_recycle_buffers?

Regards,
Jeff Davis
diff -cr postgresql-8.2.3/src/backend/access/heap/heapam.c postgresql-8.2.3-syncscan/src/backend/access/heap/heapam.c
*** postgresql-8.2.3/src/backend/access/heap/heapam.c	2007-02-04 12:00:49.0 -0800
--- postgresql-8.2.3-syncscan/src/backend/access/heap/heapam.c	2007-03-13 23:21:27.0 -0700
***
*** 65,70 
--- 65,279 
   * 
   */
  
+ static BlockNumber ss_init(HeapScanDesc);
+ static int ss_store_hint(HeapScanDesc,BlockNumber);
+ static int ss_hash(HeapScanDesc);
+ bool Trace_sync_seqscan = false;
+ double sync_seqscan_threshold = DEFAULT_SYNC_SCAN_THRESHOLD;
+ double sync_seqscan_offset = DEFAULT_SYNC_SCAN_OFFSET;
+ 
+ /*
+  * ss_init: 
+  *
+  * This function reads the Sync Scan Hint Table 
+  * (creating it if it doesn't already exist) to 
+  * find a possible location for an already running 
+  * sequential scan on this relation.
+  *
+  * By starting a sequential scan near the location
+  * of an already running scan, we improve the chance
+  * of finding pages in cache.
+  *
+  * Also, depending on SYNC_SCAN_START_OFFSET, this
+  * function will subtract from the hint before
+  * starting the scan, in order to pick up pages that
+  * are likely to already be in cache.
+  *
+  * This function assumes that scan-rs_nblocks is 
+  * already properly set, and sets scan-rs_start_page
+  * to a value based on the hint found. Also, it sets
+  * scan-rs_hint to point to the location of the hint
+  * in the hint table.
+  */
+ static BlockNumber ss_init(HeapScanDesc scan)
+ {
+ 	ss_hint_t *hint_table;
+ 	int table_offset;
+ 	bool found;
+ 	int threshold = sync_seqscan_threshold * NBuffers;
+ 	int offset = sync_seqscan_offset * NBuffers;
+ 
+ 	/*
+ 	 * If the table is not large compared to effective_cache_size,
+ 	 * don't Sync Scan.
+ 	 */
+ 	if(scan-rs_nblocks  threshold)
+ 	{
+ 		elog(DEBUG2,SYNC_SCAN: Table too small to sync scan);
+ 		scan-rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	table_offset = ss_hash(scan);
+ 	hint_table = (ss_hint_t*)ShmemInitStruct(Sync Scan Hint Table,
+ 		SYNC_SCAN_TABLE_SIZE*sizeof(ss_hint_t),found);
+ 			
+ 	scan-rs_hint = hint_table[table_offset];
+ 
+ 	/*
+ 	 * If we just created the hint table for the first time,
+ 	 * initialize the table to zero and start the scan at page 0.
+ 	 */
+ 	if(!found) {
+ 		elog(DEBUG2,SYNC_SCAN: Created Hint Table);
+ 		memset(hint_table,0,sizeof(ss_hint_t)*SYNC_SCAN_TABLE_SIZE);
+ 		scan-rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	/*
+ 	 * If the hint's relid is 0, that means
+ 	 * we have not previously created a hint
+ 	 * at this location in the table.
+ 	 */
+ 	if(scan-rs_hint-relid == 0) {
+ 		elog(DEBUG2, SYNC_SCAN: Hint empty);
+ 		scan-rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	/*
+ 	 * If the relid doesn't match the one in the hint,
+ 	 * we have a hash collision.
+ 	 */
+ 	if(RelationGetRelid(scan-rs_rd) != scan-rs_hint-relid)
+ 	{
+ 		elog(DEBUG1,SYNC_SCAN: Hash collision);
+ 		scan-rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	/*
+ 	 * If the hint is not a valid block number
+ 	 * for this relation, start at 0.
+ 	 *
+ 	 * This can happen if, for instance, someone
+ 	 * TRUNCATEd the table between when the hint 
+ 	 * was set and now.
+ 	 */
+ 	if(scan-rs_hint-location  0 || 
+ 		scan-rs_hint-location = scan-rs_nblocks) 
+ 	{
+ 		elog(DEBUG2,SYNC_SCAN: Hint %d out of range. \
+  Relation has %d pages.,
+ 			scan-rs_hint-location,scan-rs_nblocks);
+ 		scan-rs_start_page = 0;
+ 		return 0;
+ 	}
+ 
+ 	scan-rs_start_page =