Re: [PATCHES] Synchronized Scan WIP patch
Jeff Davis wrote: On Thu, 2007-05-31 at 09:08 +0100, Heikki Linnakangas wrote: * moved the sync scan stuff to a new file access/heapam/syncscan.c. heapam.c is long enough already, and in theory the same mechanism could be used for large bitmap heap scans in the future. Good idea, I hadn't thought of that. It seems like the bitmaps in two bitmap scans would have to match very closely, but that sounds plausible. Yeah, it's a pretty narrow use case, but plausible in theory. This is similar to another idea I had considered (I forget who thought of it) to try to have a bitmap of tuples still needed and then try to optimize based on that information somehow (read the ones in cache first, etc). Seems substantially more complex though, more like a prefetch system at that point. I expected the general refactoring. Hopefully my next patch is a little closer to the code expectations and places less burden on the reviewers. No worries, that's the easy part. Testing: * Multiple scans on different tables, causing movement in the LRU list * Measure the CPU overhead for a single scan * Measure the lock contention with multiple scanners Is there any way to measure the necessity of the hash table? I would think the conditions for that would be a large number of tables being actively scanned causing a lot of LRU activity such that the locks are held too long. Yep, and it's hard to imagine a system like that. I also think the optimization of only reporting when the block is not found in cache would be useful to test if the lock contention is a problem. I tried to demonstrate lock contention by running 10 backends all repeatedly scanning different tables that are bigger than the sync scan threshold but small enough to all fit in OS cache. The total runtime of the tests was the same, ~45 s, with and without the patch. That's pretty much the worst case scenario I could think of, so it seems that contention of the SyncScanLock is not an issue. There was some missed updates of the scan location, due to the LWLockConditionalAcquire that I put there to reduce lock contention, but not too much to worry about. I'll post an updated patch shortly.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Synchronized Scan WIP patch
Here's a work-in-progress update of this patch. I haven't done any major changes, but a lot of little refactoring and commenting, including: * moved the sync scan stuff to a new file access/heapam/syncscan.c. heapam.c is long enough already, and in theory the same mechanism could be used for large bitmap heap scans in the future. * initialization of the shared memory structures is done at postmaster startup, instead of lazily the first time it's needed * removed the freelist from the shared mem structure. Instead all entries are initialized with invalid values, so they will be consumed from the end of the LRU without any special handling. * changed many variable and struct names to suit my personal taste. TODO: * Only report the new scan location when the page was not in buffer cache. This should reduce the congestion when all synchronized scanners try to update the location at the same time. Should probably test to see if it's worth optimizing first. * decide what to do with the threshold. Should use the same threshold as the buffer ring stuff. * decide the size of the LRU list. Testing: * Multiple scans on different tables, causing movement in the LRU list * Measure the CPU overhead for a single scan * Measure the lock contention with multiple scanners Any thoughts? Any additional test cases you'd like to see? Jeff Davis wrote: This is my latest revision of the Sync Scan patch, and it implements the observability as discussed with Simon. Changes: * ss_report_loc() called once per hundred pages rather than once per page * DEBUG messages are a little cleaner and easier to parse, for the sake of analysis after the fact. * DEBUG2 reports a sync scan starting, the relation size in pages, and the location at which the scan starts. * DEBUG2 reports the location of a scan every 50k pages, DEBUG3 every 5k pages (before it was 100k/10k at DEBUG3/DEBUG4, respectively). Numbers are aligned along 5k boundaries to make analysis easier. * GUCs: * sync_seqscan_threshold: fraction of NBuffers for the threshold * sync_seqscan_offset: fraction of NBuffers for the offset * trace_sync_seqscan: will be used in final version of patch to control DEBUG output Sync_scan_offset may be eliminated completely if it's not shown to be useful enough in conjunction with Simon's patch. Sync Scans are still a big win without sync_seqscan_offset. Sync_scan_threshold=real may be turned into sync_seqscan=boolean with a fixed activation threshold (NBuffers/2 per Simon's suggestion). The reason is that synchronized scans should activate at the same threshold as Simon's scan_recycle_buffers feature. Should we make a #define BIG_SCAN_THRESHOLD NBuffers/2 to use for both sync_seqscan and for scan_recycle_buffers? Regards, Jeff Davis diff -cr postgresql-8.2.3/src/backend/access/heap/heapam.c postgresql-8.2.3-syncscan/src/backend/access/heap/heapam.c *** postgresql-8.2.3/src/backend/access/heap/heapam.c 2007-02-04 12:00:49.0 -0800 --- postgresql-8.2.3-syncscan/src/backend/access/heap/heapam.c 2007-03-13 23:21:27.0 -0700 *** *** 65,70 --- 65,279 * */ + static BlockNumber ss_init(HeapScanDesc); + static int ss_store_hint(HeapScanDesc,BlockNumber); + static int ss_hash(HeapScanDesc); + bool Trace_sync_seqscan = false; + double sync_seqscan_threshold = DEFAULT_SYNC_SCAN_THRESHOLD; + double sync_seqscan_offset = DEFAULT_SYNC_SCAN_OFFSET; + + /* + * ss_init: + * + * This function reads the Sync Scan Hint Table + * (creating it if it doesn't already exist) to + * find a possible location for an already running + * sequential scan on this relation. + * + * By starting a sequential scan near the location + * of an already running scan, we improve the chance + * of finding pages in cache. + * + * Also, depending on SYNC_SCAN_START_OFFSET, this + * function will subtract from the hint before + * starting the scan, in order to pick up pages that + * are likely to already be in cache. + * + * This function assumes that scan-rs_nblocks is + * already properly set, and sets scan-rs_start_page + * to a value based on the hint found. Also, it sets + * scan-rs_hint to point to the location of the hint + * in the hint table. + */ + static BlockNumber ss_init(HeapScanDesc scan) + { + ss_hint_t *hint_table; + int table_offset; + bool found; + int threshold = sync_seqscan_threshold * NBuffers; + int offset = sync_seqscan_offset * NBuffers; + + /* +* If the table is not large compared to effective_cache_size, +* don't Sync Scan. +*/ + if(scan-rs_nblocks threshold) + { + elog(DEBUG2,SYNC_SCAN: Table too small to sync scan); + scan-rs_start_page = 0; + return 0; + } + +
Re: [PATCHES] Synchronized Scan WIP patch
On Thu, 2007-05-31 at 09:08 +0100, Heikki Linnakangas wrote: Here's a work-in-progress update of this patch. I haven't done any major changes, but a lot of little refactoring and commenting, including: * moved the sync scan stuff to a new file access/heapam/syncscan.c. heapam.c is long enough already, and in theory the same mechanism could be used for large bitmap heap scans in the future. Good idea, I hadn't thought of that. It seems like the bitmaps in two bitmap scans would have to match very closely, but that sounds plausible. This is similar to another idea I had considered (I forget who thought of it) to try to have a bitmap of tuples still needed and then try to optimize based on that information somehow (read the ones in cache first, etc). Seems substantially more complex though, more like a prefetch system at that point. I expected the general refactoring. Hopefully my next patch is a little closer to the code expectations and places less burden on the reviewers. Testing: * Multiple scans on different tables, causing movement in the LRU list * Measure the CPU overhead for a single scan * Measure the lock contention with multiple scanners Is there any way to measure the necessity of the hash table? I would think the conditions for that would be a large number of tables being actively scanned causing a lot of LRU activity such that the locks are held too long. I also think the optimization of only reporting when the block is not found in cache would be useful to test if the lock contention is a problem. Regards, Jeff Davis ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Synchronized Scan WIP patch
Will use '16' rather than '100'. Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --- Jeff Davis wrote: This is my latest revision of the Sync Scan patch, and it implements the observability as discussed with Simon. Changes: * ss_report_loc() called once per hundred pages rather than once per page * DEBUG messages are a little cleaner and easier to parse, for the sake of analysis after the fact. * DEBUG2 reports a sync scan starting, the relation size in pages, and the location at which the scan starts. * DEBUG2 reports the location of a scan every 50k pages, DEBUG3 every 5k pages (before it was 100k/10k at DEBUG3/DEBUG4, respectively). Numbers are aligned along 5k boundaries to make analysis easier. * GUCs: * sync_seqscan_threshold: fraction of NBuffers for the threshold * sync_seqscan_offset: fraction of NBuffers for the offset * trace_sync_seqscan: will be used in final version of patch to control DEBUG output Sync_scan_offset may be eliminated completely if it's not shown to be useful enough in conjunction with Simon's patch. Sync Scans are still a big win without sync_seqscan_offset. Sync_scan_threshold=real may be turned into sync_seqscan=boolean with a fixed activation threshold (NBuffers/2 per Simon's suggestion). The reason is that synchronized scans should activate at the same threshold as Simon's scan_recycle_buffers feature. Should we make a #define BIG_SCAN_THRESHOLD NBuffers/2 to use for both sync_seqscan and for scan_recycle_buffers? Regards, Jeff Davis [ Attachment, skipping... ] ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Synchronized Scan WIP patch
On Thu, 2007-03-22 at 16:43 -0400, Bruce Momjian wrote: Will use '16' rather than '100'. Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. Here is the latest version, which includes the change to report every 16 pages. This patch has the following improvements: * reporting interval to 16 pages * rearranges the scan location tracing output to work regardless of the reporting interval. Previously it did not trace the output correctly if the logging interval was not an even multiple of the reporting interval * GUC trace_sync_seqscan=bool now controls whether the DEBUG output is generated or not. If this is true, a lot of logging output will be generated at DEBUG3. * You can set sync_seqscan_threshold=-1.0 ... 100.0. Positive values are treated as a fraction of NBuffers. Negative values disable sync scans. Still TODO: * Publish my test results (I've collected much of the raw data already on this version of the patch) * SGML documentation (after we stabilize the GUC names and meanings) * Possibly remove sync_seqscan_threshold=real and instead use a simple enable/disable boolean that sets the threshold at a constant fraction of NBuffers (most likely the same fraction as Simon's recycle buffers patch) Regards, Jeff Davis diff -cr postgresql-8.2.3/src/backend/access/heap/heapam.c postgresql-8.2.3-ss/src/backend/access/heap/heapam.c *** postgresql-8.2.3/src/backend/access/heap/heapam.c Sun Feb 4 12:00:49 2007 --- postgresql-8.2.3-ss/src/backend/access/heap/heapam.c Tue Mar 20 16:12:12 2007 *** *** 65,70 --- 65,275 * */ + static BlockNumber ss_init(HeapScanDesc); + static int ss_store_hint(HeapScanDesc,BlockNumber); + static int ss_hash(HeapScanDesc); + bool Trace_sync_seqscan = false; + double sync_seqscan_threshold = DEFAULT_SYNC_SCAN_THRESHOLD; + double sync_seqscan_offset = DEFAULT_SYNC_SCAN_OFFSET; + + /* + * ss_init: + * + * This function reads the Sync Scan Hint Table + * (creating it if it doesn't already exist) to + * find a possible location for an already running + * sequential scan on this relation. + * + * By starting a sequential scan near the location + * of an already running scan, we improve the chance + * of finding pages in cache. + * + * Also, depending on SYNC_SCAN_START_OFFSET, this + * function will subtract from the hint before + * starting the scan, in order to pick up pages that + * are likely to already be in cache. + * + * This function assumes that scan-rs_nblocks is + * already properly set, and sets scan-rs_start_page + * to a value based on the hint found. Also, it sets + * scan-rs_hint to point to the location of the hint + * in the hint table. + */ + static BlockNumber ss_init(HeapScanDesc scan) + { + ss_hint_t *hint_table; + int table_offset; + bool found; + int threshold = sync_seqscan_threshold * NBuffers; + int offset = sync_seqscan_offset * NBuffers; + + /* + * If the table is not large enough, or sync_scan_threshold + * is disabled (negative), don't Sync Scan. + */ + if(threshold 0 || scan-rs_nblocks threshold) + { + scan-rs_start_page = 0; + return 0; + } + + table_offset = ss_hash(scan); + hint_table = (ss_hint_t*)ShmemInitStruct(Sync Scan Hint Table, + SYNC_SCAN_TABLE_SIZE*sizeof(ss_hint_t),found); + + scan-rs_hint = hint_table[table_offset]; + + /* + * If we just created the hint table for the first time, + * initialize the table to zero and start the scan at page 0. + */ + if(!found) { + if(Trace_sync_seqscan) + elog(DEBUG2,SYNC_SCAN: Created Hint Table); + memset(hint_table,0,sizeof(ss_hint_t)*SYNC_SCAN_TABLE_SIZE); + scan-rs_start_page = 0; + return 0; + } + + /* + * If the hint's relid is 0, that means + * we have not previously created a hint + * at this location in the table. + */ + if(scan-rs_hint-relid == 0) { + if(Trace_sync_seqscan) + elog(DEBUG2, SYNC_SCAN: Hint empty); + scan-rs_start_page = 0; + return 0; + } + + /* + * If the relid doesn't match the one in the hint, + * we have a hash collision. + */ + if(RelationGetRelid(scan-rs_rd) != scan-rs_hint-relid) + { + if(Trace_sync_seqscan) + elog(DEBUG1,SYNC_SCAN: Hash collision); + scan-rs_start_page = 0; + return 0; + } + + /* + * If the hint is not a valid block number + * for this relation, start at 0. + * + * This can happen if, for instance, someone + * TRUNCATEd the table between when the hint + * was set and now. + */ + if(scan-rs_hint-location 0 || + scan-rs_hint-location = scan-rs_nblocks) + { + if(Trace_sync_seqscan) + elog(DEBUG2,SYNC_SCAN: Hint %d out of range. \ + Relation has %d pages., + scan-rs_hint-location,scan-rs_nblocks); +
Re: [PATCHES] Synchronized Scan WIP patch
On Wed, 2007-03-14 at 01:42 -0700, Jeff Davis wrote: SYNC_SCAN_REPORT_INTERVAL 100 Jeff, This will stop SeqScans from working with buffer recycling, unless we put the recycle limit to more than 100. That was why I requested you set this to 16, so we can use a recycle buffer of 32. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Synchronized Scan WIP patch
On Thu, 2007-03-15 at 08:27 +, Simon Riggs wrote: On Wed, 2007-03-14 at 01:42 -0700, Jeff Davis wrote: SYNC_SCAN_REPORT_INTERVAL 100 Jeff, This will stop SeqScans from working with buffer recycling, unless we put the recycle limit to more than 100. That was why I requested you set this to 16, so we can use a recycle buffer of 32. Yes, you're correct. If it's set at 100 it should still work by using the OS buffer cache to catch up (and then be within a few buffers), but I think you're right that 16 is the correct value. I'll change my patch to be 16. I don't think I need to send along a diff ;) Thanks, Jeff Davis ---(end of broadcast)--- TIP 6: explain analyze is your friend
[PATCHES] Synchronized Scan WIP patch
This is my latest revision of the Sync Scan patch, and it implements the observability as discussed with Simon. Changes: * ss_report_loc() called once per hundred pages rather than once per page * DEBUG messages are a little cleaner and easier to parse, for the sake of analysis after the fact. * DEBUG2 reports a sync scan starting, the relation size in pages, and the location at which the scan starts. * DEBUG2 reports the location of a scan every 50k pages, DEBUG3 every 5k pages (before it was 100k/10k at DEBUG3/DEBUG4, respectively). Numbers are aligned along 5k boundaries to make analysis easier. * GUCs: * sync_seqscan_threshold: fraction of NBuffers for the threshold * sync_seqscan_offset: fraction of NBuffers for the offset * trace_sync_seqscan: will be used in final version of patch to control DEBUG output Sync_scan_offset may be eliminated completely if it's not shown to be useful enough in conjunction with Simon's patch. Sync Scans are still a big win without sync_seqscan_offset. Sync_scan_threshold=real may be turned into sync_seqscan=boolean with a fixed activation threshold (NBuffers/2 per Simon's suggestion). The reason is that synchronized scans should activate at the same threshold as Simon's scan_recycle_buffers feature. Should we make a #define BIG_SCAN_THRESHOLD NBuffers/2 to use for both sync_seqscan and for scan_recycle_buffers? Regards, Jeff Davis diff -cr postgresql-8.2.3/src/backend/access/heap/heapam.c postgresql-8.2.3-syncscan/src/backend/access/heap/heapam.c *** postgresql-8.2.3/src/backend/access/heap/heapam.c 2007-02-04 12:00:49.0 -0800 --- postgresql-8.2.3-syncscan/src/backend/access/heap/heapam.c 2007-03-13 23:21:27.0 -0700 *** *** 65,70 --- 65,279 * */ + static BlockNumber ss_init(HeapScanDesc); + static int ss_store_hint(HeapScanDesc,BlockNumber); + static int ss_hash(HeapScanDesc); + bool Trace_sync_seqscan = false; + double sync_seqscan_threshold = DEFAULT_SYNC_SCAN_THRESHOLD; + double sync_seqscan_offset = DEFAULT_SYNC_SCAN_OFFSET; + + /* + * ss_init: + * + * This function reads the Sync Scan Hint Table + * (creating it if it doesn't already exist) to + * find a possible location for an already running + * sequential scan on this relation. + * + * By starting a sequential scan near the location + * of an already running scan, we improve the chance + * of finding pages in cache. + * + * Also, depending on SYNC_SCAN_START_OFFSET, this + * function will subtract from the hint before + * starting the scan, in order to pick up pages that + * are likely to already be in cache. + * + * This function assumes that scan-rs_nblocks is + * already properly set, and sets scan-rs_start_page + * to a value based on the hint found. Also, it sets + * scan-rs_hint to point to the location of the hint + * in the hint table. + */ + static BlockNumber ss_init(HeapScanDesc scan) + { + ss_hint_t *hint_table; + int table_offset; + bool found; + int threshold = sync_seqscan_threshold * NBuffers; + int offset = sync_seqscan_offset * NBuffers; + + /* + * If the table is not large compared to effective_cache_size, + * don't Sync Scan. + */ + if(scan-rs_nblocks threshold) + { + elog(DEBUG2,SYNC_SCAN: Table too small to sync scan); + scan-rs_start_page = 0; + return 0; + } + + table_offset = ss_hash(scan); + hint_table = (ss_hint_t*)ShmemInitStruct(Sync Scan Hint Table, + SYNC_SCAN_TABLE_SIZE*sizeof(ss_hint_t),found); + + scan-rs_hint = hint_table[table_offset]; + + /* + * If we just created the hint table for the first time, + * initialize the table to zero and start the scan at page 0. + */ + if(!found) { + elog(DEBUG2,SYNC_SCAN: Created Hint Table); + memset(hint_table,0,sizeof(ss_hint_t)*SYNC_SCAN_TABLE_SIZE); + scan-rs_start_page = 0; + return 0; + } + + /* + * If the hint's relid is 0, that means + * we have not previously created a hint + * at this location in the table. + */ + if(scan-rs_hint-relid == 0) { + elog(DEBUG2, SYNC_SCAN: Hint empty); + scan-rs_start_page = 0; + return 0; + } + + /* + * If the relid doesn't match the one in the hint, + * we have a hash collision. + */ + if(RelationGetRelid(scan-rs_rd) != scan-rs_hint-relid) + { + elog(DEBUG1,SYNC_SCAN: Hash collision); + scan-rs_start_page = 0; + return 0; + } + + /* + * If the hint is not a valid block number + * for this relation, start at 0. + * + * This can happen if, for instance, someone + * TRUNCATEd the table between when the hint + * was set and now. + */ + if(scan-rs_hint-location 0 || + scan-rs_hint-location = scan-rs_nblocks) + { + elog(DEBUG2,SYNC_SCAN: Hint %d out of range. \ + Relation has %d pages., + scan-rs_hint-location,scan-rs_nblocks); + scan-rs_start_page = 0; + return 0; + } + + scan-rs_start_page =