Re: [HACKERS] Bug: Buffer cache is not scan resistant
test version, but I am putting in the queue so we can track it there. Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --- Simon Riggs wrote: On Mon, 2007-03-12 at 09:14 +, Simon Riggs wrote: On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote: With the default value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, just like existing sequential scans. Is this intended? Yes, but its not very useful for testing to have done that. I'll do another version within the hour that sets N=0 (only) back to current behaviour for VACUUM. New test version enclosed, where scan_recycle_buffers = 0 doesn't change existing VACUUM behaviour. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com [ Attachment, skipping... ] ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Simon, is this patch ready to be added to the patch queue? I assume not. --- Simon Riggs wrote: On Mon, 2007-03-12 at 09:14 +, Simon Riggs wrote: On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote: With the default value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, just like existing sequential scans. Is this intended? Yes, but its not very useful for testing to have done that. I'll do another version within the hour that sets N=0 (only) back to current behaviour for VACUUM. New test version enclosed, where scan_recycle_buffers = 0 doesn't change existing VACUUM behaviour. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com [ Attachment, skipping... ] ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 2007-03-12 at 22:16 -0700, Luke Lonergan wrote: You may know we've built something similar and have seen similar gains. Cool We're planning a modification that I think you should consider: when there is a sequential scan of a table larger than the size of shared_buffers, we are allowing the scan to write through the shared_buffers cache. Write? For which operations? I was thinking to do this for bulk writes also, but it would require changes to bgwriter's cleaning sequence. Are you saying to write say ~32 buffers then fsync them, rather than letting bgwriter do that? Then allow those buffers to be reused? The hypothesis is that if a relation is of a size equal to or less than the size of shared_buffers, it is cacheable and should use the standard LRU approach to provide for reuse. Sounds reasonable. Please say more. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Tue, 2007-03-13 at 13:40 +0900, ITAGAKI Takahiro wrote: Simon Riggs [EMAIL PROTECTED] wrote: With the default value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, just like existing sequential scans. Is this intended? New test version enclosed, where scan_recycle_buffers = 0 doesn't change existing VACUUM behaviour. This is a result with scan_recycle_buffers.v3.patch. I used normal VACUUM with background load using slowdown-ed pgbench in this instance. I believe the patch is useful in normal cases, not only for VACUUM FREEZE. N | time | WAL flush(*) -++--- 0 | 112.8s | 44.3% 1 | 148.9s | 52.1% 8 | 105.1s | 17.6% 16 | 96.9s | 8.7% 32 | 103.9s | 6.3% 64 | 89.4s | 6.6% 128 | 80.0s | 3.8% Looks good. Not sure what value of N to pick for normal use. The objectives are i) don't let VACUUMs spoil the cache ii) speed up standalone VACUUMs iii) don't let VACUUM cause others to repeatedly WAL flush I'm thinking N=16 meets all 3 objectives. We could make VACUUM go faster still, but by knocking more blocks out of cache that someone doing real work might need. That will slow them down almost as much as forcing them to flush WAL, so I'd want to be conservative with VACUUM. Does anybody think we need a new parameter for this, or are we happy at 16 buffers in the recycle loop for VACUUM? At this point I should note something I haven't mentioned before. VACUUMs force other backends to flush out WAL only when we have an I/O bound workload. If the database already fits in memory then BufferAlloc never needs to run and therefore we don't need to flush WAL. So I can understand that the effect of WAL flushing may not have been noticed by many testers. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Simon, On 3/13/07 2:37 AM, Simon Riggs [EMAIL PROTECTED] wrote: We're planning a modification that I think you should consider: when there is a sequential scan of a table larger than the size of shared_buffers, we are allowing the scan to write through the shared_buffers cache. Write? For which operations? I'm actually just referring to the sequential scan writing into the shared buffers cache, sorry for the write through :-) I was thinking to do this for bulk writes also, but it would require changes to bgwriter's cleaning sequence. Are you saying to write say ~32 buffers then fsync them, rather than letting bgwriter do that? Then allow those buffers to be reused? Off topic, but we think we just found the reason(s) for the abysmal heap insert performance of pgsql and are working on a fix to that as well. It involves two main things: the ping-ponging seeks used to extend a relfile and the bgwriter not flushing aggressively enough. We're hoping to move the net heap insert rate from 12MB/s for a single stream to something more like 100 MB/s per stream, but it may take a week to get some early results and find out if we're on the right track. We've been wrong on this before ;-) - Luke ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Simon Riggs [EMAIL PROTECTED] wrote: I've implemented buffer recycling, as previously described, patch being posted now to -patches as scan_recycle_buffers. - for VACUUMs of any size, with the objective of reducing WAL thrashing whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as originally suggested by Itagaki-san, just with a different implementation). I tested your patch with VACUUM FREEZE. The performance was improved when I set scan_recycle_buffers 32. I used VACUUM FREEZE to increase WAL traffic, but this patch should be useful for normal VACUUMs with backgrond jobs! N | time | WAL flush(*) -+---+--- 0 | 58.7s | 0.01% 1 | 80.3s | 81.76% 8 | 73.4s | 16.73% 16 | 64.2s | 9.24% 32 | 59.0s | 4.88% 64 | 56.7s | 2.63% 128 | 55.1s | 1.41% (*) WAL flush is the ratio of the need of fsync to buffer recycle. # SET scan_recycle_buffers = 0; # UPDATE accounts SET aid=aid WHERE random() 0.005; # CHECKPOINT; # SET scan_recycle_buffers = N; # VACUUM FREEZE accounts; BTW, does the patch change the default usage of buffer in vacuum? From what I've seen, scan_recycle_buffers = 1 is the same as before. With the default value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, just like existing sequential scans. Is this intended? Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote: Simon Riggs [EMAIL PROTECTED] wrote: I've implemented buffer recycling, as previously described, patch being posted now to -patches as scan_recycle_buffers. - for VACUUMs of any size, with the objective of reducing WAL thrashing whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as originally suggested by Itagaki-san, just with a different implementation). I tested your patch with VACUUM FREEZE. The performance was improved when I set scan_recycle_buffers 32. I used VACUUM FREEZE to increase WAL traffic, but this patch should be useful for normal VACUUMs with backgrond jobs! Thanks. N | time | WAL flush(*) -+---+--- 0 | 58.7s | 0.01% 1 | 80.3s | 81.76% 8 | 73.4s | 16.73% 16 | 64.2s | 9.24% 32 | 59.0s | 4.88% 64 | 56.7s | 2.63% 128 | 55.1s | 1.41% (*) WAL flush is the ratio of the need of fsync to buffer recycle. Do you have the same measurement without patch applied? I'd be interested in the current state also (the N=0 path is modified as well for VACUUM, in this patch). # SET scan_recycle_buffers = 0; # UPDATE accounts SET aid=aid WHERE random() 0.005; # CHECKPOINT; # SET scan_recycle_buffers = N; # VACUUM FREEZE accounts; Very good results, thanks. I'd be interested in the same thing for just VACUUM and for varying ratios of dirty/clean blocks during vacuum. BTW, does the patch change the default usage of buffer in vacuum? From what I've seen, scan_recycle_buffers = 1 is the same as before. That was the intention. With the default value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, just like existing sequential scans. Is this intended? Yes, but its not very useful for testing to have done that. I'll do another version within the hour that sets N=0 (only) back to current behaviour for VACUUM. One of the objectives of the patch was to prevent VACUUM from tripping up other backends. I'm confident that it will improve that situation for OLTP workloads running regular concurrent VACUUMs, but we will need to wait a couple of days to get those results in also. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 2007-03-12 at 09:14 +, Simon Riggs wrote: On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote: With the default value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, just like existing sequential scans. Is this intended? Yes, but its not very useful for testing to have done that. I'll do another version within the hour that sets N=0 (only) back to current behaviour for VACUUM. New test version enclosed, where scan_recycle_buffers = 0 doesn't change existing VACUUM behaviour. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ? .swo ? .swp ? GNUmakefile ? RUIP-Oct12.patch ? UIP-datetime.patch ? WAL_notes.txt ? autovac.patch ? backup_minor_changes.patch ? backup_single_transaction.patch ? backup_typo.patch ? bufmgr_cyclic_reuse.v1.patch ? bufmgr_cyclic_reuse.v2.patch ? bufmgr_cyclic_reuse.v3.patch ? c2 ? cache_option.sql ? cacheinval.v1.patch ? check ? check2 ? checl ? commit_nowait.wip1.patch ? commit_nowait.wip2.patch ? commit_options.wip.v1.patch ? commitsiblings.v1.patch ? config.log ? config.status ? copy_noWAL.patch ? copy_noWAL2.patch ? copy_nowal.v1.patch ? copy_nowal.v2.patch ? copy_nowal_prep.sql ? copy_nowal_test.sql ? crc ? d.out ? datatype_frequency.sql ? dbtValidate.sql ? ddl_caveats.patch ? deadlock.patch ? edbDataValidation.sql ? error.sql ? executestricttest ? explain_analyze_timer.v1.patch ? f.sql ? faq.patch ? faq_tab.patch ? fast_cluster.v1.patch ? fast_cluster.v2.patch ? fastcopytest.sql ? funcs ? heap_diag_funcs.v1.patch ? heap_page_func.v1.patch ? hot ? iub_doc.patch ? iubackup.patch ? keepcache.patch ? last_restart_point.v1.patch ? last_restart_point.v2.patch ? log_autovacuum.v1.patch ? log_autovacuum.v2.patch ? log_lock_waits.v1.patch ? logfile ? makefile.custom ? mi ? mj.sql ? mj_circular.v1.patch ? mj_circular.v2.patch ? mj_circular.v3.patch ? mk.out ? mk.tmp ? mki.out ? mkinfo ? mkt.tmp ? mlog ? notintransblock.patch ? num_commits++ ? on.sql ? perf_doc.v1.patch ? pg_dump_analyze.v1.patch ? pg_dump_analyze.v1a.patch ? pg_standby.v2.tar ? pg_standby.v3.tar ? pg_standby.v4.tar ? pg_standby.v5.tar ? pg_standby.v6.tar ? pgbenchValidate.sql ? pgre.tar ? pgs.tar ? pgtrace_idle.REL8_3-20070104.v1.patch ? pgtrace_newprobes.REL8_3-20070104.v1.patch ? pgtrace_newprobes.v2.patch ? pitr_cleanup.patch ? portals_per_user.v1.patch ? ps.txt ? raw_page_funcs.v1.patch ? raw_page_funcs.v2.patch ? relationkeepcache.patch ? relopt ? restartablerecovery_docs.patch ? ri_initial_check.sql ? scan_recycle_buffers.v1.patch ? scan_recycle_buffers.v2.patch ? scan_recycle_buffers.v3.patch ? sel.patch ? singletransdocpatch.patch ? sss ? table_options.patch ? test_warm_standby.tar ? testcrc ? toast_options.sql ? toast_relopts.pgsql.v2.patch ? toast_relopts.v1.patch ? toast_tuning_ideas.patch ? toast_tuple_threshold_fixed.patch ? toastcache.patch ? trace_wal_flush.v1.patch ? tracedocs.patch ? transaction_guarantee.v1.patch ? transaction_guarantee.v2.patch ? transaction_guarantee.v3.patch ? transaction_guarantee.v4.patch ? transaction_guarantee.v5.patch ? transaction_guarantee.v6.patch ? uip_both.patch ? unlink_pg_internal_at_startup.patch ? vac_full_reindex.v1.patch ? vac_hint.v1.patch ? vac_hint.v2.patch ? vacstrategy.v1.patch ? vacstrategy.v2.patch ? vacstrategy_simple.v1.patch ? wal_checksum.v1.patch ? wal_checksum.v2.patch ? wrap_limit.patch ? xlog_clog_truncate.patch ? xlog_relcache.patch ? xlogswitchtuning.patch ? xlogswitchtuning2.patch ? xlogviewer.tar ? contrib/pg_relation_extend ? contrib/pgbench/.runtest.sh.swp ? contrib/pgbench/tpc_b.sql ? contrib/pgbench/truckin.pgb ? contrib/pgstattuple/pgstatheap.c ? doc/src/sgml/check ? doc/src/sgml/trace.sgml ? src/Makefile.global ? src/backend/postgres ? src/backend/access/common/rawpage.c ? src/backend/access/nbtree/nbttuple.c ? src/backend/catalog/.catalog.c.swp ? src/backend/catalog/postgres.bki ? src/backend/catalog/postgres.description ? src/backend/catalog/postgres.shdescription ? src/backend/commands/.tablespace.c.swp ? src/backend/postmaster/walwriter.c ? src/backend/storage/lmgr/.deadlock.c.swp ? src/backend/utils/mb/conversion_procs/conversion_create.sql ? src/backend/utils/mb/conversion_procs/ascii_and_mic/libascii_and_mic.so.0.0 ? src/backend/utils/mb/conversion_procs/cyrillic_and_mic/libcyrillic_and_mic.so.0.0 ? src/backend/utils/mb/conversion_procs/euc_cn_and_mic/libeuc_cn_and_mic.so.0.0 ? src/backend/utils/mb/conversion_procs/euc_jp_and_sjis/libeuc_jp_and_sjis.so.0.0 ? src/backend/utils/mb/conversion_procs/euc_kr_and_mic/libeuc_kr_and_mic.so.0.0 ? src/backend/utils/mb/conversion_procs/euc_tw_and_big5/libeuc_tw_and_big5.so.0.0 ? src/backend/utils/mb/conversion_procs/latin2_and_win1250/liblatin2_and_win1250.so.0.0 ? src/backend/utils/mb/conversion_procs/latin_and_mic/liblatin_and_mic.so.0.0 ? src/backend/utils/mb/conversion_procs/utf8_and_ascii/libutf8_and_ascii.so.0.0 ? src/backend/utils/mb/conversion_procs/utf8_and_big5/libutf8_and_big5.so.0.0 ?
Re: [HACKERS] Bug: Buffer cache is not scan resistant
ITAGAKI Takahiro [EMAIL PROTECTED] writes: I tested your patch with VACUUM FREEZE. The performance was improved when I set scan_recycle_buffers 32. I used VACUUM FREEZE to increase WAL traffic, but this patch should be useful for normal VACUUMs with backgrond jobs! Proving that you can see a different in a worst-case scenario is not the same as proving that the patch is useful in normal cases. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 2007-03-12 at 10:30 -0400, Tom Lane wrote: ITAGAKI Takahiro [EMAIL PROTECTED] writes: I tested your patch with VACUUM FREEZE. The performance was improved when I set scan_recycle_buffers 32. I used VACUUM FREEZE to increase WAL traffic, but this patch should be useful for normal VACUUMs with backgrond jobs! Proving that you can see a different in a worst-case scenario is not the same as proving that the patch is useful in normal cases. I agree, but I think that this VACUUM FREEZE test does actually represent the normal case, here's why: The poor buffer manager behaviour occurs if the block is dirty as a result of WAL-logged changes. It only takes the removal of a single row for us to have WAL logged this and dirtied the block. If we invoke VACUUM from autovacuum, we do this by default when 20% of the rows have been updated, which means with many distributions of UPDATEs we'll have touched a very large proportion of blocks before we VACUUM. That isn't true for *all* distributions of UPDATEs, but it will soon be. Dead Space Map will deliver only dirty blocks for us. So running a VACUUM FREEZE is a reasonable simulation of running a large VACUUM on a real production system with default autovacuum enabled, as will be the case for 8.3. It is possible that we run VACUUM when fewer dirty blocks are generated, but this won't be the common situation and not something we should optimise for. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Simon Riggs [EMAIL PROTECTED] wrote: With the default value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, just like existing sequential scans. Is this intended? New test version enclosed, where scan_recycle_buffers = 0 doesn't change existing VACUUM behaviour. This is a result with scan_recycle_buffers.v3.patch. I used normal VACUUM with background load using slowdown-ed pgbench in this instance. I believe the patch is useful in normal cases, not only for VACUUM FREEZE. N | time | WAL flush(*) -++--- 0 | 112.8s | 44.3% 1 | 148.9s | 52.1% 8 | 105.1s | 17.6% 16 | 96.9s | 8.7% 32 | 103.9s | 6.3% 64 | 89.4s | 6.6% 128 | 80.0s | 3.8% Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Simon, You may know we've built something similar and have seen similar gains. We're planning a modification that I think you should consider: when there is a sequential scan of a table larger than the size of shared_buffers, we are allowing the scan to write through the shared_buffers cache. The hypothesis is that if a relation is of a size equal to or less than the size of shared_buffers, it is cacheable and should use the standard LRU approach to provide for reuse. - Luke On 3/12/07 3:08 AM, Simon Riggs [EMAIL PROTECTED] wrote: On Mon, 2007-03-12 at 09:14 +, Simon Riggs wrote: On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote: With the default value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, just like existing sequential scans. Is this intended? Yes, but its not very useful for testing to have done that. I'll do another version within the hour that sets N=0 (only) back to current behaviour for VACUUM. New test version enclosed, where scan_recycle_buffers = 0 doesn't change existing VACUUM behaviour. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Cool! - Luke Msg is shrt cuz m on ma treo -Original Message- From: Simon Riggs [mailto:[EMAIL PROTECTED] Sent: Friday, March 09, 2007 02:32 PM Eastern Standard Time To: Luke Lonergan; ITAGAKI Takahiro Cc: Sherry Moore; Tom Lane; Mark Kirkwood; Pavan Deolasee; Gavin Sherry; PGSQL Hackers; Doug Rady Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant On Tue, 2007-03-06 at 22:32 -0500, Luke Lonergan wrote: Incidentally, we tried triggering NTA (L2 cache bypass) unconditionally and in various patterns and did not see the substantial gain as with reducing the working set size. My conclusion: Fixing the OS is not sufficient to alleviate the issue. We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data rates due to this effect. I've implemented buffer recycling, as previously described, patch being posted now to -patches as scan_recycle_buffers. This version includes buffer recycling - for SeqScans larger than shared buffers, with the objective of improving L2 cache efficiency *and* reducing the effects of shared buffer cache spoiling (both as previously discussed on this thread) - for VACUUMs of any size, with the objective of reducing WAL thrashing whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as originally suggested by Itagaki-san, just with a different implementation). Behaviour is not activated by default in this patch. To request buffer recycling, set the USERSET GUC SET scan_recycle_buffers = N tested with 1,4,8,16, but only 8 seems sensible, IMHO. Patch effects StrategyGetBuffer, so only effects the disk-cache path. The idea is that if its already in shared buffer cache then we get substantial benefit already and nothing else is needed. So for the general case, the patch adds a single if test into the I/O path. The parameter is picked up at the start of SeqScan and VACUUM (currently). Any change mid-scan will be ignored. IMHO its possible to do this and to allow Synch Scans at the same time, with some thought. There is no need for us to rely on cache spoiling behaviour of scans to implement that feature as well. Independent performance tests requested, so that we can discuss this objectively. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Hi Simon, and what you haven't said - all of this is orthogonal to the issue of buffer cache spoiling in PostgreSQL itself. That issue does still exist as a non-OS issue, but we've been discussing in detail the specific case of L2 cache effects with specific kernel calls. All of the test results have been stand-alone, so we've not done any measurements in that area. I say this because you make the point that reducing the working set size of write workloads has no effect on the L2 cache issue, but ISTM its still potentially a cache spoiling issue. What I wanted to point out was that (reiterating to avoid requoting), - My test was simply to demonstrate that the observed performance difference with VACUUM was caused by whether the size of the user buffer caused L2 thrashing. - In general, application should reduce the size of the working set to reduce the penalty of TLB misses and cache misses. - If the application access pattern meets the NTA trigger condition, the benefit of reducing the working set size will be much smaller. Whatever I said is probably orthogonal to the buffer cache issue you guys have been discussing, but I haven't read all the email exchange on the subject. Thanks, Sherry -- Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On 3/7/07, Hannu Krosing [EMAIL PROTECTED] wrote: Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix systems ? Supposedly you could mmap() a file and then do mincore() on the area to see which pages are cached. But you were talking about postgres cache before, there it should be easily implementable. -- marko ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Tue, 2007-03-06 at 00:54 +0100, Florian G. Pflug wrote: Simon Riggs wrote: But it would break the idea of letting a second seqscan follow in the first's hot cache trail, no? No, but it would make it somewhat harder to achieve without direct synchronization between scans. It could still work; lets see. I'm not sure thats an argument against fixing the problem with the buffer strategy though. We really want both, not just one or the other. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Hi Tom, Sorry about the delay. I have been away from computers all day. In the current Solaris release in development (Code name Nevada, available for download at http://opensolaris.org), I have implemented non-temporal access (NTA) which bypasses L2 for most writes, and reads larger than copyout_max_cached (patchable, default to 128K). The block size used by Postgres is 8KB. If I patch copyout_max_cached to 4KB to trigger NTA for reads, the access time with 16KB buffer or 128MB buffer are very close. I wrote readtest to simulate the access pattern of VACUUM (attached). tread is a 4-socket dual-core Opteron box. 81 tread ./readtest -h Usage: readtest [-v] [-N] -s size -n iter [-d delta] [-c count] -v: Verbose mode -N: Normalize results by number of reads -s size: Working set size (may specify K,M,G suffix) -n iter:Number of test iterations -f filename:Name of the file to read from -d [+|-]delta: Distance between subsequent reads -c count: Number of reads -h: Print this help With copyout_max_cached at 128K (in nanoseconds, NTA not triggered): 82 tread ./readtest -s 16k -f boot_archive 46445262 83 tread ./readtest -s 128M -f boot_archive 118294230 84 tread ./readtest -s 16k -f boot_archive -n 100 4230210856 85 tread ./readtest -s 128M -f boot_archive -n 100 6343619546 With copyout_max_cached at 4K (in nanoseconds, NTA triggered): 89 tread ./readtest -s 16k -f boot_archive 43606882 90 tread ./readtest -s 128M -f boot_archive 100547909 91 tread ./readtest -s 16k -f boot_archive -n 100 4251823995 92 tread ./readtest -s 128M -f boot_archive -n 100 4205491984 When the iteration is 1 (the default), the timing difference between using 16k buffer and 128M buffer is much bigger for both copyout_max_cached sizes, mostly due to the cost of TLB misses. When the iteration count is bigger, most of the page tables would be in Page Descriptor Cache for the later page accesses so the overhead of TLB misses become smaller. As you can see, when we do bypass L2, the performance with either buffer size is comparable. I am sure your next question is why the 128K limitation for reads. Here are the main reasons: - Based on a lot of the benchmarks and workloads I traced, the target buffer of read operations are typically accessed again shortly after the read, while writes are usually not. Therefore, the default operation mode is to bypass L2 for writes, but not for reads. - The Opteron's L1 cache size is 64K. If reads are larger than 128KB, it would have displacement flushed itself anyway, so for large reads, I will also bypass L2. I am working on dynamically setting copyout_max_cached based on the L1 D-cache size on the system. The above heuristic should have worked well in Luke's test case. However, due to the fact that the reads was done as 16,000 8K reads rather than one 128MB read, the NTA code was not triggered. Since the OS code has to be general enough to handle with most workloads, we have to pick some defaults that might not work best for some specific operations. It is a calculated balance. Thanks, Sherry On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote: Luke Lonergan [EMAIL PROTECTED] writes: Good info - it's the same in Solaris, the routine is uiomove (Sherry wrote it). Cool. Maybe Sherry can comment on the question whether it's possible for a large-scale-memcpy to not take a hit on filling a cache line that wasn't previously in cache? I looked a bit at the Linux code that's being used here, but it's all x86_64 assembler which is something I've never studied :-(. regards, tom lane -- Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym #include stdlib.h #include stdio.h #include ctype.h #include unistd.h #include fcntl.h #include sys/param.h #include sys/time.h #include sys/mman.h #include errno.h #include thread.h #include signal.h #include strings.h #include libgen.h #define KB(a) (a*1024) #define MB(a) (KB(a)*1024) static void usage(char *s) { fprintf(stderr, Usage: %s [-v] [-N] -s size -n iter [-d delta] [-c count]\n, s); fprintf(stderr, \t-v:\t\tVerbose mode\n \t-N:\t\tNormalize results by number of reads\n \t-s size:\tWorking set size (may specify K,M,G suffix)\n \t-n iter:\tNumber of test iterations\n \t-f filename:\tName of the file to read from\n \t-d [+|-]delta:\tDistance between subsequent reads\n \t-c count:\tNumber of reads\n \t-h:\t\tPrint this help\n ); exit(1); } #define ABS(x) ((x) = 0 ? (x) : -(x)) static void format_num(size_t v, size_t *new, char *code) { if (v % (1024 * 1024 * 1024) == 0) { *new = v / (1024 * 1024 *
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: Another approach I proposed back in December is to not have a variable like that at all, but scan the buffer cache for pages belonging to the table you're scanning to initialize the scan. Scanning all the BufferDescs is a fairly CPU and lock heavy operation, but it might be ok given that we're talking about large I/O bound sequential scans. It would require no DBA tuning and would work more robustly in varying conditions. I'm not sure where you would continue after scanning the in-cache pages. At the highest in-cache block number, perhaps. If there was some way to do that, it'd be what I'd vote for. I still don't know how to make this take advantage of the OS buffer cache. However, no DBA tuning is a huge advantage, I agree with that. If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. Can you guys give me some pointers about how to walk through the shared buffers, reading the pages that I need, while being sure not to read a page that's been evicted, and also not potentially causing a performance regression somewhere else? Given the partitioning of the buffer lock that Tom did it might not be that horrible for many cases, either, since you'd only need to scan through one partition. We also don't need an exact count, either. Perhaps there's some way we could keep a counter or something... Exact count of what? The pages already in cache? Regards, Jeff Davis ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Jeff Davis [EMAIL PROTECTED] writes: If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. I think that's a good way to guarantee that you'll not finish in time for 8.3. Heikki's idea is just at the handwaving stage at this point, and I'm not even convinced that it will offer any win. (Pages in cache will be picked up by a seqscan already.) regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Tue, 2007-03-06 at 12:59 -0500, Tom Lane wrote: Jeff Davis [EMAIL PROTECTED] writes: If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. I think that's a good way to guarantee that you'll not finish in time for 8.3. Heikki's idea is just at the handwaving stage at this point, and I'm not even convinced that it will offer any win. (Pages in cache will be picked up by a seqscan already.) I agree that it's a good idea stick with the current implementation which is, as far as I can see, meeting all of my performance goals. Regards, Jeff Davis ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Jeff Davis wrote: On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: Another approach I proposed back in December is to not have a variable like that at all, but scan the buffer cache for pages belonging to the table you're scanning to initialize the scan. Scanning all the BufferDescs is a fairly CPU and lock heavy operation, but it might be ok given that we're talking about large I/O bound sequential scans. It would require no DBA tuning and would work more robustly in varying conditions. I'm not sure where you would continue after scanning the in-cache pages. At the highest in-cache block number, perhaps. If there was some way to do that, it'd be what I'd vote for. I still don't know how to make this take advantage of the OS buffer cache. Yep, I don't see any way to do that. I think we could live with that, though. If we went with the sync_scan_offset approach, you'd have to leave a lot of safety margin in that as well. However, no DBA tuning is a huge advantage, I agree with that. If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. Can you guys give me some pointers about how to walk through the shared buffers, reading the pages that I need, while being sure not to read a page that's been evicted, and also not potentially causing a performance regression somewhere else? You could take a look at BufferSync, for example. It walks through the buffer cache, syncing all dirty buffers. FWIW, I've attached a function I wrote some time ago when I was playing with the same idea for vacuums. A call to the new function loops through the buffer cache and returns the next buffer that belong to a certain relation. I'm not sure that it's correct and safe, and there's not much comments, but should work if you want to play with it... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/storage/buffer/bufmgr.c === RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.214 diff -c -r1.214 bufmgr.c *** src/backend/storage/buffer/bufmgr.c 5 Jan 2007 22:19:37 - 1.214 --- src/backend/storage/buffer/bufmgr.c 22 Jan 2007 16:38:37 - *** *** 97,102 --- 97,134 static void AtProcExit_Buffers(int code, Datum arg); + Buffer + ReadAnyBufferForRelation(Relation reln) + { + static int last_buf_id = 0; + int new_buf_id; + volatile BufferDesc *bufHdr; + + /* Make sure we will have room to remember the buffer pin */ + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + new_buf_id = last_buf_id; + do + { + if (++new_buf_id = NBuffers) + new_buf_id = 0; + + bufHdr = BufferDescriptors[new_buf_id]; + LockBufHdr(bufHdr); + + if ((bufHdr-flags BM_VALID) RelFileNodeEquals(bufHdr-tag.rnode, reln-rd_node)) + { + PinBuffer_Locked(bufHdr); + last_buf_id = new_buf_id; + return BufferDescriptorGetBuffer(bufHdr); + } + UnlockBufHdr(bufHdr); + } while(new_buf_id != last_buf_id); + last_buf_id = new_buf_id; + return InvalidBuffer; + } + + /* * ReadBuffer -- returns a buffer containing the requested * block of the requested relation. If the blknum ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Tom Lane wrote: Jeff Davis [EMAIL PROTECTED] writes: If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. I think that's a good way to guarantee that you'll not finish in time for 8.3. Heikki's idea is just at the handwaving stage at this point, and I'm not even convinced that it will offer any win. (Pages in cache will be picked up by a seqscan already.) The scenario that I'm worried about is that you have a table that's slightly larger than RAM. If you issue many seqscans on that table, one at a time, every seqscan will have to read the whole table from disk, even though say 90% of it is in cache when the scan starts. This can be alleviated by using a large enough sync_scan_offset, but a single setting like that is tricky to tune, especially if your workload is not completely constant. Tune it too low, and you don't get much benefit, tune it too high and your scans diverge and you lose all benefit. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 2007-03-05 at 21:34 -0800, Sherry Moore wrote: - Based on a lot of the benchmarks and workloads I traced, the target buffer of read operations are typically accessed again shortly after the read, while writes are usually not. Therefore, the default operation mode is to bypass L2 for writes, but not for reads. Hi Sherry, I'm trying to relate what you've said to how we should proceed from here. My understanding of what you've said is: - Tom's assessment that the observed performance quirk could be fixed in the OS kernel is correct and you have the numbers to prove it - currently Solaris only does NTA for 128K reads, which we don't currently do. If we were to request 16 blocks at time, we would get this benefit on Solaris, at least. The copyout_max_cached parameter can be patched, but isn't a normal system tunable. - other workloads you've traced *do* reuse the same buffer again very soon afterwards when reading sequentially (not writes). Reducing the working set size is an effective technique in improving performance if we don't have a kernel that does NTA or we don't read in big enough chunks (we need both to get NTA to kick in). and what you haven't said - all of this is orthogonal to the issue of buffer cache spoiling in PostgreSQL itself. That issue does still exist as a non-OS issue, but we've been discussing in detail the specific case of L2 cache effects with specific kernel calls. All of the test results have been stand-alone, so we've not done any measurements in that area. I say this because you make the point that reducing the working set size of write workloads has no effect on the L2 cache issue, but ISTM its still potentially a cache spoiling issue. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Tue, 2007-03-06 at 18:47 +, Heikki Linnakangas wrote: Tom Lane wrote: Jeff Davis [EMAIL PROTECTED] writes: If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. I think that's a good way to guarantee that you'll not finish in time for 8.3. Heikki's idea is just at the handwaving stage at this point, and I'm not even convinced that it will offer any win. (Pages in cache will be picked up by a seqscan already.) The scenario that I'm worried about is that you have a table that's slightly larger than RAM. If you issue many seqscans on that table, one at a time, every seqscan will have to read the whole table from disk, even though say 90% of it is in cache when the scan starts. If you're issuing sequential scans one at a time, that 90% of the table that was cached is probably not cached any more, unless the scans are close together in time without overlapping (serial sequential scans). And the problem you describe is no worse than current behavior, where you have exactly the same problem. This can be alleviated by using a large enough sync_scan_offset, but a single setting like that is tricky to tune, especially if your workload is not completely constant. Tune it too low, and you don't get much benefit, tune it too high and your scans diverge and you lose all benefit. I see why you don't want to manually tune this setting, however it's really not that tricky. You can be quite conservative and still use a good fraction of your physical memory. I will come up with some numbers and see how much we have to gain. Regards, Jeff Davis ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mar 6, 2007, at 12:17 AM, Tom Lane wrote: Jim Nasby [EMAIL PROTECTED] writes: An idea I've been thinking about would be to have the bgwriter or some other background process actually try and keep the free list populated, The bgwriter already tries to keep pages just in front of the clock sweep pointer clean. True, but that still means that each backend has to run the clock- sweep. AFAICT that's something that backends will serialize on (due to BufFreelistLock), so it would be best to make StrategyGetBuffer as fast as possible. It certainly seems like grabbing a buffer off the free list is going to be a lot faster than running the clock sweep. That's why I think it'd be better to have the bgwriter run the clock sweep and put enough buffers on the free list to try and keep up with demand. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote: We also don't need an exact count, either. Perhaps there's some way we could keep a counter or something... Exact count of what? The pages already in cache? Yes. The idea being if you see there's 10k pages in cache, you can likely start 9k pages behind the current scan point and still pick everything up. But this is nowhere near as useful as the bitmap idea, so I'd only look at it if it's impossible to make the bitmaps work. And like others have said, that should wait until there's at least a first- generation patch that's going to make it into 8.3. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Tue, 2007-03-06 at 17:43 -0700, Jim Nasby wrote: On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote: We also don't need an exact count, either. Perhaps there's some way we could keep a counter or something... Exact count of what? The pages already in cache? Yes. The idea being if you see there's 10k pages in cache, you can likely start 9k pages behind the current scan point and still pick everything up. But this is nowhere near as useful as the bitmap idea, so I'd only look at it if it's impossible to make the bitmaps work. And like others have said, that should wait until there's at least a first- generation patch that's going to make it into 8.3. You still haven't told me how we take advantage of the OS buffer cache with the bitmap idea. What makes you think that my current implementation is nowhere near as useful as the bitmap idea? My current implementation is making use of OS buffers + shared memory; the bitmap idea can only make use of shared memory, and is likely throwing the OS buffers away completely. I also suspect that the bitmap idea relies too much on the idea that there's a contiguous cache trail in the shared buffers alone. Any devation from that -- which could be caused by PG's page replacement algorithm, especially in combination with a varied load pattern -- would negate any benefit from the bitmap idea. I feel much more confident that there will exist a trail of pages that are cached in *either* the PG shared buffers *or* the OS buffer cache. There may be holes/gaps in either one, but it's much more likely that they combine into a contiguous series of cached pages. Do you have an idea how I might test this claim? Regards, Jeff Davis ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Tue, 2007-03-06 at 18:29 +, Heikki Linnakangas wrote: Jeff Davis wrote: On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: Another approach I proposed back in December is to not have a variable like that at all, but scan the buffer cache for pages belonging to the table you're scanning to initialize the scan. Scanning all the BufferDescs is a fairly CPU and lock heavy operation, but it might be ok given that we're talking about large I/O bound sequential scans. It would require no DBA tuning and would work more robustly in varying conditions. I'm not sure where you would continue after scanning the in-cache pages. At the highest in-cache block number, perhaps. If there was some way to do that, it'd be what I'd vote for. I still don't know how to make this take advantage of the OS buffer cache. Yep, I don't see any way to do that. I think we could live with that, though. If we went with the sync_scan_offset approach, you'd have to leave a lot of safety margin in that as well. Right, there would certainly have to be a safety margin with sync_scan_offset. However, your plan only works when the shared buffers are dominated by this sequential scan. Let's say you have 40% of physical memory for shared buffers, and say that 50% are being used for hot pages in other parts of the database. That means you have access to only 20% of physical memory to optimize for this sequential scan, and 20% of the physical memory is basically unavailable (being used for other parts of the database). In my current implementation, you could set sync_scan_offset to 1.0 (meaning 1.0 x shared_buffers), giving you 40% of physical memory that would be used for starting this sequential scan. In this case, that should be a good margin of error, considering that as much as 80% of the physical memory might actually be in cache (OS or PG cache). This all needs to be backed up by testing, of course. I'm just extrapolating some numbers that look vaguely reasonable to me. Regards, Jeff Davis ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Incidentally, we tried triggering NTA (L2 cache bypass) unconditionally and in various patterns and did not see the substantial gain as with reducing the working set size. My conclusion: Fixing the OS is not sufficient to alleviate the issue. We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data rates due to this effect. - Luke Msg is shrt cuz m on ma treo -Original Message- From: Sherry Moore [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 06, 2007 10:05 PM Eastern Standard Time To: Simon Riggs Cc: Sherry Moore; Tom Lane; Luke Lonergan; Mark Kirkwood; Pavan Deolasee; Gavin Sherry; PGSQL Hackers; Doug Rady Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant Hi Simon, and what you haven't said - all of this is orthogonal to the issue of buffer cache spoiling in PostgreSQL itself. That issue does still exist as a non-OS issue, but we've been discussing in detail the specific case of L2 cache effects with specific kernel calls. All of the test results have been stand-alone, so we've not done any measurements in that area. I say this because you make the point that reducing the working set size of write workloads has no effect on the L2 cache issue, but ISTM its still potentially a cache spoiling issue. What I wanted to point out was that (reiterating to avoid requoting), - My test was simply to demonstrate that the observed performance difference with VACUUM was caused by whether the size of the user buffer caused L2 thrashing. - In general, application should reduce the size of the working set to reduce the penalty of TLB misses and cache misses. - If the application access pattern meets the NTA trigger condition, the benefit of reducing the working set size will be much smaller. Whatever I said is probably orthogonal to the buffer cache issue you guys have been discussing, but I haven't read all the email exchange on the subject. Thanks, Sherry -- Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Ühel kenal päeval, T, 2007-03-06 kell 18:28, kirjutas Jeff Davis: On Tue, 2007-03-06 at 18:29 +, Heikki Linnakangas wrote: Jeff Davis wrote: On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: Another approach I proposed back in December is to not have a variable like that at all, but scan the buffer cache for pages belonging to the table you're scanning to initialize the scan. Scanning all the BufferDescs is a fairly CPU and lock heavy operation, but it might be ok given that we're talking about large I/O bound sequential scans. It would require no DBA tuning and would work more robustly in varying conditions. I'm not sure where you would continue after scanning the in-cache pages. At the highest in-cache block number, perhaps. If there was some way to do that, it'd be what I'd vote for. I still don't know how to make this take advantage of the OS buffer cache. Maybe it should not ? Mostly there can be use of OS cache only if it is much bigger than shared buffer cache. It may make sense to forget about OS cache and just tell those who can make use of sync scans to set most of memory aside for shared buffers. Then we can do better predictions/lookups of how much of a table is actually in memory. Dual caching is usually not very beneficial anyway, not to mention about difficulties in predicting any doual-caching effects. Yep, I don't see any way to do that. I think we could live with that, though. If we went with the sync_scan_offset approach, you'd have to leave a lot of safety margin in that as well. Right, there would certainly have to be a safety margin with sync_scan_offset. However, your plan only works when the shared buffers are dominated by this sequential scan. Let's say you have 40% of physical memory for shared buffers, and say that 50% are being used for hot pages in other parts of the database. That means you have access to only 20% of physical memory to optimize for this sequential scan, and 20% of the physical memory is basically unavailable (being used for other parts of the database). The simplest thing in case table si much bigger than buffer cache usable for it is to start the second scan at the point the first scan is traversing *now*, and hope that the scans will stay together. Or start at some fixed lag, which makes the first scan to be always the one issuing reads and second just freerides on buffers already in cache. It may even be a good idea to throttle the second scan to stay N pages behind if the OS readahead gets confused when same file is read from multiple processes. If the table is smaller than the cache, then just scan it without syncing. Trying to read buffers in the same order starting from near the point where ppages are still in shared buffer cache seems good mostly for case where table is as big as or just a little larger than cache. In my current implementation, you could set sync_scan_offset to 1.0 (meaning 1.0 x shared_buffers), giving you 40% of physical memory that would be used for starting this sequential scan. In this case, that should be a good margin of error, considering that as much as 80% of the physical memory might actually be in cache (OS or PG cache). This all needs to be backed up by testing, of course. I'm just extrapolating some numbers that look vaguely reasonable to me. If there is an easy way to tell PG give me this page only if it is in shared cache already, then a good approach might be to start 2nd scan at the point where 1st is now, and move in both directions simultabeously, like this: First scan is at page N. Second scan: M=N-1 WHILE NOT ALL PAGES ARE READ: IF PAGE N IS IN CACHE : -- FOLLOW FIRST READER READ PAGE N N++ ELSE IF M=0 AND PAGE M IS IN CACHE : -- READ OLDER CACHED PAGES READ PAGE M M-- ELSE IF FIRST READER STILL GOING: -- NO OLDER PAGES, WAIT FOR 1st WAIT FOR PAGE N TO BECOME AVAILABLE READ PAGE N N++ ELSE:-- BECOME 1st reader READ PAGE N N++ PROCESS PAGE -- IF N PAGES_IF_TABLE: N=0 IF M 0: M=PAGES_IF_TABLE This should work reasonably well for LRU caches and it may be made to work with clock sweep scheme if the sweep arranges pages to purge in file order. If we could make the IF PAGE x IS IN CACHE part also know about OS cache this could also make use of os cache. Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix systems ? -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Grzegorz Jaskiewicz [EMAIL PROTECTED] writes: On Mar 5, 2007, at 2:36 AM, Tom Lane wrote: I'm also less than convinced that it'd be helpful for a big seqscan: won't reading a new disk page into memory via DMA cause that memory to get flushed from the processor cache anyway? Nope. DMA is writing directly into main memory. If the area was in the L2/L1 cache, it will get invalidated. But if it isn't there, it is okay. So either way, it isn't in processor cache after the read. So how can there be any performance benefit? regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug: Buffer cache is not scan resistant
So either way, it isn't in processor cache after the read. So how can there be any performance benefit? It's the copy from kernel IO cache to the buffer cache that is L2 sensitive. When the shared buffer cache is polluted, it thrashes the L2 cache. When the number of pages being written to in the kernel-user space writes fits in L2, then the L2 lines are written through (see the link below on page 264 for the write combining features of the opteron for example) and the writes to main memory are deferred. http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/ 25112.PDF - Luke ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Hi Tom, Now this may only prove that the disk subsystem on this machine is too cheap to let the system show any CPU-related issues. Try it with a warm IO cache. As I posted before, we see double the performance of a VACUUM from a table in IO cache when the shared buffer cache isn't being polluted. The speed with large buffer cache should be about 450 MB/s and the speed with a buffer cache smaller than L2 should be about 800 MB/s. The real issue here isn't the L2 behavior, though that's important when trying to reach very high IO speeds, the issue is that we're seeing the buffer cache pollution in the first place. When we instrument the blocks selected by the buffer page selection algorithm, we see that they iterate sequentially, filling the shared buffer cache. That's the source of the problem here. Do we have a regression test somewhere for this? - Luke ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mar 5, 2007, at 2:36 AM, Tom Lane wrote: n into account. I'm also less than convinced that it'd be helpful for a big seqscan: won't reading a new disk page into memory via DMA cause that memory to get flushed from the processor cache anyway? Nope. DMA is writing directly into main memory. If the area was in the L2/L1 cache, it will get invalidated. But if it isn't there, it is okay. -- Grzegorz Jaskiewicz [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Luke Lonergan [EMAIL PROTECTED] writes: So either way, it isn't in processor cache after the read. So how can there be any performance benefit? It's the copy from kernel IO cache to the buffer cache that is L2 sensitive. When the shared buffer cache is polluted, it thrashes the L2 cache. When the number of pages being written to in the kernel-user space writes fits in L2, then the L2 lines are written through (see the link below on page 264 for the write combining features of the opteron for example) and the writes to main memory are deferred. That makes absolutely zero sense. The data coming from the disk was certainly not in processor cache to start with, and I hope you're not suggesting that it matters whether the *target* page of a memcpy was already in processor cache. If the latter, it is not our bug to fix. http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/ 25112.PDF Even granting that your conclusions are accurate, we are not in the business of optimizing Postgres for a single CPU architecture. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Hi Tom, Even granting that your conclusions are accurate, we are not in the business of optimizing Postgres for a single CPU architecture. I think you're missing my/our point: The Postgres shared buffer cache algorithm appears to have a bug. When there is a sequential scan the blocks are filling the entire shared buffer cache. This should be fixed. My proposal for a fix: ensure that when relations larger (much larger?) than buffer cache are scanned, they are mapped to a single page in the shared buffer cache. - Luke ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Luke Lonergan wrote: The Postgres shared buffer cache algorithm appears to have a bug. When there is a sequential scan the blocks are filling the entire shared buffer cache. This should be fixed. My proposal for a fix: ensure that when relations larger (much larger?) than buffer cache are scanned, they are mapped to a single page in the shared buffer cache. It's not that simple. Using the whole buffer cache for a single seqscan is ok, if there's currently no better use for the buffer cache. Running a single select will indeed use the whole cache, but if you run any other smaller queries, the pages they need should stay in cache and the seqscan will loop through the other buffers. In fact, the pages that are left in the cache after the seqscan finishes would be useful for the next seqscan of the same table if we were smart enough to read those pages first. That'd make a big difference for seqscanning a table that's say 1.5x your RAM size. Hmm, I wonder if Jeff's sync seqscan patch adresses that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Ühel kenal päeval, E, 2007-03-05 kell 03:51, kirjutas Luke Lonergan: Hi Tom, Even granting that your conclusions are accurate, we are not in the business of optimizing Postgres for a single CPU architecture. I think you're missing my/our point: The Postgres shared buffer cache algorithm appears to have a bug. When there is a sequential scan the blocks are filling the entire shared buffer cache. This should be fixed. My proposal for a fix: ensure that when relations larger (much larger?) than buffer cache are scanned, they are mapped to a single page in the shared buffer cache. How will this approach play together with synchronized scan patches ? Or should synchronized scan rely on systems cache only ? - Luke ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Luke Lonergan [EMAIL PROTECTED] writes: I think you're missing my/our point: The Postgres shared buffer cache algorithm appears to have a bug. When there is a sequential scan the blocks are filling the entire shared buffer cache. This should be fixed. No, this is not a bug; it is operating as designed. The point of the current bufmgr algorithm is to replace the page least recently used, and that's what it's doing. If you want to lobby for changing the algorithm, then you need to explain why one test case on one platform justifies de-optimizing for a lot of other cases. In almost any concurrent-access situation I think that what you are suggesting would be a dead loss --- for instance we might as well forget about Jeff Davis' synchronized-scan work. In any case, I'm still not convinced that you've identified the problem correctly, because your explanation makes no sense to me. How can the processor's L2 cache improve access to data that it hasn't got yet? regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug: Buffer cache is not scan resistant
* Tom Lane: That makes absolutely zero sense. The data coming from the disk was certainly not in processor cache to start with, and I hope you're not suggesting that it matters whether the *target* page of a memcpy was already in processor cache. If the latter, it is not our bug to fix. Uhm, if it's not in the cache, you typically need to evict some cache lines to make room for the data, so I'd expect an indirect performance hit. I could be mistaken, though. -- Florian Weimer[EMAIL PROTECTED] BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Ühel kenal päeval, E, 2007-03-05 kell 04:15, kirjutas Tom Lane: Luke Lonergan [EMAIL PROTECTED] writes: I think you're missing my/our point: The Postgres shared buffer cache algorithm appears to have a bug. When there is a sequential scan the blocks are filling the entire shared buffer cache. This should be fixed. No, this is not a bug; it is operating as designed. Maybe he means that there is an oversight (aka bug) in the design ;) The point of the current bufmgr algorithm is to replace the page least recently used, and that's what it's doing. If you want to lobby for changing the algorithm, then you need to explain why one test case on one platform justifies de-optimizing for a lot of other cases. If you know beforehand that you will definitely overflow cache and not reuse it anytime soon, then it seems quite reasonable to not even start polluting the cache. Especially, if you get a noticable boost in performance while doing so. In almost any concurrent-access situation I think that what you are suggesting would be a dead loss Only if the concurrent access patern is over data mostly fitting in buffer cache. If we can avoid polluting buffer cache with data we know we will use only once, more useful data will be available. --- for instance we might as well forget about Jeff Davis' synchronized-scan work. Depends on ratio of system cache/shared buffer cache. I don't think Jeff's patch is anywere near the point it needs to start worrying about data swapping between system cache and shared burrers, or L2 cache usage In any case, I'm still not convinced that you've identified the problem correctly, because your explanation makes no sense to me. How can the processor's L2 cache improve access to data that it hasn't got yet? regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug: Buffer cache is not scan resistant
The Postgres shared buffer cache algorithm appears to have a bug. When there is a sequential scan the blocks are filling the entire shared buffer cache. This should be fixed. No, this is not a bug; it is operating as designed. The point of the current bufmgr algorithm is to replace the page least recently used, and that's what it's doing. At least we've established that for certain. If you want to lobby for changing the algorithm, then you need to explain why one test case on one platform justifies de-optimizing for a lot of other cases. In almost any concurrent-access situation I think that what you are suggesting would be a dead loss --- for instance we might as well forget about Jeff Davis' synchronized-scan work. Instead of forgetting about it, we'd need to change it. In any case, I'm still not convinced that you've identified the problem correctly, because your explanation makes no sense to me. How can the processor's L2 cache improve access to data that it hasn't got yet? The evidence seems to clearly indicate reduced memory writing due to an L2 related effect. The actual data shows a dramatic reduction in main memory writing when the destination of the written data fits in the L2 cache. I'll try to fit a hypothesis to explain it. Assume you've got a warm IO cache in the OS. The heapscan algorithm now works like this: 0) select a destination user buffer 1) uiomove-kcopy memory from the IO cache to the user buffer 1A) step 1: read from kernel space 1B) step 2: write to user space 2) the user buffer is accessed many times by the executor nodes above Repeat There are two situations we are evaluating: one where the addresses of the user buffer are scattered over a space larger than the size of L2 (caseA) and one where they are confined to the size of L2 (caseB). Note that we could also consider another situation where the addresses are scattered over a space smaller than the TLB entries mapped by the L2 cache (512 max) and larger than the size of L2, but we've tried that and it proved uninteresting. For both cases step 1A is the same: each block (8KB) write from (1) will read from IO cache into 128 L2 (64B each) lines, evicting the previous data there. In step 1B for caseA the destination for the writes is mostly an address not currently mapped into L2 cache, so 128 victim L2 lines are found (LRU), stored into, and writes are flushed to main memory. In step 1B for caseB, the destination for the writes is located in L2 already. The 128 L2 lines are stored into, and the write to main memory is delayed under the assumption that these lines are hot as they were already in L2. I don't know enough to be sure this is the right answer, but it does fit the experimental data. - Luke ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Gavin Sherry wrote: On Mon, 5 Mar 2007, Mark Kirkwood wrote: To add a little to this - forgetting the scan resistant point for the moment... cranking down shared_buffers to be smaller than the L2 cache seems to help *any* sequential scan immensely, even on quite modest HW: (snipped) When I've profiled this activity, I've seen a lot of time spent searching for/allocating a new buffer for each page being fetched. Obviously having less of them to search through will help, but having less than the L2 cache-size worth of 'em seems to help a whole lot! Could you demonstrate that point by showing us timings for shared_buffers sizes from 512K up to, say, 2 MB? The two numbers you give there might just have to do with managing a large buffer. Yeah - good point: PIII 1.26 Ghz 512Kb L2 cache 2G RAM Test is elapsed time for: SELECT count(*) FROM lineitem lineitem has 1535724 pages (11997 MB) Shared Buffers Elapsed IO rate (from vmstat) -- --- - 400MB 101 s122 MB/s 2MB 100 s 1MB 97 s 768KB93 s 512KB86 s 256KB77 s 128KB74 s166 MB/s I've added the observed IO rate for the two extreme cases (the rest can be pretty much deduced via interpolation). Note that the system will do about 220 MB/s with the now (in)famous dd test, so we have a bit of headroom (not too bad for a PIII). Cheers Mark ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Hi Mark, lineitem has 1535724 pages (11997 MB) Shared Buffers Elapsed IO rate (from vmstat) -- --- - 400MB 101 s122 MB/s 2MB 100 s 1MB 97 s 768KB93 s 512KB86 s 256KB77 s 128KB74 s166 MB/s I've added the observed IO rate for the two extreme cases (the rest can be pretty much deduced via interpolation). Note that the system will do about 220 MB/s with the now (in)famous dd test, so we have a bit of headroom (not too bad for a PIII). What's really interesting: try this with a table that fits into I/O cache (say half your system memory), and run VACUUM on the table. That way the effect will stand out more dramatically. - Luke ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Luke Lonergan [EMAIL PROTECTED] writes: The evidence seems to clearly indicate reduced memory writing due to an L2 related effect. You might try using valgrind's cachegrind tool which I understand can actually emulate various processors' cache to show how efficiently code uses it. I haven't done much with it though so I don't know how applicable it would be to a large-scale effect like this. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Mark Kirkwood [EMAIL PROTECTED] writes: Shared Buffers Elapsed IO rate (from vmstat) -- --- - 400MB 101 s122 MB/s 2MB 100 s 1MB 97 s 768KB93 s 512KB86 s 256KB77 s 128KB74 s166 MB/s Hm, that seems to blow the it's an L2 cache effect theory out of the water. If it were a cache effect then there should be a performance cliff at the point where the cache size is exceeded. I see no such cliff, in fact the middle part of the curve is darn near a straight line on a log scale ... So I'm back to asking what we're really measuring here. Buffer manager inefficiency of some sort, but what? Have you tried oprofile? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Tom Lane wrote: Mark Kirkwood [EMAIL PROTECTED] writes: Shared Buffers Elapsed IO rate (from vmstat) -- --- - 400MB 101 s122 MB/s 2MB 100 s 1MB 97 s 768KB93 s 512KB86 s 256KB77 s 128KB74 s166 MB/s So I'm back to asking what we're really measuring here. Buffer manager inefficiency of some sort, but what? Have you tried oprofile? Isn't the size of the shared buffer pool itself acting as a performance penalty in this case ? May be StrategyGetBuffer() needs to make multiple passes over the buffers before the usage_count of any buffer is reduced to zero and the buffer is chosen as replacement victim. There is no real advantage of having larger shared buffer pool in this particular test. A heap buffer is hardly accessed again once the seqscan passes over it. Can we try with a smaller value for BM_MAX_USAGE_COUNT and see if that has any positive impact for large shared pool in this case ? Thanks, Pavan ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Hi Tom, On 3/5/07 8:53 AM, Tom Lane [EMAIL PROTECTED] wrote: Hm, that seems to blow the it's an L2 cache effect theory out of the water. If it were a cache effect then there should be a performance cliff at the point where the cache size is exceeded. I see no such cliff, in fact the middle part of the curve is darn near a straight line on a log scale ... So I'm back to asking what we're really measuring here. Buffer manager inefficiency of some sort, but what? Have you tried oprofile? How about looking at the CPU performance counters directly using cpustat: cpustat -c BU_fill_into_L2,umask=0x1 1 This shows us how many L2 fills there are on all four cores (we use all four). In the case without buffer cache pollution, below is the trace of L2 fills. In the pollution case we fill 27 million lines, in the pollution case we fill 44 million lines. VACUUM orders (no buffer pollution): 51.006 1 tick 2754293 51.006 2 tick 3159565 51.006 3 tick 2971929 51.007 0 tick 3577487 52.006 1 tick 4214179 52.006 3 tick 3650193 52.006 2 tick 3905828 52.007 0 tick 3465261 53.006 1 tick 1818766 53.006 3 tick 1546018 53.006 2 tick 1709385 53.007 0 tick 1483371 And here is the case with buffer pollution: VACUUM orders (with buffer pollution) 22.006 0 tick 1576114 22.006 1 tick 1542604 22.006 2 tick 1987366 22.006 3 tick 1784567 23.006 3 tick 2706059 23.006 2 tick 2362048 23.006 0 tick 2190719 23.006 1 tick 2088827 24.006 0 tick 2247473 24.006 1 tick 2153850 24.006 2 tick 2422730 24.006 3 tick 2758795 25.006 0 tick 2419436 25.006 1 tick 2229602 25.006 2 tick 2619333 25.006 3 tick 2712332 26.006 1 tick 1827923 26.006 2 tick 1886556 26.006 3 tick 2909746 26.006 0 tick 1467164 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Pavan Deolasee [EMAIL PROTECTED] writes: Isn't the size of the shared buffer pool itself acting as a performance penalty in this case ? May be StrategyGetBuffer() needs to make multiple passes over the buffers before the usage_count of any buffer is reduced to zero and the buffer is chosen as replacement victim. I read that and thought you were onto something, but it's not acting quite the way I expect. I made a quick hack in StrategyGetBuffer() to count the number of buffers it looks at before finding a victim. Running it with just 32 buffers on a large count(*), the behavior after the initial startup transient is quite odd: got buffer 2 after 4 tries got buffer 3 after 1 tries got buffer 4 after 1 tries got buffer 5 after 1 tries got buffer 6 after 1 tries got buffer 7 after 1 tries got buffer 8 after 1 tries got buffer 12 after 4 tries got buffer 14 after 2 tries got buffer 21 after 7 tries got buffer 26 after 5 tries got buffer 27 after 1 tries got buffer 31 after 4 tries got buffer 0 after 1 tries got buffer 1 after 1 tries got buffer 9 after 8 tries got buffer 10 after 1 tries got buffer 11 after 1 tries got buffer 13 after 2 tries got buffer 15 after 2 tries got buffer 16 after 1 tries got buffer 17 after 1 tries got buffer 18 after 1 tries got buffer 19 after 1 tries got buffer 20 after 1 tries got buffer 22 after 2 tries got buffer 23 after 1 tries got buffer 24 after 1 tries got buffer 25 after 1 tries got buffer 28 after 3 tries got buffer 29 after 1 tries got buffer 30 after 1 tries got buffer 2 after 4 tries got buffer 3 after 1 tries got buffer 4 after 1 tries got buffer 5 after 1 tries got buffer 6 after 1 tries got buffer 7 after 1 tries got buffer 8 after 1 tries got buffer 12 after 4 tries got buffer 14 after 2 tries got buffer 21 after 7 tries got buffer 26 after 5 tries got buffer 27 after 1 tries got buffer 31 after 4 tries got buffer 0 after 1 tries got buffer 1 after 1 tries got buffer 9 after 8 tries got buffer 10 after 1 tries got buffer 11 after 1 tries got buffer 13 after 2 tries got buffer 15 after 2 tries got buffer 16 after 1 tries got buffer 17 after 1 tries got buffer 18 after 1 tries got buffer 19 after 1 tries got buffer 20 after 1 tries got buffer 22 after 2 tries got buffer 23 after 1 tries got buffer 24 after 1 tries got buffer 25 after 1 tries got buffer 28 after 3 tries got buffer 29 after 1 tries got buffer 30 after 1 tries got buffer 2 after 4 tries got buffer 3 after 1 tries got buffer 4 after 1 tries got buffer 5 after 1 tries got buffer 6 after 1 tries got buffer 7 after 1 tries got buffer 8 after 1 tries got buffer 12 after 4 tries got buffer 14 after 2 tries got buffer 21 after 7 tries got buffer 26 after 5 tries got buffer 27 after 1 tries got buffer 31 after 4 tries got buffer 0 after 1 tries got buffer 1 after 1 tries got buffer 9 after 8 tries got buffer 10 after 1 tries got buffer 11 after 1 tries got buffer 13 after 2 tries got buffer 15 after 2 tries got buffer 16 after 1 tries got buffer 17 after 1 tries got buffer 18 after 1 tries got buffer 19 after 1 tries got buffer 20 after 1 tries got buffer 22 after 2 tries got buffer 23 after 1 tries got buffer 24 after 1 tries got buffer 25 after 1 tries got buffer 28 after 3 tries got buffer 29 after 1 tries got buffer 30 after 1 tries got buffer 2 after 4 tries got buffer 3 after 1 tries got buffer 4 after 1 tries got buffer 5 after 1 tries got buffer 6 after 1 tries got buffer 7 after 1 tries got buffer 8 after 1 tries got buffer 12 after 4 tries got buffer 14 after 2 tries got buffer 21 after 7 tries got buffer 26 after 5 tries got buffer 27 after 1 tries got buffer 31 after 4 tries got buffer 0 after 1 tries got buffer 1 after 1 tries got buffer 9 after 8 tries got buffer 10 after 1 tries got buffer 11 after 1 tries got buffer 13 after 2 tries got buffer 15 after 2 tries got buffer 16 after 1 tries got buffer 17 after 1 tries got buffer 18 after 1 tries got buffer 19 after 1 tries got buffer 20 after 1 tries got buffer 22 after 2 tries got buffer 23 after 1 tries got buffer 24 after 1 tries got buffer 25 after 1 tries got buffer 28 after 3 tries got buffer 29 after 1 tries got buffer 30 after 1 tries got buffer 2 after 4 tries got buffer 3 after 1 tries got buffer 4 after 1 tries got buffer 5 after 1 tries got buffer 6 after 1 tries got buffer 7 after 1 tries got buffer 8 after 1 tries got buffer 12 after 4 tries got buffer 14 after 2 tries got buffer 21 after 7 tries got buffer 26 after 5 tries got buffer 27 after 1 tries got buffer 31 after 4 tries got buffer 0 after 1 tries got buffer 1 after 1 tries got buffer 9 after 8 tries got buffer 10 after 1 tries got buffer 11 after 1 tries got buffer 13 after 2 tries got buffer 15 after 2 tries got buffer 16 after 1 tries got buffer 17 after 1 tries got buffer 18 after 1 tries got buffer 19 after 1 tries got buffer 20 after 1 tries got buffer 22 after 2 tries got buffer 23 after 1 tries got buffer 24 after 1 tries got buffer 25 after 1 tries got buffer
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Tom, Yes, autovacuum is off, and bgwriter shouldn't have anything useful to do either, so I'm a bit at a loss what's going on --- but in any case, it doesn't look like we are cycling through the entire buffer space for each fetch. I'd be happy to DTrace it, but I'm a little lost as to where to look in the kernel. I'll see if I can find someone who knows more about memory management than me (that ought to be easy). -- Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Tom, On 3/5/07 8:53 AM, Tom Lane [EMAIL PROTECTED] wrote: Hm, that seems to blow the it's an L2 cache effect theory out of the water. If it were a cache effect then there should be a performance cliff at the point where the cache size is exceeded. I see no such cliff, in fact the middle part of the curve is darn near a straight line on a log scale ... Here's that cliff you were looking for: Size of Orders table: 7178MB Blocksize: 8KB Shared_buffers Select CountVacuum (KB)(s) (s) === 248 5.522.46 368 4.772.40 552 5.822.40 824 6.202.43 12325.603.59 18486.023.14 27685.534.56 All of these were run three times and the *lowest* time reported. Also, the behavior of fast VACUUM after SELECT begins abruptly at 1232KB of shared_buffers. These are Opterons with 2MB of L2 cache shared between two cores. - Luke ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug: Buffer cache is not scan resistant
I wrote: Pavan Deolasee [EMAIL PROTECTED] writes: Isn't the size of the shared buffer pool itself acting as a performance penalty in this case ? May be StrategyGetBuffer() needs to make multiple passes over the buffers before the usage_count of any buffer is reduced to zero and the buffer is chosen as replacement victim. I read that and thought you were onto something, but it's not acting quite the way I expect. I made a quick hack in StrategyGetBuffer() to count the number of buffers it looks at before finding a victim. ... Yes, autovacuum is off, and bgwriter shouldn't have anything useful to do either, so I'm a bit at a loss what's going on --- but in any case, it doesn't look like we are cycling through the entire buffer space for each fetch. Nope, Pavan's nailed it: the problem is that after using a buffer, the seqscan leaves it with usage_count = 1, which means it has to be passed over once by the clock sweep before it can be re-used. I was misled in the 32-buffer case because catalog accesses during startup had left the buffer state pretty confused, so that there was no long stretch before hitting something available. With a large number of buffers, the behavior is that the seqscan fills all of shared memory with buffers having usage_count 1. Once the clock sweep returns to the first of these buffers, it will have to pass over all of them, reducing all of their counts to 0, before it returns to the first one and finds it now usable. Subsequent tries find a buffer immediately, of course, until we have again filled shared_buffers with usage_count 1 everywhere. So the problem is not so much the clock sweep overhead as that it's paid in a very nonuniform fashion: with N buffers you pay O(N) once every N reads and O(1) the rest of the time. This is no doubt slowing things down enough to delay that one read, instead of leaving it nicely I/O bound all the time. Mark, can you detect hiccups in the read rate using your setup? I seem to recall that we've previously discussed the idea of letting the clock sweep decrement the usage_count before testing for 0, so that a buffer could be reused on the first sweep after it was initially used, but that we rejected it as being a bad idea. But at least with large shared_buffers it doesn't sound like such a bad idea. Another issue nearby to this is whether to avoid selecting buffers that are dirty --- IIRC someone brought that up again recently. Maybe predecrement for clean buffers, postdecrement for dirty ones would be a cute compromise. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Here's four more points on the curve - I'd use a dirac delta function for your curve fit ;-) Shared_buffers Select CountVacuum (KB)(s) (s) === 248 5.522.46 368 4.772.40 552 5.822.40 824 6.202.43 12325.603.59 18486.023.14 27685.534.56 55366.053.95 83045.804.37 12456 5.864.12 18680 5.834.10 28016 6.114.46 WRT what you found on the selection algorithm, it might also explain the L2 effects I think. I'm also still of the opinion that polluting the shared buffer cache for a seq scan does not make sense. - Luke On 3/5/07 10:21 AM, Luke Lonergan [EMAIL PROTECTED] wrote: Tom, On 3/5/07 8:53 AM, Tom Lane [EMAIL PROTECTED] wrote: Hm, that seems to blow the it's an L2 cache effect theory out of the water. If it were a cache effect then there should be a performance cliff at the point where the cache size is exceeded. I see no such cliff, in fact the middle part of the curve is darn near a straight line on a log scale ... Here's that cliff you were looking for: Size of Orders table: 7178MB Blocksize: 8KB Shared_buffers Select CountVacuum (KB)(s) (s) === 248 5.522.46 368 4.772.40 552 5.822.40 824 6.202.43 12325.603.59 18486.023.14 27685.534.56 All of these were run three times and the *lowest* time reported. Also, the behavior of fast VACUUM after SELECT begins abruptly at 1232KB of shared_buffers. These are Opterons with 2MB of L2 cache shared between two cores. - Luke ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Tom, I seem to recall that we've previously discussed the idea of letting the clock sweep decrement the usage_count before testing for 0, so that a buffer could be reused on the first sweep after it was initially used, but that we rejected it as being a bad idea. Â But at least with large shared_buffers it doesn't sound like such a bad idea. We did discuss an number of formulas for setting buffers with different clock-sweep numbers, including ones with higher usage_count for indexes and starting numbers of 0 for large seq scans as well as vacuums. However, we didn't have any way to prove that any of these complex algorithms would result in higher performance, so went with the simplest formula, with the idea of tinkering with it when we had more data. So maybe now's the time. Note, though, that the current algorithm is working very, very well for OLTP benchmarks, so we'd want to be careful not to gain performance in one area at the expense of another. In TPCE testing, we've been able to increase shared_buffers to 10GB with beneficial performance effect (numbers posted when I have them) and even found that taking over RAM with the shared_buffers (ala Oracle) gave us equivalent performance to using the FS cache. (yes, this means with a little I/O management engineering we could contemplate discarding use of the FS cache for a net performance gain. Maybe for 8.4) -- Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Tom Lane wrote: Nope, Pavan's nailed it: the problem is that after using a buffer, the seqscan leaves it with usage_count = 1, which means it has to be passed over once by the clock sweep before it can be re-used. I was misled in the 32-buffer case because catalog accesses during startup had left the buffer state pretty confused, so that there was no long stretch before hitting something available. With a large number of buffers, the behavior is that the seqscan fills all of shared memory with buffers having usage_count 1. Once the clock sweep returns to the first of these buffers, it will have to pass over all of them, reducing all of their counts to 0, before it returns to the first one and finds it now usable. Subsequent tries find a buffer immediately, of course, until we have again filled shared_buffers with usage_count 1 everywhere. So the problem is not so much the clock sweep overhead as that it's paid in a very nonuniform fashion: with N buffers you pay O(N) once every N reads and O(1) the rest of the time. This is no doubt slowing things down enough to delay that one read, instead of leaving it nicely I/O bound all the time. Mark, can you detect hiccups in the read rate using your setup? Cool. You posted the same analysis before I could hit the send button :) I am wondering whether seqscan would set the usage_count to 1 or to a higher value. usage_count is incremented while unpinning the buffer. Even if we use page-at-a-time mode, won't the buffer itself would get pinned/unpinned every time seqscan returns a tuple ? If thats the case, the overhead would be O(BM_MAX_USAGE_COUNT * N) for every N reads. I seem to recall that we've previously discussed the idea of letting the clock sweep decrement the usage_count before testing for 0, so that a buffer could be reused on the first sweep after it was initially used, but that we rejected it as being a bad idea. But at least with large shared_buffers it doesn't sound like such a bad idea. How about smaller value for BM_MAX_USAGE_COUNT ? Another issue nearby to this is whether to avoid selecting buffers that are dirty --- IIRC someone brought that up again recently. Maybe predecrement for clean buffers, postdecrement for dirty ones would be a cute compromise. Can we use a 2-bit counter where the higher bit is set if the buffer is dirty and lower bit is set whenever the buffer is used. The clock-sweep then decrement this counter and chooses a victim with counter value 0. ISTM that we should optimize for large shared buffer pool case, because that would be more common in the coming days. RAM is getting cheaper everyday. Thanks, Pavan ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Pavan Deolasee [EMAIL PROTECTED] writes: I am wondering whether seqscan would set the usage_count to 1 or to a higher value. usage_count is incremented while unpinning the buffer. Even if we use page-at-a-time mode, won't the buffer itself would get pinned/unpinned every time seqscan returns a tuple ? If thats the case, the overhead would be O(BM_MAX_USAGE_COUNT * N) for every N reads. No, it's only once per page. There's a good deal of PrivateRefCount thrashing that goes on while examining the individual tuples, but the shared state only changes when we leave the page, because we hold pin continuously on the current page of a seqscan. If you don't believe me, insert some debug printouts for yourself. How about smaller value for BM_MAX_USAGE_COUNT ? This is not relevant to the problem: we are concerned about usage count 1 versus 0, not the other end of the range. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Tom Lane [EMAIL PROTECTED] writes: I seem to recall that we've previously discussed the idea of letting the clock sweep decrement the usage_count before testing for 0, so that a buffer could be reused on the first sweep after it was initially used, but that we rejected it as being a bad idea. But at least with large shared_buffers it doesn't sound like such a bad idea. I seem to recall the classic clock sweep algorithm was to just use a single bit. Either a buffer was touched recently or it wasn't. I also vaguely recall a refinement involving keeping a bitmask and shifting it right each time the clock hand comes around. So you have a precise history of which recent clock sweeps the buffer was used and which it wasn't. I think the coarseness of not caring how heavily it was used is a key part of the algorithm. By not caring if it was lightly or heavily used during the clock sweep, just that it was used at least once it avoids making sticking with incorrect deductions about things like sequential scans even if multiple sequential scans pass by. As soon as they stop seeing the buffer it immediately reacts and discards the buffer. I would check out my OS book from school but it's on another continent :( -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 2007-03-05 at 10:46 -0800, Josh Berkus wrote: Tom, I seem to recall that we've previously discussed the idea of letting the clock sweep decrement the usage_count before testing for 0, so that a buffer could be reused on the first sweep after it was initially used, but that we rejected it as being a bad idea. But at least with large shared_buffers it doesn't sound like such a bad idea. Note, though, that the current algorithm is working very, very well for OLTP benchmarks, so we'd want to be careful not to gain performance in one area at the expense of another. Agreed. What we should also add to the analysis is that this effect only occurs when only uniform workloads is present, like SeqScan, VACUUM or COPY. When you have lots of indexed access the scan workloads don't have as much effect on the cache pollution as we are seeing in these tests. Itakgaki-san and I were discussing in January the idea of cache-looping, whereby a process begins to reuse its own buffers in a ring of ~32 buffers. When we cycle back round, if usage_count==1 then we assume that we can reuse that buffer. This avoids cache swamping for read and write workloads, plus avoids too-frequent WAL writing for VACUUM. It would be simple to implement the ring buffer and enable/disable it with a hint StrategyHintCyclicBufferReuse() in a similar manner to the hint VACUUM provides now. This would maintain the beneficial behaviour for OLTP, while keeping data within the L2 cache for DSS and bulk workloads. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Simon Riggs [EMAIL PROTECTED] writes: Itakgaki-san and I were discussing in January the idea of cache-looping, whereby a process begins to reuse its own buffers in a ring of ~32 buffers. When we cycle back round, if usage_count==1 then we assume that we can reuse that buffer. This avoids cache swamping for read and write workloads, plus avoids too-frequent WAL writing for VACUUM. This would maintain the beneficial behaviour for OLTP, Justify that claim. It sounds to me like this would act very nearly the same as having shared_buffers == 32 ... regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug: Buffer cache is not scan resistant
This sounds like a good idea. - Luke Msg is shrt cuz m on ma treo -Original Message- From: Simon Riggs [mailto:[EMAIL PROTECTED] Sent: Monday, March 05, 2007 02:37 PM Eastern Standard Time To: Josh Berkus; Tom Lane; Pavan Deolasee; Mark Kirkwood; Gavin Sherry; Luke Lonergan; PGSQL Hackers; Doug Rady; Sherry Moore Cc: pgsql-hackers@postgresql.org Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant On Mon, 2007-03-05 at 10:46 -0800, Josh Berkus wrote: Tom, I seem to recall that we've previously discussed the idea of letting the clock sweep decrement the usage_count before testing for 0, so that a buffer could be reused on the first sweep after it was initially used, but that we rejected it as being a bad idea. But at least with large shared_buffers it doesn't sound like such a bad idea. Note, though, that the current algorithm is working very, very well for OLTP benchmarks, so we'd want to be careful not to gain performance in one area at the expense of another. Agreed. What we should also add to the analysis is that this effect only occurs when only uniform workloads is present, like SeqScan, VACUUM or COPY. When you have lots of indexed access the scan workloads don't have as much effect on the cache pollution as we are seeing in these tests. Itakgaki-san and I were discussing in January the idea of cache-looping, whereby a process begins to reuse its own buffers in a ring of ~32 buffers. When we cycle back round, if usage_count==1 then we assume that we can reuse that buffer. This avoids cache swamping for read and write workloads, plus avoids too-frequent WAL writing for VACUUM. It would be simple to implement the ring buffer and enable/disable it with a hint StrategyHintCyclicBufferReuse() in a similar manner to the hint VACUUM provides now. This would maintain the beneficial behaviour for OLTP, while keeping data within the L2 cache for DSS and bulk workloads. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 2007-03-05 at 14:41 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: Itakgaki-san and I were discussing in January the idea of cache-looping, whereby a process begins to reuse its own buffers in a ring of ~32 buffers. When we cycle back round, if usage_count==1 then we assume that we can reuse that buffer. This avoids cache swamping for read and write workloads, plus avoids too-frequent WAL writing for VACUUM. This would maintain the beneficial behaviour for OLTP, Justify that claim. It sounds to me like this would act very nearly the same as having shared_buffers == 32 ... Sure. We wouldn't set the hint for IndexScans or Inserts, only for SeqScans, VACUUM and COPY. So OLTP-only workloads would be entirely unaffected. In the presence of a mixed workload the scan tasks would have only a limited effect on the cache, maintaining performance for the response time critical tasks. So its an OLTP benefit because of cache protection and WAL-flush reduction during VACUUM. As we've seen, the scan tasks look like they'll go faster with this. The assumption that we can reuse the buffer if usage_count=1 seems valid. If another user had requested the block, then the usage_count would be 1, unless the buffer has been used, unpinned and then a cycle of the buffer cache had spun round, all within the time taken to process 32 blocks sequentially. We do have to reuse one of the buffers, so cyclical reuse seems like a better bet most of the time than more arbitrary block reuse, as we see in a larger cache. Best way is to prove it though. Seems like not too much work to have a private ring data structure when the hint is enabled. The extra bookeeping is easily going to be outweighed by the reduction in mem-L2 cache fetches. I'll do it tomorrow, if no other volunteers. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 2007-03-05 at 03:51 -0500, Luke Lonergan wrote: The Postgres shared buffer cache algorithm appears to have a bug. When there is a sequential scan the blocks are filling the entire shared buffer cache. This should be fixed. My proposal for a fix: ensure that when relations larger (much larger?) than buffer cache are scanned, they are mapped to a single page in the shared buffer cache. I don't see why we should strictly limit sequential scans to one buffer per scan. I assume you mean one buffer per scan, but that raises these two questions: (1) What happens when there are more seq scans than cold buffers available? (2) What happens when two sequential scans need the same page, do we double-up? Also, the first time we access any heap page of a big table, we are very unsure whether we will access it again, regardless of whether it's part of a seq scan or not. In our current system of 4 LRU lists (depending on how many times a buffer has been referenced), we could start more likely (e.g. system catalogs, index pages) pages in higher list, and heap reads from big tables in the lowest possible list. Assuming, of course, that has any benefit (frequently accessed cache pages are likely to move up in the lists very quickly anyway). But I don't think we should eliminate caching of heap pages in big tables all together. A few buffers might be polluted during the scan, but most of the time they will be replacing other low-priority pages (perhaps from another seq scan) and probably be replaced again quickly. If that's not happening, and it's polluting frequently-accessed pages, I agree that's a bug. Regards, Jeff Davis ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 2007-03-05 at 11:10 +0200, Hannu Krosing wrote: My proposal for a fix: ensure that when relations larger (much larger?) than buffer cache are scanned, they are mapped to a single page in the shared buffer cache. How will this approach play together with synchronized scan patches ? Thanks for considering my patch in this discussion. I will test by turning shared_buffers down as low as I can, and see if that makes a big difference. Or should synchronized scan rely on systems cache only ? I don't know what the performance impact of that will be; still good compared to reading from disk, but I assume much more CPU time. Regards, Jeff Davis ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Simon Riggs [EMAIL PROTECTED] writes: Best way is to prove it though. Seems like not too much work to have a private ring data structure when the hint is enabled. The extra bookeeping is easily going to be outweighed by the reduction in mem-L2 cache fetches. I'll do it tomorrow, if no other volunteers. [ shrug... ] No one has yet proven to my satisfaction that L2 cache has anything to do with this. The notion that you can read a new disk page into a shared buffer and have that buffer still be live in the processor cache is so obviously bogus that I think there must be some other effect at work. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 2007-03-05 at 09:09 +, Heikki Linnakangas wrote: In fact, the pages that are left in the cache after the seqscan finishes would be useful for the next seqscan of the same table if we were smart enough to read those pages first. That'd make a big difference for seqscanning a table that's say 1.5x your RAM size. Hmm, I wonder if Jeff's sync seqscan patch adresses that. Absolutely. I've got a parameter in my patch sync_scan_offset that starts a seq scan N pages before the position of the last seq scan running on that table (or a current seq scan if there's still a scan going). If the last scan is not still in progress, the pages are less likely to be in cache. If the pages are in cache, great; if not, it doesn't matter where we start anyway. If the last scan is still in progress, those recently-read pages are very likely to be in cache (shared buffers or OS buffer cache). Regards, Jeff Davis ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Jeff Davis [EMAIL PROTECTED] writes: Absolutely. I've got a parameter in my patch sync_scan_offset that starts a seq scan N pages before the position of the last seq scan running on that table (or a current seq scan if there's still a scan going). Strikes me that expressing that parameter as a percentage of shared_buffers might make it less in need of manual tuning ... regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote: Jeff Davis [EMAIL PROTECTED] writes: Absolutely. I've got a parameter in my patch sync_scan_offset that starts a seq scan N pages before the position of the last seq scan running on that table (or a current seq scan if there's still a scan going). Strikes me that expressing that parameter as a percentage of shared_buffers might make it less in need of manual tuning ... The original patch was a percentage of effective_cache_size, because in theory it may be helpful to have this parameter larger than shared buffers. Synchronized Scannning can take advantage of OS buffer cache as well. Someone convinced me to change it to be an independent variable. I don't have a strong opinion, but now I have three different opinions: (1) Independent parameter (2) Percentage of shared_buffers (3) Percentage of effective_cache_size Regards, Jeff Davis ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Jeff Davis [EMAIL PROTECTED] writes: On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote: Strikes me that expressing that parameter as a percentage of shared_buffers might make it less in need of manual tuning ... The original patch was a percentage of effective_cache_size, because in theory it may be helpful to have this parameter larger than shared buffers. Synchronized Scannning can take advantage of OS buffer cache as well. I didn't say you couldn't allow it to be more than 100% ;-). But basing it on effective_cache_size strikes me as a bad idea because that parameter is seldom better than a wild guess. shared_buffers at least means something. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Jeff Davis wrote: On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote: Jeff Davis [EMAIL PROTECTED] writes: Absolutely. I've got a parameter in my patch sync_scan_offset that starts a seq scan N pages before the position of the last seq scan running on that table (or a current seq scan if there's still a scan going). Strikes me that expressing that parameter as a percentage of shared_buffers might make it less in need of manual tuning ... The original patch was a percentage of effective_cache_size, because in theory it may be helpful to have this parameter larger than shared buffers. Synchronized Scannning can take advantage of OS buffer cache as well. Someone convinced me to change it to be an independent variable. I don't have a strong opinion, but now I have three different opinions: (1) Independent parameter (2) Percentage of shared_buffers (3) Percentage of effective_cache_size Another approach I proposed back in December is to not have a variable like that at all, but scan the buffer cache for pages belonging to the table you're scanning to initialize the scan. Scanning all the BufferDescs is a fairly CPU and lock heavy operation, but it might be ok given that we're talking about large I/O bound sequential scans. It would require no DBA tuning and would work more robustly in varying conditions. I'm not sure where you would continue after scanning the in-cache pages. At the highest in-cache block number, perhaps. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Tom Lane wrote: So the problem is not so much the clock sweep overhead as that it's paid in a very nonuniform fashion: with N buffers you pay O(N) once every N reads and O(1) the rest of the time. This is no doubt slowing things down enough to delay that one read, instead of leaving it nicely I/O bound all the time. Mark, can you detect hiccups in the read rate using your setup? I think so, here's the vmstat output for 400MB of shared_buffers during the scan: procs ---memory-- ---swap-- -io --system-- cpu r b swpd free buff cache si sobibo incs us sy id wa 1 0764 51772 0 199068800 120422 2 1546 1755 16 37 46 1 1 0764 53640 0 198879200 120422 2 1544 1446 14 40 46 1 1 0788 54900 0 198756400 11674615 1470 3067 15 39 44 2 1 0788 52800 0 198955200 11919920 1488 2216 14 37 47 1 1 0788 52372 0 19900 122880 7 1532 1203 15 39 45 1 1 0788 54592 0 198787205 124928 5 1557 1058 17 38 46 0 2 0788 54052 0 198783600 118787 0 1500 2469 16 36 47 1 1 0788 52552 0 198989200 120419 0 1506 2531 15 36 48 1 1 0788 53452 0 198935600 119195 2 1501 1698 15 37 47 1 1 0788 52680 0 198979600 120424 2 1521 1610 16 37 47 1 Cheers Mark ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Mark Kirkwood [EMAIL PROTECTED] writes: Tom Lane wrote: Mark, can you detect hiccups in the read rate using your setup? I think so, here's the vmstat output for 400MB of shared_buffers during the scan: Hm, not really a smoking gun there. But just for grins, would you try this patch and see if the numbers change? regards, tom lane *** src/backend/storage/buffer/freelist.c.orig Fri Jan 5 18:02:12 2007 --- src/backend/storage/buffer/freelist.c Mon Mar 5 16:33:11 2007 *** *** 104,116 * it; decrement the usage_count and keep scanning. */ LockBufHdr(buf); - if (buf-refcount == 0 buf-usage_count == 0) - return buf; if (buf-usage_count 0) { buf-usage_count--; ! trycounter = NBuffers; } else if (--trycounter == 0) { /* --- 104,116 * it; decrement the usage_count and keep scanning. */ LockBufHdr(buf); if (buf-usage_count 0) { buf-usage_count--; ! trycounter = NBuffers + 1; } + if (buf-refcount == 0 buf-usage_count == 0) + return buf; else if (--trycounter == 0) { /* ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Tom Lane wrote: Hm, not really a smoking gun there. But just for grins, would you try this patch and see if the numbers change? Applied to 8.2.3 (don't have lineitem loaded in HEAD yet) - no change that I can see: procs ---memory-- ---swap-- -io --system-- cpu r b swpd free buff cache si sobibo incs us sy id wa 1 0764 55300 0 198803200 11879732 1532 1775 15 39 44 2 1 0764 54476 0 198972000 115507 9 1456 3970 15 39 45 1 1 0788 54540 0 198959200 121651 0 1508 3221 16 37 47 0 1 0788 52808 0 199132000 124109 0 1532 1236 15 38 46 0 1 0788 52504 0 199178400 124518 0 1547 1005 16 39 45 0 2 0788 54544 0 198974005 117965 5 1491 2012 15 36 47 2 1 0788 53596 0 199118400 120424 0 1504 1910 16 37 46 1 Elapsed time is exactly the same (101 s). Is is expected that HEAD would behave differently? Cheers Mark ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Mark Kirkwood [EMAIL PROTECTED] writes: Elapsed time is exactly the same (101 s). Is is expected that HEAD would behave differently? Offhand I don't think so. But what I wanted to see was the curve of elapsed time vs shared_buffers? regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 2007-03-05 at 21:03 +, Heikki Linnakangas wrote: Another approach I proposed back in December is to not have a variable like that at all, but scan the buffer cache for pages belonging to the table you're scanning to initialize the scan. Scanning all the BufferDescs is a fairly CPU and lock heavy operation, but it might be ok given that we're talking about large I/O bound sequential scans. It would require no DBA tuning and would work more robustly in varying conditions. I'm not sure where you would continue after scanning the in-cache pages. At the highest in-cache block number, perhaps. I assume you're referring to this: each backend keeps a bitmap of pages it has processed during the scan, and read the pages in the order they're available in cache. which I think is a great idea. However, I was unable to devise a good answer to all these questions at once: * How do we attempt to maintain sequential reads on the underlying I/O layer? * My current implementation takes advantage of the OS buffer cache, how could we maintain that advantage from PostgreSQL-specific cache logic? * How do I test to see whether it actually helps in a realistic scenario? It seems like it would help the most when scans are progressing at different rates, but how often do people have CPU-bound queries on tables that don't fit into physical memory (and how long would it take for me to benchmark such a query)? It seems like your idea is more analytical, and my current implementation is more guesswork. I like the analytical approach, but I don't know that we have enough information to pull it off because we're missing what's in the OS buffer cache. The OS buffer cache is crucial to Synchronized Scanning, because shared buffers are evicted based on a more complex set of circumstances, whereas the OS buffer cache is usually LRU and forms a nicer cache trail (upon which Synchronized Scanning is largely based). If you have some tests you'd like me to run, I'm planning to do some benchmarks this week and next. I can see if my current patch holds up under the scenarios you're worried about. Regards, Jeff Davis ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Simon Riggs wrote: On Mon, 2007-03-05 at 14:41 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: Itakgaki-san and I were discussing in January the idea of cache-looping, whereby a process begins to reuse its own buffers in a ring of ~32 buffers. When we cycle back round, if usage_count==1 then we assume that we can reuse that buffer. This avoids cache swamping for read and write workloads, plus avoids too-frequent WAL writing for VACUUM. This would maintain the beneficial behaviour for OLTP, Justify that claim. It sounds to me like this would act very nearly the same as having shared_buffers == 32 ... Sure. We wouldn't set the hint for IndexScans or Inserts, only for SeqScans, VACUUM and COPY. So OLTP-only workloads would be entirely unaffected. In the presence of a mixed workload the scan tasks would have only a limited effect on the cache, maintaining performance for the response time critical tasks. So its an OLTP benefit because of cache protection and WAL-flush reduction during VACUUM. As we've seen, the scan tasks look like they'll go faster with this. But it would break the idea of letting a second seqscan follow in the first's hot cache trail, no? greetings, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Tom Lane wrote: But what I wanted to see was the curve of elapsed time vs shared_buffers? Of course! (lets just write that off to me being pre coffee...). With the patch applied: Shared Buffers Elapsed vmstat IO rate -- --- -- 400MB 101 s122 MB/s 2MB 101 s 1MB 97 s 768KB94 s 512KB84 s 256KB79 s 128KB75 s166 MB/s Looks *very* similar. Cheers Mark ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Mark Kirkwood [EMAIL PROTECTED] writes: Tom Lane wrote: But what I wanted to see was the curve of elapsed time vs shared_buffers? ... Looks *very* similar. Yup, thanks for checking. I've been poking into this myself. I find that I can reproduce the behavior to some extent even with a slow disk drive (this machine is a dual 2.8GHz Xeon EM64T running Fedora Core 5; the dd-to-dev-null test shows the disk read speed as 43MB/sec or so). Test case is a several-gig table, no indexes, fully vacuumed so that neither VACUUM nor COUNT(*) have to do anything but seqscan as fast as they can. Given a *freshly started* postmaster, I see regression=# show shared_buffers; shared_buffers 128MB (1 row) regression=# \timing Timing is on. regression=# vacuum lineitem; VACUUM Time: 63652.333 ms regression=# vacuum lineitem; VACUUM Time: 63562.303 ms regression=# select count(*) from lineitem; count -- 1024 (1 row) Time: 63142.174 ms regression=# vacuum lineitem; VACUUM Time: 61638.421 ms regression=# vacuum lineitem; VACUUM Time: 61785.905 ms I didn't show it here, but you can repeat the VACUUM all you want before the SELECT, and its times are stable; and you can repeat all you want after the SELECT, and the times are stable but a couple seconds lower. Restart the postmaster and it goes back to the slower behavior. (I'm keeping autovac off so it doesn't change the results.) I decided to get down and dirty with oprofile, and soon found that the user-space CPU consumption is indistinguishable in both states: CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 24 GLOBAL_POWER_E...| samples| %| -- 141065 73.8193 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux 26368 13.7984 /home/tgl/testversion/bin/postgres 12765 6.6799 /libata 2238 1.1711 /lib64/libc-2.4.so 1112 0.5819 /dm_mod CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 24 GLOBAL_POWER_E...| samples| %| -- 113177 70.2169 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux 26284 16.3070 /home/tgl/testversion/bin/postgres 12004 7.4475 /libata 2093 1.2985 /lib64/libc-2.4.so 996 0.6179 /dm_mod Inside the kernel, there's only one routine that's significantly different: CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 24 samples %symbol name 5777940.9591 copy_user_generic 1817512.8841 __delay 3994 2.8313 _raw_spin_lock 2388 1.6928 put_page 2184 1.5482 mwait_idle 2083 1.4766 _raw_write_unlock 1909 1.3533 _raw_write_lock CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 24 samples %symbol name 3743733.0783 copy_user_generic 1789115.8080 __delay 3372 2.9794 _raw_spin_lock 2218 1.9598 mwait_idle 2067 1.8263 _raw_write_unlock 1837 1.6231 _raw_write_lock 1531 1.3527 put_page So that's part of the mystery: apparently copy_user_generic is coded in such a way that it's faster to copy into memory that's already in processor cache. This strikes me as something that probably could/should be fixed in the kernel; I don't see any good reason why overwriting a whole cache line oughtn't be the same speed either way. The other thing that was bothering me is why does the SELECT change VACUUM's behavior? A debugging printout added to ReadBuffer gave the answer: after postmaster start, we see things like read block 353094 into buffer 11386 read block 353095 into buffer 11387 read block 353096 into buffer 11388 read block 353097 into buffer 11389 read block 353098 into buffer 11390 read block 353099 into buffer 11391 read block 353100 into buffer 11392 read block 353101 into buffer 11393 read block 353102 into buffer 11394 read block 353103 into buffer 11395 and after the SELECT it behaves like read block 336761 into buffer 9403 read block 336762 into buffer 9402 read block 336763 into buffer 9403 read block 336764 into buffer 9402 read block 336765 into buffer 9403 read block 336766 into buffer 9402 read block 336767 into buffer 9403 read block 336768 into buffer 9402 read block 336769 into buffer 9403 read block 336770 into buffer 9402 What's going on is that VACUUM puts each buffer it's finished with on the tail of the freelist. In the post-SELECT state, there are just two buffers cycling through the freelist (not sure why not only one, but it doesn't matter) and so the cache
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Tom Lane [EMAIL PROTECTED] writes: I don't see any good reason why overwriting a whole cache line oughtn't be the same speed either way. I can think of a couple theories, but I don't know if they're reasonable. The one the comes to mind is the inter-processor cache coherency protocol. When writing to a cache line the processor already owns maybe it can skip having to check for other processors owning that cache line? What happens if VACUUM comes across buffers that *are* already in the buffer cache. Does it throw those on the freelist too? That seems like it would be dangerous if they were in the buffer cache for a reason. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Hi Tom, Good info - it's the same in Solaris, the routine is uiomove (Sherry wrote it). - Luke Msg is shrt cuz m on ma treo -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Monday, March 05, 2007 07:43 PM Eastern Standard Time To: Mark Kirkwood Cc: Pavan Deolasee; Gavin Sherry; Luke Lonergan; PGSQL Hackers; Doug Rady; Sherry Moore Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant Mark Kirkwood [EMAIL PROTECTED] writes: Tom Lane wrote: But what I wanted to see was the curve of elapsed time vs shared_buffers? ... Looks *very* similar. Yup, thanks for checking. I've been poking into this myself. I find that I can reproduce the behavior to some extent even with a slow disk drive (this machine is a dual 2.8GHz Xeon EM64T running Fedora Core 5; the dd-to-dev-null test shows the disk read speed as 43MB/sec or so). Test case is a several-gig table, no indexes, fully vacuumed so that neither VACUUM nor COUNT(*) have to do anything but seqscan as fast as they can. Given a *freshly started* postmaster, I see regression=# show shared_buffers; shared_buffers 128MB (1 row) regression=# \timing Timing is on. regression=# vacuum lineitem; VACUUM Time: 63652.333 ms regression=# vacuum lineitem; VACUUM Time: 63562.303 ms regression=# select count(*) from lineitem; count -- 1024 (1 row) Time: 63142.174 ms regression=# vacuum lineitem; VACUUM Time: 61638.421 ms regression=# vacuum lineitem; VACUUM Time: 61785.905 ms I didn't show it here, but you can repeat the VACUUM all you want before the SELECT, and its times are stable; and you can repeat all you want after the SELECT, and the times are stable but a couple seconds lower. Restart the postmaster and it goes back to the slower behavior. (I'm keeping autovac off so it doesn't change the results.) I decided to get down and dirty with oprofile, and soon found that the user-space CPU consumption is indistinguishable in both states: CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 24 GLOBAL_POWER_E...| samples| %| -- 141065 73.8193 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux 26368 13.7984 /home/tgl/testversion/bin/postgres 12765 6.6799 /libata 2238 1.1711 /lib64/libc-2.4.so 1112 0.5819 /dm_mod CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 24 GLOBAL_POWER_E...| samples| %| -- 113177 70.2169 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux 26284 16.3070 /home/tgl/testversion/bin/postgres 12004 7.4475 /libata 2093 1.2985 /lib64/libc-2.4.so 996 0.6179 /dm_mod Inside the kernel, there's only one routine that's significantly different: CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 24 samples %symbol name 5777940.9591 copy_user_generic 1817512.8841 __delay 3994 2.8313 _raw_spin_lock 2388 1.6928 put_page 2184 1.5482 mwait_idle 2083 1.4766 _raw_write_unlock 1909 1.3533 _raw_write_lock CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 24 samples %symbol name 3743733.0783 copy_user_generic 1789115.8080 __delay 3372 2.9794 _raw_spin_lock 2218 1.9598 mwait_idle 2067 1.8263 _raw_write_unlock 1837 1.6231 _raw_write_lock 1531 1.3527 put_page So that's part of the mystery: apparently copy_user_generic is coded in such a way that it's faster to copy into memory that's already in processor cache. This strikes me as something that probably could/should be fixed in the kernel; I don't see any good reason why overwriting a whole cache line oughtn't be the same speed either way. The other thing that was bothering me is why does the SELECT change VACUUM's behavior? A debugging printout added to ReadBuffer gave the answer: after postmaster start, we see things like read block 353094 into buffer 11386 read block 353095 into buffer 11387 read block 353096 into buffer 11388 read block 353097 into buffer 11389 read block 353098 into buffer 11390 read block 353099 into buffer 11391 read block 353100 into buffer 11392 read block 353101 into buffer 11393 read block 353102 into buffer 11394 read block 353103 into buffer 11395 and after the SELECT it behaves like read block 336761 into buffer 9403 read block 336762 into buffer 9402 read block 336763 into buffer 9403 read block 336764 into buffer 9402 read block
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Gregory Stark [EMAIL PROTECTED] writes: What happens if VACUUM comes across buffers that *are* already in the buffer cache. Does it throw those on the freelist too? Not unless they have usage_count 0, in which case they'd be subject to recycling by the next clock sweep anyway. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Luke Lonergan [EMAIL PROTECTED] writes: Good info - it's the same in Solaris, the routine is uiomove (Sherry wrote it). Cool. Maybe Sherry can comment on the question whether it's possible for a large-scale-memcpy to not take a hit on filling a cache line that wasn't previously in cache? I looked a bit at the Linux code that's being used here, but it's all x86_64 assembler which is something I've never studied :-(. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Tom, On 3/5/07 7:58 PM, Tom Lane [EMAIL PROTECTED] wrote: I looked a bit at the Linux code that's being used here, but it's all x86_64 assembler which is something I've never studied :-(. Here's the C wrapper routine in Solaris: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/ move.c Here's the x86 assembler routine for Solaris: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32 /ml/copy.s The actual uiomove routine is a simple wrapper that calls the assembler kcopy or xcopyout routines. There are two versions (for Opteron), one that uses the NTA instructions that bypass the L2 cache on writing to avoid L2 cache pollution, and the second writes normally - through the L2 cache. Which one is used depends on a parameter (global) based on the size of the I/O. It is tuned to identify operations that might pollute the L2 cache (sound familiar?) I think what we're seeing is a generic artifact of the write-through behavior of the cache. I wouldn't expect this to get any better with DIRECTIO to the shared_buffers in pgsql - if we iterate over a large number of user space buffers we'll still hit the increased L2 thrashing. I think we're best off with a hybrid approach - when we detect a seq scan larger (much larger?) than buffer cache, we can switch into the cache bypass behavior, much like the above code uses the NTA instruction when appropriate. We can handle syncscan using a small buffer space. - Luke ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Luke Lonergan [EMAIL PROTECTED] writes: Here's the x86 assembler routine for Solaris: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32 /ml/copy.s The actual uiomove routine is a simple wrapper that calls the assembler kcopy or xcopyout routines. There are two versions (for Opteron), one that uses the NTA instructions that bypass the L2 cache on writing to avoid L2 cache pollution, and the second writes normally - through the L2 cache. Hm. If it were bypassing the L2 cache then the hit would be taken when PG later tries to read the data. This is clearly not what's happening in the Linux system, but I've not seen any oprofile-equivalent data for Solaris? regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mar 5, 2007, at 11:46 AM, Josh Berkus wrote: Tom, I seem to recall that we've previously discussed the idea of letting the clock sweep decrement the usage_count before testing for 0, so that a buffer could be reused on the first sweep after it was initially used, but that we rejected it as being a bad idea. But at least with large shared_buffers it doesn't sound like such a bad idea. We did discuss an number of formulas for setting buffers with different clock-sweep numbers, including ones with higher usage_count for indexes and starting numbers of 0 for large seq scans as well as vacuums. However, we didn't have any way to prove that any of these complex algorithms would result in higher performance, so went with the simplest formula, with the idea of tinkering with it when we had more data. So maybe now's the time. Note, though, that the current algorithm is working very, very well for OLTP benchmarks, so we'd want to be careful not to gain performance in one area at the expense of another. In TPCE testing, we've been able to increase shared_buffers to 10GB with beneficial performance effect (numbers posted when I have them) and even found that taking over RAM with the shared_buffers (ala Oracle) gave us equivalent performance to using the FS cache. (yes, this means with a little I/O management engineering we could contemplate discarding use of the FS cache for a net performance gain. Maybe for 8.4) An idea I've been thinking about would be to have the bgwriter or some other background process actually try and keep the free list populated, so that backends needing to grab a page would be much more likely to find one there (and not have to wait to scan through the entire buffer pool, perhaps multiple times). My thought is to keep track of how many page requests occurred during a given interval, and use that value (probably averaged over time) to determine how many pages we'd like to see on the free list. The background process would then run through the buffers decrementing usage counts until it found enough for the free list. Before putting a buffer on the 'free list', it would write the buffer out; I'm not sure if it would make sense to de-associate the buffer with whatever it had been storing or not, though. If we don't do that, that would mean that we could pull pages back off the free list if we wanted to. That would be helpful if the background process got a bit over-zealous. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: Another approach I proposed back in December is to not have a variable like that at all, but scan the buffer cache for pages belonging to the table you're scanning to initialize the scan. Scanning all the BufferDescs is a fairly CPU and lock heavy operation, but it might be ok given that we're talking about large I/O bound sequential scans. It would require no DBA tuning and would work more robustly in varying conditions. I'm not sure where you would continue after scanning the in-cache pages. At the highest in-cache block number, perhaps. If there was some way to do that, it'd be what I'd vote for. Given the partitioning of the buffer lock that Tom did it might not be that horrible for many cases, either, since you'd only need to scan through one partition. We also don't need an exact count, either. Perhaps there's some way we could keep a counter or something... -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Jim Nasby [EMAIL PROTECTED] writes: An idea I've been thinking about would be to have the bgwriter or some other background process actually try and keep the free list populated, The bgwriter already tries to keep pages just in front of the clock sweep pointer clean. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] Bug: Buffer cache is not scan resistant
I'm putting this out there before we publish a fix so that we can discuss how best to fix it. Doug and Sherry recently found the source of an important performance issue with the Postgres shared buffer cache. The issue is summarized like this: the buffer cache in PGSQL is not scan resistant as advertised. A sequential scan of a table larger than cache will pollute the buffer cache in almost all circumstances. Here is performance of GPDB 2.301 (Postgres 8.1.6) on a single X4500 (thumper-3) with 4 cores where bigtable is a table 2x the size of RAM and memtable is a table that fits into I/O cache: With our default setting of shared_buffers (16MB): Operation memtablebigtable --- SELECT COUNT(*) 1221 MB/s 973 MB/s VACUUM 1709 MB/s 1206 MB/s We had observed that VACUUM would perform better when done right after a SELECT. In the above example, the faster rate from disk was 1608 MB/s, compared to the normal rate of 1206 MB/s. We verified this behavior on Postgres 8.2 as well. The buffer selection algorithm is choosing buffer pages scattered throughout the buffer cache in almost all circumstances. Sherry traced the behavior to the processor repeatedly flushing the L2 cache. Doug found that we weren't using the Postgres buffer cache the way we expected, instead we were loading the scanned data from disk into the cache even though there was no possibility of reusing it. In addition to pushing other, possibly useful pages from the cache, it has the additional behavior of invalidating the L2 cache for the remainder of the executor path that uses the data. To prove that the buffer cache was the source of the problem, we dropped the shared buffer size to fit into L2 cache (1MB per Opteron core), and this is what we saw (drop size of shared buffers to 680KB): Operation memtablebigtable --- SELECT COUNT(*) 1320 MB/s 1059 MB/s VACUUM 3033 MB/s 1597 MB/s These results do not vary with the order of operations. Thoughts on the best way to fix the buffer selection algorithm? Ideally, one page would be used in the buffer cache in circumstances where the table to be scanned is (significantly?) larger than the size of the buffer cache. - Luke ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Luke Lonergan [EMAIL PROTECTED] writes: The issue is summarized like this: the buffer cache in PGSQL is not scan resistant as advertised. Sure it is. As near as I can tell, your real complaint is that the bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache; which is hardly surprising considering it doesn't know the size of L2 cache. That's not a consideration that we've ever taken into account. I'm also less than convinced that it'd be helpful for a big seqscan: won't reading a new disk page into memory via DMA cause that memory to get flushed from the processor cache anyway? I wonder whether your numbers are explained by some other consideration than you think. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug: Buffer cache is not scan resistant
When we instrument the page selections made within the buffer cache, they are sequential and span the entire address space of the cache. With respect to whether it's L2, it's a conclusion based on the experimental results. It's not the TLB, as we also tested for the 512 entries for each L2. One thing I left out of the previous post: the difference between fast and slow behavior was that in the fast case, the buffer selection alternated between two buffer pages. This was the case only when the preceding statement was a SELECT and the statement was VACUUM. - Luke Msg is shrt cuz m on ma treo -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Sunday, March 04, 2007 08:36 PM Eastern Standard Time To: Luke Lonergan Cc: PGSQL Hackers; Doug Rady; Sherry Moore Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant Luke Lonergan [EMAIL PROTECTED] writes: The issue is summarized like this: the buffer cache in PGSQL is not scan resistant as advertised. Sure it is. As near as I can tell, your real complaint is that the bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache; which is hardly surprising considering it doesn't know the size of L2 cache. That's not a consideration that we've ever taken into account. I'm also less than convinced that it'd be helpful for a big seqscan: won't reading a new disk page into memory via DMA cause that memory to get flushed from the processor cache anyway? I wonder whether your numbers are explained by some other consideration than you think. regards, tom lane
Re: [HACKERS] Bug: Buffer cache is not scan resistant
One more thing: the L2 is invalidated when re-written from the kernel IO cache, but the pages addressed in L2 retain their values when 'writeen thru' which allows the new data to be re-used up the executor chain. - Luke Msg is shrt cuz m on ma treo -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Sunday, March 04, 2007 08:36 PM Eastern Standard Time To: Luke Lonergan Cc: PGSQL Hackers; Doug Rady; Sherry Moore Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant Luke Lonergan [EMAIL PROTECTED] writes: The issue is summarized like this: the buffer cache in PGSQL is not scan resistant as advertised. Sure it is. As near as I can tell, your real complaint is that the bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache; which is hardly surprising considering it doesn't know the size of L2 cache. That's not a consideration that we've ever taken into account. I'm also less than convinced that it'd be helpful for a big seqscan: won't reading a new disk page into memory via DMA cause that memory to get flushed from the processor cache anyway? I wonder whether your numbers are explained by some other consideration than you think. regards, tom lane
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Tom Lane wrote: Luke Lonergan [EMAIL PROTECTED] writes: The issue is summarized like this: the buffer cache in PGSQL is not scan resistant as advertised. Sure it is. As near as I can tell, your real complaint is that the bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache; which is hardly surprising considering it doesn't know the size of L2 cache. That's not a consideration that we've ever taken into account. To add a little to this - forgetting the scan resistant point for the moment... cranking down shared_buffers to be smaller than the L2 cache seems to help *any* sequential scan immensely, even on quite modest HW: e.g: PIII 1.26Ghz 512Kb L2 cache, 2G ram, SELECT count(*) FROM lineitem (which is about 11GB) performance: Shared_buffers Elapsed -- --- 400MB 101 s 128KB74 s When I've profiled this activity, I've seen a lot of time spent searching for/allocating a new buffer for each page being fetched. Obviously having less of them to search through will help, but having less than the L2 cache-size worth of 'em seems to help a whole lot! Cheers Mark ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 5 Mar 2007, Mark Kirkwood wrote: To add a little to this - forgetting the scan resistant point for the moment... cranking down shared_buffers to be smaller than the L2 cache seems to help *any* sequential scan immensely, even on quite modest HW: e.g: PIII 1.26Ghz 512Kb L2 cache, 2G ram, SELECT count(*) FROM lineitem (which is about 11GB) performance: Shared_buffers Elapsed -- --- 400MB 101 s 128KB74 s When I've profiled this activity, I've seen a lot of time spent searching for/allocating a new buffer for each page being fetched. Obviously having less of them to search through will help, but having less than the L2 cache-size worth of 'em seems to help a whole lot! Could you demonstrate that point by showing us timings for shared_buffers sizes from 512K up to, say, 2 MB? The two numbers you give there might just have to do with managing a large buffer. Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Gavin, Mark, Could you demonstrate that point by showing us timings for shared_buffers sizes from 512K up to, say, 2 MB? The two numbers you give there might just have to do with managing a large buffer. I suggest two experiments that we've already done: 1) increase shared buffers to double the L2 cache size, you should see that the behavior reverts to the slow performance and is constant at larger sizes 2) instrument the calls to BufferGetPage() (a macro) and note that the buffer block numbers returned increase sequentially during scans of tables larger than the buffer size - Luke ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Gavin Sherry [EMAIL PROTECTED] writes: Could you demonstrate that point by showing us timings for shared_buffers sizes from 512K up to, say, 2 MB? The two numbers you give there might just have to do with managing a large buffer. Using PG CVS HEAD on 64-bit Intel Xeon (1MB L2 cache), Fedora Core 5, I don't measure any noticeable difference in seqscan speed for shared_buffers set to 32MB or 256kB. I note that the code would not let me choose the latter setting without a large decrease in max_connections, which might be expected to cause some performance changes in itself. Now this may only prove that the disk subsystem on this machine is too cheap to let the system show any CPU-related issues. I'm seeing a scan rate of about 43MB/sec for both count(*) and plain ol' wc, which is a factor of 4 or so less than Mark's numbers suggest... but top shows CPU usage of less than 5%, so even with a 4x faster disk I'd not really expect that CPU speed would become interesting. (This is indeed a milestone, btw, because it wasn't so long ago that count(*) was nowhere near disk speed.) regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq