Hi Peter, I took a look at this patch and the benchmarks you've furnished:
https://github.com/petergeoghegan/commit_delay_benchmarks http://pgeoghegan.blogspot.com/2012/06/towards-14000-write-transactions-on-my.html On Wed, Nov 14, 2012 at 08:44:26PM +0000, Peter Geoghegan wrote: > Attached is a doc-patch that makes recommendations that are consistent > with my observations about what works best here. I'd like to see us > making *some* recommendation - for sympathetic cases, setting > commit_delay appropriately can make a very large difference to > transaction throughput. Such sympathetic cases - many small write > transactions - are something that tends to be seen relatively > frequently with web applications, that disproportionately use cloud > hosting. It isn't at all uncommon for these cases to be highly bound > by their commit rate, and so it is compelling to try to amortize the > cost of a flush as effectively as possible there. It would be > unfortunate if no one was even aware that commit_delay is now useful > for these cases, since the setting allows cloud hosting providers to > help these cases quite a bit, without having to do something like > compromise durability, which in general isn't acceptable. Your fast-fsync (SSD, BBWC) benchmarks show a small loss up to 8 clients and a 10-20% improvement at 32 clients. That's on a 4-core/8-thread CPU, assuming HT was left enabled. Your slow-fsync (laptop) benchmarks show a 40-100% improvement in the 16-64 client range. I did a few more benchmarks along the spectrum. First, I used a Mac, also 4-core/8-thread, with fsync_writethrough; half of fsync time gave commit_delay = 35000. I used pgbench, scale factor 100, 4-minute runs, three trials each: -clients- -tps@commit_delay=0- -tps@commit_delay=35000- 8 51,55,63 82,84,86 16 98,100,107 130,134,143 32 137,148,157 192,200,201 64 199,201,214 249,256,258 So that's a nice 27-53% improvement, fairly similar to the pattern for your laptop pgbench numbers. Next, based on your comment about the possible value for cloud-hosted applications, I tried a cc2.8xlarge (16 core, 32 thread), GNU/Linux EC2 instance with a data directory on a standard EBS volume, ext4 filesystem. Several 15s pg_test_fsync runs could not agree on an fsync time; I saw results from 694us to 1904us. Ultimately I settled on trying commit_delay=500, scale factor 300: -clients- -tps@commit_delay=0- -tps@commit_delay=500- 32 1224,1391,1584 1175,1229,1394 64 1553,1647,1673 1544,1546,1632 128 1717,1833,1900 1621,1720,1951 256 1664,1717,1918 1734,1832,1918 The numbers are all over the place, but there's more loss than gain. Amit Kapila also measured small losses in tps@-c16: http://www.postgresql.org/message-id/000701cd6ff0$013a6210$03af2630$@kap...@huawei.com I was curious about the cost of the MinimumActiveBackends() call when relying on commit_siblings to skip the delay. I ran a similar test with an extra 500 idle backends, clients=8, commit_siblings=20 (so the delay would never be used), and either a zero or nonzero commit_delay. There was no appreciable performance advantage from setting commit_delay=0 as opposed to relying on commit_siblings to suppress the delay. That's good news. On the GNU/Linux VM, pg_sleep() achieves precision on the order of 10us. However, the sleep was consistently around 70us longer than requested. A 300us request yielded a 370us sleep, and a 3000us request gave a 3080us sleep. Mac OS X was similarly precise for short sleeps, but it could oversleep a full 1000us on a 35000us sleep. > diff doc/src/sgml/wal.sgml > index fc5c3b2..92619dd > *** a/doc/src/sgml/wal.sgml > --- b/doc/src/sgml/wal.sgml > *************** > *** 375,382 **** > just before a synchronous commit attempts to flush > <acronym>WAL</acronym> to disk, in the hope that a single flush > executed by one such transaction can also serve other transactions > ! committing at about the same time. Setting > <varname>commit_delay</varname> > ! can only help when there are many concurrently committing transactions. > </para> > > </sect1> > --- 375,397 ---- The beginning of this paragraph stills says "commit_delay causes a delay just before a synchronous commit attempts to flush WAL to disk". Since it now applies to every WAL flush, that should be updated. > *************** > *** 560,570 **** > is not enabled, or if fewer than <xref linkend="guc-commit-siblings"> > other sessions are currently in active transactions; this avoids > sleeping when it's unlikely that any other session will commit soon. > ! Note that on most platforms, the resolution of a sleep request is > ten milliseconds, so that any nonzero <varname>commit_delay</varname> > setting between 1 and 10000 microseconds would have the same effect. > ! Good values for these parameters are not yet clear; experimentation > ! is encouraged. > </para> > > <para> > --- 575,607 ---- There's a similar problem at the beginning of this paragraph; it says specifically, "The commit_delay parameter defines for how many microseconds the server process will sleep after writing a commit record to the log with LogInsert but before performing a LogFlush." > is not enabled, or if fewer than <xref linkend="guc-commit-siblings"> > other sessions are currently in active transactions; this avoids > sleeping when it's unlikely that any other session will commit soon. > ! Note that on some platforms, the resolution of a sleep request is > ten milliseconds, so that any nonzero <varname>commit_delay</varname> > setting between 1 and 10000 microseconds would have the same effect. > ! </para> As a side note, if we're ever going to recommend a fire-and-forget method for setting commit_delay, it may be worth detecting whether the host sleep granularity is limited like this. Setting commit_delay = 20 for your SSD and silently getting commit_delay = 10000 would make for an unpleasant surprise. > ! > ! <para> > ! Since the purpose of <varname>commit_delay</varname> is to allow > ! the cost of each flush operation to be more effectively amortized > ! across concurrently committing transactions (potentially at the > ! expense of transaction latency), it is necessary to quantify that > ! cost when altering the setting. The higher that cost is, the more > ! effective <varname>commit_delay</varname> is expected to be in > ! increasing transaction throughput. The That's true for spinning disks, but I suspect it does not hold for storage with internal parallelism, notably virtualized storage. Consider an iSCSI configuration with high bandwidth and high latency. When network latency is the limiting factor, will sending larger requests less often still help? > ! <xref linkend="pgtestfsync"> module can be used to measure the > ! average time in microseconds that a single WAL flush operation > ! takes. A value of half of the average time the module reports it > ! takes to flush after a single 8kB write operation is often the most > ! effective setting for <varname>commit_delay</varname>. The benefit > ! of tuning <varname>commit_delay</varname> can even be pronounced on > ! storage media with very fast sync times, such as solid-state drives > ! or RAID arrays with a battery-backed write cache. However, higher > ! values of <varname>commit_siblings</varname> should be used is such Typo: "is such" -> "in such" > ! cases, whereas smaller <varname>commit_siblings</varname> values > ! can be helpful on higher latency media. Note that it is quite > ! possible that a setting of <varname>commit_delay</varname> that is > ! too high can increase transaction latency by so much that total > ! transaction throughput suffers. > </para> One would be foolish to run a performance-sensitive workload like those in question, including the choice to have synchronous_commit=on, on spinning disks with no battery-backed write cache. A cloud environment is more credible, but my benchmark showed no gain there. Overall, I still won't personally recommend changing commit_delay without measuring the performance change for one's particular workload and storage environment. commit_delay can now bring some impressive gains in the right situations, but I doubt those are common enough for a fire-and-forget setting to do more good than harm. I suggest having the documentation recommend half of the fsync time as a starting point for benchmarking different commit_delay settings against your own workload. Indicate that it's more likely to help for direct use of spinning disks than for BBWC/solid state/virtualized storage. Not sure what else can be credibly given as general advice for PostgreSQL DBAs. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers