[HACKERS] wal_buffers = -1
Is there any real use-case for not setting wal_buffers to -1 these days? Or should we just remove it and use the -1 behaviour at all times? IIRC we discussed not keeping it at all when the autotune behavior was introduced, but said we wanted to keep it just in case. If we're not ready to remove it, then does that just mean that we need to fix it so we can? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] wal_buffers = -1
On 17 January 2014 13:01, Magnus Hagander mag...@hagander.net wrote: Is there any real use-case for not setting wal_buffers to -1 these days? Or should we just remove it and use the -1 behaviour at all times? IIRC we discussed not keeping it at all when the autotune behavior was introduced, but said we wanted to keep it just in case. If we're not ready to remove it, then does that just mean that we need to fix it so we can? Robert Haas reported that setting it to 32MB can yield a considerable performance benefit: http://www.postgresql.org/message-id/CA+TgmobgMv_aaakLoasBt=5iYfi=kdcOUz0X9TdYe0c2SZ=2...@mail.gmail.com -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers = -1
Hi, On 2014-01-17 14:01:56 +0100, Magnus Hagander wrote: Is there any real use-case for not setting wal_buffers to -1 these days? Or should we just remove it and use the -1 behaviour at all times? I have seen improvements by setting it larger than the max -1 one value. Also, for some workloads (low latency) it can be beneficial to have a small s_b but still have a larger wal_buffers setting. So -1 for removing it from me. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers = -1
On Fri, Jan 17, 2014 at 2:07 PM, Thom Brown t...@linux.com wrote: On 17 January 2014 13:01, Magnus Hagander mag...@hagander.net wrote: Is there any real use-case for not setting wal_buffers to -1 these days? Or should we just remove it and use the -1 behaviour at all times? IIRC we discussed not keeping it at all when the autotune behavior was introduced, but said we wanted to keep it just in case. If we're not ready to remove it, then does that just mean that we need to fix it so we can? Robert Haas reported that setting it to 32MB can yield a considerable performance benefit: http://www.postgresql.org/message-id/CA+TgmobgMv_aaakLoasBt=5iYfi=kdcOUz0X9TdYe0c2SZ=2...@mail.gmail.com In that case, sholdn't the autotuning be changed to not limit it at 16MB? :) -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] wal_buffers = -1
On 17 January 2014 13:20, Magnus Hagander mag...@hagander.net wrote: On Fri, Jan 17, 2014 at 2:07 PM, Thom Brown t...@linux.com wrote: On 17 January 2014 13:01, Magnus Hagander mag...@hagander.net wrote: Is there any real use-case for not setting wal_buffers to -1 these days? Or should we just remove it and use the -1 behaviour at all times? IIRC we discussed not keeping it at all when the autotune behavior was introduced, but said we wanted to keep it just in case. If we're not ready to remove it, then does that just mean that we need to fix it so we can? Robert Haas reported that setting it to 32MB can yield a considerable performance benefit: http://www.postgresql.org/message-id/CA+TgmobgMv_aaakLoasBt=5iYfi=kdcOUz0X9TdYe0c2SZ=2...@mail.gmail.com In that case, sholdn't the autotuning be changed to not limit it at 16MB? :) Well, that's the question. Do we have a heuristic sweet-spot that folk would agree on? -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers = -1
On Fri, Jan 17, 2014 at 8:20 AM, Magnus Hagander mag...@hagander.net wrote: Robert Haas reported that setting it to 32MB can yield a considerable performance benefit: http://www.postgresql.org/message-id/CA+TgmobgMv_aaakLoasBt=5iYfi=kdcOUz0X9TdYe0c2SZ=2...@mail.gmail.com In that case, sholdn't the autotuning be changed to not limit it at 16MB? :) I'm in favor of keeping the setting; I think that the auto-tuning has largely eliminated the pain in this area for the majority of users, but that doesn't mean we should deny someone who really wants to squeeze the last drop of performance out of their system the opportunity to poke at it manually. I doubt it's the least useful setting we have. The test above shows 32MB beating 16MB, but I think I did other tests where 16MB and 64MB came out the same. Back when I was working heavily on performance, I did a number of tests to try to understand what events cause latency spikes. Many of those events are related to write-ahead logging. It turns out that writing a page from PostgreSQL's WAL buffers to the OS cache is usually pretty fast, unless the OS thinks we're dirtying data too quickly and decides to slam on the brakes. Calling fsync() to get the data out to disk, though, is very slow. However, both of those operations are protected by the same lock (WALWriteLock), so it's frequently the case that no more WAL buffer space can be freed up by calling write() because the guy who has the lock is busy waiting for an fsync(). That sucks, because there's no intrinsic reason why we can't have one backend doing a write() while another backend is doing an fsync(). On a related note, there's no real reason why the poor bastard who writes the WAL record that fills a segment should be forced to synchronously flush the segment instead of letting it be done in the background, but right now he is. I think if we fix these problems, the optimal value for wal_buffers is likely to change; however, I'm not certain we'll be able to to auto-tune it perfectly on day one. Having a setting makes it easier for people to experiment with different values, and I think that's good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers
On Wed, Aug 29, 2012 at 10:25 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: On 19 February 2012 05:24, Robert Haas robertmh...@gmail.com wrote: I have attached tps scatterplots. The obvious conclusion appears to be that, with only 16MB of wal_buffers, the buffer wraps around with some regularity: we can't insert more WAL because the buffer we need to use still contains WAL that hasn't yet been fsync'd, leading to long stalls. More buffer space ameliorates the problem. Incidentally, I wondered if we could further improve group commit performance by implementing commit_delay with a WaitLatch call, and setting the latch in the event of WAL buffers wraparound (or rather, a queued wraparound request - a segment switch needs WALWriteLock, which the group commit leader holds for a relatively long time during the delay). I'm not really sure how significant a win this might be, though. There could be other types of contention, which could be considerably more significant. I'll try and take a look at it next week. I have a feeling that one of the big bottlenecks here is that we force an immediate fsync when we reach the end of a segment. I think it was originally done that way to keep the code simple, and it does accomplish that, but it's not so hot for performance. More generally, I think we really need to split WALWriteLock into two locks, one to protect the write position and the other to protect the flush position. I think we're often ending up with a write (which is usually fast) waiting for a flush (which is often much slower) when in fact those things ought to be able to happen in parallel. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers
On Thursday, August 30, 2012 7:14 PM Robert Haas wrote: On Wed, Aug 29, 2012 at 10:25 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: On 19 February 2012 05:24, Robert Haas robertmh...@gmail.com wrote: I have attached tps scatterplots. The obvious conclusion appears to be that, with only 16MB of wal_buffers, the buffer wraps around with some regularity: we can't insert more WAL because the buffer we need to use still contains WAL that hasn't yet been fsync'd, leading to long stalls. More buffer space ameliorates the problem. Incidentally, I wondered if we could further improve group commit performance by implementing commit_delay with a WaitLatch call, and setting the latch in the event of WAL buffers wraparound (or rather, a queued wraparound request - a segment switch needs WALWriteLock, which the group commit leader holds for a relatively long time during the delay). I'm not really sure how significant a win this might be, though. There could be other types of contention, which could be considerably more significant. I'll try and take a look at it next week. I have a feeling that one of the big bottlenecks here is that we force an immediate fsync when we reach the end of a segment. I think it was originally done that way to keep the code simple, and it does accomplish that, but it's not so hot for performance. More generally, I think we really need to split WALWriteLock into two locks, one to protect the write position and the other to protect the flush position. I think we're often ending up with a write (which is usually fast) waiting for a flush (which is often much slower) when in fact those things ought to be able to happen in parallel. This is really good idea for splitting WALWriteLock into two locks, but in that case do we need separate handling for OPEN_SYNC method where write and flush happens together? And more about WAL, do you have any suggestions regarding the idea of triggering WALWriter in case Xlog buffers are nearly full? With Regards, Amit Kapila. With Regards, Amit Kapila. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers
On Tuesday, August 28, 2012 9:33 PM Bruce Momjian wrote: On Tue, Aug 28, 2012 at 09:40:33AM +0530, Amit Kapila wrote: From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Bruce Momjian Added to TODO: Allow reporting of stalls due to wal_buffer wrap-around http://archives.postgresql.org/pgsql-hackers/2012-02/msg00826.php Isn't this indicates that while writing XLOG, it needs some tuning such that when some thresh hold buffers(2/3) are full, then trigger LOGWriter. I assumed the LOGWriter was already working as fast as it could, but couldn't keep up. I have doubt that in some cases it might not, for example 1. Assume 16M Xlog buffers 2. 6M or less than that is filled. 3. Background writer decides how much to flush and starts writing and at the same time backends start filling remaining 10M of the buffers. 4. Background writer goes to sleep after flushing 6M. 5. Backends have filled all 16M, in this case it may so happen that some backends might need to do I/O. Some other cases where I think it can be useful to wake LogWriter Case-1 --- 1. Log writer delay is default 200ms or set to some higher value by user. 2. All the configured buffers got filled, backend might need to do I/O. Case-2 --- The case-1 scenario can also happen even if user has set wal_buffers = -1 (Auto tuning of wal buffers), Because it reserves XLog buffers equivalent to one segment file and in high load if that gets filled, backends might need to do I/O. --- On Sun, Feb 19, 2012 at 12:24:12AM -0500, Robert Haas wrote: Just for kicks, I ran two 30-minute pgbench tests at scale factor 300 tonight on Nate Boley's machine, with -n -l -c 32 -j 32. The configurations were identical, except that on one of them, I set wal_buffers=64MB. It seemed to make quite a lot of difference: wal_buffers not set (thus, 16MB): tps = 3162.594605 (including connections establishing) wal_buffers=64MB: tps = 6164.194625 (including connections establishing) Rest of config: shared_buffers = 8GB, maintenance_work_mem = 1GB, synchronous_commit = off, checkpoint_segments = 300, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, wal_writer_delay = 20ms I have attached tps scatterplots. The obvious conclusion appears to be that, with only 16MB of wal_buffers, the buffer wraps around with some regularity: we can't insert more WAL because the buffer we need to use still contains WAL that hasn't yet been fsync'd, leading to long stalls. More buffer space ameliorates the problem. This is not very surprising, when you think about it: it's clear that the peak tps rate approaches 18k/s on these tests; right after a checkpoint, every update will force a full page write - that is, a WAL record 8kB. So we'll fill up a 16MB WAL segment in about a tenth of a second. That doesn't leave much breathing room. I think we might want to consider adjusting our auto-tuning formula for wal_buffers to allow for a higher cap, although this is obviously not enough data to draw any firm conclusions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers
On 19 February 2012 05:24, Robert Haas robertmh...@gmail.com wrote: I have attached tps scatterplots. The obvious conclusion appears to be that, with only 16MB of wal_buffers, the buffer wraps around with some regularity: we can't insert more WAL because the buffer we need to use still contains WAL that hasn't yet been fsync'd, leading to long stalls. More buffer space ameliorates the problem. Incidentally, I wondered if we could further improve group commit performance by implementing commit_delay with a WaitLatch call, and setting the latch in the event of WAL buffers wraparound (or rather, a queued wraparound request - a segment switch needs WALWriteLock, which the group commit leader holds for a relatively long time during the delay). I'm not really sure how significant a win this might be, though. There could be other types of contention, which could be considerably more significant. I'll try and take a look at it next week. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers
On Tue, Aug 28, 2012 at 09:40:33AM +0530, Amit Kapila wrote: From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Bruce Momjian Added to TODO: Allow reporting of stalls due to wal_buffer wrap-around http://archives.postgresql.org/pgsql-hackers/2012-02/msg00826.php Isn't this indicates that while writing XLOG, it needs some tuning such that when some thresh hold buffers(2/3) are full, then trigger LOGWriter. I assumed the LOGWriter was already working as fast as it could, but couldn't keep up. --- On Sun, Feb 19, 2012 at 12:24:12AM -0500, Robert Haas wrote: Just for kicks, I ran two 30-minute pgbench tests at scale factor 300 tonight on Nate Boley's machine, with -n -l -c 32 -j 32. The configurations were identical, except that on one of them, I set wal_buffers=64MB. It seemed to make quite a lot of difference: wal_buffers not set (thus, 16MB): tps = 3162.594605 (including connections establishing) wal_buffers=64MB: tps = 6164.194625 (including connections establishing) Rest of config: shared_buffers = 8GB, maintenance_work_mem = 1GB, synchronous_commit = off, checkpoint_segments = 300, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, wal_writer_delay = 20ms I have attached tps scatterplots. The obvious conclusion appears to be that, with only 16MB of wal_buffers, the buffer wraps around with some regularity: we can't insert more WAL because the buffer we need to use still contains WAL that hasn't yet been fsync'd, leading to long stalls. More buffer space ameliorates the problem. This is not very surprising, when you think about it: it's clear that the peak tps rate approaches 18k/s on these tests; right after a checkpoint, every update will force a full page write - that is, a WAL record 8kB. So we'll fill up a 16MB WAL segment in about a tenth of a second. That doesn't leave much breathing room. I think we might want to consider adjusting our auto-tuning formula for wal_buffers to allow for a higher cap, although this is obviously not enough data to draw any firm conclusions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers
Added to TODO: Allow reporting of stalls due to wal_buffer wrap-around http://archives.postgresql.org/pgsql-hackers/2012-02/msg00826.php --- On Sun, Feb 19, 2012 at 12:24:12AM -0500, Robert Haas wrote: Just for kicks, I ran two 30-minute pgbench tests at scale factor 300 tonight on Nate Boley's machine, with -n -l -c 32 -j 32. The configurations were identical, except that on one of them, I set wal_buffers=64MB. It seemed to make quite a lot of difference: wal_buffers not set (thus, 16MB): tps = 3162.594605 (including connections establishing) wal_buffers=64MB: tps = 6164.194625 (including connections establishing) Rest of config: shared_buffers = 8GB, maintenance_work_mem = 1GB, synchronous_commit = off, checkpoint_segments = 300, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, wal_writer_delay = 20ms I have attached tps scatterplots. The obvious conclusion appears to be that, with only 16MB of wal_buffers, the buffer wraps around with some regularity: we can't insert more WAL because the buffer we need to use still contains WAL that hasn't yet been fsync'd, leading to long stalls. More buffer space ameliorates the problem. This is not very surprising, when you think about it: it's clear that the peak tps rate approaches 18k/s on these tests; right after a checkpoint, every update will force a full page write - that is, a WAL record 8kB. So we'll fill up a 16MB WAL segment in about a tenth of a second. That doesn't leave much breathing room. I think we might want to consider adjusting our auto-tuning formula for wal_buffers to allow for a higher cap, although this is obviously not enough data to draw any firm conclusions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers
From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Bruce Momjian Added to TODO: Allow reporting of stalls due to wal_buffer wrap-around http://archives.postgresql.org/pgsql-hackers/2012-02/msg00826.php Isn't this indicates that while writing XLOG, it needs some tuning such that when some thresh hold buffers(2/3) are full, then trigger LOGWriter. --- On Sun, Feb 19, 2012 at 12:24:12AM -0500, Robert Haas wrote: Just for kicks, I ran two 30-minute pgbench tests at scale factor 300 tonight on Nate Boley's machine, with -n -l -c 32 -j 32. The configurations were identical, except that on one of them, I set wal_buffers=64MB. It seemed to make quite a lot of difference: wal_buffers not set (thus, 16MB): tps = 3162.594605 (including connections establishing) wal_buffers=64MB: tps = 6164.194625 (including connections establishing) Rest of config: shared_buffers = 8GB, maintenance_work_mem = 1GB, synchronous_commit = off, checkpoint_segments = 300, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, wal_writer_delay = 20ms I have attached tps scatterplots. The obvious conclusion appears to be that, with only 16MB of wal_buffers, the buffer wraps around with some regularity: we can't insert more WAL because the buffer we need to use still contains WAL that hasn't yet been fsync'd, leading to long stalls. More buffer space ameliorates the problem. This is not very surprising, when you think about it: it's clear that the peak tps rate approaches 18k/s on these tests; right after a checkpoint, every update will force a full page write - that is, a WAL record 8kB. So we'll fill up a 16MB WAL segment in about a tenth of a second. That doesn't leave much breathing room. I think we might want to consider adjusting our auto-tuning formula for wal_buffers to allow for a higher cap, although this is obviously not enough data to draw any firm conclusions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Mon, Mar 12, 2012 at 7:16 AM, Robert Haas robertmh...@gmail.com wrote: On Sun, Mar 11, 2012 at 11:51 PM, Fujii Masao masao.fu...@gmail.com wrote: On Sun, Mar 11, 2012 at 12:55 PM, Robert Haas robertmh...@gmail.com wrote: I've finally been able to run some more tests of the effect of adjusting wal_buffers to values higher than 16MB. I ran the test on the 16 core (x 4 hw threads/core) IBM POWER7 machine, with my usual configuration settings: shared_buffers = 8GB maintenance_work_mem = 1GB synchronous_commit = off checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.9 wal_writer_delay = 20ms I ran three 30-minute tests at scale factor 300 with wal_buffers set at various values from 16MB up to 160MB, in multiples of 16MB, using pgbench with 32 clients and 32 threads in each case. The short version is that 32MB seems to be significantly better than 16MB, by about 1000 tps, and after that it gets murky; full results are below. Currently the max of wal_buffers is 16MB (i.e., the size of one WAL file) when it's set to -1. Thanks to your result, we should increase the max to 32MB? I think that might be a good idea, although I'm not entirely convinced that we understand why increasing wal_buffers is helping as much as it is. I stuck an elog() into AdvanceXLInsertBuffer() to complain in the case that we were writing buffers while holding the insert lock. Then, I reran 30-minute tests 32 clients, one with wal_buffers=16MB and the other wal_buffers=32MB. On the 16MB test, the elog() fired 15 times in a single second shortly after the start of the test, and then 9 more times over the rest of the test. On the 32MB test, the elog() fired a total 6 times over the course of the test. The first test got 14320 tps, while the second got 15026 tps. I find that quite surprising, because although WAL buffer wraparound is certainly bad (on this test, it probably brings the system completely to a halt until fsync() finishes) it really shouldn't lock up the system for multiple seconds at a time. And yet that's what it would need to be doing to account for the tps discrepancy on this test, considering how rarely it occurs. Where did you put the elog? If you put it near TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START(), it is probably too late in the code. If someone else (like the background wal writer) is blocked on an fsync, then AdvanceXLInsertBuffer will block on getting the WALWriteLock. Once it gets the lock, it will likely find it no longer needs to do the write (because it was done by the thing that just blocked and then released it), but at that point the damage has already been done. The damage is not that it has to do a write, but that it had to block (indirectly) on an fsync which it didn't really care about. One possibility is that the writer should update xlogctl-LogwrtResult.Write between finishing the write and starting the fsync. That way an AdvanceXLInsertBuffer that arrives during the fsync could see that the data is needs to overwrite has already been written, even if not yet fsynced, and that would be enough to allow it proceed. That would be ugly, because AdvanceXLInsertBuffer that arrives during the write itself would check xlogctl and then block until the fsync finished, with no way to wake up as soon as the write-part finished. So making it work cleanly would require a redesign of the whole locking scheme. Which the Scaling patch is already doing. Studying AdvanceXLInsertBuffer() a bit more, I'm wondering if the problem isn't so much - or isn't only - that it's expensive to write buffers while also holding WALInsertLock. Maybe it's too expensive even to acquire WalWriteLock in the first place - that is, the real problem isn't so much the wraparound condition itself, but the expense of testing whether a possible wraparound has actually occurred. A quick test suggests that we acquire WALWriteLock here much more often than we actually write anything while holding it, and that we get a big burst of WALWriteLock acquisitions here immediately after a checkpoint. I don't have any proof that this is what's causing the tps drop with smaller wal_buffers, but I think there has to be something other than an actual wraparound condition causing problems here, because that just doesn't seem to happen frequently enough to be an issue. I think my analysis is pretty much a re-wording of yours, but I'd emphasize that getting the WALWriteLock is bad not just because they fight over the lock, but because someone else (probably background wal writer) is camping out on the lock while doing an fsync. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Tue, Mar 13, 2012 at 11:18 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Mar 13, 2012 at 6:44 PM, Josh Berkus j...@agliodbs.com wrote: That's a speedup of nearly a factor of two, so clearly fsync-related stalls are a big problem here, even with wal_buffers cranked up through the ceiling. H. Do you have any ability to test on XFS? It seems I do. XFS, with fsync = on: tps = 14746.687499 (including connections establishing) XFS, with fsync = off: tps = 25121.876560 (including connections establishing) No real dramatic difference there, maybe a bit slower. On further thought, it may be that this is just a simple case of too many checkpoints. With fsync=off, we don't have to actually write all that dirty data back to disk. I'm going to try cranking up checkpoint_segments and see what happens. OK, this is bizarre. I wiped out my XFS filesystem and put ext4 back, and look at this: tps = 19105.740878 (including connections establishing) tps = 19687.674409 (including connections establishing) That's a jump of nearly a third from before. I'm not sure what's different. Nothing, AFAIK. I drop and recreate the database after every test run, so I don't see why this should be so much better, unless ext4 degrades over time (even though the FS is nearly empty, and I'm dropping the whole database after each test run). Then I tried it with checkpoint_segments=3000 rather than 300. tps = 26750.190469 (including connections establishing) Hmm, what happens with checkpoint_segments=3000 and fsync=off? tps = 30395.583366 (including connections establishing) Hmm, and what if I set checkpoint_segments=300 and fsync=off? tps = 26029.160919 (including connections establishing) Not sure what to make of all this, yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Wed, Mar 14, 2012 at 3:29 PM, Jeff Janes jeff.ja...@gmail.com wrote: I think my analysis is pretty much a re-wording of yours, but I'd emphasize that getting the WALWriteLock is bad not just because they fight over the lock, but because someone else (probably background wal writer) is camping out on the lock while doing an fsync. Yeah, good point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Tuesday, March 13, 2012 03:26:34 AM Robert Haas wrote: Meanwhile, here are some TPS graphs at 16MB and 32MB on the IBM POWER7 machine. 32 clients, 1800 seconds, scale factor 300, synchronous commit off. That graph makes me cringe because its pretty representative of what I have seen in practise. Any chance we can get a ~10s average in there? I find it pretty hard to read more than spikey. Whats the filesystem + options on this? Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Mon, Mar 12, 2012 at 4:45 PM, Jeff Janes jeff.ja...@gmail.com wrote: Rerunning all 4 benchmarks (both 16MB and 32MB wal_buffers on both machines) with fsync=off (as well as synchronous_commit=off still) might help clarify things. If it increases the TPS of Nate@16MB, but doesn't change the other 3 situations much, then that suggests the IO system is driving it. Basically moving up to 32MB is partially innoculating against slow fsyncs upon log switch on that machine. I had the idea of running pg_test_fsync on each machine. Here, for what it's worth, are the numbers. Nate Boley's box: 2 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 116.604 ops/sec fdatasync 108.654 ops/sec fsync 20.234 ops/sec fsync_writethroughn/a open_sync 17.979 ops/sec Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 39.833 ops/sec fdatasync 58.072 ops/sec fsync 19.756 ops/sec fsync_writethroughn/a open_sync 8.425 ops/sec Compare open_sync with different write sizes: (This is designed to compare the cost of writing 16kB in different write open_sync sizes.) 1 * 16kB open_sync write 17.408 ops/sec 2 * 8kB open_sync writes 9.376 ops/sec 4 * 4kB open_sync writes 4.912 ops/sec 8 * 2kB open_sync writes 2.477 ops/sec 16 * 1kB open_sync writes 1.244 ops/sec Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close19.818 ops/sec write, close, fsync19.086 ops/sec Non-Sync'ed 8kB writes: write 101176.089 ops/sec IBM POWER7 server: 2 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 167.009 ops/sec fdatasync 167.091 ops/sec fsync 47.321 ops/sec fsync_writethroughn/a open_sync 53.735 ops/sec Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 68.891 ops/sec fdatasync 164.597 ops/sec fsync 50.334 ops/sec fsync_writethroughn/a open_sync 25.240 ops/sec Compare open_sync with different write sizes: (This is designed to compare the cost of writing 16kB in different write open_sync sizes.) 1 * 16kB open_sync write 53.305 ops/sec 2 * 8kB open_sync writes 23.863 ops/sec 4 * 4kB open_sync writes 12.500 ops/sec 8 * 2kB open_sync writes 4.734 ops/sec 16 * 1kB open_sync writes 3.063 ops/sec Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close48.730 ops/sec write, close, fsync48.463 ops/sec Non-Sync'ed 8kB writes: write 186465.321 ops/sec I can't rerun any more serious benchmarks on Nate Boley's box right now due to other activity on the box. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Mon, Mar 12, 2012 at 4:45 PM, Jeff Janes jeff.ja...@gmail.com wrote: Rerunning all 4 benchmarks (both 16MB and 32MB wal_buffers on both machines) with fsync=off (as well as synchronous_commit=off still) might help clarify things. I reran the 32-client benchmark on the IBM machine with fsync=off and got this: 32MB: tps = 26809.442903 (including connections establishing) 16MB: tps = 26651.320145 (including connections establishing) That's a speedup of nearly a factor of two, so clearly fsync-related stalls are a big problem here, even with wal_buffers cranked up through the ceiling. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Tue, Mar 13, 2012 at 3:48 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Mar 12, 2012 at 4:45 PM, Jeff Janes jeff.ja...@gmail.com wrote: Rerunning all 4 benchmarks (both 16MB and 32MB wal_buffers on both machines) with fsync=off (as well as synchronous_commit=off still) might help clarify things. I reran the 32-client benchmark on the IBM machine with fsync=off and got this: 32MB: tps = 26809.442903 (including connections establishing) 16MB: tps = 26651.320145 (including connections establishing) That's a speedup of nearly a factor of two, so clearly fsync-related stalls are a big problem here, even with wal_buffers cranked up through the ceiling. And here's a tps plot with wal_buffers = 16MB, fsync = off. The performance still bounces up and down, so there's obviously some other factor contributing to latency spikes, but equally clearly, needing to wait for fsyncs makes it a lot worse. I bet if we could understand why that happens, we could improve things here a good deal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company attachment: tps-16MB.no-fsync.png -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Tue, Mar 13, 2012 at 4:55 AM, Andres Freund and...@anarazel.de wrote: On Tuesday, March 13, 2012 03:26:34 AM Robert Haas wrote: Meanwhile, here are some TPS graphs at 16MB and 32MB on the IBM POWER7 machine. 32 clients, 1800 seconds, scale factor 300, synchronous commit off. That graph makes me cringe because its pretty representative of what I have seen in practise. Any chance we can get a ~10s average in there? I find it pretty hard to read more than spikey. I tried that but don't find it an improvement; however I'm attaching an example for your consideration. Whats the filesystem + options on this? ext4, default options. Fedora 16, 3.2.6-3.fc16.ppc64. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company attachment: tps-16MB-10s.png -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
That's a speedup of nearly a factor of two, so clearly fsync-related stalls are a big problem here, even with wal_buffers cranked up through the ceiling. H. Do you have any ability to test on XFS? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Wed, Mar 14, 2012 at 7:20 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Mar 13, 2012 at 3:48 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Mar 12, 2012 at 4:45 PM, Jeff Janes jeff.ja...@gmail.com wrote: Rerunning all 4 benchmarks (both 16MB and 32MB wal_buffers on both machines) with fsync=off (as well as synchronous_commit=off still) might help clarify things. I reran the 32-client benchmark on the IBM machine with fsync=off and got this: 32MB: tps = 26809.442903 (including connections establishing) 16MB: tps = 26651.320145 (including connections establishing) That's a speedup of nearly a factor of two, so clearly fsync-related stalls are a big problem here, even with wal_buffers cranked up through the ceiling. And here's a tps plot with wal_buffers = 16MB, fsync = off. The performance still bounces up and down, so there's obviously some other factor contributing to latency spikes Initialization of WAL file? Do the latency spikes disappear if you start benchmark after you prepare lots of the recycled WAL files? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Tue, Mar 13, 2012 at 10:02 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Mar 14, 2012 at 7:20 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Mar 13, 2012 at 3:48 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Mar 12, 2012 at 4:45 PM, Jeff Janes jeff.ja...@gmail.com wrote: Rerunning all 4 benchmarks (both 16MB and 32MB wal_buffers on both machines) with fsync=off (as well as synchronous_commit=off still) might help clarify things. I reran the 32-client benchmark on the IBM machine with fsync=off and got this: 32MB: tps = 26809.442903 (including connections establishing) 16MB: tps = 26651.320145 (including connections establishing) That's a speedup of nearly a factor of two, so clearly fsync-related stalls are a big problem here, even with wal_buffers cranked up through the ceiling. And here's a tps plot with wal_buffers = 16MB, fsync = off. The performance still bounces up and down, so there's obviously some other factor contributing to latency spikes Initialization of WAL file? Do the latency spikes disappear if you start benchmark after you prepare lots of the recycled WAL files? The latency spikes seem to correspond to checkpoints, so I don't think that's it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Tue, Mar 13, 2012 at 6:44 PM, Josh Berkus j...@agliodbs.com wrote: That's a speedup of nearly a factor of two, so clearly fsync-related stalls are a big problem here, even with wal_buffers cranked up through the ceiling. H. Do you have any ability to test on XFS? It seems I do. XFS, with fsync = on: tps = 14746.687499 (including connections establishing) XFS, with fsync = off: tps = 25121.876560 (including connections establishing) No real dramatic difference there, maybe a bit slower. On further thought, it may be that this is just a simple case of too many checkpoints. With fsync=off, we don't have to actually write all that dirty data back to disk. I'm going to try cranking up checkpoint_segments and see what happens. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Sun, Mar 11, 2012 at 11:51 PM, Fujii Masao masao.fu...@gmail.com wrote: On Sun, Mar 11, 2012 at 12:55 PM, Robert Haas robertmh...@gmail.com wrote: I've finally been able to run some more tests of the effect of adjusting wal_buffers to values higher than 16MB. I ran the test on the 16 core (x 4 hw threads/core) IBM POWER7 machine, with my usual configuration settings: shared_buffers = 8GB maintenance_work_mem = 1GB synchronous_commit = off checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.9 wal_writer_delay = 20ms I ran three 30-minute tests at scale factor 300 with wal_buffers set at various values from 16MB up to 160MB, in multiples of 16MB, using pgbench with 32 clients and 32 threads in each case. The short version is that 32MB seems to be significantly better than 16MB, by about 1000 tps, and after that it gets murky; full results are below. Currently the max of wal_buffers is 16MB (i.e., the size of one WAL file) when it's set to -1. Thanks to your result, we should increase the max to 32MB? I think that might be a good idea, although I'm not entirely convinced that we understand why increasing wal_buffers is helping as much as it is. I stuck an elog() into AdvanceXLInsertBuffer() to complain in the case that we were writing buffers while holding the insert lock. Then, I reran 30-minute tests 32 clients, one with wal_buffers=16MB and the other wal_buffers=32MB. On the 16MB test, the elog() fired 15 times in a single second shortly after the start of the test, and then 9 more times over the rest of the test. On the 32MB test, the elog() fired a total 6 times over the course of the test. The first test got 14320 tps, while the second got 15026 tps. I find that quite surprising, because although WAL buffer wraparound is certainly bad (on this test, it probably brings the system completely to a halt until fsync() finishes) it really shouldn't lock up the system for multiple seconds at a time. And yet that's what it would need to be doing to account for the tps discrepancy on this test, considering how rarely it occurs. Studying AdvanceXLInsertBuffer() a bit more, I'm wondering if the problem isn't so much - or isn't only - that it's expensive to write buffers while also holding WALInsertLock. Maybe it's too expensive even to acquire WalWriteLock in the first place - that is, the real problem isn't so much the wraparound condition itself, but the expense of testing whether a possible wraparound has actually occurred. A quick test suggests that we acquire WALWriteLock here much more often than we actually write anything while holding it, and that we get a big burst of WALWriteLock acquisitions here immediately after a checkpoint. I don't have any proof that this is what's causing the tps drop with smaller wal_buffers, but I think there has to be something other than an actual wraparound condition causing problems here, because that just doesn't seem to happen frequently enough to be an issue. Anyway, maybe none of that matters at the moment. Perhaps it's enough to know that wal_buffers16MB can help, and just bump up the maximum auto-tuned value a bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Sat, Mar 10, 2012 at 7:55 PM, Robert Haas robertmh...@gmail.com wrote: I've finally been able to run some more tests of the effect of adjusting wal_buffers to values higher than 16MB. I ran the test on the 16 core (x 4 hw threads/core) IBM POWER7 machine, with my usual configuration settings: shared_buffers = 8GB maintenance_work_mem = 1GB synchronous_commit = off checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.9 wal_writer_delay = 20ms I ran three 30-minute tests at scale factor 300 with wal_buffers set at various values from 16MB up to 160MB, in multiples of 16MB, using pgbench with 32 clients and 32 threads in each case. The short version is that 32MB seems to be significantly better than 16MB, by about 1000 tps, and after that it gets murky; full results are below. On Nate Boley's machine, the difference was ~100% increase rather than ~10%. Do you think the difference is in the CPU architecture, or the IO subsystem? Also, do you have the latency numbers? Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Mon, Mar 12, 2012 at 12:32 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Nate Boley's machine, the difference was ~100% increase rather than ~10%. Oh, right. I had forgotten how dramatic the changes were in those test runs. I guess I should be happy that the absolute numbers on this machine were as high as they were. This machine seems to be beating that one on every metric. Do you think the difference is in the CPU architecture, or the IO subsystem? That is an excellent question. I tried looking at vmstat output, but a funny thing kept happening: periodically, the iowait column would show a gigantic negative number instead of a number between 0 and 100. This makes me a little chary of believing any of it. Even if I did, I'm not sure that would fully answer the question. So I guess the short answer is that I don't know, and I'm not even sure how I might go about figuring it out. Any ideas? Also, do you have the latency numbers? Not at the moment, but I'll generate them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Mon, Mar 12, 2012 at 10:55 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, Mar 12, 2012 at 12:32 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Nate Boley's machine, the difference was ~100% increase rather than ~10%. Oh, right. I had forgotten how dramatic the changes were in those test runs. I guess I should be happy that the absolute numbers on this machine were as high as they were. This machine seems to be beating that one on every metric. Do you think the difference is in the CPU architecture, or the IO subsystem? That is an excellent question. I tried looking at vmstat output, but a funny thing kept happening: periodically, the iowait column would show a gigantic negative number instead of a number between 0 and 100. On which machine was that happening? This makes me a little chary of believing any of it. Even if I did, I'm not sure that would fully answer the question. So I guess the short answer is that I don't know, and I'm not even sure how I might go about figuring it out. Any ideas? Rerunning all 4 benchmarks (both 16MB and 32MB wal_buffers on both machines) with fsync=off (as well as synchronous_commit=off still) might help clarify things. If it increases the TPS of Nate@16MB, but doesn't change the other 3 situations much, then that suggests the IO system is driving it. Basically moving up to 32MB is partially innoculating against slow fsyncs upon log switch on that machine. Does the POWER7 have a nonvolatile cache? What happened with synchronous_commit=on? Also, since all data fits in shared_buffers, making checkpoint_segments and checkpoint_timeout be larger than the benchmark period should remove the only other source of writing from the system. With no checkpoints, no evictions, and no fysncs, it is unlikely for the remaining IO to be the bottleneck. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Mon, Mar 12, 2012 at 4:45 PM, Jeff Janes jeff.ja...@gmail.com wrote: Do you think the difference is in the CPU architecture, or the IO subsystem? That is an excellent question. I tried looking at vmstat output, but a funny thing kept happening: periodically, the iowait column would show a gigantic negative number instead of a number between 0 and 100. On which machine was that happening? The IBM server. This makes me a little chary of believing any of it. Even if I did, I'm not sure that would fully answer the question. So I guess the short answer is that I don't know, and I'm not even sure how I might go about figuring it out. Any ideas? Rerunning all 4 benchmarks (both 16MB and 32MB wal_buffers on both machines) with fsync=off (as well as synchronous_commit=off still) might help clarify things. If it increases the TPS of Nate@16MB, but doesn't change the other 3 situations much, then that suggests the IO system is driving it. Basically moving up to 32MB is partially innoculating against slow fsyncs upon log switch on that machine. Mmm, yeah. Although, I think it might have been 64MB rather than 32MB that I tested on that machine. Does the POWER7 have a nonvolatile cache? What happened with synchronous_commit=on? Haven't tried that yet. Also, since all data fits in shared_buffers, making checkpoint_segments and checkpoint_timeout be larger than the benchmark period should remove the only other source of writing from the system. With no checkpoints, no evictions, and no fysncs, it is unlikely for the remaining IO to be the bottleneck. Another thing to test. Meanwhile, here are some TPS graphs at 16MB and 32MB on the IBM POWER7 machine. 32 clients, 1800 seconds, scale factor 300, synchronous commit off. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company attachment: tps-16MB.pngattachment: tps-32MB.png -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers, redux
On Sun, Mar 11, 2012 at 12:55 PM, Robert Haas robertmh...@gmail.com wrote: I've finally been able to run some more tests of the effect of adjusting wal_buffers to values higher than 16MB. I ran the test on the 16 core (x 4 hw threads/core) IBM POWER7 machine, with my usual configuration settings: shared_buffers = 8GB maintenance_work_mem = 1GB synchronous_commit = off checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.9 wal_writer_delay = 20ms I ran three 30-minute tests at scale factor 300 with wal_buffers set at various values from 16MB up to 160MB, in multiples of 16MB, using pgbench with 32 clients and 32 threads in each case. The short version is that 32MB seems to be significantly better than 16MB, by about 1000 tps, and after that it gets murky; full results are below. Currently the max of wal_buffers is 16MB (i.e., the size of one WAL file) when it's set to -1. Thanks to your result, we should increase the max to 32MB? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] wal_buffers, redux
I've finally been able to run some more tests of the effect of adjusting wal_buffers to values higher than 16MB. I ran the test on the 16 core (x 4 hw threads/core) IBM POWER7 machine, with my usual configuration settings: shared_buffers = 8GB maintenance_work_mem = 1GB synchronous_commit = off checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.9 wal_writer_delay = 20ms I ran three 30-minute tests at scale factor 300 with wal_buffers set at various values from 16MB up to 160MB, in multiples of 16MB, using pgbench with 32 clients and 32 threads in each case. The short version is that 32MB seems to be significantly better than 16MB, by about 1000 tps, and after that it gets murky; full results are below. 16MB tps = 13069.420658 (including connections establishing) 16MB tps = 14370.293228 (including connections establishing) 16MB tps = 13787.219680 (including connections establishing) 32MB tps = 14916.068968 (including connections establishing) 32MB tps = 14746.448728 (including connections establishing) 32MB tps = 15110.676467 (including connections establishing) 48MB tps = 15111.981870 (including connections establishing) 48MB tps = 12824.628192 (including connections establishing) 48MB tps = 15090.280081 (including connections establishing) 64MB tps = 15382.740815 (including connections establishing) 64MB tps = 12367.942312 (including connections establishing) 64MB tps = 15195.405382 (including connections establishing) 80MB tps = 15346.080326 (including connections establishing) 80MB tps = 12791.192216 (including connections establishing) 80MB tps = 14780.804054 (including connections establishing) 96MB tps = 15476.229392 (including connections establishing) 96MB tps = 15426.012162 (including connections establishing) 96MB tps = 15548.671849 (including connections establishing) 112MB tps = 15400.669675 (including connections establishing) 112MB tps = 15676.416378 (including connections establishing) 112MB tps = 15016.651083 (including connections establishing) 128MB tps = 15811.463522 (including connections establishing) 128MB tps = 15590.343669 (including connections establishing) 128MB tps = 15256.867665 (including connections establishing) 144MB tps = 15842.131696 (including connections establishing) 144MB tps = 15669.880640 (including connections establishing) 144MB tps = 15753.460908 (including connections establishing) 160MB tps = 15658.726637 (including connections establishing) 160MB tps = 15599.600752 (including connections establishing) 160MB tps = 15311.198480 (including connections establishing) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers
On Mon, Feb 20, 2012 at 4:10 AM, Fujii Masao masao.fu...@gmail.com wrote: On Mon, Feb 20, 2012 at 3:08 AM, Robert Haas robertmh...@gmail.com wrote: On Sun, Feb 19, 2012 at 9:46 AM, Euler Taveira de Oliveira eu...@timbira.com wrote: On 19-02-2012 02:24, Robert Haas wrote: I have attached tps scatterplots. The obvious conclusion appears to be that, with only 16MB of wal_buffers, the buffer wraps around with some regularity Isn't it useful to print some messages on the log when we have wrap around? In this case, we have an idea that wal_buffers needs to be increased. I was thinking about that. I think that what might be more useful than a log message is a counter somewhere in shared memory. Logging imposes a lot of overhead, which is exactly what we don't want here, and the volume might be quite high on a system that is bumping up against this problem. Of course then the question is... how would we expose the counter value? There is no existing statistics view suitable to include such information. What about defining pg_stat_xlog or something? Perhaps pg_stat_perf so we don't need to find a new home every time. Thinking about it, I think renaming pg_stat_bgwriter would make more sense. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers
On Mon, Feb 20, 2012 at 3:59 AM, Simon Riggs si...@2ndquadrant.com wrote: There is no existing statistics view suitable to include such information. What about defining pg_stat_xlog or something? Perhaps pg_stat_perf so we don't need to find a new home every time. Thinking about it, I think renaming pg_stat_bgwriter would make more sense. When we created pg_stat_reset_shared(text), we seemed to be contemplating the idea of multiple sets of shared counters identified by names -- bgwriter for the background writer, and maybe other things for other subsystems. So we'd have to think about how to adjust that. I do agree with you that it seems a shame to invent a whole new view for one counter... Another thought is that I'm not sure it makes sense to run this through the stats system at all. We could regard it as a shared memory counter protected by one of the LWLocks involved, which would probably be quite a bit cheaper - just one machine instruction to increment it at need. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers
On 19-02-2012 02:24, Robert Haas wrote: I have attached tps scatterplots. The obvious conclusion appears to be that, with only 16MB of wal_buffers, the buffer wraps around with some regularity Isn't it useful to print some messages on the log when we have wrap around? In this case, we have an idea that wal_buffers needs to be increased. -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers
On Sun, Feb 19, 2012 at 9:46 AM, Euler Taveira de Oliveira eu...@timbira.com wrote: On 19-02-2012 02:24, Robert Haas wrote: I have attached tps scatterplots. The obvious conclusion appears to be that, with only 16MB of wal_buffers, the buffer wraps around with some regularity Isn't it useful to print some messages on the log when we have wrap around? In this case, we have an idea that wal_buffers needs to be increased. I was thinking about that. I think that what might be more useful than a log message is a counter somewhere in shared memory. Logging imposes a lot of overhead, which is exactly what we don't want here, and the volume might be quite high on a system that is bumping up against this problem. Of course then the question is... how would we expose the counter value? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers
Robert Haas robertmh...@gmail.com writes: On Sun, Feb 19, 2012 at 9:46 AM, Euler Taveira de Oliveira eu...@timbira.com wrote: Isn't it useful to print some messages on the log when we have wrap around? In this case, we have an idea that wal_buffers needs to be increased. I was thinking about that. I think that what might be more useful than a log message is a counter somewhere in shared memory. Logging imposes a lot of overhead, which is exactly what we don't want here, and the volume might be quite high on a system that is bumping up against this problem. Of course then the question is... how would we expose the counter value? Why do you need a counter, other than the current LSN? Surely the number of WAL buffer ring cycles can be deduced directly from that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers
On Sun, Feb 19, 2012 at 6:33 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Sun, Feb 19, 2012 at 9:46 AM, Euler Taveira de Oliveira eu...@timbira.com wrote: Isn't it useful to print some messages on the log when we have wrap around? In this case, we have an idea that wal_buffers needs to be increased. I was thinking about that. I think that what might be more useful than a log message is a counter somewhere in shared memory. Logging imposes a lot of overhead, which is exactly what we don't want here, and the volume might be quite high on a system that is bumping up against this problem. Of course then the question is... how would we expose the counter value? Why do you need a counter, other than the current LSN? Surely the number of WAL buffer ring cycles can be deduced directly from that. The problem isn't how many times its cycled, the issue is whether there was a wait induced by needing to flush wal buffers because of too many writes. You can't count those waits in the way you suggest, though you can calculate an upper limit on them, but that's not very useful. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers
On Mon, Feb 20, 2012 at 3:08 AM, Robert Haas robertmh...@gmail.com wrote: On Sun, Feb 19, 2012 at 9:46 AM, Euler Taveira de Oliveira eu...@timbira.com wrote: On 19-02-2012 02:24, Robert Haas wrote: I have attached tps scatterplots. The obvious conclusion appears to be that, with only 16MB of wal_buffers, the buffer wraps around with some regularity Isn't it useful to print some messages on the log when we have wrap around? In this case, we have an idea that wal_buffers needs to be increased. I was thinking about that. I think that what might be more useful than a log message is a counter somewhere in shared memory. Logging imposes a lot of overhead, which is exactly what we don't want here, and the volume might be quite high on a system that is bumping up against this problem. Of course then the question is... how would we expose the counter value? There is no existing statistics view suitable to include such information. What about defining pg_stat_xlog or something? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers
On 02/19/2012 12:24 AM, Robert Haas wrote: I think we might want to consider adjusting our auto-tuning formula for wal_buffers to allow for a higher cap, although this is obviously not enough data to draw any firm conclusions. That's an easy enough idea to throw into my testing queue. The 16MB auto-tuning upper bound was just the easiest number to suggest that was obviously useful and unlikely to be wasteful. One of the reasons wal_buffers remains a user-visible parameter was that no one every really did an analysis at what its useful upper bound was--and that number might move up as other bottlenecks are smashed too. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] wal_buffers
Just for kicks, I ran two 30-minute pgbench tests at scale factor 300 tonight on Nate Boley's machine, with -n -l -c 32 -j 32. The configurations were identical, except that on one of them, I set wal_buffers=64MB. It seemed to make quite a lot of difference: wal_buffers not set (thus, 16MB): tps = 3162.594605 (including connections establishing) wal_buffers=64MB: tps = 6164.194625 (including connections establishing) Rest of config: shared_buffers = 8GB, maintenance_work_mem = 1GB, synchronous_commit = off, checkpoint_segments = 300, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, wal_writer_delay = 20ms I have attached tps scatterplots. The obvious conclusion appears to be that, with only 16MB of wal_buffers, the buffer wraps around with some regularity: we can't insert more WAL because the buffer we need to use still contains WAL that hasn't yet been fsync'd, leading to long stalls. More buffer space ameliorates the problem. This is not very surprising, when you think about it: it's clear that the peak tps rate approaches 18k/s on these tests; right after a checkpoint, every update will force a full page write - that is, a WAL record 8kB. So we'll fill up a 16MB WAL segment in about a tenth of a second. That doesn't leave much breathing room. I think we might want to consider adjusting our auto-tuning formula for wal_buffers to allow for a higher cap, although this is obviously not enough data to draw any firm conclusions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company attachment: tps-master-64mb.pngattachment: tps-master.png -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC assign hooks (was Re: [HACKERS] wal_buffers = -1 and SIGHUP)
Robert Haas wrote: On Sun, Apr 3, 2011 at 1:26 PM, Tom Lane wrote: It would probably take less than a day to flesh out this idea and make it happen, but it does seem like a rather large change for late alpha. what we're trying to avoid is committing new stuff that may require additional cleanup, not cleaning up the stuff we already did commit. Once we get to beta I'll be less enthusiastic about making changes like this +1 for fixing it, with full agreement with Robert's project management perspective on the issue. Having worked in this area a bit I definitely see the need in general, and for auto-tuning we pretty much have to do this to get it right. I think we should be edging into more auto-tuning capabilities as we figure them out, making this all the more important. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC assign hooks (was Re: [HACKERS] wal_buffers = -1 and SIGHUP)
I wrote: IMO the real problem is essentially that GUC assign hooks have two functions, checking and canonicalization of the value-to-be-stored versus executing secondary actions when an assignment is made; and there's no way to get at just the first one. So we cannot canonicalize the value first and then see if it's equal to the current setting. I think the only real fix is to change the API for assign hooks. This is a bit annoying but it's not like we haven't ever done that before. I'm thinking about splitting assign hooks into two functions, along the lines of bool check_hook (datatype *new_value, GucSource source) bool assign_hook (datatype new_value, GucSource source) After perusing the existing assign_hook functions, I have some further thoughts about this proposal. * Many of the existing assign hooks do a nontrivial amount of computation (such as catalog lookups) to derive internal values from the presented string; examples include assign_temp_tablespaces and assign_timezone. A naive implementation of the above design would require the assign_hook to repeat this computation after the check_hook already did it, which of course is undesirable. * Assign hooks that do catalog lookups need special pushups to avoid having to do such lookups while restoring a previous GUC setting during transaction abort (since lookups can't safely be performed in an aborted transaction). Up to now we've used ad-hoc techniques for each such variable, as seen for instance in assign_session_authorization. The usual idea is to modify the original string to store additional data, which then requires a custom show_hook to ensure only the original part of the string is shown to users. The messiest aspect of this is that it must be possible to reliably tell a modified string from original user input. I think that we can avoid the first problem and clean up the second problem if we do this: 1. Code guc.c so that the check_hook is only applied to original user input. When restoring a previous setting during abort (which necessarily will have been passed through the check_hook at an earlier time), only call the assign_hook. 2. Guarantee that the string pointer passed to the assign_hook is exactly what was returned by the check_hook, ie, guc.c is not allowed to duplicate or copy that string. Given these rules, a check_hook and assign_hook could cooperate to store additional data in what guc.c thinks is just a pointer to a string value, ie, there can be more data after the terminating \0. The assign_hook could work off just this additional data without ever doing a catalog lookup. No special show_hook is needed. Of course this only works for string GUC variables, but I'm finding it hard to visualize a case where a bool, int, float, or enum GUC could need a catalog lookup to interpret. We could possibly legislate that all of these are separately malloc'd to allow the same type of trick to be applied across the board, but I think that's overkill. We can just tell people they must use a string GUC if they need hidden data. This technique would need documentation of course, but at least it *would* be documented rather than ad-hoc for each variable. Another variant would be to allow the check_hook to pass back a separate void * value that could be passed on to the assign_hook, containing any necessary derived data. This is logically a bit cleaner, and would work for all types of GUC variables; but it would make things messier in guc.c since there would be an additional value to pass around. I'm not convinced it's worth that, but could be talked into it if anyone feels strongly about it. Another thing I was reminded of while perusing the code is the comment for GUC_complaint_elevel: * At some point it'd be nice to replace this with a mechanism that allows * the custom message to become the DETAIL line of guc.c's generic message. The reason we invented GUC_complaint_elevel in the first place was to avoid a change in the signatures of assign hooks. If we are making such a change, now would be the time to fix it, because we're never gonna fix it otherwise. I see a few ways we could do it: 1. Add a char **errdetail parameter to assign_hooks, which guc.c would initialize to NULL before call. If the hook stores a non-null pointer there, guc.c would include that string as errdetail. This is the least effort from guc.c's viewpoint, but will be relatively painful to use from the hook's standpoint, because it'd generally have to palloc some space, or maybe even set up a StringInfo buffer to contain the generated message. 2. Add a char *errdetail parameter to assign_hooks, which points at a local-variable buffer in the calling routine, of a well-known size (think GUC_ERRDETAIL_BUFSIZE macro in guc.h). Then hooks do something like snprintf(errdetail, GUC_ERRDETAIL_BUFSIZE, _(format), ...); to return an error detail string. 3. Create a function in guc.c for hooks to call, along the
Re: GUC assign hooks (was Re: [HACKERS] wal_buffers = -1 and SIGHUP)
On Mon, Apr 4, 2011 at 2:41 PM, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: IMO the real problem is essentially that GUC assign hooks have two functions, checking and canonicalization of the value-to-be-stored versus executing secondary actions when an assignment is made; and there's no way to get at just the first one. So we cannot canonicalize the value first and then see if it's equal to the current setting. I think the only real fix is to change the API for assign hooks. This is a bit annoying but it's not like we haven't ever done that before. I'm thinking about splitting assign hooks into two functions, along the lines of bool check_hook (datatype *new_value, GucSource source) bool assign_hook (datatype new_value, GucSource source) After perusing the existing assign_hook functions, I have some further thoughts about this proposal. * Many of the existing assign hooks do a nontrivial amount of computation (such as catalog lookups) to derive internal values from the presented string; examples include assign_temp_tablespaces and assign_timezone. A naive implementation of the above design would require the assign_hook to repeat this computation after the check_hook already did it, which of course is undesirable. * Assign hooks that do catalog lookups need special pushups to avoid having to do such lookups while restoring a previous GUC setting during transaction abort (since lookups can't safely be performed in an aborted transaction). Up to now we've used ad-hoc techniques for each such variable, as seen for instance in assign_session_authorization. The usual idea is to modify the original string to store additional data, which then requires a custom show_hook to ensure only the original part of the string is shown to users. The messiest aspect of this is that it must be possible to reliably tell a modified string from original user input. I think that we can avoid the first problem and clean up the second problem if we do this: 1. Code guc.c so that the check_hook is only applied to original user input. When restoring a previous setting during abort (which necessarily will have been passed through the check_hook at an earlier time), only call the assign_hook. 2. Guarantee that the string pointer passed to the assign_hook is exactly what was returned by the check_hook, ie, guc.c is not allowed to duplicate or copy that string. Given these rules, a check_hook and assign_hook could cooperate to store additional data in what guc.c thinks is just a pointer to a string value, ie, there can be more data after the terminating \0. The assign_hook could work off just this additional data without ever doing a catalog lookup. No special show_hook is needed. The only thing this proposal has to recommend it is that the current coding is even worse. Of course this only works for string GUC variables, but I'm finding it hard to visualize a case where a bool, int, float, or enum GUC could need a catalog lookup to interpret. We could possibly legislate that all of these are separately malloc'd to allow the same type of trick to be applied across the board, but I think that's overkill. We can just tell people they must use a string GUC if they need hidden data. This technique would need documentation of course, but at least it *would* be documented rather than ad-hoc for each variable. Another variant would be to allow the check_hook to pass back a separate void * value that could be passed on to the assign_hook, containing any necessary derived data. This is logically a bit cleaner, and would work for all types of GUC variables; but it would make things messier in guc.c since there would be an additional value to pass around. I'm not convinced it's worth that, but could be talked into it if anyone feels strongly about it. Another thing I was reminded of while perusing the code is the comment for GUC_complaint_elevel: * At some point it'd be nice to replace this with a mechanism that allows * the custom message to become the DETAIL line of guc.c's generic message. The reason we invented GUC_complaint_elevel in the first place was to avoid a change in the signatures of assign hooks. If we are making such a change, now would be the time to fix it, because we're never gonna fix it otherwise. I see a few ways we could do it: 1. Add a char **errdetail parameter to assign_hooks, which guc.c would initialize to NULL before call. If the hook stores a non-null pointer there, guc.c would include that string as errdetail. This is the least effort from guc.c's viewpoint, but will be relatively painful to use from the hook's standpoint, because it'd generally have to palloc some space, or maybe even set up a StringInfo buffer to contain the generated message. 2. Add a char *errdetail parameter to assign_hooks, which points at a local-variable buffer in the calling routine, of a well-known size (think
Re: GUC assign hooks (was Re: [HACKERS] wal_buffers = -1 and SIGHUP)
Robert Haas robertmh...@gmail.com writes: On Mon, Apr 4, 2011 at 2:41 PM, Tom Lane t...@sss.pgh.pa.us wrote: Given these rules, a check_hook and assign_hook could cooperate to store additional data in what guc.c thinks is just a pointer to a string value, ie, there can be more data after the terminating \0. The assign_hook could work off just this additional data without ever doing a catalog lookup. No special show_hook is needed. The only thing this proposal has to recommend it is that the current coding is even worse. Well, if you don't like that, do you like this one? Another variant would be to allow the check_hook to pass back a separate void * value that could be passed on to the assign_hook, containing any necessary derived data. This is logically a bit cleaner, and would work for all types of GUC variables; but it would make things messier in guc.c since there would be an additional value to pass around. I'm not convinced it's worth that, but could be talked into it if anyone feels strongly about it. If not, what do you suggest? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC assign hooks (was Re: [HACKERS] wal_buffers = -1 and SIGHUP)
On Mon, Apr 4, 2011 at 2:58 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Mon, Apr 4, 2011 at 2:41 PM, Tom Lane t...@sss.pgh.pa.us wrote: Given these rules, a check_hook and assign_hook could cooperate to store additional data in what guc.c thinks is just a pointer to a string value, ie, there can be more data after the terminating \0. The assign_hook could work off just this additional data without ever doing a catalog lookup. No special show_hook is needed. The only thing this proposal has to recommend it is that the current coding is even worse. Well, if you don't like that, do you like this one? To be clear, it's certainly an improvement over what we have now. Another variant would be to allow the check_hook to pass back a separate void * value that could be passed on to the assign_hook, containing any necessary derived data. This is logically a bit cleaner, and would work for all types of GUC variables; but it would make things messier in guc.c since there would be an additional value to pass around. I'm not convinced it's worth that, but could be talked into it if anyone feels strongly about it. I haven't really got the mental energy to think through all of this right now in detail, but I think that might be better. I think there's more kludgery here than we're going to fix in one pass, so as long as we're making improvements, I'm happy. Is there any case for using a Datum rather than a void * so people can pack a short quantity in directly without allocating memory, or are we expecting this to always be (say) a struct pointer? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC assign hooks (was Re: [HACKERS] wal_buffers = -1 and SIGHUP)
Robert Haas robertmh...@gmail.com writes: On Mon, Apr 4, 2011 at 2:58 PM, Tom Lane t...@sss.pgh.pa.us wrote: Another variant would be to allow the check_hook to pass back a separate void * value that could be passed on to the assign_hook, containing any necessary derived data. This is logically a bit cleaner, and would work for all types of GUC variables; but it would make things messier in guc.c since there would be an additional value to pass around. I'm not convinced it's worth that, but could be talked into it if anyone feels strongly about it. I haven't really got the mental energy to think through all of this right now in detail, but I think that might be better. I think there's more kludgery here than we're going to fix in one pass, so as long as we're making improvements, I'm happy. Is there any case for using a Datum rather than a void * so people can pack a short quantity in directly without allocating memory, or are we expecting this to always be (say) a struct pointer? Well, I was intending to insist that the void* parameter point to a single malloc'd block, so that guc.c could release it when the value was no longer of interest by doing free(). If we don't say that, then we are going to need a free hook for those objects, which is surely way more notational overhead than is likely to be repaid for the occasional cases where a single OID or whatever would be sufficient info. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC assign hooks (was Re: [HACKERS] wal_buffers = -1 and SIGHUP)
On Mon, Apr 4, 2011 at 3:14 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Mon, Apr 4, 2011 at 2:58 PM, Tom Lane t...@sss.pgh.pa.us wrote: Another variant would be to allow the check_hook to pass back a separate void * value that could be passed on to the assign_hook, containing any necessary derived data. This is logically a bit cleaner, and would work for all types of GUC variables; but it would make things messier in guc.c since there would be an additional value to pass around. I'm not convinced it's worth that, but could be talked into it if anyone feels strongly about it. I haven't really got the mental energy to think through all of this right now in detail, but I think that might be better. I think there's more kludgery here than we're going to fix in one pass, so as long as we're making improvements, I'm happy. Is there any case for using a Datum rather than a void * so people can pack a short quantity in directly without allocating memory, or are we expecting this to always be (say) a struct pointer? Well, I was intending to insist that the void* parameter point to a single malloc'd block, so that guc.c could release it when the value was no longer of interest by doing free(). If we don't say that, then we are going to need a free hook for those objects, which is surely way more notational overhead than is likely to be repaid for the occasional cases where a single OID or whatever would be sufficient info. OK. Please comment the crap out of whatever you do, or maybe even add a README. This stuff is just a bit arcane, and guideposts help a lot. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC assign hooks (was Re: [HACKERS] wal_buffers = -1 and SIGHUP)
Robert Haas robertmh...@gmail.com writes: OK. Please comment the crap out of whatever you do, or maybe even add a README. This stuff is just a bit arcane, and guideposts help a lot. We already have a README for that ;-). PFA, a patch to src/backend/utils/misc/README describing the proposed revised API. If nobody has any objections, I'll get on with making this happen. regards, tom lane diff --git a/src/backend/utils/misc/README b/src/backend/utils/misc/README index 881862a30b182a94cd71c6ed1a98dd6e6d5e51e0..221f595f8b72eb1cf48d318beb4aa8d3cb2b3411 100644 *** a/src/backend/utils/misc/README --- b/src/backend/utils/misc/README *** determining which setting is used. *** 12,68 Per-Variable Hooks -- ! Each variable known to GUC can optionally have an assign_hook and/or ! a show_hook to provide customized behavior. Assign hooks are used to ! perform validity checking on variable values (above and beyond what ! GUC can do). They are also used to update any derived state that needs ! to change when a GUC variable is set. Show hooks are used to modify ! the default SHOW display for a variable. If an assign_hook is provided, it points to a function of the signature ! bool assign_hook(newvalue, bool doit, GucSource source) ! where the type of newvalue matches the kind of variable. This function ! is called immediately before actually setting the variable's value (so it ! can look at the actual variable to determine the old value). If the ! function returns true then the assignment is completed; if it returns ! false then newvalue is considered invalid and the assignment is not ! performed. If doit is false then the function should simply check ! validity of newvalue and not change any derived state. The source parameter ! indicates where the new value came from. If it is = PGC_S_INTERACTIVE, ! then we are performing an interactive assignment (e.g., a SET command), and ! ereport(ERROR) is safe to do. But when source PGC_S_INTERACTIVE, we are ! reading a non-interactive option source, such as postgresql.conf. In this ! case the assign_hook should *not* ereport but should just return false if it ! doesn't like the newvalue. ! If an assign_hook returns false then guc.c will report a generic invalid ! value for option FOO error message. If you feel the need to provide a more ! specific error message, ereport() it using GUC_complaint_elevel(source) ! as the error level. Note that this might return either ERROR or a lower level ! such as LOG, so the ereport call might or might not return. If it does ! return, return false out of the assign_hook. ! For string variables, the signature for assign hooks is a bit different: ! const char *assign_hook(const char *newvalue, ! bool doit, ! GucSource source) ! The meanings of the parameters are the same as for the other types of GUC ! variables, but the return value is handled differently: ! NULL --- assignment fails (like returning false for other datatypes) ! newvalue --- assignment succeeds, assign the newvalue as-is ! malloc'd (not palloc'd!!!) string --- assign that value instead ! The third choice is allowed in case the assign_hook wants to return a ! canonical version of the new value. For example, the assign_hook for ! datestyle always returns a string that includes both output and input ! datestyle options, although the input might have specified only one. - Note that a string variable's assign_hook will NEVER be called with a NULL - value for newvalue, since there would be no way to distinguish success - and failure returns. If the boot_val or reset_val for a string variable - is NULL, it will just be assigned without calling the assign_hook. - Therefore, a NULL boot_val should never be used in combination with an - assign_hook that has side-effects, as the side-effects wouldn't happen - during a RESET that re-institutes the boot-time setting. If a show_hook is provided, it points to a function of the signature const char *show_hook(void) --- 12,108 Per-Variable Hooks -- ! Each variable known to GUC can optionally have a check_hook, an ! assign_hook, and/or a show_hook to provide customized behavior. ! Check hooks are used to perform validity checking on variable values ! (above and beyond what GUC can do), to compute derived settings when ! nontrivial work is needed to do that, and optionally to canonicalize ! user-supplied values. Assign hooks are used to update any derived state ! that needs to change when a GUC variable is set. Show hooks are used to ! modify the default SHOW display for a variable. ! ! ! If a check_hook is provided, it points to a function of the signature ! bool check_hook(datatype *newvalue, void **extra, GucSource source) ! The newvalue argument is of type bool *, int *, double *, or char ** ! for bool, int/enum, float, or string variables respectively. The check ! function should validate the
Re: GUC assign hooks (was Re: [HACKERS] wal_buffers = -1 and SIGHUP)
On Mon, Apr 4, 2011 at 4:57 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: OK. Please comment the crap out of whatever you do, or maybe even add a README. This stuff is just a bit arcane, and guideposts help a lot. We already have a README for that ;-). PFA, a patch to src/backend/utils/misc/README describing the proposed revised API. If nobody has any objections, I'll get on with making this happen. Looks reasonable to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
GUC assign hooks (was Re: [HACKERS] wal_buffers = -1 and SIGHUP)
I wrote: Robert Haas robertmh...@gmail.com writes: I had intended to commit Greg's patch with a show hook and an assign hook and the calculated value stored separately from the GUC variable, which I believe would avoid this is a problem, but Tom thought this way was better. Unfortunately, my knowledge of the GUC machinery is insufficient to think of a way to avoid it, other than the one Tom already rejected. Mph ... I think this shows the error of my thinking :-(. We at least need an assign hook here. Will fix, since it was my oversight to begin with. After thinking about this for awhile, I think the fundamental problem is in the is_newvalue_equal() function, which as its comment states is pretty cheesy: /* * Attempt (badly) to detect if a proposed new GUC setting is the same * as the current value. * * XXX this does not really work because it doesn't account for the * effects of canonicalization of string values by assign_hooks. */ If you hold your head at a funny angle you can see replacement of -1 with a suitable default value as a form of canonicalization, so the behavior Bernd complained of is exactly what the comment is talking about. We've not had complaints previously because is_newvalue_equal() is only used for PGC_POSTMASTER variables, and few of those have assign hooks that do any canonicalization. But it is possible to trigger the problem with unix_socket_directory, for example: try setting it to something like '/tmp/bogus/..', and you'll see that pg_ctl reload triggers a log message: LOG: parameter unix_socket_directory cannot be changed without restarting the server Robert had suggested fixing this by kluging up wal_buffers' assign and show hooks, but IIRC he never actually got that to work; I doubt it's possible to make it work in the EXEC_BACKEND case without some additional hacks to get the state to be properly replicated into child processes. Even if we could make it work, wal_buffers is not likely to be the last variable that we'll want to allow auto-tuning of. So I think we ought to address the underlying problem instead of trying to work around it in the variable-specific code for wal_buffers. IMO the real problem is essentially that GUC assign hooks have two functions, checking and canonicalization of the value-to-be-stored versus executing secondary actions when an assignment is made; and there's no way to get at just the first one. So we cannot canonicalize the value first and then see if it's equal to the current setting. I think the only real fix is to change the API for assign hooks. This is a bit annoying but it's not like we haven't ever done that before. I'm thinking about splitting assign hooks into two functions, along the lines of bool check_hook (datatype *new_value, GucSource source) bool assign_hook (datatype new_value, GucSource source) check_hook would validate the new value, and possibly change it (hence the pass-by-reference parameter). assign_hook would only be responsible for executing secondary actions needed when an assignment is done. The doit flag can go away since we'd not call the assign_hook at all unless we actually want the assignment to occur. I think most of the existing uses might only need one or the other of these hooks, but haven't gone through them yet. It might be appropriate to change the signatures some more while we're at it, in particular pass the desired elog message level explicitly instead of making hooks infer it from the source parameter. It would probably take less than a day to flesh out this idea and make it happen, but it does seem like a rather large change for late alpha. If people don't want to do this now, I suggest that we just live with the problem for 9.1. It's purely cosmetic, and easy enough to work around (just don't uncomment the default value). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC assign hooks (was Re: [HACKERS] wal_buffers = -1 and SIGHUP)
On Sun, Apr 3, 2011 at 1:26 PM, Tom Lane t...@sss.pgh.pa.us wrote: IMO the real problem is essentially that GUC assign hooks have two functions, checking and canonicalization of the value-to-be-stored versus executing secondary actions when an assignment is made; and there's no way to get at just the first one. Yes, I think that's right. A related point is that the API for assign hooks is not consistent across all data types: string assign hooks can return a replacement value, whereas everyone else can only succeed or fail. It would probably take less than a day to flesh out this idea and make it happen, but it does seem like a rather large change for late alpha. If people don't want to do this now, I suggest that we just live with the problem for 9.1. It's purely cosmetic, and easy enough to work around (just don't uncomment the default value). I think it's a pretty ugly wart, so I'm inclined to say go ahead and fix it. I'm not sure what alpha is for if it's not cleaning up the detritus of all the stuff we've committed in the last 9 months. AIUI, what we're trying to avoid is committing new stuff that may require additional cleanup, not cleaning up the stuff we already did commit. Once we get to beta I'll be less enthusiastic about making changes like this, but I think most of the testing that will get done is still in front of us. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GUC assign hooks (was Re: [HACKERS] wal_buffers = -1 and SIGHUP)
Robert Haas robertmh...@gmail.com writes: On Sun, Apr 3, 2011 at 1:26 PM, Tom Lane t...@sss.pgh.pa.us wrote: IMO the real problem is essentially that GUC assign hooks have two functions, checking and canonicalization of the value-to-be-stored versus executing secondary actions when an assignment is made; and there's no way to get at just the first one. Yes, I think that's right. A related point is that the API for assign hooks is not consistent across all data types: string assign hooks can return a replacement value, whereas everyone else can only succeed or fail. Right. In the original design we only foresaw the need to canonicalize string values, so that's why it's like that. This change will make it more consistent. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers = -1 and SIGHUP
Robert Haas robertmh...@gmail.com writes: On Thu, Mar 31, 2011 at 8:38 AM, Bernd Helmle maili...@oopsware.de wrote: This might be nitpicking (or i'm currently missing something), but i recognized that setting wal_buffers = -1 always triggers the following on reload, even if nothing to wal_buffers had changed: $ pg_ctl reload LOG: received SIGHUP, reloading configuration files LOG: parameter wal_buffers cannot be changed without restarting the server This only happens when you have wal_buffers set to -1. This is a bug. The root cause is that, on startup, we do this: if (xbuffers != XLOGbuffers) { snprintf(buf, sizeof(buf), %d, xbuffers); SetConfigOption(wal_buffers, buf, PGC_POSTMASTER, PGC_S_OVERRIDE); } I had intended to commit Greg's patch with a show hook and an assign hook and the calculated value stored separately from the GUC variable, which I believe would avoid this is a problem, but Tom thought this way was better. Unfortunately, my knowledge of the GUC machinery is insufficient to think of a way to avoid it, other than the one Tom already rejected. Mph ... I think this shows the error of my thinking :-(. We at least need an assign hook here. Will fix, since it was my oversight to begin with. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] wal_buffers = -1 and SIGHUP
This might be nitpicking (or i'm currently missing something), but i recognized that setting wal_buffers = -1 always triggers the following on reload, even if nothing to wal_buffers had changed: $ pg_ctl reload LOG: received SIGHUP, reloading configuration files LOG: parameter wal_buffers cannot be changed without restarting the server This only happens when you have wal_buffers set to -1. -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers = -1 and SIGHUP
On Thu, Mar 31, 2011 at 8:38 AM, Bernd Helmle maili...@oopsware.de wrote: This might be nitpicking (or i'm currently missing something), but i recognized that setting wal_buffers = -1 always triggers the following on reload, even if nothing to wal_buffers had changed: $ pg_ctl reload LOG: received SIGHUP, reloading configuration files LOG: parameter wal_buffers cannot be changed without restarting the server This only happens when you have wal_buffers set to -1. This is a bug. The root cause is that, on startup, we do this: if (xbuffers != XLOGbuffers) { snprintf(buf, sizeof(buf), %d, xbuffers); SetConfigOption(wal_buffers, buf, PGC_POSTMASTER, PGC_S_OVERRIDE); } I had intended to commit Greg's patch with a show hook and an assign hook and the calculated value stored separately from the GUC variable, which I believe would avoid this is a problem, but Tom thought this way was better. Unfortunately, my knowledge of the GUC machinery is insufficient to think of a way to avoid it, other than the one Tom already rejected. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers