Hi, Le 15 oct. 2014 22:25, "Guillaume Lelarge" <guilla...@lelarge.info> a écrit : > > 2014-10-15 22:11 GMT+02:00 Jeff Janes <jeff.ja...@gmail.com>: >> >> On Fri, Aug 8, 2014 at 12:08 AM, Guillaume Lelarge < guilla...@lelarge.info> wrote: >>> >>> Hi, >>> >>> As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0. >>> >>> We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula: >>> >>> greatest( >>> (2 + checkpoint_completion_target) * checkpoint_segments + 1, >>> checkpoint_segments + wal_keep_segments + 1 >>> ) >> >> >> I think the first bug is even having this formula in the documentation to start with, and in trying to use it. >> > > I agree. But we have customers asking how to compute the right size for their WAL file system partitions. Right size is usually a euphemism for smallest size, and they usually tend to get it wrong, leading to huge issues. And I'm not even speaking of monitoring, and alerting. > > A way to avoid this issue is probably to erase the formula from the documentation, and find a new way to explain them how to size their partitions for WALs. > > Monitoring is another matter, and I don't really think a monitoring solution should count the WAL files. What actually really matters is the database availability, and that is covered with having enough disk space in the WALs partition. > >> "and will normally not be more than..." >> >> This may be "normal" for a toy system. I think that the normal state for any system worth monitoring is that it has had load spikes at some point in the past. >> > > Agreed. > >> >> So it is the next part of the doc, which describes how many segments it climbs back down to upon recovering from a spike, which is the important one. And that doesn't mention wal_keep_segments at all, which surely cannot be correct. >> > > Agreed too. > >> >> I will try to independently derive the correct formula from the code, as you did, without looking too much at your derivation first, and see if we get the same answer. >> > > Thanks. I look forward reading what you found. > > What seems clear to me right now is that no one has a sane explanation of the formula. Though yours definitely made sense, it didn't seem to be what the code does. >
Did you find time to work on this? Any news? Thanks.