Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory

Guillaume Lelarge Sun, 02 Nov 2014 23:01:28 -0800

Hi,

Le 15 oct. 2014 22:25, "Guillaume Lelarge" <[email protected]> a écrit
:
>
> 2014-10-15 22:11 GMT+02:00 Jeff Janes <[email protected]>:
>>
>> On Fri, Aug 8, 2014 at 12:08 AM, Guillaume Lelarge <
[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> As part of our monitoring work for our customers, we stumbled upon an
issue with our customers' servers who have a wal_keep_segments setting
higher than 0.
>>>
>>> We have a monitoring script that checks the number of WAL files in the
pg_xlog directory, according to the setting of three parameters
(checkpoint_completion_target, checkpoint_segments, and wal_keep_segments).
We usually add a percentage to the usual formula:
>>>
>>> greatest(
>>>   (2 + checkpoint_completion_target) * checkpoint_segments + 1,
>>>   checkpoint_segments + wal_keep_segments + 1
>>> )
>>
>>
>> I think the first bug is even having this formula in the documentation
to start with, and in trying to use it.
>>
>
> I agree. But we have customers asking how to compute the right size for
their WAL file system partitions. Right size is usually a euphemism for
smallest size, and they usually tend to get it wrong, leading to huge
issues. And I'm not even speaking of monitoring, and alerting.
>
> A way to avoid this issue is probably to erase the formula from the
documentation, and find a new way to explain them how to size their
partitions for WALs.
>
> Monitoring is another matter, and I don't really think a monitoring
solution should count the WAL files. What actually really matters is the
database availability, and that is covered with having enough disk space in
the WALs partition.
>
>> "and will normally not be more than..."
>>
>> This may be "normal" for a toy system.  I think that the normal state
for any system worth monitoring is that it has had load spikes at some
point in the past.
>>
>
> Agreed.
>
>>
>> So it is the next part of the doc, which describes how many segments it
climbs back down to upon recovering from a spike, which is the important
one.  And that doesn't mention wal_keep_segments at all, which surely
cannot be correct.
>>
>
> Agreed too.
>
>>
>> I will try to independently derive the correct formula from the code, as
you did, without looking too much at your derivation  first, and see if we
get the same answer.
>>
>
> Thanks. I look forward reading what you found.
>
> What seems clear to me right now is that no one has a sane explanation of
the formula. Though yours definitely made sense, it didn't seem to be what
the code does.
>


Did you find time to work on this? Any news?

Thanks.

Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory

Reply via email to