On Tue, Apr 17, 2018 at 11:57 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Alvaro Herrera <alvhe...@alvh.no-ip.org> writes:
> > David Pacheco wrote:
> >> tl;dr: We've found that under many conditions, PostgreSQL's re-use of
> old
> >> WAL
> >> files appears to significantly degrade query latency on ZFS.  The
> reason is
> >> complicated and I have details below.  Has it been considered to make
> this
> >> behavior tunable, to cause PostgreSQL to always create new WAL files
> >> instead of re-using old ones?
>
> > I don't think this has ever been proposed, because there was no use case
> > for it.  Maybe you want to work on a patch for it?
>
> I think possibly the OP doesn't understand why it's designed that way.
> The point is not really to "recycle old WAL files", it's to avoid having
> disk space allocation occur during the critical section where we must
> PANIC on failure.  Now, of course, that doesn't really work if the
> filesystem is COW underneath, because it's allocating fresh disk space
> anyway even though semantically we're overwriting existing data.
> But what I'd like to see is a fix that deals with that somehow, rather
> than continue to accept the possibility of ENOSPC occurring inside WAL
> writes on these file systems.  I have no idea what such a fix would
> look like :-(



I think I do understand, but as you've observed, recycling WAL files to
avoid allocation relies on the implementation details of the filesystem --
details that I'd expect not to be true of any copy-on-write filesystem.  On
such systems, there may not be a way to avoid ENOSPC in special critical
sections.  (And that's not necessarily such a big deal -- to paraphrase a
colleague, ensuring that the system doesn't run out of space does not seem
like a particularly surprising or heavy burden for the operator.  It's
great that PostgreSQL can survive this event better on some systems, but
the associated tradeoffs may not be worthwhile for everybody.)  And given
that, it seems worthwhile to provide the operator an option where they take
on the risk that the database might crash if it runs out of space (assuming
the result isn't data corruption) in exchange for a potentially tremendous
improvement in tail latency and overall throughput.

To quantify this: in a recent incident, transaction latency on the primary
was degraded about 2-3x (from a p90 of about 45ms to upwards of 120ms, with
outliers exceeding 1s).  Over 95% of the outliers above 1s spent over 90%
of their time blocked on synchronous replication (based on tracing with
DTrace).  On the synchronous standby, almost 10% of the WAL receiver's wall
clock time was spent blocked on disk reads in this read-modify-write path.
The rest of the time was essentially idle -- there was plenty of headroom
in other dimensions (CPU, synchronous write performance).

Thanks,
Dave

Reply via email to