On Mon, Apr 9, 2018 at 12:45 PM, Robert Haas <robertmh...@gmail.com> wrote:
> Ouch.  If a process exits -- say, because the user typed \q into psql
> -- then you're talking about potentially calling fsync() on a really
> large number of file descriptor flushing many gigabytes of data to
> disk.  And it may well be that you never actually wrote any data to
> any of those file descriptors -- those writes could have come from
> other backends.  Or you may have written a little bit of data through
> those FDs, but there could be lots of other data that you end up
> flushing incidentally.  Perfectly innocuous things like starting up a
> backend, running a few short queries, and then having that backend
> exit suddenly turn into something that could have a massive
> system-wide performance impact.
>
> Also, if a backend ever manages to exit without running through this
> code, or writes any dirty blocks afterward, then this still fails to
> fix the problem completely.  I guess that's probably avoidable -- we
> can put this late in the shutdown sequence and PANIC if it fails.
>
> I have a really tough time believing this is the right way to solve
> the problem.  We suffered for years because of ext3's desire to flush
> the entire page cache whenever any single file was fsync()'d, which
> was terrible.  Eventually ext4 became the norm, and the problem went
> away.  Now we're going to deliberately insert logic to do a very
> similar kind of terrible thing because the kernel developers have
> decided that fsync() doesn't have to do what it says on the tin?  I
> grant that there doesn't seem to be a better option, but I bet we're
> going to have a lot of really unhappy users if we do this.

What about the bug we fixed in
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ce439f3379aed857517c8ce207485655000fc8e
?  Say somebody does something along the lines of:

ps uxww | grep postgres | grep -v grep | awk '{print $2}' | xargs kill -9

...and then restarts postgres.  Craig's proposal wouldn't cover this
case, because there was no opportunity to run fsync() after the first
crash, and there's now no way to go back and fsync() any stuff we
didn't fsync() before, because the kernel may have already thrown away
the error state, or may lie to us and tell us everything is fine
(because our new fd wasn't opened early enough).  I can't find the
original discussion that led to that commit right now, so I'm not
exactly sure what scenarios we were thinking about.  But I think it
would at least be a problem if full_page_writes=off or if you had
previously started the server with fsync=off and now wish to switch to
fsync=on after completing a bulk load or similar.  Recovery can read a
page, see that it looks OK, and continue, and then a later fsync()
failure can revert that page to an earlier state and now your database
is corrupted -- and there's absolute no way to detect this because
write() gives you the new page contents later, fsync() doesn't feel
obliged to tell you about the error because your fd wasn't opened
early enough, and eventually the write can be discarded and you'll
revert back to the old page version with no errors ever being reported
anywhere.

Another consequence of this behavior that initdb -S is never reliable,
so pg_rewind's use of it doesn't actually fix the problem it was
intended to solve.  It also means that initdb itself isn't crash-safe,
since the data file changes are made by the backend but initdb itself
is doing the fsyncs, and initdb has no way of knowing what files the
backend is going to create and therefore can't -- even theoretically
-- open them first.

What's being presented to us as the API contract that we should expect
from buffered I/O is that if you open a file and read() from it, call
fsync(), and get no error, the kernel may nevertheless decide that
some previous write that it never managed to flush can't be flushed,
and then revert the page to the contents it had at some point in the
past.  That's mostly or less equivalent to letting a malicious
adversary randomly overwrite database pages plausible-looking but
incorrect contents without notice and hoping you can still build a
reliable system.  You can avoid the problem if you can always open an
fd for every file you want to modify before it's written and hold on
to it until after it's fsync'd, but that's pretty hard to guarantee in
the face of kill -9.

I think the simplest technological solution to this problem is to
rewrite the entire backend and all supporting processes to use
O_DIRECT everywhere.  To maintain adequate performance, we'll have to
write a complete I/O scheduling system inside PostgreSQL.  Also, since
we'll now have to make shared_buffers much larger -- since we'll no
longer be benefiting from the OS cache -- we'll need to replace the
use of malloc() with an allocator that pulls from shared_buffers.
Plus, as noted, we'll need to totally rearchitect several of our
critical frontend tools.  Let's freeze all other development for the
next year while we work on that, and put out a notice that Linux is no
longer a supported platform for any existing release.  Before we do
that, we might want to check whether fsync() actually writes the data
to disk in a usable way even with O_DIRECT.  If not, we should just
de-support Linux entirely as a hopelessly broken and unsupportable
platform.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply via email to