On Mon, Apr 9, 2018 at 12:45 PM, Robert Haas <robertmh...@gmail.com> wrote: > Ouch. If a process exits -- say, because the user typed \q into psql > -- then you're talking about potentially calling fsync() on a really > large number of file descriptor flushing many gigabytes of data to > disk. And it may well be that you never actually wrote any data to > any of those file descriptors -- those writes could have come from > other backends. Or you may have written a little bit of data through > those FDs, but there could be lots of other data that you end up > flushing incidentally. Perfectly innocuous things like starting up a > backend, running a few short queries, and then having that backend > exit suddenly turn into something that could have a massive > system-wide performance impact. > > Also, if a backend ever manages to exit without running through this > code, or writes any dirty blocks afterward, then this still fails to > fix the problem completely. I guess that's probably avoidable -- we > can put this late in the shutdown sequence and PANIC if it fails. > > I have a really tough time believing this is the right way to solve > the problem. We suffered for years because of ext3's desire to flush > the entire page cache whenever any single file was fsync()'d, which > was terrible. Eventually ext4 became the norm, and the problem went > away. Now we're going to deliberately insert logic to do a very > similar kind of terrible thing because the kernel developers have > decided that fsync() doesn't have to do what it says on the tin? I > grant that there doesn't seem to be a better option, but I bet we're > going to have a lot of really unhappy users if we do this.
What about the bug we fixed in https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ce439f3379aed857517c8ce207485655000fc8e ? Say somebody does something along the lines of: ps uxww | grep postgres | grep -v grep | awk '{print $2}' | xargs kill -9 ...and then restarts postgres. Craig's proposal wouldn't cover this case, because there was no opportunity to run fsync() after the first crash, and there's now no way to go back and fsync() any stuff we didn't fsync() before, because the kernel may have already thrown away the error state, or may lie to us and tell us everything is fine (because our new fd wasn't opened early enough). I can't find the original discussion that led to that commit right now, so I'm not exactly sure what scenarios we were thinking about. But I think it would at least be a problem if full_page_writes=off or if you had previously started the server with fsync=off and now wish to switch to fsync=on after completing a bulk load or similar. Recovery can read a page, see that it looks OK, and continue, and then a later fsync() failure can revert that page to an earlier state and now your database is corrupted -- and there's absolute no way to detect this because write() gives you the new page contents later, fsync() doesn't feel obliged to tell you about the error because your fd wasn't opened early enough, and eventually the write can be discarded and you'll revert back to the old page version with no errors ever being reported anywhere. Another consequence of this behavior that initdb -S is never reliable, so pg_rewind's use of it doesn't actually fix the problem it was intended to solve. It also means that initdb itself isn't crash-safe, since the data file changes are made by the backend but initdb itself is doing the fsyncs, and initdb has no way of knowing what files the backend is going to create and therefore can't -- even theoretically -- open them first. What's being presented to us as the API contract that we should expect from buffered I/O is that if you open a file and read() from it, call fsync(), and get no error, the kernel may nevertheless decide that some previous write that it never managed to flush can't be flushed, and then revert the page to the contents it had at some point in the past. That's mostly or less equivalent to letting a malicious adversary randomly overwrite database pages plausible-looking but incorrect contents without notice and hoping you can still build a reliable system. You can avoid the problem if you can always open an fd for every file you want to modify before it's written and hold on to it until after it's fsync'd, but that's pretty hard to guarantee in the face of kill -9. I think the simplest technological solution to this problem is to rewrite the entire backend and all supporting processes to use O_DIRECT everywhere. To maintain adequate performance, we'll have to write a complete I/O scheduling system inside PostgreSQL. Also, since we'll now have to make shared_buffers much larger -- since we'll no longer be benefiting from the OS cache -- we'll need to replace the use of malloc() with an allocator that pulls from shared_buffers. Plus, as noted, we'll need to totally rearchitect several of our critical frontend tools. Let's freeze all other development for the next year while we work on that, and put out a notice that Linux is no longer a supported platform for any existing release. Before we do that, we might want to check whether fsync() actually writes the data to disk in a usable way even with O_DIRECT. If not, we should just de-support Linux entirely as a hopelessly broken and unsupportable platform. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company