On Tue, Apr 3, 2018 at 6:35 AM, Anthony Iliopoulos <ail...@altatus.com> wrote: >> Like other people here, I think this is 100% unreasonable, starting >> with "the dirty pages which cannot been written out are practically >> thrown away". Who decided that was OK, and on the basis of what >> wording in what specification? I think it's always unreasonable to > > If you insist on strict conformance to POSIX, indeed the linux > glibc configuration and associated manpage are probably wrong in > stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation > matches that of the flexibility allowed by not supporting SIO. > There's a long history of brokenness between linux and posix, > and I think there was never an intention of conforming to the > standard.
Well, then the man page probably shouldn't say CONFORMING TO 4.3BSD, POSIX.1-2001, which on the first system I tested, it did. Also, the summary should be changed from the current "fsync, fdatasync - synchronize a file's in-core state with storage device" by adding ", possibly by randomly undoing some of the changes you think you made to the file". > I believe (as tried to explain earlier) there is a certain assumption > being made that the writer and original owner of data is responsible > for dealing with potential errors in order to avoid data loss (which > should be only of interest to the original writer anyway). It would > be very questionable for the interface to persist the error while > subsequent writes and fsyncs to different offsets may as well go through. No, that's not questionable at all. fsync() doesn't take any argument saying which part of the file you care about, so the kernel is entirely not entitled to assume it knows to which writes a given fsync() call was intended to apply. > Another process may need to write into the file and fsync, while being > unaware of those newly introduced semantics is now faced with EIO > because some unrelated previous process failed some earlier writes > and did not bother to clear the error for those writes. In a similar > scenario where the second process is aware of the new semantics, it would > naturally go ahead and clear the global error in order to proceed > with its own write()+fsync(), which would essentially amount to the > same problematic semantics you have now. I don't deny that it's possible that somebody could have an application which is utterly indifferent to the fact that earlier modifications to a file failed due to I/O errors, but is A-OK with that as long as later modifications can be flushed to disk, but I don't think that's a normal thing to want. >> Also, this really does make it impossible to write reliable programs. >> Imagine that, while the server is running, somebody runs a program >> which opens a file in the data directory, calls fsync() on it, and >> closes it. If the fsync() fails, postgres is now borked and has no >> way of being aware of the problem. If we knew, we could PANIC, but >> we'll never find out, because the unrelated process ate the error. >> This is exactly the sort of ill-considered behavior that makes fcntl() >> locking nearly useless. > > Fully agree, and the errseq_t fixes have dealt exactly with the issue > of making sure that the error is reported to all file descriptors that > *happen to be open at the time of error*. Well, in PostgreSQL, we have a background process called the checkpointer which is the process that normally does all of the fsync() calls but only a subset of the write() calls. The checkpointer does not, however, necessarily have every file open all the time, so these fixes aren't sufficient to make sure that the checkpointer ever sees an fsync() failure. What you have (or someone has) basically done here is made an undocumented assumption about which file descriptors might care about a particular error, but it just so happens that PostgreSQL has never conformed to that assumption. You can keep on saying the problem is with our assumptions, but it doesn't seem like a very good guess to me to suppose that we're the only program that has ever made them. The documentation for fsync() gives zero indication that it's edge-triggered, and so complaining that people wouldn't like it if it became level-triggered seems like an ex post facto justification for a poorly-chosen behavior: they probably think (as we did prior to a week ago) that it already is. > Not sure I understand this case. The application may indeed re-write > a bunch of pages that have failed and proceed with fsync(). The kernel > will deal with combining the writeback of all the re-written pages. But > further the necessity of combining for performance really depends on > the exact storage medium. At the point you start caring about > write-combining, the kernel community will naturally redirect you to > use DIRECT_IO. Well, the way PostgreSQL works today, we typically run with say 8GB of shared_buffers even if the system memory is, say, 200GB. As pages are evicted from our relatively small cache to the operating system, we track which files need to be fsync()'d at checkpoint time, but we don't hold onto the blocks. Until checkpoint time, the operating system is left to decide whether it's better to keep caching the dirty blocks (thus leaving less memory for other things, but possibly allowing write-combining if the blocks are written again) or whether it should clean them to make room for other things. This means that only a small portion of the operating system memory is directly managed by PostgreSQL, while allowing the effective size of our cache to balloon to some very large number if the system isn't under heavy memory pressure. Now, I hear the DIRECT_IO thing and I assume we're eventually going to have to go that way: Linux kernel developers seem to think that "real men use O_DIRECT" and so if other forms of I/O don't provide useful guarantees, well that's our fault for not using O_DIRECT. That's a political reason, not a technical reason, but it's a reason all the same. Unfortunately, that is going to add a huge amount of complexity, because if we ran with shared_buffers set to a large percentage of system memory, we couldn't allocate large chunks of memory for sorts and hash tables from the operating system any more. We'd have to allocate it from our own shared_buffers because that's basically all the memory there is and using substantially more might run the system out entirely. So it's a huge, huge architectural change. And even once it's done it is in some ways inferior to what we are doing today -- true, it gives us superior control over writeback timing, but it also makes PostgreSQL play less nicely with other things running on the same machine, because now PostgreSQL has a dedicated chunk of whatever size it has, rather than using some portion of the OS buffer cache that can grow and shrink according to memory needs both of other parts of PostgreSQL and other applications on the system. > I suppose that if the check-and-clear semantics are problematic for > Pg, one could suggest a kernel patch that opts-in to a level-triggered > reporting of fsync() on a per-descriptor basis, which seems to be > non-intrusive and probably sufficient to cover your expected use-case. That would certainly be better than nothing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company