On 3 April 2018 at 10:54, Robert Haas <robertmh...@gmail.com> wrote:
> I think it's always unreasonable to > throw away the user's data. Well, we do that. If a txn aborts, all writes in the txn are discarded. I think that's perfectly reasonable. Though we also promise an all or nothing effect, we make exceptions even there. The FS doesn't offer transactional semantics, but the fsync behaviour can be interpreted kind of similarly. I don't *agree* with it, but I don't think it's as wholly unreasonable as all that. I think leaving it undocumented is absolutely gobsmacking, and it's dubious at best, but it's not totally insane. > If the writes are going to fail, then let > them keep on failing every time. Like we do, where we require an explicit rollback. But POSIX may pose issues there, it doesn't really define any interface for that AFAIK. Unless you expect the app to close() and re-open() the file. Replacing one nonstandard issue with another may not be a win. > *That* wouldn't cause any data loss, > because we'd never be able to checkpoint, and eventually the user > would have to kill the server uncleanly, and that would trigger > recovery. > Yep. That's what I expected to happen on unrecoverable I/O errors. Because, y'know, unrecoverable. I was stunned to learn it's not so. And I'm even more amazed to learn that ext4's errors=remount-ro apparently doesn't concern its self with mere user data, and may exhibit the same behaviour - I need to rerun my test case on it tomorrow. > Also, this really does make it impossible to write reliable programs. > In the presence of multiple apps interacting on the same file, yes. I think that's a little bit of a stretch though. For a single app, you can recover by remembering and redoing all the writes you did. Sucks if your app wants to have multiple processes working together on a file without some kind of journal or WAL, relying on fsync() alone, mind you. But at least we have WAL. Hrm. I wonder how this interacts with wal_level=minimal. > Even leaving that aside, a PANIC means a prolonged outage on a > prolonged system - it could easily take tens of minutes or longer to > run recovery. So saying "oh, just do that" is not really an answer. > Sure, we can do it, but it's like trying to lose weight by > intentionally eating a tapeworm. Now, it's possible to shorten the > checkpoint_timeout so that recovery runs faster, but then performance > drops because data has to be fsync()'d more often instead of getting > buffered in the OS cache for the maximum possible time. It's also spikier. Users have more issues with latency with short, frequent checkpoints. > We could also > dodge this issue in another way: suppose that when we write a page > out, we don't consider it really written until fsync() succeeds. Then > we wouldn't need to PANIC if an fsync() fails; we could just re-write > the page. Unfortunately, this would also be terrible for performance, > for pretty much the same reasons: letting the OS cache absorb lots of > dirty blocks and do write-combining is *necessary* for good > performance. > Our double-caching is already plenty bad enough anyway, as well. (Ideally I want to be able to swap buffers between shared_buffers and the OS buffer-cache. Almost like a 2nd level of buffer pinning. When we write out a block, we *transfer* ownership to the OS. Yeah, I'm dreaming. But we'd sure need to be able to trust the OS not to just forget the block then!) > > The error reporting is thus consistent with the intended semantics (which > > are sadly not properly documented). Repeated calls to fsync() simply do > not > > imply that the kernel will retry to writeback the previously-failed > pages, > > so the application needs to be aware of that. Persisting the error at the > > fsync() level would essentially mean moving application policy into the > > kernel. > > I might accept this argument if I accepted that it was OK to decide > that an fsync() failure means you can forget that the write() ever > happened in the first place, but it's hard to imagine an application > that wants that behavior. If the application didn't care about > whether the bytes really got to disk or not, it would not have called > fsync() in the first place. If it does care, reporting the error only > once is never an improvement. > Many RDBMSes do just that. It's hardly behaviour unique to the kernel. They report an ERROR on a statement in a txn then go on with life, merrily forgetting that anything was ever wrong. I agree with PostgreSQL's stance that this is wrong. We require an explicit rollback (or ROLLBACK TO SAVEPOINT) to restore the session to a usable state. This is good. But we're the odd one out there. Almost everyone else does much like what fsync() does on Linux, report the error and forget it. In any case, we're not going to get anyone to backpatch a fix for this into all kernels, so we're stuck working around it. I'll do some testing with ENOSPC tomorrow, propose a patch, report back. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services