On Tue, 12 Feb 2019 at 15:07, Rowan Worth <[email protected]> wrote: > Huh, fascinating stuff. I'm not an sqlite developer but I can shed light > on some of your questions. > > On Tue, 12 Feb 2019 at 09:54, Edwin Török <[email protected]> wrote: > >> A very conservative interpretation of various fsync bugs in various OS >> kernels [2][5] would suggest that: >> >> #1. the list of known OS issues [3] should be updated with an entry >> similar to: "Linux kernels <4.13 do not guarantee to report errors >> encountered during writeback on next fsync, therefore data corruption >> can occur without SQLite knowing about it.", see [4]: >> > > I haven't read all the links yet so forgive me if this is answered there, > but when you refer to versions <4.13 is that referring to eg. the ext3/ext4 > implementation in those versions, or is it a wider problem which affects > _all_ filesystems? (I'm particularly interested in whether lustre is > affected) >
OK having now watched the talk and read up a bit, I don't know that sqlite needs any attention. The main problem, it seems, is that postgres relied on certain semantics from fsync. I've extracted the expectations from the presentation: > Expectation #1 > > If there's an error during the fsync, the next fsync call will try to flush the data from page cache again. > > > Expectation #2 > > There may be multiple file descriptors per file, possibly from multiple processes. If the fsync fails in one process, the failure will be reported in other processes too. Dr H and/or Dan Kennedy may want to correct me, but I'm pretty sure sqlite doesn't hold either of these expectations. For one it's architecture is distributed rather than client/server, which means it doesn't really have the luxury of leaving the page cache dirty for extended periods of time. I'm only familiar with rollback journal (not WAL), but in this mode sqlite's write cycle goes: 1. Changes are made in RAM, and the original contents of any modified pages are written to the rollback journal 2. Once the transaction is ready to COMMIT, the rollback journal is fsync()d 3. An EXCLUSIVE lock is acquired on the database (which blocks waiting for active readers complete) 4. The changes held in RAM are written to the main database (via write()) 5. The main database is fsync()d 6. The rollback journal is deleted 7. The EXCLUSIVE lock is relinquished Further details are here: https://www.sqlite.org/atomiccommit.html An error from fsync() at at step (2) or step (5) will, I think, fail and roll back the transaction. I don't know what happens if another error is encountered during rollback though! It would make sense if rollback-on-error was implemented using the same hot-journal recovery mechanism used to recover from power loss, in which case it might be a separate process which encounters the rollback error and the hot-journal would remain there until someone finally succeeds in rolling it back. But I don't know if that's what actually happens. sqlite's expectations are also spelled out in section 2 of the atomic commit doc. In particular: > SQLite assumes that the operating system will buffer writes and that a write request will return before data has actually been stored in the mass storage device. SQLite further assumes that write operations will be reordered by the operating system. For this reason, SQLite does a "flush" or "fsync" operation at key points. SQLite assumes that the flush or fsync will not return until all pending write operations for the file that is being flushed have completed. It goes on to say that "some fsync primitives are broken" which could cause "database corruption following power loss." It doesn't currently acknowledge the possibility of silent corruption, ie. there's an unspoken expectation here from sqlite that flush/fsync _will_ return an error if the data did not hit disk. I haven't quite nailed down the circumstances in which linux <4.13 doesn't report this. The lwn article (https://lwn.net/Articles/752613/) mentions one scenario, which I understand as: 1. write(3, ...) // update page cache for fd 3 2. // ... time passes. the writeback for step (1) fails with an error 3. // due to memory pressure, kernel structures for fd 3 are evicted (swapped-out?) from RAM 4. fsync(3) // the kernel structures for fd 3 are paged back in, but the error info is lost, and fsync() succeeds If this is what we're looking at it seems pretty unlikely to happen in sqlite's usage. The window for eviction small because fsync() comes immediately after the write()s. ie. we're likely blocked in fsync() while the write-backs are in progress, which seems an ideal scenario for the kernel to pass on errors. -Rowan _______________________________________________ sqlite-users mailing list [email protected] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

