On Tue, 12 Feb 2019 at 15:07, Rowan Worth <[email protected]> wrote:

> Huh, fascinating stuff. I'm not an sqlite developer but I can shed light
> on some of your questions.
>
> On Tue, 12 Feb 2019 at 09:54, Edwin Török <[email protected]> wrote:
>
>> A very conservative interpretation of various fsync bugs in various OS
>> kernels [2][5] would suggest that:
>>
>> #1. the list of known OS issues [3] should be updated with an entry
>> similar to: "Linux kernels <4.13 do not guarantee to report errors
>> encountered during writeback on next fsync, therefore data corruption
>> can occur without SQLite knowing about it.", see [4]:
>>
>
> I haven't read all the links yet so forgive me if this is answered there,
> but when you refer to versions <4.13 is that referring to eg. the ext3/ext4
> implementation in those versions, or is it a wider problem which affects
> _all_ filesystems? (I'm particularly interested in whether lustre is
> affected)
>

OK having now watched the talk and read up a bit, I don't know that sqlite
needs any attention. The main problem, it seems, is that postgres relied on
certain semantics from fsync. I've extracted the expectations from the
presentation:

> Expectation #1
>
> If there's an error during the fsync, the next fsync call will try to
flush the data from page cache again.
>
>
> Expectation #2
>
> There may be multiple file descriptors per file, possibly from multiple
processes. If the fsync fails in one process, the failure will be reported
in other processes too.

Dr H and/or Dan Kennedy may want to correct me, but I'm pretty sure sqlite
doesn't hold either of these expectations. For one it's architecture is
distributed rather than client/server, which means it doesn't really have
the luxury of leaving the page cache dirty for extended periods of time.
I'm only familiar with rollback journal (not WAL), but in this mode
sqlite's write cycle goes:

1. Changes are made in RAM, and the original contents of any modified pages
are written to the rollback journal
2. Once the transaction is ready to COMMIT, the rollback journal is fsync()d
3. An EXCLUSIVE lock is acquired on the database (which blocks waiting for
active readers complete)
4. The changes held in RAM are written to the main database (via write())
5. The main database is fsync()d
6. The rollback journal is deleted
7. The EXCLUSIVE lock is relinquished

Further details are here: https://www.sqlite.org/atomiccommit.html

An error from fsync() at at step (2) or step (5) will, I think, fail and
roll back the transaction. I don't know what happens if another error is
encountered during rollback though! It would make sense if
rollback-on-error was implemented using the same hot-journal recovery
mechanism used to recover from power loss, in which case it might be a
separate process which encounters the rollback error and the hot-journal
would remain there until someone finally succeeds in rolling it back. But I
don't know if that's what actually happens.

sqlite's expectations are also spelled out in section 2 of the atomic
commit doc. In particular:

> SQLite assumes that the operating system will buffer writes and that a
write request will return before data has actually been stored in the mass
storage device. SQLite further assumes that write operations will be
reordered by the operating system. For this reason, SQLite does a "flush"
or "fsync" operation at key points. SQLite assumes that the flush or fsync
will not return until all pending write operations for the file that is
being flushed have completed.

It goes on to say that "some fsync primitives are broken" which could cause
"database corruption following power loss." It doesn't currently
acknowledge the possibility of silent corruption, ie. there's an unspoken
expectation here from sqlite that flush/fsync _will_ return an error if the
data did not hit disk.

I haven't quite nailed down the circumstances in which linux <4.13 doesn't
report this. The lwn article (https://lwn.net/Articles/752613/) mentions
one scenario, which I understand as:

1. write(3, ...)  // update page cache for fd 3
2. // ... time passes. the writeback for step (1) fails with an error
3. // due to memory pressure, kernel structures for fd 3 are evicted
(swapped-out?) from RAM
4. fsync(3) // the kernel structures for fd 3 are paged back in, but the
error info is lost, and fsync() succeeds

If this is what we're looking at it seems pretty unlikely to happen in
sqlite's usage. The window for eviction small because fsync() comes
immediately after the write()s. ie. we're likely blocked in fsync() while
the write-backs are in progress, which seems an ideal scenario for the
kernel to pass on errors.

-Rowan
_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to