On Wed, Oct 10, 2012 at 1:17 PM, Andi Kleen <[email protected]> wrote:
> Richard Hipp writes: > > > > We would really, really love to have some kind of write-barrier that is > > lighter than fsync(). If there is some method other than fsync() for > > forcing a write-barrier on Linux that we don't know about, please > enlighten > > us. > > Could you list the requirements of such a light weight barrier? > i.e. what would it need to do minimally, what's different from > fsync/fdatasync ? > For SQLite, the write barrier needs to involve two separate inodes. The requirement is this: After rebooting from a power loss or hard-reset, one or the other of the following statements must be true of any reader process that examines the two inodes associated with the write barrier: (1) it can see the complete results every write operation (and unlink) that occurred before the write barrier or (2) it can see no results from any write operation (or unlink) that occurred after the write barrier. In the case of SQLite, the write-barrier never needs to involve more than two inodes: the original database file and the transaction journal (which might be either a rollback journal or a write-ahead log, depending on how SQLite is configured.) But I would suppose that a general-purpose write barrier mechanism should involve an arbitrary number of inodes. Fsync() is a very close approximation to a write barrier since (when it works as advertised) all pending I/O reaches persistent storage before the fsync() returns. And since no subsequent I/Os are issued until after the fsync() returns, the requirements above a clearly satisfied. But it really isn't necessary to actually wait for content to reach persistent storage as long as we know that content will not reach persistent storage out-of-order. Note also that when fsync() works as advertised, SQLite transactions are ACID. But when fsync() is reduced to a write-barrier, we loss the D (durable) and transactions are only ACI. In our experience, nobody really cares very much about durable across a power-loss. People are mainly interested in Atomic, Consistent, and Isolated. If you take a power loss and then after reboot you find the 10 seconds of work prior to the power loss is missing, nobody much cares about that as long as all of the prior work is still present and consistent. > > -Andi > > -- > [email protected] -- Speaking for myself only > -- D. Richard Hipp [email protected] _______________________________________________ sqlite-users mailing list [email protected] http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

