On Wed, Oct 10, 2012 at 1:17 PM, Andi Kleen <[email protected]> wrote:

> Richard Hipp writes:
> >
> > We would really, really love to have some kind of write-barrier that is
> > lighter than fsync().  If there is some method other than fsync() for
> > forcing a write-barrier on Linux that we don't know about, please
> enlighten
> > us.
>
> Could you list the requirements of such a light weight barrier?
> i.e. what would it need to do minimally, what's different from
> fsync/fdatasync ?
>

For SQLite, the write barrier needs to involve two separate inodes.  The
requirement is this:

After rebooting from a power loss or hard-reset, one or the other of the
following statements must be true of any reader process that examines the
two inodes associated with the write barrier:  (1) it can see the complete
results every write operation (and unlink) that occurred before the write
barrier or (2) it can see no results from any write operation (or unlink)
that occurred after the write barrier.

In the case of SQLite, the write-barrier never needs to involve more than
two inodes:  the original database file and the transaction journal (which
might be either a rollback journal or a write-ahead log, depending on how
SQLite is configured.)  But I would suppose that a general-purpose write
barrier mechanism should involve an arbitrary number of inodes.

Fsync() is a very close approximation to a write barrier since (when it
works as advertised) all pending I/O reaches persistent storage before the
fsync() returns.  And since no subsequent I/Os are issued until after the
fsync() returns, the requirements above a clearly satisfied.  But it really
isn't necessary to actually wait for content to reach persistent storage as
long as we know that content will not reach persistent storage out-of-order.

Note also that when fsync() works as advertised, SQLite transactions are
ACID.  But when fsync() is reduced to a write-barrier, we loss the D
(durable) and transactions are only ACI.  In our experience, nobody really
cares very much about durable across a power-loss.  People are mainly
interested in Atomic, Consistent, and Isolated.  If you take a power loss
and then after reboot you find the 10 seconds of work prior to the power
loss is missing, nobody much cares about that as long as all of the prior
work is still present and consistent.



>
> -Andi
>
> --
> [email protected] -- Speaking for myself only
>



-- 
D. Richard Hipp
[email protected]
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to