Re: [sqlite] Apple announces new File System with better ACID support

James K. Lowden Tue, 14 Jun 2016 06:49:08 -0700

On Tue, 14 Jun 2016 10:49:05 +0900
?????????? <nat...@zenlok.com> wrote:

> > On 13 Jun 2016, at 10:13pm, Richard Hipp <d...@sqlite.org> wrote:
> >
> > The rename-is-atomic assumption is so wide-spread in the Linux
> > world, that the linux kernel was modified to make renames closer to
> > being atomic on common filesystems such as EXT4.
> 
> http://man7.org/linux/man-pages/man2/rename.2.html

rename(2) *is* atomic.  That doesn't mean it's synchronous with respect
to external storage.  It only means that no two processes will ever see
the file "in flight" in two places.  If process A calls rename(N,M), at
no point will process B have acceess to both N and M.  Once M is
available, N is extinquished.  

That's a useful property for a process that succeeds, and for which the
OS successfully flushes the data to disk.  

When Richard says rename isn't atomic, he means that it's not
synchronous with respect to the disk.  It makes no guarantee that the
directory entries were updated on disk.  The rename happens in the
kernel's filesystem memory structures, which *eventually* are persisted
to disk.  I have heard that that time lag may be measured in seconds.  

> I am interested to know what it would take to make linux renames
> fully atomic. Reading it as is it feels like the action of rename
> would be the most important piece to making rename atomic.  The docs
> claim this is atomic.  What other aspects would be necessary?

To make Linux rename fully synchronous is technically infeasible and
politically impossible.  

On the political side, the preference in Linux is invariably for
performance, often at ever-finer divisions of responsibility.  As an
example, Unix fsync(2) traditionally updated both the file and its
metadata; Linux divided those into fsync and fdatasync, and added the
requirement to call fsync on the directory. What was once a single call
became 2 or 3.  

As a technical matter, it's really infeasible because there are too
many moving parts: kernel, filesystem driver, and hardware.  It is
possible for a human being to know what kind of disk is installed and
how configured, and to know the semantics of a given filesystem.  It is
not possible for the kernel to patrol all those things, and hence the
kernel cannot make any guarantees about them.  (To take an extreme
example: NFS.)  

By the way, every DBMS I know anything about (and SQLite no
exception), tends to eschew OS services except at the most minimal
level.  The internals of a DBMS carry a lot of state information
unavailable to the kernel that the DBMS uses to prioritize how memory
is used and when and where I/O is required.  That's why every DBMS has
its own logging mechnism, and some bypass the filesystem altogether.

--jkl

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Apple announces new File System with better ACID support

Reply via email to