[sqlite] SQLite on NFS cache coherency

Brian Smith Sat, 23 Feb 2008 10:50:45 -0800

The SQLite FAQ says [0] "But use caution: this locking mechanism might not work 
correctly if the database file is kept on an NFS filesystem. This is because 
fcntl() file locking is broken on many NFS implementations." Based on reading 
the archives of this mailing list, this statement seems to have caused some 
confusion as how SQLite works on NFS. I did some research on this for my own 
transaction log system, which I think will be helpful to SQLite users as well. 
I am far from an expert on NFS by any means, so don't take anything I say as 
the final word on this matter. I provide references to substantiate most 
claims, however some of those sources are questionable.


When I say something is safe or sure to work, generally I mean that the safety 
is dependent on the bug-free operation of the NFS system, that the 
administrator hasn't explicitly made things unsafe, and that any correctness 
issues that apply to NFS are also relevant to local storage. For example, with 
NFS, it would not be surprising to have an fsync() successfully persist all the 
data and then return an error due to a networking partition that occurred after 
the server commits the data to stable storage, but before the client receives 
confirmation of the commit. However, even with local storage, fsync can return 
an error even if it successfully commits the data to disk. Thus, there is no 
real difference between remote and local storage here, except that such 
situations are much more likely to occur with remote storage than with local 
storage. Please keep in mind that many NFS clients have had a lot of bugs that 
affect cache coherency issues and locking issues. So, every time you see an 
asterisk (*), keep in mind the phrase "in theory but probably not in practice."

If the database is static (no system anywhere is updating it), then it is safe* 
to access it from multiple computers, via NFS or CIFS or basically any other 
filesystem. 

Prior to NFSv4, the NFS specification defined no cache coherency requirements. 
However, all NFS clients ended up implementing "close-to-open" file consistency 
[3][4], and NFSv4 officially mandated it. But, beware of the evil "nocto" 
option [4]. Consequently, it should be safe* to share a writable database 
between multiple clients, as long as every client re-opens the file whenever 
any *other* system (another client or the server) commits to it. Closing and 
opening files is the surest way to ensure* NFS client cache coherency, except 
when the "nocto" option is used. However, the widespread use of "nocto" means 
that reopening files needs to be combined with the additional file locking 
described below.

By far, the most common requirement for SQLite over NFS seems to be a single 
NFS client exclusively reading and writing a SQLite database stored on an NFS 
server, with no need to consider other clients accessing the same database. 
This is safe* because it is a special case of "close-to-open". The application 
doesn't need to keep closing and open the database because no other system is 
committing to it. Additionally, no additional byte-range locks are needed 
because SQLite's locking convention will work fine for local processes.

NFSv4 also requires data cache consistency for fcntl-based locking [5]. If the 
NFS client implements these cache consistency requirements correctly, and the 
application uses fcntl-locking in an NFS-compatible-way, and the NFS client 
updates its copy of the cached file size when a fcntl lock is acquired, then it 
should be safe* to share a file between multiple clients without having to 
constantly close/re-open the file. NFSv4-capable clients seem to implement 
these consistency guarantees* even when they are using earlier versions of the 
protocol on the wire.

The statement quoted from the SQLite FAQ says that SQLite-NFS problems are 
caused by NFS clients that do not implement locking correctly. While that is 
definitely true, SQLite is also to blame, because it uses fnctl locks in a way 
that is incompatible with NFS locking/cache coherency semantics. Section 9.3.2 
of the NFSv4 specification [6] describes SQLite's technique and the problem 
with it well:

"For those applications that choose to use file locking instead of share 
reservations to exclude inconsistent file access, there is an analogous set of 
constraints that apply to client side data caching. These rules are effective 
only if the file locking is used in a way that matches in an equivalent way the 
actual READ and WRITE operations executed. This is as opposed to file locking 
that is based on pure convention. For example, it is possible to manipulate a 
two-megabyte file by dividing the file into two one-megabyte regions and 
protecting access to the two regions by file locks on bytes zero and one. A 
lock for write on byte zero of the file would represent the right to do READ 
and WRITE operations on the first region. A lock for write on byte one of the 
file would represent the right to do READ and WRITE operations on the second 
region. As long as all applications manipulating the file obey this convention, 
they will work on a local filesystem. However, they may not work with the NFS 
version 4 protocol unless clients refrain from data caching."

SQLite uses exactly this type of "locking that is based on pure convention," 
and so consequently we can say that it is never safe to use SQLite on NFS in a 
read-write situation. Future versions of SQLite can alleviate this problem by 
ensuring that it never reads from any section of *any* file (database files or 
log files) that is not protected by a F_RDLCK, and that it never writes to a 
file or fsyncs it without the modified sections being wrapped in a F_WRLCK. 
Because a transaction may append data to the file, these locks have to extend 
past the end of the file. Adding these locks will ensure* that SQLite never 
reads stale data. 

Note using open(... | O_DIRECT) generally has a similar affect with regards to 
data cache coherency. However, the O_DIRECT option is unusable in practice as 
SQLite's performance depends greatly on the operating system's file caching. 
Adding these additional byte range locks will also probably degrade 
performance, but only for NFS, not for local access.

Even if these additional locks are added, there is still a problem for some NFS 
clients. SQLite may resize the file, the NFS client caches the size separately 
from the data, and fcntl locking ensures cache coherency only for the data 
cache, not for the cache used for the file size. Fortunately, some platforms, 
notably Linux and Solaris, revalidate the file size when a lock is acquired 
[5], even though the NFS specification doesn't require that. For other 
platforms, some other technique to flush the attribute cache must be used [7], 
but personally I wouldn't depend on it.

[0] http://www.sqlite.org/faq.html#q5
[1] http://www.kernel.org/pub/linux/kernel/v2.6/
[2] http://www.apps.ietf.org/rfc/rfc1813.html#sec-4.11
[3] http://www.netapp.com/library/tr/tr2002.pdf
[4] http://nfs.sourceforge.net/#faq_a8
[6] http://www.apps.ietf.org/rfc/rfc3530.html#sec-9.3.2
[7] 
http://www.unixcoding.org/NFSCoding#head-a092c1477f04fa51abffc99d62cf649296aa4c4c

Cheers,
Brian

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] SQLite on NFS cache coherency

Reply via email to