On Wed, Apr 24, 2013 at 11:21 AM, Sašo Kiselkov <skiselkov...@gmail.com> wrote: > ZFS has been the filesystem of choice for SunOS-based systems for about > the last 5 years now, is becoming that for FreeBSD as we speak, and is
More like 8 years :) > quickly gaining ground on Linux. The absence of support for > posix_fallocate() on ZFS kind of makes sense, since copy-on-write > filesystems cannot keep the posix_fallocate promise: Agreed. > http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html > "If posix_fallocate() returns successfully, subsequent writes to the > specified file data shall not fail due to the lack of free space on the > file system storage media." Pre-allocation should be a per-filesystem feature, discoverable via pathconf(3). What would it take to add such a pathconf? (I should know this, but I don't.) In the meantime: > As such, I would suggest one of: > > 1) Introduce a configure option which allows SQLite users to explicitly > disable posix_fallocate support, if they expect to be running on > file systems without support for it. Merely switching by OS may not > be reliable enough, since for instanceUFS on SunOS implements it and > there is no simple way for libc to guess what file system a > particular file sits on. > > 2) Implement some sort of automatic fallback method which detects the > EINVAL condition and attempts to fall back to using the > truncate-and-write method. EINVAL seems like a lousy error code to return here though. ENOTSUP seems much better. EINVAL should be fatal here, but ENOTSUP should cause SQLite3 to shrug and continue. > If method #2 is acceptable for the SQLite project, I can attempt to I would think that it should be, but I think the errno that triggers fallback should be ENOTSUP. > implement it. I could also implement support for posix_fallocate into > ZFS, but that will take a lot of time to get widely deployed (at least > several years), and even then the best ZFS could do is lie to the > applications (due to the aforementioned COW design). Let's expand a bit on why. ZFS could save the DVAs of fallocated blocks in the file's dnode for use later when either the file deleted (last unlink) or written to. Admittedly it'd be tricky: the pre-allocated blocks would have to include blocks for writing metadata all the way up to the root, and the block sizes would have to be just right, which would effectively mean having to pre-allocate the largest possible block sizes (since ZFS has variable block sizes, but for any given file the data blocks are all the same size, but this can change when the file is one block long and grows; this applies to a bunch of metadata as well), and that'd be rather painful. For a SQLite3 DB/WAL in a dedicated ZFS dataset you could use reservations to roughly equivalent effect to posix_fallocate(). But that's not a solution. Nico -- _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users