On 04/24/2013 05:17 PM, Sašo Kiselkov wrote: > On 04/24/2013 04:44 PM, Sašo Kiselkov wrote: >> On 04/24/2013 04:41 PM, Richard Hipp wrote: >>> On Wed, Apr 24, 2013 at 10:28 AM, Sašo Kiselkov >>> <skiselkov...@gmail.com>wrote: >>> >>>> On 04/24/2013 03:57 PM, Richard Hipp wrote: >>>>> On Wed, Apr 24, 2013 at 8:28 AM, Sašo Kiselkov <skiselkov...@gmail.com >>>>> wrote: >>>>> >>>>>> I'm running into I/O errors when trying to access a sqlite3 database >>>>>> which is using WAL from my app. While using journal_mode=delete, >>>>>> everything is fine, but as soon as I switch over to journal_mode=wal, I >>>>>> just get a load of I/O errors on any query, regardless if it is a SELECT >>>>>> or UPDATE/INSERT. >>>>>> >>>>> >>>>> >>>>> Can you please turn on error logging (as described at >>>>> http://www.sqlite.org/draft/errlog.html) and let us know more details >>>> about >>>>> the I/O error you are seeing? >>>> >>>> Here's my error log: >>>> >>>> #4874: os_unix.c:27116: (22) fallocate(/root/test/idx/block.db-shm) - >>>> Invalid argument >>>> >>> >>> So apparently, the call to fallocate() on the file >>> /root/test/idx/block.db-shm is failing with errno==22. Do you have any >>> idea why that might be? >>> >>> Can you tell me exactly which version of SQLite you are using so that I can >>> figure out what line 27116 says? Or maybe look at line 27116 of sqlite3.c >>> yourself and let us know which line of code the error is occurring on? >> >> I'm running sqlite-autoconf-3071602, here's the relevant bits of code >> from sqlite3.c: >> >> if( sStat.st_size<nByte ){ >> /* The requested memory region does not exist. If bExtend is set to >> ** false, exit early. *pp will be set to NULL and SQLITE_OK returned. >> ** >> ** Alternatively, if bExtend is true, use ftruncate() to allocate >> ** the requested memory region. >> */ >> if( !bExtend ) goto shmpage_out; >> #if defined(HAVE_POSIX_FALLOCATE) && HAVE_POSIX_FALLOCATE >> if( osFallocate(pShmNode->h, sStat.st_size, nByte)!=0 ){ >> rc = unixLogError(SQLITE_IOERR_SHMSIZE, "fallocate", >> pShmNode->zFilename); >> goto shmpage_out; >> } >> #else >> if( robust_ftruncate(pShmNode->h, nByte) ){ >> rc = unixLogError(SQLITE_IOERR_SHMSIZE, "ftruncate", >> pShmNode->zFilename); >> goto shmpage_out; >> } >> #endif >> } > > I think I've found it. Dtracing around in the system, this is the ZFS > kernel code that's being called: > > 6 -> zfs_space > 6 -> rrw_enter > 6 -> rrw_enter_read > 6 <- rrw_enter_read > 6 <- rrw_enter > 6 -> rrw_exit > 6 <- rrw_exit > 6 <- zfs_space > > Looking at the implementation of zfs_space, I can see this tidbit: > > /* > * ... > * Currently, this function only supports the `F_FREESP' command. > * ... > */ > static int > zfs_space(vnode_t *vp, int cmd, flock64_t *bfp, int flag, > offset_t offset, cred_t *cr, caller_context_t *ct) > { > ... > if (cmd != F_FREESP) { > ZFS_EXIT(zfsvfs); > return (SET_ERROR(EINVAL)); > } > ... > } > > So it appears that F_ALLOCSP isn't support on ZFS. This appears to be > the case for all platforms where ZFS is available, not just SunOS. For > instance, ZFS on Linux has this problem as well: > https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zfs_vnops.c#L4225-L4228 > > Is there some way to work around posix_fallocate and still have WAL > support in SQLite?
Just as a quick follow-up on this, when I manually undefine HAVE_POSIX_FALLOCATE, which makes SQLite fall back to the truncate-and-write implementation, everything works fine. ZFS has been the filesystem of choice for SunOS-based systems for about the last 5 years now, is becoming that for FreeBSD as we speak, and is quickly gaining ground on Linux. The absence of support for posix_fallocate() on ZFS kind of makes sense, since copy-on-write filesystems cannot keep the posix_fallocate promise: http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html "If posix_fallocate() returns successfully, subsequent writes to the specified file data shall not fail due to the lack of free space on the file system storage media." COW filesystems never overwrite data in place and instead always allocate new blocks, meaning even if the file being written to has data blocks allocated, and the application thinks it's just overwriting the existing blocks, under the hood the filesystem allocates new data blocks, writes the data to them and then it *might* choose to discard the original data (modulo snapshots, clones and a myriad of other mechanisms in which data can be retained). As such, I would suggest one of: 1) Introduce a configure option which allows SQLite users to explicitly disable posix_fallocate support, if they expect to be running on file systems without support for it. Merely switching by OS may not be reliable enough, since for instanceUFS on SunOS implements it and there is no simple way for libc to guess what file system a particular file sits on. 2) Implement some sort of automatic fallback method which detects the EINVAL condition and attempts to fall back to using the truncate-and-write method. If method #2 is acceptable for the SQLite project, I can attempt to implement it. I could also implement support for posix_fallocate into ZFS, but that will take a lot of time to get widely deployed (at least several years), and even then the best ZFS could do is lie to the applications (due to the aforementioned COW design). Cheers, -- Saso _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users