Re: [sqlite] I/O errors with WAL on ZFS

Sašo Kiselkov Wed, 24 Apr 2013 09:22:07 -0700

On 04/24/2013 05:17 PM, Sašo Kiselkov wrote:
> On 04/24/2013 04:44 PM, Sašo Kiselkov wrote:
>> On 04/24/2013 04:41 PM, Richard Hipp wrote:
>>> On Wed, Apr 24, 2013 at 10:28 AM, Sašo Kiselkov 
>>> <skiselkov...@gmail.com>wrote:
>>>
>>>> On 04/24/2013 03:57 PM, Richard Hipp wrote:
>>>>> On Wed, Apr 24, 2013 at 8:28 AM, Sašo Kiselkov <skiselkov...@gmail.com
>>>>> wrote:
>>>>>
>>>>>> I'm running into I/O errors when trying to access a sqlite3 database
>>>>>> which is using WAL from my app. While using journal_mode=delete,
>>>>>> everything is fine, but as soon as I switch over to journal_mode=wal, I
>>>>>> just get a load of I/O errors on any query, regardless if it is a SELECT
>>>>>> or UPDATE/INSERT.
>>>>>>
>>>>>
>>>>>
>>>>> Can you please turn on error logging (as described at
>>>>> http://www.sqlite.org/draft/errlog.html) and let us know more details
>>>> about
>>>>> the I/O error you are seeing?
>>>>
>>>> Here's my error log:
>>>>
>>>> #4874: os_unix.c:27116: (22) fallocate(/root/test/idx/block.db-shm) -
>>>> Invalid argument
>>>>
>>>
>>> So apparently, the call to fallocate() on the file
>>> /root/test/idx/block.db-shm is failing with errno==22.  Do you have any
>>> idea why that might be?
>>>
>>> Can you tell me exactly which version of SQLite you are using so that I can
>>> figure out what line 27116 says?  Or maybe look at line 27116 of sqlite3.c
>>> yourself and let us know which line of code the error is occurring on?
>>
>> I'm running sqlite-autoconf-3071602, here's the relevant bits of code
>> from sqlite3.c:
>>
>> if( sStat.st_size<nByte ){
>>   /* The requested memory region does not exist. If bExtend is set to
>>    ** false, exit early. *pp will be set to NULL and SQLITE_OK returned.
>>    **
>>    ** Alternatively, if bExtend is true, use ftruncate() to allocate
>>    ** the requested memory region.
>>    */
>>   if( !bExtend ) goto shmpage_out;
>> #if defined(HAVE_POSIX_FALLOCATE) && HAVE_POSIX_FALLOCATE
>>   if( osFallocate(pShmNode->h, sStat.st_size, nByte)!=0 ){
>>     rc = unixLogError(SQLITE_IOERR_SHMSIZE, "fallocate",
>>                       pShmNode->zFilename);
>>     goto shmpage_out;
>>   }
>> #else
>>   if( robust_ftruncate(pShmNode->h, nByte) ){
>>     rc = unixLogError(SQLITE_IOERR_SHMSIZE, "ftruncate",
>>                       pShmNode->zFilename);
>>     goto shmpage_out;
>>   }
>> #endif
>> }
> 
> I think I've found it. Dtracing around in the system, this is the ZFS
> kernel code that's being called:
> 
>   6    -> zfs_space
>   6      -> rrw_enter
>   6        -> rrw_enter_read
>   6        <- rrw_enter_read
>   6      <- rrw_enter
>   6      -> rrw_exit
>   6      <- rrw_exit
>   6    <- zfs_space
> 
> Looking at the implementation of zfs_space, I can see this tidbit:
> 
> /*
>  * ...
>  * Currently, this function only supports the `F_FREESP' command.
>  * ...
>  */
> static int
> zfs_space(vnode_t *vp, int cmd, flock64_t *bfp, int flag,
>     offset_t offset, cred_t *cr, caller_context_t *ct)
> {
>       ...
>       if (cmd != F_FREESP) {
>               ZFS_EXIT(zfsvfs);
>               return (SET_ERROR(EINVAL));
>       }
>       ...
> }
> 
> So it appears that F_ALLOCSP isn't support on ZFS. This appears to be
> the case for all platforms where ZFS is available, not just SunOS. For
> instance, ZFS on Linux has this problem as well:
> https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zfs_vnops.c#L4225-L4228
> 
> Is there some way to work around posix_fallocate and still have WAL
> support in SQLite?


Just as a quick follow-up on this, when I manually undefine
HAVE_POSIX_FALLOCATE, which makes SQLite fall back to the
truncate-and-write implementation, everything works fine.

ZFS has been the filesystem of choice for SunOS-based systems for about
the last 5 years now, is becoming that for FreeBSD as we speak, and is
quickly gaining ground on Linux. The absence of support for
posix_fallocate() on ZFS kind of makes sense, since copy-on-write
filesystems cannot keep the posix_fallocate promise:

http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html
"If posix_fallocate() returns successfully, subsequent writes to the
specified file data shall not fail due to the lack of free space on the
file system storage media."

COW filesystems never overwrite data in place and instead always
allocate new blocks, meaning even if the file being written to has data
blocks allocated, and the application thinks it's just overwriting the
existing blocks, under the hood the filesystem allocates new data
blocks, writes the data to them and then it *might* choose to discard
the original data (modulo snapshots, clones and a myriad of other
mechanisms in which data can be retained).

As such, I would suggest one of:

 1) Introduce a configure option which allows SQLite users to explicitly
    disable posix_fallocate support, if they expect to be running on
    file systems without support for it. Merely switching by OS may not
    be reliable enough, since for instanceUFS on SunOS implements it and
    there is no simple way for libc to guess what file system a
    particular file sits on.

 2) Implement some sort of automatic fallback method which detects the
    EINVAL condition and attempts to fall back to using the
    truncate-and-write method.

If method #2 is acceptable for the SQLite project, I can attempt to
implement it. I could also implement support for posix_fallocate into
ZFS, but that will take a lot of time to get widely deployed (at least
several years), and even then the best ZFS could do is lie to the
applications (due to the aforementioned COW design).

Cheers,
--
Saso
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] I/O errors with WAL on ZFS

Reply via email to