Re: [sqlite] [sqlite-dev] Can I safely use the pragma synchronization = OFF?

2013-01-28 Thread Simon Slavin
TL;DR: If you want ACID at the OS and storage firmware level, expect to buy 
expensive server-rated hardware and expect it to be slow.

On 28 Jan 2013, at 12:30pm, Phil Schwan  wrote:

> Arguably more importantly, there's an OS page cache that sits between your
> application (sqlite) and the file system.  Unless you disable the cache --
> the equivalent of doing an fdatasync() after every operation anyway -- or
> you have an exceptionally clever file system, the OS will combine separate
> writes to the same page before they hit the disk.

On 28 Jan 2013, at 12:29pm, Richard Hipp  wrote:

> Furthermore, I'm pretty sure every modern unix-like system will usually
> reorder the sequence of write()s so that data reaches oxide in a different
> order from how data was written by the application.

Worse still, your hard disk subsystem (You noticed your hard disk has a green 
circuit board on, didn't you ?) probably reorders writes too.  Because this 
results in overall faster operation.  And since it's happening at the storage 
module level, the OS doesn't know about it and can't stop it.

You can buy expensive server-rated hard disks which have jumper settings which 
turn this off.  They're slow.  You wouldn't want to boot from one.  But I have 
tested more than one cheap hard disk which does have the jumper settings 
(according to the documentation) but as far as I could see ignored them.  At 
least sending it a lot of intentionally badly-ordered set of writes took no 
longer (statistical average) when I moved the jumpers.

Amusingly, SSD seem to have less optimization of this kind done in storage 
firmware.  Presumably because they're actually faster when you distribute 
writes between the different banks of storage.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [sqlite-dev] Can I safely use the pragma synchronization = OFF?

2013-01-28 Thread Phil Schwan
I'm not even sure why I'm wading into this; glutton for punishment, I guess.


TL;DR: the assumption that a data-journaled file system guarantees the
atomicity of individual write()s is, in my experience, not a valid one.



Unfortunately this isn't really a topic about which one can draw general
conclusions.  In practice, every file system -- especially network file
systems -- provide subtly different semantics.

But if I *had* to make a general statement, I'd say that even with full
data journaling you cannot trust that writes will necessarily be replayed
in full or in order without explicit fdatasync()ing as sqlite does.  There
are many ways in which the ordering may be disrupted.

Most file systems guarantee atomicity of writes only up to a relatively
small block size (4k is popular).  If you write multiple blocks, even in a
single system call, it's usually possible for one to succeed and another to
fail.

Arguably more importantly, there's an OS page cache that sits between your
application (sqlite) and the file system.  Unless you disable the cache --
the equivalent of doing an fdatasync() after every operation anyway -- or
you have an exceptionally clever file system, the OS will combine separate
writes to the same page before they hit the disk.

The purpose of data journaling for most file systems isn't to provide
strict atomicity OR ordering of application writes.  It's to prevent things
like an unallocated data block (containing whatever random rubbish was
already at that point on the disk platter) appearing in your file if a
power loss or system crash occurs shortly after a write().

Consider the case where:

- sqlite transaction A modifies blocks 1 and 2 of the file
- sqlite transaction B modifies blocks 2 and 3 of the file
- you kick out the power cable

Without an intermediate fdatasync(), it's entirely possible with most file
systems that:

- blocks 1 and 2 are written, containing the changes from both
transactions, but not block 3

- blocks 2 and 3 are written, containing the changes from both
transactions, but not block 1

- blocks 1 and 3 are written, but not block 2

- all three blocks are written, but block 2 only is missing the
modification from transaction B

Any of those scenarios result in a corrupted database.

Maybe you're using a file system that protects against these things and
guarantees strict ordering.  Are you sure it does that even in the face of
a power outage?  Of a disk failure?  Of a disk failure that occurs when
restoring from a power failure?  Of the failure of one of the DIMMs in its
cache?

For 6 years I worked with clients who had a nearly-unlimited budget to
throw at hardware, and none of those systems provided those guarantees
unless you disabled the caches (at which point you might as well let sqlite
call fsync).  Are you willing to bet that yours does?

Cheers,

-p

On 28 January 2013 18:57, Shuki Sasson  wrote:

>
> A *physical journal* logs an advance copy of every block that will later be
> written to the main file system. If there is a crash when the main file
> system is being written to, the write can simply be replayed to completion
> when the file system is next mounted. If there is a crash when the write is
> being logged to the journal, the partial write will have a missing or
> mismatched checksum and can be ignored at next mount.
>
> Physical journals impose a significant performance penalty because every
> changed block must be committed *twice* to storage, but may be acceptable
> when *absolute fault protection is
> required.*
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [sqlite-dev] Can I safely use the pragma synchronization = OFF?

2013-01-28 Thread Richard Hipp
On Sun, Jan 27, 2013 at 12:21 PM, Shuki Sasson wrote:

> No confusion here, the atomicity of the FS journal guarantees that the
> fwrite will happen in full or not happen at all...
>

First off, SQLite uses write(), not fwrite().

Secondly, I don't think any modern unix-like system guarantees atomicity of
write() operations.  You might get an atomic write under special
circumstances (such as a write of exactly one sector) but not in general.
Furthermore, I'm pretty sure every modern unix-like system will usually
reorder the sequence of write()s so that data reaches oxide in a different
order from how data was written by the application.  Together, these
observations go a long way to ensuring that your database files will go
corrupt if you are in the middle of a transaction with PRAGMA
synchronous=OFF when the power goes out, even if you are on the latest
trendy journalling file system.

But you are welcomed to try to prove me wrong by running the experiment
yourself.



> The atomicity of SQLITE is guaranteed by SQLITE mechanisms that are
> completely independent of that.
> However, the synchronization pragma got everything to do with the File
> System fwrite mechansim and nothing to do with the SQLITE mechanism for
> atomicity.
>
> I hope that clears things up here too.. :-)
>
> Shuki
>
> On Sun, Jan 27, 2013 at 12:11 PM, Stephan Beal  >wrote:
>
> > On Sun, Jan 27, 2013 at 5:53 PM, Shuki Sasson 
> > wrote:
> >
> > > Answer: The journal is organized in transactions that each of them is
> > > atomic, so all the buffered cache changes for such operation are put
> into
> > > the transaction. Only fully completed transaction are replayed when the
> > > system is recovering from a panic or power loss.
> > >
> >
> > Be careful not to confuse atomic here with atomic in the SQL sense. Any
> > given SQL write operation is made up of many fwrite() (or equivalent)
> > calls, each of which is (in the journaling FS case) atomic in the sense
> of
> > the whole write() call lands stably on disk or does not, but that has
> > absolutely nothing to do with the atomicity of the complete SQL-related
> > write (whether that be a single field of a single record, metadata for a
> > record, or a whole SQL transaction).
> >
> >
> > > well as data it makes all the sense in the world to run with
> > > synchronization = OFF and gain the additional performance benefits.
> > >
> >
> > Good luck with that. :)
> >
> > --
> > - stephan beal
> > http://wanderinghorse.net/home/stephan/
> > http://gplus.to/sgbeal
> > ___
> > sqlite-users mailing list
> > sqlite-users@sqlite.org
> > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [sqlite-dev] Can I safely use the pragma synchronization = OFF?

2013-01-28 Thread Shuki Sasson
UFS is not fully journaled FS it jut keeps the metadata.
With fully journaled File System that keeps metadata and data there is
no possibility to loose unsaved data.
Anything that was handed to fwrite and fwrite returned an OK for it is
backed by the journal.
Read the following:
http://en.wikipedia.org/wiki/Journaling_file_system
 Read the following from there:
Physical journals

A *physical journal* logs an advance copy of every block that will later be
written to the main file system. If there is a crash when the main file
system is being written to, the write can simply be replayed to completion
when the file system is next mounted. If there is a crash when the write is
being logged to the journal, the partial write will have a missing or
mismatched checksum and can be ignored at next mount.

Physical journals impose a significant performance penalty because every
changed block must be committed *twice* to storage, but may be acceptable
when *absolute fault protection is
required.*[2]
[edit
]

Hope this clears things up.
Shuki
On Sun, Jan 27, 2013 at 10:14 PM, Pavel Ivanov  wrote:

> OK. I picked this one:
> http://www.freebsd.org/doc/en/articles/gjournal-desktop/article.html.
> It says:
>
> A journaling file system uses a log to record all transactions that
> take place in the file system, and preserves its integrity in the
> event of a system crash or power failure. Although it is still
> possible to lose unsaved changes to files, journaling almost
> completely eliminates the possibility of file system corruption caused
> by an unclean shutdown.
>
> So with UFS you have guarantees that file system won't corrupt. But
> there's absolutely no durability guarantees ("it is possible to lose
> unsaved changes") and I don't see guarantees that SQLite file format
> won't corrupt (FS may be non-corrupt while file data are bogus). While
> I agree the latter is arguable and could be preserved, durability is a
> big reason to use pragma synchronous = normal. Sure, if you don't care
> about it you may not use that, you may as well use WAL journal mode
> (which AFAIR can also lose some of last changed data with pragma
> synchronous = normal). But still your claim that UFS with full
> journaling is a complete replacement for pragma synchronous = normal
> is false.
>
>
> Pavel
>
> On Sun, Jan 27, 2013 at 5:20 PM, Shuki Sasson 
> wrote:
> > Pick up any book about UFS and read about the journal...
> >
> > Shuki
> >
> > On Sun, Jan 27, 2013 at 7:56 PM, Pavel Ivanov 
> wrote:
> >
> >> > So in any file system that supports journaling fwrite is blocked until
> >> all
> >> > metadata and data changes are made to the buffer cache and journal is
> >> > update with the changes.
> >>
> >> Please give us some links where did you get all this info with the
> >> benchmarks please. Because what you try to convince us is that with
> >> journaling FS write() doesn't return until the journal record is
> >> guaranteed to physically make it to disk. First of all I don't see
> >> what's the benefit of that compared to direct writing to disk not
> >> using write-back cache. And second do you realize that in this case
> >> you can't make more than 30-50 journal records per second? Do you
> >> really believe that for good OS performance it's enough to make less
> >> than 30 calls to write() per second (on any file, not on each file)? I
> >> won't believe that until I see data and benchmarks from reliable
> >> sources.
> >>
> >>
> >> Pavel
> >>
> >>
> >> On Sun, Jan 27, 2013 at 8:53 AM, Shuki Sasson 
> >> wrote:
> >> > Hi Pavel, thanks a lot for your answer. Assuming xWrite is using
> fwrite
> >> > here is what is going on the File System:
> >> > In a legacy UNIX File System (UFS) the journaling protects only the
> >> > metadata (inode structure directory block indirect block etc..) but
> not
> >> the
> >> > data itself.
> >> > In more modern File Systems (usually one that are enterprise based
> like
> >> EMC
> >> > OneFS on the Isilon product) both data and meta data are journaled.
> >> >
> >> > How journaling works?
> >> > The File System has a cache of the File System blocks it deals with
> (both
> >> > metadata and data) when changes are made to a buffer cached block it
> is
> >> > made to the memory only and the set of changes is save to the journal
> >> > persistently. When the persistent journal is on disk than saving both
> >> data
> >> > and meta data changes
> >> > takes too long and and only meta data changes are journaled. If the
> >> journal
> >> > is placed on NVRAM then it is fast enough to save both data and
> metadata
> >> > changes to the journal.
> >> > So in any file system that supports journaling fwrite is blocked until
> >> all
> >> > metadata and data changes are made to the buffer cache and journal is
> >> > update with the changes.
> >> > The only question tha

Re: [sqlite] [sqlite-dev] Can I safely use the pragma synchronization = OFF?

2013-01-27 Thread Pavel Ivanov
OK. I picked this one:
http://www.freebsd.org/doc/en/articles/gjournal-desktop/article.html.
It says:

A journaling file system uses a log to record all transactions that
take place in the file system, and preserves its integrity in the
event of a system crash or power failure. Although it is still
possible to lose unsaved changes to files, journaling almost
completely eliminates the possibility of file system corruption caused
by an unclean shutdown.

So with UFS you have guarantees that file system won't corrupt. But
there's absolutely no durability guarantees ("it is possible to lose
unsaved changes") and I don't see guarantees that SQLite file format
won't corrupt (FS may be non-corrupt while file data are bogus). While
I agree the latter is arguable and could be preserved, durability is a
big reason to use pragma synchronous = normal. Sure, if you don't care
about it you may not use that, you may as well use WAL journal mode
(which AFAIR can also lose some of last changed data with pragma
synchronous = normal). But still your claim that UFS with full
journaling is a complete replacement for pragma synchronous = normal
is false.


Pavel

On Sun, Jan 27, 2013 at 5:20 PM, Shuki Sasson  wrote:
> Pick up any book about UFS and read about the journal...
>
> Shuki
>
> On Sun, Jan 27, 2013 at 7:56 PM, Pavel Ivanov  wrote:
>
>> > So in any file system that supports journaling fwrite is blocked until
>> all
>> > metadata and data changes are made to the buffer cache and journal is
>> > update with the changes.
>>
>> Please give us some links where did you get all this info with the
>> benchmarks please. Because what you try to convince us is that with
>> journaling FS write() doesn't return until the journal record is
>> guaranteed to physically make it to disk. First of all I don't see
>> what's the benefit of that compared to direct writing to disk not
>> using write-back cache. And second do you realize that in this case
>> you can't make more than 30-50 journal records per second? Do you
>> really believe that for good OS performance it's enough to make less
>> than 30 calls to write() per second (on any file, not on each file)? I
>> won't believe that until I see data and benchmarks from reliable
>> sources.
>>
>>
>> Pavel
>>
>>
>> On Sun, Jan 27, 2013 at 8:53 AM, Shuki Sasson 
>> wrote:
>> > Hi Pavel, thanks a lot for your answer. Assuming xWrite is using fwrite
>> > here is what is going on the File System:
>> > In a legacy UNIX File System (UFS) the journaling protects only the
>> > metadata (inode structure directory block indirect block etc..) but not
>> the
>> > data itself.
>> > In more modern File Systems (usually one that are enterprise based like
>> EMC
>> > OneFS on the Isilon product) both data and meta data are journaled.
>> >
>> > How journaling works?
>> > The File System has a cache of the File System blocks it deals with (both
>> > metadata and data) when changes are made to a buffer cached block it is
>> > made to the memory only and the set of changes is save to the journal
>> > persistently. When the persistent journal is on disk than saving both
>> data
>> > and meta data changes
>> > takes too long and and only meta data changes are journaled. If the
>> journal
>> > is placed on NVRAM then it is fast enough to save both data and metadata
>> > changes to the journal.
>> > So in any file system that supports journaling fwrite is blocked until
>> all
>> > metadata and data changes are made to the buffer cache and journal is
>> > update with the changes.
>> > The only question than is if the File System keeps a journal of both meta
>> > data and data , if your system has a file system that supports journaling
>> > to both metadata and data blocks than you are theoretically (if there are
>> > no bugs in the FS) guaranteed against data loss in case of system panic
>> or
>> > loss of power.
>> > So in short, fully journaled File System gives you the safety of
>> > synchronized = FULL (or even better) without the huge performance penalty
>> > associated with fsync (or fsyncdada).
>> >
>> > Additional Explanation: Why is cheaper to save the changes to the log
>> > rather the whole chached buffer (block)?
>> > Explanation: Each FileSystem block is 8K in size, some of the changes
>> > includes areas in the block that are smaller in size and only these
>> changes
>> > are recorders.
>> > What happens if a change to the File System involves multiple changes to
>> > data blocks as well as metadata blocks like when an fwrite operation
>> > increases the file size and induced an addition of an indirect meta data
>> > block?
>> > Answer: The journal is organized in transactions that each of them is
>> > atomic, so all the buffered cache changes for such operation are put into
>> > the transaction. Only fully completed transaction are replayed when the
>> > system is recovering from a panic or power loss.
>> >
>> > In short, in most file systems like UFS using synchronization = NORMAL
>> > makes a lot 

Re: [sqlite] [sqlite-dev] Can I safely use the pragma synchronization = OFF?

2013-01-27 Thread Shuki Sasson
Pick up any book about UFS and read about the journal...

Shuki

On Sun, Jan 27, 2013 at 7:56 PM, Pavel Ivanov  wrote:

> > So in any file system that supports journaling fwrite is blocked until
> all
> > metadata and data changes are made to the buffer cache and journal is
> > update with the changes.
>
> Please give us some links where did you get all this info with the
> benchmarks please. Because what you try to convince us is that with
> journaling FS write() doesn't return until the journal record is
> guaranteed to physically make it to disk. First of all I don't see
> what's the benefit of that compared to direct writing to disk not
> using write-back cache. And second do you realize that in this case
> you can't make more than 30-50 journal records per second? Do you
> really believe that for good OS performance it's enough to make less
> than 30 calls to write() per second (on any file, not on each file)? I
> won't believe that until I see data and benchmarks from reliable
> sources.
>
>
> Pavel
>
>
> On Sun, Jan 27, 2013 at 8:53 AM, Shuki Sasson 
> wrote:
> > Hi Pavel, thanks a lot for your answer. Assuming xWrite is using fwrite
> > here is what is going on the File System:
> > In a legacy UNIX File System (UFS) the journaling protects only the
> > metadata (inode structure directory block indirect block etc..) but not
> the
> > data itself.
> > In more modern File Systems (usually one that are enterprise based like
> EMC
> > OneFS on the Isilon product) both data and meta data are journaled.
> >
> > How journaling works?
> > The File System has a cache of the File System blocks it deals with (both
> > metadata and data) when changes are made to a buffer cached block it is
> > made to the memory only and the set of changes is save to the journal
> > persistently. When the persistent journal is on disk than saving both
> data
> > and meta data changes
> > takes too long and and only meta data changes are journaled. If the
> journal
> > is placed on NVRAM then it is fast enough to save both data and metadata
> > changes to the journal.
> > So in any file system that supports journaling fwrite is blocked until
> all
> > metadata and data changes are made to the buffer cache and journal is
> > update with the changes.
> > The only question than is if the File System keeps a journal of both meta
> > data and data , if your system has a file system that supports journaling
> > to both metadata and data blocks than you are theoretically (if there are
> > no bugs in the FS) guaranteed against data loss in case of system panic
> or
> > loss of power.
> > So in short, fully journaled File System gives you the safety of
> > synchronized = FULL (or even better) without the huge performance penalty
> > associated with fsync (or fsyncdada).
> >
> > Additional Explanation: Why is cheaper to save the changes to the log
> > rather the whole chached buffer (block)?
> > Explanation: Each FileSystem block is 8K in size, some of the changes
> > includes areas in the block that are smaller in size and only these
> changes
> > are recorders.
> > What happens if a change to the File System involves multiple changes to
> > data blocks as well as metadata blocks like when an fwrite operation
> > increases the file size and induced an addition of an indirect meta data
> > block?
> > Answer: The journal is organized in transactions that each of them is
> > atomic, so all the buffered cache changes for such operation are put into
> > the transaction. Only fully completed transaction are replayed when the
> > system is recovering from a panic or power loss.
> >
> > In short, in most file systems like UFS using synchronization = NORMAL
> > makes a lot of sense as data blocks are not protected by the journal,
> > however with more robust File System that have full journal for metadata
> as
> > well as data it makes all the sense in the world to run with
> > synchronization = OFF and gain the additional performance benefits.
> >
> > Let me know if I missed something and I hope this makes things clearer.
> > Shuki
> >
> >
> >
> >
> > On Sat, Jan 26, 2013 at 10:31 PM, Pavel Ivanov 
> wrote:
> >
> >> On Sat, Jan 26, 2013 at 6:50 PM, Shuki Sasson 
> >> wrote:
> >> >
> >> > Hi all, I read the documentation about the synchronization pragma.
> >> > It got to do with how often xSync method is called.
> >> > With synchronization = FULL xSync is called after each and every
> change
> >> to
> >> > the DataBase file as far as I understand...
> >> >
> >> > Observing the VFS interface used by the SQLITE:
> >> >
> >> > typedef struct sqlite3_io_methods sqlite3_io_methods;
> >> > struct sqlite3_io_methods {
> >> >   int iVersion;
> >> >   int (*xClose)(sqlite3_file*);
> >> >   int (*xRead)(sqlite3_file*, void*, int iAmt, sqlite3_int64 iOfst);
> >> >   *int (*xWrite)(sqlite3_file*, const void*, int iAmt, sqlite3_int64
> >> iOfst);*
> >> >   int (*xTruncate)(sqlite3_file*, sqlite3_int64 size);
> >> >  * int (*xSync)(sqlite3_file*, int flags);*

Re: [sqlite] [sqlite-dev] Can I safely use the pragma synchronization = OFF?

2013-01-27 Thread Pavel Ivanov
> So in any file system that supports journaling fwrite is blocked until all
> metadata and data changes are made to the buffer cache and journal is
> update with the changes.

Please give us some links where did you get all this info with the
benchmarks please. Because what you try to convince us is that with
journaling FS write() doesn't return until the journal record is
guaranteed to physically make it to disk. First of all I don't see
what's the benefit of that compared to direct writing to disk not
using write-back cache. And second do you realize that in this case
you can't make more than 30-50 journal records per second? Do you
really believe that for good OS performance it's enough to make less
than 30 calls to write() per second (on any file, not on each file)? I
won't believe that until I see data and benchmarks from reliable
sources.


Pavel


On Sun, Jan 27, 2013 at 8:53 AM, Shuki Sasson  wrote:
> Hi Pavel, thanks a lot for your answer. Assuming xWrite is using fwrite
> here is what is going on the File System:
> In a legacy UNIX File System (UFS) the journaling protects only the
> metadata (inode structure directory block indirect block etc..) but not the
> data itself.
> In more modern File Systems (usually one that are enterprise based like EMC
> OneFS on the Isilon product) both data and meta data are journaled.
>
> How journaling works?
> The File System has a cache of the File System blocks it deals with (both
> metadata and data) when changes are made to a buffer cached block it is
> made to the memory only and the set of changes is save to the journal
> persistently. When the persistent journal is on disk than saving both data
> and meta data changes
> takes too long and and only meta data changes are journaled. If the journal
> is placed on NVRAM then it is fast enough to save both data and metadata
> changes to the journal.
> So in any file system that supports journaling fwrite is blocked until all
> metadata and data changes are made to the buffer cache and journal is
> update with the changes.
> The only question than is if the File System keeps a journal of both meta
> data and data , if your system has a file system that supports journaling
> to both metadata and data blocks than you are theoretically (if there are
> no bugs in the FS) guaranteed against data loss in case of system panic or
> loss of power.
> So in short, fully journaled File System gives you the safety of
> synchronized = FULL (or even better) without the huge performance penalty
> associated with fsync (or fsyncdada).
>
> Additional Explanation: Why is cheaper to save the changes to the log
> rather the whole chached buffer (block)?
> Explanation: Each FileSystem block is 8K in size, some of the changes
> includes areas in the block that are smaller in size and only these changes
> are recorders.
> What happens if a change to the File System involves multiple changes to
> data blocks as well as metadata blocks like when an fwrite operation
> increases the file size and induced an addition of an indirect meta data
> block?
> Answer: The journal is organized in transactions that each of them is
> atomic, so all the buffered cache changes for such operation are put into
> the transaction. Only fully completed transaction are replayed when the
> system is recovering from a panic or power loss.
>
> In short, in most file systems like UFS using synchronization = NORMAL
> makes a lot of sense as data blocks are not protected by the journal,
> however with more robust File System that have full journal for metadata as
> well as data it makes all the sense in the world to run with
> synchronization = OFF and gain the additional performance benefits.
>
> Let me know if I missed something and I hope this makes things clearer.
> Shuki
>
>
>
>
> On Sat, Jan 26, 2013 at 10:31 PM, Pavel Ivanov  wrote:
>
>> On Sat, Jan 26, 2013 at 6:50 PM, Shuki Sasson 
>> wrote:
>> >
>> > Hi all, I read the documentation about the synchronization pragma.
>> > It got to do with how often xSync method is called.
>> > With synchronization = FULL xSync is called after each and every change
>> to
>> > the DataBase file as far as I understand...
>> >
>> > Observing the VFS interface used by the SQLITE:
>> >
>> > typedef struct sqlite3_io_methods sqlite3_io_methods;
>> > struct sqlite3_io_methods {
>> >   int iVersion;
>> >   int (*xClose)(sqlite3_file*);
>> >   int (*xRead)(sqlite3_file*, void*, int iAmt, sqlite3_int64 iOfst);
>> >   *int (*xWrite)(sqlite3_file*, const void*, int iAmt, sqlite3_int64
>> iOfst);*
>> >   int (*xTruncate)(sqlite3_file*, sqlite3_int64 size);
>> >  * int (*xSync)(sqlite3_file*, int flags);*
>> >
>> > *
>> > *
>> >
>> > I see both xWrite and xSync...
>> >
>> > Is this means that xWrite initiate  a FS write to the file?
>>
>> Yes, in a sense that subsequent read without power cut from the
>> machine will return written data.
>>
>> >
>> > Is that means that xSync makes sure that the FS buffered changes are
>> > synced to d

Re: [sqlite] [sqlite-dev] Can I safely use the pragma synchronization = OFF?

2013-01-27 Thread Shuki Sasson
No confusion here, the atomicity of the FS journal guarantees that the
fwrite will happen in full or not happen at all...
The atomicity of SQLITE is guaranteed by SQLITE mechanisms that are
completely independent of that.
However, the synchronization pragma got everything to do with the File
System fwrite mechansim and nothing to do with the SQLITE mechanism for
atomicity.

I hope that clears things up here too.. :-)

Shuki

On Sun, Jan 27, 2013 at 12:11 PM, Stephan Beal wrote:

> On Sun, Jan 27, 2013 at 5:53 PM, Shuki Sasson 
> wrote:
>
> > Answer: The journal is organized in transactions that each of them is
> > atomic, so all the buffered cache changes for such operation are put into
> > the transaction. Only fully completed transaction are replayed when the
> > system is recovering from a panic or power loss.
> >
>
> Be careful not to confuse atomic here with atomic in the SQL sense. Any
> given SQL write operation is made up of many fwrite() (or equivalent)
> calls, each of which is (in the journaling FS case) atomic in the sense of
> the whole write() call lands stably on disk or does not, but that has
> absolutely nothing to do with the atomicity of the complete SQL-related
> write (whether that be a single field of a single record, metadata for a
> record, or a whole SQL transaction).
>
>
> > well as data it makes all the sense in the world to run with
> > synchronization = OFF and gain the additional performance benefits.
> >
>
> Good luck with that. :)
>
> --
> - stephan beal
> http://wanderinghorse.net/home/stephan/
> http://gplus.to/sgbeal
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [sqlite-dev] Can I safely use the pragma synchronization = OFF?

2013-01-27 Thread Stephan Beal
On Sun, Jan 27, 2013 at 5:53 PM, Shuki Sasson  wrote:

> Answer: The journal is organized in transactions that each of them is
> atomic, so all the buffered cache changes for such operation are put into
> the transaction. Only fully completed transaction are replayed when the
> system is recovering from a panic or power loss.
>

Be careful not to confuse atomic here with atomic in the SQL sense. Any
given SQL write operation is made up of many fwrite() (or equivalent)
calls, each of which is (in the journaling FS case) atomic in the sense of
the whole write() call lands stably on disk or does not, but that has
absolutely nothing to do with the atomicity of the complete SQL-related
write (whether that be a single field of a single record, metadata for a
record, or a whole SQL transaction).


> well as data it makes all the sense in the world to run with
> synchronization = OFF and gain the additional performance benefits.
>

Good luck with that. :)

-- 
- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [sqlite-dev] Can I safely use the pragma synchronization = OFF?

2013-01-27 Thread Shuki Sasson
Hi Pavel, thanks a lot for your answer. Assuming xWrite is using fwrite
here is what is going on the File System:
In a legacy UNIX File System (UFS) the journaling protects only the
metadata (inode structure directory block indirect block etc..) but not the
data itself.
In more modern File Systems (usually one that are enterprise based like EMC
OneFS on the Isilon product) both data and meta data are journaled.

How journaling works?
The File System has a cache of the File System blocks it deals with (both
metadata and data) when changes are made to a buffer cached block it is
made to the memory only and the set of changes is save to the journal
persistently. When the persistent journal is on disk than saving both data
and meta data changes
takes too long and and only meta data changes are journaled. If the journal
is placed on NVRAM then it is fast enough to save both data and metadata
changes to the journal.
So in any file system that supports journaling fwrite is blocked until all
metadata and data changes are made to the buffer cache and journal is
update with the changes.
The only question than is if the File System keeps a journal of both meta
data and data , if your system has a file system that supports journaling
to both metadata and data blocks than you are theoretically (if there are
no bugs in the FS) guaranteed against data loss in case of system panic or
loss of power.
So in short, fully journaled File System gives you the safety of
synchronized = FULL (or even better) without the huge performance penalty
associated with fsync (or fsyncdada).

Additional Explanation: Why is cheaper to save the changes to the log
rather the whole chached buffer (block)?
Explanation: Each FileSystem block is 8K in size, some of the changes
includes areas in the block that are smaller in size and only these changes
are recorders.
What happens if a change to the File System involves multiple changes to
data blocks as well as metadata blocks like when an fwrite operation
increases the file size and induced an addition of an indirect meta data
block?
Answer: The journal is organized in transactions that each of them is
atomic, so all the buffered cache changes for such operation are put into
the transaction. Only fully completed transaction are replayed when the
system is recovering from a panic or power loss.

In short, in most file systems like UFS using synchronization = NORMAL
makes a lot of sense as data blocks are not protected by the journal,
however with more robust File System that have full journal for metadata as
well as data it makes all the sense in the world to run with
synchronization = OFF and gain the additional performance benefits.

Let me know if I missed something and I hope this makes things clearer.
Shuki




On Sat, Jan 26, 2013 at 10:31 PM, Pavel Ivanov  wrote:

> On Sat, Jan 26, 2013 at 6:50 PM, Shuki Sasson 
> wrote:
> >
> > Hi all, I read the documentation about the synchronization pragma.
> > It got to do with how often xSync method is called.
> > With synchronization = FULL xSync is called after each and every change
> to
> > the DataBase file as far as I understand...
> >
> > Observing the VFS interface used by the SQLITE:
> >
> > typedef struct sqlite3_io_methods sqlite3_io_methods;
> > struct sqlite3_io_methods {
> >   int iVersion;
> >   int (*xClose)(sqlite3_file*);
> >   int (*xRead)(sqlite3_file*, void*, int iAmt, sqlite3_int64 iOfst);
> >   *int (*xWrite)(sqlite3_file*, const void*, int iAmt, sqlite3_int64
> iOfst);*
> >   int (*xTruncate)(sqlite3_file*, sqlite3_int64 size);
> >  * int (*xSync)(sqlite3_file*, int flags);*
> >
> > *
> > *
> >
> > I see both xWrite and xSync...
> >
> > Is this means that xWrite initiate  a FS write to the file?
>
> Yes, in a sense that subsequent read without power cut from the
> machine will return written data.
>
> >
> > Is that means that xSync makes sure that the FS buffered changes are
> > synced to disk?
>
> Yes.
>
> > I guess it is calling fsync in case of LINUX /FreeBSD am I right?
>
> fdatasync() I think.
>
> > If the above is correct and SQLITE operates over modern reliable FS
> > that has journaling with each write, than despite the fact that the
> > write buffer cache are not fully synced they are protected by the FS
> > journal that fully records all the changes to the file and that is
> > going to be replayed in case of a FS mount after a system crash.
> >
> > If  my understanding is correct than assuming the FS journaling  is
> > bullet proof than I can safely operate with synchronization = OFF with
> > SQLITE and still be fully protected by the FS journal in case system
> > crash, right?
>
> I really doubt journaling filesystems work like that. Yes, your file
> will be restored using journal if the journal records made it to disk.
> But FS just can't physically write every record of the journal to disk
> at the moment of that record creation. If it did that your computer
> would be really slow. But as FS doesn't do that fdatasync