Pick up any book about UFS and read about the journal... Shuki
On Sun, Jan 27, 2013 at 7:56 PM, Pavel Ivanov <paiva...@gmail.com> wrote: > > So in any file system that supports journaling fwrite is blocked until > all > > metadata and data changes are made to the buffer cache and journal is > > update with the changes. > > Please give us some links where did you get all this info with the > benchmarks please. Because what you try to convince us is that with > journaling FS write() doesn't return until the journal record is > guaranteed to physically make it to disk. First of all I don't see > what's the benefit of that compared to direct writing to disk not > using write-back cache. And second do you realize that in this case > you can't make more than 30-50 journal records per second? Do you > really believe that for good OS performance it's enough to make less > than 30 calls to write() per second (on any file, not on each file)? I > won't believe that until I see data and benchmarks from reliable > sources. > > > Pavel > > > On Sun, Jan 27, 2013 at 8:53 AM, Shuki Sasson <gur.mons...@gmail.com> > wrote: > > Hi Pavel, thanks a lot for your answer. Assuming xWrite is using fwrite > > here is what is going on the File System: > > In a legacy UNIX File System (UFS) the journaling protects only the > > metadata (inode structure directory block indirect block etc..) but not > the > > data itself. > > In more modern File Systems (usually one that are enterprise based like > EMC > > OneFS on the Isilon product) both data and meta data are journaled. > > > > How journaling works? > > The File System has a cache of the File System blocks it deals with (both > > metadata and data) when changes are made to a buffer cached block it is > > made to the memory only and the set of changes is save to the journal > > persistently. When the persistent journal is on disk than saving both > data > > and meta data changes > > takes too long and and only meta data changes are journaled. If the > journal > > is placed on NVRAM then it is fast enough to save both data and metadata > > changes to the journal. > > So in any file system that supports journaling fwrite is blocked until > all > > metadata and data changes are made to the buffer cache and journal is > > update with the changes. > > The only question than is if the File System keeps a journal of both meta > > data and data , if your system has a file system that supports journaling > > to both metadata and data blocks than you are theoretically (if there are > > no bugs in the FS) guaranteed against data loss in case of system panic > or > > loss of power. > > So in short, fully journaled File System gives you the safety of > > synchronized = FULL (or even better) without the huge performance penalty > > associated with fsync (or fsyncdada). > > > > Additional Explanation: Why is cheaper to save the changes to the log > > rather the whole chached buffer (block)? > > Explanation: Each FileSystem block is 8K in size, some of the changes > > includes areas in the block that are smaller in size and only these > changes > > are recorders. > > What happens if a change to the File System involves multiple changes to > > data blocks as well as metadata blocks like when an fwrite operation > > increases the file size and induced an addition of an indirect meta data > > block? > > Answer: The journal is organized in transactions that each of them is > > atomic, so all the buffered cache changes for such operation are put into > > the transaction. Only fully completed transaction are replayed when the > > system is recovering from a panic or power loss. > > > > In short, in most file systems like UFS using synchronization = NORMAL > > makes a lot of sense as data blocks are not protected by the journal, > > however with more robust File System that have full journal for metadata > as > > well as data it makes all the sense in the world to run with > > synchronization = OFF and gain the additional performance benefits. > > > > Let me know if I missed something and I hope this makes things clearer. > > Shuki > > > > > > > > > > On Sat, Jan 26, 2013 at 10:31 PM, Pavel Ivanov <paiva...@gmail.com> > wrote: > > > >> On Sat, Jan 26, 2013 at 6:50 PM, Shuki Sasson <gur.mons...@gmail.com> > >> wrote: > >> > > >> > Hi all, I read the documentation about the synchronization pragma. > >> > It got to do with how often xSync method is called. > >> > With synchronization = FULL xSync is called after each and every > change > >> to > >> > the DataBase file as far as I understand... > >> > > >> > Observing the VFS interface used by the SQLITE: > >> > > >> > typedef struct sqlite3_io_methods sqlite3_io_methods; > >> > struct sqlite3_io_methods { > >> > int iVersion; > >> > int (*xClose)(sqlite3_file*); > >> > int (*xRead)(sqlite3_file*, void*, int iAmt, sqlite3_int64 iOfst); > >> > *int (*xWrite)(sqlite3_file*, const void*, int iAmt, sqlite3_int64 > >> iOfst);* > >> > int (*xTruncate)(sqlite3_file*, sqlite3_int64 size); > >> > * int (*xSync)(sqlite3_file*, int flags);* > >> > > >> > * > >> > * > >> > > >> > I see both xWrite and xSync... > >> > > >> > Is this means that xWrite initiate a FS write to the file? > >> > >> Yes, in a sense that subsequent read without power cut from the > >> machine will return written data. > >> > >> > > >> > Is that means that xSync makes sure that the FS buffered changes are > >> > synced to disk? > >> > >> Yes. > >> > >> > I guess it is calling fsync in case of LINUX /FreeBSD am I right? > >> > >> fdatasync() I think. > >> > >> > If the above is correct and SQLITE operates over modern reliable FS > >> > that has journaling with each write, than despite the fact that the > >> > write buffer cache are not fully synced they are protected by the FS > >> > journal that fully records all the changes to the file and that is > >> > going to be replayed in case of a FS mount after a system crash. > >> > > >> > If my understanding is correct than assuming the FS journaling is > >> > bullet proof than I can safely operate with synchronization = OFF with > >> > SQLITE and still be fully protected by the FS journal in case system > >> > crash, right? > >> > >> I really doubt journaling filesystems work like that. Yes, your file > >> will be restored using journal if the journal records made it to disk. > >> But FS just can't physically write every record of the journal to disk > >> at the moment of that record creation. If it did that your computer > >> would be really slow. But as FS doesn't do that fdatasync still makes > >> sense if you want to guarantee that when COMMIT execution is finished > >> it's safe to cut the power off or crash. > >> > >> > Meaning synchronization = NORMAL doesn't buy me anything in fact it > >> > severely slows the Data Base operations. > >> > > >> > Am I missing something here? > >> > >> Please re-check documentation on how journaling FS work. > >> > >> > >> Pavel > >> _______________________________________________ > >> sqlite-dev mailing list > >> sqlite-...@sqlite.org > >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-dev > >> > > _______________________________________________ > > sqlite-users mailing list > > sqlite-users@sqlite.org > > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users