Re: Proposal for "proper" durable fsync() and fdatasync()
Jamie Lokier wrote: Jeff Garzik wrote: Nick Piggin wrote: Anyway, the idea of making fsync/fdatasync etc. safe by default is a good idea IMO, and is a bad bug that we don't do that :( Agreed... it's also disappointing that [unless I'm mistaken] you have to hack each filesystem to support barriers. It seems far easier to make sync_blkdev() Do The Right Thing, and magically make all filesystems data-safe. Well, you need ordered metadata writes, barriers _and_ flushes with some filesystems. Merely writing all the data pages than issuing a drive cache flush won't Do The Right Thing with those filesystems - someone already mentioned Btrfs, where it won't. Oh certainly. That's why we have a VFS :) fsync for NFS will look quite different, too. But I agree that your suggestion would make a superb default, for filesystems which don't provide their own function. Yep. That would immediately cover a bunch of filesystems. It's not optimal even then. Devices: On a software RAID, you ideally don't want to issue flushes to all drives if your database did a 1 block commit entry. (But they probably use O_DIRECT anyway, changing the rules again). But all that can be optimised in generic VFS code eventually. It doesn't need filesystem assistance in most cases. My own idea is that we create a FLUSH command for blkdev request queues, to exist alongside READ, WRITE, and the current barrier implementation. Then FLUSH could be passed down through MD or DM. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
On Tue, 26 February 2008 17:29:13 +, Jamie Lokier wrote: > > You're right. Though, doesn't normal page writeback enqueue the COW > metadata changes? If not, how do they get written in a timely > fashion? It does. But this is not sufficient to guarantee that the pages in question have been safely committed to the device by the time sync_file_range() has returned. Jörn -- Joern's library part 5: http://www.faqs.org/faqs/compression-faq/part2/section-9.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Jörn Engel wrote: > On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote: > > > > > One interesting aspect of this comes with COW filesystems like btrfs or > > > logfs. Writing out data pages is not sufficient, because those will get > > > lost unless their referencing metadata is written as well. So either we > > > have to call fsync for those filesystems or add another callback and let > > > filesystems override the default implementation. > > > > Doesn't the ->fsync callback get called in the sys_fdatasync() case, > > with appropriate arguments? > > My paragraph above was aimed at the sync_file_range() case. fsync and > fdatasync do the right thing within the limitations you brought up in > this thread. sync_file_range() without further changes will only write > data pages, not the metadata required to actually access those data > pages. This works just fine for non-COW filesystems, which covers all > currently merged ones. > > With COW filesystems it is currently impossible to do sync_file_range() > properly. The problem is orthogonal to your's, I just brought it up > since you were already mentioning sync_file_range(). You're right. Though, doesn't normal page writeback enqueue the COW metadata changes? If not, how do they get written in a timely fashion? -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote: > > > One interesting aspect of this comes with COW filesystems like btrfs or > > logfs. Writing out data pages is not sufficient, because those will get > > lost unless their referencing metadata is written as well. So either we > > have to call fsync for those filesystems or add another callback and let > > filesystems override the default implementation. > > Doesn't the ->fsync callback get called in the sys_fdatasync() case, > with appropriate arguments? My paragraph above was aimed at the sync_file_range() case. fsync and fdatasync do the right thing within the limitations you brought up in this thread. sync_file_range() without further changes will only write data pages, not the metadata required to actually access those data pages. This works just fine for non-COW filesystems, which covers all currently merged ones. With COW filesystems it is currently impossible to do sync_file_range() properly. The problem is orthogonal to your's, I just brought it up since you were already mentioning sync_file_range(). Jörn -- Joern's library part 10: http://blogs.msdn.com/David_Gristwood/archive/2004/06/24/164849.aspx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Jeff Garzik wrote: > Nick Piggin wrote: > >Anyway, the idea of making fsync/fdatasync etc. safe by default is > >a good idea IMO, and is a bad bug that we don't do that :( > > Agreed... it's also disappointing that [unless I'm mistaken] you have > to hack each filesystem to support barriers. > > It seems far easier to make sync_blkdev() Do The Right Thing, and > magically make all filesystems data-safe. Well, you need ordered metadata writes, barriers _and_ flushes with some filesystems. Merely writing all the data pages than issuing a drive cache flush won't Do The Right Thing with those filesystems - someone already mentioned Btrfs, where it won't. But I agree that your suggestion would make a superb default, for filesystems which don't provide their own function. It's not optimal even then. Devices: On a software RAID, you ideally don't want to issue flushes to all drives if your database did a 1 block commit entry. (But they probably use O_DIRECT anyway, changing the rules again). But all that can be optimised in generic VFS code eventually. It doesn't need filesystem assistance in most cases. Apps: don't always want a full flush; sometimes a barrier would do. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Nick Piggin wrote: Anyway, the idea of making fsync/fdatasync etc. safe by default is a good idea IMO, and is a bad bug that we don't do that :( Agreed... it's also disappointing that [unless I'm mistaken] you have to hack each filesystem to support barriers. It seems far easier to make sync_blkdev() Do The Right Thing, and magically make all filesystems data-safe. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
On Tue, 26 Feb 2008 15:07:45 + Jamie Lokier <[EMAIL PROTECTED]> wrote: > SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty > pages which aren't already queued for write-out. It marks those with > a "write-out" flag, and starts write I/Os at some unspecified time in > the near future; it can be assumed writes for all the pages will > complete eventually if there's no errors. When I/O completes on a > page, it cleans the page and also clears the write-out flag. > > SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't > have the "write-out" flag set. > > SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking > pages for write-out. I don't actually see the point in this. Isn't a > preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making > BEFORE a redundant flag? Consider the case of pages which are dirty but are already under writeout. ie: someone redirtied the page after someone started writing the page out. For these pages the kernel needs to a) wait for the current writeout to complete b) start new writeout c) wait for that writeout to complete. those are the three stages of sync_file_range(). They are independently selectable and various combinations provide various results. The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that userspace can get as much data into the queue as possible, to permit the kernel to optimise IO scheduling better. If you perform a) and b) together (SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE) then you are guaranteed that all data which was dirty when sync_file_range() executed will be sent into the queue, but you won't get as much data into the queue if the kernel encounters dirty, under-writeout pages. This is especially hurtful if you're trying to feed a lot of little segments into the queue. In that case perhaps userspace should do an asynchrnous pass (SYNC_FILE_RANGE_WRITE) to stuff as much data as poss into the queue, then a SYNC_FILE_RANGE_WAIT_AFTER pass then a SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER pass to clean up any stragglers. WHich mode is best very much depends on the application's file dirtying patterns. One would have to experiment with it, and tuning of sync_file_range() usage would occur alongside tuning of the application's write() design. It's an interesting problem, with potentially high payback. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Ric Wheeler wrote: > >>I was surprised that fsync() doesn't do this already. There was a lot > >>of effort put into block I/O write barriers during 2.5, so that > >>journalling filesystems can force correct write ordering, using disk > >>flush cache commands. > >> > >>After all that effort, I was very surprised to notice that Linux 2.6.x > >>doesn't use that capability to ensure fsync() flushes the disk cache > >>onto stable storage. > > > >It's surprising you are surprised, given that this [lame] fsync behavior > >has remaining consistently lame throughout Linux's history. > > Maybe I am confused, but isn't this is what fsync() does today whenever > barriers are enabled (the fsync() invalidates the drive's write cache). No, fsync() doesn't always flush the drive's write cache. It often does, any I think many people are under the impression it always does, but it doesn't. Try this code on ext3: fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666); while (1) { char byte; usleep (10); pwrite (fd, &byte, 1, 0); fsync (fd); } It will do just over 10 write ops per second on an idle system (13 on mine), and 1 flush op per second. That's because ext3 fsync() only does a journal commit when the inode has changed. The inode mtime is changed by write only with 1 second granularity. Without a journal commit, there's no barrier, which translates to not flushing disk write cache. If you add "fchmod (fd, 0644); fchmod (fd, 0664);" between the write and fsync, you'll see at least 20 write ops and 20 flush ops per second, and you'll here the disk seeking more. That's because the fchmod dirties the inode, so fsync() writes the inode with a journal commit. It turns out even _that_ is not sufficient according to the kernel internals. A journal commit uses an ordered request, which isn't the same as a flush potentially, it just happens to use flush in this instance. I'm not sure if ordered requests are actually implemented by any drivers at the moment. If not now, they will be one day. We could change ext3 fsync() to always do a journal commit, and depend on the non-existence of block drivers which do ordered (not flush) barrier requests. But there's lots of things wrong with that. Not least, it sucks performance for database-like applications and virtual machines, a lot due to unnecessary seeks. That way lies wrongness. Rightness is to make fdatasync() work well, with a genuine flush (or equivalent (see FUA), only when required, and not a mere ordered barrier), no inode write, and to make sync_file_range()[*] offer the fancier applications finer controls which reflect what they actually need. [*] - or whatever. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Jörn Engel wrote: > On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote: > > Yeah, sync_file_range has slightly unusual semantics and introduce > > the new concept, "writeout", to userspace (does "writeout" include > > "in drive cache"? the kernel doesn't think so, but the only way to > > make sync_file_range "safe" is if you do consider it writeout). > > If sync_file_range isn't safe, it should get replaced by a noop > implementation. There really is no point in promising "a little" > safety. Sometimes there is a point in "a little" safety. There's a spectrum of durability (meaning how safely stored the data is). In the cases we're imagining, it's application -> main memory cache -> disk cache -> disk surface. There are others. _None_ of those provide perfect safety for your data. They are a spectrum, and how far along you want data to be committed before you say "fine, the data is safe enough for me" depends on your application. For example, there are users who like to turn _off_ fdatasync() with their SQL database of choice. They prefer speed over safety, and they don't mind losing an hour's data and doing regular backups (we assume ;-) Some blogs fall into this category; who cares if a rare crash costs you a comment or two and a restore from backup; it's acceptable for the speed. There's users who would really like fdatasync() to commit data to the drive platters, so after their database says "done", they are very confident that a power failure won't cause committed data to be lost. Accepting credit cards is more at this end. So should be anyone using a virtual machine of any kind without a journalling fs in the guest! And there's users who like it where it is right now: a compromise, where a system crash won't lose committed data; but a power failure might. (I'm making assumptions about drive behaviour on reset here.) My problem with fdatasync() at the moment is, I can't choose what I want from it, and there's no mechanism to give me the safest option. Most annoyingly, in-kernel filesystems _do_ have a mechanism; it just isn't exported to userspace. (A quick aside: fdatasync() et al. are actually used for two _different_ things. 1: A program says "I've written it", it can say so with confidence, e.g. announcing email receipt. 2: It's used for write ordering with write-ahead logging: write, fdatasync, write. When you tease at the details, efficient implementations of them are different... Think SCSI tagged commands versus cache flushes.) > One interesting aspect of this comes with COW filesystems like btrfs or > logfs. Writing out data pages is not sufficient, because those will get > lost unless their referencing metadata is written as well. So either we > have to call fsync for those filesystems or add another callback and let > filesystems override the default implementation. Doesn't the ->fsync callback get called in the sys_fdatasync() case, with appropriate arguments? With barriers/flushes it certainly makes those a bit more complicated. You have to flush not just the disks with data pages, but the _other_ disks in a software RAID with data pointer metadata pages, but ideally not all of them (think database journal commit). That can be implemented with per-buffer pending-barrier/flush flags (like I described for pages in the first mail), which are equally useful when a database-like application uses a block device. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Jeff Garzik wrote: Jamie Lokier wrote: By durable, I mean that fsync() should actually commit writes to physical stable storage, Yes, it should. I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that journalling filesystems can force correct write ordering, using disk flush cache commands. After all that effort, I was very surprised to notice that Linux 2.6.x doesn't use that capability to ensure fsync() flushes the disk cache onto stable storage. It's surprising you are surprised, given that this [lame] fsync behavior has remaining consistently lame throughout Linux's history. Maybe I am confused, but isn't this is what fsync() does today whenever barriers are enabled (the fsync() invalidates the drive's write cache). ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Jörn Engel wrote: > On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote: > > > > Yeah, sync_file_range has slightly unusual semantics and introduce > > the new concept, "writeout", to userspace (does "writeout" include > > "in drive cache"? the kernel doesn't think so, but the only way to > > make sync_file_range "safe" is if you do consider it writeout). > > If sync_file_range isn't safe, it should get replaced by a noop > implementation. There really is no point in promising "a little" > safety. > > One interesting aspect of this comes with COW filesystems like btrfs or > logfs. Writing out data pages is not sufficient, because those will get > lost unless their referencing metadata is written as well. So either we > have to call fsync for those filesystems or add another callback and let > filesystems override the default implementation. fdatasync() is required to write data pages _and_ the necessary metadata to reference those changed pages (btrfs tree etc.), but not non-data metadata. It's the filesystem's responsibility to interpret that correctly. In-place writes don't need anything else. Phase-tree style writes do. Some kinds of logged writes don't. I'm under the impression that sync_file_range() is a sort of restricted-range asynchronous fdatasync(). By limiting the range of file date which must be written out, it becomes more refined for database and filesystem-in-a-file type applications. Just as fsync() is more refined than sync() - it's useful to sync less - same goes for syncing just part of a file. It's still the filesystem's responsibility to sync data access metadata appropriately. It can sync more if it wants, but not less. That's what I understand by sync_file_range(fd, start,length, SYNC_FILE_RANGE_WRITE_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WRITE_AFTER); Largely because the manual says to use that combination of flags for an equivalent to fdatasync(). The concept of "write-out" is not defined in the manual. I'm assuming it to mean this, as a reasonable guess: SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty pages which aren't already queued for write-out. It marks those with a "write-out" flag, and starts write I/Os at some unspecified time in the near future; it can be assumed writes for all the pages will complete eventually if there's no errors. When I/O completes on a page, it cleans the page and also clears the write-out flag. SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't have the "write-out" flag set. SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking pages for write-out. I don't actually see the point in this. Isn't a preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making BEFORE a redundant flag? The manual says it is something to do with data-integrity, but it's not clear to me what that means. All this implies that "write-out" flag is a concept userspace can rely on. That's not so peculiar: WRITE seems to be equivalent to AIO-style fdatasync() on a limited range of offsets, and WAIT_AFTER seems to be equivalent to waiting for any previously issued such ops to complete. Any data access metadata updates that btrfs must make for fdatasync(), it must also make for sync_file_range(), for the limited range of offsets. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote: > > Yeah, sync_file_range has slightly unusual semantics and introduce > the new concept, "writeout", to userspace (does "writeout" include > "in drive cache"? the kernel doesn't think so, but the only way to > make sync_file_range "safe" is if you do consider it writeout). If sync_file_range isn't safe, it should get replaced by a noop implementation. There really is no point in promising "a little" safety. One interesting aspect of this comes with COW filesystems like btrfs or logfs. Writing out data pages is not sufficient, because those will get lost unless their referencing metadata is written as well. So either we have to call fsync for those filesystems or add another callback and let filesystems override the default implementation. Jörn -- There is no worse hell than that provided by the regrets for wasted opportunities. -- Andre-Louis Moreau in Scarabouche -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Jeff Garzik wrote: > [snip huge long proposal] > > Rather than invent new APIs, we should fix the existing ones to _really_ > flush data to physical media. Btw, one reason for the length is the current block request API isn't sufficient even to make fsync() durable with _no_ new APIs. It offers ordering barriers only, which aren't enough. I tried to explain, discuss some changes and then suggest optimisations. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
On Tuesday 26 February 2008 18:59, Jamie Lokier wrote: > Andrew Morton wrote: > > On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier <[EMAIL PROTECTED]> wrote: > > > (It would be nicer if sync_file_range() > > > took a vector of ranges for better elevator scheduling, but let's > > > ignore that :-) > > > > Two passes: > > > > Pass 1: shove each of the segments into the queue with > > SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE > > > > Pass 2: wait for them all to complete and return accumulated result > > with SYNC_FILE_RANGE_WAIT_AFTER > > Thanks. > > Seems ok, though being able to cork the I/O until the last one would > be a bonus (like TCP_MORE... SYNC_FILE_RANGE_MORE?) > > I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE. Is there a > reason why you have it there? The man page isn't very enlightening. Yeah, sync_file_range has slightly unusual semantics and introduce the new concept, "writeout", to userspace (does "writeout" include "in drive cache"? the kernel doesn't think so, but the only way to make sync_file_range "safe" is if you do consider it writeout). If it makes it any easier to understand, we can add in SYNC_FILE_ASYNC, SYNC_FILE_SYNC parts that just deal with safe/unsafe and sync/async semantics that is part of the normal POSIX api. Anyway, the idea of making fsync/fdatasync etc. safe by default is a good idea IMO, and is a bad bug that we don't do that :( -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Andrew Morton wrote: > On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier <[EMAIL PROTECTED]> wrote: > > > (It would be nicer if sync_file_range() > > took a vector of ranges for better elevator scheduling, but let's > > ignore that :-) > > Two passes: > > Pass 1: shove each of the segments into the queue with > SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE > > Pass 2: wait for them all to complete and return accumulated result > with SYNC_FILE_RANGE_WAIT_AFTER Thanks. Seems ok, though being able to cork the I/O until the last one would be a bonus (like TCP_MORE... SYNC_FILE_RANGE_MORE?) I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE. Is there a reason why you have it there? The man page isn't very enlightening. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Jeff Garzik wrote: > Jamie Lokier wrote: > >By durable, I mean that fsync() should actually commit writes to > >physical stable storage, > > Yes, it should. Glad we agree :-) > >I was surprised that fsync() doesn't do this already. There was a lot > >of effort put into block I/O write barriers during 2.5, so that > >journalling filesystems can force correct write ordering, using disk > >flush cache commands. > > > >After all that effort, I was very surprised to notice that Linux 2.6.x > >doesn't use that capability to ensure fsync() flushes the disk cache > >onto stable storage. > > It's surprising you are surprised, given that this [lame] fsync behavior > has remaining consistently lame throughout Linux's history. I was surprised because of the effort put into IDE write barriers to get it right for in-kernel filesystems, and the messages in 2004 telling concerned users that fsync would use barriers in 2.6, which it does sometimes but not always. > [snip huge long proposal] > > Rather than invent new APIs, we should fix the existing ones to _really_ > flush data to physical media. > > Linux should default to SAFE data storage, and permit users to retain > the older unsafe behavior via an option. It's completely ridiculous > that we default to an unsafe fsync. Well, I agree with you. Which is why the "new API" I suggested, being really just an extension of an existing one, allows fsync() to be SAFE if that's what people want. To be fair, fsync() is rather overkill for some apps. sync_file_range() is obviously the right place for fine tuning "less safe" variations. > And [anticipating a common response from others] it is completely > irrelevant that POSIX fsync(2) permits Linux's current behavior. The > current behavior is unsafe. > > Safety before performance -- ESPECIALLY when it comes to storing user data. Especially now that people work a lot in guest VMs, where the IDE barrier stuff doesn't work if the host fdatasync() doesn't work. Since it happened with Mac OS X, I wouldn't be surprised if changing fsync() and just that wasn't popular. Heck, you already get people asking "how to turn off fsync in PostGreSQL"... (Haven't those people heard of transactions...?) But with changes to sync_file_range() [or whatever... I don't care] to support database's finely tuned commit needs, and then adoption of that by database vendors, perhaps nobody will mind fsync() becoming safe then. Nobody seems bothered by it's performance for other things. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Jamie Lokier wrote: By durable, I mean that fsync() should actually commit writes to physical stable storage, Yes, it should. I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that journalling filesystems can force correct write ordering, using disk flush cache commands. After all that effort, I was very surprised to notice that Linux 2.6.x doesn't use that capability to ensure fsync() flushes the disk cache onto stable storage. It's surprising you are surprised, given that this [lame] fsync behavior has remaining consistently lame throughout Linux's history. [snip huge long proposal] Rather than invent new APIs, we should fix the existing ones to _really_ flush data to physical media. Linux should default to SAFE data storage, and permit users to retain the older unsafe behavior via an option. It's completely ridiculous that we default to an unsafe fsync. And [anticipating a common response from others] it is completely irrelevant that POSIX fsync(2) permits Linux's current behavior. The current behavior is unsafe. Safety before performance -- ESPECIALLY when it comes to storing user data. Regards, Jeff (Linux ATA driver dude) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier <[EMAIL PROTECTED]> wrote: > (It would be nicer if sync_file_range() > took a vector of ranges for better elevator scheduling, but let's > ignore that :-) Two passes: Pass 1: shove each of the segments into the queue with SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE Pass 2: wait for them all to complete and return accumulated result with SYNC_FILE_RANGE_WAIT_AFTER -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Proposal for "proper" durable fsync() and fdatasync()
Dear kernel, This is a proposal to add "proper" durable fsync() and fdatasync() to Linux. First the problem, then a proposed solution "with benefits", so to speak. I need feedback on the details, before implementing anything. Or (hopefully) someone else thinks it's very important and does it themselves :-) By durable, I mean that fsync() should actually commit writes to physical stable storage, not just the disk write cache when that is enabled. Databases and guest VMs needs this, or an equivalent feature, if they aren't to face occasional corruption after power failure and perhaps some crashes. The alternative is to disable the disk write cache. But that isn't modern practice or recommendation, since I/O write barriers were implemented and they are much faster. I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that journalling filesystems can force correct write ordering, using disk flush cache commands. After all that effort, I was very surprised to notice that Linux 2.6.x doesn't use that capability to ensure fsync() flushes the disk cache onto stable storage. I noticed this following up discussions on the Qemu mailing list, about guest VMs and how their IDE flush cache command should translate to fsync() to avoid data loss. (For guest VMs, fsync() isn't necessary if the host machine is fine, and it isn't enough (on Linux host) if the host machine loses power or the hard disk crashes another way.) Then I noticed it again, when I was designing a database engine with filesystem characteristics. I thought "how do I ensure ordered journal writes; can I use fdatasync()?" and was surprised to find the answer is no, I have to use hacks like calling hdparm, and the authors of major SQL databases seem to brush the problem under a carpet. (Interestingly, in the Linux 2.4 patches for write barriers, fsync() seems to be fine, if a bit slow.) It isn't the first time this topic has come up: http://groups.google.com.br/group/linux.kernel/browse_thread/thread/d343e51655b4ac7c/7ee9bca80977c2d1?#7ee9bca80977c2d1 ("True fsync() in Linux (on IDE)") In that thread, it was implied that would be fixed in 2.6. So I bet some people are under the illusion that it's fixed in 2.6... For a while, I've been meaning to bring it up on linux-kernel... The fsync problem - Chris Wedgwood wrote: > On Mon, Feb 25, 2008 at 08:50:40PM +, Jamie Lokier wrote: > > > On Linux (and other host OSes), fdatsync() and fsync() don't always > > commit data to hard storage; it sometimes only commits it to the hard > > drive cache. > > That's a filesystem bug IMO. People should be able to use f[data]sync > with some level onf confidence or else it's basically pointless. I agree, I consider it a serious bug, and I would be pleased if someone paid it some love and attention. Right now, if you want a reliable database on Linux, you _cannot_ properly depend on fsync() or fdatasync(). Considering how much Linux is used for critical databases, using these functions, this amazes me. Also, if you have a guest VM, then the guest's filesystem journalling is not reliable. Not only can it lose data on power loss, it can corrupt the guest filesystem too, due to reordering. This is contrary to what people expect, I think. I'm not sure if a system reset can cause similar loss; I don't know how disks react to that. Also, for the person porting ZFS to run on FUSE, same applies... Linux fsync is faulty in two ways: 1. Database commits aren't _durable_ against power failure, because fsync doesn't flush the disk's cache. This means data stored is not guaranteed to be stored at the expected durability. 2. It's unsafe for write-ahead logging, because it doesn't really guarantee any _ordering_ for the writes at the hard storage level. So aside from losing committed data, it can also corrupt structural metadata. With ext3 it's quite easy to verify that fsync/fdatasync don't always write a journal entry. (Apart from looking at the kernel code :-) Just write some data, fsync(), and observe the number of writes in /proc/diskstats. If the current mtime second _hasn't_ changed, the inode isn't written. If you write data, say, 10 times a second to the same place followed by fsync(), you'll see a little more than 10 write I/Os, and less than 20. By the way, this shows a trick for fixing #2 (ordering): use fchmod() to toggle the file attributes, and that will force the next fsync() to write a journal entry, which _does_ issue a write barrier. If you do that with each write as above (write, fchmod change, fsync 10 times a second), you will clearly see more write I/Os, and you'll hear the disk behaving differently: it's seeking more. However, even this ugly trick has problems: 3. Using the fchmod() trick or good fortune, fsync() issues a write barrier. Right now, this does commit data (if