[To qemu-devel and Chris, I have started a thread on linux-kernel on this topic. I've copied the first few paragraphs here, so you can see what it's about since it's a response to a post here. But it's largely off topic for Qemu, and on topic for linux-kernel, so I didn't cross post lest linux-kernel replies come here.]
To: [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Proposal for "proper" durable fsync() and fdatasync() Message-ID: <[EMAIL PROTECTED]> Date: Tue, 26 Feb 2008 07:26:49 +0000 Dear kernel, This is a proposal to add "proper" durable fsync() and fdatasync() to Linux. First the problem, then a proposed solution "with benefits", so to speak. [...] By durable, I mean that fsync() should actually commit writes to physical stable storage, not just the disk write cache when that is enabled. Databases and guest VMs needs this, or an equivalent feature, if they aren't to face occasional corruption after power failure and perhaps some crashes. The alternative is to disable the disk write cache. But that isn't modern practice or recommendation, since I/O write barriers were implemented and they are much faster. I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that journalling filesystems can force correct write ordering, using disk flush cache commands. After all that effort, I was very surprised to notice that Linux 2.6.x doesn't use that capability to ensure fsync() flushes the disk cache onto stable storage. I noticed this following up discussions on the Qemu mailing list, about guest VMs and how their IDE flush cache command should translate to fsync() to avoid data loss. (For guest VMs, fsync() isn't necessary if the host machine is fine, and it isn't enough (on Linux host) if the host machine loses power or the hard disk crashes another way.) Then I noticed it again, when I was designing a database engine with filesystem characteristics. I thought "how do I ensure ordered journal writes; can I use fdatasync()?" and was surprised to find the answer is no, I have to use hacks like calling hdparm, and the authors of major SQL databases seem to brush the problem under a carpet. (Interestingly, in the Linux 2.4 patches for write barriers, fsync() seems to be fine, if a bit slow.) It isn't the first time this topic has come up: http://groups.google.com.br/group/linux.kernel/browse_thread/thread/d343e51655b4ac7c/7ee9bca80977c2d1?#7ee9bca80977c2d1 ("True fsync() in Linux (on IDE)") In that thread, it was implied that would be fixed in 2.6. So I bet some people are under the illusion that it's fixed in 2.6... For a while, I've been meaning to bring it up on linux-kernel... [More on linux-kernel]. Thanks, -- Jamie