Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
On Sat, Jul 29 2006, Paul Brook wrote: Easy to do with the fsync infrastructure, but probably not worth doing since people are working on the AIO I/O backend, which would allow multiple outstanding writes from a guest. That, in turn, means I/O completion in the guest can be done when the data really hits disk, but without a performance impact. Not entirely true. That only works if you allow multiple guest IO requests in parallel, ie. some form of tagged command queueing. This requires either improving the SCSI emulation, or implementing SATA emulation. AFAIK parallel IDE doesn't support command queueing. Parallel IDE does support queuing, but it never gained wide spread support and the standard is quite broken as well (which is probably _why_ it never got much adoption). It was also quite suboptimal from a CPU efficiency POV. Besides, async completion in itself is not enough, QEMU still needs to honor ordered writes (barriers) and cache flushes. My impression what that the initial AIO implementation is just straight serial async operation. IO wouldn't actually go any faster, it just means the guest can do something else while it's waiting. Depends on the app, if the io workload is parallel then you should see a nice speedup as well (as QEMU is then no longer the serializing bottle neck). -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
On Sat, Jul 29 2006, Rik van Riel wrote: Fabrice Bellard wrote: Hi, Using O_SYNC for disk image access is not acceptable: QEMU relies on the host OS to ensure that the data is written correctly. This means that write ordering is not preserved, and on a power failure any data written by qemu (or Xen fully virt) guests may not be preserved. Applications running on the host can count on fsync doing the right thing, meaning that if they call fsync, the data *will* have made it to disk. Applications running inside a guest have no guarantees that their data is actually going to make it anywhere when fsync returns... Then the guest OS is broken. Applications issuing an fsync() should issue a flush (or write-through), the guest OS should propagate this knowledge through it's io stack and the QEMU hard drive should get notified. If the guest OS isn't doing what it's supposed to, QEMU can't help you. And, in fact, running your app on the same host OS with write back caching would screw you as well. The timing window will probably be larger with QEMU, but the problem is essentially the same. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
On 31 jul 2006, at 09:08, Jens Axboe wrote: Applications running on the host can count on fsync doing the right thing, meaning that if they call fsync, the data *will* have made it to disk. Applications running inside a guest have no guarantees that their data is actually going to make it anywhere when fsync returns... Then the guest OS is broken. The problem is that supposedly many OS'es are broken in this way. See http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html Jonas ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
On Mon, Jul 31 2006, Jonas Maebe wrote: On 31 jul 2006, at 09:08, Jens Axboe wrote: Applications running on the host can count on fsync doing the right thing, meaning that if they call fsync, the data *will* have made it to disk. Applications running inside a guest have no guarantees that their data is actually going to make it anywhere when fsync returns... Then the guest OS is broken. The problem is that supposedly many OS'es are broken in this way. See http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html Well, as others have written here as well, then their OS are broken on real hardware as well. I wouldn't be adverse to a QEMU work-around, but O_SYNC is clearly not a viable alternative! We could make QEMU behave more like a real hard drive when it has aio support, flushing dirty cache out in a manner more closely mimicking what a drive would do instead of relying on the page cache writeout deciding to write it out. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
On Mon, Jul 31 2006, andrzej zaborowski wrote: On 30/07/06, Jamie Lokier [EMAIL PROTECTED] wrote: Rik van Riel wrote: This may look like hair splitting, but so far I've lost a (test) postgresql database to this 3 times already. Not getting the guest application's data to disk when the application calls fsync is a recipe for disaster. Exactly the same thing happens with real IDE disks if IDE write caching (on the drive itself) is enabled, which it is by default. It is rarer, but it happens. The little difference with QEMU is that there are two caches above it: the host OS'es software cache and the IDE hardware cache. When a guest OS flushes its own software cache its precious data goes to the host's software cache while the guest thinks it's already the IDE cache. This is ofcourse of less importance because data in both caches (hard- and software) is lost when the power is cut off. But the drive cache does not let the dirty data linger for as long as wht OS page/buffer cache. IMHO what really makes IO unreliable in QEMU is that IO errors on the host are not reported to the guest by the IDE emulation and there's an exact place in hw/ide.c where they are arrogantly ignored. Send a patch, I'm pretty sure nobody would disagree :-) -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
On 31/07/06, Jens Axboe [EMAIL PROTECTED] wrote: On Mon, Jul 31 2006, andrzej zaborowski wrote: On 30/07/06, Jamie Lokier [EMAIL PROTECTED] wrote: Rik van Riel wrote: This may look like hair splitting, but so far I've lost a (test) postgresql database to this 3 times already. Not getting the guest application's data to disk when the application calls fsync is a recipe for disaster. Exactly the same thing happens with real IDE disks if IDE write caching (on the drive itself) is enabled, which it is by default. It is rarer, but it happens. The little difference with QEMU is that there are two caches above it: the host OS'es software cache and the IDE hardware cache. When a guest OS flushes its own software cache its precious data goes to the host's software cache while the guest thinks it's already the IDE cache. This is ofcourse of less importance because data in both caches (hard- and software) is lost when the power is cut off. But the drive cache does not let the dirty data linger for as long as wht OS page/buffer cache. I would say this an argument speaking for actually using O_SYNC. IMHO what really makes IO unreliable in QEMU is that IO errors on the host are not reported to the guest by the IDE emulation and there's an exact place in hw/ide.c where they are arrogantly ignored. Send a patch, I'm pretty sure nobody would disagree :-) Here's what I proposed: http://lists.gnu.org/archive/html/qemu-devel/2005-12/msg00275.html but I'm afraid it's not correct :P -- Jens Axboe -- balrog 2oo6 Dear Outlook users: Please remove me from your address books http://www.newsforge.com/article.pl?sid=03/08/21/143258 ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
Rik van Riel wrote: This may look like hair splitting, but so far I've lost a (test) postgresql database to this 3 times already. Not getting the guest application's data to disk when the application calls fsync is a recipe for disaster. Exactly the same thing happens with real IDE disks if IDE write caching (on the drive itself) is enabled, which it is by default. It is rarer, but it happens. I've seen this with Linux 2.4 kernels writing to ext3 (real, not virtual). Filesystem metadata gets corrupted from time to time if power is removed, because write ordering is not preserved. Disabling IDE write caching fixes it, but the performance impact is huge on some systems. Linux 2.6 kernels will issue IDE cache flush commands, at least with ext3, to commit data to disk when fsync is called, and to preserve journal/metadata ordering. Doesn't qemu fsync the host file corresponding to the emulated disk, when the guest OS issues an IDE cache flush? For IDE emulation to be as reliable for data storage as a real disk, it should: - fsync the host file whenever the guest OS issues an IDE cache flush command. - use O_SYNC (or fsync after each write or aio equivalent, etc.) _only_ when the guest OS disables the IDE disk cache (not done by default). -- JAmie ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
Bill C. Riemers wrote: How about compromising, and making the patch a run time option. Presumably this is only a problem when the virtual machine is not properly shutdown. For those ho want the extra security of knowing the data will be written regardless of the shutdown status they can enable the flag. By default it could be turned off. Then everybody can be happy. Real disks don't provide that security unless you disable the disk's cache, or issue cache flush instructions to the disk. Modern guest OS filesystems are written with this in mind. With older guest OSes, you have to disable the disk cache if you want that kind of security with real disks. Is there any reason why the emulation should be any different? -- Jamie ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
Hi, Using O_SYNC for disk image access is not acceptable: QEMU relies on the host OS to ensure that the data is written correctly. Even the current 'fsync' support is questionnable to say the least ! Please don't mix issues regarding QEMU disk handling and the underlying hypervisor/host OS block device handling. Regards, Fabrice. Rik van Riel wrote: This is the simple approach to making sure that disk writes actually hit disk before we tell the guest OS that IO has completed. Thanks to DMA_MULTI_THREAD the performance still seems to be adequate. A fancier solution would be to make the sync/non-sync behaviour of the qemu disk backing store tunable from the guest OS, by tuning the IDE disk write cache on/off with hdparm, and having hw/ide.c call -fsync functions in the block backends. I'm willing to code up the fancy solution if people prefer that. Make sure disk writes really made it to disk before we report I/O completion to the guest domain. The DMA_MULTI_THREAD functionality from the qemu-dm IDE emulation should make the performance overhead of synchronous writes bearable, or at least comparable to native hardware. Signed-off-by: Rik van Riel [EMAIL PROTECTED] --- xen-unstable-10712/tools/ioemu/block-bochs.c.osync 2006-07-28 02:15:56.0 -0400 +++ xen-unstable-10712/tools/ioemu/block-bochs.c2006-07-28 02:21:08.0 -0400 @@ -91,7 +91,7 @@ int fd, i; struct bochs_header bochs; -fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE); +fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC); if (fd 0) { fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE); if (fd 0) --- xen-unstable-10712/tools/ioemu/block.c.osync2006-07-28 02:15:56.0 -0400 +++ xen-unstable-10712/tools/ioemu/block.c 2006-07-28 02:19:27.0 -0400 @@ -677,7 +677,7 @@ int rv; #endif -fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE); +fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC); if (fd 0) { fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE); if (fd 0) --- xen-unstable-10712/tools/ioemu/block-cloop.c.osync 2006-07-28 02:15:56.0 -0400 +++ xen-unstable-10712/tools/ioemu/block-cloop.c2006-07-28 02:17:13.0 -0400 @@ -55,7 +55,7 @@ BDRVCloopState *s = bs-opaque; uint32_t offsets_size,max_compressed_block_size=1,i; -s-fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE); +s-fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE | O_SYNC); if (s-fd 0) return -1; bs-read_only = 1; --- xen-unstable-10712/tools/ioemu/block-cow.c.osync2006-07-28 02:15:56.0 -0400 +++ xen-unstable-10712/tools/ioemu/block-cow.c 2006-07-28 02:21:34.0 -0400 @@ -69,7 +69,7 @@ struct cow_header_v2 cow_header; int64_t size; -fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE); +fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC); if (fd 0) { fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE); if (fd 0) --- xen-unstable-10712/tools/ioemu/block-qcow.c.osync 2006-07-28 02:15:56.0 -0400 +++ xen-unstable-10712/tools/ioemu/block-qcow.c 2006-07-28 02:20:05.0 -0400 @@ -95,7 +95,7 @@ int fd, len, i, shift; QCowHeader header; -fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE); +fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC); if (fd 0) { fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE); if (fd 0) --- xen-unstable-10712/tools/ioemu/block-vmdk.c.osync 2006-07-28 02:15:56.0 -0400 +++ xen-unstable-10712/tools/ioemu/block-vmdk.c 2006-07-28 02:20:20.0 -0400 @@ -96,7 +96,7 @@ uint32_t magic; int l1_size; -fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE); +fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC); if (fd 0) { fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE); if (fd 0) ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
Fabrice Bellard wrote: Hi, Using O_SYNC for disk image access is not acceptable: QEMU relies on the host OS to ensure that the data is written correctly. This means that write ordering is not preserved, and on a power failure any data written by qemu (or Xen fully virt) guests may not be preserved. Applications running on the host can count on fsync doing the right thing, meaning that if they call fsync, the data *will* have made it to disk. Applications running inside a guest have no guarantees that their data is actually going to make it anywhere when fsync returns... This may look like hair splitting, but so far I've lost a (test) postgresql database to this 3 times already. Not getting the guest application's data to disk when the application calls fsync is a recipe for disaster. -- Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. - Brian W. Kernighan ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
Easy to do with the fsync infrastructure, but probably not worth doing since people are working on the AIO I/O backend, which would allow multiple outstanding writes from a guest. That, in turn, means I/O completion in the guest can be done when the data really hits disk, but without a performance impact. Not entirely true. That only works if you allow multiple guest IO requests in parallel, ie. some form of tagged command queueing. This requires either improving the SCSI emulation, or implementing SATA emulation. AFAIK parallel IDE doesn't support command queueing. My impression what that the initial AIO implementation is just straight serial async operation. IO wouldn't actually go any faster, it just means the guest can do something else while it's waiting. Paul ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
How about compromising, and making the patch a run time option. Presumably this is only a problem when the virtual machine is not properly shutdown. For those ho want the extra security of knowing the data will be written regardless of the shutdown status they can enable the flag. By default it could be turned off. Then everybody can be happy. BillOn 7/29/06, Rik van Riel [EMAIL PROTECTED] wrote: Fabrice Bellard wrote: Hi, Using O_SYNC for disk image access is not acceptable: QEMU relies on the host OS to ensure that the data is written correctly.This means that write ordering is not preserved, and on a power failure any data written by qemu (or Xen fully virt) guests maynot be preserved.Applications running on the host can count on fsync doing theright thing, meaning that if they call fsync, the data *will* have made it to disk.Applications running inside a guest haveno guarantees that their data is actually going to make itanywhere when fsync returns...This may look like hair splitting, but so far I've lost a (test) postgresql database to this 3 times already.Not gettingthe guest application's data to disk when the application callsfsync is a recipe for disaster.--Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are,by definition, not smart enough to debug it. - Brian W. Kernighan___Qemu-devel mailing list Qemu-devel@nongnu.orghttp://lists.nongnu.org/mailman/listinfo/qemu-devel ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel