Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk

2006-07-31 Thread Jens Axboe
On Sat, Jul 29 2006, Paul Brook wrote:
  Easy to do with the fsync infrastructure, but probably not worth
  doing since people are working on the AIO I/O backend, which would
  allow multiple outstanding writes from a guest.  That, in turn,
  means I/O completion in the guest can be done when the data really
  hits disk, but without a performance impact.
 
 Not entirely true. That only works if you allow multiple guest IO
 requests in parallel, ie. some form of tagged command queueing. This
 requires either improving the SCSI emulation, or implementing SATA
 emulation. AFAIK parallel IDE doesn't support command queueing.

Parallel IDE does support queuing, but it never gained wide spread
support and the standard is quite broken as well (which is probably
_why_ it never got much adoption). It was also quite suboptimal from a
CPU efficiency POV.

Besides, async completion in itself is not enough, QEMU still needs to
honor ordered writes (barriers) and cache flushes.

 My impression what that the initial AIO implementation is just
 straight serial async operation. IO wouldn't actually go any faster,
 it just means the guest can do something else while it's waiting.

Depends on the app, if the io workload is parallel then you should see a
nice speedup as well (as QEMU is then no longer the serializing bottle
neck).

-- 
Jens Axboe



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk

2006-07-31 Thread Jens Axboe
On Sat, Jul 29 2006, Rik van Riel wrote:
 Fabrice Bellard wrote:
 Hi,
 
 Using O_SYNC for disk image access is not acceptable: QEMU relies on the 
 host OS to ensure that the data is written correctly.
 
 This means that write ordering is not preserved, and on a power
 failure any data written by qemu (or Xen fully virt) guests may
 not be preserved.
 
 Applications running on the host can count on fsync doing the
 right thing, meaning that if they call fsync, the data *will*
 have made it to disk.  Applications running inside a guest have
 no guarantees that their data is actually going to make it
 anywhere when fsync returns...

Then the guest OS is broken. Applications issuing an fsync() should
issue a flush (or write-through), the guest OS should propagate this
knowledge through it's io stack and the QEMU hard drive should get
notified. If the guest OS isn't doing what it's supposed to, QEMU can't
help you. And, in fact, running your app on the same host OS with write
back caching would screw you as well. The timing window will probably be
larger with QEMU, but the problem is essentially the same.

-- 
Jens Axboe



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk

2006-07-31 Thread Jonas Maebe


On 31 jul 2006, at 09:08, Jens Axboe wrote:


Applications running on the host can count on fsync doing the
right thing, meaning that if they call fsync, the data *will*
have made it to disk.  Applications running inside a guest have
no guarantees that their data is actually going to make it
anywhere when fsync returns...


Then the guest OS is broken.


The problem is that supposedly many OS'es are broken in this way. See
http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html


Jonas


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk

2006-07-31 Thread Jens Axboe
On Mon, Jul 31 2006, Jonas Maebe wrote:
 
 On 31 jul 2006, at 09:08, Jens Axboe wrote:
 
 Applications running on the host can count on fsync doing the
 right thing, meaning that if they call fsync, the data *will*
 have made it to disk.  Applications running inside a guest have
 no guarantees that their data is actually going to make it
 anywhere when fsync returns...
 
 Then the guest OS is broken.
 
 The problem is that supposedly many OS'es are broken in this way. See
 http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html

Well, as others have written here as well, then their OS are broken on
real hardware as well.

I wouldn't be adverse to a QEMU work-around, but O_SYNC is clearly not a
viable alternative! We could make QEMU behave more like a real hard
drive when it has aio support, flushing dirty cache out in a manner
more closely mimicking what a drive would do instead of relying on the
page cache writeout deciding to write it out.

-- 
Jens Axboe



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk

2006-07-31 Thread Jens Axboe
On Mon, Jul 31 2006, andrzej zaborowski wrote:
 On 30/07/06, Jamie Lokier [EMAIL PROTECTED] wrote:
 Rik van Riel wrote:
  This may look like hair splitting, but so far I've lost a
  (test) postgresql database to this 3 times already.  Not getting
  the guest application's data to disk when the application calls
  fsync is a recipe for disaster.
 
 Exactly the same thing happens with real IDE disks if IDE write
 caching (on the drive itself) is enabled, which it is by default.  It
 is rarer, but it happens.
 
 The little difference with QEMU is that there are two caches above it:
 the host OS'es software cache and the IDE hardware cache. When a guest
 OS flushes its own software cache its precious data goes to the host's
 software cache while the guest thinks it's already the IDE cache. This
 is ofcourse of less importance because data in both caches (hard- and
 software) is lost when the power is cut off.

But the drive cache does not let the dirty data linger for as long as
wht OS page/buffer cache.

 IMHO what really makes IO unreliable in QEMU is that IO errors on the
 host are not reported to the guest by the IDE emulation and there's an
 exact place in hw/ide.c where they are arrogantly ignored.

Send a patch, I'm pretty sure nobody would disagree :-)

-- 
Jens Axboe



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk

2006-07-31 Thread andrzej zaborowski

On 31/07/06, Jens Axboe [EMAIL PROTECTED] wrote:

On Mon, Jul 31 2006, andrzej zaborowski wrote:
 On 30/07/06, Jamie Lokier [EMAIL PROTECTED] wrote:
 Rik van Riel wrote:
  This may look like hair splitting, but so far I've lost a
  (test) postgresql database to this 3 times already.  Not getting
  the guest application's data to disk when the application calls
  fsync is a recipe for disaster.
 
 Exactly the same thing happens with real IDE disks if IDE write
 caching (on the drive itself) is enabled, which it is by default.  It
 is rarer, but it happens.

 The little difference with QEMU is that there are two caches above it:
 the host OS'es software cache and the IDE hardware cache. When a guest
 OS flushes its own software cache its precious data goes to the host's
 software cache while the guest thinks it's already the IDE cache. This
 is ofcourse of less importance because data in both caches (hard- and
 software) is lost when the power is cut off.

But the drive cache does not let the dirty data linger for as long as
wht OS page/buffer cache.


I would say this an argument speaking for actually using O_SYNC.



 IMHO what really makes IO unreliable in QEMU is that IO errors on the
 host are not reported to the guest by the IDE emulation and there's an
 exact place in hw/ide.c where they are arrogantly ignored.

Send a patch, I'm pretty sure nobody would disagree :-)


Here's what I proposed:
http://lists.gnu.org/archive/html/qemu-devel/2005-12/msg00275.html but
I'm afraid it's not correct :P



--
Jens Axboe





--
balrog 2oo6

Dear Outlook users: Please remove me from your address books
http://www.newsforge.com/article.pl?sid=03/08/21/143258


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk

2006-07-30 Thread Jamie Lokier
Rik van Riel wrote:
 This may look like hair splitting, but so far I've lost a
 (test) postgresql database to this 3 times already.  Not getting
 the guest application's data to disk when the application calls
 fsync is a recipe for disaster.

Exactly the same thing happens with real IDE disks if IDE write
caching (on the drive itself) is enabled, which it is by default.  It
is rarer, but it happens.

I've seen this with Linux 2.4 kernels writing to ext3 (real, not
virtual).  Filesystem metadata gets corrupted from time to time if
power is removed, because write ordering is not preserved.  Disabling
IDE write caching fixes it, but the performance impact is huge on some
systems.

Linux 2.6 kernels will issue IDE cache flush commands, at least with
ext3, to commit data to disk when fsync is called, and to preserve
journal/metadata ordering.

Doesn't qemu fsync the host file corresponding to the emulated disk,
when the guest OS issues an IDE cache flush?

For IDE emulation to be as reliable for data storage as a real disk,
it should:

- fsync the host file whenever the guest OS issues an IDE cache
  flush command.

- use O_SYNC (or fsync after each write or aio equivalent, etc.) _only_
  when the guest OS disables the IDE disk cache (not done by default).

-- JAmie


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk

2006-07-30 Thread Jamie Lokier
Bill C. Riemers wrote:
How  about  compromising,  and  making  the  patch  a run time option.
Presumably  this  is  only  a  problem when the virtual machine is not
properly  shutdown.   For  those ho want the extra security of knowing
the  data  will  be written regardless of the shutdown status they can
enable  the  flag.  By default it could be turned off.  Then everybody
can be happy.

Real disks don't provide that security unless you disable the disk's
cache, or issue cache flush instructions to the disk.

Modern guest OS filesystems are written with this in mind.

With older guest OSes, you have to disable the disk cache if you want
that kind of security with real disks.

Is there any reason why the emulation should be any different?

-- Jamie


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk

2006-07-29 Thread Fabrice Bellard

Hi,

Using O_SYNC for disk image access is not acceptable: QEMU relies on the 
host OS to ensure that the data is written correctly. Even the current 
'fsync' support is questionnable to say the least !


Please don't mix issues regarding QEMU disk handling and the underlying 
hypervisor/host OS block device handling.


Regards,

Fabrice.

Rik van Riel wrote:

This is the simple approach to making sure that disk writes actually
hit disk before we tell the guest OS that IO has completed.  Thanks
to DMA_MULTI_THREAD the performance still seems to be adequate.

A fancier solution would be to make the sync/non-sync behaviour of
the qemu disk backing store tunable from the guest OS, by tuning
the IDE disk write cache on/off with hdparm, and having hw/ide.c
call -fsync functions in the block backends.

I'm willing to code up the fancy solution if people prefer that.




Make sure disk writes really made it to disk before we report I/O
completion to the guest domain.  The DMA_MULTI_THREAD functionality
from the qemu-dm IDE emulation should make the performance overhead
of synchronous writes bearable, or at least comparable to native
hardware.

Signed-off-by: Rik van Riel [EMAIL PROTECTED]

--- xen-unstable-10712/tools/ioemu/block-bochs.c.osync  2006-07-28 
02:15:56.0 -0400
+++ xen-unstable-10712/tools/ioemu/block-bochs.c2006-07-28 
02:21:08.0 -0400
@@ -91,7 +91,7 @@
 int fd, i;
 struct bochs_header bochs;
 
-fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE);

+fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC);
 if (fd  0) {
 fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
 if (fd  0)
--- xen-unstable-10712/tools/ioemu/block.c.osync2006-07-28 
02:15:56.0 -0400
+++ xen-unstable-10712/tools/ioemu/block.c  2006-07-28 02:19:27.0 
-0400
@@ -677,7 +677,7 @@
 int rv;
 #endif
 
-fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE);

+fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC);
 if (fd  0) {
 fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
 if (fd  0)
--- xen-unstable-10712/tools/ioemu/block-cloop.c.osync  2006-07-28 
02:15:56.0 -0400
+++ xen-unstable-10712/tools/ioemu/block-cloop.c2006-07-28 
02:17:13.0 -0400
@@ -55,7 +55,7 @@
 BDRVCloopState *s = bs-opaque;
 uint32_t offsets_size,max_compressed_block_size=1,i;
 
-s-fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);

+s-fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE | O_SYNC);
 if (s-fd  0)
 return -1;
 bs-read_only = 1;
--- xen-unstable-10712/tools/ioemu/block-cow.c.osync2006-07-28 
02:15:56.0 -0400
+++ xen-unstable-10712/tools/ioemu/block-cow.c  2006-07-28 02:21:34.0 
-0400
@@ -69,7 +69,7 @@
 struct cow_header_v2 cow_header;
 int64_t size;
 
-fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE);

+fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC);
 if (fd  0) {
 fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
 if (fd  0)
--- xen-unstable-10712/tools/ioemu/block-qcow.c.osync   2006-07-28 
02:15:56.0 -0400
+++ xen-unstable-10712/tools/ioemu/block-qcow.c 2006-07-28 02:20:05.0 
-0400
@@ -95,7 +95,7 @@
 int fd, len, i, shift;
 QCowHeader header;
 
-fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE);

+fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC);
 if (fd  0) {
 fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
 if (fd  0)
--- xen-unstable-10712/tools/ioemu/block-vmdk.c.osync   2006-07-28 
02:15:56.0 -0400
+++ xen-unstable-10712/tools/ioemu/block-vmdk.c 2006-07-28 02:20:20.0 
-0400
@@ -96,7 +96,7 @@
 uint32_t magic;
 int l1_size;
 
-fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE);

+fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC);
 if (fd  0) {
 fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
 if (fd  0)




___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel




___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk

2006-07-29 Thread Rik van Riel

Fabrice Bellard wrote:

Hi,

Using O_SYNC for disk image access is not acceptable: QEMU relies on the 
host OS to ensure that the data is written correctly.


This means that write ordering is not preserved, and on a power
failure any data written by qemu (or Xen fully virt) guests may
not be preserved.

Applications running on the host can count on fsync doing the
right thing, meaning that if they call fsync, the data *will*
have made it to disk.  Applications running inside a guest have
no guarantees that their data is actually going to make it
anywhere when fsync returns...

This may look like hair splitting, but so far I've lost a
(test) postgresql database to this 3 times already.  Not getting
the guest application's data to disk when the application calls
fsync is a recipe for disaster.

--
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it. - Brian W. Kernighan


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk

2006-07-29 Thread Paul Brook
 Easy to do with the fsync infrastructure, but probably not worth
 doing since people are working on the AIO I/O backend, which would
 allow multiple outstanding writes from a guest.  That, in turn,
 means I/O completion in the guest can be done when the data really
 hits disk, but without a performance impact.

Not entirely true. That only works if you allow multiple guest IO requests in 
parallel, ie. some form of tagged command queueing. This requires either 
improving the SCSI emulation, or implementing SATA emulation. AFAIK parallel 
IDE doesn't support command queueing.

My impression what that the initial AIO implementation is just straight serial 
async operation. IO wouldn't actually go any faster, it just means the guest 
can do something else while it's waiting.

Paul


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk

2006-07-29 Thread Bill C. Riemers
How about compromising, and making the patch a run time option. Presumably this is only a problem when the virtual machine is not properly shutdown. For those ho want the extra security of knowing the data will be written regardless of the shutdown status they can enable the flag. By default it could be turned off. Then everybody can be happy.
BillOn 7/29/06, Rik van Riel [EMAIL PROTECTED] wrote:
Fabrice Bellard wrote: Hi, Using O_SYNC for disk image access is not acceptable: QEMU relies on the host OS to ensure that the data is written correctly.This means that write ordering is not preserved, and on a power
failure any data written by qemu (or Xen fully virt) guests maynot be preserved.Applications running on the host can count on fsync doing theright thing, meaning that if they call fsync, the data *will*
have made it to disk.Applications running inside a guest haveno guarantees that their data is actually going to make itanywhere when fsync returns...This may look like hair splitting, but so far I've lost a
(test) postgresql database to this 3 times already.Not gettingthe guest application's data to disk when the application callsfsync is a recipe for disaster.--Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,by definition, not smart enough to debug it. - Brian W. Kernighan___Qemu-devel mailing list
Qemu-devel@nongnu.orghttp://lists.nongnu.org/mailman/listinfo/qemu-devel
___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel