Re: [Qemu-devel] [RFC] Disk integrity in QEMU

2008-10-14 Thread Avi Kivity
Izik Eidus wrote: 
 But when using O_DIRECT you actuality make the pages not swappable at
 all...
 or am i wrong?

Only for the duration of the I/O operation, which is typically very short.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Disk integrity in QEMU

2008-10-12 Thread Dor Laor

Avi Kivity wrote:

Chris Wright wrote:

I think it's safe to say the perf folks are concerned w/ data integrity
first, stable/reproducible results second, and raw performance third.

So seeing data cached in host was simply not what they expected.  I 
think

write through is sufficient.  However I think that uncached vs. wt will
show up on the radar under reproducible results (need to tune based on
cache size).  And in most overcommit scenarios memory is typically more
precious than cpu, it's unclear to me if the extra buffering is anything
other than memory overhead.  As long as it's configurable then it's
comparable and benchmarking and best practices can dictate best choice.
  


Getting good performance because we have a huge amount of free memory 
in the host is not a good benchmark.  Under most circumstances, the 
free memory will be used either for more guests, or will be given to 
the existing guests, which can utilize it more efficiently than the host.


I can see two cases where this is not true:

- using older, 32-bit guests which cannot utilize all of the cache.  I 
think Windows XP is limited to 512MB of cache, and usually doesn't 
utilize even that.  So if you have an application running on 32-bit 
Windows (or on 32-bit Linux with pae disabled), and a huge host, you 
will see a significant boost from cache=writethrough.  This is a case 
where performance can exceed native, simply because native cannot 
exploit all the resources of the host.


- if cache requirements vary in time across the different guests, and 
if some smart ballooning is not in place, having free memory on the 
host means we utilize it for whichever guest has the greatest need, so 
overall performance improves.




Another justification for ODIRECT is that many production system will 
use the base images for their VMs.
It's mainly true for desktop virtualization but probably for some server 
virtualization deployments.
In these type of scenarios, we can have all of the base image chain 
opened as default with caching for read-only while the

leaf images are open with cache=off.
Since there is ongoing effort (both by IT and developers) to keep the 
base images as big as possible, it guarantees that
this data is best suited for caching in the host while the private leaf 
images will be uncached.
This way we provide good performance and caching for the shared parent 
images while also promising correctness.

Actually this is what happens on mainline qemu with cache=off.

Cheers,
Dor
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Disk integrity in QEMU

2008-10-12 Thread Jamie Lokier
Dor Laor wrote:
 Actually this is what happens on mainline qemu with cache=off.

Have I understood right that cache=off on a qcow2 image only uses
O_DIRECT for the leaf image, and the chain of base images don't use
O_DIRECT?

Sometimes on a memory constrained host, where the (collective) guest
memory is nearly as big as the host memory, I'm not sure this is what
I want.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Disk integrity in QEMU

2008-10-12 Thread Jamie Lokier
Chris Wright wrote:
 Either wt or uncached (so host O_DSYNC or O_DIRECT) would suffice to get
 it through to host's storage subsytem, and I think that's been the core
 of the discussion (plus defaults, etc).

Just want to point out that the storage commitment from O_DIRECT can
be _weaker_ than O_DSYNC.

On Linux,m O_DIRECT never uses storage-device barriers or
transactions, but O_DSYNC sometimes does, and fsync is even more
likely to than O_DSYNC.

I'm not certain, but I think the same applies to other host OSes too -
including Windows, which has its own equivalents to O_DSYNC and
O_DIRECT, and extra documented semantics when they are used together.

Although this is a host implementation detail, unfortunately it means
that O_DIRECT=no-cache and O_DSYNC=write-through-cache is not an
accurate characterisation.

Some might be mislead into assuming that cache=off is as strongly
committing their data to hard storage as cache=wb would.

I think you can assume this only when the underlying storage devices'
write caches are disabled.  You cannot assume this if the host
filesystem uses barriers instead of disabling the storage devices'
write cache.

Unfortunately there's not a lot qemu can do about these various quirks,
but at least it should be documented, so that someone requiring
storage commitment (e.g. for a critical guest database) is advised to
investigate whether O_DIRECT and/or O_DSYNC give them what they
require with their combination of host kernel, filesystem, filesystem
options and storage device(s).

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Disk integrity in QEMU

2008-10-12 Thread Anthony Liguori

Jamie Lokier wrote:

Dor Laor wrote:
  

Actually this is what happens on mainline qemu with cache=off.



Have I understood right that cache=off on a qcow2 image only uses
O_DIRECT for the leaf image, and the chain of base images don't use
O_DIRECT?
  


Yeah, that's a bug IMHO and in my patch to add O_DSYNC, I fix that.  I 
think an argument for O_DIRECT in a leaf and wb in the leaf is seriously 
flawed...


Regards,

Anthony Liguori


Sometimes on a memory constrained host, where the (collective) guest
memory is nearly as big as the host memory, I'm not sure this is what
I want.

-- Jamie


  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Disk integrity in QEMU

2008-10-12 Thread Anthony Liguori

Dor Laor wrote:

Avi Kivity wrote:

Since there is ongoing effort (both by IT and developers) to keep the 
base images as big as possible, it guarantees that
this data is best suited for caching in the host while the private 
leaf images will be uncached.


A proper CAS solution is really such a better approach.  qcow2 
deduplification is an interesting concept, but such a hack :-)


This way we provide good performance and caching for the shared parent 
images while also promising correctness.


You get correctness by using O_DSYNC.  cache=off should disable the use 
of the page cache everywhere.


Regards,

Anthony Liguori


Actually this is what happens on mainline qemu with cache=off.

Cheers,
Dor
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Disk integrity in QEMU

2008-10-12 Thread Izik Eidus

Avi Kivity wrote:


LRU typically makes fairly bad decisions since it throws most of the
information it has away.  I recommend looking up LRU-K and similar
algorithms, just to get a feel for this; it is basically the simplest
possible algorithm short of random selection.

Note that Linux doesn't even have an LRU; it has to approximate since it
can't sample all of the pages all of the time.  With a hypervisor that
uses Intel's EPT, it's even worse since we don't have an accessed bit.
On silly benchmarks that just exercise the disk and touch no memory, and
if you tune the host very aggresively, LRU will win on long running
guests since it will eventually page out all unused guest memory (with
Linux guests, it will never even page guest memory in).  On real life
applications I don't think there is much chance.

  

But when using O_DIRECT you actuality make the pages not swappable at all...
or am i wrong?
maybe somekind of combination with the mm shrink could be good,
do_try_to_free_pages is good point for reference.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Disk integrity in QEMU

2008-10-11 Thread Chris Wright
* Mark Wagner ([EMAIL PROTECTED]) wrote:
 I think that are two distinct arguments going on here. My main concern is
 that I don't think that this a simple what do we make the default cache 
 policy
 be issue. I think that regardless of the cache policy, if something in the
 guest requests O_DIRECT, the host must honor that and not cache the data.

OK, O_DIRECT in the guest is just one example of the guest requesting
data to be synchronously written to disk.  It bypasses guest page cache,
but even page cached writes need to be written at some point.  Any time
the disk driver issues an io where it expects the data to be on disk
(possible low-level storage subystem caching) is the area of concern.

* Mark Wagner ([EMAIL PROTECTED]) wrote:
 Anthony Liguori wrote:
 It's extremely important to understand what the guarantee is.  The  
 guarantee is that upon completion on write(), the data will have been  
 reported as written by the underlying storage subsystem.  This does  
 *not* mean that the data is on disk.

 I apologize if I worded it poorly, I assume that the guarantee is that
 the data has been sent to the storage controller and said controller
 sent an indication that the write has completed.  This could mean
 multiple things likes its in the controllers cache, on the disk, etc.

 I do not believe that this means that the data is still sitting in the
 host cache.  I realize it may not yet be on a disk, but, at a minimum,
 I would expect that is has been sent to the storage controller.  Do you
 consider the hosts cache to be part of the storage subsystem ?

Either wt or uncached (so host O_DSYNC or O_DIRECT) would suffice to get
it through to host's storage subsytem, and I think that's been the core
of the discussion (plus defaults, etc).

 In the case of KVM, even using write-back caching with the host page  
 cache, we are still honoring the guarantee of O_DIRECT.  We just have  
 another level of caching that happens to be write-back.

 I still don't get it.  If I have something running on the host that I
 open with O_DIRECT, do you still consider it not to be a violation of
 the system call if that data ends up in the host cache instead of being
 sent to the storage controller?

I suppose an argument could be made for host caching and write-back
to be considered part of the storage subsystem from the guest pov, but
then we also need to bring in the requirement for proper cache flushing.
Given a popular linux guest fs can be a little fast and loose, wb and
flushing isn't really optimal choice for the integrity case.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html