Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-16 Thread Stefan Hajnoczi
On Sat, Apr 16, 2011 at 12:00 AM, Anthony Liguori  wrote:
> 3) We have no way to detect when we no longer need a work around which makes
> (2) really unappealing.

I agree.

> 4) That leaves us with:
>    a) waiting for NFS to get fixed properly and just living with worse
> performance on older kernels
>
>    b) having a user-tunable switch to enable bouncing
>
> I really dislike the idea of (b) because we're stuck with it forever and
> it's yet another switch for people to mistakenly depend on.

The user-tunable switch is potentially interesting for performance
troubleshooting.  We have seen another file system which has issues
with vectored direct I/O.  It would have been much easier to identify
the problem by telling the user "Try running it with linearize=on and
see if it makes a difference".

But let's try harder on linux-nfs.

Stefan



Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-15 Thread Christoph Hellwig
On Fri, Apr 15, 2011 at 04:33:51PM -0700, Badari Pulavarty wrote:
> To be honest with you, we should kill cache=none and just optimize only one 
> case and live with it (like other commerical
> hypervisor). :(

cache=none is the only sane mode for qemu, modulo bugs in nfs or similar weird
protocols that no sane person should use for VM storage.

>



Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-15 Thread Christoph Hellwig
On Fri, Apr 15, 2011 at 03:21:36PM -0700, Badari Pulavarty wrote:
> When you say "you're" - you really meant RH right ? RH should have caught 
> this in their
> regression year ago as part of their first beta. Correct ?

With you I mean whoever cares.  Which apparently is no one but IBM.




Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-15 Thread Badari Pulavarty

On 4/15/2011 4:00 PM, Anthony Liguori wrote:

On 04/15/2011 05:21 PM, pbad...@linux.vnet.ibm.com wrote:

On 4/15/2011 10:29 AM, Christoph Hellwig wrote:

On Fri, Apr 15, 2011 at 09:23:54AM -0700, Badari Pulavarty wrote:

True. That brings up a different question - whether we are doing
enough testing on mainline QEMU :(

It seems you're clearly not doing enough testing on any qemu.  Even
the RHEL6 qemu has had preadv/pwritev since the first beta.


Christoph,

When you say "you're" - you really meant RH right ? RH should have 
caught this in their

regression year ago as part of their first beta. Correct ?

Unfortunately, you are picking on person who spent time find & 
analyzing the regression,
narrowing the problem area and suggesting approaches to address the 
issue :(


This is a pretty silly discussion to be having.

The facts are:

1) NFS sucks with preadv/pwritev and O_DIRECT -- is anyone really 
surprised?


2) We could work around this in QEMU by doing something ugly

3) We have no way to detect when we no longer need a work around which 
makes (2) really unappealing.


4) That leaves us with:
a) waiting for NFS to get fixed properly and just living with 
worse performance on older kernels


b) having a user-tunable switch to enable bouncing

I really dislike the idea of (b) because we're stuck with it forever 
and it's yet another switch for people to mistakenly depend on.


I'm still waiting to see performance data without O_DIRECT.  I suspect 
that using cache=writethrough will make most of this problem go away 
in which case, we can just recommend that as a work around until NFS 
is properly fixed.


We need to run through all cases and analyze the performance of 
cache=writethrough. Our initial (smaller setup) analysis
indicates that its better than unpatched O_DIRECT - but ~5% slower for 
sequential writes. But 30%+ slower for
random read/writes and mixed IO workloads. (In the past NFS O_SYNC is 
performance extremely poor compared to

O_DIRECT with no scaling - older kernels due to congestion control issues).

Khoa would collect the data over next few days.

To be honest with you, we should kill cache=none and just optimize only 
one case and live with it (like other commerical

hypervisor). :(

Thanks,
Badari





Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-15 Thread Anthony Liguori

On 04/15/2011 05:21 PM, pbad...@linux.vnet.ibm.com wrote:

On 4/15/2011 10:29 AM, Christoph Hellwig wrote:

On Fri, Apr 15, 2011 at 09:23:54AM -0700, Badari Pulavarty wrote:

True. That brings up a different question - whether we are doing
enough testing on mainline QEMU :(

It seems you're clearly not doing enough testing on any qemu.  Even
the RHEL6 qemu has had preadv/pwritev since the first beta.


Christoph,

When you say "you're" - you really meant RH right ? RH should have 
caught this in their

regression year ago as part of their first beta. Correct ?

Unfortunately, you are picking on person who spent time find & 
analyzing the regression,
narrowing the problem area and suggesting approaches to address the 
issue :(


This is a pretty silly discussion to be having.

The facts are:

1) NFS sucks with preadv/pwritev and O_DIRECT -- is anyone really surprised?

2) We could work around this in QEMU by doing something ugly

3) We have no way to detect when we no longer need a work around which 
makes (2) really unappealing.


4) That leaves us with:
a) waiting for NFS to get fixed properly and just living with worse 
performance on older kernels


b) having a user-tunable switch to enable bouncing

I really dislike the idea of (b) because we're stuck with it forever and 
it's yet another switch for people to mistakenly depend on.


I'm still waiting to see performance data without O_DIRECT.  I suspect 
that using cache=writethrough will make most of this problem go away in 
which case, we can just recommend that as a work around until NFS is 
properly fixed.


Regards,

Anthony Liguori



Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-15 Thread Badari Pulavarty

On 4/15/2011 10:29 AM, Christoph Hellwig wrote:

On Fri, Apr 15, 2011 at 09:23:54AM -0700, Badari Pulavarty wrote:
   

True. That brings up a different question - whether we are doing
enough testing on mainline QEMU :(
 

It seems you're clearly not doing enough testing on any qemu.  Even
the RHEL6 qemu has had preadv/pwritev since the first beta.
   


Christoph,

When you say "you're" - you really meant RH right ? RH should have 
caught this in their

regression year ago as part of their first beta. Correct ?

Unfortunately, you are picking on person who spent time find & analyzing 
the regression,

narrowing the problem area and suggesting approaches to address the issue :(

BTW, I am not sure RH discussion is appropriate in this forum.

Thanks,
Badari





Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-15 Thread Badari Pulavarty
On Fri, 2011-04-15 at 13:09 -0500, Anthony Liguori wrote:
> On 04/15/2011 11:23 AM, Badari Pulavarty wrote:
> > On Fri, 2011-04-15 at 17:34 +0200, Christoph Hellwig wrote:
> >> On Fri, Apr 15, 2011 at 04:26:41PM +0100, Stefan Hajnoczi wrote:
> >>> On Fri, Apr 15, 2011 at 4:05 PM, Christoph Hellwig  wrote:
>  NAK. ?Just wait for the bloody NFS client fix to get in instead of
>  adding crap like that.
> >>> That's totally fine if NFS client will be fixed in the near future but
> >>> this doesn't seem likely:
> >>>
> >>> http://www.spinics.net/lists/linux-nfs/msg20462.html
> >> The code to use preadv/pwritev has been in qemu for over 2 years,
> >> and it took people to notice the NFS slowdown until now, so don't
> >> expect it to be fixed three days layer.
> > True. That brings up a different question - whether we are doing
> > enough testing on mainline QEMU :(
> 
> The issue here is NFS, not QEMU.  

Sure. But we should have caught the regression on NFS when
preadv/pwritev change went into QEMU or before going in -- Isn't it ?
Since it was 2 years ago (like hch indicated) - we could have
fixed NFS long ago :) 

> Moreover, the real problem is that 
> we're using O_DIRECT.  O_DIRECT seems to result in nothing but problems 
> and it never seems to be tested well on any file system.

O_DIRECT was added for a specific use-case in mind - and its supposed
to handle only that case well (pre-allocated files with database usage -
where db has their own caching layer). That case is well tested 
by various DB vendors on most (important) local filesystems. 

You know very well why we are on O_DIRECT path :)

> 
> I think the fundamental problem we keep running into really boils down 
> to O_DIRECT being a second class interface within Linux.

Its by design. Its a special case for specific use.

Thanks,
Badari





Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-15 Thread Anthony Liguori

On 04/15/2011 11:23 AM, Badari Pulavarty wrote:

On Fri, 2011-04-15 at 17:34 +0200, Christoph Hellwig wrote:

On Fri, Apr 15, 2011 at 04:26:41PM +0100, Stefan Hajnoczi wrote:

On Fri, Apr 15, 2011 at 4:05 PM, Christoph Hellwig  wrote:

NAK. ?Just wait for the bloody NFS client fix to get in instead of
adding crap like that.

That's totally fine if NFS client will be fixed in the near future but
this doesn't seem likely:

http://www.spinics.net/lists/linux-nfs/msg20462.html

The code to use preadv/pwritev has been in qemu for over 2 years,
and it took people to notice the NFS slowdown until now, so don't
expect it to be fixed three days layer.

True. That brings up a different question - whether we are doing
enough testing on mainline QEMU :(


The issue here is NFS, not QEMU.  Moreover, the real problem is that 
we're using O_DIRECT.  O_DIRECT seems to result in nothing but problems 
and it never seems to be tested well on any file system.


I think the fundamental problem we keep running into really boils down 
to O_DIRECT being a second class interface within Linux.


Regards,

Anthony Liguori




Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-15 Thread Christoph Hellwig
On Fri, Apr 15, 2011 at 09:23:54AM -0700, Badari Pulavarty wrote:
> True. That brings up a different question - whether we are doing
> enough testing on mainline QEMU :(

It seems you're clearly not doing enough testing on any qemu.  Even
the RHEL6 qemu has had preadv/pwritev since the first beta.

> I asked Stefan to work on a QEMU patch - since my QEMU skills are
> limited and Stefan is more appropriate person to deal with QEMU :)
> 
> As mentioned earlier, we don't see performance with any temporary
> solutions suggested so far (RPC table slot + ASYNC) comes anywhere
> close to readv/writev support patch or linearized QEMU.

So push harder for the real patch.  I haven't seen a clear NAK from
Trond, just that he'd like to avoid it.  For extra brownie points get
him release his read/write rework and integrate it with that.




Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-15 Thread Christoph Hellwig
On Fri, Apr 15, 2011 at 11:10:37AM -0500, Anthony Liguori wrote:
> In general, since we are userspace, we should try to run well on whatever 
> kernel we're on.

It should run with the interfaces given, but hacking around performance
bugs in a gross way is not something qemu should do.




Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-15 Thread Badari Pulavarty
On Fri, 2011-04-15 at 17:34 +0200, Christoph Hellwig wrote:
> On Fri, Apr 15, 2011 at 04:26:41PM +0100, Stefan Hajnoczi wrote:
> > On Fri, Apr 15, 2011 at 4:05 PM, Christoph Hellwig  wrote:
> > > NAK. ?Just wait for the bloody NFS client fix to get in instead of
> > > adding crap like that.
> > 
> > That's totally fine if NFS client will be fixed in the near future but
> > this doesn't seem likely:
> > 
> > http://www.spinics.net/lists/linux-nfs/msg20462.html
> 
> The code to use preadv/pwritev has been in qemu for over 2 years,
> and it took people to notice the NFS slowdown until now, so don't
> expect it to be fixed three days layer.

True. That brings up a different question - whether we are doing
enough testing on mainline QEMU :(

> I can't event see you in the relevent threads arguing for getting it
> fixed, so don't complain. 

I asked Stefan to work on a QEMU patch - since my QEMU skills are
limited and Stefan is more appropriate person to deal with QEMU :)

As mentioned earlier, we don't see performance with any temporary
solutions suggested so far (RPC table slot + ASYNC) comes anywhere
close to readv/writev support patch or linearized QEMU.

Thanks,
Badari




Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-15 Thread Stefan Hajnoczi
On Fri, Apr 15, 2011 at 5:10 PM, Anthony Liguori  wrote:
> On 04/15/2011 10:34 AM, Christoph Hellwig wrote:
>>
>> On Fri, Apr 15, 2011 at 04:26:41PM +0100, Stefan Hajnoczi wrote:
>>>
>>> On Fri, Apr 15, 2011 at 4:05 PM, Christoph Hellwig  wrote:

 NAK. ?Just wait for the bloody NFS client fix to get in instead of
 adding crap like that.
>>>
>>> That's totally fine if NFS client will be fixed in the near future but
>>> this doesn't seem likely:
>>>
>>> http://www.spinics.net/lists/linux-nfs/msg20462.html
>>
>> The code to use preadv/pwritev has been in qemu for over 2 years,
>> and it took people to notice the NFS slowdown until now, so don't
>> expect it to be fixed three days layer.
>>
>> I can't event see you in the relevent threads arguing for getting it
>> fixed, so don't complain.
>
> In general, since we are userspace, we should try to run well on whatever
> kernel we're on.
>
> What I don't like about this patch is that likelihood of false positives.
>  We check for Linux and for NFS but that means an old userspace is doing
> unoptimal things on newer kernels.  Even if we had a kernel version check,
> if the fix gets backported to an older kernel, we'll still get a false
> positive.
>
> Ideally, we'd be able to query the kernel to see whether we should bounce or
> not.  But AFAIK there is nothing even close to an interface to do this
> today.

Bah, good point.  I was planning to sneak in a uname() later but that
really doesn't cut it due to backports/downstream.

Stefan



Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-15 Thread Anthony Liguori

On 04/15/2011 10:34 AM, Christoph Hellwig wrote:

On Fri, Apr 15, 2011 at 04:26:41PM +0100, Stefan Hajnoczi wrote:

On Fri, Apr 15, 2011 at 4:05 PM, Christoph Hellwig  wrote:

NAK. ?Just wait for the bloody NFS client fix to get in instead of
adding crap like that.

That's totally fine if NFS client will be fixed in the near future but
this doesn't seem likely:

http://www.spinics.net/lists/linux-nfs/msg20462.html

The code to use preadv/pwritev has been in qemu for over 2 years,
and it took people to notice the NFS slowdown until now, so don't
expect it to be fixed three days layer.

I can't event see you in the relevent threads arguing for getting it
fixed, so don't complain.


In general, since we are userspace, we should try to run well on 
whatever kernel we're on.


What I don't like about this patch is that likelihood of false 
positives.  We check for Linux and for NFS but that means an old 
userspace is doing unoptimal things on newer kernels.  Even if we had a 
kernel version check, if the fix gets backported to an older kernel, 
we'll still get a false positive.


Ideally, we'd be able to query the kernel to see whether we should 
bounce or not.  But AFAIK there is nothing even close to an interface to 
do this today.


Regards,

Anthony Liguori




Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-15 Thread Christoph Hellwig
On Fri, Apr 15, 2011 at 04:26:41PM +0100, Stefan Hajnoczi wrote:
> On Fri, Apr 15, 2011 at 4:05 PM, Christoph Hellwig  wrote:
> > NAK. ?Just wait for the bloody NFS client fix to get in instead of
> > adding crap like that.
> 
> That's totally fine if NFS client will be fixed in the near future but
> this doesn't seem likely:
> 
> http://www.spinics.net/lists/linux-nfs/msg20462.html

The code to use preadv/pwritev has been in qemu for over 2 years,
and it took people to notice the NFS slowdown until now, so don't
expect it to be fixed three days layer.

I can't event see you in the relevent threads arguing for getting it
fixed, so don't complain. 



Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-15 Thread Stefan Hajnoczi
On Fri, Apr 15, 2011 at 4:05 PM, Christoph Hellwig  wrote:
> NAK.  Just wait for the bloody NFS client fix to get in instead of
> adding crap like that.

That's totally fine if NFS client will be fixed in the near future but
this doesn't seem likely:

http://www.spinics.net/lists/linux-nfs/msg20462.html

Stefan



Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-15 Thread Christoph Hellwig
NAK.  Just wait for the bloody NFS client fix to get in instead of
adding crap like that.




[Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS

2011-04-15 Thread Stefan Hajnoczi
The Linux NFS client issues separate NFS requests for vectored direct
I/O writes.  For example, a pwritev() with 8 elements results in 8 write
requests to the server.  This is very inefficient and a kernel-side fix
is not trivial or likely to be available soon.

This patch detects files on NFS and uses the QEMU_AIO_MISALIGNED flag to
force requests to bounce through a linear buffer.

Khoa Huynh  reports the following ffsb benchmark
results over 1 Gbit Ethernet:

Test (threads=8)   unpatched patched
  (MB/s)  (MB/s)
Large File Creates (bs=256 KB)  20.5   112.0
Sequential Reads (bs=256 KB)58.7   112.0
Large File Creates (bs=8 KB) 5.2 5.8
Sequential Reads (bs=8 KB)  46.780.9
Random Reads (bs=8 KB)   8.723.4
Random Writes (bs=8 KB) 39.644.0
Mail Server (bs=8 KB)   10.223.6

Test (threads=1)   unpatched patched
  (MB/s)  (MB/s)
Large File Creates (bs=256 KB)  14.549.8
Sequential Reads (bs=256 KB)87.983.9
Large File Creates (bs=8 KB) 4.8 4.8
Sequential Reads (bs=8 KB)  23.223.1
Random Reads (bs=8 KB)   4.8 4.7
Random Writes (bs=8 KB)  9.412.8
Mail Server (bs=8 KB)5.4 7.3

Signed-off-by: Stefan Hajnoczi 
---
 block/raw-posix.c |   55 
 1 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 6b72470..40b7c61 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -49,8 +49,10 @@
 #ifdef __linux__
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 #endif
 #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
 #include 
@@ -124,6 +126,7 @@ typedef struct BDRVRawState {
 #endif
 uint8_t *aligned_buf;
 unsigned aligned_buf_size;
+bool force_linearize;
 #ifdef CONFIG_XFS
 bool is_xfs : 1;
 #endif
@@ -136,6 +139,32 @@ static int64_t raw_getlength(BlockDriverState *bs);
 static int cdrom_reopen(BlockDriverState *bs);
 #endif
 
+#if defined(__linux__)
+static bool is_vectored_io_slow(int fd, int open_flags)
+{
+struct statfs stfs;
+int ret;
+
+do {
+ret = fstatfs(fd, &stfs);
+} while (ret != 0 && errno == EINTR);
+
+/*
+ * Linux NFS client splits vectored direct I/O requests into separate NFS
+ * requests so it is faster to submit a single buffer instead.
+ */
+if (!ret && stfs.f_type == NFS_SUPER_MAGIC && (open_flags & O_DIRECT)) {
+return true;
+}
+return false;
+}
+#else /* !defined(__linux__) */
+static bool is_vectored_io_slow(int fd, int open_flags)
+{
+return false;
+}
+#endif
+
 static int raw_open_common(BlockDriverState *bs, const char *filename,
int bdrv_flags, int open_flags)
 {
@@ -167,6 +196,7 @@ static int raw_open_common(BlockDriverState *bs, const char 
*filename,
 }
 s->fd = fd;
 s->aligned_buf = NULL;
+s->force_linearize = is_vectored_io_slow(fd, s->open_flags);
 
 if ((bdrv_flags & BDRV_O_NOCACHE)) {
 /*
@@ -536,20 +566,27 @@ static BlockDriverAIOCB *raw_aio_submit(BlockDriverState 
*bs,
 return NULL;
 
 /*
+ * Check if buffers need to be copied into a single linear buffer.
+ */
+if (s->force_linearize && qiov->niov > 1) {
+type |= QEMU_AIO_MISALIGNED;
+}
+
+/*
  * If O_DIRECT is used the buffer needs to be aligned on a sector
- * boundary.  Check if this is the case or telll the low-level
+ * boundary.  Check if this is the case or tell the low-level
  * driver that it needs to copy the buffer.
  */
-if (s->aligned_buf) {
-if (!qiov_is_aligned(bs, qiov)) {
-type |= QEMU_AIO_MISALIGNED;
+if (s->aligned_buf && !qiov_is_aligned(bs, qiov)) {
+type |= QEMU_AIO_MISALIGNED;
+}
+
 #ifdef CONFIG_LINUX_AIO
-} else if (s->use_aio) {
-return laio_submit(bs, s->aio_ctx, s->fd, sector_num, qiov,
-   nb_sectors, cb, opaque, type);
-#endif
-}
+if (s->use_aio && (type & QEMU_AIO_MISALIGNED) == 0) {
+return laio_submit(bs, s->aio_ctx, s->fd, sector_num, qiov,
+   nb_sectors, cb, opaque, type);
 }
+#endif
 
 return paio_submit(bs, s->fd, sector_num, qiov, nb_sectors,
cb, opaque, type);
-- 
1.7.4.1