Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-12 Thread Dr. David Alan Gilbert
* piaojun (piao...@huawei.com) wrote:
> 
> 
> On 2019/8/12 18:05, Stefan Hajnoczi wrote:
> > On Sun, Aug 11, 2019 at 10:26:18AM +0800, piaojun wrote:
> >> On 2019/8/9 16:21, Stefan Hajnoczi wrote:
> >>> On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
>  * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> > 3. Can READ/WRITE be performed directly in QEMU via a separate virtqueue
> >to eliminate the bad address problem?
> 
>  Are you thinking of doing all read/writes that way, or just the corner
>  cases? It doesn't seem worth it for the corner cases unless you're
>  finding them cropping up in real work loads.
> >>>
> >>> Send all READ/WRITE requests to QEMU instead of virtiofsd.
> >>>
> >>> Only handle metadata requests in virtiofsd (OPEN, RELEASE, READDIR,
> >>> MKDIR, etc).
> >>>
> >>
> >> Sorry for not catching your point, and I like the virtiofsd to do
> >> READ/WRITE requests and qemu handle metadata requests, as virtiofsd is
> >> good at processing dataplane things due to thread-pool and CPU
> >> affinity(maybe in the future). As you said, virtiofsd is just acting as
> >> a vhost-user device which should care less about ctrl request.
> >>
> >> If our concern is improving mmap/write/read performance, why not adding
> >> a delay worker for unmmap which could decrease the ummap times. Maybe
> >> virtiofsd could still handle both data and meta requests by this way.
> > 
> > Doing READ/WRITE in QEMU solves the problem that vhost-user slaves only
> > have access to guest RAM regions.  If a guest transfers other memory,
> > like an address in the DAX Window, to/from the vhost-user device then
> > virtqueue buffer address translation fails.
> > 
> > Dave added a code path that bounces such accesses through the QEMU
> > process using the VHOST_USER_SLAVE_FS_IO slavefd request, but it would
> > be simpler, faster, and cleaner to do I/O in QEMU in the first place.
> > 
> > What I don't like about moving READ/WRITE into QEMU is that we need to
> > use even more virtqueues for multiqueue operation :).
> > 
> > Stefan
> 
> Thanks for your detailed explanation. If DAX is not good at small files,
> shall we just let the users choose the I/O path according to their user
> cases?

The problem is how/when to decide and where to keep policy like that.
My understanding is it's also tricky to flip in the kernel from DAX to
non-DAX for any one file.

So without knowing access patterns it's tricky.

Dave

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-12 Thread piaojun



On 2019/8/12 18:05, Stefan Hajnoczi wrote:
> On Sun, Aug 11, 2019 at 10:26:18AM +0800, piaojun wrote:
>> On 2019/8/9 16:21, Stefan Hajnoczi wrote:
>>> On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
 * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> 3. Can READ/WRITE be performed directly in QEMU via a separate virtqueue
>to eliminate the bad address problem?

 Are you thinking of doing all read/writes that way, or just the corner
 cases? It doesn't seem worth it for the corner cases unless you're
 finding them cropping up in real work loads.
>>>
>>> Send all READ/WRITE requests to QEMU instead of virtiofsd.
>>>
>>> Only handle metadata requests in virtiofsd (OPEN, RELEASE, READDIR,
>>> MKDIR, etc).
>>>
>>
>> Sorry for not catching your point, and I like the virtiofsd to do
>> READ/WRITE requests and qemu handle metadata requests, as virtiofsd is
>> good at processing dataplane things due to thread-pool and CPU
>> affinity(maybe in the future). As you said, virtiofsd is just acting as
>> a vhost-user device which should care less about ctrl request.
>>
>> If our concern is improving mmap/write/read performance, why not adding
>> a delay worker for unmmap which could decrease the ummap times. Maybe
>> virtiofsd could still handle both data and meta requests by this way.
> 
> Doing READ/WRITE in QEMU solves the problem that vhost-user slaves only
> have access to guest RAM regions.  If a guest transfers other memory,
> like an address in the DAX Window, to/from the vhost-user device then
> virtqueue buffer address translation fails.
> 
> Dave added a code path that bounces such accesses through the QEMU
> process using the VHOST_USER_SLAVE_FS_IO slavefd request, but it would
> be simpler, faster, and cleaner to do I/O in QEMU in the first place.
> 
> What I don't like about moving READ/WRITE into QEMU is that we need to
> use even more virtqueues for multiqueue operation :).
> 
> Stefan

Thanks for your detailed explanation. If DAX is not good at small files,
shall we just let the users choose the I/O path according to their user
cases?




Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-12 Thread Stefan Hajnoczi
On Sun, Aug 11, 2019 at 10:26:18AM +0800, piaojun wrote:
> On 2019/8/9 16:21, Stefan Hajnoczi wrote:
> > On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
> >> * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> >>> On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> >>> 3. Can READ/WRITE be performed directly in QEMU via a separate virtqueue
> >>>to eliminate the bad address problem?
> >>
> >> Are you thinking of doing all read/writes that way, or just the corner
> >> cases? It doesn't seem worth it for the corner cases unless you're
> >> finding them cropping up in real work loads.
> > 
> > Send all READ/WRITE requests to QEMU instead of virtiofsd.
> > 
> > Only handle metadata requests in virtiofsd (OPEN, RELEASE, READDIR,
> > MKDIR, etc).
> > 
> 
> Sorry for not catching your point, and I like the virtiofsd to do
> READ/WRITE requests and qemu handle metadata requests, as virtiofsd is
> good at processing dataplane things due to thread-pool and CPU
> affinity(maybe in the future). As you said, virtiofsd is just acting as
> a vhost-user device which should care less about ctrl request.
> 
> If our concern is improving mmap/write/read performance, why not adding
> a delay worker for unmmap which could decrease the ummap times. Maybe
> virtiofsd could still handle both data and meta requests by this way.

Doing READ/WRITE in QEMU solves the problem that vhost-user slaves only
have access to guest RAM regions.  If a guest transfers other memory,
like an address in the DAX Window, to/from the vhost-user device then
virtqueue buffer address translation fails.

Dave added a code path that bounces such accesses through the QEMU
process using the VHOST_USER_SLAVE_FS_IO slavefd request, but it would
be simpler, faster, and cleaner to do I/O in QEMU in the first place.

What I don't like about moving READ/WRITE into QEMU is that we need to
use even more virtqueues for multiqueue operation :).

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-10 Thread piaojun



On 2019/8/9 16:21, Stefan Hajnoczi wrote:
> On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
>> * Stefan Hajnoczi (stefa...@redhat.com) wrote:
>>> On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
>>> 2. Can MAP/UNMAP be performed directly in QEMU via a separate virtqueue?
>>
>> I think there's two things to solve here that I don't currently know the
>> answer to:
>>   2a) We'd need to get the fd to qemu for the thing to mmap;
>>   we might be able to cache the fd on the qemu side for existing
>>   mappings, so when asking for a new mapping for an existing file then
>>   it would already have the fd.
>>
>>   2b) Running a device with a mix of queues inside QEMU and on
>>   vhost-user; I don't think we have anything with that mix
> 
> vhost-user-net works in the same way.  The ctrl queue is handled by QEMU
> and the rx/tx queues by the vhost device.  This is in fact how vhost was
> initially designed: the vhost device is not a full virtio device, only
> the dataplane.

Agreed.

> 
>>> 3. Can READ/WRITE be performed directly in QEMU via a separate virtqueue
>>>to eliminate the bad address problem?
>>
>> Are you thinking of doing all read/writes that way, or just the corner
>> cases? It doesn't seem worth it for the corner cases unless you're
>> finding them cropping up in real work loads.
> 
> Send all READ/WRITE requests to QEMU instead of virtiofsd.
> 
> Only handle metadata requests in virtiofsd (OPEN, RELEASE, READDIR,
> MKDIR, etc).
> 

Sorry for not catching your point, and I like the virtiofsd to do
READ/WRITE requests and qemu handle metadata requests, as virtiofsd is
good at processing dataplane things due to thread-pool and CPU
affinity(maybe in the future). As you said, virtiofsd is just acting as
a vhost-user device which should care less about ctrl request.

If our concern is improving mmap/write/read performance, why not adding
a delay worker for unmmap which could decrease the ummap times. Maybe
virtiofsd could still handle both data and meta requests by this way.

Jun



Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-10 Thread Liu Bo
On Thu, Aug 08, 2019 at 08:53:20AM -0400, Vivek Goyal wrote:
> On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > > On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> > > > Kernel also serializes MAP/UNMAP on one inode. So you will need to run
> > > > multiple jobs operating on different inodes to see parallel MAP/UNMAP
> > > > (atleast from kernel's point of view).
> > > 
> > > Okay, there is still room to experiment with how MAP and UNMAP are
> > > handled by virtiofsd and QEMU even if the host kernel ultimately becomes
> > > the bottleneck.
> > > 
> > > One possible optimization is to eliminate REMOVEMAPPING requests when
> > > the guest driver knows a SETUPMAPPING will follow immediately.  I see
> > > the following request pattern in a fio randread iodepth=64 job:
> > > 
> > >   unique: 995348, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, 
> > > pid: 1351
> > >   lo_setupmapping(ino=135, fi=0x(nil), foffset=3860856832, len=2097152, 
> > > moffset=859832320, flags=0)
> > >  unique: 995348, success, outsize: 16
> > >   unique: 995350, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, 
> > > pid: 12
> > >  unique: 995350, success, outsize: 16
> > >   unique: 995352, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, 
> > > pid: 1351
> > >   lo_setupmapping(ino=135, fi=0x(nil), foffset=16777216, len=2097152, 
> > > moffset=861929472, flags=0)
> > >  unique: 995352, success, outsize: 16
> > >   unique: 995354, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, 
> > > pid: 12
> > >  unique: 995354, success, outsize: 16
> > >   virtio_send_msg: elem 9: with 1 in desc of length 16
> > >   unique: 995356, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, 
> > > pid: 1351
> > >   lo_setupmapping(ino=135, fi=0x(nil), foffset=383778816, len=2097152, 
> > > moffset=864026624, flags=0)
> > >  unique: 995356, success, outsize: 16
> > >   unique: 995358, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, 
> > > pid: 12
> > > 
> > > The REMOVEMAPPING requests are unnecessary since we can map over the top
> > > of the old mapping instead of taking the extra step of removing it
> > > first.
> > 
> > Yep, those should go - I think Vivek likes to keep them for testing
> > since they make things fail more completely if there's a screwup.
> 
> I like to keep them because otherwise they keep the resources busy
> on host. If DAX range is being used immediately, then this optimization
> makes more sense. I will keep this in mind.
>

Other than the resource not being released, do you think there'll be
any stale data problem if we don't do removemapping at all, neither
background reclaim nor inline reclaim?
(truncate/punch_hole/evict_inode still needs to remove mapping though)

thanks,
-liubo

> > 
> > > Some more questions to consider for DAX performance optimization:
> > > 
> > > 1. Is FUSE_READ/FUSE_WRITE more efficient than DAX for some I/O patterns?
> > 
> > Probably for cases where the data is only accessed once, and you can't
> > preemptively map.
> > Another variant on (1) is whether we could do read/writes while the mmap
> > is happening to absorb the latency.
> 
> For small random I/O, dax might not be very effective. Overhead of
> setting up mapping and tearing it down is significant.
> 
> Vivek
> 
> ___
> Virtio-fs mailing list
> virtio...@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs



Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-10 Thread Liu Bo
On Fri, Aug 09, 2019 at 09:21:02AM +0100, Stefan Hajnoczi wrote:
> On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > > On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> > > 2. Can MAP/UNMAP be performed directly in QEMU via a separate virtqueue?
> > 
> > I think there's two things to solve here that I don't currently know the
> > answer to:
> >   2a) We'd need to get the fd to qemu for the thing to mmap;
> >   we might be able to cache the fd on the qemu side for existing
> >   mappings, so when asking for a new mapping for an existing file then
> >   it would already have the fd.
> > 
> >   2b) Running a device with a mix of queues inside QEMU and on
> >   vhost-user; I don't think we have anything with that mix
> 
> vhost-user-net works in the same way.  The ctrl queue is handled by QEMU
> and the rx/tx queues by the vhost device.  This is in fact how vhost was
> initially designed: the vhost device is not a full virtio device, only
> the dataplane.

> > > 3. Can READ/WRITE be performed directly in QEMU via a separate virtqueue
> > >to eliminate the bad address problem?
> > 
> > Are you thinking of doing all read/writes that way, or just the corner
> > cases? It doesn't seem worth it for the corner cases unless you're
> > finding them cropping up in real work loads.
> 
> Send all READ/WRITE requests to QEMU instead of virtiofsd.
> 
> Only handle metadata requests in virtiofsd (OPEN, RELEASE, READDIR,
> MKDIR, etc).

For now qemu is not aware of virtio-fs's fd info, but I think it's
doable, I like the idea.

thanks,
-liubo
> 
> > > I'm not going to tackle DAX optimization myself right now but wanted to
> > > share these ideas.
> > 
> > One I was thinking about that feels easier than (2) was to change the
> > vhost slave protocol to be split transaction; it wouldn't do anything
> > for the latency but it would be able to do some in parallel if we can
> > get the kernel to feed it.
> 
> There are two cases:
> 1. mmapping multiple inode.  This should benefit from parallelism,
>although mmap is still expensive because it involves TLB shootdown
>for all other threads running this process.
> 2. mmapping the same inode.  Here the host kernel is likely to serialize
>mmaps even more, making it hard to gain performance.
> 
> It's probably worth writing a tiny benchmark first to evaluate the
> potential gains.
> 
> Stefan



> ___
> Virtio-fs mailing list
> virtio...@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs



Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-09 Thread Stefan Hajnoczi
On Thu, Aug 08, 2019 at 08:53:20AM -0400, Vivek Goyal wrote:
> On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > > On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> > > > Kernel also serializes MAP/UNMAP on one inode. So you will need to run
> > > > multiple jobs operating on different inodes to see parallel MAP/UNMAP
> > > > (atleast from kernel's point of view).
> > > 
> > > Okay, there is still room to experiment with how MAP and UNMAP are
> > > handled by virtiofsd and QEMU even if the host kernel ultimately becomes
> > > the bottleneck.
> > > 
> > > One possible optimization is to eliminate REMOVEMAPPING requests when
> > > the guest driver knows a SETUPMAPPING will follow immediately.  I see
> > > the following request pattern in a fio randread iodepth=64 job:
> > > 
> > >   unique: 995348, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, 
> > > pid: 1351
> > >   lo_setupmapping(ino=135, fi=0x(nil), foffset=3860856832, len=2097152, 
> > > moffset=859832320, flags=0)
> > >  unique: 995348, success, outsize: 16
> > >   unique: 995350, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, 
> > > pid: 12
> > >  unique: 995350, success, outsize: 16
> > >   unique: 995352, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, 
> > > pid: 1351
> > >   lo_setupmapping(ino=135, fi=0x(nil), foffset=16777216, len=2097152, 
> > > moffset=861929472, flags=0)
> > >  unique: 995352, success, outsize: 16
> > >   unique: 995354, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, 
> > > pid: 12
> > >  unique: 995354, success, outsize: 16
> > >   virtio_send_msg: elem 9: with 1 in desc of length 16
> > >   unique: 995356, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, 
> > > pid: 1351
> > >   lo_setupmapping(ino=135, fi=0x(nil), foffset=383778816, len=2097152, 
> > > moffset=864026624, flags=0)
> > >  unique: 995356, success, outsize: 16
> > >   unique: 995358, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, 
> > > pid: 12
> > > 
> > > The REMOVEMAPPING requests are unnecessary since we can map over the top
> > > of the old mapping instead of taking the extra step of removing it
> > > first.
> > 
> > Yep, those should go - I think Vivek likes to keep them for testing
> > since they make things fail more completely if there's a screwup.
> 
> I like to keep them because otherwise they keep the resources busy
> on host. If DAX range is being used immediately, then this optimization
> makes more sense. I will keep this in mind.

Skipping all unmaps is has drawbacks, as you've said.  I'm just thinking
about the case where a mapping is replaced with a new one.

> > 
> > > Some more questions to consider for DAX performance optimization:
> > > 
> > > 1. Is FUSE_READ/FUSE_WRITE more efficient than DAX for some I/O patterns?
> > 
> > Probably for cases where the data is only accessed once, and you can't
> > preemptively map.
> > Another variant on (1) is whether we could do read/writes while the mmap
> > is happening to absorb the latency.
> 
> For small random I/O, dax might not be very effective. Overhead of
> setting up mapping and tearing it down is significant.

Plus there is still an EPT violation and the host page cache needs to be
filled if we haven't prefetched it.  So I imagine FUSE_READ/FUSE_WRITE
will be faster than DAX here.  DAX will be better for repeated,
long-lived accesses.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-09 Thread Stefan Hajnoczi
On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> > 2. Can MAP/UNMAP be performed directly in QEMU via a separate virtqueue?
> 
> I think there's two things to solve here that I don't currently know the
> answer to:
>   2a) We'd need to get the fd to qemu for the thing to mmap;
>   we might be able to cache the fd on the qemu side for existing
>   mappings, so when asking for a new mapping for an existing file then
>   it would already have the fd.
> 
>   2b) Running a device with a mix of queues inside QEMU and on
>   vhost-user; I don't think we have anything with that mix

vhost-user-net works in the same way.  The ctrl queue is handled by QEMU
and the rx/tx queues by the vhost device.  This is in fact how vhost was
initially designed: the vhost device is not a full virtio device, only
the dataplane.

> > 3. Can READ/WRITE be performed directly in QEMU via a separate virtqueue
> >to eliminate the bad address problem?
> 
> Are you thinking of doing all read/writes that way, or just the corner
> cases? It doesn't seem worth it for the corner cases unless you're
> finding them cropping up in real work loads.

Send all READ/WRITE requests to QEMU instead of virtiofsd.

Only handle metadata requests in virtiofsd (OPEN, RELEASE, READDIR,
MKDIR, etc).

> > I'm not going to tackle DAX optimization myself right now but wanted to
> > share these ideas.
> 
> One I was thinking about that feels easier than (2) was to change the
> vhost slave protocol to be split transaction; it wouldn't do anything
> for the latency but it would be able to do some in parallel if we can
> get the kernel to feed it.

There are two cases:
1. mmapping multiple inode.  This should benefit from parallelism,
   although mmap is still expensive because it involves TLB shootdown
   for all other threads running this process.
2. mmapping the same inode.  Here the host kernel is likely to serialize
   mmaps even more, making it hard to gain performance.

It's probably worth writing a tiny benchmark first to evaluate the
potential gains.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-08 Thread Vivek Goyal
On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> > > Kernel also serializes MAP/UNMAP on one inode. So you will need to run
> > > multiple jobs operating on different inodes to see parallel MAP/UNMAP
> > > (atleast from kernel's point of view).
> > 
> > Okay, there is still room to experiment with how MAP and UNMAP are
> > handled by virtiofsd and QEMU even if the host kernel ultimately becomes
> > the bottleneck.
> > 
> > One possible optimization is to eliminate REMOVEMAPPING requests when
> > the guest driver knows a SETUPMAPPING will follow immediately.  I see
> > the following request pattern in a fio randread iodepth=64 job:
> > 
> >   unique: 995348, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 
> > 1351
> >   lo_setupmapping(ino=135, fi=0x(nil), foffset=3860856832, len=2097152, 
> > moffset=859832320, flags=0)
> >  unique: 995348, success, outsize: 16
> >   unique: 995350, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 
> > 12
> >  unique: 995350, success, outsize: 16
> >   unique: 995352, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 
> > 1351
> >   lo_setupmapping(ino=135, fi=0x(nil), foffset=16777216, len=2097152, 
> > moffset=861929472, flags=0)
> >  unique: 995352, success, outsize: 16
> >   unique: 995354, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 
> > 12
> >  unique: 995354, success, outsize: 16
> >   virtio_send_msg: elem 9: with 1 in desc of length 16
> >   unique: 995356, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 
> > 1351
> >   lo_setupmapping(ino=135, fi=0x(nil), foffset=383778816, len=2097152, 
> > moffset=864026624, flags=0)
> >  unique: 995356, success, outsize: 16
> >   unique: 995358, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 
> > 12
> > 
> > The REMOVEMAPPING requests are unnecessary since we can map over the top
> > of the old mapping instead of taking the extra step of removing it
> > first.
> 
> Yep, those should go - I think Vivek likes to keep them for testing
> since they make things fail more completely if there's a screwup.

I like to keep them because otherwise they keep the resources busy
on host. If DAX range is being used immediately, then this optimization
makes more sense. I will keep this in mind.

> 
> > Some more questions to consider for DAX performance optimization:
> > 
> > 1. Is FUSE_READ/FUSE_WRITE more efficient than DAX for some I/O patterns?
> 
> Probably for cases where the data is only accessed once, and you can't
> preemptively map.
> Another variant on (1) is whether we could do read/writes while the mmap
> is happening to absorb the latency.

For small random I/O, dax might not be very effective. Overhead of
setting up mapping and tearing it down is significant.

Vivek



Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-08 Thread Stefan Hajnoczi
On Thu, Aug 08, 2019 at 04:10:00PM +0800, piaojun wrote:
> From my test, your patch set of multithreading improves iops greatly as
> below:

Thank you for sharing your results!

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-08 Thread Dr. David Alan Gilbert
* Stefan Hajnoczi (stefa...@redhat.com) wrote:
> On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> > Kernel also serializes MAP/UNMAP on one inode. So you will need to run
> > multiple jobs operating on different inodes to see parallel MAP/UNMAP
> > (atleast from kernel's point of view).
> 
> Okay, there is still room to experiment with how MAP and UNMAP are
> handled by virtiofsd and QEMU even if the host kernel ultimately becomes
> the bottleneck.
> 
> One possible optimization is to eliminate REMOVEMAPPING requests when
> the guest driver knows a SETUPMAPPING will follow immediately.  I see
> the following request pattern in a fio randread iodepth=64 job:
> 
>   unique: 995348, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 
> 1351
>   lo_setupmapping(ino=135, fi=0x(nil), foffset=3860856832, len=2097152, 
> moffset=859832320, flags=0)
>  unique: 995348, success, outsize: 16
>   unique: 995350, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
>  unique: 995350, success, outsize: 16
>   unique: 995352, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 
> 1351
>   lo_setupmapping(ino=135, fi=0x(nil), foffset=16777216, len=2097152, 
> moffset=861929472, flags=0)
>  unique: 995352, success, outsize: 16
>   unique: 995354, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
>  unique: 995354, success, outsize: 16
>   virtio_send_msg: elem 9: with 1 in desc of length 16
>   unique: 995356, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 
> 1351
>   lo_setupmapping(ino=135, fi=0x(nil), foffset=383778816, len=2097152, 
> moffset=864026624, flags=0)
>  unique: 995356, success, outsize: 16
>   unique: 995358, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
> 
> The REMOVEMAPPING requests are unnecessary since we can map over the top
> of the old mapping instead of taking the extra step of removing it
> first.

Yep, those should go - I think Vivek likes to keep them for testing
since they make things fail more completely if there's a screwup.

> Some more questions to consider for DAX performance optimization:
> 
> 1. Is FUSE_READ/FUSE_WRITE more efficient than DAX for some I/O patterns?

Probably for cases where the data is only accessed once, and you can't
preemptively map.
Another variant on (1) is whether we could do read/writes while the mmap
is happening to absorb the latency.

> 2. Can MAP/UNMAP be performed directly in QEMU via a separate virtqueue?

I think there's two things to solve here that I don't currently know the
answer to:
  2a) We'd need to get the fd to qemu for the thing to mmap;
  we might be able to cache the fd on the qemu side for existing
  mappings, so when asking for a new mapping for an existing file then
  it would already have the fd.

  2b) Running a device with a mix of queues inside QEMU and on
  vhost-user; I don't think we have anything with that mix
 
> 3. Can READ/WRITE be performed directly in QEMU via a separate virtqueue
>to eliminate the bad address problem?

Are you thinking of doing all read/writes that way, or just the corner
cases? It doesn't seem worth it for the corner cases unless you're
finding them cropping up in real work loads.

> 4. Can OPEN+MAP be fused into a single request for small files, avoiding
>the 2nd request?

Sounds possible.

> I'm not going to tackle DAX optimization myself right now but wanted to
> share these ideas.

One I was thinking about that feels easier than (2) was to change the
vhost slave protocol to be split transaction; it wouldn't do anything
for the latency but it would be able to do some in parallel if we can
get the kernel to feed it.

Dave

> Stefan



> ___
> Virtio-fs mailing list
> virtio...@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-08 Thread Stefan Hajnoczi
On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> Kernel also serializes MAP/UNMAP on one inode. So you will need to run
> multiple jobs operating on different inodes to see parallel MAP/UNMAP
> (atleast from kernel's point of view).

Okay, there is still room to experiment with how MAP and UNMAP are
handled by virtiofsd and QEMU even if the host kernel ultimately becomes
the bottleneck.

One possible optimization is to eliminate REMOVEMAPPING requests when
the guest driver knows a SETUPMAPPING will follow immediately.  I see
the following request pattern in a fio randread iodepth=64 job:

  unique: 995348, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
  lo_setupmapping(ino=135, fi=0x(nil), foffset=3860856832, len=2097152, 
moffset=859832320, flags=0)
 unique: 995348, success, outsize: 16
  unique: 995350, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
 unique: 995350, success, outsize: 16
  unique: 995352, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
  lo_setupmapping(ino=135, fi=0x(nil), foffset=16777216, len=2097152, 
moffset=861929472, flags=0)
 unique: 995352, success, outsize: 16
  unique: 995354, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
 unique: 995354, success, outsize: 16
  virtio_send_msg: elem 9: with 1 in desc of length 16
  unique: 995356, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
  lo_setupmapping(ino=135, fi=0x(nil), foffset=383778816, len=2097152, 
moffset=864026624, flags=0)
 unique: 995356, success, outsize: 16
  unique: 995358, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12

The REMOVEMAPPING requests are unnecessary since we can map over the top
of the old mapping instead of taking the extra step of removing it
first.

Some more questions to consider for DAX performance optimization:

1. Is FUSE_READ/FUSE_WRITE more efficient than DAX for some I/O patterns?
2. Can MAP/UNMAP be performed directly in QEMU via a separate virtqueue?
3. Can READ/WRITE be performed directly in QEMU via a separate virtqueue
   to eliminate the bad address problem?
4. Can OPEN+MAP be fused into a single request for small files, avoiding
   the 2nd request?

I'm not going to tackle DAX optimization myself right now but wanted to
share these ideas.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-08 Thread piaojun
Hi Stefan,

>From my test, your patch set of multithreading improves iops greatly as
below:

Guest configuration:
8 vCPU
8GB RAM
Linux 5.1 (vivek-aug-06-2019)

Host configuration:
Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (8 cores x 4 threads)
32GB RAM
Linux 3.10.0
EXT4 + LVM + local HDD

---
Before:
# fio -direct=1 -time_based -iodepth=64 -rw=randread -ioengine=libaio -bs=4k 
-size=1G -numjob=1 -runtime=30 -group_reporting -name=file 
-filename=/mnt/virtiofs/file
Jobs: 1 (f=1): [r(1)] [100.0% done] [1177KB/0KB/0KB /s] [294/0/0 iops] [eta 
00m:00s]
file: (groupid=0, jobs=1): err= 0: pid=6037: Thu Aug  8 23:18:59 2019
  read : io=35148KB, bw=1169.9KB/s, iops=292, runt= 30045msec

After:
Jobs: 1 (f=1): [r(1)] [100.0% done] [6246KB/0KB/0KB /s] [1561/0/0 iops] [eta 
00m:00s]
file: (groupid=0, jobs=1): err= 0: pid=5850: Thu Aug  8 23:21:22 2019
  read : io=191216KB, bw=6370.7KB/s, iops=1592, runt= 30015msec
---

But there is no iops improvment when I change from HDD to ramdisk. I
guess this is because ramdisk has no iodepth.

Thanks,
Jun

On 2019/8/8 2:03, Stefan Hajnoczi wrote:
> On Thu, Aug 01, 2019 at 05:54:05PM +0100, Stefan Hajnoczi wrote:
>> Performance
>> ---
>> Please try these patches out and share your results.
> 
> Here are the performance numbers:
> 
>   Threadpool | iodepth | iodepth
>  size|1|   64
>   ---+-+
>   None   |   4451  |  4876
>   1  |   4360  |  4858
>   64 |   4359  | 33,266
> 
> A graph is available here:
> https://vmsplice.net/~stefan/virtiofsd-threadpool-performance.png
> 
> Summary:
> 
>  * iodepth=64 performance is increased by 6.8 times.
>  * iodepth=1 performance degrades by 2%.
>  * DAX is bottlenecked by QEMU's single-threaded
>VHOST_USER_SLAVE_FS_MAP/UNMAP handler.
> 
> Threadpool size "none" is virtiofsd commit 813a824b707 ("virtiofsd: use
> fuse_lowlevel_is_virtio() in fuse_session_destroy()") without any of the
> multithreading preparation patches.  I benchmarked this to check whether
> the patches introduce a regression for iodepth=1.  They do, but it's
> only around 2%.
> 
> I also ran with DAX but found there was not much difference between
> iodepth=1 and iodepth=64.  This might be because the host mmap(2)
> syscall becomes the bottleneck and a serialization point.  QEMU only
> processes one VHOST_USER_SLAVE_FS_MAP/UNMAP at a time.  If we want to
> accelerate DAX it may be necessary to parallelize mmap, assuming the
> host kernel can do them in parallel on a single file.  This performance
> optimization is future work and not directly related to this patch
> series.
> 
> The following fio job was run with cache=none and no DAX:
> 
>   [global]
>   runtime=60
>   ramp_time=30
>   filename=/var/tmp/fio.dat
>   direct=1
>   rw=randread
>   bs=4k
>   size=4G
>   ioengine=libaio
>   iodepth=1
> 
>   [read]
> 
> Guest configuration:
> 1 vCPU
> 4 GB RAM
> Linux 5.1 (vivek-aug-06-2019)
> 
> Host configuration:
> Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz (2 cores x 2 threads)
> 8 GB RAM
> Linux 5.1.20-300.fc30.x86_64
> XFS + dm-thin + dm-crypt
> Toshiba THNSFJ256GDNU (256 GB SATA SSD)
> 
> Stefan
> 
> 
> 
> ___
> Virtio-fs mailing list
> virtio...@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs
> 



Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-07 Thread Vivek Goyal
On Wed, Aug 07, 2019 at 07:03:55PM +0100, Stefan Hajnoczi wrote:
> On Thu, Aug 01, 2019 at 05:54:05PM +0100, Stefan Hajnoczi wrote:
> > Performance
> > ---
> > Please try these patches out and share your results.
> 
> Here are the performance numbers:
> 
>   Threadpool | iodepth | iodepth
>  size|1|   64
>   ---+-+
>   None   |   4451  |  4876
>   1  |   4360  |  4858
>   64 |   4359  | 33,266
> 
> A graph is available here:
> https://vmsplice.net/~stefan/virtiofsd-threadpool-performance.png
> 
> Summary:
> 
>  * iodepth=64 performance is increased by 6.8 times.
>  * iodepth=1 performance degrades by 2%.
>  * DAX is bottlenecked by QEMU's single-threaded
>VHOST_USER_SLAVE_FS_MAP/UNMAP handler.
> 
> Threadpool size "none" is virtiofsd commit 813a824b707 ("virtiofsd: use
> fuse_lowlevel_is_virtio() in fuse_session_destroy()") without any of the
> multithreading preparation patches.  I benchmarked this to check whether
> the patches introduce a regression for iodepth=1.  They do, but it's
> only around 2%.
> 
> I also ran with DAX but found there was not much difference between
> iodepth=1 and iodepth=64.  This might be because the host mmap(2)
> syscall becomes the bottleneck and a serialization point.  QEMU only
> processes one VHOST_USER_SLAVE_FS_MAP/UNMAP at a time.  If we want to
> accelerate DAX it may be necessary to parallelize mmap, assuming the
> host kernel can do them in parallel on a single file.  This performance
> optimization is future work and not directly related to this patch
> series.

Good to see nice improvement with higher queue depth.

Kernel also serializes MAP/UNMAP on one inode. So you will need to run
multiple jobs operating on different inodes to see parallel MAP/UNMAP
(atleast from kernel's point of view).

Thanks
Vivek



Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-05 Thread piaojun
Hi Stefan,

On 2019/8/5 16:01, Stefan Hajnoczi wrote:
> On Mon, Aug 05, 2019 at 10:52:21AM +0800, piaojun wrote:
>> # fio -direct=1 -time_based -iodepth=1 -rw=randwrite -ioengine=libaio -bs=1M 
>> -size=1G -numjob=1 -runtime=30 -group_reporting -name=file 
>> -filename=/mnt/9pshare/file
> 
> This benchmark configuration (--iodepth=1 --numjobs=1) cannot benefit
> from virtiofsd multithreading.  Have you measured a regression when this
> patch series is applied?
> 
> If your benchmark is unrelated to this patch series, let's discuss it in
> a separate email thread.  fuse_buf_copy() can be optimized to use
> pwritev().
> 
> Thanks,
> Stefan

Yes, I have not applied your patch series and test. As you said, I will
discuss *pwritev* in another email.

Thanks,
Jun

> 



Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-05 Thread Stefan Hajnoczi
On Mon, Aug 05, 2019 at 10:52:21AM +0800, piaojun wrote:
> # fio -direct=1 -time_based -iodepth=1 -rw=randwrite -ioengine=libaio -bs=1M 
> -size=1G -numjob=1 -runtime=30 -group_reporting -name=file 
> -filename=/mnt/9pshare/file

This benchmark configuration (--iodepth=1 --numjobs=1) cannot benefit
from virtiofsd multithreading.  Have you measured a regression when this
patch series is applied?

If your benchmark is unrelated to this patch series, let's discuss it in
a separate email thread.  fuse_buf_copy() can be optimized to use
pwritev().

Thanks,
Stefan



Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3

2019-08-04 Thread piaojun
Hi Stefan,

>From my test, 9p has better bandwidth than virtio as below:

---
9p Test:
# mount -t 9p -o 
trans=virtio,version=9p2000.L,rw,nodev,msize=10,access=client 9pshare 
/mnt/9pshare

# fio -direct=1 -time_based -iodepth=1 -rw=randwrite -ioengine=libaio -bs=1M 
-size=1G -numjob=1 -runtime=30 -group_reporting -name=file 
-filename=/mnt/9pshare/file
file: (g=0): rw=randwrite, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=1
fio-2.13
Starting 1 process
file: Laying out IO file(s) (1 file(s) / 1024MB)
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/1091MB/0KB /s] [0/1091/0 iops] [eta 
00m:00s]
file: (groupid=0, jobs=1): err= 0: pid=6187: Mon Aug  5 17:55:44 2019
  write: io=35279MB, bw=1175.1MB/s, iops=1175, runt= 30001msec
slat (usec): min=589, max=4211, avg=844.04, stdev=124.04
clat (usec): min=1, max=24, avg= 2.53, stdev= 1.16
 lat (usec): min=591, max=4214, avg=846.57, stdev=124.14
clat percentiles (usec):
 |  1.00th=[2],  5.00th=[2], 10.00th=[2], 20.00th=[2],
 | 30.00th=[2], 40.00th=[2], 50.00th=[2], 60.00th=[3],
 | 70.00th=[3], 80.00th=[3], 90.00th=[3], 95.00th=[3],
 | 99.00th=[4], 99.50th=[   13], 99.90th=[   18], 99.95th=[   20],
 | 99.99th=[   22]
lat (usec) : 2=0.04%, 4=98.27%, 10=1.15%, 20=0.48%, 50=0.06%
  cpu  : usr=23.83%, sys=5.24%, ctx=105843, majf=0, minf=9
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=0/w=35279/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=1
---

---
virtiofs Test:
# ./virtiofsd -o vhost_user_socket=/tmp/vhostqemu -o source=/mnt/virtiofs/ -o 
cache=none

# mount -t virtio_fs myfs /mnt/virtiofs -o rootmode=04,user_id=0,group_id=0

# fio -direct=1 -time_based -iodepth=1 -rw=randwrite -ioengine=libaio -bs=1M 
-size=1G -numjob=1 -runtime=30 -group_reporting -name=file 
-filename=/mnt/virtiofs/file
file: (g=0): rw=randwrite, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=1
fio-2.13
Starting 1 process
file: Laying out IO file(s) (1 file(s) / 1024MB)
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/895.1MB/0KB /s] [0/895/0 iops] [eta 
00m:00s]
file: (groupid=0, jobs=1): err= 0: pid=6046: Mon Aug  5 17:54:58 2019
  write: io=23491MB, bw=801799KB/s, iops=783, runt= 30001msec
slat (usec): min=93, max=390, avg=233.40, stdev=64.22
clat (usec): min=849, max=4083, avg=1039.32, stdev=178.98
 lat (usec): min=971, max=4346, avg=1272.72, stdev=200.34
clat percentiles (usec):
 |  1.00th=[  972],  5.00th=[  980], 10.00th=[  988], 20.00th=[  988],
 | 30.00th=[  996], 40.00th=[ 1004], 50.00th=[ 1012], 60.00th=[ 1012],
 | 70.00th=[ 1020], 80.00th=[ 1032], 90.00th=[ 1032], 95.00th=[ 1384],
 | 99.00th=[ 1560], 99.50th=[ 1768], 99.90th=[ 3664], 99.95th=[ 4016],
 | 99.99th=[ 4048]
lat (usec) : 1000=37.21%
lat (msec) : 2=62.39%, 4=0.34%, 10=0.06%
  cpu  : usr=15.39%, sys=4.03%, ctx=23496, majf=0, minf=10
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=0/w=23491/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=1
---

And the backend filesystem is ext4 + ramdisk, and 9p has deeper queue
depth than virtiofs catched by iostat. Then I check the code, and found
9p uses pwritev, but virtiofs uses pwrite. I wonder if virtiofs could
also use iovec to improve its performance.

I'd like to help contributing the patch in the future.

Thanks,
Jun

On 2019/8/2 0:54, Stefan Hajnoczi wrote:
> This patch series introduces the virtiofsd --thread-pool-size=NUM and sets the
> default value to 64.  Each virtqueue has its own thread pool for processing
> requests.  Blocking requests no longer pause virtqueue processing and I/O
> performance should be greatly improved when the queue depth is greater than 1.
> 
> Linux boot and pjdfstest have been tested with these patches and the default
> thread pool size of 64.
> 
> I have now concluded the thread-safety code audit.  Please let me know if you
> have concerns about things I missed.
> 
> Performance
> ---
> Please try these patches out and share your results.
> 
> Scalability
> ---
> There are several synchronization primitives that are taken by the virtqueue
> processing thread or the thread pool worker.  Unfortunately this is necessary
> to protect shared state.  It means that thread pool workers contend on or at
> least access thread synchronization primitives.  If anyone has suggestions for
> improving this situation, please discuss.
> 
> 1. vu_dispatch_rwlock -