Re: [PATCH] fuse: cleanup request queuing towards virtiofs

2024-05-29 Thread Stefan Hajnoczi
On Wed, May 29, 2024 at 05:52:07PM +0200, Miklos Szeredi wrote:
> Virtiofs has its own queing mechanism, but still requests are first queued
> on fiq->pending to be immediately dequeued and queued onto the virtio
> queue.
> 
> The queuing on fiq->pending is unnecessary and might even have some
> performance impact due to being a contention point.
> 
> Forget requests are handled similarly.
> 
> Move the queuing of requests and forgets into the fiq->ops->*.
> fuse_iqueue_ops are renamed to reflect the new semantics.
> 
> Signed-off-by: Miklos Szeredi 
> ---
>  fs/fuse/dev.c   | 159 
>  fs/fuse/fuse_i.h|  19 ++
>  fs/fuse/virtio_fs.c |  41 
>  3 files changed, 106 insertions(+), 113 deletions(-)

This is a little scary but I can't think of a scenario where directly
dispatching requests to virtqueues is a problem.

Is there someone who can run single and multiqueue virtiofs performance
benchmarks?

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH] fuse: cleanup request queuing towards virtiofs

2024-05-29 Thread Jingbo Xu



On 5/29/24 11:52 PM, Miklos Szeredi wrote:
> Virtiofs has its own queing mechanism, but still requests are first queued
> on fiq->pending to be immediately dequeued and queued onto the virtio
> queue.
> 
> The queuing on fiq->pending is unnecessary and might even have some
> performance impact due to being a contention point.
> 
> Forget requests are handled similarly.
> 
> Move the queuing of requests and forgets into the fiq->ops->*.
> fuse_iqueue_ops are renamed to reflect the new semantics.
> 
> Signed-off-by: Miklos Szeredi 
> ---

[...]

> +static void fuse_dev_queue_interrupt(struct fuse_iqueue *fiq, struct 
> fuse_req *req)
> +{
> + spin_lock(&fiq->lock);
> + if (list_empty(&req->intr_entry)) {
> + list_add_tail(&req->intr_entry, &fiq->interrupts);
> + /*
> +  * Pairs with smp_mb() implied by test_and_set_bit()
> +  * from fuse_request_end().
> +  */
> + smp_mb();
> + if (test_bit(FR_FINISHED, &req->flags)) {
> + list_del_init(&req->intr_entry);
> + spin_unlock(&fiq->lock  ^
missing "return" here?

> + }
> + fuse_dev_wake_and_unlock(fiq);
> + } else {
> + spin_unlock(&fiq->lock);
> + }
> +}

[...]

>  static void fuse_adjust_compat(struct fuse_conn *fc, struct fuse_args *args)
> @@ -581,7 +605,6 @@ static int fuse_simple_notify_reply(struct fuse_mount *fm,
>  {
>   struct fuse_req *req;
>   struct fuse_iqueue *fiq = &fm->fc->iq;
> - int err = 0;
>  
>   req = fuse_get_req(fm, false);
>   if (IS_ERR(req))> @@ -592,16 +615,9 @@ static int 
> fuse_simple_notify_reply(struct
fuse_mount *fm,
>  
>   fuse_args_to_req(req, args);
>  
> - spin_lock(&fiq->lock);
> - if (fiq->connected) {
> - queue_request_and_unlock(fiq, req);
> - } else {
> - err = -ENODEV;
> - spin_unlock(&fiq->lock);
> - fuse_put_request(req);
> - }
> + fuse_send_one(fiq, req);
>  
> - return err;
> + return 0;
>  }

There's a minor changed behavior visible to users.  Prior to the patch,
the FUSE_NOTIFY_RETRIEVE will returns -ENODEV when the connection is
aborted, but now it returns 0.

It seems only example/notify_store_retrieve.c has used
FUSE_NOTIFY_RETRIEVE in libfuse.  I'm not sure if this change really
matters.

Maybe we could check req->out.h.error after fuse_send_one() returns?


-- 
Thanks,
Jingbo



Re: [PATCH] fuse: cleanup request queuing towards virtiofs

2024-05-30 Thread Miklos Szeredi
On Thu, 30 May 2024 at 05:20, Jingbo Xu  wrote:

> > + if (test_bit(FR_FINISHED, &req->flags)) {
> > + list_del_init(&req->intr_entry);
> > + spin_unlock(&fiq->lock  ^
> missing "return" here?

Well spotted.  Thanks.

> > - err = -ENODEV;
> > - spin_unlock(&fiq->lock);
> > - fuse_put_request(req);
> > - }
> > + fuse_send_one(fiq, req);
> >
> > - return err;
> > + return 0;
> >  }
>
> There's a minor changed behavior visible to users.  Prior to the patch,
> the FUSE_NOTIFY_RETRIEVE will returns -ENODEV when the connection is
> aborted, but now it returns 0.
>
> It seems only example/notify_store_retrieve.c has used
> FUSE_NOTIFY_RETRIEVE in libfuse.  I'm not sure if this change really
> matters.

It will return -ENOTCONN from  fuse_simple_notify_reply() ->
fuse_get_req().  The -ENODEV would be a very short transient error
during the abort, so it doesn't matter.

Thanks,
Miklos



Re: [PATCH] fuse: cleanup request queuing towards virtiofs

2024-05-30 Thread Miklos Szeredi
On Wed, 29 May 2024 at 20:32, Stefan Hajnoczi  wrote:

> This is a little scary but I can't think of a scenario where directly
> dispatching requests to virtqueues is a problem.

It would be scary if it was't a simple transformation:

insert element on empty list, remove only element from list -> do nothing

Thanks,
Miklos



Re: [PATCH] fuse: cleanup request queuing towards virtiofs

2024-05-30 Thread Peter-Jan Gootzen
On Wed, 2024-05-29 at 14:32 -0400, Stefan Hajnoczi wrote:
> On Wed, May 29, 2024 at 05:52:07PM +0200, Miklos Szeredi wrote:
> > Virtiofs has its own queing mechanism, but still requests are first
> > queued
> > on fiq->pending to be immediately dequeued and queued onto the
> > virtio
> > queue.
> > 
> > The queuing on fiq->pending is unnecessary and might even have some
> > performance impact due to being a contention point.
> > 
> > Forget requests are handled similarly.
> > 
> > Move the queuing of requests and forgets into the fiq->ops->*.
> > fuse_iqueue_ops are renamed to reflect the new semantics.
> > 
> > Signed-off-by: Miklos Szeredi 
> > ---
> >  fs/fuse/dev.c   | 159 -
> > ---
> >  fs/fuse/fuse_i.h    |  19 ++
> >  fs/fuse/virtio_fs.c |  41 
> >  3 files changed, 106 insertions(+), 113 deletions(-)
> 
> This is a little scary but I can't think of a scenario where directly
> dispatching requests to virtqueues is a problem.
> 
> Is there someone who can run single and multiqueue virtiofs
> performance
> benchmarks?

Yes we can provide that with our BlueField DPU setup. I will review,
test and perform some experiments on the patch and get back to you all
on this with some numbers.

> 
> Reviewed-by: Stefan Hajnoczi 

- Peter-Jan



Re: [PATCH] fuse: cleanup request queuing towards virtiofs

2024-05-30 Thread Jingbo Xu



On 5/30/24 5:00 PM, Miklos Szeredi wrote:
> On Thu, 30 May 2024 at 05:20, Jingbo Xu  wrote:
> 
>>> + if (test_bit(FR_FINISHED, &req->flags)) {
>>> + list_del_init(&req->intr_entry);
>>> + spin_unlock(&fiq->lock  ^
>> missing "return" here?
> 
> Well spotted.  Thanks.
> 
>>> - err = -ENODEV;
>>> - spin_unlock(&fiq->lock);
>>> - fuse_put_request(req);
>>> - }
>>> + fuse_send_one(fiq, req);
>>>
>>> - return err;
>>> + return 0;
>>>  }
>>
>> There's a minor changed behavior visible to users.  Prior to the patch,
>> the FUSE_NOTIFY_RETRIEVE will returns -ENODEV when the connection is
>> aborted, but now it returns 0.
>>
>> It seems only example/notify_store_retrieve.c has used
>> FUSE_NOTIFY_RETRIEVE in libfuse.  I'm not sure if this change really
>> matters.
> 
> It will return -ENOTCONN from  fuse_simple_notify_reply() ->
> fuse_get_req().  The -ENODEV would be a very short transient error
> during the abort, so it doesn't matter.

Okay, fair enough.  Feel free to add:

Reviewed-by: Jingbo Xu 


-- 
Thanks,
Jingbo



Re: [PATCH] fuse: cleanup request queuing towards virtiofs

2024-05-30 Thread Bernd Schubert



On 5/30/24 11:00, Miklos Szeredi wrote:
> On Thu, 30 May 2024 at 05:20, Jingbo Xu  wrote:
> 
>>> + if (test_bit(FR_FINISHED, &req->flags)) {
>>> + list_del_init(&req->intr_entry);
>>> + spin_unlock(&fiq->lock  ^
>> missing "return" here?
> 
> Well spotted.  Thanks.
> 
>>> - err = -ENODEV;
>>> - spin_unlock(&fiq->lock);
>>> - fuse_put_request(req);
>>> - }
>>> + fuse_send_one(fiq, req);
>>>
>>> - return err;
>>> + return 0;
>>>  }
>>
>> There's a minor changed behavior visible to users.  Prior to the patch,
>> the FUSE_NOTIFY_RETRIEVE will returns -ENODEV when the connection is
>> aborted, but now it returns 0.
>>
>> It seems only example/notify_store_retrieve.c has used
>> FUSE_NOTIFY_RETRIEVE in libfuse.  I'm not sure if this change really
>> matters.
> 
> It will return -ENOTCONN from  fuse_simple_notify_reply() ->
> fuse_get_req().  The -ENODEV would be a very short transient error
> during the abort, so it doesn't matter.

example/notify_store_retrieve and example/notify_inval_inode actually
get EBADF on umount. Issue is that FUSE_DESTROY is sent very late only,
but inodes already released. If server side sends a notification in the
wrong moment, it gets EBADF. We would need something like
FUSE_PREPARE_DESTROY that is being send when umount processing starts to
stop notifications. I'm not sure if that exists in the VFS and didn't
have time yet to check.


Thanks,
Bernd



Re: [PATCH] fuse: cleanup request queuing towards virtiofs

2024-06-05 Thread Peter-Jan Gootzen
On Wed, 2024-05-29 at 14:32 -0400, Stefan Hajnoczi wrote:
> On Wed, May 29, 2024 at 05:52:07PM +0200, Miklos Szeredi wrote:
> > Virtiofs has its own queing mechanism, but still requests are first
> > queued
> > on fiq->pending to be immediately dequeued and queued onto the
> > virtio
> > queue.
> > 
> > The queuing on fiq->pending is unnecessary and might even have some
> > performance impact due to being a contention point.
> > 
> > Forget requests are handled similarly.
> > 
> > Move the queuing of requests and forgets into the fiq->ops->*.
> > fuse_iqueue_ops are renamed to reflect the new semantics.
> > 
> > Signed-off-by: Miklos Szeredi 
> > ---
> >  fs/fuse/dev.c   | 159 -
> > ---
> >  fs/fuse/fuse_i.h    |  19 ++
> >  fs/fuse/virtio_fs.c |  41 
> >  3 files changed, 106 insertions(+), 113 deletions(-)
> 
> This is a little scary but I can't think of a scenario where directly
> dispatching requests to virtqueues is a problem.
> 
> Is there someone who can run single and multiqueue virtiofs
> performance
> benchmarks?
> 
> Reviewed-by: Stefan Hajnoczi 

I ran some tests and experiments on the patch (on top of v6.10-rc2) with
our multi-queue capable virtio-fs device. No issues were found.

Experimental system setup (which is not the fastest possible setup nor
the most optimized setup!):
# Host:
   - Dell PowerEdge R7525
   - CPU: 2x AMD EPYC 7413 24-Core
   - VM: QEMU KVM with 24 cores, vCPUs locked to the NUMA nodes on which
the DPU is attached. VFIO-pci device to passthrough the DPU.   
Running a default x86_64 ext4 buildroot with fio installed.
# Virtio-fs device:
   - BlueField-3 DPU
   - CPU: ARM Cortex-A78AE, 16 cores
   - One thread per queue, each busy polling on one request queue
   - Each queue is 1024 descriptors deep
# Workload (deviations are specified in the table):
   - fio 3.34
   - sequential read
   - ioengine=io_uring, single 4GiB file, iodepth=128, bs=256KiB,
runtime=30s, ramp_time=10s, direct=1
   - T is the number of threads (numjobs=T with thread=1)
   - Q is the number of request queues

| Workload   | Before patch | After patch |
| -- |  | --- |
| T=1 Q=1| 9216MiB/s| 9201MiB/s   |
| T=2 Q=2| 10.8GiB/s| 10.7GiB/s   |
| T=4 Q=4| 12.6GiB/s| 12.2GiB/s   |
| T=8 Q=8| 19.5GiB/s| 19.7GiB/s   |
| T=16 Q=1   | 9451MiB/s| 9558MiB/s   |
| T=16 Q=2   | 13.5GiB/s| 13.4GiB/s   |
| T=16 Q=4   | 11.8GiB/s| 11.4GiB/s   |
| T=16 Q=8   | 11.1GiB/s| 10.8GiB/s   |
| T=24 Q=24  | 26.5GiB/s| 26.5GiB/s   |
| T=24 Q=24 24 files | 26.5GiB/s| 26.6GiB/s   |
| T=24 Q=24 4k   | 948MiB/s | 955MiB/s|

Averaging out those results, the difference is within a reasonable
margin of a error (less than 1%). So in this setup's
case we see no difference in performance.
However if the virtio-fs device was more optimized, e.g. if it didn't
copy the data to its memory, then the bottleneck could possibly be on
the driver-side and this patch could show some benefit at those higher
message rates.

So although I would have hoped for some performance increase already
with this setup, I still think this is a good patch and a logical
optimization for high performance virtio-fs devices that might show a
benefit in the future.

Tested-by: Peter-Jan Gootzen 
Reviewed-by: Peter-Jan Gootzen 



Re: [PATCH] fuse: cleanup request queuing towards virtiofs

2024-06-05 Thread Stefan Hajnoczi
On Wed, Jun 05, 2024 at 10:40:44AM +, Peter-Jan Gootzen wrote:
> On Wed, 2024-05-29 at 14:32 -0400, Stefan Hajnoczi wrote:
> > On Wed, May 29, 2024 at 05:52:07PM +0200, Miklos Szeredi wrote:
> > > Virtiofs has its own queing mechanism, but still requests are first
> > > queued
> > > on fiq->pending to be immediately dequeued and queued onto the
> > > virtio
> > > queue.
> > > 
> > > The queuing on fiq->pending is unnecessary and might even have some
> > > performance impact due to being a contention point.
> > > 
> > > Forget requests are handled similarly.
> > > 
> > > Move the queuing of requests and forgets into the fiq->ops->*.
> > > fuse_iqueue_ops are renamed to reflect the new semantics.
> > > 
> > > Signed-off-by: Miklos Szeredi 
> > > ---
> > >  fs/fuse/dev.c   | 159 -
> > > ---
> > >  fs/fuse/fuse_i.h    |  19 ++
> > >  fs/fuse/virtio_fs.c |  41 
> > >  3 files changed, 106 insertions(+), 113 deletions(-)
> > 
> > This is a little scary but I can't think of a scenario where directly
> > dispatching requests to virtqueues is a problem.
> > 
> > Is there someone who can run single and multiqueue virtiofs
> > performance
> > benchmarks?
> > 
> > Reviewed-by: Stefan Hajnoczi 
> 
> I ran some tests and experiments on the patch (on top of v6.10-rc2) with
> our multi-queue capable virtio-fs device. No issues were found.
> 
> Experimental system setup (which is not the fastest possible setup nor
> the most optimized setup!):
> # Host:
>- Dell PowerEdge R7525
>- CPU: 2x AMD EPYC 7413 24-Core
>- VM: QEMU KVM with 24 cores, vCPUs locked to the NUMA nodes on which
> the DPU is attached. VFIO-pci device to passthrough the DPU.   
> Running a default x86_64 ext4 buildroot with fio installed.
> # Virtio-fs device:
>- BlueField-3 DPU
>- CPU: ARM Cortex-A78AE, 16 cores
>- One thread per queue, each busy polling on one request queue
>- Each queue is 1024 descriptors deep
> # Workload (deviations are specified in the table):
>- fio 3.34
>- sequential read
>- ioengine=io_uring, single 4GiB file, iodepth=128, bs=256KiB,
> runtime=30s, ramp_time=10s, direct=1
>- T is the number of threads (numjobs=T with thread=1)
>- Q is the number of request queues
> 
> | Workload   | Before patch | After patch |
> | -- |  | --- |
> | T=1 Q=1| 9216MiB/s| 9201MiB/s   |
> | T=2 Q=2| 10.8GiB/s| 10.7GiB/s   |
> | T=4 Q=4| 12.6GiB/s| 12.2GiB/s   |
> | T=8 Q=8| 19.5GiB/s| 19.7GiB/s   |
> | T=16 Q=1   | 9451MiB/s| 9558MiB/s   |
> | T=16 Q=2   | 13.5GiB/s| 13.4GiB/s   |
> | T=16 Q=4   | 11.8GiB/s| 11.4GiB/s   |
> | T=16 Q=8   | 11.1GiB/s| 10.8GiB/s   |
> | T=24 Q=24  | 26.5GiB/s| 26.5GiB/s   |
> | T=24 Q=24 24 files | 26.5GiB/s| 26.6GiB/s   |
> | T=24 Q=24 4k   | 948MiB/s | 955MiB/s|
> 
> Averaging out those results, the difference is within a reasonable
> margin of a error (less than 1%). So in this setup's
> case we see no difference in performance.
> However if the virtio-fs device was more optimized, e.g. if it didn't
> copy the data to its memory, then the bottleneck could possibly be on
> the driver-side and this patch could show some benefit at those higher
> message rates.
> 
> So although I would have hoped for some performance increase already
> with this setup, I still think this is a good patch and a logical
> optimization for high performance virtio-fs devices that might show a
> benefit in the future.
> 
> Tested-by: Peter-Jan Gootzen 
> Reviewed-by: Peter-Jan Gootzen 

Thank you!

Stefan



signature.asc
Description: PGP signature