[PATCH V4] virtio-fs: Improved request latencies when Virtio queue is full

2023-07-03 Thread Peter-Jan Gootzen via Virtualization
When the Virtio queue is full, a work item is scheduled
to execute in 1ms that retries adding the request to the queue.
This is a large amount of time on the scale on which a
virtio-fs device can operate. When using a DPU this is around
40us baseline without going to a remote server (4k, QD=1).
This patch queues requests when the Virtio queue is full,
and when a completed request is taken off, immediately fills
it back up with queued requests.

This reduces the 99.9th percentile latencies in our tests by
60x and slightly increases the overall throughput, when using a
queue depth 2x the size of the Virtio queue size, with a
DPU-powered virtio-fs device.

Signed-off-by: Peter-Jan Gootzen 
---
V4: Removed return value on error changes to simplify patch,
that should be changed in another patch.
V3: Fixed requests falling into the void when -ENOMEM and no new
incoming requests. Virtio-fs now always lets -ENOMEM bubble up to
userspace. Also made queue full condition more explicit with
-ENOSPC in `send_forget_request`.
V2: Not scheduling dispatch work anymore when not needed
and changed delayed_work structs to work_struct structs

 fs/fuse/virtio_fs.c | 32 +---
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 4d8d4f16c727..a676297db09b 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -45,7 +45,7 @@ struct virtio_fs_vq {
struct work_struct done_work;
struct list_head queued_reqs;
struct list_head end_reqs;  /* End these requests */
-   struct delayed_work dispatch_work;
+   struct work_struct dispatch_work;
struct fuse_dev *fud;
bool connected;
long in_flight;
@@ -202,7 +202,7 @@ static void virtio_fs_drain_queue(struct virtio_fs_vq *fsvq)
}
 
flush_work(>done_work);
-   flush_delayed_work(>dispatch_work);
+   flush_work(>dispatch_work);
 }
 
 static void virtio_fs_drain_all_queues_locked(struct virtio_fs *fs)
@@ -346,6 +346,9 @@ static void virtio_fs_hiprio_done_work(struct work_struct 
*work)
dec_in_flight_req(fsvq);
}
} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
+
+   if (!list_empty(>queued_reqs))
+   schedule_work(>dispatch_work);
spin_unlock(>lock);
 }
 
@@ -353,7 +356,7 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
 {
struct fuse_req *req;
struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
-dispatch_work.work);
+dispatch_work);
int ret;
 
pr_debug("virtio-fs: worker %s called.\n", __func__);
@@ -388,8 +391,6 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
if (ret == -ENOMEM || ret == -ENOSPC) {
spin_lock(>lock);
list_add_tail(>list, >queued_reqs);
-   schedule_delayed_work(>dispatch_work,
- msecs_to_jiffies(1));
spin_unlock(>lock);
return;
}
@@ -436,8 +437,6 @@ static int send_forget_request(struct virtio_fs_vq *fsvq,
pr_debug("virtio-fs: Could not queue FORGET: err=%d. 
Will try later\n",
 ret);
list_add_tail(>list, >queued_reqs);
-   schedule_delayed_work(>dispatch_work,
- msecs_to_jiffies(1));
if (!in_flight)
inc_in_flight_req(fsvq);
/* Queue is full */
@@ -469,7 +468,7 @@ static void virtio_fs_hiprio_dispatch_work(struct 
work_struct *work)
 {
struct virtio_fs_forget *forget;
struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
-dispatch_work.work);
+dispatch_work);
pr_debug("virtio-fs: worker %s called.\n", __func__);
while (1) {
spin_lock(>lock);
@@ -647,6 +646,11 @@ static void virtio_fs_requests_done_work(struct 
work_struct *work)
virtio_fs_request_complete(req, fsvq);
}
}
+
+   spin_lock(>lock);
+   if (!list_empty(>queued_reqs))
+   schedule_work(>dispatch_work);
+   spin_unlock(>lock);
 }
 
 /* Virtqueue interrupt handler */
@@ -670,12 +674,12 @@ static void virtio_fs_init_vq(struct virtio_fs_vq *fsvq, 
char *name,
 
if (vq_type == VQ_REQUEST) {
INIT_WORK(>done_work, virtio_fs_requests_done_work);
-   INIT_DELAYED_WORK(>dispatch_work,
- 

[PATCH V3] virtio-fs: Improved request latencies when Virtio queue is full

2023-06-02 Thread Peter-Jan Gootzen via Virtualization
When the Virtio queue is full, a work item is scheduled
to execute in 1ms that retries adding the request to the queue.
This is a large amount of time on the scale on which a
virtio-fs device can operate. When using a DPU this is around
40us baseline without going to a remote server (4k, QD=1).
This patch queues requests when the Virtio queue is full,
and when a completed request is taken off, immediately fills
it back up with queued requests.

This reduces the 99.9th percentile latencies in our tests by
60x and slightly increases the overall throughput, when using a
queue depth 2x the size of the Virtio queue size, with a
DPU-powered virtio-fs device.

Furthermore, the virtio-fs driver now also always lets -ENOMEM
errors go to userspace instead of retrying the request in the
driver.

Signed-off-by: Peter-Jan Gootzen 
---
V3: Fixed requests falling into the void when -ENOMEM and no new
incoming requests. Virtio-fs now always lets -ENOMEM bubble up to
userspace. Also made queue full condition more explicit with
-ENOSPC in `send_forget_request`.
V2: Not scheduling dispatch work anymore when not needed
and changed delayed_work structs to work_struct structs

 fs/fuse/virtio_fs.c | 46 ++---
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 4d8d4f16c727..3a3231ddb9e7 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -45,7 +45,7 @@ struct virtio_fs_vq {
struct work_struct done_work;
struct list_head queued_reqs;
struct list_head end_reqs;  /* End these requests */
-   struct delayed_work dispatch_work;
+   struct work_struct dispatch_work;
struct fuse_dev *fud;
bool connected;
long in_flight;
@@ -202,7 +202,7 @@ static void virtio_fs_drain_queue(struct virtio_fs_vq *fsvq)
}
 
flush_work(>done_work);
-   flush_delayed_work(>dispatch_work);
+   flush_work(>dispatch_work);
 }
 
 static void virtio_fs_drain_all_queues_locked(struct virtio_fs *fs)
@@ -346,6 +346,9 @@ static void virtio_fs_hiprio_done_work(struct work_struct 
*work)
dec_in_flight_req(fsvq);
}
} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
+
+   if (!list_empty(>queued_reqs))
+   schedule_work(>dispatch_work);
spin_unlock(>lock);
 }
 
@@ -353,7 +356,7 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
 {
struct fuse_req *req;
struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
-dispatch_work.work);
+dispatch_work);
int ret;
 
pr_debug("virtio-fs: worker %s called.\n", __func__);
@@ -385,11 +388,9 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
 
ret = virtio_fs_enqueue_req(fsvq, req, true);
if (ret < 0) {
-   if (ret == -ENOMEM || ret == -ENOSPC) {
+   if (ret == -ENOSPC) {
spin_lock(>lock);
list_add_tail(>list, >queued_reqs);
-   schedule_delayed_work(>dispatch_work,
- msecs_to_jiffies(1));
spin_unlock(>lock);
return;
}
@@ -405,8 +406,8 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
 }
 
 /*
- * Returns 1 if queue is full and sender should wait a bit before sending
- * next request, 0 otherwise.
+ * Returns 0 if request has been successfully sent, otherwise -ENOSPC
+ * when the queue is full.
  */
 static int send_forget_request(struct virtio_fs_vq *fsvq,
   struct virtio_fs_forget *forget,
@@ -432,16 +433,12 @@ static int send_forget_request(struct virtio_fs_vq *fsvq,
 
ret = virtqueue_add_outbuf(vq, , 1, forget, GFP_ATOMIC);
if (ret < 0) {
-   if (ret == -ENOMEM || ret == -ENOSPC) {
+   if (ret == -ENOSPC) {
pr_debug("virtio-fs: Could not queue FORGET: err=%d. 
Will try later\n",
 ret);
list_add_tail(>list, >queued_reqs);
-   schedule_delayed_work(>dispatch_work,
- msecs_to_jiffies(1));
if (!in_flight)
inc_in_flight_req(fsvq);
-   /* Queue is full */
-   ret = 1;
} else {
pr_debug("virtio-fs: Could not queue FORGET: err=%d. 
Dropping it.\n",
 ret);
@@ -469,7 +466,7 @@ static void virtio_fs_hiprio_dispatch_work(struct 
work_struct *work)
 {
struct virtio_fs_forget 

Re: [PATCH V2] virtio-fs: Improved request latencies when Virtio queue is full

2023-06-02 Thread Peter-Jan Gootzen via Virtualization
On 01/06/2023 20:45, Vivek Goyal wrote:
> On Thu, Jun 01, 2023 at 10:08:50AM -0400, Stefan Hajnoczi wrote:
>> On Wed, May 31, 2023 at 04:49:39PM -0400, Vivek Goyal wrote:
>>> On Wed, May 31, 2023 at 10:34:15PM +0200, Peter-Jan Gootzen wrote:
 On 31/05/2023 21:18, Vivek Goyal wrote:
> On Wed, May 31, 2023 at 07:10:32PM +0200, Peter-Jan Gootzen wrote:
>> When the Virtio queue is full, a work item is scheduled
>> to execute in 1ms that retries adding the request to the queue.
>> This is a large amount of time on the scale on which a
>> virtio-fs device can operate. When using a DPU this is around
>> 40us baseline without going to a remote server (4k, QD=1).
>> This patch queues requests when the Virtio queue is full,
>> and when a completed request is taken off, immediately fills
>> it back up with queued requests.
>>
>> This reduces the 99.9th percentile latencies in our tests by
>> 60x and slightly increases the overall throughput, when using a
>> queue depth 2x the size of the Virtio queue size, with a
>> DPU-powered virtio-fs device.
>>
>> Signed-off-by: Peter-Jan Gootzen 
>> ---
>> V1 -> V2: Not scheduling dispatch work anymore when not needed
>> and changed delayed_work structs to work_struct structs
>>
>>  fs/fuse/virtio_fs.c | 32 +---
>>  1 file changed, 17 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
>> index 4d8d4f16c727..a676297db09b 100644
>> --- a/fs/fuse/virtio_fs.c
>> +++ b/fs/fuse/virtio_fs.c
>> @@ -45,7 +45,7 @@ struct virtio_fs_vq {
>>  struct work_struct done_work;
>>  struct list_head queued_reqs;
>>  struct list_head end_reqs;  /* End these requests */
>> -struct delayed_work dispatch_work;
>> +struct work_struct dispatch_work;
>>  struct fuse_dev *fud;
>>  bool connected;
>>  long in_flight;
>> @@ -202,7 +202,7 @@ static void virtio_fs_drain_queue(struct 
>> virtio_fs_vq *fsvq)
>>  }
>>  
>>  flush_work(>done_work);
>> -flush_delayed_work(>dispatch_work);
>> +flush_work(>dispatch_work);
>>  }
>>  
>>  static void virtio_fs_drain_all_queues_locked(struct virtio_fs *fs)
>> @@ -346,6 +346,9 @@ static void virtio_fs_hiprio_done_work(struct 
>> work_struct *work)
>>  dec_in_flight_req(fsvq);
>>  }
>>  } while (!virtqueue_enable_cb(vq) && 
>> likely(!virtqueue_is_broken(vq)));
>> +
>> +if (!list_empty(>queued_reqs))
>> +schedule_work(>dispatch_work);
>>  spin_unlock(>lock);
>>  }
>>  
>> @@ -353,7 +356,7 @@ static void virtio_fs_request_dispatch_work(struct 
>> work_struct *work)
>>  {
>>  struct fuse_req *req;
>>  struct virtio_fs_vq *fsvq = container_of(work, struct 
>> virtio_fs_vq,
>> - dispatch_work.work);
>> + dispatch_work);
>>  int ret;
>>  
>>  pr_debug("virtio-fs: worker %s called.\n", __func__);
>> @@ -388,8 +391,6 @@ static void virtio_fs_request_dispatch_work(struct 
>> work_struct *work)
>>  if (ret == -ENOMEM || ret == -ENOSPC) {
>>  spin_lock(>lock);
>>  list_add_tail(>list, 
>> >queued_reqs);
>> -
>> schedule_delayed_work(>dispatch_work,
>> -  
>> msecs_to_jiffies(1));
>
> Virtqueue being full is only one of the reasons for failure to queue
> the request. What if virtqueue is empty but we could not queue the
> request because lack of memory (-ENOMEM). In that case we will queue
> the request and it might not be dispatched because there is no completion.
> (Assume there is no further new request coming). That means deadlock?
>
> Thanks
> Vivek
>

 Good catch that will deadlock.

 Is default kernel behavior to indefinitely retry a file system
 request until memory is available?
>>>
>>> As of now that seems to be the behavior. I think I had copied this
>>> code from another driver. 
>>>
>>> But I guess one can argue that if memory is not available, then
>>> return -ENOMEM to user space instead of retrying in kernel.
>>>
>>> Stefan, Miklos, WDYT?
>>
>> My understanding is that file system syscalls may return ENOMEM, so this
>> is okay.
> 
> Ok. Fair enough. Thanks.
> 
> One more question. How do we know virtqueue is full. Is -ENOSPC is the
> correct error code to check and retry indefinitely. Are there other
> situations where -ENOSPC can be returned. Peter's 

Re: [PATCH V2] virtio-fs: Improved request latencies when Virtio queue is full

2023-06-01 Thread Peter-Jan Gootzen via Virtualization
On 01/06/2023 16:08, Stefan Hajnoczi wrote:
> On Wed, May 31, 2023 at 04:49:39PM -0400, Vivek Goyal wrote:
>> On Wed, May 31, 2023 at 10:34:15PM +0200, Peter-Jan Gootzen wrote:
>>> On 31/05/2023 21:18, Vivek Goyal wrote:
 On Wed, May 31, 2023 at 07:10:32PM +0200, Peter-Jan Gootzen wrote:
> When the Virtio queue is full, a work item is scheduled
> to execute in 1ms that retries adding the request to the queue.
> This is a large amount of time on the scale on which a
> virtio-fs device can operate. When using a DPU this is around
> 40us baseline without going to a remote server (4k, QD=1).
> This patch queues requests when the Virtio queue is full,
> and when a completed request is taken off, immediately fills
> it back up with queued requests.
>
> This reduces the 99.9th percentile latencies in our tests by
> 60x and slightly increases the overall throughput, when using a
> queue depth 2x the size of the Virtio queue size, with a
> DPU-powered virtio-fs device.
>
> Signed-off-by: Peter-Jan Gootzen 
> ---
> V1 -> V2: Not scheduling dispatch work anymore when not needed
> and changed delayed_work structs to work_struct structs
>
>  fs/fuse/virtio_fs.c | 32 +---
>  1 file changed, 17 insertions(+), 15 deletions(-)
>
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 4d8d4f16c727..a676297db09b 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -45,7 +45,7 @@ struct virtio_fs_vq {
>   struct work_struct done_work;
>   struct list_head queued_reqs;
>   struct list_head end_reqs;  /* End these requests */
> - struct delayed_work dispatch_work;
> + struct work_struct dispatch_work;
>   struct fuse_dev *fud;
>   bool connected;
>   long in_flight;
> @@ -202,7 +202,7 @@ static void virtio_fs_drain_queue(struct virtio_fs_vq 
> *fsvq)
>   }
>  
>   flush_work(>done_work);
> - flush_delayed_work(>dispatch_work);
> + flush_work(>dispatch_work);
>  }
>  
>  static void virtio_fs_drain_all_queues_locked(struct virtio_fs *fs)
> @@ -346,6 +346,9 @@ static void virtio_fs_hiprio_done_work(struct 
> work_struct *work)
>   dec_in_flight_req(fsvq);
>   }
>   } while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
> +
> + if (!list_empty(>queued_reqs))
> + schedule_work(>dispatch_work);
>   spin_unlock(>lock);
>  }
>  
> @@ -353,7 +356,7 @@ static void virtio_fs_request_dispatch_work(struct 
> work_struct *work)
>  {
>   struct fuse_req *req;
>   struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
> -  dispatch_work.work);
> +  dispatch_work);
>   int ret;
>  
>   pr_debug("virtio-fs: worker %s called.\n", __func__);
> @@ -388,8 +391,6 @@ static void virtio_fs_request_dispatch_work(struct 
> work_struct *work)
>   if (ret == -ENOMEM || ret == -ENOSPC) {
>   spin_lock(>lock);
>   list_add_tail(>list, >queued_reqs);
> - schedule_delayed_work(>dispatch_work,
> -   msecs_to_jiffies(1));

 Virtqueue being full is only one of the reasons for failure to queue
 the request. What if virtqueue is empty but we could not queue the
 request because lack of memory (-ENOMEM). In that case we will queue
 the request and it might not be dispatched because there is no completion.
 (Assume there is no further new request coming). That means deadlock?

 Thanks
 Vivek

>>>
>>> Good catch that will deadlock.
>>>
>>> Is default kernel behavior to indefinitely retry a file system
>>> request until memory is available?
>>
>> As of now that seems to be the behavior. I think I had copied this
>> code from another driver. 
>>
>> But I guess one can argue that if memory is not available, then
>> return -ENOMEM to user space instead of retrying in kernel.
>>
>> Stefan, Miklos, WDYT?
> 
> My understanding is that file system syscalls may return ENOMEM, so this
> is okay.
> 
> Stefan

Then I propose only handling -ENOSPC as a special case and letting all
other errors go through to userspace.

Noob Linux contributor question: how often should I send in a new revision of
the patch? Should I wait for more comments or send in a V3 with that fix now?

Best,
Peter-Jan
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V2] virtio-fs: Improved request latencies when Virtio queue is full

2023-05-31 Thread Peter-Jan Gootzen via Virtualization
On 31/05/2023 21:18, Vivek Goyal wrote:
> On Wed, May 31, 2023 at 07:10:32PM +0200, Peter-Jan Gootzen wrote:
>> When the Virtio queue is full, a work item is scheduled
>> to execute in 1ms that retries adding the request to the queue.
>> This is a large amount of time on the scale on which a
>> virtio-fs device can operate. When using a DPU this is around
>> 40us baseline without going to a remote server (4k, QD=1).
>> This patch queues requests when the Virtio queue is full,
>> and when a completed request is taken off, immediately fills
>> it back up with queued requests.
>>
>> This reduces the 99.9th percentile latencies in our tests by
>> 60x and slightly increases the overall throughput, when using a
>> queue depth 2x the size of the Virtio queue size, with a
>> DPU-powered virtio-fs device.
>>
>> Signed-off-by: Peter-Jan Gootzen 
>> ---
>> V1 -> V2: Not scheduling dispatch work anymore when not needed
>> and changed delayed_work structs to work_struct structs
>>
>>  fs/fuse/virtio_fs.c | 32 +---
>>  1 file changed, 17 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
>> index 4d8d4f16c727..a676297db09b 100644
>> --- a/fs/fuse/virtio_fs.c
>> +++ b/fs/fuse/virtio_fs.c
>> @@ -45,7 +45,7 @@ struct virtio_fs_vq {
>>  struct work_struct done_work;
>>  struct list_head queued_reqs;
>>  struct list_head end_reqs;  /* End these requests */
>> -struct delayed_work dispatch_work;
>> +struct work_struct dispatch_work;
>>  struct fuse_dev *fud;
>>  bool connected;
>>  long in_flight;
>> @@ -202,7 +202,7 @@ static void virtio_fs_drain_queue(struct virtio_fs_vq 
>> *fsvq)
>>  }
>>  
>>  flush_work(>done_work);
>> -flush_delayed_work(>dispatch_work);
>> +flush_work(>dispatch_work);
>>  }
>>  
>>  static void virtio_fs_drain_all_queues_locked(struct virtio_fs *fs)
>> @@ -346,6 +346,9 @@ static void virtio_fs_hiprio_done_work(struct 
>> work_struct *work)
>>  dec_in_flight_req(fsvq);
>>  }
>>  } while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
>> +
>> +if (!list_empty(>queued_reqs))
>> +schedule_work(>dispatch_work);
>>  spin_unlock(>lock);
>>  }
>>  
>> @@ -353,7 +356,7 @@ static void virtio_fs_request_dispatch_work(struct 
>> work_struct *work)
>>  {
>>  struct fuse_req *req;
>>  struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
>> - dispatch_work.work);
>> + dispatch_work);
>>  int ret;
>>  
>>  pr_debug("virtio-fs: worker %s called.\n", __func__);
>> @@ -388,8 +391,6 @@ static void virtio_fs_request_dispatch_work(struct 
>> work_struct *work)
>>  if (ret == -ENOMEM || ret == -ENOSPC) {
>>  spin_lock(>lock);
>>  list_add_tail(>list, >queued_reqs);
>> -schedule_delayed_work(>dispatch_work,
>> -  msecs_to_jiffies(1));
> 
> Virtqueue being full is only one of the reasons for failure to queue
> the request. What if virtqueue is empty but we could not queue the
> request because lack of memory (-ENOMEM). In that case we will queue
> the request and it might not be dispatched because there is no completion.
> (Assume there is no further new request coming). That means deadlock?
> 
> Thanks
> Vivek
> 

Good catch that will deadlock.

Is default kernel behavior to indefinitely retry a file system
request until memory is available?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH V2] virtio-fs: Improved request latencies when Virtio queue is full

2023-05-31 Thread Peter-Jan Gootzen via Virtualization
When the Virtio queue is full, a work item is scheduled
to execute in 1ms that retries adding the request to the queue.
This is a large amount of time on the scale on which a
virtio-fs device can operate. When using a DPU this is around
40us baseline without going to a remote server (4k, QD=1).
This patch queues requests when the Virtio queue is full,
and when a completed request is taken off, immediately fills
it back up with queued requests.

This reduces the 99.9th percentile latencies in our tests by
60x and slightly increases the overall throughput, when using a
queue depth 2x the size of the Virtio queue size, with a
DPU-powered virtio-fs device.

Signed-off-by: Peter-Jan Gootzen 
---
V1 -> V2: Not scheduling dispatch work anymore when not needed
and changed delayed_work structs to work_struct structs

 fs/fuse/virtio_fs.c | 32 +---
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 4d8d4f16c727..a676297db09b 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -45,7 +45,7 @@ struct virtio_fs_vq {
struct work_struct done_work;
struct list_head queued_reqs;
struct list_head end_reqs;  /* End these requests */
-   struct delayed_work dispatch_work;
+   struct work_struct dispatch_work;
struct fuse_dev *fud;
bool connected;
long in_flight;
@@ -202,7 +202,7 @@ static void virtio_fs_drain_queue(struct virtio_fs_vq *fsvq)
}
 
flush_work(>done_work);
-   flush_delayed_work(>dispatch_work);
+   flush_work(>dispatch_work);
 }
 
 static void virtio_fs_drain_all_queues_locked(struct virtio_fs *fs)
@@ -346,6 +346,9 @@ static void virtio_fs_hiprio_done_work(struct work_struct 
*work)
dec_in_flight_req(fsvq);
}
} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
+
+   if (!list_empty(>queued_reqs))
+   schedule_work(>dispatch_work);
spin_unlock(>lock);
 }
 
@@ -353,7 +356,7 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
 {
struct fuse_req *req;
struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
-dispatch_work.work);
+dispatch_work);
int ret;
 
pr_debug("virtio-fs: worker %s called.\n", __func__);
@@ -388,8 +391,6 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
if (ret == -ENOMEM || ret == -ENOSPC) {
spin_lock(>lock);
list_add_tail(>list, >queued_reqs);
-   schedule_delayed_work(>dispatch_work,
- msecs_to_jiffies(1));
spin_unlock(>lock);
return;
}
@@ -436,8 +437,6 @@ static int send_forget_request(struct virtio_fs_vq *fsvq,
pr_debug("virtio-fs: Could not queue FORGET: err=%d. 
Will try later\n",
 ret);
list_add_tail(>list, >queued_reqs);
-   schedule_delayed_work(>dispatch_work,
- msecs_to_jiffies(1));
if (!in_flight)
inc_in_flight_req(fsvq);
/* Queue is full */
@@ -469,7 +468,7 @@ static void virtio_fs_hiprio_dispatch_work(struct 
work_struct *work)
 {
struct virtio_fs_forget *forget;
struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
-dispatch_work.work);
+dispatch_work);
pr_debug("virtio-fs: worker %s called.\n", __func__);
while (1) {
spin_lock(>lock);
@@ -647,6 +646,11 @@ static void virtio_fs_requests_done_work(struct 
work_struct *work)
virtio_fs_request_complete(req, fsvq);
}
}
+
+   spin_lock(>lock);
+   if (!list_empty(>queued_reqs))
+   schedule_work(>dispatch_work);
+   spin_unlock(>lock);
 }
 
 /* Virtqueue interrupt handler */
@@ -670,12 +674,12 @@ static void virtio_fs_init_vq(struct virtio_fs_vq *fsvq, 
char *name,
 
if (vq_type == VQ_REQUEST) {
INIT_WORK(>done_work, virtio_fs_requests_done_work);
-   INIT_DELAYED_WORK(>dispatch_work,
- virtio_fs_request_dispatch_work);
+   INIT_WORK(>dispatch_work,
+ virtio_fs_request_dispatch_work);
} else {
INIT_WORK(>done_work, virtio_fs_hiprio_done_work);
-   INIT_DELAYED_WORK(>dispatch_work,
- virtio_fs_hiprio_dispatch_work);
+ 

[PATCH] virtio-fs: Improved request latencies when Virtio queue is full

2023-05-22 Thread Peter-Jan Gootzen via Virtualization
When the Virtio queue is full, a work item is scheduled
to execute in 1ms that retries adding the request to the queue.
This is a large amount of time on the scale on which a
virtio-fs device can operate. When using a DPU this is around
40us baseline without going to a remote server (4k, QD=1).
This patch queues requests when the Virtio queue is full,
and when a completed request is taken off, immediately fills
it back up with queued requests.

This reduces the 99.9th percentile latencies in our tests by
60x and slightly increases the overall throughput, when using a
queue depth 2x the size of the Virtio queue size, with a
DPU-powered virtio-fs device.

Signed-off-by: Peter-Jan Gootzen 
---
 fs/fuse/virtio_fs.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 4d8d4f16c727..8af9d3dc61d3 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -347,6 +347,8 @@ static void virtio_fs_hiprio_done_work(struct work_struct 
*work)
}
} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
spin_unlock(>lock);
+
+   schedule_delayed_work(>dispatch_work, 0);
 }
 
 static void virtio_fs_request_dispatch_work(struct work_struct *work)
@@ -388,8 +390,6 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
if (ret == -ENOMEM || ret == -ENOSPC) {
spin_lock(>lock);
list_add_tail(>list, >queued_reqs);
-   schedule_delayed_work(>dispatch_work,
- msecs_to_jiffies(1));
spin_unlock(>lock);
return;
}
@@ -436,8 +436,6 @@ static int send_forget_request(struct virtio_fs_vq *fsvq,
pr_debug("virtio-fs: Could not queue FORGET: err=%d. 
Will try later\n",
 ret);
list_add_tail(>list, >queued_reqs);
-   schedule_delayed_work(>dispatch_work,
- msecs_to_jiffies(1));
if (!in_flight)
inc_in_flight_req(fsvq);
/* Queue is full */
@@ -647,6 +645,8 @@ static void virtio_fs_requests_done_work(struct work_struct 
*work)
virtio_fs_request_complete(req, fsvq);
}
}
+
+   schedule_delayed_work(>dispatch_work, 0);
 }
 
 /* Virtqueue interrupt handler */
@@ -1254,8 +1254,6 @@ __releases(fiq->lock)
spin_lock(>lock);
list_add_tail(>list, >queued_reqs);
inc_in_flight_req(fsvq);
-   schedule_delayed_work(>dispatch_work,
-   msecs_to_jiffies(1));
spin_unlock(>lock);
return;
}
-- 
2.34.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: virtio-fs: adding support for multi-queue

2023-03-07 Thread Peter-Jan Gootzen via Virtualization

On 22-02-2023 15:32, Stefan Hajnoczi wrote:

On Wed, Feb 08, 2023 at 05:29:25PM +0100, Peter-Jan Gootzen wrote:

On 08/02/2023 11:43, Stefan Hajnoczi wrote:

On Wed, Feb 08, 2023 at 09:33:33AM +0100, Peter-Jan Gootzen wrote:



On 07/02/2023 22:57, Vivek Goyal wrote:

On Tue, Feb 07, 2023 at 04:32:02PM -0500, Stefan Hajnoczi wrote:

On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal wrote:

On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan Hajnoczi wrote:

On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan Gootzen wrote:

Hi,



[cc German]


For my MSc thesis project in collaboration with IBM
(https://github.com/IBM/dpu-virtio-fs) we are looking to improve the
performance of the virtio-fs driver in high throughput scenarios. We think
the main bottleneck is the fact that the virtio-fs driver does not support
multi-queue (while the spec does). A big factor in this is that our setup on
the virtio-fs device-side (a DPU) does not easily allow multiple cores to
tend to a single virtio queue.


This is an interesting limitation in DPU.


Virtqueues are single-consumer queues anyway. Sharing them between
multiple threads would be expensive. I think using multiqueue is natural
and not specific to DPUs.


Can we create multiple threads (a thread pool) on DPU and let these
threads process requests in parallel (While there is only one virt
queue).

So this is what we had done in virtiofsd. One thread is dedicated to
pull the requests from virt queue and then pass the request to thread
pool to process it. And that seems to help with performance in
certain cases.

Is that possible on DPU? That itself can give a nice performance
boost for certain workloads without having to implement multiqueue
actually.

Just curious. I am not opposed to the idea of multiqueue. I am
just curious about the kind of performance gain (if any) it can
provide. And will this be helpful for rust virtiofsd running on
host as well?

Thanks
Vivek


There is technically nothing preventing us from consuming a single queue on
multiple cores, however our current Virtio implementation (DPU-side) is set
up with the assumption that you should never want to do that (concurrency
mayham around the Virtqueues and the DMAs). So instead of putting all the
work into reworking the implementation to support that and still incur the
big overhead, we see it more fitting to amend the virtio-fs driver with
multi-queue support.



Is it just a theory at this point of time or have you implemented
it and seeing significant performance benefit with multiqueue?


It is a theory, but we are currently seeing that using the single request
queue, the single core attending to that queue on the DPU is reasonably
close to being fully saturated.


And will this be helpful for rust virtiofsd running on
host as well?


I figure this would be dependent on the workload and the users-needs.
Having many cores concurrently pulling on their own virtq and then
immediately process the request locally would of course improve performance.
But we are offloading all this work to the DPU, for providing
high-throughput cloud services.


I think Vivek is getting at whether your code processes requests
sequentially or in parallel. A single thread processing the virtqueue
that hands off requests to worker threads or uses io_uring to perform
I/O asynchronously will perform differently from a single thread that
processes requests sequentially in a blocking fashion. Multiqueue is not
necessary for parallelism, but the single queue might become a
bottleneck.


Requests are handled non-blocking with remote IO on the DPU. Our current
architecture is as follows:
T1: Tends to the Virtq, parses FUSE to remote IO and fires off the
asynchronous remote IO.
T2: Polls for completion on the remote IO and parses it back to FUSE, puts
the FUSE buffers in a completion queue of T1.
T1: Handles the Virtio completion and DMA of the requests in the CQ.

Thread 1 is busy polling on its two queues (Virtq and CQ) with equal
priority, thread 2 is busy polling as well. This setup is not really
optimal, but we are working within the constraints of both our DPU and
remote IO stack.


Why does T1 need to handle VIRTIO completion and DMA requests instead of
T2?

Stefan


No good reason other than the fact that the concurrency safety of our 
DPU's virtio-fs library requires this.


> I had been doing some performance benchmarking for virtio-fs and I found
> some old results.
>
> 
https://github.com/rhvgoyal/virtiofs-tests/tree/master/performance-results/feb-10-2021

>
> While running on top of local fs, with bs=4K, with single queue I could
> achieve more than 600MB/s.
>
> NAMEWORKLOADBandwidth   IOPS
> default seqread-psync   625.0mb 156.2k
> no-tpoolseqread-psync   660.8mb 165.2k
>
> But catch here I think is that host is doing the caching. In your
> case I am assuming there is no caching at DPU and all the I/O is
> 

Re: virtio-fs: adding support for multi-queue

2023-02-08 Thread Peter-Jan Gootzen via Virtualization

On 08/02/2023 11:43, Stefan Hajnoczi wrote:

On Wed, Feb 08, 2023 at 09:33:33AM +0100, Peter-Jan Gootzen wrote:



On 07/02/2023 22:57, Vivek Goyal wrote:

On Tue, Feb 07, 2023 at 04:32:02PM -0500, Stefan Hajnoczi wrote:

On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal wrote:

On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan Hajnoczi wrote:

On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan Gootzen wrote:

Hi,



[cc German]


For my MSc thesis project in collaboration with IBM
(https://github.com/IBM/dpu-virtio-fs) we are looking to improve the
performance of the virtio-fs driver in high throughput scenarios. We think
the main bottleneck is the fact that the virtio-fs driver does not support
multi-queue (while the spec does). A big factor in this is that our setup on
the virtio-fs device-side (a DPU) does not easily allow multiple cores to
tend to a single virtio queue.


This is an interesting limitation in DPU.


Virtqueues are single-consumer queues anyway. Sharing them between
multiple threads would be expensive. I think using multiqueue is natural
and not specific to DPUs.


Can we create multiple threads (a thread pool) on DPU and let these
threads process requests in parallel (While there is only one virt
queue).

So this is what we had done in virtiofsd. One thread is dedicated to
pull the requests from virt queue and then pass the request to thread
pool to process it. And that seems to help with performance in
certain cases.

Is that possible on DPU? That itself can give a nice performance
boost for certain workloads without having to implement multiqueue
actually.

Just curious. I am not opposed to the idea of multiqueue. I am
just curious about the kind of performance gain (if any) it can
provide. And will this be helpful for rust virtiofsd running on
host as well?

Thanks
Vivek


There is technically nothing preventing us from consuming a single queue on
multiple cores, however our current Virtio implementation (DPU-side) is set
up with the assumption that you should never want to do that (concurrency
mayham around the Virtqueues and the DMAs). So instead of putting all the
work into reworking the implementation to support that and still incur the
big overhead, we see it more fitting to amend the virtio-fs driver with
multi-queue support.



Is it just a theory at this point of time or have you implemented
it and seeing significant performance benefit with multiqueue?


It is a theory, but we are currently seeing that using the single request
queue, the single core attending to that queue on the DPU is reasonably
close to being fully saturated.


And will this be helpful for rust virtiofsd running on
host as well?


I figure this would be dependent on the workload and the users-needs.
Having many cores concurrently pulling on their own virtq and then
immediately process the request locally would of course improve performance.
But we are offloading all this work to the DPU, for providing
high-throughput cloud services.


I think Vivek is getting at whether your code processes requests
sequentially or in parallel. A single thread processing the virtqueue
that hands off requests to worker threads or uses io_uring to perform
I/O asynchronously will perform differently from a single thread that
processes requests sequentially in a blocking fashion. Multiqueue is not
necessary for parallelism, but the single queue might become a
bottleneck.


Requests are handled non-blocking with remote IO on the DPU. Our current 
architecture is as follows:
T1: Tends to the Virtq, parses FUSE to remote IO and fires off the 
asynchronous remote IO.
T2: Polls for completion on the remote IO and parses it back to FUSE, 
puts the FUSE buffers in a completion queue of T1.

T1: Handles the Virtio completion and DMA of the requests in the CQ.

Thread 1 is busy polling on its two queues (Virtq and CQ) with equal 
priority, thread 2 is busy polling as well. This setup is not really 
optimal, but we are working within the constraints of both our DPU and 
remote IO stack.

Currently we are able to get with sequential single job 4k throughput:
Write: 246MiB/s
Read: 20MiB/s
We are not sure yet where the bottleneck is for reads, we hope to be 
able to match it to the write speed. For writes the two main bottlenecks 
we see are: the single Virtq (so limited parallelism on the DPU and 
remote-side) and that virtio-fs IO is constrained to the page size of 4k 
(NFS for example, who we are trying to replace, sees huge performance 
gains with larger block sizes).



This is what I remembered as well, but can't find it clearly in the source
right now, do you have references to the source for this?


virtio_blk.ko uses an irq_affinity descriptor to tell virtio_find_vqs()
to spread MSI interrupts across CPUs:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/block/virtio_blk.c#n609

The core blk-mq code has the blk_mq_virtio_map_queues() function to map
block layer queues to virtqueues:

Re: virtio-fs: adding support for multi-queue

2023-02-08 Thread Peter-Jan Gootzen via Virtualization




On 07/02/2023 22:57, Vivek Goyal wrote:

On Tue, Feb 07, 2023 at 04:32:02PM -0500, Stefan Hajnoczi wrote:

On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal wrote:

On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan Hajnoczi wrote:

On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan Gootzen wrote:

Hi,



[cc German]


For my MSc thesis project in collaboration with IBM
(https://github.com/IBM/dpu-virtio-fs) we are looking to improve the
performance of the virtio-fs driver in high throughput scenarios. We think
the main bottleneck is the fact that the virtio-fs driver does not support
multi-queue (while the spec does). A big factor in this is that our setup on
the virtio-fs device-side (a DPU) does not easily allow multiple cores to
tend to a single virtio queue.


This is an interesting limitation in DPU.


Virtqueues are single-consumer queues anyway. Sharing them between
multiple threads would be expensive. I think using multiqueue is natural
and not specific to DPUs.


Can we create multiple threads (a thread pool) on DPU and let these
threads process requests in parallel (While there is only one virt
queue).

So this is what we had done in virtiofsd. One thread is dedicated to
pull the requests from virt queue and then pass the request to thread
pool to process it. And that seems to help with performance in
certain cases.

Is that possible on DPU? That itself can give a nice performance
boost for certain workloads without having to implement multiqueue
actually.

Just curious. I am not opposed to the idea of multiqueue. I am
just curious about the kind of performance gain (if any) it can
provide. And will this be helpful for rust virtiofsd running on
host as well?

Thanks
Vivek

There is technically nothing preventing us from consuming a single queue 
on multiple cores, however our current Virtio implementation (DPU-side) 
is set up with the assumption that you should never want to do that 
(concurrency mayham around the Virtqueues and the DMAs). So instead of 
putting all the work into reworking the implementation to support that 
and still incur the big overhead, we see it more fitting to amend the 
virtio-fs driver with multi-queue support.



> Is it just a theory at this point of time or have you implemented
> it and seeing significant performance benefit with multiqueue?

It is a theory, but we are currently seeing that using the single 
request queue, the single core attending to that queue on the DPU is 
reasonably close to being fully saturated.


> And will this be helpful for rust virtiofsd running on
> host as well?

I figure this would be dependent on the workload and the users-needs.
Having many cores concurrently pulling on their own virtq and then 
immediately process the request locally would of course improve 
performance. But we are offloading all this work to the DPU, for 
providing high-throughput cloud services.


> Sounds good. Assigning vqs round-robin is the strategy that virtio-net
> and virtio-blk use. virtio-blk could be an interesting example as it's
> similar to virtiofs. The Linux multiqueue block layer and core virtio
> irq allocation handle CPU affinity in the case of virtio-blk.

The virtio-blk use the queue assigned by the mq block layer and 
virtio-net the queue assigned from the net core layer correct?


If I interpret you correct, the round-robin strategy is done by 
assigning cores to queues round-robin, not per requests dynamically 
round-robin?
This is what I remembered as well, but can't find it clearly in the 
source right now, do you have references to the source for this?


> Which DPU are you targetting?

This is something I unfortunately can't disclose at the moment.

Thanks,
Peter-Jan
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization