Re: Re: [PATCH v3 2/5] virtio-crypto: wait ctrl queue instead of busy polling

2022-04-22 Thread zhenwei pi

On 4/22/22 15:46, Jason Wang wrote:


在 2022/4/21 18:40, zhenwei pi 写道:

Originally, after submitting request into virtio crypto control
queue, the guest side polls the result from the virt queue. This
works like following:
 CPU0   CPU1   ... CPUx  CPUy
  |  |  | |
  \  \  / /
   \spin_lock(&vcrypto->ctrl_lock)---/
    |
  virtqueue add & kick
    |
   busy poll virtqueue
    |
   spin_unlock(&vcrypto->ctrl_lock)
   ...

There are two problems:
1, The queue depth is always 1, the performance of a virtio crypto
    device gets limited. Multi user processes share a single control
    queue, and hit spin lock race from control queue. Test on Intel
    Platinum 8260, a single worker gets ~35K/s create/close session
    operations, and 8 workers get ~40K/s operations with 800% CPU
    utilization.
2, The control request is supposed to get handled immediately, but
    in the current implementation of QEMU(v6.2), the vCPU thread kicks
    another thread to do this work, the latency also gets unstable.
    Tracking latency of virtio_crypto_alg_akcipher_close_session in 5s:
 usecs   : count distribution
  0 -> 1  : 0    |    |
  2 -> 3  : 7    |    |
  4 -> 7  : 72   |    |
  8 -> 15 : 186485   ||
 16 -> 31 : 687  |    |
 32 -> 63 : 5    |    |
 64 -> 127    : 3    |    |
    128 -> 255    : 1    |    |
    256 -> 511    : 0    |    |
    512 -> 1023   : 0    |    |
   1024 -> 2047   : 0    |    |
   2048 -> 4095   : 0    |    |
   4096 -> 8191   : 0    |    |
   8192 -> 16383  : 2    |    |
    This means that a CPU may hold vcrypto->ctrl_lock as long as 
8192~16383us.


To improve the performance of control queue, a request on control 
queue waits
completion instead of busy polling to reduce lock racing, and gets 
completed by

control queue callback.
 CPU0   CPU1   ... CPUx  CPUy
  |  |  | |
  \  \  / /
   \spin_lock(&vcrypto->ctrl_lock)---/
    |
  virtqueue add & kick
    |
   -spin_unlock(&vcrypto->ctrl_lock)--
  /  /  \ \
  |  |  | |
 wait   wait   wait  wait

Test this patch, the guest side get ~200K/s operations with 300% CPU
utilization.

Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Gonglei 
Signed-off-by: zhenwei pi 
---
  drivers/crypto/virtio/virtio_crypto_common.c | 42 +++-
  drivers/crypto/virtio/virtio_crypto_common.h |  8 
  drivers/crypto/virtio/virtio_crypto_core.c   |  2 +-
  3 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/drivers/crypto/virtio/virtio_crypto_common.c 
b/drivers/crypto/virtio/virtio_crypto_common.c

index e65125a74db2..93df73c40dd3 100644
--- a/drivers/crypto/virtio/virtio_crypto_common.c
+++ b/drivers/crypto/virtio/virtio_crypto_common.c
@@ -8,14 +8,21 @@
  #include "virtio_crypto_common.h"
+static void virtio_crypto_ctrlq_callback(struct 
virtio_crypto_ctrl_request *vc_ctrl_req)

+{
+    complete(&vc_ctrl_req->compl);
+}
+
  int virtio_crypto_ctrl_vq_request(struct virtio_crypto *vcrypto, 
struct scatterlist *sgs[],

    unsigned int out_sgs, unsigned int in_sgs,
    struct virtio_crypto_ctrl_request *vc_ctrl_req)
  {
  int err;
-    unsigned int inlen;
  unsigned long flags;
+    init_completion(&vc_ctrl_req->compl);
+    vc_ctrl_req->ctrl_cb =  virtio_crypto_ctrlq_callback;



Is there a chance that the cb would not be virtio_crypto_ctrlq_callback()?



Yes, it's the only callback function used for control queue, removing 
this and calling virtio_crypto_ctrlq_callback directly in 
virtcrypto_ctrlq_callback seems better. I'll fix this in the next version.



+
  spin_lock_irqsave(&vcrypto->ctrl_lock, flags);
  err = virtqueue_add_sgs(vcrypto->ctrl_vq, sgs, out_sgs, in_sgs, 
vc_ctrl_req, GFP_ATOMIC);

  if (err < 0) {
@@ -24,16 +31,31 @@ int virtio_crypto_ctrl_vq_request(struct 
virtio_crypto *vcrypto, struct scatterl

  }
  virtqueue_kick(vcrypto->ctrl_vq);
-
-    /*
- * 

Re: [PATCH v3 2/5] virtio-crypto: wait ctrl queue instead of busy polling

2022-04-22 Thread Jason Wang


在 2022/4/21 18:40, zhenwei pi 写道:

Originally, after submitting request into virtio crypto control
queue, the guest side polls the result from the virt queue. This
works like following:
 CPU0   CPU1   ... CPUx  CPUy
  |  |  | |
  \  \  / /
   \spin_lock(&vcrypto->ctrl_lock)---/
|
  virtqueue add & kick
|
   busy poll virtqueue
|
   spin_unlock(&vcrypto->ctrl_lock)
   ...

There are two problems:
1, The queue depth is always 1, the performance of a virtio crypto
device gets limited. Multi user processes share a single control
queue, and hit spin lock race from control queue. Test on Intel
Platinum 8260, a single worker gets ~35K/s create/close session
operations, and 8 workers get ~40K/s operations with 800% CPU
utilization.
2, The control request is supposed to get handled immediately, but
in the current implementation of QEMU(v6.2), the vCPU thread kicks
another thread to do this work, the latency also gets unstable.
Tracking latency of virtio_crypto_alg_akcipher_close_session in 5s:
 usecs   : count distribution
  0 -> 1  : 0||
  2 -> 3  : 7||
  4 -> 7  : 72   ||
  8 -> 15 : 186485   ||
 16 -> 31 : 687  ||
 32 -> 63 : 5||
 64 -> 127: 3||
128 -> 255: 1||
256 -> 511: 0||
512 -> 1023   : 0||
   1024 -> 2047   : 0||
   2048 -> 4095   : 0||
   4096 -> 8191   : 0||
   8192 -> 16383  : 2||
This means that a CPU may hold vcrypto->ctrl_lock as long as 8192~16383us.

To improve the performance of control queue, a request on control queue waits
completion instead of busy polling to reduce lock racing, and gets completed by
control queue callback.
 CPU0   CPU1   ... CPUx  CPUy
  |  |  | |
  \  \  / /
   \spin_lock(&vcrypto->ctrl_lock)---/
|
  virtqueue add & kick
|
   -spin_unlock(&vcrypto->ctrl_lock)--
  /  /  \ \
  |  |  | |
 wait   wait   wait  wait

Test this patch, the guest side get ~200K/s operations with 300% CPU
utilization.

Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Gonglei 
Signed-off-by: zhenwei pi 
---
  drivers/crypto/virtio/virtio_crypto_common.c | 42 +++-
  drivers/crypto/virtio/virtio_crypto_common.h |  8 
  drivers/crypto/virtio/virtio_crypto_core.c   |  2 +-
  3 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/drivers/crypto/virtio/virtio_crypto_common.c 
b/drivers/crypto/virtio/virtio_crypto_common.c
index e65125a74db2..93df73c40dd3 100644
--- a/drivers/crypto/virtio/virtio_crypto_common.c
+++ b/drivers/crypto/virtio/virtio_crypto_common.c
@@ -8,14 +8,21 @@
  
  #include "virtio_crypto_common.h"
  
+static void virtio_crypto_ctrlq_callback(struct virtio_crypto_ctrl_request *vc_ctrl_req)

+{
+   complete(&vc_ctrl_req->compl);
+}
+
  int virtio_crypto_ctrl_vq_request(struct virtio_crypto *vcrypto, struct 
scatterlist *sgs[],
  unsigned int out_sgs, unsigned int in_sgs,
  struct virtio_crypto_ctrl_request 
*vc_ctrl_req)
  {
int err;
-   unsigned int inlen;
unsigned long flags;
  
+	init_completion(&vc_ctrl_req->compl);

+   vc_ctrl_req->ctrl_cb =  virtio_crypto_ctrlq_callback;



Is there a chance that the cb would not be virtio_crypto_ctrlq_callback()?



+
spin_lock_irqsave(&vcrypto->ctrl_lock, flags);
err = virtqueue_add_sgs(vcrypto->ctrl_vq, sgs, out_sgs, in_sgs, 
vc_ctrl_req, GFP_ATOMIC);
if (err < 0) {
@@ -24,16 +31,31 @@ int virtio_crypto_ctrl_vq_request(struct virtio_crypto 
*vcrypto, struct scatterl
}
  
  	virtqueue_kick(vcrypto->ctrl_vq);

-
-   /*
-* Trapping into the hypervisor, so the request should be
-* handled immediately.
-*/
-   while (!virtqueue_get_buf(vcrypto->ctrl_vq, &inlen) &&
-   !virtqueue_is