Re: [PATCH V7 4/5] virtio-scsi: introduce multiqueue support

2013-03-25 Thread Paolo Bonzini
Il 25/03/2013 08:25, Bart Van Assche ha scritto:
>>
>> +queue_num = smp_processor_id();
>> +while (unlikely(queue_num >= vscsi->num_queues))
>> +queue_num -= vscsi->num_queues;
>> +
>> +tgt->req_vq = vq = &vscsi->req_vqs[queue_num];
>> +}
>> +
>> +spin_unlock_irqrestore(&tgt->tgt_lock, flags);
>> +return vq;
>> +}
> 
> Is there any reason why the smp_processor_id() % vscsi->num_queues
> computation in virtscsi_pick_vq() has been implemented as a loop instead
> of as an arithmetic operation ? If so, I would appreciate it if that
> could be explained in a comment.

The expectation is that vscsi->num_queues is the same as the number of
CPUs in the virtual machine; see patch 5 too.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V7 4/5] virtio-scsi: introduce multiqueue support

2013-03-25 Thread Bart Van Assche

On 03/23/13 12:28, Wanlong Gao wrote:

+static struct virtio_scsi_vq *virtscsi_pick_vq(struct virtio_scsi *vscsi,
+  struct virtio_scsi_target_state 
*tgt)
+{
+   struct virtio_scsi_vq *vq;
+   unsigned long flags;
+   u32 queue_num;
+
+   spin_lock_irqsave(&tgt->tgt_lock, flags);
+
+   /*
+* The memory barrier after atomic_inc_return matches
+* the smp_read_barrier_depends() in virtscsi_req_done.
+*/
+   if (atomic_inc_return(&tgt->reqs) > 1)
+   vq = ACCESS_ONCE(tgt->req_vq);
+   else {
+   queue_num = smp_processor_id();
+   while (unlikely(queue_num >= vscsi->num_queues))
+   queue_num -= vscsi->num_queues;
+
+   tgt->req_vq = vq = &vscsi->req_vqs[queue_num];
+   }
+
+   spin_unlock_irqrestore(&tgt->tgt_lock, flags);
+   return vq;
+}


Is there any reason why the smp_processor_id() % vscsi->num_queues 
computation in virtscsi_pick_vq() has been implemented as a loop instead 
of as an arithmetic operation ? If so, I would appreciate it if that 
could be explained in a comment.


Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V7 4/5] virtio-scsi: introduce multiqueue support

2013-03-23 Thread Wanlong Gao
From: Paolo Bonzini 

This patch adds queue steering to virtio-scsi.  When a target is sent
multiple requests, we always drive them to the same queue so that FIFO
processing order is kept.  However, if a target was idle, we can choose
a queue arbitrarily.  In this case the queue is chosen according to the
current VCPU, so the driver expects the number of request queues to be
equal to the number of VCPUs.  This makes it easy and fast to select
the queue, and also lets the driver optimize the IRQ affinity for the
virtqueues (each virtqueue's affinity is set to the CPU that "owns"
the queue).

The speedup comes from improving cache locality and giving CPU affinity
to the virtqueues, which is why this scheme was selected.  Assuming that
the thread that is sending requests to the device is I/O-bound, it is
likely to be sleeping at the time the ISR is executed, and thus executing
the ISR on the same processor that sent the requests is cheap.

However, the kernel will not execute the ISR on the "best" processor
unless you explicitly set the affinity.  This is because in practice
you will have many such I/O-bound processes and thus many otherwise
idle processors.  Then the kernel will execute the ISR on a random
processor, rather than the one that is sending requests to the device.

The alternative to per-CPU virtqueues is per-target virtqueues.  To
achieve the same locality, we could dynamically choose the virtqueue's
affinity based on the CPU of the last task that sent a request.  This
is less appealing because we do not set the affinity directly---we only
provide a hint to the irqbalanced running in userspace.  Dynamically
changing the affinity only works if the userspace applies the hint
fast enough.

Cc: linux-s...@vger.kernel.org
Signed-off-by: Paolo Bonzini 
Signed-off-by: Wanlong Gao 
Reviewed-by: Asias He 
Tested-by: Venkatesh Srinivas 
---
 drivers/scsi/virtio_scsi.c | 282 -
 1 file changed, 254 insertions(+), 28 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index dc2daec..8dcdef0 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -22,12 +22,14 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 
 #define VIRTIO_SCSI_MEMPOOL_SZ 64
 #define VIRTIO_SCSI_EVENT_LEN 8
+#define VIRTIO_SCSI_VQ_BASE 2
 
 /* Command queue element */
 struct virtio_scsi_cmd {
@@ -59,22 +61,58 @@ struct virtio_scsi_vq {
struct virtqueue *vq;
 };
 
-/* Per-target queue state */
+/*
+ * Per-target queue state.
+ *
+ * This struct holds the data needed by the queue steering policy.  When a
+ * target is sent multiple requests, we need to drive them to the same queue so
+ * that FIFO processing order is kept.  However, if a target was idle, we can
+ * choose a queue arbitrarily.  In this case the queue is chosen according to
+ * the current VCPU, so the driver expects the number of request queues to be
+ * equal to the number of VCPUs.  This makes it easy and fast to select the
+ * queue, and also lets the driver optimize the IRQ affinity for the virtqueues
+ * (each virtqueue's affinity is set to the CPU that "owns" the queue).
+ *
+ * An interesting effect of this policy is that only writes to req_vq need to
+ * take the tgt_lock.  Read can be done outside the lock because:
+ *
+ * - writes of req_vq only occur when atomic_inc_return(&tgt->reqs) returns 1.
+ *   In that case, no other CPU is reading req_vq: even if they were in
+ *   virtscsi_queuecommand_multi, they would be spinning on tgt_lock.
+ *
+ * - reads of req_vq only occur when the target is not idle (reqs != 0).
+ *   A CPU that enters virtscsi_queuecommand_multi will not modify req_vq.
+ *
+ * Similarly, decrements of reqs are never concurrent with writes of req_vq.
+ * Thus they can happen outside the tgt_lock, provided of course we make reqs
+ * an atomic_t.
+ */
 struct virtio_scsi_target_state {
-   /* Never held at the same time as vq_lock.  */
+   /* This spinlock never held at the same time as vq_lock. */
spinlock_t tgt_lock;
+
+   /* Count of outstanding requests. */
+   atomic_t reqs;
+
+   /* Currently active virtqueue for requests sent to this target. */
+   struct virtio_scsi_vq *req_vq;
 };
 
 /* Driver instance state */
 struct virtio_scsi {
struct virtio_device *vdev;
 
-   struct virtio_scsi_vq ctrl_vq;
-   struct virtio_scsi_vq event_vq;
-   struct virtio_scsi_vq req_vq;
-
/* Get some buffers ready for event vq */
struct virtio_scsi_event_node event_list[VIRTIO_SCSI_EVENT_LEN];
+
+   u32 num_queues;
+
+   /* If the affinity hint is set for virtqueues */
+   bool affinity_hint_set;
+
+   struct virtio_scsi_vq ctrl_vq;
+   struct virtio_scsi_vq event_vq;
+   struct virtio_scsi_vq req_vqs[];
 };
 
 static struct kmem_cache *virtscsi_cmd_cache;
@@ -109,6 +147,8 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
*vscsi, vo