Re: virtio-scsi: two questions related with picking up queue

2014-05-08 Thread Ming Lei
On Thu, May 8, 2014 at 9:21 PM, Paolo Bonzini  wrote:
> Il 08/05/2014 14:55, Ming Lei ha scritto:
>
>> On Thu, May 8, 2014 at 8:17 PM, Paolo Bonzini  wrote:
>>>
>>> Il 08/05/2014 12:44, Ming Lei ha scritto:


 On Wed, 07 May 2014 18:43:45 +0200
 Paolo Bonzini  wrote:

>
> Per-CPU spinlocks have bad scalability problems, especially if you're
> overcommitting.  Writing req_vq is not at all rare.



 OK, thought about it further, and I believe seqcount may
 be a match for the case, could you take a look at below patch?

 diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
 index 13dd500..1adbad7 100644
 --- a/drivers/scsi/virtio_scsi.c
 +++ b/drivers/scsi/virtio_scsi.c
 @@ -26,6 +26,7 @@
  #include 
  #include 
  #include 
 +#include 

  #define VIRTIO_SCSI_MEMPOOL_SZ 64
  #define VIRTIO_SCSI_EVENT_LEN 8
 @@ -73,18 +74,16 @@ struct virtio_scsi_vq {
   * queue, and also lets the driver optimize the IRQ affinity for the
 virtqueues
   * (each virtqueue's affinity is set to the CPU that "owns" the queue).
   *
 - * tgt_lock is held to serialize reading and writing req_vq. Reading
 req_vq
 - * could be done locklessly, but we do not do it yet.
 + * tgt_seq is held to serialize reading and writing req_vq.
   *
   * Decrements of reqs are never concurrent with writes of req_vq:
 before
 the
   * decrement reqs will be != 0; after the decrement the virtqueue
 completion
   * routine will not use the req_vq so it can be changed by a new
 request.
 - * Thus they can happen outside the tgt_lock, provided of course we
 make
 reqs
 + * Thus they can happen outside the tgt_seq, provided of course we make
 reqs
   * an atomic_t.
   */
  struct virtio_scsi_target_state {
 -   /* This spinlock never held at the same time as vq_lock. */
 -   spinlock_t tgt_lock;
 +   seqcount_t tgt_seq;

 /* Count of outstanding requests. */
 atomic_t reqs;
 @@ -521,19 +520,33 @@ static struct virtio_scsi_vq
 *virtscsi_pick_vq(struct virtio_scsi *vscsi,
 unsigned long flags;
 u32 queue_num;

 -   spin_lock_irqsave(&tgt->tgt_lock, flags);
 +   local_irq_save(flags);
 +   if (atomic_inc_return(&tgt->reqs) > 1) {
 +   unsigned long seq;
 +
 +   do {
 +   seq = read_seqcount_begin(&tgt->tgt_seq);
 +   vq = tgt->req_vq;
 +   } while (read_seqcount_retry(&tgt->tgt_seq, seq));
 +   } else {
 +   /* no writes can be concurrent because of atomic_t */
 +   write_seqcount_begin(&tgt->tgt_seq);
 +
 +   /* keep previous req_vq if there is reader found */
 +   if (unlikely(atomic_read(&tgt->reqs) > 1)) {
 +   vq = tgt->req_vq;
 +   goto unlock;
 +   }

 queue_num = smp_processor_id();
 while (unlikely(queue_num >= vscsi->num_queues))
 queue_num -= vscsi->num_queues;
 tgt->req_vq = vq = &vscsi->req_vqs[queue_num];
 + unlock:
 +   write_seqcount_end(&tgt->tgt_seq);
 }
 +   local_irq_restore(flags);
>>>
>>>
>>>
>>> I find this harder to think about than the double-check with a
>>> spin_lock_irqsave in the middle,
>>
>>
>> Sorry, could you explain it a bit? With seqcount, spin_lock
>> isn't needed, which should have been used for serialize
>> read and write.
>
>
> Yes, the spin lock is not needed but you are still potentially spinning on
> the read side.

IMO, anyway the spinning is required because a consistent req_vq is
needed, and the spinning only happens if there is in-progressing
writer, and it is very short surely.

>
>
>>> and the read side is not lock free anymore.
>>
>>
>> It is still lock free, because reader won't block reader, and
>> both read_seqcount_begin and read_seqcount_retry only
>> checks if there is writer in progress or being completed,
>> and the two helpers are very cheap.
>
>
> Lock-free has a precise meaning, which is that the system will progress
> regardless of scheduling.  In this case, readers won't progress while the
> writer is preempted between write_seqcount_begin and write_seqcount_end.

The writer won't be preempted, since local irq is disabled.

> My cmpxchg example had a lock-free read-side and a blocking write-side,
> while your code is the opposite, the write-side is lock-free and the
> read-side is blocking.

The read-side isn't blocking and it just waits for completion of writing,
and it should be inevitable for concurrent readers and writer, and
your cmpxchag example is still 'blocking' if there are readers, which
should be w

Re: virtio-scsi: two questions related with picking up queue

2014-05-08 Thread Paolo Bonzini

Il 08/05/2014 14:55, Ming Lei ha scritto:

On Thu, May 8, 2014 at 8:17 PM, Paolo Bonzini  wrote:

Il 08/05/2014 12:44, Ming Lei ha scritto:


On Wed, 07 May 2014 18:43:45 +0200
Paolo Bonzini  wrote:



Per-CPU spinlocks have bad scalability problems, especially if you're
overcommitting.  Writing req_vq is not at all rare.



OK, thought about it further, and I believe seqcount may
be a match for the case, could you take a look at below patch?

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 13dd500..1adbad7 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 

 #define VIRTIO_SCSI_MEMPOOL_SZ 64
 #define VIRTIO_SCSI_EVENT_LEN 8
@@ -73,18 +74,16 @@ struct virtio_scsi_vq {
  * queue, and also lets the driver optimize the IRQ affinity for the
virtqueues
  * (each virtqueue's affinity is set to the CPU that "owns" the queue).
  *
- * tgt_lock is held to serialize reading and writing req_vq. Reading
req_vq
- * could be done locklessly, but we do not do it yet.
+ * tgt_seq is held to serialize reading and writing req_vq.
  *
  * Decrements of reqs are never concurrent with writes of req_vq: before
the
  * decrement reqs will be != 0; after the decrement the virtqueue
completion
  * routine will not use the req_vq so it can be changed by a new request.
- * Thus they can happen outside the tgt_lock, provided of course we make
reqs
+ * Thus they can happen outside the tgt_seq, provided of course we make
reqs
  * an atomic_t.
  */
 struct virtio_scsi_target_state {
-   /* This spinlock never held at the same time as vq_lock. */
-   spinlock_t tgt_lock;
+   seqcount_t tgt_seq;

/* Count of outstanding requests. */
atomic_t reqs;
@@ -521,19 +520,33 @@ static struct virtio_scsi_vq
*virtscsi_pick_vq(struct virtio_scsi *vscsi,
unsigned long flags;
u32 queue_num;

-   spin_lock_irqsave(&tgt->tgt_lock, flags);
+   local_irq_save(flags);
+   if (atomic_inc_return(&tgt->reqs) > 1) {
+   unsigned long seq;
+
+   do {
+   seq = read_seqcount_begin(&tgt->tgt_seq);
+   vq = tgt->req_vq;
+   } while (read_seqcount_retry(&tgt->tgt_seq, seq));
+   } else {
+   /* no writes can be concurrent because of atomic_t */
+   write_seqcount_begin(&tgt->tgt_seq);
+
+   /* keep previous req_vq if there is reader found */
+   if (unlikely(atomic_read(&tgt->reqs) > 1)) {
+   vq = tgt->req_vq;
+   goto unlock;
+   }

queue_num = smp_processor_id();
while (unlikely(queue_num >= vscsi->num_queues))
queue_num -= vscsi->num_queues;
tgt->req_vq = vq = &vscsi->req_vqs[queue_num];
+ unlock:
+   write_seqcount_end(&tgt->tgt_seq);
}
+   local_irq_restore(flags);



I find this harder to think about than the double-check with a
spin_lock_irqsave in the middle,


Sorry, could you explain it a bit? With seqcount, spin_lock
isn't needed, which should have been used for serialize
read and write.


Yes, the spin lock is not needed but you are still potentially spinning 
on the read side.



and the read side is not lock free anymore.


It is still lock free, because reader won't block reader, and
both read_seqcount_begin and read_seqcount_retry only
checks if there is writer in progress or being completed,
and the two helpers are very cheap.


Lock-free has a precise meaning, which is that the system will progress 
regardless of scheduling.  In this case, readers won't progress while 
the writer is preempted between write_seqcount_begin and write_seqcount_end.


My cmpxchg example had a lock-free read-side and a blocking write-side, 
while your code is the opposite, the write-side is lock-free and the 
read-side is blocking.


I'm not sure which is better.  You can try both and the current code, 
and show that some benchmarks improve.  Otherwise, it's better to leave 
the current code.  Simple code is better that complex code that was 
never benchmarked (which is why in the end I and Wanlong Gao settled for 
the simple spinlock).


Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-scsi: two questions related with picking up queue

2014-05-08 Thread Ming Lei
On Thu, May 8, 2014 at 8:17 PM, Paolo Bonzini  wrote:
> Il 08/05/2014 12:44, Ming Lei ha scritto:
>>
>> On Wed, 07 May 2014 18:43:45 +0200
>> Paolo Bonzini  wrote:
>>
>>>
>>> Per-CPU spinlocks have bad scalability problems, especially if you're
>>> overcommitting.  Writing req_vq is not at all rare.
>>
>>
>> OK, thought about it further, and I believe seqcount may
>> be a match for the case, could you take a look at below patch?
>>
>> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
>> index 13dd500..1adbad7 100644
>> --- a/drivers/scsi/virtio_scsi.c
>> +++ b/drivers/scsi/virtio_scsi.c
>> @@ -26,6 +26,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  #define VIRTIO_SCSI_MEMPOOL_SZ 64
>>  #define VIRTIO_SCSI_EVENT_LEN 8
>> @@ -73,18 +74,16 @@ struct virtio_scsi_vq {
>>   * queue, and also lets the driver optimize the IRQ affinity for the
>> virtqueues
>>   * (each virtqueue's affinity is set to the CPU that "owns" the queue).
>>   *
>> - * tgt_lock is held to serialize reading and writing req_vq. Reading
>> req_vq
>> - * could be done locklessly, but we do not do it yet.
>> + * tgt_seq is held to serialize reading and writing req_vq.
>>   *
>>   * Decrements of reqs are never concurrent with writes of req_vq: before
>> the
>>   * decrement reqs will be != 0; after the decrement the virtqueue
>> completion
>>   * routine will not use the req_vq so it can be changed by a new request.
>> - * Thus they can happen outside the tgt_lock, provided of course we make
>> reqs
>> + * Thus they can happen outside the tgt_seq, provided of course we make
>> reqs
>>   * an atomic_t.
>>   */
>>  struct virtio_scsi_target_state {
>> -   /* This spinlock never held at the same time as vq_lock. */
>> -   spinlock_t tgt_lock;
>> +   seqcount_t tgt_seq;
>>
>> /* Count of outstanding requests. */
>> atomic_t reqs;
>> @@ -521,19 +520,33 @@ static struct virtio_scsi_vq
>> *virtscsi_pick_vq(struct virtio_scsi *vscsi,
>> unsigned long flags;
>> u32 queue_num;
>>
>> -   spin_lock_irqsave(&tgt->tgt_lock, flags);
>> +   local_irq_save(flags);
>> +   if (atomic_inc_return(&tgt->reqs) > 1) {
>> +   unsigned long seq;
>> +
>> +   do {
>> +   seq = read_seqcount_begin(&tgt->tgt_seq);
>> +   vq = tgt->req_vq;
>> +   } while (read_seqcount_retry(&tgt->tgt_seq, seq));
>> +   } else {
>> +   /* no writes can be concurrent because of atomic_t */
>> +   write_seqcount_begin(&tgt->tgt_seq);
>> +
>> +   /* keep previous req_vq if there is reader found */
>> +   if (unlikely(atomic_read(&tgt->reqs) > 1)) {
>> +   vq = tgt->req_vq;
>> +   goto unlock;
>> +   }
>>
>> queue_num = smp_processor_id();
>> while (unlikely(queue_num >= vscsi->num_queues))
>> queue_num -= vscsi->num_queues;
>> tgt->req_vq = vq = &vscsi->req_vqs[queue_num];
>> + unlock:
>> +   write_seqcount_end(&tgt->tgt_seq);
>> }
>> +   local_irq_restore(flags);
>
>
> I find this harder to think about than the double-check with a
> spin_lock_irqsave in the middle,

Sorry, could you explain it a bit? With seqcount, spin_lock
isn't needed, which should have been used for serialize
read and write.

> and the read side is not lock free anymore.

It is still lock free, because reader won't block reader, and
both read_seqcount_begin and read_seqcount_retry only
checks if there is writer in progress or being completed,
and the two helpers are very cheap.



Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-scsi: two questions related with picking up queue

2014-05-08 Thread Paolo Bonzini

Il 08/05/2014 12:44, Ming Lei ha scritto:

On Wed, 07 May 2014 18:43:45 +0200
Paolo Bonzini  wrote:



Per-CPU spinlocks have bad scalability problems, especially if you're
overcommitting.  Writing req_vq is not at all rare.


OK, thought about it further, and I believe seqcount may
be a match for the case, could you take a look at below patch?

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 13dd500..1adbad7 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 

 #define VIRTIO_SCSI_MEMPOOL_SZ 64
 #define VIRTIO_SCSI_EVENT_LEN 8
@@ -73,18 +74,16 @@ struct virtio_scsi_vq {
  * queue, and also lets the driver optimize the IRQ affinity for the virtqueues
  * (each virtqueue's affinity is set to the CPU that "owns" the queue).
  *
- * tgt_lock is held to serialize reading and writing req_vq. Reading req_vq
- * could be done locklessly, but we do not do it yet.
+ * tgt_seq is held to serialize reading and writing req_vq.
  *
  * Decrements of reqs are never concurrent with writes of req_vq: before the
  * decrement reqs will be != 0; after the decrement the virtqueue completion
  * routine will not use the req_vq so it can be changed by a new request.
- * Thus they can happen outside the tgt_lock, provided of course we make reqs
+ * Thus they can happen outside the tgt_seq, provided of course we make reqs
  * an atomic_t.
  */
 struct virtio_scsi_target_state {
-   /* This spinlock never held at the same time as vq_lock. */
-   spinlock_t tgt_lock;
+   seqcount_t tgt_seq;

/* Count of outstanding requests. */
atomic_t reqs;
@@ -521,19 +520,33 @@ static struct virtio_scsi_vq *virtscsi_pick_vq(struct 
virtio_scsi *vscsi,
unsigned long flags;
u32 queue_num;

-   spin_lock_irqsave(&tgt->tgt_lock, flags);
+   local_irq_save(flags);
+   if (atomic_inc_return(&tgt->reqs) > 1) {
+   unsigned long seq;
+
+   do {
+   seq = read_seqcount_begin(&tgt->tgt_seq);
+   vq = tgt->req_vq;
+   } while (read_seqcount_retry(&tgt->tgt_seq, seq));
+   } else {
+   /* no writes can be concurrent because of atomic_t */
+   write_seqcount_begin(&tgt->tgt_seq);
+
+   /* keep previous req_vq if there is reader found */
+   if (unlikely(atomic_read(&tgt->reqs) > 1)) {
+   vq = tgt->req_vq;
+   goto unlock;
+   }

queue_num = smp_processor_id();
while (unlikely(queue_num >= vscsi->num_queues))
queue_num -= vscsi->num_queues;
tgt->req_vq = vq = &vscsi->req_vqs[queue_num];
+ unlock:
+   write_seqcount_end(&tgt->tgt_seq);
}
+   local_irq_restore(flags);


I find this harder to think about than the double-check with a 
spin_lock_irqsave in the middle, and the read side is not lock free anymore.


Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-scsi: two questions related with picking up queue

2014-05-08 Thread Ming Lei
On Wed, 07 May 2014 18:43:45 +0200
Paolo Bonzini  wrote:

> 
> Per-CPU spinlocks have bad scalability problems, especially if you're 
> overcommitting.  Writing req_vq is not at all rare.

OK, thought about it further, and I believe seqcount may
be a match for the case, could you take a look at below patch?

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 13dd500..1adbad7 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define VIRTIO_SCSI_MEMPOOL_SZ 64
 #define VIRTIO_SCSI_EVENT_LEN 8
@@ -73,18 +74,16 @@ struct virtio_scsi_vq {
  * queue, and also lets the driver optimize the IRQ affinity for the virtqueues
  * (each virtqueue's affinity is set to the CPU that "owns" the queue).
  *
- * tgt_lock is held to serialize reading and writing req_vq. Reading req_vq
- * could be done locklessly, but we do not do it yet.
+ * tgt_seq is held to serialize reading and writing req_vq.
  *
  * Decrements of reqs are never concurrent with writes of req_vq: before the
  * decrement reqs will be != 0; after the decrement the virtqueue completion
  * routine will not use the req_vq so it can be changed by a new request.
- * Thus they can happen outside the tgt_lock, provided of course we make reqs
+ * Thus they can happen outside the tgt_seq, provided of course we make reqs
  * an atomic_t.
  */
 struct virtio_scsi_target_state {
-   /* This spinlock never held at the same time as vq_lock. */
-   spinlock_t tgt_lock;
+   seqcount_t tgt_seq;
 
/* Count of outstanding requests. */
atomic_t reqs;
@@ -521,19 +520,33 @@ static struct virtio_scsi_vq *virtscsi_pick_vq(struct 
virtio_scsi *vscsi,
unsigned long flags;
u32 queue_num;
 
-   spin_lock_irqsave(&tgt->tgt_lock, flags);
+   local_irq_save(flags);
+   if (atomic_inc_return(&tgt->reqs) > 1) {
+   unsigned long seq;
+
+   do {
+   seq = read_seqcount_begin(&tgt->tgt_seq);
+   vq = tgt->req_vq;
+   } while (read_seqcount_retry(&tgt->tgt_seq, seq));
+   } else {
+   /* no writes can be concurrent because of atomic_t */
+   write_seqcount_begin(&tgt->tgt_seq);
+
+   /* keep previous req_vq if there is reader found */
+   if (unlikely(atomic_read(&tgt->reqs) > 1)) {
+   vq = tgt->req_vq;
+   goto unlock;
+   }
 
-   if (atomic_inc_return(&tgt->reqs) > 1)
-   vq = tgt->req_vq;
-   else {
queue_num = smp_processor_id();
while (unlikely(queue_num >= vscsi->num_queues))
queue_num -= vscsi->num_queues;
-
tgt->req_vq = vq = &vscsi->req_vqs[queue_num];
+ unlock:
+   write_seqcount_end(&tgt->tgt_seq);
}
+   local_irq_restore(flags);
 
-   spin_unlock_irqrestore(&tgt->tgt_lock, flags);
return vq;
 }
 
@@ -623,7 +636,7 @@ static int virtscsi_target_alloc(struct scsi_target 
*starget)
if (!tgt)
return -ENOMEM;
 
-   spin_lock_init(&tgt->tgt_lock);
+   seqcount_init(&tgt->tgt_seq);
atomic_set(&tgt->reqs, 0);
tgt->req_vq = NULL;
 



Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-scsi: two questions related with picking up queue

2014-05-07 Thread Paolo Bonzini

Il 07/05/2014 18:24, Ming Lei ha scritto:

On Tue, 06 May 2014 15:17:15 +0200
Paolo Bonzini  wrote:


Il 06/05/2014 11:26, Ming Lei ha scritto:

Hi Paolo and All,

One question is about ACCESS_ONCE() in virtscsi_pick_vq(),
looks it needn't since both reading and writing tgt->req_vq holds
tgt->tgt_lock.


You're right.  It should be possible to avoid the lock in
virtscsi_pick_vq like this:

value = atomic_read(&tgt->reqs);
retry:
if (value != 0) {
old_value = atomic_cmpxchg(&tgt->regs, value, value + 1)
if (old_value != value) {
value = old_value;
goto retry;
}

smp_mb__after_atomic_cmpxchg(); // you get the idea :)
vq = ACCESS_ONCE(tgt->req_vq);
} else {
spin_lock_irqsave(&tgt->tgt_lock, flags);

// tgt->reqs may not be 0 anymore, need to recheck
value = atomic_read(&tgt->reqs);
if (atomic_read(&tgt->reqs) != 0) {
spin_unlock_irqrestore(&tgt->tgt_lock, flags);
goto retry;
}

// tgt->reqs now will remain fixed to 0.
...
tgt->req_vq = vq = ...;
smp_wmb();
atomic_set(&tgt->reqs, 1);
spin_unlock_irqrestore(&tgt->tgt_lock, flags);
}

return vq;



Another approach I thought of is to use percpu spinlock, and
the idea is simple:

- all perpcu locks are held for writing req_vq, and
- only percpu lock is needed for reading req_vq.

What do you think about the below patch?


Per-CPU spinlocks have bad scalability problems, especially if you're 
overcommitting.  Writing req_vq is not at all rare.


Paolo


diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 697fa53..00deab4 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -82,7 +82,7 @@ struct virtio_scsi_vq {
  */
 struct virtio_scsi_target_state {
/* This spinlock never held at the same time as vq_lock. */
-   spinlock_t tgt_lock;
+   spinlock_t __percpu *lock;

/* Count of outstanding requests. */
atomic_t reqs;
@@ -517,21 +517,46 @@ static struct virtio_scsi_vq *virtscsi_pick_vq(struct 
virtio_scsi *vscsi,
 {
struct virtio_scsi_vq *vq;
unsigned long flags;
-   u32 queue_num;
+   u32 cpu = get_cpu();
+   spinlock_t  *lock = per_cpu_ptr(tgt->lock, cpu);

-   spin_lock_irqsave(&tgt->tgt_lock, flags);
+   spin_lock_irqsave(lock, flags);

if (atomic_inc_return(&tgt->reqs) > 1)
vq = tgt->req_vq;
else {
-   queue_num = smp_processor_id();
+   u32 queue_num = cpu;
+   int i;
+
while (unlikely(queue_num >= vscsi->num_queues))
queue_num -= vscsi->num_queues;

-   tgt->req_vq = vq = &vscsi->req_vqs[queue_num];
+   /*
+* there should be only one writing because of atomic
+* counter, so we don't worry about deadlock, but
+* might need to teach lockdep to not complain it
+*/
+   for_each_possible_cpu(i) {
+   spinlock_t *other = per_cpu_ptr(tgt->lock, i);
+   if (i != cpu)
+   spin_lock(other);
+   }
+
+   /* only update req_vq when reqs is one*/
+   if (atomic_read(&tgt->reqs) == 1)
+   tgt->req_vq = vq = &vscsi->req_vqs[queue_num];
+   else
+   vq = tgt->req_vq;
+
+   for_each_possible_cpu(i) {
+   spinlock_t *other = per_cpu_ptr(tgt->lock, i);
+   if (i != cpu)
+   spin_unlock(other);
+   }
}

-   spin_unlock_irqrestore(&tgt->tgt_lock, flags);
+   spin_unlock_irqrestore(lock, flags);
+   put_cpu();
return vq;
 }

@@ -618,10 +643,22 @@ static int virtscsi_target_alloc(struct scsi_target 
*starget)
 {
struct virtio_scsi_target_state *tgt =
kmalloc(sizeof(*tgt), GFP_KERNEL);
+   int i;
+
if (!tgt)
return -ENOMEM;

-   spin_lock_init(&tgt->tgt_lock);
+   tgt->lock = alloc_percpu(spinlock_t);
+   if (!tgt->lock) {
+   kfree(tgt);
+   return -ENOMEM;
+   }
+
+   for_each_possible_cpu(i) {
+   spinlock_t *lock = per_cpu_ptr(tgt->lock, i);
+   spin_lock_init(lock);
+   }
+
atomic_set(&tgt->reqs, 0);
tgt->req_vq = NULL;

@@ -632,6 +669,7 @@ static int virtscsi_target_alloc(struct scsi_target 
*starget)
 static void virtscsi_target_destroy(struct scsi_target *starget)
 {
struct virtio_scsi_target_state *tgt = starget->hostdata;

Re: virtio-scsi: two questions related with picking up queue

2014-05-07 Thread Ming Lei
On Tue, 06 May 2014 15:17:15 +0200
Paolo Bonzini  wrote:

> Il 06/05/2014 11:26, Ming Lei ha scritto:
> > Hi Paolo and All,
> >
> > One question is about ACCESS_ONCE() in virtscsi_pick_vq(),
> > looks it needn't since both reading and writing tgt->req_vq holds
> > tgt->tgt_lock.
> 
> You're right.  It should be possible to avoid the lock in 
> virtscsi_pick_vq like this:
> 
>   value = atomic_read(&tgt->reqs);
> retry:
>   if (value != 0) {
>   old_value = atomic_cmpxchg(&tgt->regs, value, value + 1)
>   if (old_value != value) {
>   value = old_value;
>   goto retry;
>   }
> 
>   smp_mb__after_atomic_cmpxchg(); // you get the idea :)
>   vq = ACCESS_ONCE(tgt->req_vq);
>   } else {
>   spin_lock_irqsave(&tgt->tgt_lock, flags);
> 
>   // tgt->reqs may not be 0 anymore, need to recheck
>   value = atomic_read(&tgt->reqs);
>   if (atomic_read(&tgt->reqs) != 0) {
>   spin_unlock_irqrestore(&tgt->tgt_lock, flags);
>   goto retry;
>   }
> 
>   // tgt->reqs now will remain fixed to 0.
>   ...
>   tgt->req_vq = vq = ...;
>   smp_wmb();
>   atomic_set(&tgt->reqs, 1);
>   spin_unlock_irqrestore(&tgt->tgt_lock, flags);
>   }
> 
>   return vq;
> 

Another approach I thought of is to use percpu spinlock, and
the idea is simple:

- all perpcu locks are held for writing req_vq, and
- only percpu lock is needed for reading req_vq.

What do you think about the below patch?

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 697fa53..00deab4 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -82,7 +82,7 @@ struct virtio_scsi_vq {
  */
 struct virtio_scsi_target_state {
/* This spinlock never held at the same time as vq_lock. */
-   spinlock_t tgt_lock;
+   spinlock_t __percpu *lock;
 
/* Count of outstanding requests. */
atomic_t reqs;
@@ -517,21 +517,46 @@ static struct virtio_scsi_vq *virtscsi_pick_vq(struct 
virtio_scsi *vscsi,
 {
struct virtio_scsi_vq *vq;
unsigned long flags;
-   u32 queue_num;
+   u32 cpu = get_cpu();
+   spinlock_t  *lock = per_cpu_ptr(tgt->lock, cpu);
 
-   spin_lock_irqsave(&tgt->tgt_lock, flags);
+   spin_lock_irqsave(lock, flags);
 
if (atomic_inc_return(&tgt->reqs) > 1)
vq = tgt->req_vq;
else {
-   queue_num = smp_processor_id();
+   u32 queue_num = cpu;
+   int i;
+
while (unlikely(queue_num >= vscsi->num_queues))
queue_num -= vscsi->num_queues;
 
-   tgt->req_vq = vq = &vscsi->req_vqs[queue_num];
+   /*
+* there should be only one writing because of atomic
+* counter, so we don't worry about deadlock, but
+* might need to teach lockdep to not complain it
+*/
+   for_each_possible_cpu(i) {
+   spinlock_t *other = per_cpu_ptr(tgt->lock, i);
+   if (i != cpu)
+   spin_lock(other);
+   }
+
+   /* only update req_vq when reqs is one*/
+   if (atomic_read(&tgt->reqs) == 1)
+   tgt->req_vq = vq = &vscsi->req_vqs[queue_num];
+   else
+   vq = tgt->req_vq;
+
+   for_each_possible_cpu(i) {
+   spinlock_t *other = per_cpu_ptr(tgt->lock, i);
+   if (i != cpu)
+   spin_unlock(other);
+   }
}
 
-   spin_unlock_irqrestore(&tgt->tgt_lock, flags);
+   spin_unlock_irqrestore(lock, flags);
+   put_cpu();
return vq;
 }
 
@@ -618,10 +643,22 @@ static int virtscsi_target_alloc(struct scsi_target 
*starget)
 {
struct virtio_scsi_target_state *tgt =
kmalloc(sizeof(*tgt), GFP_KERNEL);
+   int i;
+
if (!tgt)
return -ENOMEM;
 
-   spin_lock_init(&tgt->tgt_lock);
+   tgt->lock = alloc_percpu(spinlock_t);
+   if (!tgt->lock) {
+   kfree(tgt);
+   return -ENOMEM;
+   }
+
+   for_each_possible_cpu(i) {
+   spinlock_t *lock = per_cpu_ptr(tgt->lock, i);
+   spin_lock_init(lock);
+   }
+
atomic_set(&tgt->reqs, 0);
tgt->req_vq = NULL;
 
@@ -632,6 +669,7 @@ static int virtscsi_target_alloc(struct scsi_target 
*starget)
 static void virtscsi_target_destroy(struct scsi_target *starget)
 {
struct virtio_scsi_target_state *tgt = starget->hostdata;
+   free_percpu(tgt->lock);
kfree(tgt);
 }
 

Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line "uns

Re: virtio-scsi: two questions related with picking up queue

2014-05-07 Thread Paolo Bonzini

Il 07/05/2014 03:07, Ming Lei ha scritto:

Hi Paolo,

On Tue, May 6, 2014 at 9:17 PM, Paolo Bonzini  wrote:

Il 06/05/2014 11:26, Ming Lei ha scritto:


Hi Paolo and All,

One question is about ACCESS_ONCE() in virtscsi_pick_vq(),
looks it needn't since both reading and writing tgt->req_vq holds
tgt->tgt_lock.



You're right.  It should be possible to avoid the lock in virtscsi_pick_vq
like this:

value = atomic_read(&tgt->reqs);


I am wondering if atomic_read() is OK because  atomic_read()
should be treated as a simple C statement, and it may not reflect
the latest value of tgt->reqs.


It would be caught by cmpxchg below.


retry:
if (value != 0) {
old_value = atomic_cmpxchg(&tgt->regs, value, value + 1)
if (old_value != value) {


Maybe ' if (old_value != value && !old_value) ' is a bit better.


No, because if you have failed the cmpxchg you haven't incremented 
tgt->reqs.



value = old_value;
goto retry;
}

vq = ACCESS_ONCE(tgt->req_vq);
} else {
spin_lock_irqsave(&tgt->tgt_lock, flags);

// tgt->reqs may not be 0 anymore, need to recheck
value = atomic_read(&tgt->reqs);
if (atomic_read(&tgt->reqs) != 0) {
spin_unlock_irqrestore(&tgt->tgt_lock, flags);
goto retry;
}


Same with above, if atomic_read() still returns zero even
after it is increased in read path from another CPU, then
an obsolete vq pointer might be seen in the read path.


If I understand you correctly, then the CPUs wouldn't be cache-coherent. 
 You would have bigger problems.




// tgt->reqs now will remain fixed to 0.
...
tgt->req_vq = vq = ...;
smp_wmb();
atomic_set(&tgt->reqs, 1);
spin_unlock_irqrestore(&tgt->tgt_lock, flags);
}

return vq;

and then you would need the ACCESS_ONCE but I'm not sure it's worthwhile.



Yes, I agree, :-)




Another one is about the comment in virtscsi_req_done(), which
said smp_read_barrier_depends() is needed for avoiding
out of order between reading req_vq and decreasing tgt->reqs.
But if I understand correctly, in virtscsi_req_done(), req_vq is
read from vscsi->req_vqs[vq->index - VIRTIO_SCSI_VQ_BASE],
instead of tgt->req_vq, and the former won't change wrt.
inc/dec tgt->reqs, so can the barrier be removed?



Right.  The comment is obsolete and dates to before vq->index existed.


OK, I will cook a patch to remove the barrier and cleanup the comment.


Thanks.  Please remove the ACCESS_ONCE too.

Paolo

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-scsi: two questions related with picking up queue

2014-05-06 Thread Ming Lei
Hi Paolo,

On Tue, May 6, 2014 at 9:17 PM, Paolo Bonzini  wrote:
> Il 06/05/2014 11:26, Ming Lei ha scritto:
>
>> Hi Paolo and All,
>>
>> One question is about ACCESS_ONCE() in virtscsi_pick_vq(),
>> looks it needn't since both reading and writing tgt->req_vq holds
>> tgt->tgt_lock.
>
>
> You're right.  It should be possible to avoid the lock in virtscsi_pick_vq
> like this:
>
> value = atomic_read(&tgt->reqs);

I am wondering if atomic_read() is OK because  atomic_read()
should be treated as a simple C statement, and it may not reflect
the latest value of tgt->reqs.

> retry:
> if (value != 0) {
> old_value = atomic_cmpxchg(&tgt->regs, value, value + 1)
> if (old_value != value) {

Maybe ' if (old_value != value && !old_value) ' is a bit better.

> value = old_value;
> goto retry;
> }
>
> smp_mb__after_atomic_cmpxchg(); // you get the idea :)

Yes, but looks it isn't needed since atomic_cmpxchg() implies
smp_mb().

> vq = ACCESS_ONCE(tgt->req_vq);
> } else {
> spin_lock_irqsave(&tgt->tgt_lock, flags);
>
> // tgt->reqs may not be 0 anymore, need to recheck
> value = atomic_read(&tgt->reqs);
> if (atomic_read(&tgt->reqs) != 0) {
> spin_unlock_irqrestore(&tgt->tgt_lock, flags);
> goto retry;
> }

Same with above, if atomic_read() still returns zero even
after it is increased in read path from another CPU, then
an obsolete vq pointer might be seen in the read path.

>
> // tgt->reqs now will remain fixed to 0.
> ...
> tgt->req_vq = vq = ...;
> smp_wmb();
> atomic_set(&tgt->reqs, 1);
> spin_unlock_irqrestore(&tgt->tgt_lock, flags);
> }
>
> return vq;
>
> and then you would need the ACCESS_ONCE but I'm not sure it's worthwhile.
>

Yes, I agree, :-)

>
>> Another one is about the comment in virtscsi_req_done(), which
>> said smp_read_barrier_depends() is needed for avoiding
>> out of order between reading req_vq and decreasing tgt->reqs.
>> But if I understand correctly, in virtscsi_req_done(), req_vq is
>> read from vscsi->req_vqs[vq->index - VIRTIO_SCSI_VQ_BASE],
>> instead of tgt->req_vq, and the former won't change wrt.
>> inc/dec tgt->reqs, so can the barrier be removed?
>
>
> Right.  The comment is obsolete and dates to before vq->index existed.

OK, I will cook a patch to remove the barrier and cleanup the comment.


Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-scsi: two questions related with picking up queue

2014-05-06 Thread Paolo Bonzini

Il 06/05/2014 11:26, Ming Lei ha scritto:

Hi Paolo and All,

One question is about ACCESS_ONCE() in virtscsi_pick_vq(),
looks it needn't since both reading and writing tgt->req_vq holds
tgt->tgt_lock.


You're right.  It should be possible to avoid the lock in 
virtscsi_pick_vq like this:


value = atomic_read(&tgt->reqs);
retry:
if (value != 0) {
old_value = atomic_cmpxchg(&tgt->regs, value, value + 1)
if (old_value != value) {
value = old_value;
goto retry;
}

smp_mb__after_atomic_cmpxchg(); // you get the idea :)
vq = ACCESS_ONCE(tgt->req_vq);
} else {
spin_lock_irqsave(&tgt->tgt_lock, flags);

// tgt->reqs may not be 0 anymore, need to recheck
value = atomic_read(&tgt->reqs);
if (atomic_read(&tgt->reqs) != 0) {
spin_unlock_irqrestore(&tgt->tgt_lock, flags);
goto retry;
}

// tgt->reqs now will remain fixed to 0.
...
tgt->req_vq = vq = ...;
smp_wmb();
atomic_set(&tgt->reqs, 1);
spin_unlock_irqrestore(&tgt->tgt_lock, flags);
}

return vq;

and then you would need the ACCESS_ONCE but I'm not sure it's worthwhile.


Another one is about the comment in virtscsi_req_done(), which
said smp_read_barrier_depends() is needed for avoiding
out of order between reading req_vq and decreasing tgt->reqs.
But if I understand correctly, in virtscsi_req_done(), req_vq is
read from vscsi->req_vqs[vq->index - VIRTIO_SCSI_VQ_BASE],
instead of tgt->req_vq, and the former won't change wrt.
inc/dec tgt->reqs, so can the barrier be removed?


Right.  The comment is obsolete and dates to before vq->index existed.

Paolo


Any comments about the above?

Thanks,



--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html