Re: [PATCH RFC 0/8] IB/srp: Add multichannel support

2014-09-22 Thread Bart Van Assche

On 22/09/2014 8:37, Christoph Hellwig wrote:

One thing that is missing is generation multiqueue-aware tags at the
blk-mq level, which should be as simple as always adding a queue
prefix in the tag allocation code.


Hello Christoph,

Adding a queue prefix in the tag allocation code is an interesting idea. 
Encoding the hardware context index in the upper bits of the 'tag' field 
in 'struct request' is something I have considered. The reason I have 
not done that is because I think several block drivers assume that the 
rq->tag field is a number in the range [0..queue_depth-1]. Here is just 
one example from the mtip32xx driver:


fis->sect_count  = ((rq->tag << 3) | (rq->tag >> 5));


Did you consider switching srp to use the block layer provided tags?


This is on my to-do list. The only reason I have not yet done this is 
because I have not yet had the time to work on it. Another item that is 
on my to-do list is to eliminate per-request memory allocation and 
instead to use your patch that added a "cmd_size" field in the SCSI host 
template.



Also do you have any performance numbers for just using multiple
queues inside srp vs using blk-mq exposed queues?


So far I have only rerun the multithreaded write test. For that test I 
see about 15% more IOPS with this patch series (exploiting multiple 
hardware queues and a 1:1 mapping between hardware context and RDMA 
queue pair) compared to the previous implementation (one hardware queue 
and multiple RDMA queue pairs). Please keep in mind that in that test 
the CPU's of the target system are saturated so the performance 
potential of using multiple hardware queues is probably larger than the 
difference I measured.


Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/8] IB/srp: Add multichannel support

2014-09-22 Thread Jens Axboe

On 2014-09-22 10:31, Jens Axboe wrote:

On 2014-09-22 10:25, Bart Van Assche wrote:

On 22/09/2014 8:37, Christoph Hellwig wrote:

One thing that is missing is generation multiqueue-aware tags at the
blk-mq level, which should be as simple as always adding a queue
prefix in the tag allocation code.


Hello Christoph,

Adding a queue prefix in the tag allocation code is an interesting idea.
Encoding the hardware context index in the upper bits of the 'tag' field
in 'struct request' is something I have considered. The reason I have
not done that is because I think several block drivers assume that the
rq->tag field is a number in the range [0..queue_depth-1]. Here is just
one example from the mtip32xx driver:

 fis->sect_count  = ((rq->tag << 3) | (rq->tag >> 5));


Most drivers assume that the tag is within a certain range, the queue
prefix would only work on drivers where the tag number is just some
arbitrary "cookie". So for SCSI it should work, and I don't think we
need it anywhere else.

Alternatively, we can wrap tag retrieval in some function to mask off
the queue prefix for the cases where we just want an index.


Would probably be better to have a tag generation function instead, that 
uses rq->tag and ORs in the queue prefix, actually. That way rq->tag 
would remain being a plain index, and the issued tag in the driver could 
then use some function to add the queue prefix, if it needs unique tags 
across queues in a shared tag set.


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/8] IB/srp: Add multichannel support

2014-09-22 Thread Jens Axboe

On 2014-09-22 10:25, Bart Van Assche wrote:

On 22/09/2014 8:37, Christoph Hellwig wrote:

One thing that is missing is generation multiqueue-aware tags at the
blk-mq level, which should be as simple as always adding a queue
prefix in the tag allocation code.


Hello Christoph,

Adding a queue prefix in the tag allocation code is an interesting idea.
Encoding the hardware context index in the upper bits of the 'tag' field
in 'struct request' is something I have considered. The reason I have
not done that is because I think several block drivers assume that the
rq->tag field is a number in the range [0..queue_depth-1]. Here is just
one example from the mtip32xx driver:

 fis->sect_count  = ((rq->tag << 3) | (rq->tag >> 5));


Most drivers assume that the tag is within a certain range, the queue 
prefix would only work on drivers where the tag number is just some 
arbitrary "cookie". So for SCSI it should work, and I don't think we 
need it anywhere else.


Alternatively, we can wrap tag retrieval in some function to mask off 
the queue prefix for the cases where we just want an index.



--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/8] IB/srp: Add multichannel support

2014-09-22 Thread Christoph Hellwig
Hi Bart,

I like these changes modulo the minor comments we had.

One thing that is missing is generation multiqueue-aware tags at the
blk-mq level, which should be as simple as always adding a queue
prefix in the tag allocation code.  Did you consider switching srp
to use the block layer provided tags?

Also do you have any performance numbers for just using multiple
queues inside srp vs using blk-mq exposed queues?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/8] IB/srp: Add multichannel support

2014-09-19 Thread Jens Axboe
On 09/19/2014 06:55 AM, Bart Van Assche wrote:
> Hello,
> 
> Although the SRP protocol supports multichannel operation, although
> since considerable time RDMA HCA's are available that support multiple
> completion vectors and although multichannel operation yields better
> performance than using a single channel, the Linux SRP initiator does
> not yet support multichannel operation. While adding multichannel
> support in the SRP initiator I encountered a few challenges of which I
> think these need wider discussion. The topics I would like invite wider
> discussion about are as follows:
> - How to avoid unneeded inter-socket cache traffic. Should the blk-mq
>   layer e.g. assign CPU cores to hardware queues such that all CPU cores
>   associated with a single hardware queue reside on the same CPU socket?
>   (see also patch 1/8)

Right now these are deliberately symmetric, and hence can result in
hardware queues being left unused. Whether it makes sense to make this
change or not, I think will greatly depend on how many queues are
available. There's a crossover point where having more queues doesn't
help, and it can even make things worse (interrupt load, etc). For your
specific case or 4 queues, it probably DOES make sense to use them all,
however.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/8] IB/srp: Add multichannel support

2014-09-19 Thread Sagi Grimberg
On 9/19/2014 3:56 PM, Bart Van Assche wrote:
> [PATCH 1/8] blk-mq: Use all available hardware queues
>
> Suppose that a system has two CPU sockets, three cores per socket,
> that it does not support hyperthreading and that four hardware
> queues are provided by a block driver. With the current algorithm
> this will lead to the following assignment of CPU cores to hardware
> queues:
>
>HWQ 0: 0 1
>HWQ 1: 2 3
>HWQ 2: 4 5
>HWQ 3: (none)
>
> This patch changes the queue assignment into:
>
>HWQ 0: 0 1
>HWQ 1: 2
>HWQ 2: 3 4
>HWQ 3: 5
>
> In other words, this patch has the following three effects:
> - All four hardware queues are used instead of only three.
> - CPU cores are spread more evenly over hardware queues. For the
>above example the range of the number of CPU cores associated
>with a single HWQ is reduced from [0..2] to [1..2].
> - If the number of HWQ's is a multiple of the number of CPU sockets
>it is now guaranteed that all CPU cores associated with a single
>HWQ reside on the same CPU socket.
>
> Signed-off-by: Bart Van Assche 
> Cc: Jens Axboe 
> Cc: Christoph Hellwig 
> Cc: Ming Lei 
> ---
>   block/blk-mq-cpumap.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 1065d7c..8e56455 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -17,7 +17,7 @@
>   static int cpu_to_queue_index(unsigned int nr_cpus, unsigned int nr_queues,
>const int cpu)
>   {
> - return cpu / ((nr_cpus + nr_queues - 1) / nr_queues);
> + return cpu * nr_queues / nr_cpus;
>   }
>
>   static int get_first_sibling(unsigned int cpu)
>

Seems reasonable enough.

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/8] IB/srp: Add multichannel support

2014-09-19 Thread Bart Van Assche
[PATCH 1/8] blk-mq: Use all available hardware queues

Suppose that a system has two CPU sockets, three cores per socket,
that it does not support hyperthreading and that four hardware
queues are provided by a block driver. With the current algorithm
this will lead to the following assignment of CPU cores to hardware
queues:

  HWQ 0: 0 1
  HWQ 1: 2 3
  HWQ 2: 4 5
  HWQ 3: (none)

This patch changes the queue assignment into:

  HWQ 0: 0 1
  HWQ 1: 2
  HWQ 2: 3 4
  HWQ 3: 5

In other words, this patch has the following three effects:
- All four hardware queues are used instead of only three.
- CPU cores are spread more evenly over hardware queues. For the
  above example the range of the number of CPU cores associated
  with a single HWQ is reduced from [0..2] to [1..2].
- If the number of HWQ's is a multiple of the number of CPU sockets
  it is now guaranteed that all CPU cores associated with a single
  HWQ reside on the same CPU socket.

Signed-off-by: Bart Van Assche 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Ming Lei 
---
 block/blk-mq-cpumap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 1065d7c..8e56455 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -17,7 +17,7 @@
 static int cpu_to_queue_index(unsigned int nr_cpus, unsigned int nr_queues,
  const int cpu)
 {
-   return cpu / ((nr_cpus + nr_queues - 1) / nr_queues);
+   return cpu * nr_queues / nr_cpus;
 }
 
 static int get_first_sibling(unsigned int cpu)
-- 
1.8.4.5

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 0/8] IB/srp: Add multichannel support

2014-09-19 Thread Bart Van Assche

Hello,

Although the SRP protocol supports multichannel operation, although 
since considerable time RDMA HCA's are available that support multiple 
completion vectors and although multichannel operation yields better 
performance than using a single channel, the Linux SRP initiator does 
not yet support multichannel operation. While adding multichannel 
support in the SRP initiator I encountered a few challenges of which I 
think these need wider discussion. The topics I would like invite wider 
discussion about are as follows:

- How to avoid unneeded inter-socket cache traffic. Should the blk-mq
  layer e.g. assign CPU cores to hardware queues such that all CPU cores
  associated with a single hardware queue reside on the same CPU socket?
  (see also patch 1/8)
- How to pass the hardware context selected by the block layer to the
  SCSI LLD queuecommand() callback function ? (see also patches 2/8 and
  3/8).
- Which approach should a SCSI LLD follow for selection of an MSI-X
  completion vector to ensure that the interrupt handler is invoked on
  the same CPU socket as the blk-mq hardware context data structures ?
  As one can see patch 8/8 relies on the assumption that completion
  vectors have been spread evenly over CPU sockets. If a HCA e.g.
  supports eight completion vectors then that means that in a system
  with two CPU sockets vectors 0-3 are associated with a CPU core on
  the first CPU socket and vectors 4-7 with a CPU core on the second CPU
  socket.

The patches in this series are:
0001-blk-mq-Use-all-available-hardware-queues.patch
0002-scsi-mq-Add-support-for-multiple-hardware-queues.patch
0003-scsi-mq-Pass-hctx-to-low-level-SCSI-drivers.patch
0004-IB-srp-Move-ib_destroy_cm_id-call-into-srp_free_ch_i.patch
0005-IB-srp-Remove-stale-connection-retry-mechanism.patch
0006-IB-srp-Avoid-that-I-O-hangs-due-to-a-cable-pull-duri.patch
0007-IB-srp-Separate-target-and-channel-variables.patch
0008-IB-srp-Add-multichannel-support.patch

Note: a predecessor of this patch series has been used to measure the 
performance of Christoph's scsi-mq patches that have been merged in 
kernel version v3.17-rc1.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html