Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready

2014-02-26 Thread Bart Van Assche
On 02/17/14 23:00, Christoph Hellwig wrote:
 Most of the scsi multiqueue work so far has been about modifying the
 block layer, so I'm definitively now shy about doing that were needed.
 And I think we will eventually need to be able to have n:m queue to hctx
 mapping instead of the current 1:n one.

I think it would be great if the blk-mq core would support an n:m queue
to hctx mapping. One of the advantages of that approach would be that
the blk-mq layer already keeps track of how many commands are queued per
hctx and hence that would allow to eliminate the host_busy counter from
the SCSI mid-layer. However, this involves a change in semantics. I hope
it's fine to change the semantics of the hosts-can_queue parameter from
maximum number of commands queued per SCSI host into maximum number
of commands queued per hctx ?

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready

2014-02-17 Thread Bart Van Assche
On 02/10/14 21:09, Jens Axboe wrote:
 On Mon, Feb 10 2014, Christoph Hellwig wrote:
 I also think we should be getting more utility out of threading
 guarantees.  So, if there's only one thread active per device we don't
 need any device counters to be atomic.  Likewise, u32 read/write is an
 atomic operation, so we might be able to use sloppy counters for the
 target and host stuff (one per CPU that are incremented/decremented on
 that CPU ... this will only work using CPU locality ... completion on
 same CPU but that seems to be an element of a lot of stuff nowadays).

 The blk-mq code is aiming for CPU locality, but there are no hard
 guarantees.  I'm also not sure always bouncing around the I/O submission
 is a win, but it might be something to play around with at the block
 layer.

 Jens, did you try something like this earlier?
 
 Nope, I've always thought that if you needed to bounce submission
 around, you would already have lost. [ ... ]

This comment makes a lot of sense to me. The approach that has been
taken in the scsi-mq patches that have been posted on February 5 is to
associate one blk-mq device with each LUN. That blk-mq device has one
hctx with queue depth shost-cmd_per_lun. So if there are multiple LUNs
per SCSI host the combined hctx queue depth of its LUNs can exceed
shost-can_queue. I'm not sure whether it's possible to prevent this
without modifying the block layer. How about modifying the block layer
such that a single hctx can be shared by multiple block devices and
adding cmd_per_lun support in the block layer ? I think that would allow
to prevent having to bounce submission in the SCSI mid-layer.

Thanks,

Bart.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready

2014-02-17 Thread Christoph Hellwig
On Mon, Feb 17, 2014 at 10:36:20AM +0100, Bart Van Assche wrote:
 This comment makes a lot of sense to me. The approach that has been
 taken in the scsi-mq patches that have been posted on February 5 is to
 associate one blk-mq device with each LUN. That blk-mq device has one
 hctx with queue depth shost-cmd_per_lun. So if there are multiple LUNs
 per SCSI host the combined hctx queue depth of its LUNs can exceed
 shost-can_queue. I'm not sure whether it's possible to prevent this
 without modifying the block layer. How about modifying the block layer
 such that a single hctx can be shared by multiple block devices and
 adding cmd_per_lun support in the block layer ? I think that would allow
 to prevent having to bounce submission in the SCSI mid-layer.

Most of the scsi multiqueue work so far has been about modifying the
block layer, so I'm defintively now shy about doing that were needed.
And I think we will eventually need to be able to have n:m queue to hctx
mapping instead of the current 1:n one.

I think the biggest issue will be to sort out the tag allocation.  I'm
going to look into this whole cluster of issues once the basic
functionality is fully working.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready

2014-02-10 Thread Christoph Hellwig
On Thu, Feb 06, 2014 at 08:56:59AM -0800, James Bottomley wrote:
 I'm dubious about replacing a locked set of checks and increments with
 atomics for the simple reason that atomics are pretty expensive on
 non-x86, so you've likely slowed the critical path down for them.  Even
 on x86, atomics can be very expensive because of the global bus lock.  I
 think about three of them in a row is where you might as well stick with
 the lock.

The three of them replace two locks at least when using blk-mq.  Until
we use blk-mq and those avoid the queue_lock we could keep the
per-device counters as-is.

As Bart's numbers have shown this defintively shows a major improvement
on x86, for other architecture we'd need someone to run benchmarks
on useful hardware.  Maybe some of the IBM people on the list could
help out on PPC and S/390?

 I also think we should be getting more utility out of threading
 guarantees.  So, if there's only one thread active per device we don't
 need any device counters to be atomic.  Likewise, u32 read/write is an
 atomic operation, so we might be able to use sloppy counters for the
 target and host stuff (one per CPU that are incremented/decremented on
 that CPU ... this will only work using CPU locality ... completion on
 same CPU but that seems to be an element of a lot of stuff nowadays).

The blk-mq code is aiming for CPU locality, but there are no hard
guarantees.  I'm also not sure always bouncing around the I/O submission
is a win, but it might be something to play around with at the block
layer.

Jens, did you try something like this earlier?

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready

2014-02-10 Thread Nicholas A. Bellinger
On Mon, 2014-02-10 at 04:09 -0800, Christoph Hellwig wrote:
 On Sun, Feb 09, 2014 at 12:26:48AM -0800, Nicholas A. Bellinger wrote:
  Again, try NOP'ing all REQ_TYPE_FS type commands immediately in
  -queuecommand() in order to determine a baseline without any other LLD
  overhead involved.
 
 Seems like this duplicates the fake_rw parameter.  Removing the needless
 host_lock in queuecommand and maybe a few more optimizations would
 certainly be more useful than duplicating the fake_rw parameter.
 

Fine by me.

--nab

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 80b8b10..631f5d2 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -3603,7 +3603,7 @@ static void sdebug_remove_adapter(void)
 }
 
 static
-int scsi_debug_queuecommand_lck(struct scsi_cmnd *SCpnt, done_funct_t done)
+int __scsi_debug_queuecommand_lck(struct scsi_cmnd *SCpnt, done_funct_t done)
 {
unsigned char *cmd = (unsigned char *) SCpnt-cmnd;
int len, k;
@@ -3775,8 +3775,6 @@ read:
errsts = check_readiness(SCpnt, 0, devip);
if (errsts)
break;
-   if (scsi_debug_fake_rw)
-   break;
get_data_transfer_info(cmd, lba, num, ei_lba);
errsts = resp_read(SCpnt, lba, num, devip, ei_lba);
if (inj_recovered  (0 == errsts)) {
@@ -3825,8 +3823,6 @@ write:
errsts = check_readiness(SCpnt, 0, devip);
if (errsts)
break;
-   if (scsi_debug_fake_rw)
-   break;
get_data_transfer_info(cmd, lba, num, ei_lba);
errsts = resp_write(SCpnt, lba, num, devip, ei_lba);
if (inj_recovered  (0 == errsts)) {
@@ -3903,8 +3899,6 @@ write:
errsts = check_readiness(SCpnt, 0, devip);
if (errsts)
break;
-   if (scsi_debug_fake_rw)
-   break;
get_data_transfer_info(cmd, lba, num, ei_lba);
errsts = resp_read(SCpnt, lba, num, devip, ei_lba);
if (errsts)
@@ -3952,7 +3946,21 @@ write:
 (delay_override ? 0 : scsi_debug_delay));
 }
 
-static DEF_SCSI_QCMD(scsi_debug_queuecommand)
+static DEF_SCSI_QCMD(__scsi_debug_queuecommand)
+
+static int scsi_debug_queuecommand(struct Scsi_Host *host, struct scsi_cmnd 
*sc)
+{
+   struct request *rq = sc-request;
+
+   if (scsi_debug_fake_rw  rq-cmd_type == REQ_TYPE_FS) {
+   set_host_byte(sc, DID_OK);
+   sc-result |= SAM_STAT_GOOD;
+   sc-scsi_done(sc);
+   return 0;
+   }
+
+   return __scsi_debug_queuecommand(host, sc);
+}
 
 static struct scsi_host_template sdebug_driver_template = {
.show_info =scsi_debug_show_info,
@@ -3973,7 +3981,7 @@ static struct scsi_host_template sdebug_driver_template = 
{
.can_queue =SCSI_DEBUG_CANQUEUE,
.this_id =  7,
.sg_tablesize = 256,
-   .cmd_per_lun =  16,
+   .cmd_per_lun =  64,
.max_sectors =  0x,
.use_clustering =   DISABLE_CLUSTERING,
.module =   THIS_MODULE,
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready

2014-02-10 Thread James Bottomley

On Mon, 2014-02-10 at 03:39 -0800, Christoph Hellwig wrote:
 On Thu, Feb 06, 2014 at 08:56:59AM -0800, James Bottomley wrote:
  I'm dubious about replacing a locked set of checks and increments with
  atomics for the simple reason that atomics are pretty expensive on
  non-x86, so you've likely slowed the critical path down for them.  Even
  on x86, atomics can be very expensive because of the global bus lock.  I
  think about three of them in a row is where you might as well stick with
  the lock.
 
 The three of them replace two locks at least when using blk-mq.  Until
 we use blk-mq and those avoid the queue_lock we could keep the
 per-device counters as-is.

I'm not saying never, I'm just really dubious about this bit and the
potential cost on other platforms ... at the very least, surely the
device busy can still be a counter because of the current threading
guarantees?

 As Bart's numbers have shown this defintively shows a major improvement
 on x86, for other architecture we'd need someone to run benchmarks
 on useful hardware.  Maybe some of the IBM people on the list could
 help out on PPC and S/390?

Well, no, they haven't.  The number were an assertion not a benchmark
and 3.8% can be in the error margin of anybody's tests.  Even if the
actual benchmark were published, I can't see it being convincing because
there's too much potential variation (other architectures, different
ways of testing) for such a small result.

I'm happy to take shortening critical sections and removing atomics in
get/put because they're obvious wins without needing benchmark
justification.  I don't really want to take the dubious stuff (like spin
lock replacement with atomic) until we see the shape of what we can do
with the block mq stuff and what's necessary and what isn't.

James


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready

2014-02-09 Thread Nicholas A. Bellinger
On Sat, 2014-02-08 at 12:00 +0100, Bart Van Assche wrote:
 On 02/07/14 20:30, Nicholas A. Bellinger wrote:
  All that scsi_debug with NOP'ed REQ_TYPE_FS commands is doing is calling
  scsi_cmd-done() as soon as the descriptor has been dispatched into LLD
  -queuecommand() code.
  
  It's useful for determining an absolute performance ceiling between
  scsi_mq vs. scsi_request_fn() based codepaths without involving any
  other LLD specific overheads.
 
 Sorry but I'm not convinced that the scsi_debug driver is well suited
 for testing scsi-mq. In source file drivers/scsi/scsi_debug.c of kernel
 3.14-rc1 I found the following:
 
 static DEF_SCSI_QCMD(scsi_debug_queuecommand)
 
 From include/scsi/scsi_host.h:
 
 /*
  * Temporary #define for host lock push down. Can be removed when all
  * drivers have been updated to take advantage of unlocked
  * queuecommand.
  *
  */
 #define DEF_SCSI_QCMD(func_name) \
 int func_name(struct Scsi_Host *shost, struct scsi_cmnd *cmd)  \
 {  \
 unsigned long irq_flags;   \
 int rc;\
 spin_lock_irqsave(shost-host_lock, irq_flags);\
 scsi_cmd_get_serial(shost, cmd);   \
 rc = func_name##_lck (cmd, cmd-scsi_done);\
 spin_unlock_irqrestore(shost-host_lock, irq_flags);   \
 return rc; \
 }
 
 In other words, all scsi_debug_queuecommand() are serialized. I think
 for testing the scsi-mq code properly a SCSI LLD driver is needed that
 allows concurrent queuecommand() calls.
 

Again, try NOP'ing all REQ_TYPE_FS type commands immediately in
-queuecommand() in order to determine a baseline without any other LLD
overhead involved.

Here's what been used so far in target-pending/scsi-mq:

From 0f312a951eedc87adc4c00adfec8ab317727efdd Mon Sep 17 00:00:00 2001
From: Nicholas Bellinger n...@linux-iscsi.org
Date: Fri, 24 May 2013 05:11:38 +
Subject: scsi-debug: Enable scsi-mq operation

v4 changes:
  - Bump can_queue to 64 for performance testing
  - Enable scsi-mq DIF support
  - Add nop_fs_io module parameter for performance testing

Signed-off-by: Nicholas Bellinger n...@linux-iscsi.org
---
diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 80b8b10..612d36d 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -119,6 +119,7 @@ static const char * scsi_debug_version_date = 20100324;
 #define DEF_VIRTUAL_GB   0
 #define DEF_VPD_USE_HOSTNO 1
 #define DEF_WRITESAME_LENGTH 0x
+#define DEF_NOP_FS_IO 0
 
 /* bit mask values for scsi_debug_opts */
 #define SCSI_DEBUG_OPT_NOISE   1
@@ -195,6 +196,7 @@ static unsigned int scsi_debug_unmap_max_blocks = 
DEF_UNMAP_MAX_BLOCKS;
 static unsigned int scsi_debug_unmap_max_desc = DEF_UNMAP_MAX_DESC;
 static unsigned int scsi_debug_write_same_length = DEF_WRITESAME_LENGTH;
 static bool scsi_debug_removable = DEF_REMOVABLE;
+static unsigned int scsi_debug_nop_fs_io = DEF_NOP_FS_IO;
 
 static int scsi_debug_cmnd_count = 0;
 
@@ -2779,6 +2781,8 @@ module_param_named(vpd_use_hostno, 
scsi_debug_vpd_use_hostno, int,
   S_IRUGO | S_IWUSR);
 module_param_named(write_same_length, scsi_debug_write_same_length, int,
   S_IRUGO | S_IWUSR);
+module_param_named(nop_fs_io, scsi_debug_nop_fs_io, int,
+  S_IRUGO | S_IWUSR);
 
 MODULE_AUTHOR(Eric Youngdale + Douglas Gilbert);
 MODULE_DESCRIPTION(SCSI debug adapter driver);
@@ -2820,6 +2824,7 @@ MODULE_PARM_DESC(unmap_max_desc, max # of ranges that 
can be unmapped in one cm
 MODULE_PARM_DESC(virtual_gb, virtual gigabyte size (def=0 - use 
dev_size_mb));
 MODULE_PARM_DESC(vpd_use_hostno, 0 - dev ids ignore hostno (def=1 - unique 
dev ids));
 MODULE_PARM_DESC(write_same_length, Maximum blocks per WRITE SAME cmd 
(def=0x));
+MODULE_PARM_DESC(nop_fs_io, Turn REQ_TYPE_FS I/O into NOPs);
 
 static char sdebug_info[256];
 
@@ -3954,6 +3959,20 @@ write:
 
 static DEF_SCSI_QCMD(scsi_debug_queuecommand)
 
+static int scsi_debug_queuecommand_mq(struct Scsi_Host *host, struct scsi_cmnd 
*sc)
+{
+   struct request *rq = sc-request;
+
+   if (scsi_debug_nop_fs_io  rq-cmd_type == REQ_TYPE_FS) {
+   set_host_byte(sc, DID_OK);
+   sc-result |= SAM_STAT_GOOD;
+   sc-scsi_done(sc);
+   return 0;
+   }
+
+   return scsi_debug_queuecommand_lck(sc, sc-scsi_done);
+}
+
 static struct scsi_host_template sdebug_driver_template = {
.show_info =scsi_debug_show_info,
.write_info =   scsi_debug_write_info,
@@ -3965,6 +3984,8 @@ static struct scsi_host_template sdebug_driver_template = 
{
.slave_destroy =scsi_debug_slave_destroy,
.ioctl =scsi_debug_ioctl,

Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready

2014-02-08 Thread Bart Van Assche
On 02/07/14 20:30, Nicholas A. Bellinger wrote:
 All that scsi_debug with NOP'ed REQ_TYPE_FS commands is doing is calling
 scsi_cmd-done() as soon as the descriptor has been dispatched into LLD
 -queuecommand() code.
 
 It's useful for determining an absolute performance ceiling between
 scsi_mq vs. scsi_request_fn() based codepaths without involving any
 other LLD specific overheads.

Sorry but I'm not convinced that the scsi_debug driver is well suited
for testing scsi-mq. In source file drivers/scsi/scsi_debug.c of kernel
3.14-rc1 I found the following:

static DEF_SCSI_QCMD(scsi_debug_queuecommand)

From include/scsi/scsi_host.h:

/*
 * Temporary #define for host lock push down. Can be removed when all
 * drivers have been updated to take advantage of unlocked
 * queuecommand.
 *
 */
#define DEF_SCSI_QCMD(func_name) \
int func_name(struct Scsi_Host *shost, struct scsi_cmnd *cmd)  \
{  \
unsigned long irq_flags;   \
int rc;\
spin_lock_irqsave(shost-host_lock, irq_flags);\
scsi_cmd_get_serial(shost, cmd);   \
rc = func_name##_lck (cmd, cmd-scsi_done);\
spin_unlock_irqrestore(shost-host_lock, irq_flags);   \
return rc; \
}

In other words, all scsi_debug_queuecommand() are serialized. I think
for testing the scsi-mq code properly a SCSI LLD driver is needed that
allows concurrent queuecommand() calls.

Bart.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready

2014-02-07 Thread Bart Van Assche
On 02/06/14 22:58, Nicholas A. Bellinger wrote:
 Starting with a baseline using scsi_debug that NOPs REQ_TYPE_FS commands
 to measure improvements would be a better baseline vs. scsi_request_fn()
 existing code that what your doing above.
 
 That way at least it's easier to measure specific scsi-core overhead
 down to the LLD's queuecommand without everything else in the way.

Running initiator and target workload on the same system makes initiator
and target code share CPUs and hence makes it hard to analyze which
performance change is caused by initiator optimizations and which
performance change is caused by interactions between initiator and
target workloads.

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready

2014-02-07 Thread Bart Van Assche
On 02/06/14 19:41, James Bottomley wrote:
 On Thu, 2014-02-06 at 18:10 +0100, Bart Van Assche wrote:
 On 02/06/14 17:56, James Bottomley wrote:
 Could you benchmark this lot and show what the actual improvement is
 just for this series, if any?

 I see a performance improvement of 12% with the SRP protocol for the
 SCSI core optimizations alone (I am still busy measuring the impact of
 the blk-mq conversion but I can already see that it is really
 significant). Please note that the performance impact depends a lot on
 the workload (number of LUNs per SCSI host e.g.) so maybe the workload I
 chose is not doing justice to Christoph's work. And it's also important
 to mention that with the workload I ran I was saturating the target
 system CPU (a quad core Intel i5). In other words, results might be
 better with a more powerful target system.
 
 On what?  Just the patches I indicated or the whole series?  My specific
 concern is that swapping a critical section for atomics may not buy us
 anything even on x86 and may slow down non-x86.  That's the bit I'd like
 benchmarks to explore.

The numbers I mentioned in my previous e-mail referred to the SCSI data
path micro-optimizations patch series and the A different approach for
using blk-mq in the SCSI layer series as a whole. I have run a new test
in which I compared the performance between a kernel with these two
patch series applied versus a kernel in which the four patches that
convert host_busy, target_busy and device_busy into atomics have been
reverted. For a workload with a single SCSI host, a single LUN, a block
size of 512 bytes, the SRP protocol and a single CPU thread submitting
I/O requests I see a performance improvement of 0.5% when using atomics.
For a workload with a single SCSI host, eight LUNs and eight CPU threads
submitting I/O I see a performance improvement of 3.8% when using
atomics. Please note that these measurements have been run on a single
socket system, that cache line misses are more expensive on NUMA systems
and hence that the performance impact of these patches on a NUMA system
will be more substantial.

Bart.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready

2014-02-07 Thread Nicholas A. Bellinger
On Fri, 2014-02-07 at 11:32 +0100, Bart Van Assche wrote:
 On 02/06/14 22:58, Nicholas A. Bellinger wrote:
  Starting with a baseline using scsi_debug that NOPs REQ_TYPE_FS commands
  to measure improvements would be a better baseline vs. scsi_request_fn()
  existing code that what your doing above.
  
  That way at least it's easier to measure specific scsi-core overhead
  down to the LLD's queuecommand without everything else in the way.
 
 Running initiator and target workload on the same system makes initiator
 and target code share CPUs and hence makes it hard to analyze which
 performance change is caused by initiator optimizations and which
 performance change is caused by interactions between initiator and
 target workloads.
 

Huh..?  There is no 'target workload' involved.

All that scsi_debug with NOP'ed REQ_TYPE_FS commands is doing is calling
scsi_cmd-done() as soon as the descriptor has been dispatched into LLD
-queuecommand() code.

It's useful for determining an absolute performance ceiling between
scsi_mq vs. scsi_request_fn() based codepaths without involving any
other LLD specific overheads.

--nab

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready

2014-02-06 Thread James Bottomley

On Wed, 2014-02-05 at 04:39 -0800, Christoph Hellwig wrote:
 Prepare for not taking a host-wide lock in the dispatch path by pushing
 the lock down into the places that actually need it.  Note that this
 patch is just a preparation step, as it will actually increase lock
 roundtrips and thus decrease performance on its own.

I'm not really convinced by the rest of the series.  I think patches
4,8,9,10,11,12 (lock elimination from fast path and get/put reduction)
are where the improvements are at and they mostly look ready to apply
modulo some author and signoff fixes.

I'm dubious about replacing a locked set of checks and increments with
atomics for the simple reason that atomics are pretty expensive on
non-x86, so you've likely slowed the critical path down for them.  Even
on x86, atomics can be very expensive because of the global bus lock.  I
think about three of them in a row is where you might as well stick with
the lock.

Could you benchmark this lot and show what the actual improvement is
just for this series, if any?

I also think we should be getting more utility out of threading
guarantees.  So, if there's only one thread active per device we don't
need any device counters to be atomic.  Likewise, u32 read/write is an
atomic operation, so we might be able to use sloppy counters for the
target and host stuff (one per CPU that are incremented/decremented on
that CPU ... this will only work using CPU locality ... completion on
same CPU but that seems to be an element of a lot of stuff nowadays).

I'm not saying this will all pan out, but I am saying I don't think just
making all the counters atomic to reduce locking will buy us a huge
amount, so I'd appreciate careful benchmarking to confirm or invalidate
this hypothesis first.

James


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready

2014-02-06 Thread Bart Van Assche
On 02/06/14 17:56, James Bottomley wrote:
 Could you benchmark this lot and show what the actual improvement is
 just for this series, if any?

I see a performance improvement of 12% with the SRP protocol for the
SCSI core optimizations alone (I am still busy measuring the impact of
the blk-mq conversion but I can already see that it is really
significant). Please note that the performance impact depends a lot on
the workload (number of LUNs per SCSI host e.g.) so maybe the workload I
chose is not doing justice to Christoph's work. And it's also important
to mention that with the workload I ran I was saturating the target
system CPU (a quad core Intel i5). In other words, results might be
better with a more powerful target system.

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready

2014-02-06 Thread James Bottomley
On Thu, 2014-02-06 at 18:10 +0100, Bart Van Assche wrote:
 On 02/06/14 17:56, James Bottomley wrote:
  Could you benchmark this lot and show what the actual improvement is
  just for this series, if any?
 
 I see a performance improvement of 12% with the SRP protocol for the
 SCSI core optimizations alone (I am still busy measuring the impact of
 the blk-mq conversion but I can already see that it is really
 significant). Please note that the performance impact depends a lot on
 the workload (number of LUNs per SCSI host e.g.) so maybe the workload I
 chose is not doing justice to Christoph's work. And it's also important
 to mention that with the workload I ran I was saturating the target
 system CPU (a quad core Intel i5). In other words, results might be
 better with a more powerful target system.

On what?  Just the patches I indicated or the whole series?  My specific
concern is that swapping a critical section for atomics may not buy us
anything even on x86 and may slow down non-x86.  That's the bit I'd like
benchmarks to explore.

James


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready

2014-02-06 Thread Nicholas A. Bellinger
On Thu, 2014-02-06 at 18:10 +0100, Bart Van Assche wrote:
 On 02/06/14 17:56, James Bottomley wrote:
  Could you benchmark this lot and show what the actual improvement is
  just for this series, if any?
 
 I see a performance improvement of 12% with the SRP protocol for the
 SCSI core optimizations alone (I am still busy measuring the impact of
 the blk-mq conversion but I can already see that it is really
 significant). Please note that the performance impact depends a lot on
 the workload (number of LUNs per SCSI host e.g.) so maybe the workload I
 chose is not doing justice to Christoph's work. And it's also important
 to mention that with the workload I ran I was saturating the target
 system CPU (a quad core Intel i5). In other words, results might be
 better with a more powerful target system.
 

Starting with a baseline using scsi_debug that NOPs REQ_TYPE_FS commands
to measure improvements would be a better baseline vs. scsi_request_fn()
existing code that what your doing above.

That way at least it's easier to measure specific scsi-core overhead
down to the LLD's queuecommand without everything else in the way.

--nab

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready

2014-02-05 Thread Christoph Hellwig
Prepare for not taking a host-wide lock in the dispatch path by pushing
the lock down into the places that actually need it.  Note that this
patch is just a preparation step, as it will actually increase lock
roundtrips and thus decrease performance on its own.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 drivers/scsi/scsi_lib.c |   75 ---
 1 file changed, 39 insertions(+), 36 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 44740c0..295e038 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1318,18 +1318,18 @@ static inline int scsi_dev_queue_ready(struct 
request_queue *q,
 /*
  * scsi_target_queue_ready: checks if there we can send commands to target
  * @sdev: scsi device on starget to check.
- *
- * Called with the host lock held.
  */
 static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
   struct scsi_device *sdev)
 {
struct scsi_target *starget = scsi_target(sdev);
+   int ret = 0;
 
+   spin_lock_irq(shost-host_lock);
if (starget-single_lun) {
if (starget-starget_sdev_user 
starget-starget_sdev_user != sdev)
-   return 0;
+   goto out;
starget-starget_sdev_user = sdev;
}
 
@@ -1337,57 +1337,66 @@ static inline int scsi_target_queue_ready(struct 
Scsi_Host *shost,
/*
 * unblock after target_blocked iterates to zero
 */
-   if (--starget-target_blocked == 0) {
-   SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
-unblocking target at zero depth\n));
-   } else
-   return 0;
+   if (--starget-target_blocked != 0)
+   goto out;
+
+   SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
+unblocking target at zero depth\n));
}
 
if (scsi_target_is_busy(starget)) {
list_move_tail(sdev-starved_entry, shost-starved_list);
-   return 0;
+   goto out;
}
 
-   return 1;
+   scsi_target(sdev)-target_busy++;
+   ret = 1;
+out:
+   spin_unlock_irq(shost-host_lock);
+   return ret;
 }
 
 /*
  * scsi_host_queue_ready: if we can send requests to shost, return 1 else
  * return 0. We must end up running the queue again whenever 0 is
  * returned, else IO can hang.
- *
- * Called with host_lock held.
  */
 static inline int scsi_host_queue_ready(struct request_queue *q,
   struct Scsi_Host *shost,
   struct scsi_device *sdev)
 {
+   int ret = 0;
+
+   spin_lock_irq(shost-host_lock);
+
if (scsi_host_in_recovery(shost))
-   return 0;
+   goto out;
if (shost-host_busy == 0  shost-host_blocked) {
/*
 * unblock after host_blocked iterates to zero
 */
-   if (--shost-host_blocked == 0) {
-   SCSI_LOG_MLQUEUE(3,
-   printk(scsi%d unblocking host at zero depth\n,
-   shost-host_no));
-   } else {
-   return 0;
-   }
+   if (--shost-host_blocked != 0)
+   goto out;
+
+   SCSI_LOG_MLQUEUE(3,
+   printk(scsi%d unblocking host at zero depth\n,
+   shost-host_no));
}
if (scsi_host_is_busy(shost)) {
if (list_empty(sdev-starved_entry))
list_add_tail(sdev-starved_entry, 
shost-starved_list);
-   return 0;
+   goto out;
}
 
/* We're OK to process the command, so we can't be starved */
if (!list_empty(sdev-starved_entry))
list_del_init(sdev-starved_entry);
 
-   return 1;
+   shost-host_busy++;
+   ret = 1;
+out:
+   spin_unlock_irq(shost-host_lock);
+   return ret;
 }
 
 /*
@@ -1551,7 +1560,7 @@ static void scsi_request_fn(struct request_queue *q)
blk_start_request(req);
sdev-device_busy++;
 
-   spin_unlock(q-queue_lock);
+   spin_unlock_irq(q-queue_lock);
cmd = req-special;
if (unlikely(cmd == NULL)) {
printk(KERN_CRIT impossible request in %s.\n
@@ -1561,7 +1570,6 @@ static void scsi_request_fn(struct request_queue *q)
blk_dump_rq_flags(req, foo);
BUG();
}
-   spin_lock(shost-host_lock);
 
/*
 * We hit this when the driver is using a host wide
@@ -1572,9 +1580,11 @@ static void