Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready
On 02/17/14 23:00, Christoph Hellwig wrote: Most of the scsi multiqueue work so far has been about modifying the block layer, so I'm definitively now shy about doing that were needed. And I think we will eventually need to be able to have n:m queue to hctx mapping instead of the current 1:n one. I think it would be great if the blk-mq core would support an n:m queue to hctx mapping. One of the advantages of that approach would be that the blk-mq layer already keeps track of how many commands are queued per hctx and hence that would allow to eliminate the host_busy counter from the SCSI mid-layer. However, this involves a change in semantics. I hope it's fine to change the semantics of the hosts-can_queue parameter from maximum number of commands queued per SCSI host into maximum number of commands queued per hctx ? Bart. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready
On 02/10/14 21:09, Jens Axboe wrote: On Mon, Feb 10 2014, Christoph Hellwig wrote: I also think we should be getting more utility out of threading guarantees. So, if there's only one thread active per device we don't need any device counters to be atomic. Likewise, u32 read/write is an atomic operation, so we might be able to use sloppy counters for the target and host stuff (one per CPU that are incremented/decremented on that CPU ... this will only work using CPU locality ... completion on same CPU but that seems to be an element of a lot of stuff nowadays). The blk-mq code is aiming for CPU locality, but there are no hard guarantees. I'm also not sure always bouncing around the I/O submission is a win, but it might be something to play around with at the block layer. Jens, did you try something like this earlier? Nope, I've always thought that if you needed to bounce submission around, you would already have lost. [ ... ] This comment makes a lot of sense to me. The approach that has been taken in the scsi-mq patches that have been posted on February 5 is to associate one blk-mq device with each LUN. That blk-mq device has one hctx with queue depth shost-cmd_per_lun. So if there are multiple LUNs per SCSI host the combined hctx queue depth of its LUNs can exceed shost-can_queue. I'm not sure whether it's possible to prevent this without modifying the block layer. How about modifying the block layer such that a single hctx can be shared by multiple block devices and adding cmd_per_lun support in the block layer ? I think that would allow to prevent having to bounce submission in the SCSI mid-layer. Thanks, Bart. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready
On Mon, Feb 17, 2014 at 10:36:20AM +0100, Bart Van Assche wrote: This comment makes a lot of sense to me. The approach that has been taken in the scsi-mq patches that have been posted on February 5 is to associate one blk-mq device with each LUN. That blk-mq device has one hctx with queue depth shost-cmd_per_lun. So if there are multiple LUNs per SCSI host the combined hctx queue depth of its LUNs can exceed shost-can_queue. I'm not sure whether it's possible to prevent this without modifying the block layer. How about modifying the block layer such that a single hctx can be shared by multiple block devices and adding cmd_per_lun support in the block layer ? I think that would allow to prevent having to bounce submission in the SCSI mid-layer. Most of the scsi multiqueue work so far has been about modifying the block layer, so I'm defintively now shy about doing that were needed. And I think we will eventually need to be able to have n:m queue to hctx mapping instead of the current 1:n one. I think the biggest issue will be to sort out the tag allocation. I'm going to look into this whole cluster of issues once the basic functionality is fully working. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready
On Thu, Feb 06, 2014 at 08:56:59AM -0800, James Bottomley wrote: I'm dubious about replacing a locked set of checks and increments with atomics for the simple reason that atomics are pretty expensive on non-x86, so you've likely slowed the critical path down for them. Even on x86, atomics can be very expensive because of the global bus lock. I think about three of them in a row is where you might as well stick with the lock. The three of them replace two locks at least when using blk-mq. Until we use blk-mq and those avoid the queue_lock we could keep the per-device counters as-is. As Bart's numbers have shown this defintively shows a major improvement on x86, for other architecture we'd need someone to run benchmarks on useful hardware. Maybe some of the IBM people on the list could help out on PPC and S/390? I also think we should be getting more utility out of threading guarantees. So, if there's only one thread active per device we don't need any device counters to be atomic. Likewise, u32 read/write is an atomic operation, so we might be able to use sloppy counters for the target and host stuff (one per CPU that are incremented/decremented on that CPU ... this will only work using CPU locality ... completion on same CPU but that seems to be an element of a lot of stuff nowadays). The blk-mq code is aiming for CPU locality, but there are no hard guarantees. I'm also not sure always bouncing around the I/O submission is a win, but it might be something to play around with at the block layer. Jens, did you try something like this earlier? -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready
On Mon, 2014-02-10 at 04:09 -0800, Christoph Hellwig wrote: On Sun, Feb 09, 2014 at 12:26:48AM -0800, Nicholas A. Bellinger wrote: Again, try NOP'ing all REQ_TYPE_FS type commands immediately in -queuecommand() in order to determine a baseline without any other LLD overhead involved. Seems like this duplicates the fake_rw parameter. Removing the needless host_lock in queuecommand and maybe a few more optimizations would certainly be more useful than duplicating the fake_rw parameter. Fine by me. --nab diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c index 80b8b10..631f5d2 100644 --- a/drivers/scsi/scsi_debug.c +++ b/drivers/scsi/scsi_debug.c @@ -3603,7 +3603,7 @@ static void sdebug_remove_adapter(void) } static -int scsi_debug_queuecommand_lck(struct scsi_cmnd *SCpnt, done_funct_t done) +int __scsi_debug_queuecommand_lck(struct scsi_cmnd *SCpnt, done_funct_t done) { unsigned char *cmd = (unsigned char *) SCpnt-cmnd; int len, k; @@ -3775,8 +3775,6 @@ read: errsts = check_readiness(SCpnt, 0, devip); if (errsts) break; - if (scsi_debug_fake_rw) - break; get_data_transfer_info(cmd, lba, num, ei_lba); errsts = resp_read(SCpnt, lba, num, devip, ei_lba); if (inj_recovered (0 == errsts)) { @@ -3825,8 +3823,6 @@ write: errsts = check_readiness(SCpnt, 0, devip); if (errsts) break; - if (scsi_debug_fake_rw) - break; get_data_transfer_info(cmd, lba, num, ei_lba); errsts = resp_write(SCpnt, lba, num, devip, ei_lba); if (inj_recovered (0 == errsts)) { @@ -3903,8 +3899,6 @@ write: errsts = check_readiness(SCpnt, 0, devip); if (errsts) break; - if (scsi_debug_fake_rw) - break; get_data_transfer_info(cmd, lba, num, ei_lba); errsts = resp_read(SCpnt, lba, num, devip, ei_lba); if (errsts) @@ -3952,7 +3946,21 @@ write: (delay_override ? 0 : scsi_debug_delay)); } -static DEF_SCSI_QCMD(scsi_debug_queuecommand) +static DEF_SCSI_QCMD(__scsi_debug_queuecommand) + +static int scsi_debug_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *sc) +{ + struct request *rq = sc-request; + + if (scsi_debug_fake_rw rq-cmd_type == REQ_TYPE_FS) { + set_host_byte(sc, DID_OK); + sc-result |= SAM_STAT_GOOD; + sc-scsi_done(sc); + return 0; + } + + return __scsi_debug_queuecommand(host, sc); +} static struct scsi_host_template sdebug_driver_template = { .show_info =scsi_debug_show_info, @@ -3973,7 +3981,7 @@ static struct scsi_host_template sdebug_driver_template = { .can_queue =SCSI_DEBUG_CANQUEUE, .this_id = 7, .sg_tablesize = 256, - .cmd_per_lun = 16, + .cmd_per_lun = 64, .max_sectors = 0x, .use_clustering = DISABLE_CLUSTERING, .module = THIS_MODULE, -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready
On Mon, 2014-02-10 at 03:39 -0800, Christoph Hellwig wrote: On Thu, Feb 06, 2014 at 08:56:59AM -0800, James Bottomley wrote: I'm dubious about replacing a locked set of checks and increments with atomics for the simple reason that atomics are pretty expensive on non-x86, so you've likely slowed the critical path down for them. Even on x86, atomics can be very expensive because of the global bus lock. I think about three of them in a row is where you might as well stick with the lock. The three of them replace two locks at least when using blk-mq. Until we use blk-mq and those avoid the queue_lock we could keep the per-device counters as-is. I'm not saying never, I'm just really dubious about this bit and the potential cost on other platforms ... at the very least, surely the device busy can still be a counter because of the current threading guarantees? As Bart's numbers have shown this defintively shows a major improvement on x86, for other architecture we'd need someone to run benchmarks on useful hardware. Maybe some of the IBM people on the list could help out on PPC and S/390? Well, no, they haven't. The number were an assertion not a benchmark and 3.8% can be in the error margin of anybody's tests. Even if the actual benchmark were published, I can't see it being convincing because there's too much potential variation (other architectures, different ways of testing) for such a small result. I'm happy to take shortening critical sections and removing atomics in get/put because they're obvious wins without needing benchmark justification. I don't really want to take the dubious stuff (like spin lock replacement with atomic) until we see the shape of what we can do with the block mq stuff and what's necessary and what isn't. James -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready
On Sat, 2014-02-08 at 12:00 +0100, Bart Van Assche wrote: On 02/07/14 20:30, Nicholas A. Bellinger wrote: All that scsi_debug with NOP'ed REQ_TYPE_FS commands is doing is calling scsi_cmd-done() as soon as the descriptor has been dispatched into LLD -queuecommand() code. It's useful for determining an absolute performance ceiling between scsi_mq vs. scsi_request_fn() based codepaths without involving any other LLD specific overheads. Sorry but I'm not convinced that the scsi_debug driver is well suited for testing scsi-mq. In source file drivers/scsi/scsi_debug.c of kernel 3.14-rc1 I found the following: static DEF_SCSI_QCMD(scsi_debug_queuecommand) From include/scsi/scsi_host.h: /* * Temporary #define for host lock push down. Can be removed when all * drivers have been updated to take advantage of unlocked * queuecommand. * */ #define DEF_SCSI_QCMD(func_name) \ int func_name(struct Scsi_Host *shost, struct scsi_cmnd *cmd) \ { \ unsigned long irq_flags; \ int rc;\ spin_lock_irqsave(shost-host_lock, irq_flags);\ scsi_cmd_get_serial(shost, cmd); \ rc = func_name##_lck (cmd, cmd-scsi_done);\ spin_unlock_irqrestore(shost-host_lock, irq_flags); \ return rc; \ } In other words, all scsi_debug_queuecommand() are serialized. I think for testing the scsi-mq code properly a SCSI LLD driver is needed that allows concurrent queuecommand() calls. Again, try NOP'ing all REQ_TYPE_FS type commands immediately in -queuecommand() in order to determine a baseline without any other LLD overhead involved. Here's what been used so far in target-pending/scsi-mq: From 0f312a951eedc87adc4c00adfec8ab317727efdd Mon Sep 17 00:00:00 2001 From: Nicholas Bellinger n...@linux-iscsi.org Date: Fri, 24 May 2013 05:11:38 + Subject: scsi-debug: Enable scsi-mq operation v4 changes: - Bump can_queue to 64 for performance testing - Enable scsi-mq DIF support - Add nop_fs_io module parameter for performance testing Signed-off-by: Nicholas Bellinger n...@linux-iscsi.org --- diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c index 80b8b10..612d36d 100644 --- a/drivers/scsi/scsi_debug.c +++ b/drivers/scsi/scsi_debug.c @@ -119,6 +119,7 @@ static const char * scsi_debug_version_date = 20100324; #define DEF_VIRTUAL_GB 0 #define DEF_VPD_USE_HOSTNO 1 #define DEF_WRITESAME_LENGTH 0x +#define DEF_NOP_FS_IO 0 /* bit mask values for scsi_debug_opts */ #define SCSI_DEBUG_OPT_NOISE 1 @@ -195,6 +196,7 @@ static unsigned int scsi_debug_unmap_max_blocks = DEF_UNMAP_MAX_BLOCKS; static unsigned int scsi_debug_unmap_max_desc = DEF_UNMAP_MAX_DESC; static unsigned int scsi_debug_write_same_length = DEF_WRITESAME_LENGTH; static bool scsi_debug_removable = DEF_REMOVABLE; +static unsigned int scsi_debug_nop_fs_io = DEF_NOP_FS_IO; static int scsi_debug_cmnd_count = 0; @@ -2779,6 +2781,8 @@ module_param_named(vpd_use_hostno, scsi_debug_vpd_use_hostno, int, S_IRUGO | S_IWUSR); module_param_named(write_same_length, scsi_debug_write_same_length, int, S_IRUGO | S_IWUSR); +module_param_named(nop_fs_io, scsi_debug_nop_fs_io, int, + S_IRUGO | S_IWUSR); MODULE_AUTHOR(Eric Youngdale + Douglas Gilbert); MODULE_DESCRIPTION(SCSI debug adapter driver); @@ -2820,6 +2824,7 @@ MODULE_PARM_DESC(unmap_max_desc, max # of ranges that can be unmapped in one cm MODULE_PARM_DESC(virtual_gb, virtual gigabyte size (def=0 - use dev_size_mb)); MODULE_PARM_DESC(vpd_use_hostno, 0 - dev ids ignore hostno (def=1 - unique dev ids)); MODULE_PARM_DESC(write_same_length, Maximum blocks per WRITE SAME cmd (def=0x)); +MODULE_PARM_DESC(nop_fs_io, Turn REQ_TYPE_FS I/O into NOPs); static char sdebug_info[256]; @@ -3954,6 +3959,20 @@ write: static DEF_SCSI_QCMD(scsi_debug_queuecommand) +static int scsi_debug_queuecommand_mq(struct Scsi_Host *host, struct scsi_cmnd *sc) +{ + struct request *rq = sc-request; + + if (scsi_debug_nop_fs_io rq-cmd_type == REQ_TYPE_FS) { + set_host_byte(sc, DID_OK); + sc-result |= SAM_STAT_GOOD; + sc-scsi_done(sc); + return 0; + } + + return scsi_debug_queuecommand_lck(sc, sc-scsi_done); +} + static struct scsi_host_template sdebug_driver_template = { .show_info =scsi_debug_show_info, .write_info = scsi_debug_write_info, @@ -3965,6 +3984,8 @@ static struct scsi_host_template sdebug_driver_template = { .slave_destroy =scsi_debug_slave_destroy, .ioctl =scsi_debug_ioctl,
Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready
On 02/07/14 20:30, Nicholas A. Bellinger wrote: All that scsi_debug with NOP'ed REQ_TYPE_FS commands is doing is calling scsi_cmd-done() as soon as the descriptor has been dispatched into LLD -queuecommand() code. It's useful for determining an absolute performance ceiling between scsi_mq vs. scsi_request_fn() based codepaths without involving any other LLD specific overheads. Sorry but I'm not convinced that the scsi_debug driver is well suited for testing scsi-mq. In source file drivers/scsi/scsi_debug.c of kernel 3.14-rc1 I found the following: static DEF_SCSI_QCMD(scsi_debug_queuecommand) From include/scsi/scsi_host.h: /* * Temporary #define for host lock push down. Can be removed when all * drivers have been updated to take advantage of unlocked * queuecommand. * */ #define DEF_SCSI_QCMD(func_name) \ int func_name(struct Scsi_Host *shost, struct scsi_cmnd *cmd) \ { \ unsigned long irq_flags; \ int rc;\ spin_lock_irqsave(shost-host_lock, irq_flags);\ scsi_cmd_get_serial(shost, cmd); \ rc = func_name##_lck (cmd, cmd-scsi_done);\ spin_unlock_irqrestore(shost-host_lock, irq_flags); \ return rc; \ } In other words, all scsi_debug_queuecommand() are serialized. I think for testing the scsi-mq code properly a SCSI LLD driver is needed that allows concurrent queuecommand() calls. Bart. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready
On 02/06/14 22:58, Nicholas A. Bellinger wrote: Starting with a baseline using scsi_debug that NOPs REQ_TYPE_FS commands to measure improvements would be a better baseline vs. scsi_request_fn() existing code that what your doing above. That way at least it's easier to measure specific scsi-core overhead down to the LLD's queuecommand without everything else in the way. Running initiator and target workload on the same system makes initiator and target code share CPUs and hence makes it hard to analyze which performance change is caused by initiator optimizations and which performance change is caused by interactions between initiator and target workloads. Bart. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready
On 02/06/14 19:41, James Bottomley wrote: On Thu, 2014-02-06 at 18:10 +0100, Bart Van Assche wrote: On 02/06/14 17:56, James Bottomley wrote: Could you benchmark this lot and show what the actual improvement is just for this series, if any? I see a performance improvement of 12% with the SRP protocol for the SCSI core optimizations alone (I am still busy measuring the impact of the blk-mq conversion but I can already see that it is really significant). Please note that the performance impact depends a lot on the workload (number of LUNs per SCSI host e.g.) so maybe the workload I chose is not doing justice to Christoph's work. And it's also important to mention that with the workload I ran I was saturating the target system CPU (a quad core Intel i5). In other words, results might be better with a more powerful target system. On what? Just the patches I indicated or the whole series? My specific concern is that swapping a critical section for atomics may not buy us anything even on x86 and may slow down non-x86. That's the bit I'd like benchmarks to explore. The numbers I mentioned in my previous e-mail referred to the SCSI data path micro-optimizations patch series and the A different approach for using blk-mq in the SCSI layer series as a whole. I have run a new test in which I compared the performance between a kernel with these two patch series applied versus a kernel in which the four patches that convert host_busy, target_busy and device_busy into atomics have been reverted. For a workload with a single SCSI host, a single LUN, a block size of 512 bytes, the SRP protocol and a single CPU thread submitting I/O requests I see a performance improvement of 0.5% when using atomics. For a workload with a single SCSI host, eight LUNs and eight CPU threads submitting I/O I see a performance improvement of 3.8% when using atomics. Please note that these measurements have been run on a single socket system, that cache line misses are more expensive on NUMA systems and hence that the performance impact of these patches on a NUMA system will be more substantial. Bart. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready
On Fri, 2014-02-07 at 11:32 +0100, Bart Van Assche wrote: On 02/06/14 22:58, Nicholas A. Bellinger wrote: Starting with a baseline using scsi_debug that NOPs REQ_TYPE_FS commands to measure improvements would be a better baseline vs. scsi_request_fn() existing code that what your doing above. That way at least it's easier to measure specific scsi-core overhead down to the LLD's queuecommand without everything else in the way. Running initiator and target workload on the same system makes initiator and target code share CPUs and hence makes it hard to analyze which performance change is caused by initiator optimizations and which performance change is caused by interactions between initiator and target workloads. Huh..? There is no 'target workload' involved. All that scsi_debug with NOP'ed REQ_TYPE_FS commands is doing is calling scsi_cmd-done() as soon as the descriptor has been dispatched into LLD -queuecommand() code. It's useful for determining an absolute performance ceiling between scsi_mq vs. scsi_request_fn() based codepaths without involving any other LLD specific overheads. --nab -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready
On Wed, 2014-02-05 at 04:39 -0800, Christoph Hellwig wrote: Prepare for not taking a host-wide lock in the dispatch path by pushing the lock down into the places that actually need it. Note that this patch is just a preparation step, as it will actually increase lock roundtrips and thus decrease performance on its own. I'm not really convinced by the rest of the series. I think patches 4,8,9,10,11,12 (lock elimination from fast path and get/put reduction) are where the improvements are at and they mostly look ready to apply modulo some author and signoff fixes. I'm dubious about replacing a locked set of checks and increments with atomics for the simple reason that atomics are pretty expensive on non-x86, so you've likely slowed the critical path down for them. Even on x86, atomics can be very expensive because of the global bus lock. I think about three of them in a row is where you might as well stick with the lock. Could you benchmark this lot and show what the actual improvement is just for this series, if any? I also think we should be getting more utility out of threading guarantees. So, if there's only one thread active per device we don't need any device counters to be atomic. Likewise, u32 read/write is an atomic operation, so we might be able to use sloppy counters for the target and host stuff (one per CPU that are incremented/decremented on that CPU ... this will only work using CPU locality ... completion on same CPU but that seems to be an element of a lot of stuff nowadays). I'm not saying this will all pan out, but I am saying I don't think just making all the counters atomic to reduce locking will buy us a huge amount, so I'd appreciate careful benchmarking to confirm or invalidate this hypothesis first. James -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready
On 02/06/14 17:56, James Bottomley wrote: Could you benchmark this lot and show what the actual improvement is just for this series, if any? I see a performance improvement of 12% with the SRP protocol for the SCSI core optimizations alone (I am still busy measuring the impact of the blk-mq conversion but I can already see that it is really significant). Please note that the performance impact depends a lot on the workload (number of LUNs per SCSI host e.g.) so maybe the workload I chose is not doing justice to Christoph's work. And it's also important to mention that with the workload I ran I was saturating the target system CPU (a quad core Intel i5). In other words, results might be better with a more powerful target system. Bart. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready
On Thu, 2014-02-06 at 18:10 +0100, Bart Van Assche wrote: On 02/06/14 17:56, James Bottomley wrote: Could you benchmark this lot and show what the actual improvement is just for this series, if any? I see a performance improvement of 12% with the SRP protocol for the SCSI core optimizations alone (I am still busy measuring the impact of the blk-mq conversion but I can already see that it is really significant). Please note that the performance impact depends a lot on the workload (number of LUNs per SCSI host e.g.) so maybe the workload I chose is not doing justice to Christoph's work. And it's also important to mention that with the workload I ran I was saturating the target system CPU (a quad core Intel i5). In other words, results might be better with a more powerful target system. On what? Just the patches I indicated or the whole series? My specific concern is that swapping a critical section for atomics may not buy us anything even on x86 and may slow down non-x86. That's the bit I'd like benchmarks to explore. James -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready
On Thu, 2014-02-06 at 18:10 +0100, Bart Van Assche wrote: On 02/06/14 17:56, James Bottomley wrote: Could you benchmark this lot and show what the actual improvement is just for this series, if any? I see a performance improvement of 12% with the SRP protocol for the SCSI core optimizations alone (I am still busy measuring the impact of the blk-mq conversion but I can already see that it is really significant). Please note that the performance impact depends a lot on the workload (number of LUNs per SCSI host e.g.) so maybe the workload I chose is not doing justice to Christoph's work. And it's also important to mention that with the workload I ran I was saturating the target system CPU (a quad core Intel i5). In other words, results might be better with a more powerful target system. Starting with a baseline using scsi_debug that NOPs REQ_TYPE_FS commands to measure improvements would be a better baseline vs. scsi_request_fn() existing code that what your doing above. That way at least it's easier to measure specific scsi-core overhead down to the LLD's queuecommand without everything else in the way. --nab -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready
Prepare for not taking a host-wide lock in the dispatch path by pushing the lock down into the places that actually need it. Note that this patch is just a preparation step, as it will actually increase lock roundtrips and thus decrease performance on its own. Signed-off-by: Christoph Hellwig h...@lst.de --- drivers/scsi/scsi_lib.c | 75 --- 1 file changed, 39 insertions(+), 36 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 44740c0..295e038 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1318,18 +1318,18 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, /* * scsi_target_queue_ready: checks if there we can send commands to target * @sdev: scsi device on starget to check. - * - * Called with the host lock held. */ static inline int scsi_target_queue_ready(struct Scsi_Host *shost, struct scsi_device *sdev) { struct scsi_target *starget = scsi_target(sdev); + int ret = 0; + spin_lock_irq(shost-host_lock); if (starget-single_lun) { if (starget-starget_sdev_user starget-starget_sdev_user != sdev) - return 0; + goto out; starget-starget_sdev_user = sdev; } @@ -1337,57 +1337,66 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost, /* * unblock after target_blocked iterates to zero */ - if (--starget-target_blocked == 0) { - SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget, -unblocking target at zero depth\n)); - } else - return 0; + if (--starget-target_blocked != 0) + goto out; + + SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget, +unblocking target at zero depth\n)); } if (scsi_target_is_busy(starget)) { list_move_tail(sdev-starved_entry, shost-starved_list); - return 0; + goto out; } - return 1; + scsi_target(sdev)-target_busy++; + ret = 1; +out: + spin_unlock_irq(shost-host_lock); + return ret; } /* * scsi_host_queue_ready: if we can send requests to shost, return 1 else * return 0. We must end up running the queue again whenever 0 is * returned, else IO can hang. - * - * Called with host_lock held. */ static inline int scsi_host_queue_ready(struct request_queue *q, struct Scsi_Host *shost, struct scsi_device *sdev) { + int ret = 0; + + spin_lock_irq(shost-host_lock); + if (scsi_host_in_recovery(shost)) - return 0; + goto out; if (shost-host_busy == 0 shost-host_blocked) { /* * unblock after host_blocked iterates to zero */ - if (--shost-host_blocked == 0) { - SCSI_LOG_MLQUEUE(3, - printk(scsi%d unblocking host at zero depth\n, - shost-host_no)); - } else { - return 0; - } + if (--shost-host_blocked != 0) + goto out; + + SCSI_LOG_MLQUEUE(3, + printk(scsi%d unblocking host at zero depth\n, + shost-host_no)); } if (scsi_host_is_busy(shost)) { if (list_empty(sdev-starved_entry)) list_add_tail(sdev-starved_entry, shost-starved_list); - return 0; + goto out; } /* We're OK to process the command, so we can't be starved */ if (!list_empty(sdev-starved_entry)) list_del_init(sdev-starved_entry); - return 1; + shost-host_busy++; + ret = 1; +out: + spin_unlock_irq(shost-host_lock); + return ret; } /* @@ -1551,7 +1560,7 @@ static void scsi_request_fn(struct request_queue *q) blk_start_request(req); sdev-device_busy++; - spin_unlock(q-queue_lock); + spin_unlock_irq(q-queue_lock); cmd = req-special; if (unlikely(cmd == NULL)) { printk(KERN_CRIT impossible request in %s.\n @@ -1561,7 +1570,6 @@ static void scsi_request_fn(struct request_queue *q) blk_dump_rq_flags(req, foo); BUG(); } - spin_lock(shost-host_lock); /* * We hit this when the driver is using a host wide @@ -1572,9 +1580,11 @@ static void