Re: Device or HBA level QD throttling creates randomness in sequetial workload
On 01/30/2017 11:28 AM, Kashyap Desai wrote: >> -Original Message- >> From: Jens Axboe [mailto:ax...@kernel.dk] >> Sent: Monday, January 30, 2017 10:03 PM >> To: Bart Van Assche; osan...@osandov.com; kashyap.de...@broadcom.com >> Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org; >> h...@infradead.org; linux-bl...@vger.kernel.org; paolo.vale...@linaro.org >> Subject: Re: Device or HBA level QD throttling creates randomness in >> sequetial workload >> >> On 01/30/2017 09:30 AM, Bart Van Assche wrote: >>> On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote: >>>> - if (atomic_inc_return(&instance->fw_outstanding) > >>>> - instance->host->can_queue) { >>>> - atomic_dec(&instance->fw_outstanding); >>>> - return SCSI_MLQUEUE_HOST_BUSY; >>>> - } >>>> + if (atomic_inc_return(&instance->fw_outstanding) > > safe_can_queue) { >>>> + is_nonrot = blk_queue_nonrot(scmd->device->request_queue); >>>> + /* For rotational device wait for sometime to get fusion >>>> + command >>>> from pool. >>>> +* This is just to reduce proactive re-queue at mid layer >>>> + which is >>>> not >>>> +* sending sorted IO in SCSI.MQ mode. >>>> +*/ >>>> + if (!is_nonrot) >>>> + udelay(100); >>>> + } >>> >>> The SCSI core does not allow to sleep inside the queuecommand() >>> callback function. >> >> udelay() is a busy loop, so it's not sleeping. That said, it's obviously > NOT a >> great idea. We want to fix the reordering due to requeues, not introduce >> random busy delays to work around it. > > Thanks for feedback. I do realize that udelay() is going to be very odd > in queue_command call back. I will keep this note. Preferred solution is > blk mq scheduler patches. It's coming in 4.11, so you don't have to wait long. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Device or HBA level QD throttling creates randomness in sequetial workload
> -Original Message- > From: Jens Axboe [mailto:ax...@kernel.dk] > Sent: Monday, January 30, 2017 10:03 PM > To: Bart Van Assche; osan...@osandov.com; kashyap.de...@broadcom.com > Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org; > h...@infradead.org; linux-bl...@vger.kernel.org; paolo.vale...@linaro.org > Subject: Re: Device or HBA level QD throttling creates randomness in > sequetial workload > > On 01/30/2017 09:30 AM, Bart Van Assche wrote: > > On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote: > >> - if (atomic_inc_return(&instance->fw_outstanding) > > >> - instance->host->can_queue) { > >> - atomic_dec(&instance->fw_outstanding); > >> - return SCSI_MLQUEUE_HOST_BUSY; > >> - } > >> + if (atomic_inc_return(&instance->fw_outstanding) > safe_can_queue) { > >> + is_nonrot = blk_queue_nonrot(scmd->device->request_queue); > >> + /* For rotational device wait for sometime to get fusion > >> + command > >> from pool. > >> +* This is just to reduce proactive re-queue at mid layer > >> + which is > >> not > >> +* sending sorted IO in SCSI.MQ mode. > >> +*/ > >> + if (!is_nonrot) > >> + udelay(100); > >> + } > > > > The SCSI core does not allow to sleep inside the queuecommand() > > callback function. > > udelay() is a busy loop, so it's not sleeping. That said, it's obviously NOT a > great idea. We want to fix the reordering due to requeues, not introduce > random busy delays to work around it. Thanks for feedback. I do realize that udelay() is going to be very odd in queue_command call back. I will keep this note. Preferred solution is blk mq scheduler patches. > > -- > Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Device or HBA level QD throttling creates randomness in sequetial workload
On 01/30/2017 09:30 AM, Bart Van Assche wrote: > On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote: >> - if (atomic_inc_return(&instance->fw_outstanding) > >> - instance->host->can_queue) { >> - atomic_dec(&instance->fw_outstanding); >> - return SCSI_MLQUEUE_HOST_BUSY; >> - } >> + if (atomic_inc_return(&instance->fw_outstanding) > safe_can_queue) { >> + is_nonrot = blk_queue_nonrot(scmd->device->request_queue); >> + /* For rotational device wait for sometime to get fusion command >> from pool. >> +* This is just to reduce proactive re-queue at mid layer which is >> not >> +* sending sorted IO in SCSI.MQ mode. >> +*/ >> + if (!is_nonrot) >> + udelay(100); >> + } > > The SCSI core does not allow to sleep inside the queuecommand() callback > function. udelay() is a busy loop, so it's not sleeping. That said, it's obviously NOT a great idea. We want to fix the reordering due to requeues, not introduce random busy delays to work around it. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Device or HBA level QD throttling creates randomness in sequetial workload
On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote: > - if (atomic_inc_return(&instance->fw_outstanding) > > - instance->host->can_queue) { > - atomic_dec(&instance->fw_outstanding); > - return SCSI_MLQUEUE_HOST_BUSY; > - } > + if (atomic_inc_return(&instance->fw_outstanding) > safe_can_queue) { > + is_nonrot = blk_queue_nonrot(scmd->device->request_queue); > + /* For rotational device wait for sometime to get fusion command > from pool. > +* This is just to reduce proactive re-queue at mid layer which is > not > +* sending sorted IO in SCSI.MQ mode. > +*/ > + if (!is_nonrot) > + udelay(100); > + } The SCSI core does not allow to sleep inside the queuecommand() callback function. Bart.-- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Device or HBA level QD throttling creates randomness in sequetial workload
Hi Jens/Omar, I used git.kernel.dk/linux-block branch - blk-mq-sched (commit 0efe27068ecf37ece2728a99b863763286049ab5) and confirm that issue reported in this thread is resolved. Now I am seeing MQ and SQ mode both are resulting in sequential IO pattern while IO is getting re-queued in block layer. To make similar performance without blk-mq-sched feature, is it good to pause IO for few usec in LLD? I mean, I want to avoid driver asking SML/Block layer to re-queue the IO (if it is Sequential on Rotational media.) Explaining w.r.t megaraid_sas driver. This driver expose can_queue, but it internally consume commands for raid 1, fast path. In worst case, can_queue/2 will consume all firmware resources and driver will re-queue further IOs to SML as below - if (atomic_inc_return(&instance->fw_outstanding) > instance->host->can_queue) { atomic_dec(&instance->fw_outstanding); return SCSI_MLQUEUE_HOST_BUSY; } I want to avoid above SCSI_MLQUEUE_HOST_BUSY. Need your suggestion for below changes - diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c index 9a9c84f..a683eb0 100644 --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c @@ -54,6 +54,7 @@ #include #include #include +#include #include "megaraid_sas_fusion.h" #include "megaraid_sas.h" @@ -2572,7 +2573,15 @@ void megasas_prepare_secondRaid1_IO(struct megasas_instance *instance, struct megasas_cmd_fusion *cmd, *r1_cmd = NULL; union MEGASAS_REQUEST_DESCRIPTOR_UNION *req_desc; u32 index; - struct fusion_context *fusion; + boolis_nonrot; + u32 safe_can_queue; + u32 num_cpus; + struct fusion_context *fusion; + + fusion = instance->ctrl_context; + + num_cpus = num_online_cpus(); + safe_can_queue = instance->cur_can_queue - num_cpus; fusion = instance->ctrl_context; @@ -2584,11 +2593,15 @@ void megasas_prepare_secondRaid1_IO(struct megasas_instance *instance, return SCSI_MLQUEUE_DEVICE_BUSY; } - if (atomic_inc_return(&instance->fw_outstanding) > - instance->host->can_queue) { - atomic_dec(&instance->fw_outstanding); - return SCSI_MLQUEUE_HOST_BUSY; - } + if (atomic_inc_return(&instance->fw_outstanding) > safe_can_queue) { + is_nonrot = blk_queue_nonrot(scmd->device->request_queue); + /* For rotational device wait for sometime to get fusion command from pool. +* This is just to reduce proactive re-queue at mid layer which is not +* sending sorted IO in SCSI.MQ mode. +*/ + if (!is_nonrot) + udelay(100); + } cmd = megasas_get_cmd_fusion(instance, scmd->request->tag); ` Kashyap > -Original Message- > From: Kashyap Desai [mailto:kashyap.de...@broadcom.com] > Sent: Tuesday, November 01, 2016 11:11 AM > To: 'Jens Axboe'; 'Omar Sandoval' > Cc: 'linux-scsi@vger.kernel.org'; 'linux-ker...@vger.kernel.org'; 'linux- > bl...@vger.kernel.org'; 'Christoph Hellwig'; 'paolo.vale...@linaro.org' > Subject: RE: Device or HBA level QD throttling creates randomness in > sequetial workload > > Jens- Replied inline. > > > Omar - I tested your WIP repo and figure out System hangs only if I pass > " > scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I > am looking for scsi_mod.use_blk_mq=Y. > > Also below is snippet of blktrace. In case of higher per device QD, I see > Requeue request in blktrace. > > 65,128 10 6268 2.432404509 18594 P N [fio] > 65,128 10 6269 2.432405013 18594 U N [fio] 1 > 65,128 10 6270 2.432405143 18594 I WS 148800 + 8 [fio] > 65,128 10 6271 2.432405740 18594 R WS 148800 + 8 [0] > 65,128 10 6272 2.432409794 18594 Q WS 148808 + 8 [fio] > 65,128 10 6273 2.432410234 18594 G WS 148808 + 8 [fio] > 65,128 10 6274 2.432410424 18594 S WS 148808 + 8 [fio] > 65,128 23 3626 2.432432595 16232 D WS 148800 + 8 > [kworker/23:1H] > 65,128 22 3279 2.432973482 0 C WS 147432 + 8 [0] > 65,128 7 6126 2.433032637 18594 P N [fio] > 65,128 7 6127 2.433033204 18594 U N [fio] 1 > 65,128 7 6128 2.433033346 18594 I WS 148808 + 8 [fio] > 65,128 7 6129 2.433033871 18594 D WS 148808 + 8 [fio] > 65,128 7 6130 2.433034559 18594 R WS 148808 + 8 [0] > 65,128 7 6131 2.433039796 18594 Q WS 148816 + 8 [fio] > 65,128 7 6132 2.433040206 18594 G WS 148816 + 8 [fio] > 65,128 7 6133 2.433040351 18594 S WS 148816 + 8 [fio] > 65,128 9 6392 2.433133729 0 C WS 147240 + 8 [0] > 65,128 9 6393
RE: Device or HBA level QD throttling creates randomness in sequetial workload
Jens- Replied inline. Omar - I tested your WIP repo and figure out System hangs only if I pass " scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I am looking for scsi_mod.use_blk_mq=Y. Also below is snippet of blktrace. In case of higher per device QD, I see Requeue request in blktrace. 65,128 10 6268 2.432404509 18594 P N [fio] 65,128 10 6269 2.432405013 18594 U N [fio] 1 65,128 10 6270 2.432405143 18594 I WS 148800 + 8 [fio] 65,128 10 6271 2.432405740 18594 R WS 148800 + 8 [0] 65,128 10 6272 2.432409794 18594 Q WS 148808 + 8 [fio] 65,128 10 6273 2.432410234 18594 G WS 148808 + 8 [fio] 65,128 10 6274 2.432410424 18594 S WS 148808 + 8 [fio] 65,128 23 3626 2.432432595 16232 D WS 148800 + 8 [kworker/23:1H] 65,128 22 3279 2.432973482 0 C WS 147432 + 8 [0] 65,128 7 6126 2.433032637 18594 P N [fio] 65,128 7 6127 2.433033204 18594 U N [fio] 1 65,128 7 6128 2.433033346 18594 I WS 148808 + 8 [fio] 65,128 7 6129 2.433033871 18594 D WS 148808 + 8 [fio] 65,128 7 6130 2.433034559 18594 R WS 148808 + 8 [0] 65,128 7 6131 2.433039796 18594 Q WS 148816 + 8 [fio] 65,128 7 6132 2.433040206 18594 G WS 148816 + 8 [fio] 65,128 7 6133 2.433040351 18594 S WS 148816 + 8 [fio] 65,128 9 6392 2.433133729 0 C WS 147240 + 8 [0] 65,128 9 6393 2.433138166 905 D WS 148808 + 8 [kworker/9:1H] 65,128 7 6134 2.433167450 18594 P N [fio] 65,128 7 6135 2.433167911 18594 U N [fio] 1 65,128 7 6136 2.433168074 18594 I WS 148816 + 8 [fio] 65,128 7 6137 2.433168492 18594 D WS 148816 + 8 [fio] 65,128 7 6138 2.433174016 18594 Q WS 148824 + 8 [fio] 65,128 7 6139 2.433174282 18594 G WS 148824 + 8 [fio] 65,128 7 6140 2.433174613 18594 S WS 148824 + 8 [fio] CPU0 (sdy): Reads Queued: 0,0KiB Writes Queued: 79, 316KiB Read Dispatches:0,0KiB Write Dispatches: 67, 18,446,744,073PiB Reads Requeued: 0 Writes Requeued:86 Reads Completed:0,0KiB Writes Completed: 98, 392KiB Read Merges:0,0KiB Write Merges:0, 0KiB Read depth: 0 Write depth: 5 IO unplugs:79 Timer unplugs: 0 ` Kashyap > -Original Message- > From: Jens Axboe [mailto:ax...@kernel.dk] > Sent: Monday, October 31, 2016 10:54 PM > To: Kashyap Desai; Omar Sandoval > Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org; linux- > bl...@vger.kernel.org; Christoph Hellwig; paolo.vale...@linaro.org > Subject: Re: Device or HBA level QD throttling creates randomness in > sequetial > workload > > Hi, > > One guess would be that this isn't around a requeue condition, but rather > the > fact that we don't really guarantee any sort of hard FIFO behavior between > the > software queues. Can you try this test patch to see if it changes the > behavior for > you? Warning: untested... Jens - I tested the patch, but I still see random IO pattern for expected Sequential Run. I am intentionally running case of Re-queue and seeing issue at the time of Re-queue. If there is no Requeue, I see no issue at LLD. > > diff --git a/block/blk-mq.c b/block/blk-mq.c index > f3d27a6dee09..5404ca9c71b2 > 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -772,6 +772,14 @@ static inline unsigned int queued_to_index(unsigned > int > queued) > return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1); > } > > +static int rq_pos_cmp(void *priv, struct list_head *a, struct list_head > +*b) { > + struct request *rqa = container_of(a, struct request, queuelist); > + struct request *rqb = container_of(b, struct request, queuelist); > + > + return blk_rq_pos(rqa) < blk_rq_pos(rqb); } > + > /* >* Run this hardware queue, pulling any software queues mapped to it in. >* Note that this function currently has various problems around > ordering @@ - > 812,6 +820,14 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx > *hctx) > } > > /* > + * If the device is rotational, sort the list sanely to avoid > + * unecessary seeks. The software queues are roughly FIFO, but > + * only roughly, there are no hard guarantees. > + */ > + if (!blk_queue_nonrot(q)) > + list_sort(NULL, &rq_list, rq_pos_cmp); > + > + /* >* Start off with dptr being NULL, so we start the first request >* immediately, even if we have more pending. >*/ > > -- > Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Device or HBA level QD throttling creates randomness in sequetial workload
Hi, One guess would be that this isn't around a requeue condition, but rather the fact that we don't really guarantee any sort of hard FIFO behavior between the software queues. Can you try this test patch to see if it changes the behavior for you? Warning: untested... diff --git a/block/blk-mq.c b/block/blk-mq.c index f3d27a6dee09..5404ca9c71b2 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -772,6 +772,14 @@ static inline unsigned int queued_to_index(unsigned int queued) return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1); } +static int rq_pos_cmp(void *priv, struct list_head *a, struct list_head *b) +{ + struct request *rqa = container_of(a, struct request, queuelist); + struct request *rqb = container_of(b, struct request, queuelist); + + return blk_rq_pos(rqa) < blk_rq_pos(rqb); +} + /* * Run this hardware queue, pulling any software queues mapped to it in. * Note that this function currently has various problems around ordering @@ -812,6 +820,14 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) } /* +* If the device is rotational, sort the list sanely to avoid +* unecessary seeks. The software queues are roughly FIFO, but +* only roughly, there are no hard guarantees. +*/ + if (!blk_queue_nonrot(q)) + list_sort(NULL, &rq_list, rq_pos_cmp); + + /* * Start off with dptr being NULL, so we start the first request * immediately, even if we have more pending. */ -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Device or HBA level QD throttling creates randomness in sequetial workload
On Tue, Oct 25, 2016 at 12:24:24AM +0530, Kashyap Desai wrote: > > -Original Message- > > From: Omar Sandoval [mailto:osan...@osandov.com] > > Sent: Monday, October 24, 2016 9:11 PM > > To: Kashyap Desai > > Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org; linux- > > bl...@vger.kernel.org; ax...@kernel.dk; Christoph Hellwig; > > paolo.vale...@linaro.org > > Subject: Re: Device or HBA level QD throttling creates randomness in > sequetial > > workload > > > > On Mon, Oct 24, 2016 at 06:35:01PM +0530, Kashyap Desai wrote: > > > > > > > > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote: > > > > > Hi - > > > > > > > > > > I found below conversation and it is on the same line as I wanted > > > > > some input from mailing list. > > > > > > > > > > http://marc.info/?l=linux-kernel&m=147569860526197&w=2 > > > > > > > > > > I can do testing on any WIP item as Omar mentioned in above > > > discussion. > > > > > https://github.com/osandov/linux/tree/blk-mq-iosched > > > > > > I tried build kernel using this repo, but looks like it is not allowed > > > to reboot due to some changes in layer. > > > > Did you build the most up-to-date version of that branch? I've been > force > > pushing to it, so the commit id that you built would be useful. > > What boot failure are you seeing? > > Below is latest commit on repo. > commit b077a9a5149f17ccdaa86bc6346fa256e3c1feda > Author: Omar Sandoval > Date: Tue Sep 20 11:20:03 2016 -0700 > > [WIP] blk-mq: limit bio queue depth > > I have latest repo from 4.9/scsi-next maintained by Martin which boots > fine. Only Delta is " CONFIG_SBITMAP" is enabled in WIP blk-mq-iosched > branch. I could not see any meaningful data on boot hang, so going to try > one more time tomorrow. The blk-mq-bio-queueing branch has the latest work there separated out. Not sure that it'll help in this case. > > > > > > > > > > Are you using blk-mq for this disk? If not, then the work there > > > > won't > > > affect you. > > > > > > YES. I am using blk-mq for my test. I also confirm if use_blk_mq is > > > disable, Sequential work load issue is not seen and scheduling > > > works well. > > > > Ah, okay, perfect. Can you send the fio job file you're using? Hard to > tell exactly > > what's going on without the details. A sequential workload with just one > > submitter is about as easy as it gets, so this _should_ be behaving > nicely. > > > > ; setup numa policy for each thread > ; 'numactl --show' to determine the maximum numa nodes > [global] > ioengine=libaio > buffered=0 > rw=write > bssplit=4K/100 > iodepth=256 > numjobs=1 > direct=1 > runtime=60s > allow_mounted_write=0 > > [job1] > filename=/dev/sdd > .. > [job24] > filename=/dev/sdaa Okay, so you have one high-iodepth job per disk, got it. > When I tune /sys/module/scsi_mod/parameters/use_blk_mq = 1, below is a > ioscheduler detail. (It is in blk-mq mode. ) > /sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10: > 2:13:0/block/sdq/queue/scheduler:none > > When I have set /sys/module/scsi_mod/parameters/use_blk_mq = 0, > ioscheduler picked by SML is . > /sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10: > 2:13:0/block/sdq/queue/scheduler:noop deadline [cfq] > > I see in blk-mq performance is very low for Sequential Write work load and > I confirm that blk-mq convert Sequential work load into random stream due > to io-scheduler change in blk-mq vs legacy block layer. Since this happens when the fio iodepth exceeds the per-device QD, my best guess is that this is that requests are getting requeued and scrambled when that happens. Do you have the blktrace lying around? > > > > > Is there any workaround/alternative in latest upstream kernel, if > > > > > user wants to see limited penalty for Sequential Work load on HDD > ? > > > > > > > > > > ` Kashyap > > > > > > > > > P.S., your emails are being marked as spam by Gmail. Actually, Gmail > seems to > > mark just about everything I get from Broadcom as spam due to failed > DMARC. > > > > -- > > Omar -- Omar -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Device or HBA level QD throttling creates randomness in sequetial workload
> -Original Message- > From: Omar Sandoval [mailto:osan...@osandov.com] > Sent: Monday, October 24, 2016 9:11 PM > To: Kashyap Desai > Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org; linux- > bl...@vger.kernel.org; ax...@kernel.dk; Christoph Hellwig; > paolo.vale...@linaro.org > Subject: Re: Device or HBA level QD throttling creates randomness in sequetial > workload > > On Mon, Oct 24, 2016 at 06:35:01PM +0530, Kashyap Desai wrote: > > > > > > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote: > > > > Hi - > > > > > > > > I found below conversation and it is on the same line as I wanted > > > > some input from mailing list. > > > > > > > > http://marc.info/?l=linux-kernel&m=147569860526197&w=2 > > > > > > > > I can do testing on any WIP item as Omar mentioned in above > > discussion. > > > > https://github.com/osandov/linux/tree/blk-mq-iosched > > > > I tried build kernel using this repo, but looks like it is not allowed > > to reboot due to some changes in layer. > > Did you build the most up-to-date version of that branch? I've been force > pushing to it, so the commit id that you built would be useful. > What boot failure are you seeing? Below is latest commit on repo. commit b077a9a5149f17ccdaa86bc6346fa256e3c1feda Author: Omar Sandoval Date: Tue Sep 20 11:20:03 2016 -0700 [WIP] blk-mq: limit bio queue depth I have latest repo from 4.9/scsi-next maintained by Martin which boots fine. Only Delta is " CONFIG_SBITMAP" is enabled in WIP blk-mq-iosched branch. I could not see any meaningful data on boot hang, so going to try one more time tomorrow. > > > > > > > Are you using blk-mq for this disk? If not, then the work there > > > won't > > affect you. > > > > YES. I am using blk-mq for my test. I also confirm if use_blk_mq is > > disable, Sequential work load issue is not seen and scheduling > > works well. > > Ah, okay, perfect. Can you send the fio job file you're using? Hard to tell exactly > what's going on without the details. A sequential workload with just one > submitter is about as easy as it gets, so this _should_ be behaving nicely. ; setup numa policy for each thread ; 'numactl --show' to determine the maximum numa nodes [global] ioengine=libaio buffered=0 rw=write bssplit=4K/100 iodepth=256 numjobs=1 direct=1 runtime=60s allow_mounted_write=0 [job1] filename=/dev/sdd .. [job24] filename=/dev/sdaa When I tune /sys/module/scsi_mod/parameters/use_blk_mq = 1, below is a ioscheduler detail. (It is in blk-mq mode. ) /sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10: 2:13:0/block/sdq/queue/scheduler:none When I have set /sys/module/scsi_mod/parameters/use_blk_mq = 0, ioscheduler picked by SML is . /sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10: 2:13:0/block/sdq/queue/scheduler:noop deadline [cfq] I see in blk-mq performance is very low for Sequential Write work load and I confirm that blk-mq convert Sequential work load into random stream due to io-scheduler change in blk-mq vs legacy block layer. > > > > > > > > Is there any workaround/alternative in latest upstream kernel, if > > > > user wants to see limited penalty for Sequential Work load on HDD ? > > > > > > > > ` Kashyap > > > > > > P.S., your emails are being marked as spam by Gmail. Actually, Gmail seems to > mark just about everything I get from Broadcom as spam due to failed DMARC. > > -- > Omar -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Device or HBA level QD throttling creates randomness in sequetial workload
On Mon, Oct 24, 2016 at 06:35:01PM +0530, Kashyap Desai wrote: > > > > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote: > > > Hi - > > > > > > I found below conversation and it is on the same line as I wanted some > > > input from mailing list. > > > > > > http://marc.info/?l=linux-kernel&m=147569860526197&w=2 > > > > > > I can do testing on any WIP item as Omar mentioned in above > discussion. > > > https://github.com/osandov/linux/tree/blk-mq-iosched > > I tried build kernel using this repo, but looks like it is not allowed to > reboot due to some changes in layer. Did you build the most up-to-date version of that branch? I've been force pushing to it, so the commit id that you built would be useful. What boot failure are you seeing? > > > > Are you using blk-mq for this disk? If not, then the work there won't > affect you. > > YES. I am using blk-mq for my test. I also confirm if use_blk_mq is > disable, Sequential work load issue is not seen and scheduling works > well. Ah, okay, perfect. Can you send the fio job file you're using? Hard to tell exactly what's going on without the details. A sequential workload with just one submitter is about as easy as it gets, so this _should_ be behaving nicely. > > > > > Is there any workaround/alternative in latest upstream kernel, if user > > > wants to see limited penalty for Sequential Work load on HDD ? > > > > > > ` Kashyap > > > P.S., your emails are being marked as spam by Gmail. Actually, Gmail seems to mark just about everything I get from Broadcom as spam due to failed DMARC. -- Omar -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Device or HBA level QD throttling creates randomness in sequetial workload
> > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote: > > Hi - > > > > I found below conversation and it is on the same line as I wanted some > > input from mailing list. > > > > http://marc.info/?l=linux-kernel&m=147569860526197&w=2 > > > > I can do testing on any WIP item as Omar mentioned in above discussion. > > https://github.com/osandov/linux/tree/blk-mq-iosched I tried build kernel using this repo, but looks like it is not allowed to reboot due to some changes in layer. > > Are you using blk-mq for this disk? If not, then the work there won't affect you. YES. I am using blk-mq for my test. I also confirm if use_blk_mq is disable, Sequential work load issue is not seen and scheduling works well. > > > Is there any workaround/alternative in latest upstream kernel, if user > > wants to see limited penalty for Sequential Work load on HDD ? > > > > ` Kashyap > > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Device or HBA level QD throttling creates randomness in sequetial workload
On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote: > Hi - > > I found below conversation and it is on the same line as I wanted some > input from mailing list. > > http://marc.info/?l=linux-kernel&m=147569860526197&w=2 > > I can do testing on any WIP item as Omar mentioned in above discussion. > https://github.com/osandov/linux/tree/blk-mq-iosched Are you using blk-mq for this disk? If not, then the work there won't affect you. > Is there any workaround/alternative in latest upstream kernel, if user > wants to see limited penalty for Sequential Work load on HDD ? > > ` Kashyap > > > -Original Message- > > From: Kashyap Desai [mailto:kashyap.de...@broadcom.com] > > Sent: Thursday, October 20, 2016 3:39 PM > > To: linux-scsi@vger.kernel.org > > Subject: Device or HBA level QD throttling creates randomness in > sequetial > > workload > > > > [ Apologize, if you find more than one instance of my email. > > Web based email client has some issue, so now trying git send mail.] > > > > Hi, > > > > I am doing some performance tuning in MR driver to understand how sdev > > queue depth and hba queue depth play role in IO submission from above > layer. > > I have 24 JBOD connected to MR 12GB controller and I can see performance > for > > 4K Sequential work load as below. > > > > HBA QD for MR controller is 4065 and Per device QD is set to 32 > > > > queue depth from 256 reports 300K IOPS queue depth from 128 > > reports 330K IOPS queue depth from 64 reports 360K IOPS queue > depth > > from 32 reports 510K IOPS > > > > In MR driver I added debug print and confirm that more IO come to driver > as > > random IO whenever I have queue depth more than 32. > > > > I have debug using scsi logging level and blktrace as well. Below is > snippet of > > logs using scsi logging level. In summary, if SML do flow control of IO > due to > > Device QD or HBA QD, IO coming to LLD is more random pattern. > > > > I see IO coming to driver is not sequential. > > > > [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 > 3b 00 > > 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 > 00 03 > > c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: > Write(10) 2a > > 00 00 03 c0 5b 00 00 01 00 > > > > After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b". > > Two Sequence are overlapped due to sdev QD throttling. > > > > [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 > 5c 00 > > 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 > 00 03 > > c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: > Write(10) 2a > > 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857 > CDB: > > Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0: > [sdy] > > tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00 > > > > If scsi_request_fn() breaks due to unavailability of device queue (due > to below > > check), will there be any side defect as I observe ? > > if (!scsi_dev_queue_ready(q, sdev)) > > break; > > > > If I reduce HBA QD and make sure IO from above layer is throttled due to > HBA > > QD, there is a same impact. > > MR driver use host wide shared tag map. > > > > Can someone help me if this can be tunable in LLD providing additional > settings > > or it is expected behavior ? Problem I am facing is, I am not able to > figure out > > optimal device queue depth for different configuration and work load. > > > > Thanks, Kashyap -- Omar -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Device or HBA level QD throttling creates randomness in sequetial workload
Hi - I found below conversation and it is on the same line as I wanted some input from mailing list. http://marc.info/?l=linux-kernel&m=147569860526197&w=2 I can do testing on any WIP item as Omar mentioned in above discussion. https://github.com/osandov/linux/tree/blk-mq-iosched Is there any workaround/alternative in latest upstream kernel, if user wants to see limited penalty for Sequential Work load on HDD ? ` Kashyap > -Original Message- > From: Kashyap Desai [mailto:kashyap.de...@broadcom.com] > Sent: Thursday, October 20, 2016 3:39 PM > To: linux-scsi@vger.kernel.org > Subject: Device or HBA level QD throttling creates randomness in sequetial > workload > > [ Apologize, if you find more than one instance of my email. > Web based email client has some issue, so now trying git send mail.] > > Hi, > > I am doing some performance tuning in MR driver to understand how sdev > queue depth and hba queue depth play role in IO submission from above layer. > I have 24 JBOD connected to MR 12GB controller and I can see performance for > 4K Sequential work load as below. > > HBA QD for MR controller is 4065 and Per device QD is set to 32 > > queue depth from 256 reports 300K IOPS queue depth from 128 > reports 330K IOPS queue depth from 64 reports 360K IOPS queue depth > from 32 reports 510K IOPS > > In MR driver I added debug print and confirm that more IO come to driver as > random IO whenever I have queue depth more than 32. > > I have debug using scsi logging level and blktrace as well. Below is snippet of > logs using scsi logging level. In summary, if SML do flow control of IO due to > Device QD or HBA QD, IO coming to LLD is more random pattern. > > I see IO coming to driver is not sequential. > > [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 3b 00 > 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 00 03 > c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: Write(10) 2a > 00 00 03 c0 5b 00 00 01 00 > > After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b". > Two Sequence are overlapped due to sdev QD throttling. > > [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 5c 00 > 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 00 03 > c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: Write(10) 2a > 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857 CDB: > Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0: [sdy] > tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00 > > If scsi_request_fn() breaks due to unavailability of device queue (due to below > check), will there be any side defect as I observe ? > if (!scsi_dev_queue_ready(q, sdev)) > break; > > If I reduce HBA QD and make sure IO from above layer is throttled due to HBA > QD, there is a same impact. > MR driver use host wide shared tag map. > > Can someone help me if this can be tunable in LLD providing additional settings > or it is expected behavior ? Problem I am facing is, I am not able to figure out > optimal device queue depth for different configuration and work load. > > Thanks, Kashyap -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Device or HBA level QD throttling creates randomness in sequetial workload
[ Apologize, if you find more than one instance of my email. Web based email client has some issue, so now trying git send mail.] Hi, I am doing some performance tuning in MR driver to understand how sdev queue depth and hba queue depth play role in IO submission from above layer. I have 24 JBOD connected to MR 12GB controller and I can see performance for 4K Sequential work load as below. HBA QD for MR controller is 4065 and Per device QD is set to 32 queue depth from 256 reports 300K IOPS queue depth from 128 reports 330K IOPS queue depth from 64 reports 360K IOPS queue depth from 32 reports 510K IOPS In MR driver I added debug print and confirm that more IO come to driver as random IO whenever I have queue depth more than 32. I have debug using scsi logging level and blktrace as well. Below is snippet of logs using scsi logging level. In summary, if SML do flow control of IO due to Device QD or HBA QD, IO coming to LLD is more random pattern. I see IO coming to driver is not sequential. [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 3b 00 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 00 03 c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: Write(10) 2a 00 00 03 c0 5b 00 00 01 00 After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b". Two Sequence are overlapped due to sdev QD throttling. [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 5c 00 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 00 03 c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: Write(10) 2a 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857 CDB: Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0: [sdy] tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00 If scsi_request_fn() breaks due to unavailability of device queue (due to below check), will there be any side defect as I observe ? if (!scsi_dev_queue_ready(q, sdev)) break; If I reduce HBA QD and make sure IO from above layer is throttled due to HBA QD, there is a same impact. MR driver use host wide shared tag map. Can someone help me if this can be tunable in LLD providing additional settings or it is expected behavior ? Problem I am facing is, I am not able to figure out optimal device queue depth for different configuration and work load. Thanks, Kashyap -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Device or HBA level QD throttling creates randomness in sequetial workload
[ Apologize, if you find more than one instance of my email. Web based email client has some issue, so now trying git send mail.] Hi, I am doing some performance tuning in MR driver to understand how sdev queue depth and hba queue depth play role in IO submission from above layer. I have 24 JBOD connected to MR 12GB controller and I can see performance for 4K Sequential work load as below. HBA QD for MR controller is 4065 and Per device QD is set to 32 queue depth from 256 reports 300K IOPS queue depth from 128 reports 330K IOPS queue depth from 64 reports 360K IOPS queue depth from 32 reports 510K IOPS In MR driver I added debug print and confirm that more IO come to driver as random IO whenever I have queue depth more than 32. I have debug using scsi logging level and blktrace as well. Below is snippet of logs using scsi logging level. In summary, if SML do flow control of IO due to Device QD or HBA QD, IO coming to LLD is more random pattern. I see IO coming to driver is not sequential. [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 3b 00 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 00 03 c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: Write(10) 2a 00 00 03 c0 5b 00 00 01 00 After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b". Two Sequence are overlapped due to sdev QD throttling. [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 5c 00 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 00 03 c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: Write(10) 2a 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857 CDB: Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0: [sdy] tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00 If scsi_request_fn() breaks due to unavailability of device queue (due to below check), will there be any side defect as I observe ? if (!scsi_dev_queue_ready(q, sdev)) break; If I reduce HBA QD and make sure IO from above layer is throttled due to HBA QD, there is a same impact. MR driver use host wide shared tag map. Can someone help me if this can be tunable in LLD providing additional settings or it is expected behavior ? Problem I am facing is, I am not able to figure out optimal device queue depth for different configuration and work load. Thanks, Kashyap -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html