Re: blk-mq problem on proliant DL380 G3 (cciss)
03-11-2014 12:08 kirjutas Christoph Hellwig: Meelis, can you give the patch below a try? This only tries to locked the door on devices that actually were reset. Given that on a reset device we fail all commands before resuming operations it should work fine there as all tags should be released. Works fine on both DL380G3 and the other server with MPT and IDE CD. -- Meelis Roos -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: blk-mq problem on proliant DL380 G3 (cciss)
Meelis, can you give the patch below a try? This only tries to locked the door on devices that actually were reset. Given that on a reset device we fail all commands before resuming operations it should work fine there as all tags should be released. diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index fa7b5ec..7af43cb 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -2016,8 +2016,10 @@ static void scsi_restart_operations(struct Scsi_Host *shost) * is no point trying to lock the door of an off-line device. */ shost_for_each_device(sdev, shost) { - if (scsi_device_online(sdev) && sdev->locked) + if (scsi_device_online(sdev) && sdev->was_reset && sdev->locked) { scsi_eh_lock_door(sdev); + sdev->was_reset = 0; + } } /* -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: blk-mq problem on proliant DL380 G3 (cciss)
On 2014-10-30 11:45, Christoph Hellwig wrote: On Thu, Oct 30, 2014 at 07:32:52PM +0200, Meelis Roos wrote: can you try the patch below? It's a hack and not a proper fix, but it addresses what seems to be your culprit, given that it is the only place allocating a request from the error handler. Applied it on top of 3.18-rc2, booted with scsi_mod.use_blk_mq=1 and it booted up fine. Jens, any idea what we could do here? We want to lock the door again ASAP after potentially resetting the device state as far as I can read the code (the commit message for it is utterly meaningless). Right now the code allocates the request from the scsi EH thread, which already is dangerous but mostly works for the !blk-mq case, but with the strict only allocate a request if a tag is available policy this breaks down if we still have BLOCK_PC requests that have references on them blocking another request queued (ATA cdroms tend to have a queue depth of 1). Given that this always was best effort anyway we might want to move it to a separate workqueue to not block EH? So what we usually do for tagged devices that need some command for error handling etc, is to have one tag reserved. The lock/unlock should probably be using a reserved request, given how it is invoked as error handling. Right now we don't reserve a tag for untagged things like PATA cdrom, but we could, since they don't care about the tag anyway. And if we had that and reserved grab in the scsi_eh_lock_door(), it should just work. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: blk-mq problem on proliant DL380 G3 (cciss)
On Thu, Oct 30, 2014 at 07:32:52PM +0200, Meelis Roos wrote: > > can you try the patch below? It's a hack and not a proper fix, but it > > addresses what seems to be your culprit, given that it is the only > > place allocating a request from the error handler. > > Applied it on top of 3.18-rc2, booted with scsi_mod.use_blk_mq=1 and it > booted up fine. Jens, any idea what we could do here? We want to lock the door again ASAP after potentially resetting the device state as far as I can read the code (the commit message for it is utterly meaningless). Right now the code allocates the request from the scsi EH thread, which already is dangerous but mostly works for the !blk-mq case, but with the strict only allocate a request if a tag is available policy this breaks down if we still have BLOCK_PC requests that have references on them blocking another request queued (ATA cdroms tend to have a queue depth of 1). Given that this always was best effort anyway we might want to move it to a separate workqueue to not block EH? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: blk-mq problem on proliant DL380 G3 (cciss)
> can you try the patch below? It's a hack and not a proper fix, but it > addresses what seems to be your culprit, given that it is the only > place allocating a request from the error handler. Applied it on top of 3.18-rc2, booted with scsi_mod.use_blk_mq=1 and it booted up fine. > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > index fa7b5ec..5804ea0 100644 > --- a/drivers/scsi/scsi_error.c > +++ b/drivers/scsi/scsi_error.c > @@ -2010,6 +2010,7 @@ static void scsi_restart_operations(struct Scsi_Host > *shost) > struct scsi_device *sdev; > unsigned long flags; > > +#if 0 > /* >* If the door was locked, we need to insert a door lock request >* onto the head of the SCSI request queue for the device. There > @@ -2019,6 +2020,7 @@ static void scsi_restart_operations(struct Scsi_Host > *shost) > if (scsi_device_online(sdev) && sdev->locked) > scsi_eh_lock_door(sdev); > } > +#endif > > /* >* next free up anything directly waiting upon the host. this > -- Meelis Roos (mr...@linux.ee) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: blk-mq problem on proliant DL380 G3 (cciss)
Meelis, can you try the patch below? It's a hack and not a proper fix, but it addresses what seems to be your culprit, given that it is the only place allocating a request from the error handler. diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index fa7b5ec..5804ea0 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -2010,6 +2010,7 @@ static void scsi_restart_operations(struct Scsi_Host *shost) struct scsi_device *sdev; unsigned long flags; +#if 0 /* * If the door was locked, we need to insert a door lock request * onto the head of the SCSI request queue for the device. There @@ -2019,6 +2020,7 @@ static void scsi_restart_operations(struct Scsi_Host *shost) if (scsi_device_online(sdev) && sdev->locked) scsi_eh_lock_door(sdev); } +#endif /* * next free up anything directly waiting upon the host. this -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: blk-mq problem on proliant DL380 G3 (cciss)
> >> On Wed, Oct 29, 2014 at 09:08:46AM -0600, Jens Axboe wrote: > Another test server with MPT SCSI RAID has similar problem, > scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no serail > console). 3.18.0-rc2-00043-gf7e87a4 was tested there. > >>> > >>> The first issue looks like scsi cdrom and error handling, it must be > >>> leaking > >>> requests hence we hang on allocation of a new one. cciss doesn't use > >>> blk_mq > >>> regardless of the scsi setting. Does the mpt box also have a libata driven > >>> cdrom? > >> > >> cciss does use scsi for CDROMs and other external devices, it is a bit > >> of a mess. > >> > >> Meelis, did you also test scsi-mq on 3.17 and this is a regression, or > >> was 3.18-rc2 the first kernel you tested? > > > > Both machines ran 3.17 successfully. I turned on scsi-mq option as soon > > as it appeared in Kconfig as a new option. But I am not sure whan the > > option appeared, before or after 3.17 release. > > So just to be fully clear, you never enabled scsi-mq on 3.17? To do > that, you would have had to add a scsi_mod.use_blk_mq=1 boot parameter. > The scsi-mq kconfig option did not show up until after 3.17 release. Re-tested DL380G3 with 3.17 and manual scsi_mod.use_blk_mq=1 option. The problem happens with 3.17 too with blk-mq. -- Meelis Roos (mr...@linux.ee) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: blk-mq problem on proliant DL380 G3 (cciss)
> >> On Wed, Oct 29, 2014 at 09:08:46AM -0600, Jens Axboe wrote: > Another test server with MPT SCSI RAID has similar problem, > scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no serail > console). 3.18.0-rc2-00043-gf7e87a4 was tested there. > >>> > >>> The first issue looks like scsi cdrom and error handling, it must be > >>> leaking > >>> requests hence we hang on allocation of a new one. cciss doesn't use > >>> blk_mq > >>> regardless of the scsi setting. Does the mpt box also have a libata driven > >>> cdrom? > >> > >> cciss does use scsi for CDROMs and other external devices, it is a bit > >> of a mess. > >> > >> Meelis, did you also test scsi-mq on 3.17 and this is a regression, or > >> was 3.18-rc2 the first kernel you tested? > > > > Both machines ran 3.17 successfully. I turned on scsi-mq option as soon > > as it appeared in Kconfig as a new option. But I am not sure whan the > > option appeared, before or after 3.17 release. > > So just to be fully clear, you never enabled scsi-mq on 3.17? To do > that, you would have had to add a scsi_mod.use_blk_mq=1 boot parameter. > The scsi-mq kconfig option did not show up until after 3.17 release. Yes, I never enabled it via command line, only noticed it when the question was asked during make oldconfig. Will try 3.17 with use_blk_mq today. -- Meelis Roos (mr...@linux.ee) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: blk-mq problem on proliant DL380 G3 (cciss)
> -Original Message- > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > ow...@vger.kernel.org] On Behalf Of Meelis Roos > Sent: Wednesday, 29 October, 2014 10:38 AM > To: Jens Axboe > Cc: linux-scsi@vger.kernel.org; Christoph Hellwig > Subject: Re: blk-mq problem on proliant DL380 G3 (cciss) > > > On 2014-10-29 05:46, Meelis Roos wrote: > > > > I tried 3.18-rc2 with blk-mq default on on HP ProLiant DL380 G3 > (with HP > > > > CCISS RAID controller). It fails late in the bootup with "task > > > > scsi_eh_1:720 blocked for more than 120 seconds." messages. > > > > > > > > Booting with scsi_mod.use_blk_mq=0 fixes the problem. > > > > > > Another test server with MPT SCSI RAID has similar problem, > > > scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no > > serail > > > console). 3.18.0-rc2-00043-gf7e87a4 was tested there. > > > > The first issue looks like scsi cdrom and error handling, it must > > be leaking > > requests hence we hang on allocation of a new one. cciss doesn't > > use blk_mq > > regardless of the scsi setting. Does the mpt box also have a libata > > driven cdrom? > > Yes, it does. > > -- > Meelis Roos (mr...@linux.ee) In the log, the first soft lockup for scsi_eh_1 means the thread for host1, which is a pata controller: ... [ 15.069114] scsi host1: pata_serverworks [ 15.173491] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti [ 15.184512] scsi host2: pata_serverworks [ 15.184673] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0x2000 irq 14 [ 15.184675] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0x2008 irq 15 [ 15.460452] ata1.00: ATAPI: COMPAQ CD-ROM SN-124, N104, max PIO4 [ 15.476445] ata1.00: configured for PIO4 [ 15.477110] scsi 1:0:0:0: CD-ROMCOMPAQ CD-ROM SN-124N104 PQ: 0 ANSI: 5 ... [ 240.704040] INFO: task scsi_eh_1:720 blocked for more than 120 seconds. [ 240.783198] Not tainted 3.18.0-rc2-dirty #22 [ 240.840485] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 240.934172] scsi_eh_1 D c1264d7f 0 720 2 0x [ 241.010385] f5bdbe54 0046 f5bdbde0 c1264d7f f5bdbe00 1412 2be89db4 [ 241.103850] 0004 2be8b1c6 0004 c1534000 f63bca10 c10892f5 6f223d9e 0132 [ 241.197335] 066087ce f5bdbe50 c10892f5 6f22478a 0132 066087ce [ 241.290803] Call Trace: [ 241.320039] [] ? put_device+0xf/0x20 [ 241.374205] [] ? ktime_get+0x45/0x110 [ 241.429416] [] ? ktime_get+0x45/0x110 [ 241.484631] [] schedule+0x1e/0x60 [ 241.535679] [] io_schedule+0x77/0xc0 [ 241.589854] [] bt_get+0xc3/0x140 [ 241.639867] [] ? __wake_up_sync+0x20/0x20 [ 241.699240] [] blk_mq_get_tag+0x9e/0xc0 ... -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: blk-mq problem on proliant DL380 G3 (cciss)
On 10/29/2014 02:06 PM, Meelis Roos wrote: >> On Wed, Oct 29, 2014 at 09:08:46AM -0600, Jens Axboe wrote: Another test server with MPT SCSI RAID has similar problem, scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no serail console). 3.18.0-rc2-00043-gf7e87a4 was tested there. >>> >>> The first issue looks like scsi cdrom and error handling, it must be leaking >>> requests hence we hang on allocation of a new one. cciss doesn't use blk_mq >>> regardless of the scsi setting. Does the mpt box also have a libata driven >>> cdrom? >> >> cciss does use scsi for CDROMs and other external devices, it is a bit >> of a mess. >> >> Meelis, did you also test scsi-mq on 3.17 and this is a regression, or >> was 3.18-rc2 the first kernel you tested? > > Both machines ran 3.17 successfully. I turned on scsi-mq option as soon > as it appeared in Kconfig as a new option. But I am not sure whan the > option appeared, before or after 3.17 release. So just to be fully clear, you never enabled scsi-mq on 3.17? To do that, you would have had to add a scsi_mod.use_blk_mq=1 boot parameter. The scsi-mq kconfig option did not show up until after 3.17 release. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: blk-mq problem on proliant DL380 G3 (cciss)
> On Wed, Oct 29, 2014 at 09:08:46AM -0600, Jens Axboe wrote: > > >Another test server with MPT SCSI RAID has similar problem, > > >scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no serail > > >console). 3.18.0-rc2-00043-gf7e87a4 was tested there. > > > > The first issue looks like scsi cdrom and error handling, it must be leaking > > requests hence we hang on allocation of a new one. cciss doesn't use blk_mq > > regardless of the scsi setting. Does the mpt box also have a libata driven > > cdrom? > > cciss does use scsi for CDROMs and other external devices, it is a bit > of a mess. > > Meelis, did you also test scsi-mq on 3.17 and this is a regression, or > was 3.18-rc2 the first kernel you tested? Both machines ran 3.17 successfully. I turned on scsi-mq option as soon as it appeared in Kconfig as a new option. But I am not sure whan the option appeared, before or after 3.17 release. -- Meelis Roos (mr...@linux.ee) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: blk-mq problem on proliant DL380 G3 (cciss)
On Wed, Oct 29, 2014 at 09:08:46AM -0600, Jens Axboe wrote: > >Another test server with MPT SCSI RAID has similar problem, > >scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no serail > >console). 3.18.0-rc2-00043-gf7e87a4 was tested there. > > The first issue looks like scsi cdrom and error handling, it must be leaking > requests hence we hang on allocation of a new one. cciss doesn't use blk_mq > regardless of the scsi setting. Does the mpt box also have a libata driven > cdrom? cciss does use scsi for CDROMs and other external devices, it is a bit of a mess. Meelis, did you also test scsi-mq on 3.17 and this is a regression, or was 3.18-rc2 the first kernel you tested? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: blk-mq problem on proliant DL380 G3 (cciss)
> On 2014-10-29 05:46, Meelis Roos wrote: > > > I tried 3.18-rc2 with blk-mq default on on HP ProLiant DL380 G3 (with HP > > > CCISS RAID controller). It fails late in the bootup with "task > > > scsi_eh_1:720 blocked for more than 120 seconds." messages. > > > > > > Booting with scsi_mod.use_blk_mq=0 fixes the problem. > > > > Another test server with MPT SCSI RAID has similar problem, > > scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no serail > > console). 3.18.0-rc2-00043-gf7e87a4 was tested there. > > The first issue looks like scsi cdrom and error handling, it must be leaking > requests hence we hang on allocation of a new one. cciss doesn't use blk_mq > regardless of the scsi setting. Does the mpt box also have a libata driven > cdrom? Yes, it does. -- Meelis Roos (mr...@linux.ee) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: blk-mq problem on proliant DL380 G3 (cciss)
On 2014-10-29 05:46, Meelis Roos wrote: I tried 3.18-rc2 with blk-mq default on on HP ProLiant DL380 G3 (with HP CCISS RAID controller). It fails late in the bootup with "task scsi_eh_1:720 blocked for more than 120 seconds." messages. Booting with scsi_mod.use_blk_mq=0 fixes the problem. Another test server with MPT SCSI RAID has similar problem, scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no serail console). 3.18.0-rc2-00043-gf7e87a4 was tested there. The first issue looks like scsi cdrom and error handling, it must be leaking requests hence we hang on allocation of a new one. cciss doesn't use blk_mq regardless of the scsi setting. Does the mpt box also have a libata driven cdrom? -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: blk-mq problem on proliant DL380 G3 (cciss)
> I tried 3.18-rc2 with blk-mq default on on HP ProLiant DL380 G3 (with HP > CCISS RAID controller). It fails late in the bootup with "task > scsi_eh_1:720 blocked for more than 120 seconds." messages. > > Booting with scsi_mod.use_blk_mq=0 fixes the problem. Another test server with MPT SCSI RAID has similar problem, scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no serail console). 3.18.0-rc2-00043-gf7e87a4 was tested there. -- Meelis Roos (mr...@linux.ee) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html