Re: blk-mq problem on proliant DL380 G3 (cciss)

2014-11-03 Thread Meelis Roos

03-11-2014 12:08 kirjutas Christoph Hellwig:


Meelis,

can you give the patch below a try? This only tries to locked the door
on devices that actually were reset. Given that on a reset device we
fail all commands before resuming operations it should work fine there
as all tags should be released.


Works fine on both DL380G3 and the other server with MPT and IDE CD.

--
Meelis Roos
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: blk-mq problem on proliant DL380 G3 (cciss)

2014-11-03 Thread Christoph Hellwig
Meelis,

can you give the patch below a try?  This only tries to locked the door
on devices that actually were reset. Given that on a reset device we
fail all commands before resuming operations it should work fine there
as all tags should be released.

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index fa7b5ec..7af43cb 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -2016,8 +2016,10 @@ static void scsi_restart_operations(struct Scsi_Host 
*shost)
 * is no point trying to lock the door of an off-line device.
 */
shost_for_each_device(sdev, shost) {
-   if (scsi_device_online(sdev) && sdev->locked)
+   if (scsi_device_online(sdev) && sdev->was_reset && 
sdev->locked) {
scsi_eh_lock_door(sdev);
+   sdev->was_reset = 0;
+   }
}
 
/*
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: blk-mq problem on proliant DL380 G3 (cciss)

2014-11-02 Thread Jens Axboe

On 2014-10-30 11:45, Christoph Hellwig wrote:

On Thu, Oct 30, 2014 at 07:32:52PM +0200, Meelis Roos wrote:

can you try the patch below?  It's a hack and not a proper fix, but it
addresses what seems to be your culprit, given that it is the only
place allocating a request from the error handler.


Applied it on top of 3.18-rc2, booted with scsi_mod.use_blk_mq=1 and it
booted up fine.


Jens,

any idea what we could do here?  We want to lock the door again ASAP
after potentially resetting the device state as far as I can read
the code (the commit message for it is utterly meaningless).

Right now the code allocates the request from the scsi EH thread, which
already is dangerous but mostly works for the !blk-mq case, but with the
strict only allocate a request if a tag is available policy this breaks
down if we still have BLOCK_PC requests that have references on them
blocking another request queued (ATA cdroms tend to have a queue depth
of 1).

Given that this always was best effort anyway we might want to move it
to a separate workqueue to not block EH?


So what we usually do for tagged devices that need some command for 
error handling etc, is to have one tag reserved. The lock/unlock should 
probably be using a reserved request, given how it is invoked as error 
handling. Right now we don't reserve a tag for untagged things like PATA 
cdrom, but we could, since they don't care about the tag anyway. And if 
we had that and reserved grab in the scsi_eh_lock_door(), it should just 
work.


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: blk-mq problem on proliant DL380 G3 (cciss)

2014-10-30 Thread Christoph Hellwig
On Thu, Oct 30, 2014 at 07:32:52PM +0200, Meelis Roos wrote:
> > can you try the patch below?  It's a hack and not a proper fix, but it
> > addresses what seems to be your culprit, given that it is the only
> > place allocating a request from the error handler.
> 
> Applied it on top of 3.18-rc2, booted with scsi_mod.use_blk_mq=1 and it 
> booted up fine.

Jens,

any idea what we could do here?  We want to lock the door again ASAP
after potentially resetting the device state as far as I can read
the code (the commit message for it is utterly meaningless).

Right now the code allocates the request from the scsi EH thread, which
already is dangerous but mostly works for the !blk-mq case, but with the
strict only allocate a request if a tag is available policy this breaks
down if we still have BLOCK_PC requests that have references on them
blocking another request queued (ATA cdroms tend to have a queue depth
of 1).

Given that this always was best effort anyway we might want to move it
to a separate workqueue to not block EH?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: blk-mq problem on proliant DL380 G3 (cciss)

2014-10-30 Thread Meelis Roos
> can you try the patch below?  It's a hack and not a proper fix, but it
> addresses what seems to be your culprit, given that it is the only
> place allocating a request from the error handler.

Applied it on top of 3.18-rc2, booted with scsi_mod.use_blk_mq=1 and it 
booted up fine.

> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index fa7b5ec..5804ea0 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -2010,6 +2010,7 @@ static void scsi_restart_operations(struct Scsi_Host 
> *shost)
>   struct scsi_device *sdev;
>   unsigned long flags;
>  
> +#if 0
>   /*
>* If the door was locked, we need to insert a door lock request
>* onto the head of the SCSI request queue for the device.  There
> @@ -2019,6 +2020,7 @@ static void scsi_restart_operations(struct Scsi_Host 
> *shost)
>   if (scsi_device_online(sdev) && sdev->locked)
>   scsi_eh_lock_door(sdev);
>   }
> +#endif
>  
>   /*
>* next free up anything directly waiting upon the host.  this
> 

-- 
Meelis Roos (mr...@linux.ee)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: blk-mq problem on proliant DL380 G3 (cciss)

2014-10-30 Thread Christoph Hellwig
Meelis,

can you try the patch below?  It's a hack and not a proper fix, but it
addresses what seems to be your culprit, given that it is the only
place allocating a request from the error handler.


diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index fa7b5ec..5804ea0 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -2010,6 +2010,7 @@ static void scsi_restart_operations(struct Scsi_Host 
*shost)
struct scsi_device *sdev;
unsigned long flags;
 
+#if 0
/*
 * If the door was locked, we need to insert a door lock request
 * onto the head of the SCSI request queue for the device.  There
@@ -2019,6 +2020,7 @@ static void scsi_restart_operations(struct Scsi_Host 
*shost)
if (scsi_device_online(sdev) && sdev->locked)
scsi_eh_lock_door(sdev);
}
+#endif
 
/*
 * next free up anything directly waiting upon the host.  this
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: blk-mq problem on proliant DL380 G3 (cciss)

2014-10-30 Thread Meelis Roos
> >> On Wed, Oct 29, 2014 at 09:08:46AM -0600, Jens Axboe wrote:
>  Another test server with MPT SCSI RAID has similar problem,
>  scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no serail
>  console). 3.18.0-rc2-00043-gf7e87a4 was tested there.
> >>>
> >>> The first issue looks like scsi cdrom and error handling, it must be 
> >>> leaking
> >>> requests hence we hang on allocation of a new one. cciss doesn't use 
> >>> blk_mq
> >>> regardless of the scsi setting. Does the mpt box also have a libata driven
> >>> cdrom?
> >>
> >> cciss does use scsi for CDROMs and other external devices, it is a bit
> >> of a mess.
> >>
> >> Meelis, did you also test scsi-mq on 3.17 and this is a regression, or
> >> was 3.18-rc2 the first kernel you tested?
> > 
> > Both machines ran 3.17 successfully. I turned on scsi-mq option as soon 
> > as it appeared in Kconfig as a new option. But I am not sure whan the 
> > option appeared, before or after 3.17 release.
> 
> So just to be fully clear, you never enabled scsi-mq on 3.17? To do
> that, you would have had to add a scsi_mod.use_blk_mq=1 boot parameter.
> The scsi-mq kconfig option did not show up until after 3.17 release.

Re-tested DL380G3 with 3.17 and manual scsi_mod.use_blk_mq=1 option. The 
problem happens with 3.17 too with blk-mq.

-- 
Meelis Roos (mr...@linux.ee)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: blk-mq problem on proliant DL380 G3 (cciss)

2014-10-29 Thread Meelis Roos
> >> On Wed, Oct 29, 2014 at 09:08:46AM -0600, Jens Axboe wrote:
>  Another test server with MPT SCSI RAID has similar problem,
>  scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no serail
>  console). 3.18.0-rc2-00043-gf7e87a4 was tested there.
> >>>
> >>> The first issue looks like scsi cdrom and error handling, it must be 
> >>> leaking
> >>> requests hence we hang on allocation of a new one. cciss doesn't use 
> >>> blk_mq
> >>> regardless of the scsi setting. Does the mpt box also have a libata driven
> >>> cdrom?
> >>
> >> cciss does use scsi for CDROMs and other external devices, it is a bit
> >> of a mess.
> >>
> >> Meelis, did you also test scsi-mq on 3.17 and this is a regression, or
> >> was 3.18-rc2 the first kernel you tested?
> > 
> > Both machines ran 3.17 successfully. I turned on scsi-mq option as soon 
> > as it appeared in Kconfig as a new option. But I am not sure whan the 
> > option appeared, before or after 3.17 release.
> 
> So just to be fully clear, you never enabled scsi-mq on 3.17? To do
> that, you would have had to add a scsi_mod.use_blk_mq=1 boot parameter.
> The scsi-mq kconfig option did not show up until after 3.17 release.

Yes, I never enabled it via command line, only noticed it when the 
question was asked during make oldconfig. Will try 3.17 with use_blk_mq 
today.

-- 
Meelis Roos (mr...@linux.ee)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: blk-mq problem on proliant DL380 G3 (cciss)

2014-10-29 Thread Elliott, Robert (Server Storage)


> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Meelis Roos
> Sent: Wednesday, 29 October, 2014 10:38 AM
> To: Jens Axboe
> Cc: linux-scsi@vger.kernel.org; Christoph Hellwig
> Subject: Re: blk-mq problem on proliant DL380 G3 (cciss)
> 
> > On 2014-10-29 05:46, Meelis Roos wrote:
> > > > I tried 3.18-rc2 with blk-mq default on on HP ProLiant DL380 G3
> (with HP
> > > > CCISS RAID controller). It fails late in the bootup with "task
> > > > scsi_eh_1:720 blocked for more than 120 seconds." messages.
> > > >
> > > > Booting with scsi_mod.use_blk_mq=0 fixes the problem.
> > >
> > > Another test server with MPT SCSI RAID has similar problem,
> > > scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no
> > serail
> > > console). 3.18.0-rc2-00043-gf7e87a4 was tested there.
> >
> > The first issue looks like scsi cdrom and error handling, it must
> > be leaking
> > requests hence we hang on allocation of a new one. cciss doesn't
> > use blk_mq
> > regardless of the scsi setting. Does the mpt box also have a libata
> > driven cdrom?
> 
> Yes, it does.
> 
> --
> Meelis Roos (mr...@linux.ee)

In the log, the first soft lockup for scsi_eh_1 means the thread 
for host1, which is a pata controller:

...
[   15.069114] scsi host1: pata_serverworks
[   15.173491] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo 
Giometti 
[   15.184512] scsi host2: pata_serverworks
[   15.184673] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0x2000 irq 14
[   15.184675] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0x2008 irq 15
[   15.460452] ata1.00: ATAPI: COMPAQ  CD-ROM SN-124, N104, max PIO4
[   15.476445] ata1.00: configured for PIO4
[   15.477110] scsi 1:0:0:0: CD-ROMCOMPAQ   CD-ROM SN-124N104 
PQ: 0 ANSI: 5 
...
[  240.704040] INFO: task scsi_eh_1:720 blocked for more than 120 seconds.
[  240.783198]   Not tainted 3.18.0-rc2-dirty #22
[  240.840485] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  240.934172] scsi_eh_1   D c1264d7f 0   720  2 0x
[  241.010385]  f5bdbe54 0046 f5bdbde0 c1264d7f f5bdbe00 1412  
2be89db4
[  241.103850]  0004 2be8b1c6 0004 c1534000 f63bca10 c10892f5 6f223d9e 
0132
[  241.197335]   066087ce f5bdbe50 c10892f5 6f22478a 0132  
066087ce
[  241.290803] Call Trace:
[  241.320039]  [] ? put_device+0xf/0x20
[  241.374205]  [] ? ktime_get+0x45/0x110
[  241.429416]  [] ? ktime_get+0x45/0x110
[  241.484631]  [] schedule+0x1e/0x60
[  241.535679]  [] io_schedule+0x77/0xc0
[  241.589854]  [] bt_get+0xc3/0x140
[  241.639867]  [] ? __wake_up_sync+0x20/0x20
[  241.699240]  [] blk_mq_get_tag+0x9e/0xc0
...


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: blk-mq problem on proliant DL380 G3 (cciss)

2014-10-29 Thread Jens Axboe
On 10/29/2014 02:06 PM, Meelis Roos wrote:
>> On Wed, Oct 29, 2014 at 09:08:46AM -0600, Jens Axboe wrote:
 Another test server with MPT SCSI RAID has similar problem,
 scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no serail
 console). 3.18.0-rc2-00043-gf7e87a4 was tested there.
>>>
>>> The first issue looks like scsi cdrom and error handling, it must be leaking
>>> requests hence we hang on allocation of a new one. cciss doesn't use blk_mq
>>> regardless of the scsi setting. Does the mpt box also have a libata driven
>>> cdrom?
>>
>> cciss does use scsi for CDROMs and other external devices, it is a bit
>> of a mess.
>>
>> Meelis, did you also test scsi-mq on 3.17 and this is a regression, or
>> was 3.18-rc2 the first kernel you tested?
> 
> Both machines ran 3.17 successfully. I turned on scsi-mq option as soon 
> as it appeared in Kconfig as a new option. But I am not sure whan the 
> option appeared, before or after 3.17 release.

So just to be fully clear, you never enabled scsi-mq on 3.17? To do
that, you would have had to add a scsi_mod.use_blk_mq=1 boot parameter.
The scsi-mq kconfig option did not show up until after 3.17 release.


-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: blk-mq problem on proliant DL380 G3 (cciss)

2014-10-29 Thread Meelis Roos
> On Wed, Oct 29, 2014 at 09:08:46AM -0600, Jens Axboe wrote:
> > >Another test server with MPT SCSI RAID has similar problem,
> > >scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no serail
> > >console). 3.18.0-rc2-00043-gf7e87a4 was tested there.
> > 
> > The first issue looks like scsi cdrom and error handling, it must be leaking
> > requests hence we hang on allocation of a new one. cciss doesn't use blk_mq
> > regardless of the scsi setting. Does the mpt box also have a libata driven
> > cdrom?
> 
> cciss does use scsi for CDROMs and other external devices, it is a bit
> of a mess.
> 
> Meelis, did you also test scsi-mq on 3.17 and this is a regression, or
> was 3.18-rc2 the first kernel you tested?

Both machines ran 3.17 successfully. I turned on scsi-mq option as soon 
as it appeared in Kconfig as a new option. But I am not sure whan the 
option appeared, before or after 3.17 release.

-- 
Meelis Roos (mr...@linux.ee)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: blk-mq problem on proliant DL380 G3 (cciss)

2014-10-29 Thread Christoph Hellwig
On Wed, Oct 29, 2014 at 09:08:46AM -0600, Jens Axboe wrote:
> >Another test server with MPT SCSI RAID has similar problem,
> >scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no serail
> >console). 3.18.0-rc2-00043-gf7e87a4 was tested there.
> 
> The first issue looks like scsi cdrom and error handling, it must be leaking
> requests hence we hang on allocation of a new one. cciss doesn't use blk_mq
> regardless of the scsi setting. Does the mpt box also have a libata driven
> cdrom?

cciss does use scsi for CDROMs and other external devices, it is a bit
of a mess.

Meelis, did you also test scsi-mq on 3.17 and this is a regression, or
was 3.18-rc2 the first kernel you tested?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: blk-mq problem on proliant DL380 G3 (cciss)

2014-10-29 Thread Meelis Roos
> On 2014-10-29 05:46, Meelis Roos wrote:
> > > I tried 3.18-rc2 with blk-mq default on on HP ProLiant DL380 G3 (with HP
> > > CCISS RAID controller). It fails late in the bootup with "task
> > > scsi_eh_1:720 blocked for more than 120 seconds." messages.
> > >
> > > Booting with scsi_mod.use_blk_mq=0 fixes the problem.
> >
> > Another test server with MPT SCSI RAID has similar problem,
> > scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no serail
> > console). 3.18.0-rc2-00043-gf7e87a4 was tested there.
> 
> The first issue looks like scsi cdrom and error handling, it must be leaking
> requests hence we hang on allocation of a new one. cciss doesn't use blk_mq
> regardless of the scsi setting. Does the mpt box also have a libata driven
> cdrom?

Yes, it does.

-- 
Meelis Roos (mr...@linux.ee)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: blk-mq problem on proliant DL380 G3 (cciss)

2014-10-29 Thread Jens Axboe

On 2014-10-29 05:46, Meelis Roos wrote:

I tried 3.18-rc2 with blk-mq default on on HP ProLiant DL380 G3 (with HP
CCISS RAID controller). It fails late in the bootup with "task
scsi_eh_1:720 blocked for more than 120 seconds." messages.

Booting with scsi_mod.use_blk_mq=0 fixes the problem.


Another test server with MPT SCSI RAID has similar problem,
scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no serail
console). 3.18.0-rc2-00043-gf7e87a4 was tested there.


The first issue looks like scsi cdrom and error handling, it must be 
leaking requests hence we hang on allocation of a new one. cciss doesn't 
use blk_mq regardless of the scsi setting. Does the mpt box also have a 
libata driven cdrom?


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: blk-mq problem on proliant DL380 G3 (cciss)

2014-10-29 Thread Meelis Roos
> I tried 3.18-rc2 with blk-mq default on on HP ProLiant DL380 G3 (with HP 
> CCISS RAID controller). It fails late in the bootup with "task 
> scsi_eh_1:720 blocked for more than 120 seconds." messages.
> 
> Booting with scsi_mod.use_blk_mq=0 fixes the problem.

Another test server with MPT SCSI RAID has similar problem, 
scsi_mode.use_blk_mq=0 cures it but I can not get good trace (no serail 
console). 3.18.0-rc2-00043-gf7e87a4 was tested there.

-- 
Meelis Roos (mr...@linux.ee)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html