Re: 2.6.22 oops kernel BUG at block/elevator.c:366!

2007-09-10 Thread Andrew Morton
On Thu, 30 Aug 2007 13:29:37 +0200 Arkadiusz Miskiewicz <[EMAIL PROTECTED]> 
wrote:

> On Wednesday 29 of August 2007, Jens Axboe wrote:
> > On Wed, Aug 29 2007, Arkadiusz Miskiewicz wrote:
> > > On Wednesday 29 of August 2007, Jens Axboe wrote:
> > > > On Wed, Aug 29 2007, Arkadiusz Miskiewicz wrote:
> > > > > On Wednesday 29 of August 2007, Jens Axboe wrote:
> > > > > > On Wed, Aug 29 2007, Arkadiusz Miskiewicz wrote:
> > > > > > > I guess I should sent these here since it looks like not scsi bug
> > > > > > > anyway.
> > > > > >
> > > > > > It's stex, right? It seems to have some issues with multiple
> > > > > > completions of commands, which craps out the block layer of course.
> > > > >
> > > > > Yes, stex. I'm staying with 2.6.19 in that case since it works fine
> > > > > in that version.
> > > > >
> > > > > So scsi bug ... 8-)
> > > >
> > > > And you based that conclusion on what exactly?

Could be viewed as a scsi deficiency at least.  Is it unheard of for
"independent" queues to have shared resources?  If so, then yeah, perhaps
some driver-private locking as James suggested is appropriate.  But if
other drivers face similar problems then perhaps it is something which scsi
core should offer support for.

But whatever.  The situation is that Ed suggested a fix eight months ago,
James suggested enhancements and afaict nobody did anything more, and
machines which use this driver are still crashing.



OK, Ed's email client breaks message threading, so you need to hyperjump to
a "different" thread a few days later, in which Ed points out that qla4xxx
also has a shared tag queue.

Ed's email client proceeds to splatter the discussion all over the Jan 2007
archive.  Ed finds a possible bug in qla4xxx.  Jens proposes a block patch.
Ed disagrees, Jeff agrees with Ed, discussion dies, driver still
crashing..


> > > Isn't drivers/scsi/* handled by [EMAIL PROTECTED] (that's what I mean)
> >
> > Yep indeed, I thought you meant that it was a scsi bug (and not an stex
> > one). You could try and copy the 2.6.19 stex driver into 2.6.20 and see
> > if that works, though.
> 
> Looks like this bug is known for months :-(
> 
> Ed Lin pointed to http://lkml.org/lkml/2007/1/23/268 with possible patch 
> (that 
> unfortunately serialises access to storage devices, well...)
> 
> There is also: http://bugzilla.kernel.org/show_bug.cgi?id=7842
> 
> I'm running 2.6.22 with that patch now, did huge (few hours) rsync that 
> previously caused oopses and now everything works properly.
> 
> Can we get some form of this patch into Linus tree?

Here's Ed's patch again.  As a suboptimal driver is better than a crashing
one, perhaps we should merge it until we can sort out something better?



From: "Ed Lin" <[EMAIL PROTECTED]>

The block layer uses lock to protect request queue.  Every scsi device has
a unique request queue, and queue lock is the default lock in struct
request_queue.  This is good for normal cases.  But for a host with shared
queue tag (e.g.  stex controllers), a queue lock per device means the
shared queue tag is not protected when multiple devices are accessed at a
same time.  This patch is a simple fix for this situation by introducing a
host queue lock to protect shared queue tag.  Without this patch we will
see various kernel panics (including the BUG() and kernel errors in
blk_queue_start_tag and blk_queue_end_tag of ll_rw_blk.c) when accessing
another in smp kernels).

Signed-off-by: Ed Lin <[EMAIL PROTECTED]>
Cc: James Bottomley <[EMAIL PROTECTED]>
Cc: Jeff Garzik <[EMAIL PROTECTED]>
Cc: Jens Axboe <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 drivers/scsi/scsi_lib.c  |2 +-
 drivers/scsi/stex.c  |2 ++
 include/scsi/scsi_host.h |3 +++
 3 files changed, 6 insertions(+), 1 deletion(-)

diff -puN 
drivers/scsi/scsi_lib.c~scsi-use-lock-per-host-instead-of-per-device-for-shared-queue-tag-host
 drivers/scsi/scsi_lib.c
--- 
a/drivers/scsi/scsi_lib.c~scsi-use-lock-per-host-instead-of-per-device-for-shared-queue-tag-host
+++ a/drivers/scsi/scsi_lib.c
@@ -1670,7 +1670,7 @@ struct request_queue *__scsi_alloc_queue
 {
struct request_queue *q;
 
-   q = blk_init_queue(request_fn, NULL);
+   q = blk_init_queue(request_fn, shost->req_q_lock);
if (!q)
return NULL;
 
diff -puN 
drivers/scsi/stex.c~scsi-use-lock-per-host-instead-of-per-device-for-shared-queue-tag-host
 drivers/scsi/stex.c
--- 
a/drivers/scsi/stex.c~scsi-use-lock-per-host-instead-of-per-device-for-shared-queue-tag-host
+++ a/drivers/scsi/stex.c
@@ -1234,6 +1234,8 @@ stex_probe(struct pci_dev *pdev, const s
if (err)
goto out_free_irq;
 
+   spin_lock_init(&host->__req_q_lock);
+   host->req_q_lock = &host->__req_q_lock;
err = scsi_init_shared_tag_map(host, host->can_queue);
if (err) {
printk(KERN_ERR DRV_NAME "(%s): init shared queue failed\n",
diff -puN 
include/scsi/scsi_host.h~scsi-use-lock-per-host-instead-of-

Re: 2.6.22 oops kernel BUG at block/elevator.c:366!

2007-08-30 Thread Arkadiusz Miskiewicz
On Wednesday 29 of August 2007, Jens Axboe wrote:
> On Wed, Aug 29 2007, Arkadiusz Miskiewicz wrote:
> > On Wednesday 29 of August 2007, Jens Axboe wrote:
> > > On Wed, Aug 29 2007, Arkadiusz Miskiewicz wrote:
> > > > On Wednesday 29 of August 2007, Jens Axboe wrote:
> > > > > On Wed, Aug 29 2007, Arkadiusz Miskiewicz wrote:
> > > > > > I guess I should sent these here since it looks like not scsi bug
> > > > > > anyway.
> > > > >
> > > > > It's stex, right? It seems to have some issues with multiple
> > > > > completions of commands, which craps out the block layer of course.
> > > >
> > > > Yes, stex. I'm staying with 2.6.19 in that case since it works fine
> > > > in that version.
> > > >
> > > > So scsi bug ... 8-)
> > >
> > > And you based that conclusion on what exactly?
> >
> > Isn't drivers/scsi/* handled by [EMAIL PROTECTED] (that's what I mean)
>
> Yep indeed, I thought you meant that it was a scsi bug (and not an stex
> one). You could try and copy the 2.6.19 stex driver into 2.6.20 and see
> if that works, though.

Looks like this bug is known for months :-(

Ed Lin pointed to http://lkml.org/lkml/2007/1/23/268 with possible patch (that 
unfortunately serialises access to storage devices, well...)

There is also: http://bugzilla.kernel.org/show_bug.cgi?id=7842

I'm running 2.6.22 with that patch now, did huge (few hours) rsync that 
previously caused oopses and now everything works properly.

Can we get some form of this patch into Linus tree?

-- 
Arkadiusz MiśkiewiczPLD/Linux Team
arekm / maven.plhttp://ftp.pld-linux.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 oops kernel BUG at block/elevator.c:366!

2007-08-29 Thread Jens Axboe
On Wed, Aug 29 2007, Arkadiusz Miskiewicz wrote:
> On Wednesday 29 of August 2007, Jens Axboe wrote:
> > On Wed, Aug 29 2007, Arkadiusz Miskiewicz wrote:
> > > On Wednesday 29 of August 2007, Jens Axboe wrote:
> > > > On Wed, Aug 29 2007, Arkadiusz Miskiewicz wrote:
> > > > > I guess I should sent these here since it looks like not scsi bug
> > > > > anyway.
> > > >
> > > > It's stex, right? It seems to have some issues with multiple
> > > > completions of commands, which craps out the block layer of course.
> > >
> > > Yes, stex. I'm staying with 2.6.19 in that case since it works fine in
> > > that version.
> > >
> > > So scsi bug ... 8-)
> >
> > And you based that conclusion on what exactly?
> 
> Isn't drivers/scsi/* handled by [EMAIL PROTECTED] (that's what I mean)

Yep indeed, I thought you meant that it was a scsi bug (and not an stex
one). You could try and copy the 2.6.19 stex driver into 2.6.20 and see
if that works, though.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 oops kernel BUG at block/elevator.c:366!

2007-08-29 Thread Arkadiusz Miskiewicz
On Wednesday 29 of August 2007, Jens Axboe wrote:
> On Wed, Aug 29 2007, Arkadiusz Miskiewicz wrote:
> > On Wednesday 29 of August 2007, Jens Axboe wrote:
> > > On Wed, Aug 29 2007, Arkadiusz Miskiewicz wrote:
> > > > I guess I should sent these here since it looks like not scsi bug
> > > > anyway.
> > >
> > > It's stex, right? It seems to have some issues with multiple
> > > completions of commands, which craps out the block layer of course.
> >
> > Yes, stex. I'm staying with 2.6.19 in that case since it works fine in
> > that version.
> >
> > So scsi bug ... 8-)
>
> And you based that conclusion on what exactly?

Isn't drivers/scsi/* handled by [EMAIL PROTECTED] (that's what I mean)

-- 
Arkadiusz MiśkiewiczPLD/Linux Team
arekm / maven.plhttp://ftp.pld-linux.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 oops kernel BUG at block/elevator.c:366!

2007-08-29 Thread Jens Axboe
On Wed, Aug 29 2007, Arkadiusz Miskiewicz wrote:
> On Wednesday 29 of August 2007, Jens Axboe wrote:
> > On Wed, Aug 29 2007, Arkadiusz Miskiewicz wrote:
> > > I guess I should sent these here since it looks like not scsi bug anyway.
> >
> > It's stex, right? It seems to have some issues with multiple completions
> > of commands, which craps out the block layer of course.
> 
> Yes, stex. I'm staying with 2.6.19 in that case since it works fine in that 
> version.
> 
> So scsi bug ... 8-)

And you based that conclusion on what exactly?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 oops kernel BUG at block/elevator.c:366!

2007-08-29 Thread Arkadiusz Miskiewicz
On Wednesday 29 of August 2007, Jens Axboe wrote:
> On Wed, Aug 29 2007, Arkadiusz Miskiewicz wrote:
> > I guess I should sent these here since it looks like not scsi bug anyway.
>
> It's stex, right? It seems to have some issues with multiple completions
> of commands, which craps out the block layer of course.

Yes, stex. I'm staying with 2.6.19 in that case since it works fine in that 
version.

So scsi bug ... 8-)
-- 
Arkadiusz MiśkiewiczPLD/Linux Team
arekm / maven.plhttp://ftp.pld-linux.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 oops kernel BUG at block/elevator.c:366!

2007-08-29 Thread Jens Axboe
On Wed, Aug 29 2007, Arkadiusz Miskiewicz wrote:
> 
> I guess I should sent these here since it looks like not scsi bug anyway.

It's stex, right? It seems to have some issues with multiple completions
of commands, which craps out the block layer of course.

> --  Forwarded Message  --
> 
> Subject: 2.6.22 oops kernel BUG at block/elevator.c:366!
> Date: Wednesday 29 of August 2007
> From: Arkadiusz Miskiewicz <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> 
> Hello,
> 
> I'm trying to get stable kernel for Promise SuperTrak 
> X16350 hardware. So far 2.6.20, 2.6.21 and 2.6.22 oopsed
> like this (while doing rsync):
> 
> kernel BUG at block/elevator.c:366!
> invalid opcode:  [1] SMP
> CPU 1
> Modules linked in: softdog sch_sfq forcedeth ext3 jbd mbcache dm_mod xfs 
> scsi_wait_scan sd_mod stex scsi_mod
> Pid: 1139:#0, comm: xfsbufd Not tainted 2.6.22.5-0.2 #1
> RIP: 0010:[]  [] elv_rb_del+0x3a/0x40
> RSP: :8100759b1c00  EFLAGS: 00010046
> RAX: 81000d1f5428 RBX: 81000d1f5428 RCX: 81007c1a1a00
> RDX:  RSI: 81000d1f53b0 RDI: 81007c102af0
> RBP: 81000d1f53b0 R08: 81004a9dab50 R09: 
> R10:  R11: 880072c0 R12: 81007c102ac0
> R13: 81007c1a1a00 R14: 0004 R15: 81007c102b18
> FS:  2ba2cafc9be0() GS:81007d0a5b40() knlGS:
> CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
> CR2: 2ba2cab5a158 CR3: 3c5ce000 CR4: 06e0
> Process xfsbufd (pid: 1139[#0], threadinfo 8100759b, task 
> 81007cac1040)
> Stack:  0001 81007c102ac0 81000d1f53b0 8034abe8
>  0246 81000d1f53b0 81007c1a1a00 81007c102ac0
>  81007c0f2d08 0004 81007c102b18 8034ad55
> Call Trace:
>  [] cfq_remove_request+0x78/0x1b0
>  [] cfq_dispatch_insert+0x35/0x70
>  [] cfq_dispatch_requests+0x1bf/0x3a0
>  [] elv_next_request+0x3f/0x150
>  [] lock_timer_base+0x34/0x70
>  [] :scsi_mod:scsi_request_fn+0x69/0x3d0
>  [] __make_request+0xe6/0x5d0
>  [] generic_make_request+0x18b/0x230
>  [] submit_bio+0x5a/0xf0
>  [] :xfs:_xfs_buf_ioapply+0x199/0x340
>  [] :xfs:xfs_buf_iorequest+0x29/0x80
>  [] :xfs:xfs_bdstrat_cb+0x3b/0x50
>  [] :xfs:xfsbufd+0x92/0x140
>  [] :xfs:xfsbufd+0x0/0x140
>  [] kthread+0x4b/0x80
>  [] child_rip+0xa/0x12
>  [] kthread+0x0/0x80
>  [] child_rip+0x0/0x12
> 
> 
> Code: 0f 0b eb fe 66 90 48 83 ec 08 49 89 f8 48 89 f8 31 c9 eb 09
> RIP  [] elv_rb_del+0x3a/0x40
>  RSP 
> 
> 
> I can reproduce it without bigger problem.
> 
> 
> Here are the same oopses on 2.6.20:
> http://paste.stgraber.org/3138
> 
> This is 1 x dual core athlon64 on asus m2npv mainboard, 2GB RAM.
> There is hw raid on fasttrack 16350 only (no software one).
> 
> Has anyone seen this ?
> 
> Going to try without cfq.
> 
> -- 
> Arkadiusz Mi?kiewiczPLD/Linux Team
> arekm / maven.plhttp://ftp.pld-linux.org/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> ---
> 
> --  Forwarded Message  --
> 
> Subject: Re: 2.6.22 oops kernel BUG at block/elevator.c:366!
> Date: Wednesday 29 of August 2007
> From: Arkadiusz Miskiewicz <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> 
> On Wednesday 29 of August 2007, Arkadiusz Miskiewicz wrote:
> > Hello,
> >
> > I'm trying to get stable kernel for Promise SuperTrak
> > X16350 hardware. So far 2.6.20, 2.6.21 and 2.6.22 oopsed
> > like this (while doing rsync):
> 
> With anticipatory:
> 
> berta login: [ cut here ]
> kernel BUG at block/as-iosched.c:1084!
> invalid opcode:  [1] SMP
> CPU 1
> Modules linked in: softdog sch_sfq forcedeth ext3 jbd mbcache dm_mod xfs 
> scsi_wait_scan sd_mod stex scsi_mod
> Pid: 32:#0, comm: kblockd/1 Not tainted 2.6.22.5-0.2 #1
> RIP: 0010:[]  [] 
> as_dispatch_request+0x438/0x460
> RSP: 0018:81007d1fddc0  EFLAGS: 00010046
> RAX:  RBX: 81007c765a00 RCX: 
> RDX: 81007c765a28 RSI:  RDI: 81007c54ad08
> RBP:  R08:  R09: 81006a289d80
> R10:  R11: 0001 R12: 
> R13: 0001 R14:  R15: 81007cf85048
> FS:  2ba4421e8b00() GS:81007d0a5b40() knlGS:
> CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
> CR