Bug#682233: mpt2sas: kernel crash under load with hanged disks

2012-09-23 Thread Jonathan Nieder
tags 682233 + upstream patch pending
quit

Hi,

George Shuklin wrote:

> I think this commit is somehow related to that problem:
>
> commit 14216561e164671ce147458653b1fea06a4ada1e
> Author: James Bottomley 
> Date:   Wed Jul 25 23:55:55 2012 +0400
>
> [SCSI] Fix 'Device not ready' issue on mpt2sas

Sounds plausible.  That patch was applied upstream as v3.2.30~126, so
please test 3.2.30-1 once it is available.

If impatient before then:

 0. prerequisites:

apt-get install git build-essential

 1. get the kernel history, if you do not already have it:

git clone \
  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

 2. fetch point releases:

cd linux
git remote add stable \
  git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
git fetch stable

 3. configure, build, attempt to reproduce the bug:

git checkout v3.2.29
cp /boot/config-$(uname -r) .config; # current configuration
scripts/config --disable DEBUG_INFO
make localmodconfig; # optional: minimize configuration
make deb-pkg; # optionally with -j for parallel build
dpkg -i ../; # as root
reboot
... test test test ...

Hopefully it reproduces the bug.  So

 4. update:

cd linux
git merge stable/linux-3.2.y
make deb-pkg; # maybe with -j4
dpkg -i ../; # as root
reboot
... test test test ...

Thanks again for your help and patience.

Sincerely,
Jonathan


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#682233: mpt2sas: kernel crash under load with hanged disks

2012-09-02 Thread Jonathan Nieder
George Shuklin wrote:

> I think that problem is specific to LSI drivers, not to linux-raid,
> because same tests with Adaptec (aacraid) and few onboard HBAs show
> no signs of crashing (hanged disks is just marked as 'failed' and
> all systems behave as expected).

Thanks.  Very useful.

[...]
> linux-3.0 do have mpt2sas 08.100.00.02  and linux-3.2 do have 10.100.00.00

Between 3.0 and 3.2.12, the mpt2sas driver had 30 patches.  That would
be an interesting test: could you try a current kernel with the
mpt2sas driver from 3.0.y?  It works like this:

 0. prerequisites:

apt-get install git build-essential

 1. get the kernel history, if you don't already have it:

git clone \
  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

 2. fetch point releases:

cd linux
git remote add stable \
  git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
git fetch stable

 3. configure, build, test:

git checkout origin/master
cp /boot/config-$(uname -r) .config; # current configuration
scripts/config --disable DEBUG_INFO
make localmodconfig; # optional: minimize configuration
make deb-pkg; # optionally with -j for parallel build
dpkg -i ../; # as root
reboot
... test test test ...

Hopefully it reproduces the bug.  So

 4. try the mpt2sas driver from 3.0.y:

cd linux
git checkout stable/linux-3.0.y -- drivers/scsi/mpt2sas
make deb-pkg; # maybe with -j4
dpkg -i ../
reboot
... test ...

Jonathan


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#682233: mpt2sas: kernel crash under load with hanged disks

2012-09-02 Thread George Shuklin
I think that problem is specific to LSI drivers, not to linux-raid, 
because same tests with Adaptec (aacraid) and few onboard HBAs show no 
signs of crashing (hanged disks is just marked as 'failed' and all 
systems behave as expected).


I'll try to bisect it at 3.5, but I think it's kinda simple to say where 
problem is:


linux-3.0 do have mpt2sas 08.100.00.02  and linux-3.2 do have 10.100.00.00

And note, that mpt2sas do have strange behavior in linux-2.6.32 (version 
02.100.03.00) under highload.


On 03.09.2012 06:30, Jonathan Nieder wrote:

George Shuklin wrote:


We've tested it with vanilla 3.2.12, problem was same.

Thanks for the quick feedback.  Please send a summary of symptoms to
linux-r...@vger.kernel.org, cc-ing Neil Brown  and
either me or this bug log so we can track it.

Be sure to mention:

  - steps to reproduce, expected result, actual result, and how
the difference indicates a bug (should be simple enough ---
the summary you sent here would work fine)

  - which kernel versions you have tested and what happened with
each

  - full "dmesg" output from booting and reproducing the bug, as
an attachment

  - any other weird symptoms or observations

  - what you would be able to do to track it down (can you run commands
if provided? try patches? bisect to find which commit introduced
the regression?)

If we're lucky, the symptoms will ring a bell for Neil or someone else
on-list or someone will have an idea for a test to try to track it
down further.  Otherwise, the best we can do is probably to bisect to
find which specific change introduced the bug, as described at [1].

Regards,
Jonathan

[1] http://kernel-handbook.alioth.debian.org/ch-bugs.html#s9.2.1



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#682233: mpt2sas: kernel crash under load with hanged disks

2012-09-02 Thread Jonathan Nieder
George Shuklin wrote:

> We've tested it with vanilla 3.2.12, problem was same.

Thanks for the quick feedback.  Please send a summary of symptoms to
linux-r...@vger.kernel.org, cc-ing Neil Brown  and
either me or this bug log so we can track it.

Be sure to mention:

 - steps to reproduce, expected result, actual result, and how
   the difference indicates a bug (should be simple enough ---
   the summary you sent here would work fine)

 - which kernel versions you have tested and what happened with
   each

 - full "dmesg" output from booting and reproducing the bug, as
   an attachment

 - any other weird symptoms or observations

 - what you would be able to do to track it down (can you run commands
   if provided? try patches? bisect to find which commit introduced
   the regression?)

If we're lucky, the symptoms will ring a bell for Neil or someone else
on-list or someone will have an idea for a test to try to track it
down further.  Otherwise, the best we can do is probably to bisect to
find which specific change introduced the bug, as described at [1].

Regards,
Jonathan

[1] http://kernel-handbook.alioth.debian.org/ch-bugs.html#s9.2.1


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#682233: mpt2sas: kernel crash under load with hanged disks

2012-09-02 Thread George Shuklin

We've tested it with vanilla 3.2.12, problem was same.


On 03.09.2012 06:01, Jonathan Nieder wrote:

Hi George,

George Shuklin wrote:


Tags: upstream

Which upstream version did you test?

[...]

That bug found in 3.2 and 3.3 versions of kernel, but not
reproducing in 3.0.

[...]

1) Set up large raid10.
2) Start it rebuild
3) run addition io on raid (dd if=/dev/md0 of=/dev/md0)
4) Somehow make to slow down IO on two or more disks. We found that
bug in wild with normal load, but following scripts allows to see it
in few minutes:

[...]

end_request: I/O error, dev sdf, sector 729088
[ cut here ]
kernel BUG at [...]/linux-3.4.4/drivers/scsi/scsi_lib.c:1154!

[...]

Pid: 343, comm: kworker/5:1 Not tainted 3.4-trunk-amd64 #1 Supermicro 
X8DTN+-F/X8DTN+-F

[...]

Call Trace:
  [] ? sd_prep_fn+0x2e9/0xb8e [sd_mod]
  [] ? cfq_dispatch_requests+0x722/0x880
  [] ? create_io_context+0x5a/0x5a
  [] ? blk_peek_request+0xcf/0x1ac

[...]

Code: 85 c0 74 1d 48 8b 00 48 85 c0 74 15 48 8b 40 48 48 85 c0 74 0c 48 89 ee 48 89 
df ff d0 85 c0 75 44 66 83 bd e0 00 00 00 00 75 02<0f>  0b 48 89 ee 48 89 df e8 
62 ec ff ff 48 85 c0 48 89 c2 74 20
RIP  [] scsi_setup_fs_cmnd+0x45/0x83 [scsi_mod]

Thanks for a clear report, and sorry for the slow reply.

This is "BUG_ON(!req->nr_phys_segments)".  Smells similar to [1],
which bisected to v3.1-rc1~131^2~31 and was fixed by v3.2.2~91
(md/raid1: perform bad-block tests for WriteMostly devices too,
2012-01-09), aka v3.3-rc3~3^2~2.

But that wouldn't explain triggering the same trace in a 3.4.y kernel.

Is this reproducible with 3.5.2 or newer from experimental?  Which
3.2.y kernel did you use to experience it?

Curious,
Jonathan

[1] http://thread.gmane.org/gmane.linux.raid/36732



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#682233: mpt2sas: kernel crash under load with hanged disks

2012-09-02 Thread Jonathan Nieder
Hi George,

George Shuklin wrote:

> Tags: upstream

Which upstream version did you test?

[...]
> That bug found in 3.2 and 3.3 versions of kernel, but not
> reproducing in 3.0.
[...]
> 1) Set up large raid10.
> 2) Start it rebuild
> 3) run addition io on raid (dd if=/dev/md0 of=/dev/md0)
> 4) Somehow make to slow down IO on two or more disks. We found that
> bug in wild with normal load, but following scripts allows to see it
> in few minutes:
[...]
> end_request: I/O error, dev sdf, sector 729088
> [ cut here ]
> kernel BUG at [...]/linux-3.4.4/drivers/scsi/scsi_lib.c:1154!
[...]
> Pid: 343, comm: kworker/5:1 Not tainted 3.4-trunk-amd64 #1 Supermicro 
> X8DTN+-F/X8DTN+-F
[...]
> Call Trace:
>  [] ? sd_prep_fn+0x2e9/0xb8e [sd_mod]
>  [] ? cfq_dispatch_requests+0x722/0x880
>  [] ? create_io_context+0x5a/0x5a
>  [] ? blk_peek_request+0xcf/0x1ac
[...]
> Code: 85 c0 74 1d 48 8b 00 48 85 c0 74 15 48 8b 40 48 48 85 c0 74 0c 48 89 ee 
> 48 89 df ff d0 85 c0 75 44 66 83 bd e0 00 00 00 00 75 02 <0f> 0b 48 89 ee 48 
> 89 df e8 62 ec ff ff 48 85 c0 48 89 c2 74 20 
> RIP  [] scsi_setup_fs_cmnd+0x45/0x83 [scsi_mod]

Thanks for a clear report, and sorry for the slow reply.

This is "BUG_ON(!req->nr_phys_segments)".  Smells similar to [1],
which bisected to v3.1-rc1~131^2~31 and was fixed by v3.2.2~91
(md/raid1: perform bad-block tests for WriteMostly devices too,
2012-01-09), aka v3.3-rc3~3^2~2.

But that wouldn't explain triggering the same trace in a 3.4.y kernel.

Is this reproducible with 3.5.2 or newer from experimental?  Which
3.2.y kernel did you use to experience it?

Curious,
Jonathan

[1] http://thread.gmane.org/gmane.linux.raid/36732


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org