Bug#682233: mpt2sas: kernel crash under load with hanged disks
tags 682233 + upstream patch pending quit Hi, George Shuklin wrote: > I think this commit is somehow related to that problem: > > commit 14216561e164671ce147458653b1fea06a4ada1e > Author: James Bottomley > Date: Wed Jul 25 23:55:55 2012 +0400 > > [SCSI] Fix 'Device not ready' issue on mpt2sas Sounds plausible. That patch was applied upstream as v3.2.30~126, so please test 3.2.30-1 once it is available. If impatient before then: 0. prerequisites: apt-get install git build-essential 1. get the kernel history, if you do not already have it: git clone \ git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 2. fetch point releases: cd linux git remote add stable \ git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git git fetch stable 3. configure, build, attempt to reproduce the bug: git checkout v3.2.29 cp /boot/config-$(uname -r) .config; # current configuration scripts/config --disable DEBUG_INFO make localmodconfig; # optional: minimize configuration make deb-pkg; # optionally with -j for parallel build dpkg -i ../; # as root reboot ... test test test ... Hopefully it reproduces the bug. So 4. update: cd linux git merge stable/linux-3.2.y make deb-pkg; # maybe with -j4 dpkg -i ../; # as root reboot ... test test test ... Thanks again for your help and patience. Sincerely, Jonathan -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#682233: mpt2sas: kernel crash under load with hanged disks
George Shuklin wrote: > I think that problem is specific to LSI drivers, not to linux-raid, > because same tests with Adaptec (aacraid) and few onboard HBAs show > no signs of crashing (hanged disks is just marked as 'failed' and > all systems behave as expected). Thanks. Very useful. [...] > linux-3.0 do have mpt2sas 08.100.00.02 and linux-3.2 do have 10.100.00.00 Between 3.0 and 3.2.12, the mpt2sas driver had 30 patches. That would be an interesting test: could you try a current kernel with the mpt2sas driver from 3.0.y? It works like this: 0. prerequisites: apt-get install git build-essential 1. get the kernel history, if you don't already have it: git clone \ git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 2. fetch point releases: cd linux git remote add stable \ git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git git fetch stable 3. configure, build, test: git checkout origin/master cp /boot/config-$(uname -r) .config; # current configuration scripts/config --disable DEBUG_INFO make localmodconfig; # optional: minimize configuration make deb-pkg; # optionally with -j for parallel build dpkg -i ../; # as root reboot ... test test test ... Hopefully it reproduces the bug. So 4. try the mpt2sas driver from 3.0.y: cd linux git checkout stable/linux-3.0.y -- drivers/scsi/mpt2sas make deb-pkg; # maybe with -j4 dpkg -i ../ reboot ... test ... Jonathan -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#682233: mpt2sas: kernel crash under load with hanged disks
I think that problem is specific to LSI drivers, not to linux-raid, because same tests with Adaptec (aacraid) and few onboard HBAs show no signs of crashing (hanged disks is just marked as 'failed' and all systems behave as expected). I'll try to bisect it at 3.5, but I think it's kinda simple to say where problem is: linux-3.0 do have mpt2sas 08.100.00.02 and linux-3.2 do have 10.100.00.00 And note, that mpt2sas do have strange behavior in linux-2.6.32 (version 02.100.03.00) under highload. On 03.09.2012 06:30, Jonathan Nieder wrote: George Shuklin wrote: We've tested it with vanilla 3.2.12, problem was same. Thanks for the quick feedback. Please send a summary of symptoms to linux-r...@vger.kernel.org, cc-ing Neil Brown and either me or this bug log so we can track it. Be sure to mention: - steps to reproduce, expected result, actual result, and how the difference indicates a bug (should be simple enough --- the summary you sent here would work fine) - which kernel versions you have tested and what happened with each - full "dmesg" output from booting and reproducing the bug, as an attachment - any other weird symptoms or observations - what you would be able to do to track it down (can you run commands if provided? try patches? bisect to find which commit introduced the regression?) If we're lucky, the symptoms will ring a bell for Neil or someone else on-list or someone will have an idea for a test to try to track it down further. Otherwise, the best we can do is probably to bisect to find which specific change introduced the bug, as described at [1]. Regards, Jonathan [1] http://kernel-handbook.alioth.debian.org/ch-bugs.html#s9.2.1 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#682233: mpt2sas: kernel crash under load with hanged disks
George Shuklin wrote: > We've tested it with vanilla 3.2.12, problem was same. Thanks for the quick feedback. Please send a summary of symptoms to linux-r...@vger.kernel.org, cc-ing Neil Brown and either me or this bug log so we can track it. Be sure to mention: - steps to reproduce, expected result, actual result, and how the difference indicates a bug (should be simple enough --- the summary you sent here would work fine) - which kernel versions you have tested and what happened with each - full "dmesg" output from booting and reproducing the bug, as an attachment - any other weird symptoms or observations - what you would be able to do to track it down (can you run commands if provided? try patches? bisect to find which commit introduced the regression?) If we're lucky, the symptoms will ring a bell for Neil or someone else on-list or someone will have an idea for a test to try to track it down further. Otherwise, the best we can do is probably to bisect to find which specific change introduced the bug, as described at [1]. Regards, Jonathan [1] http://kernel-handbook.alioth.debian.org/ch-bugs.html#s9.2.1 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#682233: mpt2sas: kernel crash under load with hanged disks
We've tested it with vanilla 3.2.12, problem was same. On 03.09.2012 06:01, Jonathan Nieder wrote: Hi George, George Shuklin wrote: Tags: upstream Which upstream version did you test? [...] That bug found in 3.2 and 3.3 versions of kernel, but not reproducing in 3.0. [...] 1) Set up large raid10. 2) Start it rebuild 3) run addition io on raid (dd if=/dev/md0 of=/dev/md0) 4) Somehow make to slow down IO on two or more disks. We found that bug in wild with normal load, but following scripts allows to see it in few minutes: [...] end_request: I/O error, dev sdf, sector 729088 [ cut here ] kernel BUG at [...]/linux-3.4.4/drivers/scsi/scsi_lib.c:1154! [...] Pid: 343, comm: kworker/5:1 Not tainted 3.4-trunk-amd64 #1 Supermicro X8DTN+-F/X8DTN+-F [...] Call Trace: [] ? sd_prep_fn+0x2e9/0xb8e [sd_mod] [] ? cfq_dispatch_requests+0x722/0x880 [] ? create_io_context+0x5a/0x5a [] ? blk_peek_request+0xcf/0x1ac [...] Code: 85 c0 74 1d 48 8b 00 48 85 c0 74 15 48 8b 40 48 48 85 c0 74 0c 48 89 ee 48 89 df ff d0 85 c0 75 44 66 83 bd e0 00 00 00 00 75 02<0f> 0b 48 89 ee 48 89 df e8 62 ec ff ff 48 85 c0 48 89 c2 74 20 RIP [] scsi_setup_fs_cmnd+0x45/0x83 [scsi_mod] Thanks for a clear report, and sorry for the slow reply. This is "BUG_ON(!req->nr_phys_segments)". Smells similar to [1], which bisected to v3.1-rc1~131^2~31 and was fixed by v3.2.2~91 (md/raid1: perform bad-block tests for WriteMostly devices too, 2012-01-09), aka v3.3-rc3~3^2~2. But that wouldn't explain triggering the same trace in a 3.4.y kernel. Is this reproducible with 3.5.2 or newer from experimental? Which 3.2.y kernel did you use to experience it? Curious, Jonathan [1] http://thread.gmane.org/gmane.linux.raid/36732 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#682233: mpt2sas: kernel crash under load with hanged disks
Hi George, George Shuklin wrote: > Tags: upstream Which upstream version did you test? [...] > That bug found in 3.2 and 3.3 versions of kernel, but not > reproducing in 3.0. [...] > 1) Set up large raid10. > 2) Start it rebuild > 3) run addition io on raid (dd if=/dev/md0 of=/dev/md0) > 4) Somehow make to slow down IO on two or more disks. We found that > bug in wild with normal load, but following scripts allows to see it > in few minutes: [...] > end_request: I/O error, dev sdf, sector 729088 > [ cut here ] > kernel BUG at [...]/linux-3.4.4/drivers/scsi/scsi_lib.c:1154! [...] > Pid: 343, comm: kworker/5:1 Not tainted 3.4-trunk-amd64 #1 Supermicro > X8DTN+-F/X8DTN+-F [...] > Call Trace: > [] ? sd_prep_fn+0x2e9/0xb8e [sd_mod] > [] ? cfq_dispatch_requests+0x722/0x880 > [] ? create_io_context+0x5a/0x5a > [] ? blk_peek_request+0xcf/0x1ac [...] > Code: 85 c0 74 1d 48 8b 00 48 85 c0 74 15 48 8b 40 48 48 85 c0 74 0c 48 89 ee > 48 89 df ff d0 85 c0 75 44 66 83 bd e0 00 00 00 00 75 02 <0f> 0b 48 89 ee 48 > 89 df e8 62 ec ff ff 48 85 c0 48 89 c2 74 20 > RIP [] scsi_setup_fs_cmnd+0x45/0x83 [scsi_mod] Thanks for a clear report, and sorry for the slow reply. This is "BUG_ON(!req->nr_phys_segments)". Smells similar to [1], which bisected to v3.1-rc1~131^2~31 and was fixed by v3.2.2~91 (md/raid1: perform bad-block tests for WriteMostly devices too, 2012-01-09), aka v3.3-rc3~3^2~2. But that wouldn't explain triggering the same trace in a 3.4.y kernel. Is this reproducible with 3.5.2 or newer from experimental? Which 3.2.y kernel did you use to experience it? Curious, Jonathan [1] http://thread.gmane.org/gmane.linux.raid/36732 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org