Hello,

While testing some new SAS hardware, I have encountered an issue that results 
in an "Unable to handle kernel NULL pointer dereference" message from the 
kernel. The stack trace taken from syslog output is attached.

The problem occurs when connecting then disconnecting an external cable 
between two JBOD disk boxes. The problem does not seem to occur when 
connecting and disconnecting a single disk box directly to the HBA.

To reproduce:
1. Boot with the hardware connected as pictured below. All 32 external disks
   are found and no problems are noticed.
2. Disconnect cable B. The 16 disks and enclosure target from disk box 2 are
   removed with no errors noticed. There are some 'failed to synchronize
   cache' messages if the disks are not removed through /sys first but the
   the error will occur either way.
3. Reconnect cable B. No indications that anything has happened from the OS. I 
   have tried waiting for over 2 minutes after connecting the cable.
4. Disconnect cable B again and the attached messages are logged. A hard reset
   is then required to recover.

+---Host w/LSI3801E HBA------------+
|  LSI1068E                        |
+-####-####------------------------+
  ||||     < Cable A
+-####--Disk box 1-----------------+
| ||||                             |
| LSISASx12A                       |
| ||||  ||\`== LSISASx12A < 8 HDDs |
| ||||  ||                         |
| ||||  \`==== LSISASx12A < 8 HDDs |
+-####-----------------------------+
  ||||     < Cable B
+-####--Disk box 2-----------------+
| ||||                             |
| LSISASx12A                       |
| ||||  ||\`== LSISASx12A < 8 HDDs |
| ||||  ||                         |
| ||||  \`==== LSISASx12A < 8 HDDs |
+-####-----------------------------+


For the attached error, the disk boxes were full of SATA disks and the system 
was running the Debian backports.org 2.6.21-1-amd64 (2.6.21-4~bpo.1) kernel. 
The problem also seems to exist with the Debian etch 2.6.18-4-amd64 kernel. 
Happy to try any kernel versions and configs that would be useful.

The diagram represents my current understanding of the expander setup in the 
disk boxes but I could be mistaken. I can provide further details of the view 
of the hardware from the host if it is of interest.

The server also has an on-board LSI1064 connected to 4 internal SAS HDDs:
$ cat /proc/mpt/summary
ioc0: LSISAS1068E, FwRev=01120000h, Ports=1, MaxQ=511, IRQ=19
ioc1: LSISAS1064, FwRev=01102800h, Ports=1, MaxQ=511, IRQ=58

I will continue to investigate and will report any findings but any help in 
resolving the issue would be greatly appreciated.

-- 
Alex Winawer, Unix Systems Programmer
Systems Development & Support, Oxford University Computing Services
Jul  6 09:46:05 just-read-the-instructions kernel: mptbase: ioc0: 
LogInfo(0x30050000): Originator={IOP}, Code={Task Terminated}, SubCode(0x0000)
Jul  6 09:48:09 just-read-the-instructions kernel: Unable to handle kernel NULL 
pointer dereference at 00000000000002c0 RIP: 
Jul  6 09:48:09 just-read-the-instructions kernel:  [<ffffffff8025a762>] 
mutex_lock+0x0/0xb
Jul  6 09:48:09 just-read-the-instructions kernel: PGD 10ce07067 PUD 10ea3e067 
PMD 0 
Jul  6 09:48:09 just-read-the-instructions kernel: Oops: 0002 [1] SMP 
Jul  6 09:48:09 just-read-the-instructions kernel: CPU 0 
Jul  6 09:48:09 just-read-the-instructions kernel: Modules linked in: raid456 
xor ipv6 iptable_mangle iptable_nat nf_nat xt_tcpudp nf_conntrack_ipv4 xt_state 
nf_conntrack nfnetlink ipt_owner ipt_REJECT xt_limit ipt_LOG xt_hashlimit 
ip6_tables ipt_addrtype iptable_filter ip_tables x_tables 8021q serio_raw 
psmouse i2c_nforce2 shpchp i2c_core pci_hotplug pcspkr k8temp sg sr_mod cdrom 
joydev evdev ext3 jbd mbcache dm_mirror dm_snapshot dm_mod raid1 md_mod 
ide_generic ata_generic sata_nv libata sd_mod generic usb_storage usbhid hid 
mptsas mptscsih mptbase scsi_transport_sas amd74xx e1000 scsi_mod forcedeth 
ide_core ohci_hcd ehci_hcd thermal processor fan
Jul  6 09:48:09 just-read-the-instructions kernel: Pid: 14, comm: events/0 Not 
tainted 2.6.21-1-amd64 #1
Jul  6 09:48:09 just-read-the-instructions kernel: RIP: 
0010:[<ffffffff8025a762>]  [<ffffffff8025a762>] mutex_lock+0x0/0xb
Jul  6 09:48:09 just-read-the-instructions kernel: RSP: 0018:ffff810120201c88  
EFLAGS: 00010246
Jul  6 09:48:09 just-read-the-instructions kernel: RAX: 0000000000000000 RBX: 
ffff81011b1f3000 RCX: 0000000000000000
Jul  6 09:48:09 just-read-the-instructions kernel: RDX: ffff81011a9784c0 RSI: 
ffff81011b1f3000 RDI: 00000000000002c0
Jul  6 09:48:09 just-read-the-instructions kernel: RBP: 0000000000000004 R08: 
000000000000000c R09: ffff81011b3392a0
Jul  6 09:48:09 just-read-the-instructions kernel: R10: 00000000fffffff4 R11: 
ffff810120201ca8 R12: 0000000000000000
Jul  6 09:48:09 just-read-the-instructions kernel: R13: 00000000000002c0 R14: 
0000000000000000 R15: 00000000000005b0
Jul  6 09:48:09 just-read-the-instructions kernel: FS:  00002b86508c56d0(0000) 
GS:ffffffff804d9000(0000) knlGS:0000000000000000
Jul  6 09:48:09 just-read-the-instructions kernel: CS:  0010 DS: 0018 ES: 0018 
CR0: 000000008005003b
Jul  6 09:48:09 just-read-the-instructions kernel: CR2: 00000000000002c0 CR3: 
0000000101c9e000 CR4: 00000000000006e0
Jul  6 09:48:09 just-read-the-instructions kernel: Process events/0 (pid: 14, 
threadinfo ffff810120200000, task ffff81011c0d2100)
Jul  6 09:48:09 just-read-the-instructions kernel: Stack:  ffffffff880b8cca 
ffff81011bcb29c0 ffff81011bc8a000 ffff81011ad99d80
Jul  6 09:48:09 just-read-the-instructions kernel:  ffffffff880dc789 
ffff81011bc8a5e8 ffff810120201cb8 ffff810120201cb8
Jul  6 09:48:09 just-read-the-instructions kernel:  0000000000000000 
0000000000000000 ffff81011bc8a000 ffff81011ad99d80
Jul  6 09:48:09 just-read-the-instructions kernel: Call Trace:
Jul  6 09:48:09 just-read-the-instructions kernel:  [<ffffffff880b8cca>] 
:scsi_transport_sas:sas_port_delete_phy+0x1a/0x5e
Jul  6 09:48:09 just-read-the-instructions kernel:  [<ffffffff880dc789>] 
:mptsas:mptsas_setup_wide_ports+0x72/0x20d
Jul  6 09:48:09 just-read-the-instructions kernel:  [<ffffffff880dd097>] 
:mptsas:mptsas_probe_expander_phys+0x3d0/0x427
Jul  6 09:48:09 just-read-the-instructions kernel:  [<ffffffff880c7265>] 
:mptbase:mpt_timer_expired+0x0/0x24
Jul  6 09:48:09 just-read-the-instructions kernel:  [<ffffffff880dd969>] 
:mptsas:__mptsas_discovery_work+0x16f/0x18a
Jul  6 09:48:09 just-read-the-instructions kernel:  [<ffffffff880dd984>] 
:mptsas:mptsas_discovery_work+0x0/0x39
Jul  6 09:48:09 just-read-the-instructions kernel:  [<ffffffff880dd9a8>] 
:mptsas:mptsas_discovery_work+0x24/0x39
Jul  6 09:48:09 just-read-the-instructions kernel:  [<ffffffff80246fa3>] 
run_workqueue+0x8f/0x137
Jul  6 09:48:09 just-read-the-instructions kernel:  [<ffffffff80243bcf>] 
worker_thread+0x0/0x14a
Jul  6 09:48:09 just-read-the-instructions kernel:  [<ffffffff80243ce3>] 
worker_thread+0x114/0x14a
Jul  6 09:48:09 just-read-the-instructions kernel:  [<ffffffff8027a990>] 
default_wake_function+0x0/0xe
Jul  6 09:48:09 just-read-the-instructions kernel:  [<ffffffff8022f236>] 
kthread+0xd1/0x100
Jul  6 09:48:09 just-read-the-instructions kernel:  [<ffffffff80255f38>] 
child_rip+0xa/0x12
Jul  6 09:48:09 just-read-the-instructions kernel:  [<ffffffff8022f165>] 
kthread+0x0/0x100
Jul  6 09:48:09 just-read-the-instructions kernel:  [<ffffffff80255f2e>] 
child_rip+0x0/0x12
Jul  6 09:48:09 just-read-the-instructions kernel: 
Jul  6 09:48:09 just-read-the-instructions kernel: 
Jul  6 09:48:09 just-read-the-instructions kernel: Code: f0 ff 0f 79 05 e8 27 
01 00 00 c3 f0 ff 07 7f 05 e8 e9 00 00 
Jul  6 09:48:09 just-read-the-instructions kernel: RIP  [<ffffffff8025a762>] 
mutex_lock+0x0/0xb
Jul  6 09:48:09 just-read-the-instructions kernel:  RSP <ffff810120201c88>
Jul  6 09:48:09 just-read-the-instructions kernel: CR2: 00000000000002c0

Reply via email to