I just had a strange freeze that killed networking and made software RAID fail two of my harddisks.
There are a bunch of messages from the kernel which I extracted from the system log after reboot at the end of this mail. I hit power off in pure paranoia after the box froze, and then started to do disk I/O again just right after I noticed the messages about two of my RAID disks had failed on the console. The network didn't recover when the harddrive suddenly started working again. I managed to connect an USB keyboard and wake up the monitor from sleep so I could see some of the messages printed on the console. I looked through some other threads and found a mention of smartmontools which I too use (5.37-5ubuntu2). Kernel 2.6.22-14-generic (Ubuntu Gutsy Gibbon 7.10) Motherboard: Asus M2N32 WS Professional nForce 590 SLI MCP (MCP55) CPU: Athlon64 X2 Dual-Core 5600+ RAM: 4GB (passed memtest86 just a few minutes ago) The harddrives are four Samsung HD501LJ 500GB drives. sda and sdb have firmware CR100-10 and sdc and sdd have firmware CR100-11. The drives are just a couple of months old, well cooled and so far there's nothing interesting reported by S.M.A.R.T. Software raid is configured like this: sda1,sdc1 -> md0 (raid 1) sdb1,sdd1 -> md1 (raid 1) Both md0 and md1 are then encrypted with dm-crypt and the dm-devices are then used to form md2 (stripe). -- noah # lspci 00:00.0 RAM memory: nVidia Corporation C51 Host Bridge (rev a2) 00:00.1 RAM memory: nVidia Corporation C51 Memory Controller 0 (rev a2) 00:00.2 RAM memory: nVidia Corporation C51 Memory Controller 1 (rev a2) 00:00.3 RAM memory: nVidia Corporation C51 Memory Controller 5 (rev a2) 00:00.4 RAM memory: nVidia Corporation C51 Memory Controller 4 (rev a2) 00:00.5 RAM memory: nVidia Corporation C51 Host Bridge (rev a2) 00:00.6 RAM memory: nVidia Corporation C51 Memory Controller 3 (rev a2) 00:00.7 RAM memory: nVidia Corporation C51 Memory Controller 2 (rev a2) 00:04.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1) 00:08.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a1) 00:09.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a2) 00:09.1 SMBus: nVidia Corporation MCP55 SMBus (rev a2) 00:09.2 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2) 00:0a.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1) 00:0a.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2) 00:0c.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1) 00:0d.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2) 00:0d.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2) 00:0d.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2) 00:0e.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2) 00:0e.1 Audio device: nVidia Corporation MCP55 High Definition Audio (rev a2) 00:10.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a2) 00:11.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a2) 00:12.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2) 00:14.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2) 00:15.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2) 00:16.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2) 00:17.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 01:00.0 VGA compatible controller: nVidia Corporation GeForce 8400 GS (rev a1) 02:06.0 Communication controller: Tiger Jet Network Inc. Tiger3XX Modem/ISDN interface 03:00.0 PCI bridge: NEC Corporation uPD720400 PCI Express - PCI/PCI-X Bridge (rev 06) 03:00.1 PCI bridge: NEC Corporation uPD720400 PCI Express - PCI/PCI-X Bridge (rev 06) 08:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE6145 SATA II PCI-E controller (rev a1) kernel: [734344.717844] irq 21: nobody cared (try booting with the "irqpoll" option) kernel: [734344.717866] kernel: [734344.717866] Call Trace: kernel: [734344.717868] <IRQ> [__report_bad_irq+30/128] __report_bad_irq+0x1e/0x80 mdadm: Fail event detected on md device /dev/md1, component device /dev/sdd1 kernel: [734344.717888] [note_interrupt+643/704] note_interrupt+0x283/0x2c0 kernel: [734344.717895] [handle_fasteoi_irq+221/272] handle_fasteoi_irq+0xdd/0x110 mdadm: Fail event detected on md device /dev/md0, component device /dev/sdc1 kernel: [734344.717901] [do_IRQ+123/256] do_IRQ+0x7b/0x100 kernel: [734344.717904] [default_idle+0/64] default_idle+0x0/0x40 kernel: [734344.717907] [ret_from_intr+0/10] ret_from_intr+0x0/0xa kernel: [734344.717909] <EOI> [tcp_poll+0/368] tcp_poll+0x0/0x170 kernel: [734344.717918] [default_idle+41/64] default_idle+0x29/0x40 kernel: [734344.717923] [cpu_idle+112/192] cpu_idle+0x70/0xc0 kernel: [734344.717936] kernel: [734344.717937] handlers: kernel: [734344.717950] [_end+131265960/2130332920] (nv_generic_interrupt+0x0/0xe0 [sata_nv]) kernel: [734344.717973] [_end+131152728/2130332920] (nv_nic_irq_optimized+0x0/0x2b0 [forcedeth]) kernel: [734344.718003] Disabling IRQ #21 kernel: [734356.827155] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen kernel: [734356.827169] ata5.00: cmd 25/00:90:ff:3c:82/00:01:02:00:00/e0 tag 0 cdb 0x0 data 204800 in kernel: [734356.827170] res 40/00:b4:cc:88:7f/40:00:02:00:00/e0 Emask 0x4 (timeout) kernel: [734355.620185] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen kernel: [734355.620199] ata6.00: cmd c8/00:08:ef:52:a9/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 in kernel: [734355.620200] res 40/00:00:02:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) kernel: [734355.731595] ata6: soft resetting port kernel: [734355.787308] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300) kernel: [734358.738407] ata5: port is slow to respond, please be patient (Status 0xd8) kernel: [734360.418260] ata5: device not ready (errno=-16), forcing hardreset kernel: [734360.418262] ata5: hard resetting port kernel: [734360.643958] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300) kernel: [734366.509226] ata6.00: qc timeout (cmd 0x27) kernel: [734366.509231] ata6.00: ata_hpa_resize 1: hpa sectors (0) is smaller than sectors (976773168) kernel: [734366.509241] ata6.00: failed to set xfermode (err_mask=0x40) kernel: [734366.509250] ata6: failed to recover some devices, retrying in 5 secs kernel: [734368.296234] ata6: hard resetting port kernel: [734368.521935] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300) kernel: [734371.365916] ata5.00: qc timeout (cmd 0x27) kernel: [734371.365921] ata5.00: ata_hpa_resize 1: hpa sectors (0) is smaller than sectors (976773168) kernel: [734371.365928] ata5.00: failed to set xfermode (err_mask=0x40) kernel: [734371.365936] ata5: failed to recover some devices, retrying in 5 secs kernel: [734373.152942] ata5: hard resetting port kernel: [734373.378644] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300) kernel: [734379.244112] ata6.00: qc timeout (cmd 0x27) kernel: [734379.244118] ata6.00: ata_hpa_resize 1: hpa sectors (0) is smaller than sectors (976773168) kernel: [734379.244126] ata6.00: failed to set xfermode (err_mask=0x40) kernel: [734379.244135] ata6: limiting SATA link speed to 1.5 Gbps kernel: [734379.244138] ata6.00: limiting speed to UDMA/133:PIO3 kernel: [734379.244140] ata6: failed to recover some devices, retrying in 5 secs kernel: [734381.031138] ata6: hard resetting port kernel: [734381.256840] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310) kernel: [734384.095108] ata5.00: qc timeout (cmd 0x27) kernel: [734384.095113] ata5.00: ata_hpa_resize 1: hpa sectors (0) is smaller than sectors (976773168) kernel: [734384.095120] ata5.00: failed to set xfermode (err_mask=0x40) kernel: [734384.095129] ata5: limiting SATA link speed to 1.5 Gbps kernel: [734384.095131] ata5.00: limiting speed to UDMA/133:PIO3 kernel: [734384.095133] ata5: failed to recover some devices, retrying in 5 secs kernel: [734385.882133] ata5: hard resetting port kernel: [734386.107836] ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310) kernel: [734391.979018] ata6.00: qc timeout (cmd 0x27) kernel: [734391.979024] ata6.00: ata_hpa_resize 1: hpa sectors (0) is smaller than sectors (976773168) kernel: [734391.979032] ata6.00: failed to set xfermode (err_mask=0x40) kernel: [734391.979041] ata6.00: disabled kernel: [734392.159007] ata6: EH complete kernel: [734392.159018] sd 5:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK kernel: [734392.159022] end_request: I/O error, dev sdd, sector 11096815 kernel: [734392.159026] raid1: sdd1: rescheduling sector 11096752 kernel: [734394.574692] sd 5:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK kernel: [734394.574697] end_request: I/O error, dev sdd, sector 976767935 kernel: [734394.574703] md: super_written gets error=-5, uptodate=0 kernel: [734394.574706] raid1: Disk failure on sdd1, disabling device. kernel: [734394.574707] ^IOperation continuing on 1 devices kernel: [734394.574960] sd 5:0:0:0: [sdd] READ CAPACITY failed kernel: [734394.574961] sd 5:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK kernel: [734394.574964] sd 5:0:0:0: [sdd] Sense not available. kernel: [734394.575233] sd 5:0:0:0: [sdd] Write Protect is off kernel: [734394.575235] sd 5:0:0:0: [sdd] Mode Sense: 00 00 00 00 kernel: [734394.575275] sd 5:0:0:0: [sdd] Asking for cache data failed kernel: [734394.575282] sd 5:0:0:0: [sdd] Assuming drive cache: write through kernel: [734394.577622] RAID1 conf printout: kernel: [734394.577624] --- wd:1 rd:2 kernel: [734394.577626] disk 0, wo:1, o:0, dev:sdd1 kernel: [734394.577627] disk 1, wo:0, o:1, dev:sdb1 kernel: [734394.583008] RAID1 conf printout: kernel: [734394.583010] --- wd:1 rd:2 kernel: [734394.583011] disk 1, wo:0, o:1, dev:sdb1 kernel: [734394.593189] raid1: sdb1: redirecting sector 11096752 to another mirror kernel: [734399.802574] ata5.00: qc timeout (cmd 0x27) kernel: [734399.802580] ata5.00: ata_hpa_resize 1: hpa sectors (0) is smaller than sectors (976773168) kernel: [734399.802588] ata5.00: failed to set xfermode (err_mask=0x40) kernel: [734399.802606] ata5.00: disabled kernel: [734400.306547] ata5: EH complete kernel: [734400.306601] sd 4:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK kernel: [734400.306605] end_request: I/O error, dev sdc, sector 42089727 kernel: [734400.306608] raid1: sdc1: rescheduling sector 42089664 kernel: [734400.306626] raid1: sdc1: rescheduling sector 42089672 kernel: [734400.306642] raid1: sdc1: rescheduling sector 42089680 kernel: [734400.306658] raid1: sdc1: rescheduling sector 42089688 kernel: [734400.306674] raid1: sdc1: rescheduling sector 42089696 kernel: [734400.306690] raid1: sdc1: rescheduling sector 42089704 kernel: [734400.306706] raid1: sdc1: rescheduling sector 42089712 kernel: [734400.306722] raid1: sdc1: rescheduling sector 42089720 kernel: [734400.306738] raid1: sdc1: rescheduling sector 42089728 kernel: [734400.307304] sd 4:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK kernel: [734400.307307] end_request: I/O error, dev sdc, sector 976767935 kernel: [734400.307313] md: super_written gets error=-5, uptodate=0 kernel: [734400.307315] raid1: Disk failure on sdc1, disabling device. kernel: [734400.307316] ^IOperation continuing on 1 devices kernel: [734400.307643] sd 4:0:0:0: [sdc] READ CAPACITY failed kernel: [734400.307645] sd 4:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK kernel: [734400.307647] sd 4:0:0:0: [sdc] Sense not available. kernel: [734400.307767] sd 4:0:0:0: [sdc] Write Protect is off kernel: [734400.307768] sd 4:0:0:0: [sdc] Mode Sense: 00 00 00 00 kernel: [734400.307947] sd 4:0:0:0: [sdc] Asking for cache data failed kernel: [734400.307963] sd 4:0:0:0: [sdc] Assuming drive cache: write through kernel: [734400.319399] RAID1 conf printout: kernel: [734400.319401] --- wd:1 rd:2 kernel: [734400.319402] disk 0, wo:1, o:0, dev:sdc1 kernel: [734400.319404] disk 1, wo:0, o:1, dev:sda1 kernel: [734400.330537] RAID1 conf printout: kernel: [734400.330539] --- wd:1 rd:2 kernel: [734400.330540] disk 1, wo:0, o:1, dev:sda1 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/