Hi José, Jose Calhariz wrote:
> Version: 2.6.32-45 [...] > The previous night during the periodic mdadm RAID check: > - the linux kernel gave a kernel BUG, > - tried to kick out a failed disk and > - stopped accepting I/O to the affected raid. > > The affected programs were in state D. The only way to recover was to > do a reboot. After reboot the problematic disk was replaced. > > This machine have 2 x RAID6 with 6 disks each, for a total of 12 disks. Thanks for reporting. Do you have a test system you can experiment on or ideas for reproducing it in a VM? [...] > build/source_i386_none/drivers/md/raid5.c:2764! > invalid opcode: 0000 [#1] SMP > last sysfs file: > /sys/devices/pci0000:00/0000:00:1c.0/0000:02:01.0/cciss0/c0d0/block/cciss!c0d0/removable > Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs > minix ntfs vfat msdos fat jfs reiserfs ext4 jbd2 crc16 openafs(P) lp > parport_pc parport joydev st sd_mod crc_t10dif ext2 loop tun xt_multiport xfs > exportfs 8021q garp stp ip6table_filter ip6_tables iptable_filter ip_tables > x_tables ide_generic ide_gd_mod ide_cd_mod ide_core snd_pcm snd_timer hpilo > snd soundcore snd_page_alloc hpwdt e752x_edac shpchp rng_core i6300esb > edac_core pci_hotplug pcspkr container processor evdev button psmouse > serio_raw ext3 jbd mbcache dm_mod raid456 md_mod async_raid6_recov async_pq > usbhid hid raid6_pq async_xor xor async_memcpy async_tx sg sr_mod cdrom > ata_generic thermal uhci_hcd cciss tg3 floppy ata_piix ehci_hcd libata e1000 > usbcore libphy scsi_mod nls_base thermal_sys [last unloaded: openafs] > > Pid: 743, comm: md2_raid6 Tainted: P (2.6.32-5-686 #1) ProLiant > DL360 G4 > EIP: 0060:[<f818c811>] EFLAGS: 00010297 CPU: 3 > EIP is at handle_stripe+0x89d/0x173e [raid456] > EAX: 00000005 EBX: 00000002 ECX: 00000003 EDX: 00000001 > ESI: f6394000 EDI: 00000003 EBP: f6394028 ESP: f58d5e6c > DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 > Process md2_raid6 (pid: 743, ti=f58d4000 task=f6569980 task.ti=f58d4000) > Stack: > e6fde3e6 c2988138 00000006 f61c8e00 00000006 0002d995 00020003 00000000 > <0> c2988138 f4cbc86c f65699ac 000f0e67 00000000 f639431c 00000005 fffffffc > <0> f4cbc86c c1025461 00000000 00000000 00000002 00000005 00988100 c127a45c > Call Trace: > [<c1025461>] ? check_preempt_wakeup+0x196/0x202 > [<f818d9fb>] ? raid5d+0x349/0x389 [raid456] > [<c103b623>] ? del_timer_sync+0xa/0x14 > [<c103b6cb>] ? process_timeout+0x0/0x5 > [<f816206e>] ? md_thread+0xe1/0xf8 [md_mod] > [<c104433a>] ? autoremove_wake_function+0x0/0x2d > [<f8161f8d>] ? md_thread+0x0/0xf8 [md_mod] > [<c1044108>] ? kthread+0x61/0x66 > [<c10440a7>] ? kthread+0x0/0x66 > [<c1003d47>] ? kernel_thread_helper+0x7/0x10 > Code: e9 9b 01 00 00 83 7c 24 7c 02 74 04 0f 0b eb fe f6 46 28 10 c7 46 3c 00 > 00 00 00 0f 85 7f 01 00 00 8b 44 24 38 39 44 24 70 7d 04 <0f> 0b eb fe 83 7c > 24 7c 02 75 20 6b 84 24 a8 00 00 00 78 ff 44 > EIP: [<f818c811>] handle_stripe+0x89d/0x173e [raid456] SS:ESP 0068:f58d5e6c If I am reading correctly, this is case check_state_compute_result: sh->check_state = check_state_idle; /* check that a write has not made the stripe insync */ if (test_bit(STRIPE_INSYNC, &sh->state)) break; /* now write out any block on a failed drive, * or P or Q if they were recomputed */ BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */ from the call chain handle_stripe -> handle_stripe6 -> handle_parity_checks6. I would be happy if v3.2-rc5~4^2~8 (md/raid5: abort any pending parity operations when array fails, 2011-11-08) or some related change would have fixed it, but alas, that patch is already in 2.6.32-40. So I don't have many ideas yet. Please attach a log from booting up the kernel in the same boot as the BUG above. Hope that helps, Jonathan -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120605035529.GA3118@burratino