Hello, In the last couple of days one of my production servers started rebooting due to a kernel panic. I believe this could be related to something in the reiserfs file system that is causing the kernel to panic. The panic also causes data corruption on some system files that are heavily accessed when the panic occurs.
I will detail the scenario as best I can below. I was able to find and replicate what is causing the panic, but due to the server being in production I have refrained from extensive testing until I can schedule an outage window. I also have refrained from trying to repair the file system errors to avoid make an un-informed attempt that could cause more harm then good. System Details: Dell Poweredge 2650 - Dual Intel Xeon 2.8Ghz PERC3di SCSI-RAID Controller using the aacraid driver on RAID-10 raid set. Red Hat/Adaptec aacraid driver (1.1-3 Aug 4 2004 12:11:35) Fedora Core 1 Kernel: 2.4.22-1.2199.nptlsmp Tracking down any error messages has been difficult the systems syslog appears to fail to record the kernel error messages. Though I was able to find some error message from the log of a scheduled job that runs on the server that repeatably triggers the kernel panic. I was also able to too a screen shot of part of the kernel panic message using remote access console (no serial console as of yet). Kernel Panic Message: EIP: 0060:[<c011cbea>] Not tainted EFLAGS: 00010206 EIP is at do_page_fault [kernel] 0x26a (2.4.22-1.2199.nptlsmp) eax: 00000013 ebx: 73747000 ecx: c0374888 edx: 00006912 esi: f7facca4 edi: f7ffa000 ebp: 0000000f esp: f7ffbe18 ds: 0068 es: 0068 ss: 0068 Process init (pid: 1, stackpage=f7ffb000) Stack: c02a68af 73747069 00000000 f7ffbee8 00000000 f88630bf 00000001 1680f54c 00000003 00000017 001b657a 00000000 00000206 c0376730 00030001 00000000 c037667c 00000286 00000001 f1dca8c0 00000000 00000000 00000003 f1dca8c0 Call Trace: [<f88630bf>] check_journal_end [reiserfs] 0x16f (0xf7ffbe2c) [<c011f4bc>] schedule [kernel] 0x3fc (0xf7ffbe90) [<c011c980>] do_page_fault [kernel] 0x0 (0xf7ffbed0) [<c0109c18>] error_code [kernel] 0x34 (0xf7ffbed8) [<c0163f23>] poll_freewait [kernel] 0x23 (0xf7ffbf0c) [<c0164251>] do_select [kernel] 0x151 (0xf7ffbf24) [<c01646ce>] sys_select [kernel] 0x34e (0xf7ffbf60) [<c015a279>] sys_fstat64 [kernel] 0x49 (0xf7ffbfa8) [<c0109b27>] system_call [kernel] 0x33 (0xf7ffbfc0) Code: 8b 9c ab 00 00 00 c0 c7 04 24 c0 68 2a c0 89 5c 24 04 e8 ef <0>Kernel panic: Attempted to kill init! I am able to reproduce the kernel panic by running the prelinking, and slocate daily cron jobs. Within the the log for the prelinking job it appears that some syslog messages, regarding reiserfs errors. It appears that this information was concatenated with the prelinking log due to corruption since the end of the file is filled with garbage binary data. Here are the errors listed in the prelinking log. /usr/lib/libtiff.so.3.5 0040Aug 23 21:02:09 mail01 syslogd 1.4.1: restart. Aug 23 21:02:10 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 25) Aug 23 21:02:15 mail01 last message repeated 12 times Aug 23 21:02:16 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 27) Aug 23 21:02:18 mail01 last message repeated 20 times Aug 23 21:02:22 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 25) Aug 23 21:02:32 mail01 last message repeated 24 times Aug 23 21:02:33 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 27) Aug 23 21:02:33 mail01 last message repeated 5 times Aug 23 21:02:35 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 25) Aug 23 21:02:35 mail01 last message repeated 7 times Aug 23 21:02:36 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 27) Aug 23 21:02:36 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 27) Aug 23 21:02:39 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 25) Aug 23 21:02:42 mail01 last message repeated 8 times Aug 23 21:02:43 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 29) Aug 23 21:02:43 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 29) Aug 23 21:02:43 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 25) Aug 23 21:02:43 mail01 last message repeated 3 times Aug 23 21:02:44 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 27) Aug 23 21:02:44 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 28) Aug 23 21:02:44 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 28) Aug 23 21:02:44 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 29) Aug 23 21:02:44 mail01 last message repeated 2 times Aug 23 21:02:58 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 25) Aug 23 21:03:29 mail01 last message repeated 45 times Aug 23 21:03:40 mail01 last message repeated 10 times Aug 23 21:03:40 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 26) Aug 23 21:03:40 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 27) Aug 23 21:03:40 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 27) Aug 23 21:03:41 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 25) Aug 23 21:03:41 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 25) Aug 23 21:03:43 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 27) Aug 23 21:03:43 mail01 last message repeated 2 times Aug 23 21:03:43 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 25) Aug 23 21:03:43 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0x0 SD] (nlink == 1) not found (pos 25) Aug 23 21:03:44 mail01 kernel: sd(8,6):vs-13060: reiserfs_update_sd: stat data o f object [1148 1150 0Aug 25 22:01:06 mail01 syslogd 1.4.1: restart. [UNREADABLE BINARY DATA] After reseting the system and telling the system to do an integrity check of the local filesystems reiserfs doesn't complain much about he filesystem. Here is the contents from the boot log when reiserfs checks and mounts the filesystems. Partition check: sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 > reiserfs: found format "3.6" with standard journal reiserfs: checking transaction log (device sd(8,5)) ... for (sd(8,5)) reiserfs: replayed 3 transactions in 0 seconds sd(8,5):Using r5 hash to sort names Freeing unused kernel memory: 168k freed attempt to access beyond end of device 08:05: rw=0, want=4192936, limit=4192933 sd(8,5):Removing [38665 245093 0x0 SD]..done sd(8,5):Removing [38665 245085 0x0 SD]..done sd(8,5):There were 2 uncompleted unlinks/truncates. Completed Adding Swap: 8385920k swap-space (priority -1) reiserfs: found format "3.6" with standard journal reiserfs: checking transaction log (device sd(8,2)) ... for (sd(8,2)) sd(8,2):Using r5 hash to sort names reiserfs: found format "3.6" with standard journal reiserfs: checking transaction log (device sd(8,6)) ... for (sd(8,6)) sd(8,6):Using r5 hash to sort names sd(8,6):Removing [619 1807083 0x0 SD]..done sd(8,6):There were 1 uncompleted unlinks/truncates. Completed reiserfs: found format "3.6" with standard journal reiserfs: checking transaction log (device sd(8,7)) ... for (sd(8,7)) sd(8,7):Using r5 hash to sort names My main questions are, could the file system corruption indicated by the reiserfs_update_sd error messages the likely root to cause the kernel panic? The panic message seems to indicate that check_journal_end from journal.c in reiserfs (that is a completely layman understanding of the panic message on my part). If it is the cause of the panic, would repairing the file system be adequate to prevent this from happening again? Also what is the recommended method for repairing this error? From my research running reiserfsck --rebuild-tree appears to be the commonly recommended process, is this appropriate in this case? I assume that running --check and --fix-fixable prior to doing this is appropriate, but would --fix-fixable actually repair this problem? Sorry for the long message, I wanted to include all the details I was able to observe. Any help and or advice is extremely appreciated, if i have left out anything that would be pertinent to debugging this problem please let me know and I can attempt to retrieve the needed information. Take care. -- Sean