>>>>> "Dave" == Dave Kleikamp <[EMAIL PROTECTED]> writes:
>> >> I know the server might have a hardware problem (is yet to be located >> and confirmed). I also see these message in the logs, but am uncertain >> what they actually mean? >> >> Jun 27 01:04:40 boston kernel: find_entry called with index = 0 >> Jun 27 04:02:10 boston kernel: find_entry called with index = 0 > The directory contains a Btree containing the contents of the directory and > a table needed to map each directory entry with a persistent "cookie" > necessary to make readdirs work correctly as changes are made to the > directory. That message indicates that some directory entries do not > have a valid cookie. > If you have any entries in lost+found, these are created without the cookie > table. (We need to fix this.) If that's not the case, I'm not aware of > any known problems resulting in this condition. We do not have entries in lost+found and JFS is constantly producing error messages in our system log. Newer JFS versions produce more errors (and crashes more often) then earlier versions. We have run a series of tests on our system, but none have been able to conclude that we have hardware failure. We have even run a low level SCSI test (from the stress-kernel package) that did not reveal any problems. The JFS partition is located on a Mylex DAC960 Raid 5 array and shared via NFS (smp setup). Is anyone else using a similar configuration succesfully? For reference, ReiserFS had (still has?) trouble with NFS access. Could a similar problem trouble JFS? When JFS crashes and fsck is run, we loose files. Often files not accessed for quite a while. The lastest crash (which occured after a LOT of JFS log messages) was this (2.4.18+1.0.20): Jul 1 08:46:17 boston kernel: assert((__builtin_constant_p(COMMIT_New) ? constant_test_bit((COMMIT_New),(&(JFS_IP(ip)->cflag))) : variable_test_bit((COMMIT_New),(&(JFS_IP(ip)->cflag))))) Jul 1 08:46:17 boston kernel: kernel BUG at jfs_txnmgr.c:2334! Jul 1 08:46:17 boston kernel: invalid operand: 0000 Jul 1 08:46:17 boston kernel: CPU: 1 Jul 1 08:46:17 boston kernel: EIP: 0010:[eepro100:__insmod_eepro100_O/lib/modules/2.4.18/kernel/drivers/net/e+-557248/96] Tainted: P Jul 1 08:46:17 boston kernel: EIP: 0010:[<e0906f40>] Tainted: P Jul 1 08:46:17 boston kernel: EFLAGS: 00010282 Jul 1 08:46:17 boston kernel: eax: 00000021 ebx: d9d84a20 ecx: c02793c0 edx: 002da98b Jul 1 08:46:17 boston kernel: esi: 00000000 edi: e0934bb8 ebp: e090c837 esp: d78bbf64 Jul 1 08:46:17 boston kernel: ds: 0018 es: 0018 ss: 0018 Jul 1 08:46:17 boston kernel: Process jfsCommit (pid: 185, stackpage=d78bb000) Jul 1 08:46:17 boston kernel: Stack: e090c843 0000091e 00000000 00000040 cf83a440 c0116a47 d78bbfb8 c181c000 Jul 1 08:46:17 boston kernel: 00000000 0000011a 00000001 e0905167 e09150f0 e0910de8 00000246 d78ba000 Jul 1 08:46:18 boston kernel: 00000001 e0907846 e0910de8 d78ba000 00000246 e0907a5e e0910de8 00000000 Jul 1 08:46:18 boston kernel: Call Trace: [eepro100:__insmod_eepro100_O/lib/modules/2.4.18/kernel/drivers/net/e+-534461/96] [schedule+1191/1376] [eepro100:__insmod_eepro100_O/lib/modules/2.4.18/kernel/drivers/net/e+-564889/96] [eepro100:__insmod_eepro100_O/lib/modules/2.4.18/kernel/drivers/net/e+-554938/96] [eepro100:__insmod_eepro100_O/lib/modules/2.4.18/kernel/drivers/net/e+-554402/96] Jul 1 08:46:18 boston kernel: Call Trace: [<e090c843>] [<c0116a47>] [<e0905167>] [<e0907846>] [<e0907a5e>] Jul 1 08:46:18 boston kernel: [kernel_thread+38/48] [eepro100:__insmod_eepro100_O/lib/modules/2.4.18/kernel/drivers/net/e+-554768/96] Jul 1 08:46:18 boston kernel: [<c0105876>] [<e09078f0>] Jul 1 08:46:18 boston kernel: Jul 1 08:46:18 boston kernel: Code: 0f 0b 59 8b 93 20 01 00 00 5f f0 0f b3 72 48 8b 54 24 40 52 Before this assertion the above 2 syslog entries were shown and later, these: Jun 28 01:02:58 boston kernel: find_entry called with index = 0 Jun 28 04:02:11 boston kernel: find_entry called with index = 0 Jun 28 09:18:47 boston kernel: diRead: di_nlink is zero. ino=132754 Jun 28 13:13:33 boston kernel: diRead: di_nlink is zero. ino=132754 Jun 28 14:33:38 boston kernel: diRead: di_nlink is zero. ino=132571 And then, when the backup sets in (only a few shown, too numerous for email): Jun 29 01:02:56 boston kernel: XT_GETPAGE: xtree page corrupt Jun 29 01:02:56 boston kernel: xtLookup: xtSearch returned 5 Jun 29 01:03:15 boston kernel: DT_GETPAGE: dtree page corrupt Jun 29 01:03:15 boston kernel: jfs_lookup: dtSearch returned 5 Jun 29 01:03:16 boston kernel: stack overrun in dtSearch! Jun 29 01:03:16 boston kernel: jfs_lookup: dtSearch returned 5 Jun 29 01:03:19 boston kernel: MetaData crosses page boundary!! Jun 29 01:03:19 boston kernel: bread failed! Jun 29 01:03:19 boston kernel: jfs_lookup: dtSearch returned 5 Jun 29 01:03:27 boston kernel: jfs_lookup: dtSearch returned 5 Jun 29 01:04:28 boston kernel: stack overrun in dtSearch! Jun 29 01:04:28 boston kernel: jfs_l0:06: rw=0, want=297009428, limit=137058516 Jun 29 01:04:28 boston kernel: stack overrun in dtSearch! Jun 29 01:04:28 boston kernel: jfs_lookup: dtSearch returned 5 Jun 29 01:04:28 boston kernel: attempt to access beyond end of device Jun 29 01:04:28 boston kernel: 30:06: rw=0, want=297009428, limit=137058516 Jun 29 01:04:28 boston kernel: attempt to access beyond end of device Jun 29 01:04:28 boston kernel: 30:06: rw=0, want=297009428, limit=137058516 Jun 29 01:04:28 boston kernel: attempt to access beyond end of device Jun 29 01:04:28 boston kernel: 30:06: rw=0, want=297009428, limit=137058516 Jun 29 01:04:28 boston kernel: attempt to access beyond end of device Jun 29 01:04:29 boston kernel: 30:06: rw=0, want=297009428, limit=137058516 Jun 29 01:04:29 boston kernel: stack overrun in dtSearch! More error of the above errors (not the access beyond end of device though) were shown before the assertion occured. Now, as a low level SCSI test of a partition of the RAID disk didn't produce any errors, it could be a combination of our components that reveals the error. We have no idea to isolate it further, though (memtest86 has been run to verify memory). We tried to stress the kernel and ext2 partition with the stress-kernel package but didn't detect problems. The ext2 (system) partition is located on the same raid array as the JFS partition. Bernhard Ege -- Bernhard Mogens Ege, M.Sc.E.E. E-mail: [EMAIL PROTECTED] Dept. of Health Science and Technology Direct call: +45 96 35 87 82 Aalborg University Switchboard: +45 96 35 80 80 Frederik Bajers Vej 7, D1-203 Fax: +45 98 15 40 08 DK-9220 Aalborg East http://www.hst.auc.dk/~bme ------------------------------------------------------------------------------ Home: Hadsund Landevej 454, DK-9260 Gistrup, Phone: +45 96365086, +45 51905086 _______________________________________________ Jfs-discussion mailing list [EMAIL PROTECTED] http://www-124.ibm.com/developerworks/oss/mailman/listinfo/jfs-discussion