Re: [Jfs-discussion] mpstat: free larger than alloc?

Bernhard Mogens Ege Sun, 07 Jul 2002 18:55:25 -0700

>>>>> "Dave" == Dave Kleikamp <[EMAIL PROTECTED]> writes:


>> 
>> I know the server might have a hardware problem (is yet to be located
>> and confirmed). I also see these message in the logs, but am uncertain 
>> what they actually mean?
>> 
>> Jun 27 01:04:40 boston kernel: find_entry called with index = 0
>> Jun 27 04:02:10 boston kernel: find_entry called with index = 0

> The directory contains a Btree containing the contents of the directory and
> a table needed to map each directory entry with a persistent "cookie"
> necessary to make readdirs work correctly as changes are made to the
> directory.  That message indicates that some directory entries do not
> have a valid cookie.

> If you have any entries in lost+found, these are created without the cookie
> table.  (We need to fix this.)  If that's not the case, I'm not aware of
> any known problems resulting in this condition.

We do not have entries in lost+found and JFS is constantly producing
error messages in our system log. Newer JFS versions produce more
errors (and crashes more often) then earlier versions.

We have run a series of tests on our system, but none have been able
to conclude that we have hardware failure. We have even run a low
level SCSI test (from the stress-kernel package) that did not reveal
any problems.

The JFS partition is located on a Mylex DAC960 Raid 5 array and shared
via NFS (smp setup). Is anyone else using a similar configuration
succesfully?

For reference, ReiserFS had (still has?) trouble with NFS
access. Could a similar problem trouble JFS?

When JFS crashes and fsck is run, we loose files. Often files not
accessed for quite a while.

The lastest crash (which occured after a LOT of JFS log messages) was
this (2.4.18+1.0.20):

Jul  1 08:46:17 boston kernel: assert((__builtin_constant_p(COMMIT_New) ? 
constant_test_bit((COMMIT_New),(&(JFS_IP(ip)->cflag))) : 
variable_test_bit((COMMIT_New),(&(JFS_IP(ip)->cflag)))))
Jul  1 08:46:17 boston kernel: kernel BUG at jfs_txnmgr.c:2334!
Jul  1 08:46:17 boston kernel: invalid operand: 0000
Jul  1 08:46:17 boston kernel: CPU:    1
Jul  1 08:46:17 boston kernel: EIP:    
0010:[eepro100:__insmod_eepro100_O/lib/modules/2.4.18/kernel/drivers/net/e+-557248/96] 
   Tainted: P 
Jul  1 08:46:17 boston kernel: EIP:    0010:[<e0906f40>]    Tainted: P 
Jul  1 08:46:17 boston kernel: EFLAGS: 00010282
Jul  1 08:46:17 boston kernel: eax: 00000021   ebx: d9d84a20   ecx: c02793c0   edx: 
002da98b
Jul  1 08:46:17 boston kernel: esi: 00000000   edi: e0934bb8   ebp: e090c837   esp: 
d78bbf64
Jul  1 08:46:17 boston kernel: ds: 0018   es: 0018   ss: 0018
Jul  1 08:46:17 boston kernel: Process jfsCommit (pid: 185, stackpage=d78bb000)
Jul  1 08:46:17 boston kernel: Stack: e090c843 0000091e 00000000 00000040 cf83a440 
c0116a47 d78bbfb8 c181c000 
Jul  1 08:46:17 boston kernel:        00000000 0000011a 00000001 e0905167 e09150f0 
e0910de8 00000246 d78ba000 
Jul  1 08:46:18 boston kernel:        00000001 e0907846 e0910de8 d78ba000 00000246 
e0907a5e e0910de8 00000000 
Jul  1 08:46:18 boston kernel: Call Trace: 
[eepro100:__insmod_eepro100_O/lib/modules/2.4.18/kernel/drivers/net/e+-534461/96] 
[schedule+1191/1376] 
[eepro100:__insmod_eepro100_O/lib/modules/2.4.18/kernel/drivers/net/e+-564889/96] 
[eepro100:__insmod_eepro100_O/lib/modules/2.4.18/kernel/drivers/net/e+-554938/96] 
[eepro100:__insmod_eepro100_O/lib/modules/2.4.18/kernel/drivers/net/e+-554402/96] 
Jul  1 08:46:18 boston kernel: Call Trace: [<e090c843>] [<c0116a47>] [<e0905167>] 
[<e0907846>] [<e0907a5e>] 
Jul  1 08:46:18 boston kernel:    [kernel_thread+38/48] 
[eepro100:__insmod_eepro100_O/lib/modules/2.4.18/kernel/drivers/net/e+-554768/96] 
Jul  1 08:46:18 boston kernel:    [<c0105876>] [<e09078f0>] 
Jul  1 08:46:18 boston kernel: 
Jul  1 08:46:18 boston kernel: Code: 0f 0b 59 8b 93 20 01 00 00 5f f0 0f b3 72 48 8b 
54 24 40 52 

Before this assertion the above 2 syslog entries were shown and later, 
these:

Jun 28 01:02:58 boston kernel: find_entry called with index = 0
Jun 28 04:02:11 boston kernel: find_entry called with index = 0
Jun 28 09:18:47 boston kernel: diRead: di_nlink is zero. ino=132754
Jun 28 13:13:33 boston kernel: diRead: di_nlink is zero. ino=132754
Jun 28 14:33:38 boston kernel: diRead: di_nlink is zero. ino=132571

And then, when the backup sets in (only a few shown, too numerous for
email):

Jun 29 01:02:56 boston kernel: XT_GETPAGE: xtree page corrupt
Jun 29 01:02:56 boston kernel: xtLookup: xtSearch returned 5

Jun 29 01:03:15 boston kernel: DT_GETPAGE: dtree page corrupt
Jun 29 01:03:15 boston kernel: jfs_lookup: dtSearch returned 5

Jun 29 01:03:16 boston kernel: stack overrun in dtSearch!
Jun 29 01:03:16 boston kernel: jfs_lookup: dtSearch returned 5

Jun 29 01:03:19 boston kernel: MetaData crosses page boundary!!
Jun 29 01:03:19 boston kernel: bread failed!
Jun 29 01:03:19 boston kernel: jfs_lookup: dtSearch returned 5

Jun 29 01:03:27 boston kernel: jfs_lookup: dtSearch returned 5
Jun 29 01:04:28 boston kernel: stack overrun in dtSearch!
Jun 29 01:04:28 boston kernel: jfs_l0:06: rw=0, want=297009428, limit=137058516
Jun 29 01:04:28 boston kernel: stack overrun in dtSearch!
Jun 29 01:04:28 boston kernel: jfs_lookup: dtSearch returned 5
Jun 29 01:04:28 boston kernel: attempt to access beyond end of device
Jun 29 01:04:28 boston kernel: 30:06: rw=0, want=297009428, limit=137058516
Jun 29 01:04:28 boston kernel: attempt to access beyond end of device
Jun 29 01:04:28 boston kernel: 30:06: rw=0, want=297009428, limit=137058516
Jun 29 01:04:28 boston kernel: attempt to access beyond end of device
Jun 29 01:04:28 boston kernel: 30:06: rw=0, want=297009428, limit=137058516
Jun 29 01:04:28 boston kernel: attempt to access beyond end of device
Jun 29 01:04:29 boston kernel: 30:06: rw=0, want=297009428, limit=137058516
Jun 29 01:04:29 boston kernel: stack overrun in dtSearch!

More error of the above errors (not the access beyond end of device
though) were shown before the assertion occured.

Now, as a low level SCSI test of a partition of the RAID disk didn't
produce any errors, it could be a combination of our components that
reveals the error. We have no idea to isolate it further, though
(memtest86 has been run to verify memory). We tried to stress the
kernel and ext2 partition with the stress-kernel package but didn't
detect problems. The ext2 (system) partition is located on the same
raid array as the JFS partition.

Bernhard Ege

-- 
 Bernhard Mogens Ege, M.Sc.E.E.                E-mail:        [EMAIL PROTECTED]
 Dept. of Health Science and Technology        Direct call:   +45 96 35 87 82
 Aalborg University                            Switchboard:   +45 96 35 80 80
 Frederik Bajers Vej 7, D1-203                 Fax:           +45 98 15 40 08
 DK-9220 Aalborg East                            http://www.hst.auc.dk/~bme
------------------------------------------------------------------------------
Home: Hadsund Landevej 454, DK-9260 Gistrup, Phone: +45 96365086, +45 51905086

_______________________________________________
Jfs-discussion mailing list
[EMAIL PROTECTED]
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/jfs-discussion

Re: [Jfs-discussion] mpstat: free larger than alloc?

Reply via email to