Re: [SMP] 2.4.5-ac13 memory corruption/deadlock?

2001-06-19 Thread Rico Tudor

Are you sure about bad memory?

Single-bit errors will be corrected; double-bit errors will generate NMI.
You can also find memory errors with an exerciser.  Unfortunately,
trusty memtest86 bombs on my ServerWorks machine.  Instead I use

http://www.qcc.sk.ca/~charlesc/software/memtester/

which runs in user-mode.  I diagnosed thermal problems by running
this utility.  Within 3 minutes of cold start, it raised main memory
temperature sufficiently to induce a hard error, which was detected
simultaneously by it and the hardware (NMI taken by kernel).

Can you recommend one of your (shorter) tests for me to try?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [SMP] 2.4.5-ac13 memory corruption/deadlock?

2001-06-19 Thread Bob Glamm

> > I've got a strange situation, and I'm looking for a little direction.
> > Quick summary: I get sporadic lockups running 2.4.5-ac13 on a
> > ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs,
> > 512M RAM, 512M+ swap.  Machine has 8 active disks, two as RAID 1,
> > 6 as RAID 5.  Swap is on RAID 1.  Machine also has a 100Mbit Netgear
> > FA310TX and an Intel 82559-based 100Mbit card.  SCSI controllers
> > are AIC-7899 (2) and AIC-7895 (1).  RAM is PC-133 ECC RAM; two
> > identical machines display these problems.
> 
> please adds nmi_watchdog=1 as kernel parameter ( append=... in lilo ) and try 
> again.

Done, and I managed to get it to lock solid in under three hours.  Two
oopses in the syslog (follow).  It looks like memory corruption: the
BUG() that is called from spin_lock() and spin_unlock() test to see whether
the spinlock at the given address has the proper magic; apparently
it's gotten to the point where it doesn't.  In this case the lock that
has gotten mangled is dcache_lock.

Unfortunately, I don't think that this particular lockup is repeatable,
but I'm going to try again anyway to see if the same pattern of memory
corruption occurs.

-Bob

kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:113!
invalid operand: 
CPU:0
EIP:0010:[d_alloc+413/504]
EFLAGS: 00010286
eax: 0044   ebx: de9f811c   ecx: c027c088   edx: 869b
esi: c6e3fed1   edi: c190a14c   ebp: c6e3fee8   esp: c6e3fe94
ds: 0018   es: 0018   ss: 0018
Process sshd (pid: 565, stackpage=c6e3f000)
Stack: c0238840 0071 de38bd04 c7f5ee94 c6e3fed2 0004 c01dae97 c190a11c
   c6e3fee8 de38bd04 c7975f14 c6e3ff14 bca8 3532325b 35373138 c01d005d
   bca8 c6e3ff14 0010 de38bd04 c7975f14 c6e3fec8 0009 002274ff
Call Trace: [sock_map_fd+211/532] [mark_rdev_faulty+17/60] [sys_accept+197/252] 
[__free_pages+27/28] [free_pages+33/36]
   [poll_freewait+58/68] [do_select+523/548] [select_bits_free+10/16] 
[sys_select+1135/1148] [sys_socketcall+180/512] [system_call+51/56]

Code: 0f 0b 83 c4 08 8d b6 00 00 00 00 a0 c0 e2 27 c0 84 c0 7e 17
 eip: c0152f37 (d_lookup)
kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:101!
invalid operand: 
CPU:1
EIP:0010:[d_lookup+121/476]
EFLAGS: 00010282
eax: 0044   ebx: dffe9f68   ecx: c027c088   edx: 8a07
esi:    edi: c1933824   ebp: b818   esp: dffe9f04
ds: 0018   es: 0018   ss: 0018
Process init (pid: 1, stackpage=dffe9000)
Stack: c0238840 0065 dffe9f68  c1933824 b818 dff40a20 d228d001
   0023ee05 0003 c014850c c1932de4 dffe9f68 dffe9f68 c0148d09 c1932de4
   dffe9f68 0004 d228d000  dffe9fa4 b818 c01480ca 0009
Call Trace: [cached_lookup+16/84] [path_walk+889/3104] [getname+90/152] 
[__user_walk+60/88] [sys_stat64+22/120]
   [system_call+51/56]
eip: c021f2f4 (atomic_dec_and_lock)
kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:101!

Code: 0f 0b 83 c4 08 f0 fe 0d c0 e2 27 c0 0f 88 22 17 0d 00 8b 54
 invalid operand: 
Kernel panic: Attempted to kill init!

> > I've seen three variations of symptoms:
> >
> >   1) Almost complete lockout - machine responds to interrupts (indeed,
> >  it can even complete a TCP connection) but no userspace code gets
> >  executed.  Alt-SysRq-* still works, console scrollback does not;
> >   2) Partial lockout - lock_kernel() seems to be getting called without
> >  a corresponding unlock_kernel().  This manifested as programs such
> >  as 'ps' and 'top' getting stuck in kernel space;
> >   3) Unkillable programs - a test program that allocates 512M of memory
> >  and touches every page; running two copies of this simultaneously
> >  repeatedly results in at least one of the copies getting stuck
> >  in 'raid1_alloc_r1bh'.
> >
> > Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3
> > were observed under 2.4.5-ac13 only.  I never get any PANICs, only
> > these variety of deadlocks.  A reboot is the only way to resolve the
> > problem.
> >
> > There seem to be two ways to manifest the problem.  As alluded to in
> > (3), running two copies of the memory eater simultaneously along with
> > calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute
> > or two).  Another method to manifest the problem is to run multiple
> > copies of this script (I run 10 simultaneous copies):
> >
> >   #!/bin/sh
> >
> >   while /bin/true; do
> > ssh remote-machine 'sleep 1'
> >   done
> >
> > This script causes (1) in about a day or two.
> >
> > If anyone has any suggestions about how to proceed to figure out what
> > the problem is (or if there is already a fix), please let me know.
> > I would be more than willing to provide a wide range of cooperation on
> > this problem.  I don't have a feel for where to go from here, and I'm
> > hoping that someone with more experience can give me some
> > assistance..
> >
> > -Bob
> > -
> > To unsubscribe from this list: 

Re: [SMP] 2.4.5-ac13 memory corruption/deadlock?

2001-06-19 Thread Bob Glamm

  I've got a strange situation, and I'm looking for a little direction.
  Quick summary: I get sporadic lockups running 2.4.5-ac13 on a
  ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs,
  512M RAM, 512M+ swap.  Machine has 8 active disks, two as RAID 1,
  6 as RAID 5.  Swap is on RAID 1.  Machine also has a 100Mbit Netgear
  FA310TX and an Intel 82559-based 100Mbit card.  SCSI controllers
  are AIC-7899 (2) and AIC-7895 (1).  RAM is PC-133 ECC RAM; two
  identical machines display these problems.
 
 please adds nmi_watchdog=1 as kernel parameter ( append=... in lilo ) and try 
 again.

Done, and I managed to get it to lock solid in under three hours.  Two
oopses in the syslog (follow).  It looks like memory corruption: the
BUG() that is called from spin_lock() and spin_unlock() test to see whether
the spinlock at the given address has the proper magic; apparently
it's gotten to the point where it doesn't.  In this case the lock that
has gotten mangled is dcache_lock.

Unfortunately, I don't think that this particular lockup is repeatable,
but I'm going to try again anyway to see if the same pattern of memory
corruption occurs.

-Bob

kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:113!
invalid operand: 
CPU:0
EIP:0010:[d_alloc+413/504]
EFLAGS: 00010286
eax: 0044   ebx: de9f811c   ecx: c027c088   edx: 869b
esi: c6e3fed1   edi: c190a14c   ebp: c6e3fee8   esp: c6e3fe94
ds: 0018   es: 0018   ss: 0018
Process sshd (pid: 565, stackpage=c6e3f000)
Stack: c0238840 0071 de38bd04 c7f5ee94 c6e3fed2 0004 c01dae97 c190a11c
   c6e3fee8 de38bd04 c7975f14 c6e3ff14 bca8 3532325b 35373138 c01d005d
   bca8 c6e3ff14 0010 de38bd04 c7975f14 c6e3fec8 0009 002274ff
Call Trace: [sock_map_fd+211/532] [mark_rdev_faulty+17/60] [sys_accept+197/252] 
[__free_pages+27/28] [free_pages+33/36]
   [poll_freewait+58/68] [do_select+523/548] [select_bits_free+10/16] 
[sys_select+1135/1148] [sys_socketcall+180/512] [system_call+51/56]

Code: 0f 0b 83 c4 08 8d b6 00 00 00 00 a0 c0 e2 27 c0 84 c0 7e 17
 eip: c0152f37 (d_lookup)
kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:101!
invalid operand: 
CPU:1
EIP:0010:[d_lookup+121/476]
EFLAGS: 00010282
eax: 0044   ebx: dffe9f68   ecx: c027c088   edx: 8a07
esi:    edi: c1933824   ebp: b818   esp: dffe9f04
ds: 0018   es: 0018   ss: 0018
Process init (pid: 1, stackpage=dffe9000)
Stack: c0238840 0065 dffe9f68  c1933824 b818 dff40a20 d228d001
   0023ee05 0003 c014850c c1932de4 dffe9f68 dffe9f68 c0148d09 c1932de4
   dffe9f68 0004 d228d000  dffe9fa4 b818 c01480ca 0009
Call Trace: [cached_lookup+16/84] [path_walk+889/3104] [getname+90/152] 
[__user_walk+60/88] [sys_stat64+22/120]
   [system_call+51/56]
eip: c021f2f4 (atomic_dec_and_lock)
kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:101!

Code: 0f 0b 83 c4 08 f0 fe 0d c0 e2 27 c0 0f 88 22 17 0d 00 8b 54
 invalid operand: 
Kernel panic: Attempted to kill init!

  I've seen three variations of symptoms:
 
1) Almost complete lockout - machine responds to interrupts (indeed,
   it can even complete a TCP connection) but no userspace code gets
   executed.  Alt-SysRq-* still works, console scrollback does not;
2) Partial lockout - lock_kernel() seems to be getting called without
   a corresponding unlock_kernel().  This manifested as programs such
   as 'ps' and 'top' getting stuck in kernel space;
3) Unkillable programs - a test program that allocates 512M of memory
   and touches every page; running two copies of this simultaneously
   repeatedly results in at least one of the copies getting stuck
   in 'raid1_alloc_r1bh'.
 
  Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3
  were observed under 2.4.5-ac13 only.  I never get any PANICs, only
  these variety of deadlocks.  A reboot is the only way to resolve the
  problem.
 
  There seem to be two ways to manifest the problem.  As alluded to in
  (3), running two copies of the memory eater simultaneously along with
  calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute
  or two).  Another method to manifest the problem is to run multiple
  copies of this script (I run 10 simultaneous copies):
 
#!/bin/sh
 
while /bin/true; do
  ssh remote-machine 'sleep 1'
done
 
  This script causes (1) in about a day or two.
 
  If anyone has any suggestions about how to proceed to figure out what
  the problem is (or if there is already a fix), please let me know.
  I would be more than willing to provide a wide range of cooperation on
  this problem.  I don't have a feel for where to go from here, and I'm
  hoping that someone with more experience can give me some
  assistance..
 
  -Bob
  -
  To unsubscribe from this list: send the line unsubscribe linux-kernel in
  the body of a message to [EMAIL PROTECTED]
  More majordomo 

Re: [SMP] 2.4.5-ac13 memory corruption/deadlock?

2001-06-19 Thread Rico Tudor

Are you sure about bad memory?

Single-bit errors will be corrected; double-bit errors will generate NMI.
You can also find memory errors with an exerciser.  Unfortunately,
trusty memtest86 bombs on my ServerWorks machine.  Instead I use

http://www.qcc.sk.ca/~charlesc/software/memtester/

which runs in user-mode.  I diagnosed thermal problems by running
this utility.  Within 3 minutes of cold start, it raised main memory
temperature sufficiently to induce a hard error, which was detected
simultaneously by it and the hardware (NMI taken by kernel).

Can you recommend one of your (shorter) tests for me to try?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/