[SMP] 2.4.5-ac13 through ac18 dcache/NFS conflict

2001-06-29 Thread Bob Glamm

Just to follow up on this situation, I think I've tracked it down
to a problem arising from a combination of SMP, the directory entry
cache, and NFS client code.  After several 24-hour runs of 10 copies
of

  'find /nfs-mounted-directory -print > /dev/null' 

running simultaneously, the kernel stops or dies in fs/dcache.c
(in dput() or d_lookup(), and it triggered the BUG() on
line 129 once).

Performing the same 10 finds on a locally mounted ext2 filesystem
produces no lockups or hangs.

-Bob

> I've got a strange situation, and I'm looking for a little direction.
> Quick summary: I get sporadic lockups running 2.4.5-ac13 on a
> ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs,
> 512M RAM, 512M+ swap.  Machine has 8 active disks, two as RAID 1,
> 6 as RAID 5.  Swap is on RAID 1.  Machine also has a 100Mbit Netgear
> FA310TX and an Intel 82559-based 100Mbit card.  SCSI controllers
> are AIC-7899 (2) and AIC-7895 (1).  RAM is PC-133 ECC RAM; two
> identical machines display these problems.
> 
> I've seen three variations of symptoms:
> 
>   1) Almost complete lockout - machine responds to interrupts (indeed,
>  it can even complete a TCP connection) but no userspace code gets
>  executed.  Alt-SysRq-* still works, console scrollback does not;
>   2) Partial lockout - lock_kernel() seems to be getting called without
>  a corresponding unlock_kernel().  This manifested as programs such
>  as 'ps' and 'top' getting stuck in kernel space;
>   3) Unkillable programs - a test program that allocates 512M of memory
>  and touches every page; running two copies of this simultaneously
>  repeatedly results in at least one of the copies getting stuck
>  in 'raid1_alloc_r1bh'.
> 
> Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3
> were observed under 2.4.5-ac13 only.  I never get any PANICs, only
> these variety of deadlocks.  A reboot is the only way to resolve the
> problem.
> 
> There seem to be two ways to manifest the problem.  As alluded to in
> (3), running two copies of the memory eater simultaneously along with
> calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute
> or two).  Another method to manifest the problem is to run multiple
> copies of this script (I run 10 simultaneous copies):
> 
>   #!/bin/sh
> 
>   while /bin/true; do
> ssh remote-machine 'sleep 1'
>   done
> 
> This script causes (1) in about a day or two.
> 
> If anyone has any suggestions about how to proceed to figure out what
> the problem is (or if there is already a fix), please let me know.
> I would be more than willing to provide a wide range of cooperation on
> this problem.  I don't have a feel for where to go from here, and I'm
> hoping that someone with more experience can give me some
> assistance..
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[SMP] 2.4.5-ac13 through ac18 dcache/NFS conflict

2001-06-29 Thread Bob Glamm

Just to follow up on this situation, I think I've tracked it down
to a problem arising from a combination of SMP, the directory entry
cache, and NFS client code.  After several 24-hour runs of 10 copies
of

  'find /nfs-mounted-directory -print  /dev/null' 

running simultaneously, the kernel stops or dies in fs/dcache.c
(in dput() or d_lookup(), and it triggered the BUG() on
line 129 once).

Performing the same 10 finds on a locally mounted ext2 filesystem
produces no lockups or hangs.

-Bob

 I've got a strange situation, and I'm looking for a little direction.
 Quick summary: I get sporadic lockups running 2.4.5-ac13 on a
 ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs,
 512M RAM, 512M+ swap.  Machine has 8 active disks, two as RAID 1,
 6 as RAID 5.  Swap is on RAID 1.  Machine also has a 100Mbit Netgear
 FA310TX and an Intel 82559-based 100Mbit card.  SCSI controllers
 are AIC-7899 (2) and AIC-7895 (1).  RAM is PC-133 ECC RAM; two
 identical machines display these problems.
 
 I've seen three variations of symptoms:
 
   1) Almost complete lockout - machine responds to interrupts (indeed,
  it can even complete a TCP connection) but no userspace code gets
  executed.  Alt-SysRq-* still works, console scrollback does not;
   2) Partial lockout - lock_kernel() seems to be getting called without
  a corresponding unlock_kernel().  This manifested as programs such
  as 'ps' and 'top' getting stuck in kernel space;
   3) Unkillable programs - a test program that allocates 512M of memory
  and touches every page; running two copies of this simultaneously
  repeatedly results in at least one of the copies getting stuck
  in 'raid1_alloc_r1bh'.
 
 Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3
 were observed under 2.4.5-ac13 only.  I never get any PANICs, only
 these variety of deadlocks.  A reboot is the only way to resolve the
 problem.
 
 There seem to be two ways to manifest the problem.  As alluded to in
 (3), running two copies of the memory eater simultaneously along with
 calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute
 or two).  Another method to manifest the problem is to run multiple
 copies of this script (I run 10 simultaneous copies):
 
   #!/bin/sh
 
   while /bin/true; do
 ssh remote-machine 'sleep 1'
   done
 
 This script causes (1) in about a day or two.
 
 If anyone has any suggestions about how to proceed to figure out what
 the problem is (or if there is already a fix), please let me know.
 I would be more than willing to provide a wide range of cooperation on
 this problem.  I don't have a feel for where to go from here, and I'm
 hoping that someone with more experience can give me some
 assistance..
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/