Just to follow up on this situation, I think I've tracked it down
to a problem arising from a combination of SMP, the directory entry
cache, and NFS client code.  After several 24-hour runs of 10 copies
of

  'find /nfs-mounted-directory -print > /dev/null' 

running simultaneously, the kernel stops or dies in fs/dcache.c
(in dput() or d_lookup(), and it triggered the BUG() on
line 129 once).

Performing the same 10 finds on a locally mounted ext2 filesystem
produces no lockups or hangs.

-Bob

> I've got a strange situation, and I'm looking for a little direction.
> Quick summary: I get sporadic lockups running 2.4.5-ac13 on a
> ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs,
> 512M RAM, 512M+ swap.  Machine has 8 active disks, two as RAID 1,
> 6 as RAID 5.  Swap is on RAID 1.  Machine also has a 100Mbit Netgear
> FA310TX and an Intel 82559-based 100Mbit card.  SCSI controllers
> are AIC-7899 (2) and AIC-7895 (1).  RAM is PC-133 ECC RAM; two
> identical machines display these problems.
> 
> I've seen three variations of symptoms:
> 
>   1) Almost complete lockout - machine responds to interrupts (indeed,
>      it can even complete a TCP connection) but no userspace code gets
>      executed.  Alt-SysRq-* still works, console scrollback does not;
>   2) Partial lockout - lock_kernel() seems to be getting called without
>      a corresponding unlock_kernel().  This manifested as programs such
>      as 'ps' and 'top' getting stuck in kernel space;
>   3) Unkillable programs - a test program that allocates 512M of memory
>      and touches every page; running two copies of this simultaneously
>      repeatedly results in at least one of the copies getting stuck
>      in 'raid1_alloc_r1bh'.
> 
> Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3
> were observed under 2.4.5-ac13 only.  I never get any PANICs, only
> these variety of deadlocks.  A reboot is the only way to resolve the
> problem.
> 
> There seem to be two ways to manifest the problem.  As alluded to in
> (3), running two copies of the memory eater simultaneously along with
> calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute
> or two).  Another method to manifest the problem is to run multiple
> copies of this script (I run 10 simultaneous copies):
> 
>   #!/bin/sh
> 
>   while /bin/true; do
>     ssh remote-machine 'sleep 1'
>   done
> 
> This script causes (1) in about a day or two.
> 
> If anyone has any suggestions about how to proceed to figure out what
> the problem is (or if there is already a fix), please let me know.
> I would be more than willing to provide a wide range of cooperation on
> this problem.  I don't have a feel for where to go from here, and I'm
> hoping that someone with more experience can give me some
> assistance..
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to