Hello all,

We've come across a rather annoying behavior of systems hanging due to SCSI 
activity on BusLogic and Adaptec cards.  I originally thought it was excessive 
heat, but now I'm not so sure.  The systems are dual 400 MHz PII on the Intel 
NightShade board running the 2.0.36 kernel.  Has anyone else seen this problem 
and come up with a solution?  See the forwarded email below for more info.

I've also run against the problem of not being able to disable devices via the 
SSU utility (Intel's System Setup Utility), nor can effectively change the 
IRQ's of devices with the same utility.  Has anyone had success with this 
thing?

Thanks for the info,
Dan
___________________________________________________________________________
Dan Yocum                       | Phone:  (630) 840-8525
Linux/Unix System Administrator | Fax:    (630) 840-6345
Computing Division  OSS/FSS     | email:  [EMAIL PROTECTED]            .~.   L
Fermi National Accelerator Lab  | WWW:    www-oss.fnal.gov/~yocum/  /V\   I
P.O. Box 500                    |                                  // \\  N
Batavia, IL  60510              |      "TANSTAAFL"                /(   )\ U
________________________________|_________________________________ ^`~'^__X_


------- Forwarded Message

Return-Path: [EMAIL PROTECTED]
Received: from FNAL.FNAL.Gov (fnal.fnal.gov [131.225.9.8])
        by sapphire.fnal.gov (8.8.7/8.8.7) with ESMTP id KAA21682
        for <[EMAIL PROTECTED]>; Wed, 7 Apr 1999 10:23:55 -0500
Received: from fndaub.fnal.gov ("port 13509"@fndaub.fnal.gov)
 by FNAL.FNAL.GOV (PMDF V5.1-12 #3998)
 with ESMTP id <[EMAIL PROTECTED]> for [EMAIL PROTECTED];
 Wed, 7 Apr 1999 10:23:53 -0500 CDT
Received: (from djholm@localhost)
 by fndaub.fnal.gov (980427.SGI.8.8.8/970903.SGI.AUTOCF) id KAA10098; Wed,
 07 Apr 1999 10:23:52 -0500 (CDT)
Date: Wed, 07 Apr 1999 10:23:52 -0500 (CDT)
From: Don Holmgren <[EMAIL PROTECTED]>
Subject: Re: Dual CPU Comark systems hanging in e831...
In-reply-to: <[EMAIL PROTECTED]>
To: "Harry W. K. Cheung" <[EMAIL PROTECTED]>
Cc: Dan Yocum <[EMAIL PROTECTED]>, [EMAIL PROTECTED]
Message-id: <[EMAIL PROTECTED]>
MIME-version: 1.0
Content-type: TEXT/PLAIN; charset=US-ASCII


Hi -

We've been using systems based on the Nightshade motherboard fairly 
heavily to read/write tape drives on the EMASS robot and have run into 
some hangs as well.  I originally suspected the Buslogic BT958D cards we 
were using, but now we're seeing the same hangs with Adaptec 2944 cards.

We also have used external disks on an Adaptec 2940U2W card, but have 
seen so hangs.  I'll exercise these heavily and see if I can provoke a 
problem.

Before seeing your mails I was fairly certain the problem with our tapes 
was related to the various SCSI layers in the kernel.  I suppose there 
could be hardware problems with these systems, and note that our common 
experience is problems on external scsi buses.

After the next hang, could you do the following:
  'cat /proc/scsi/Buslogic/2'
(that "2" might be another number) and send me the results?  What I see 
on our systems in this section:

                           DATA TRANSFER STATISTICS

Target  Tagged Queuing  Queue Depth  Active  Attempted  Completed
======  ==============  ===========  ======  =========  =========
   3    Not Supported         3         0        42181      42181
   4    Not Supported         3         0        27513      27513
   5    Not Supported         3         0        21025      21025

is that the "Attempted" numbers are 1 greater than the "Completed" for 
one or more targets - the interface or the kernel or an external device 
are hanging on a command.  Again, since I get similar behaviour on 
Adaptec cards (without the nice reports in /proc/scsi), I suspect a bug 
in the kernel SCSI layers above the driver.

You should also check the temperatures of your systems.  Fetch the 
following file:
  ftp://linux-rep.fnal.gov/pub/ipmi/sdrread
and run it as root (chmod +x first).  It prints out a bunch of 
information, but at the bottom will be a section like:

sdr 0: -12V: -12.21  C0
sdr 1: Proc1-VID: 4.83 
sdr 2: Proc2-VID: 4.83 
sdr 3: BB Temp1: 41.00  C0
sdr 4: BB Temp2: 38.00  C0
sdr 5: CPU1 Temp: 41.00  C0
sdr 6: CPU2 Temp: 39.00  C0
sdr 7: CPU Fan1: 4650.00  C0
sdr 8: CPU Fan2: 4800.00  C0
sdr 9: 5V: 5.24  C0
sdr 10: -5V: -5.05  C0
sdr 11: 12V: 11.90  C0
sdr 12: 3.3V: 3.37  C0
sdr 13: CPU1 Voltage: 2.02  C0
sdr 14: CPU2 Voltage: 2.03  C0
sdr 15: 2.5V: 2.53  C0
sdr 16: 1.5V: 1.51  C0
sdr 17: SCWA Term1: 2.88  C0
sdr 18: SCWA Term2: 2.87  C0
sdr 19: SCWA Term3: 2.85  C0
sdr 20: SCNA Term1: 2.85  C0
sdr 21: 5V_stndby: 5.00  C0
sdr 22: Proc1 Stat:  80 80
sdr 23: BMC-FP-NMI:  00 80
sdr 24: BMC-Watchdog:  00 80
sdr 25: Proc2 Stat:  80 80
sdr 26: DIMM1 Pres:  C1
sdr 27: DIMM2 Pres:  C0
sdr 28: DIMM3 Pres:  C0
sdr 29: DIMM4 Pres:  C0
sdr 30: Post Error:  C0
sdr 31: Chasis Intruid:  C1
sdr 32: not a sensor

The BB and CPU temperatures should be reasonable - not over 70 certainly 
(I'll look up the accepted range for Intel).  We've had some systems 
under load with higher temperatures - reseating the CPU fans fixed these.

Don Holmgren





On Tue, 6 Apr 1999, Harry W. K. Cheung wrote:
...
> 
> 
> The problem is that the PC will hang with either an external SCSI disk
> with its light (SCSI activity) on. Or in one of the systems that has an
> internal SCSI 8mm Eliant 820 tape drive, it has also hung with the SCSI
> activity light lit on the tape drive. In this case I can get into a console
> and log on as root to the IDE drive. However I cannot shut down the system
> as it tries to do a SCSI bus reset and fails and aborts and keeps going
> like that. In other hangs with the external drive the PC does not respond
> at all.
> 
> All three systems ran okay for a long time with just Monte Carlo jobs that
> do not access the disk much. Currently we are running data analysis jobs
> that access the disk more, across the network (NFS) and locally. Sometimes
> we spool data off tape using the tape drive across the network. It doesn't
> have to have many jobs running nor is the disk access that heavy when the
> systems hang. Its random in that I cannot predictably cause them to hang.
> Two of the system both each hung once today (so far!), so it does happen
> often.
> 
> I tried changing SCSI cables to much shorter ones in two of the systems.
> The third system has only one external SCSI disk on a 2 foot cable. All
> systems still hung as before. In case it helps the systems are given below:
> 
> all three systems have:
> 
> BXN440BX Night shade Motherboard with dual 400 MHz PII
>    supposedly including heasink and active fan though I haven't looked inside
> 80mm DC fan
> st38641A internal Seagate IDE drive
> qm309100td-s internal Quantum SCSI drive
> xm6201b-s internal SCSI CDROM drive
> bt-958 Buslogic wide SCSI controller
> 3c900-com-in 3COM 3c900 combo ethernet card
> 
> one of the systems has an internal 8mm SCSI drive plus 3 external SCSI disks
> the second system has an external SCSI disk and a SCSI (narrow) scanner
> and the third system has an external SCSI disk only.

------- End of Forwarded Message



-
Linux SMP list: FIRST see FAQ at http://www.irisa.fr/prive/mentre/smp-faq/
To Unsubscribe: send "unsubscribe linux-smp" to [EMAIL PROTECTED]

Reply via email to