a memory + SCSI error??

Susan G. Kleinmann 5 Aug 1996 20:32:30 -0000

My system has become increasingly flaky in the last 36 hours, and I
need whatever advice anyone might have to offer quickly!


I wouldn't bring this to the list, except that I believe that whatever 
problem I'm having may well affect others.

History:
On Saturday I got carried away with housecleaning:  I started with the
study, but then decided to clean the computers, and then I decided to
clean the _insides_ of the computers.  That now looks like a big mistake.

(The system has an Intel Plato motherboard with an NCR SCSI controller,
2 1 GByte SCSI disks, 1 SCSI CD-ROM, and (externally) a Micropolis 
9 GByte disk drive and an Exabyte tape drive.  This particular Micropolis
has been well behaved, but I had another Micropolis 9 Gbyte which died
quickly, so now I think of them as delicate disk drives.  The system 
has 48 MBytes of RAM, though it is rare that I ever see more than 
2 MBytes of RAM free when I run vmstat -- an issue I'd been meaning to 
investigate for months.  This seems odd to me since I am almost
always the only user.)

It's been a while since I've backed up the system.

After I cleaned it, the system seemed to boot up normally.
But the next morning, I saw a message:
   KERNEL PANIC, and something about a SCSI I/O error.  
I tried to logout so I could reboot normally,
but couldn't do that.  So I shutdown the system, and when I tried to
reboot, I was of course greeted by a message that said I'd have to run
fsck manually.  When I did that, I was asked to agree to a bunch of
actions which I didn't understand, but agreed to anyway.  Then the report
came up that a file system (/dev/sdc1) had been changed.

I was unable to boot after the fsck process.  Using emergency boot disks
(made with Bruce's boot-floppies package), I was able to boot enough
to remove /dev/sdc1 from /etc/fstab, and then mke2fs on /dev/sdc1, which
allowed me to boot normally again.

Having come this close to a crisis, I decided to back up the system. (!)
This has resulted in more and worse failures.
1.  I used tob, with the command,
   bash /sbin/tob -rc /etc/tob/tob.rc.afioz -full all 

    I left the room, came back after a couple of hours, and saw that the
    process seemed to have made some progress, but then just stopped.
    The afio process was marked 'D' in the ps -ax output.

    Again, I tried to shutdown, couldn't, had to turn off the machine,
    went through another fsck exercise, and finally rebooted.

2.  I tried tob again with the command:
   bash /sbin/tob -rc /etc/tob/tob.rc.afioz.TEST -full all 

    The process seemed to start up normally, so I let it go for a few hours.

    Same result.  I took notes this time:
  PID TTY STAT  TIME COMMAND
  375  p1 S    0:00 -bash 
  484  p1 S    0:00 bash /sbin/tob -rc /etc/tob/tob.rc.afioz.TEST -full all 
  515  p1 D    2:50 afio -Zvo /tmp/tob.out 

    I also noted that the root file system (on which /tmp was mounted)
    was full.  So I removed some files that I thought might alleviate
    the problem, but I was never able to reawaken tob.

3.  In desperation, I tried plain old tar, but now that seems to have 
    done exactly the same thing:  i.e., it started out well, the disks
    and taper whirred a lot, but now the process seems to have hung.
    Here's the ps -ax output:

  PID TTY STAT  TIME COMMAND
22671  p2 D    0:24 tar -cvf /dev/st0 /home 


    Now one of my xterms seems to have some very weird settings -- the
    characters have all changed from a-z to some graphics set, and 
    'stty sane' doesn't help.  

     The root file system is not full, vmstat reports about 1.8 MBytes
     free (which is maddening, but typical).

     I do not believe that I can reboot the system now without serious
     loss of data.

I'm sorry it's taken so long to relay my story.  I'd very much appreciate 
any advice on how to get rid of all the sick processes now on this 
system (see process table below), and how I might safely back up some
files and reboot.  Meantime, as I mentioned above, I think this is all 
a sign of some peculiar memory problems.  I suspect tob is not the
source of the problem.

Regards,
Susan Kleinmann

a memory + SCSI error??

Reply via email to