Re: a memory + SCSI error??

1996-08-08 Thread Susan G. Kleinmann
I hesitate to burden the list with an email that's more reminiscent
of Days of our Lives than a bug report, but having suggested a couple of
days ago that there might be problems with Linux's handling of memory or 
SCSI interfaces, I thought it might be constructive to report that the
problems I encountered had nothing to do with Linux, Debian, or any of the
utilities I was using, and everything to do with cleaning.  I also thought
it might be useful to summarize 'lessons learned'.

In order to clean the inside of the box that had the 9 GB drive on it,
the outer casing of the box had to be slid off.  The box hadn't been
constructed all that perfectly, so when the lid was finally removed, 
the drive heads got a real physical jolt.  Basically, I was over-optimistic
about the ruggedness of the Micropolis drive (which is now 2 years old).  
I suspect that the heads were misaligned when the box casing was removed, 
and that caused the errors when I rebooted.

(The answer to questions about what I used for cleaning is: a nearly dry 
damp cloth.  I gave the box lots of opportunity to dry after I cleaned it,
so I don't think the two drops of water in the cloth was a problem.  
In fact I suspect that cleaning without a little water is sometimes more 
of a problem, due to static.)

Lessons learned:
-- Running fsck once on a broken partition isn't enough.  It must be run
more than once, especially if fsck ever asks:  ... Ignorey?

-- Use 'reboot -n' after running fsck (suggestion from Sherwood Botsford).   

-- (of course) back up early and often.  Reading and then applying
the documentation on the 'tob' backup script is not only a lot less painful 
than going through the trauma that I went through for the last two days; 
it is actually pleasurable reading, a real rarity!  


Thanks to all for sympathy and suggestions while I was in the pits.
Susan Kleinmann



Re: a memory + SCSI error??

1996-08-06 Thread Mike Taylor
I recently had a memory problem that I solved by reseating the SIMMs.
I didn't remove them.  I just wiggled them.  If you try it, be sure to
protect against static discarge.

Mike



Re: a memory + SCSI error??

1996-08-06 Thread Rob Browning

Sounds to me like you might have unseated one of the cards, most
likely SCSI, or loosened a SCSI cable.  You might want to check that.

Also make sure all your termination switches (if any) are right.

--
Rob



a memory + SCSI error??

1996-08-05 Thread Susan G. Kleinmann
My system has become increasingly flaky in the last 36 hours, and I
need whatever advice anyone might have to offer quickly!

I wouldn't bring this to the list, except that I believe that whatever 
problem I'm having may well affect others.

History:
On Saturday I got carried away with housecleaning:  I started with the
study, but then decided to clean the computers, and then I decided to
clean the _insides_ of the computers.  That now looks like a big mistake.

(The system has an Intel Plato motherboard with an NCR SCSI controller,
2 1 GByte SCSI disks, 1 SCSI CD-ROM, and (externally) a Micropolis 
9 GByte disk drive and an Exabyte tape drive.  This particular Micropolis
has been well behaved, but I had another Micropolis 9 Gbyte which died
quickly, so now I think of them as delicate disk drives.  The system 
has 48 MBytes of RAM, though it is rare that I ever see more than 
2 MBytes of RAM free when I run vmstat -- an issue I'd been meaning to 
investigate for months.  This seems odd to me since I am almost
always the only user.)

It's been a while since I've backed up the system.

After I cleaned it, the system seemed to boot up normally.
But the next morning, I saw a message:
   KERNEL PANIC, and something about a SCSI I/O error.  
I tried to logout so I could reboot normally,
but couldn't do that.  So I shutdown the system, and when I tried to
reboot, I was of course greeted by a message that said I'd have to run
fsck manually.  When I did that, I was asked to agree to a bunch of
actions which I didn't understand, but agreed to anyway.  Then the report
came up that a file system (/dev/sdc1) had been changed.

I was unable to boot after the fsck process.  Using emergency boot disks
(made with Bruce's boot-floppies package), I was able to boot enough
to remove /dev/sdc1 from /etc/fstab, and then mke2fs on /dev/sdc1, which
allowed me to boot normally again.

Having come this close to a crisis, I decided to back up the system. (!)
This has resulted in more and worse failures.
1.  I used tob, with the command,
   bash /sbin/tob -rc /etc/tob/tob.rc.afioz -full all 

I left the room, came back after a couple of hours, and saw that the
process seemed to have made some progress, but then just stopped.
The afio process was marked 'D' in the ps -ax output.

Again, I tried to shutdown, couldn't, had to turn off the machine,
went through another fsck exercise, and finally rebooted.

2.  I tried tob again with the command:
   bash /sbin/tob -rc /etc/tob/tob.rc.afioz.TEST -full all 

The process seemed to start up normally, so I let it go for a few hours.

Same result.  I took notes this time:
  PID TTY STAT  TIME COMMAND
  375  p1 S0:00 -bash 
  484  p1 S0:00 bash /sbin/tob -rc /etc/tob/tob.rc.afioz.TEST -full all 
  515  p1 D2:50 afio -Zvo /tmp/tob.out 

I also noted that the root file system (on which /tmp was mounted)
was full.  So I removed some files that I thought might alleviate
the problem, but I was never able to reawaken tob.

3.  In desperation, I tried plain old tar, but now that seems to have 
done exactly the same thing:  i.e., it started out well, the disks
and taper whirred a lot, but now the process seems to have hung.
Here's the ps -ax output:

  PID TTY STAT  TIME COMMAND
22671  p2 D0:24 tar -cvf /dev/st0 /home 


Now one of my xterms seems to have some very weird settings -- the
characters have all changed from a-z to some graphics set, and 
'stty sane' doesn't help.  

 The root file system is not full, vmstat reports about 1.8 MBytes
 free (which is maddening, but typical).

 I do not believe that I can reboot the system now without serious
 loss of data.

I'm sorry it's taken so long to relay my story.  I'd very much appreciate 
any advice on how to get rid of all the sick processes now on this 
system (see process table below), and how I might safely back up some
files and reboot.  Meantime, as I mentioned above, I think this is all 
a sign of some peculiar memory problems.  I suspect tob is not the
source of the problem.

Regards,
Susan Kleinmann