Some complex issues - need a few Gurus

Ted Hilts Tue, 27 Mar 2001 12:33:08 -0800
Context: I have this special assembly database server with a system
drive (boot section and linux OS)-10 gig, a drive for temporary
determinations - 15gig, an IDE-RAID configuration of 4 x 20 gig drives
all software based from the kernel. And a heavy duty power supply.  The
system has also a HP CD-writer and a DVD rom, a tape drive and floppy.  

What happenned:  I had the computer running for about 3 days (3x24 hrs)
when it suddenly crashed.  The error was kernel panic and the inability
to access some list. This happenned again when booting up while right in
the middle of fschk. I tried a floppy based restore (kernel and system
files in RAM with no hard disk utilited, or at least under my
discretion) and the system again failed.  So I swapped the two 256meg
memory cards and tried again with the regular linux boot (thus bringing
up full linux system) as I could no longer get failure with the rescue
system.  Crashed again, same error!!! No big deal that was just the
first thing I looked at.  So I went back to the rescue system and tried
to create a heavy load condition but could not get it to crash.  So back
to full system mode (but with the problem of a contaminated disk which I
finally sorted out, as the system tried and gave up, won't get into
that). I put the system in maintenance mode and left it running for a
very long time and no failure.  This was after the software repair
episode with the 4 disk raid array.

By this time the tower is open on both sides and I am checking for free
air space for ventilation (ensure the ambient temperature is not an
issue).  It looked (with all the cables that the cpu fan was working
intermittently or was recycling a trapped air pocket, I could not
determine which, but everytime I checked the CPU heatsink fan and the
heatsink fins - things seemed okay from that perspective.  Nevertheless,
I took a powerful fan and placed it near this mess and pushed the air
through everything to ensure that ambient temperature was not crawling
up in an air pocket and affecting CPU, on board chips, etc. 

After all of this, while running full linux I got a SEGMENT error and
the system crashed.

Now the only thing so far I have not mentioned is that the crashes NOW
only occur when running something like "tar -czvt
for-some-big-file.tar.gz on-some-big-directory".  The system seems
stable if I break up this kind of activity into smaller portions, not
running so long.  Also, I noticed that when I do a "tar -ztf
on-some-new-portion-of-the-big-directory.tar.gz" it runs without any
errors.  But later when I try it I get CRC checksum compression errors. 
NOW ALL THESE OPERATIONS ARE ON THE RAID ARRAY.  Of course I am gun shy
as I never want to go through repairing a raid array when the system
says it cannot resolve the damage issues due to a non graceful
shutdown.  So I schedule this thing into a computer repair outlet where
they assure me that they have floppy based software diagnostic stuff
that will iterrogate the parts of the system to determine where the
failure is mostly likely occuring.

But through all of this I feel that I have failed.  I have gone from
suspecting physical memory, to CPU, to fans that are not doing the job,
(one definitely is in trouble thus possibly leading to fan load on power
supply causing spike and affecting CPU board operation), to ground
conditions because it is part of a network (but that's not the problem),
to two possibilities left in my mind, the CPU is not what I ordered and
is over clocked causing it to miss a beat somewhere and get confused (or
it has degraded to this because of initial heat problems) or second,
there is a disk surface error that over time results in a resident file
on the raid array to degrade(do to surface bit change) resulting in CRC
compression error when I do "tar -ztf ...".  So now based on this horrid
experience I have a few questions.

Will a CPU running linux automatically recover from a "panic"?  If so
how long does it take and what is the outcome?  I just assumed that no
system response means a dead machine, specially when no disk lights were
flickering in the usual manner. 

For linux how does one do a disk surface check. I understand that if
there are sections of damage on a disk surface it is not unusual to set
up the drive to recognize and bypass those areas?  Is there a way I can
check this???

Third question.  There are numerous mail error messages that indicate
the CPU encountered a 1 point increase over the 5 point maximum CPU load
figure. This is a new one to me (for linux).  Can this not be controlled
like on other systems where offending processes are limited in the
amount of CPU resources utilized?   What will a CPU overload of this
sort do?  I thought it would just slow down the machine?????

My fourth question is kind of dumb but let me try anyway.  What would
any of you suppose that if a fatal condition which occured frequently
for 2 days would cause it to diminish in frequency?  I can now boot up
without a problem, run without a problem, etc., as long as I don't ask
the CPU to do the really heavy task I explained above.  And the CRC
errors I mentioned don't cause a crash, maybe just indicate some other
serious matter.

Also, I think the problem has something to do with the RAID array, since
the array runs UDMA 66 protocol and does not appear to have special
cables.  But I don't know how to tell the cables apart (the special 66
type versus regular IDE to disk drive ribbon cables.  I think maybey the
software IDE-RAID operation is driving things at the 66 rate and maybe
there is an impedance match because of the cables, they look like normal
IDE cables.  BTW, there are two PROMISE IDE controllers in addition to
what is on the CPU board which is an EPoX Mainboard.  If there is a
cable impedance problem and the drives don't sense this (they are
suppossed to sense this and drop down to 33 rate) then that might
explain the problem when a lot of very fast data exchanges are occuring
on those drives.  But if this were the case then one would have expected
the problem to have show up when using those routines before the problem
occurred.  In other words I used to do these operations before without
any problem.  Anybody got any ideas.  I'm worried that the techi that
gets this problem is going to do a replace and try it again process or
not find any problem an I pay a fat fee, take it home, and go to do the
heavy duty routines and away it crashes again. See my problem?????  If I
was just a bit smarter.


Any ideas would be welcome.  I've got about 24 hours before I send it
in.

Bye-thanks _TED



_______________________________________________
Redhat-list mailing list
[EMAIL PROTECTED]
https://listman.redhat.com/mailman/listinfo/redhat-list
Some complex issues - need a few Gurus

Reply via email to