I can't thank you all enough for bearing with me as I stumble my way through this.

I now understand the logic behind running memtest uninterrupted for a long period (>24hr) and will do that.

I have to take back my comment about kmod-nvidia. I repeatedly messed up /etc/selinux/config trying to disable it and that it what was causing the kernel panics. I suppose that's a sign I'm not paying enough attention.

The purpose of running from LiveCD is not to necessarily find a hardware problem but to remove the hard disks and the installed software from the equation. The idea being IF I got one of these rare and random failures while running that way I could rule out insidious package conflicts, mangled configurations and the system disk as the cause.

As far as finding a computer repair professional whom I would go to for a problem like this, well all I can say is I've been living in this town for 32 years working in computing, I do have an outstanding doctor, a great car mechanic, an exceptional plumber... but I haven't found a computer guy better than me at this. That is not to imply that I am any good at it.

I am now up and running with SL6.4 on a spinning disk (to remove the SSD and a bunch of useful and need packages from the equation). I'll try to get some work done today and see if it crashes.

My next step is to swap memory and GPU with another box and see if the problem follows.

I hope I'm not posting too much useless (to others) information to the list.

Joe

On 04/24/2013 09:10 AM, Yasha Karant wrote:
A small comment: stress testing is cumulative only if the underlying system has no recovery mechanism. (An understanding of this in detail requires non-equilibrium statistical mechanics but can be summarized with non-equilibrium "thermodynamics"). My experience with failing electronics and magnetics -- depending upon the exact failure mode -- is that non-interrupted stress testing is better than interrupted in terms of finding failures. A simple example: suppose a failure mode is temperature dependent, and temperature depends upon the amount of work being done. An interrupted but cumulative stress test might never reach the "critical" temperature, whereas a continued stress test might.

Yasha Karant

On 04/24/2013 08:03 AM, Joseph Areeda wrote:
Thanks for the tips Konstantin,

I assume that your recommendation for 24 hrs of memtest is cumulative
and I can probably see the same results starting it each night when I
quit for the day.

When I mentioned SMART I was talking about the self tests not the status
that comes up.  I've also copied large files around and checked their
md5sum's.

I played with LiveCD for 4 or 5 hours today, much of it was trying to
install it on a different spinning hard drive.

I did see one time when the SSD was shown in the disk utility but all
the partitions were zero length. that's where my root directory used to be.

I also found that the nvidia drivers in ELREPO don't seem to work with
6.4.  I seem to be able to run fine (at least for a while) unless I
install kmod-nvidia then I get a kernal panic on the next reboot (3
times until I tracked it down).  It saiys something like "not syncing
attempt xxx(can't read my writing) PID 1 comm init not tainted
2.6.32.258.2.1.  That's another problem I think.

Right now I suspect not necessarily in order:

  * Bad SSD.  Run time is reported as 1.8 years.  I did have /usr
    /usr/local /tmp swap and /home on spinning media but...
  * Bad memory:  still a good possiblity
  * Some insidious incompatibility with all packages from multiple
    repos.  I really hope it's not that, I don't load much I don't need.

And as for finding a real computer repairman, let me know if you have
one in Los Angeles.  This is similar to a problem I had with an iMac.
The geniuses at the store took three trips to convince them something
was wrong and that was after about an hour each time with the phone
support people.  That one turned out to be a flaky memory DIMM that
passed all the quick diagnostics.

Oh well the saga continues.  It's nice have a group to go to for ideas.
Thank you all.

Joe


On 04/23/2013 04:20 PM, Konstantin Olchanski wrote:
On Tue, Apr 23, 2013 at 11:44:22AM -0700, Joseph Areeda wrote:
I'm having this strange behavior that I think is a hardware problem ...
* System freezes, mouse and keyboard dead, sshd unresponsive sometimes

First action is to run memtest86 (Q: which one? google finds several. A: all of them).

Run memtest86 for 24 hours at least - if it reports memory errors, hangs, freezes or machine turns off, you definitely have a hardware problem. Suspect parts are in this order: RAM, power supply, CPU socket (bent pins), mobo, CPU.

If memtest86 runs fine for 24 hours and more, there *still* could be a hardware
problem. (memtest86 does not test the video, the disk, the network
and the usb interfaces).

disk utility show ... SMART [is] fine.

SMART "health report" is useless. I had dead disks report "SMART OK" and perfectly functional disks report "SMART Failure, replace your disk now".

This is free advice. For advice that would actually get your computer
working again, you would want to hire a proper computer repairman.


Reply via email to