Re: [gentoo-user] spontaneous reboots.. what to look for
On Sun, 15 Feb 2009 17:42:44 -0600, Harry Putnam wrote: > 2) checking how hot the cpu is getting (Doesn't appear to be a >problem) But now running a cron job recording temperatures every 10 >minutes. So that may turn up something. You could also check disk temperatures with app-admin/hddtemp. I've had random crashes due to an overheating drive before. I'd also run smartctl (emerge smartmontools) over the drive, just to be sure. memtest is a must, bad RAM can easily cause crashes, and take Volker's advice on PSUs. -- Neil Bothwick What if there were no hypothetical situations? signature.asc Description: PGP signature
Re: [gentoo-user] spontaneous reboots.. what to look for
Mark Knecht wrote: > On Sun, Feb 15, 2009 at 3:42 PM, Harry Putnam wrote: > >> I've been experiencing spontaneous reboots on one gentoo machine >> lately. Looking thru /var/log/messages... I see the restarts but >> looking above that... I'm not seeing anything I recognize as being a >> culprit. >> >> Its been happening for a few weeks... but I've been busy and only now >> digging into it ( The machine is no kind of server ). >> >> It appears to only happen in X (I'm using xfce4) and I've only noticed >> it since I started running 2.6.28 kernels. Although I couldn't say >> that it seemed to be directly related. >> >> I mean I didn't boot into 2.6.28 and suddenly notice spontaneous >> rebooting. >> >> It does not appear to be heat realated... but I am only now using >> lm_sensors to keep an accurate record and see if there appears to be a >> relationship. >> >> I've had two today so either its happening more often or I'm just >> spending more time on that machine. >> >> It may also be on the first or second time its happened while I as >> actually right at the keyboard. >> >> I'm sorry to be so vague about it, but in truth, I've been pretty lazy >> about it... since no real harm comes of an unexpected reboot on that >> machine (so far anyway). But clearly something that has to be figured >> out. >> >> The only things I've checked so far... >> 1) browsing thru /var/log/messages (Having trouble recognizing any >> thing that looks suspicious. >> >> I have noticed what appears to be a time/date anomaly where the >> progression of time is suddenly irregular. That is, an earlier >> time shows up amongst some later times. >> >> It appears to have been me sudoing to visudo. And apparently >> having /etc/sudoers open long enough for the closing of it to be >> earlier than other events taking place. >> >> Again ... I'm not real sure exactly what happened there but it >> does not appear to coincide with a reboot anyway. >> >> 2) checking how hot the cpu is getting (Doesn't appear to be a >> problem) But now running a cron job recording temperatures every 10 >> minutes. So that may turn up something. >> >> 3) checking for overfilled disks. (none show in df -h) >> >> > > Reseat memory and PCI cards, etc. Consider removing for a period of > time any hardware not absolutely necessary to debug the problem. (I.e. > - second video card, extra disk drives, extra network adapters, etc.) > Run memtest86 for a few days if you can spare the machine. Run > spinrite, etc., to look for drive problems. Open the box up and place > a fan blowing extra air for additional cooling. > > good luck, > Mark > > > To add another test. I had this issue once before and it was a faulty driver for my hard drives. I ran a command like this to test mine: hdparm -Tt /dev/hda && hdparm -Tt /dev/hda && hdparm -Tt /dev/hda && hdparm -Tt /dev/hda && hdparm -Tt /dev/hda If it can pass that then it should be all right and you can look elsewhere. Mine would only fail when the drives were very busy and that test should do that pretty good. Hope that helps. Dale :-) :-)
Re: [gentoo-user] spontaneous reboots.. what to look for
On Feb 15, 2009, at 7:16 PM, Volker Armin Hemmann > wrote: So the problem started recently. That means it is either: a cap going bad. oxidized contacts. dust clogging the fans. PSU is going bad. something obscure. Do the easy thing first. Clean your case, reseat all cards and memory modules and check all caps while doing so. Any of them deformed? The 'head' going up? Strange stuff around its feet? Congratulation, you need new hardware. If you don't find a bad cap and the problem persists, get a new PSU. A good one. Not big - most PSUs are oversized, but good quality. Anandtech has something about psu's, so does tomshardware (most of their tests are rubbish, but their psu tests are ok). If the problem goes away, congratulation! If not, well, then report back ;) I had a similar issue even when not running X. To be honest, I can't say I have a concrete idea of exactly what caused it. I simply became security-nuts and began wondering if it wasn't someone just toying with me; hardened my sshd config and installed denyhosts to monitor failed loggins. This was a month ago and my uptime has been perfect, with no restarts.
Re: [gentoo-user] spontaneous reboots.. what to look for
So the problem started recently. That means it is either: a cap going bad. oxidized contacts. dust clogging the fans. PSU is going bad. something obscure. Do the easy thing first. Clean your case, reseat all cards and memory modules and check all caps while doing so. Any of them deformed? The 'head' going up? Strange stuff around its feet? Congratulation, you need new hardware. If you don't find a bad cap and the problem persists, get a new PSU. A good one. Not big - most PSUs are oversized, but good quality. Anandtech has something about psu's, so does tomshardware (most of their tests are rubbish, but their psu tests are ok). If the problem goes away, congratulation! If not, well, then report back ;)
Re: [gentoo-user] spontaneous reboots.. what to look for
On Sun, Feb 15, 2009 at 3:42 PM, Harry Putnam wrote: > I've been experiencing spontaneous reboots on one gentoo machine > lately. Looking thru /var/log/messages... I see the restarts but > looking above that... I'm not seeing anything I recognize as being a > culprit. > > Its been happening for a few weeks... but I've been busy and only now > digging into it ( The machine is no kind of server ). > > It appears to only happen in X (I'm using xfce4) and I've only noticed > it since I started running 2.6.28 kernels. Although I couldn't say > that it seemed to be directly related. > > I mean I didn't boot into 2.6.28 and suddenly notice spontaneous > rebooting. > > It does not appear to be heat realated... but I am only now using > lm_sensors to keep an accurate record and see if there appears to be a > relationship. > > I've had two today so either its happening more often or I'm just > spending more time on that machine. > > It may also be on the first or second time its happened while I as > actually right at the keyboard. > > I'm sorry to be so vague about it, but in truth, I've been pretty lazy > about it... since no real harm comes of an unexpected reboot on that > machine (so far anyway). But clearly something that has to be figured > out. > > The only things I've checked so far... > 1) browsing thru /var/log/messages (Having trouble recognizing any > thing that looks suspicious. > > I have noticed what appears to be a time/date anomaly where the > progression of time is suddenly irregular. That is, an earlier > time shows up amongst some later times. > > It appears to have been me sudoing to visudo. And apparently > having /etc/sudoers open long enough for the closing of it to be > earlier than other events taking place. > > Again ... I'm not real sure exactly what happened there but it > does not appear to coincide with a reboot anyway. > > 2) checking how hot the cpu is getting (Doesn't appear to be a > problem) But now running a cron job recording temperatures every 10 > minutes. So that may turn up something. > > 3) checking for overfilled disks. (none show in df -h) > Reseat memory and PCI cards, etc. Consider removing for a period of time any hardware not absolutely necessary to debug the problem. (I.e. - second video card, extra disk drives, extra network adapters, etc.) Run memtest86 for a few days if you can spare the machine. Run spinrite, etc., to look for drive problems. Open the box up and place a fan blowing extra air for additional cooling. good luck, Mark
[gentoo-user] spontaneous reboots.. what to look for
I've been experiencing spontaneous reboots on one gentoo machine lately. Looking thru /var/log/messages... I see the restarts but looking above that... I'm not seeing anything I recognize as being a culprit. Its been happening for a few weeks... but I've been busy and only now digging into it ( The machine is no kind of server ). It appears to only happen in X (I'm using xfce4) and I've only noticed it since I started running 2.6.28 kernels. Although I couldn't say that it seemed to be directly related. I mean I didn't boot into 2.6.28 and suddenly notice spontaneous rebooting. It does not appear to be heat realated... but I am only now using lm_sensors to keep an accurate record and see if there appears to be a relationship. I've had two today so either its happening more often or I'm just spending more time on that machine. It may also be on the first or second time its happened while I as actually right at the keyboard. I'm sorry to be so vague about it, but in truth, I've been pretty lazy about it... since no real harm comes of an unexpected reboot on that machine (so far anyway). But clearly something that has to be figured out. The only things I've checked so far... 1) browsing thru /var/log/messages (Having trouble recognizing any thing that looks suspicious. I have noticed what appears to be a time/date anomaly where the progression of time is suddenly irregular. That is, an earlier time shows up amongst some later times. It appears to have been me sudoing to visudo. And apparently having /etc/sudoers open long enough for the closing of it to be earlier than other events taking place. Again ... I'm not real sure exactly what happened there but it does not appear to coincide with a reboot anyway. 2) checking how hot the cpu is getting (Doesn't appear to be a problem) But now running a cron job recording temperatures every 10 minutes. So that may turn up something. 3) checking for overfilled disks. (none show in df -h)