Re: [vox-tech] Problem with Gigabyte 890FX, Phenom II, and Kubuntu

2010-12-08 Thread Brian Lavender
Check the system event logs in the motherboard bios. Sometimes listed
under SEL. Otherwise, I would stress test the machine. I used to run ctcs
to burn in systems for a cluster I worked on for LLNL. It does memory,
io, and cpu stress tests.

http://sourceforge.net/projects/va-ctcs/

You could also try lm-sensors to monitor the hardware.

http://ubuntuforums.org/showthread.php?t=2780

brian

On Wed, Dec 08, 2010 at 07:21:14AM -0800, Cam Ellison wrote:
 Recently I upgraded to this MB, with an AMD Phenom II.  The latest 
 Kubuntu (10.10) is loaded onto it.  Twice it has halted suddenly: no 
 activity, no output, and has required a reboot.  This is not an external 
 power problem: its power comes from an APC 3000.  There is nothing in 
 the logs: everything runs normally and then suddenly nothing does.
 
 In both instances, the stoppage has occurred in the middle of a set of 
 cron.daily jobs (in the middle of the night), so I am exploring that 
 avenue.  The problem is that the machine has been in place for 7 weeks, 
 and the two stoppages are about 5 weeks apart - there's not much to go on.
 
 I'm looking for any ideas about how to track this down: is there a 
 utility that might give me more insight?  More to the point, does anyone 
 in the group have this combination and a comparable experience?
 
 TIA
 
 Cam Ellison
 
 
 ___
 vox-tech mailing list
 vox-tech@lists.lugod.org
 http://lists.lugod.org/mailman/listinfo/vox-tech

-- 
Brian Lavender
http://www.brie.com/brian/

Program testing can be used to show the presence of bugs, but never to
show their absence!

Professor Edsger Dijkstra
1972 Turing award recipient
___
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech


Re: [vox-tech] Problem with Gigabyte 890FX, Phenom II, and Kubuntu

2010-12-08 Thread Brian Lavender
You might also want to try sar. Here is an interesting article.

http://www.linux.com/archive/feature/114224

brian

On Wed, Dec 08, 2010 at 10:00:32AM -0800, Brian Lavender wrote:
 Check the system event logs in the motherboard bios. Sometimes listed
 under SEL. Otherwise, I would stress test the machine. I used to run ctcs
 to burn in systems for a cluster I worked on for LLNL. It does memory,
 io, and cpu stress tests.
 
 http://sourceforge.net/projects/va-ctcs/
 
 You could also try lm-sensors to monitor the hardware.
 
 http://ubuntuforums.org/showthread.php?t=2780
 
 brian
 
 On Wed, Dec 08, 2010 at 07:21:14AM -0800, Cam Ellison wrote:
  Recently I upgraded to this MB, with an AMD Phenom II.  The latest 
  Kubuntu (10.10) is loaded onto it.  Twice it has halted suddenly: no 
  activity, no output, and has required a reboot.  This is not an external 
  power problem: its power comes from an APC 3000.  There is nothing in 
  the logs: everything runs normally and then suddenly nothing does.
  
  In both instances, the stoppage has occurred in the middle of a set of 
  cron.daily jobs (in the middle of the night), so I am exploring that 
  avenue.  The problem is that the machine has been in place for 7 weeks, 
  and the two stoppages are about 5 weeks apart - there's not much to go on.
  
  I'm looking for any ideas about how to track this down: is there a 
  utility that might give me more insight?  More to the point, does anyone 
  in the group have this combination and a comparable experience?
  
  TIA
  
  Cam Ellison
  
  
  ___
  vox-tech mailing list
  vox-tech@lists.lugod.org
  http://lists.lugod.org/mailman/listinfo/vox-tech
 
 -- 
 Brian Lavender
 http://www.brie.com/brian/
 
 Program testing can be used to show the presence of bugs, but never to
 show their absence!
 
 Professor Edsger Dijkstra
 1972 Turing award recipient
 ___
 vox-tech mailing list
 vox-tech@lists.lugod.org
 http://lists.lugod.org/mailman/listinfo/vox-tech

-- 
Brian Lavender
http://www.brie.com/brian/

Program testing can be used to show the presence of bugs, but never to
show their absence!

Professor Edsger Dijkstra
1972 Turing award recipient
___
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech


Re: [vox-tech] Problem with Gigabyte 890FX, Phenom II, and Kubuntu

2010-12-08 Thread Rick Moen
Quoting Brian Lavender (br...@brie.com):

 Check the system event logs in the motherboard bios. Sometimes listed
 under SEL. Otherwise, I would stress test the machine. I used to run ctcs
 to burn in systems for a cluster I worked on for LLNL. It does memory,
 io, and cpu stress tests.
 
 http://sourceforge.net/projects/va-ctcs/

No longer maintained.  See the successor fork:
http://sourceforge.net/projects/ctcs2/

___
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech


Re: [vox-tech] Problem with Gigabyte 890FX, Phenom II, and Kubuntu

2010-12-08 Thread Cam Ellison
On 10-12-08 10:40 AM, Brian Lavender wrote:
 You might also want to try sar. Here is an interesting article.

 http://www.linux.com/archive/feature/114224
I am not familiar with it, so I've downloaded it and started it, and 
will get ksar as well (so I can get a better handle on the output), and 
have a go.  On *Ubuntu it's part of a package called sysstat.  It does 
look quite interesting, not to mention comprehensive - the man page goes 
on forever,
 brian

 On Wed, Dec 08, 2010 at 10:00:32AM -0800, Brian Lavender wrote:
 Check the system event logs in the motherboard bios. Sometimes listed
 under SEL. Otherwise, I would stress test the machine. I used to run ctcs
 to burn in systems for a cluster I worked on for LLNL. It does memory,
 io, and cpu stress tests.

 http://sourceforge.net/projects/va-ctcs/
This is my only machine, and it's a production machine, so I'm not sure 
about taking it out of service to run ctcs2 (thanks Rick!).  It may be 
worth a trial, nonetheless, in the wee hours of weekend morning.  As to 
the system event log, I just ran dmidecode, and it shows no errors.  
Mind you, this is 32 hours later, with a reboot in between, so anything 
that was current then may have been over-written.
 You could also try lm-sensors to monitor the hardware.

 http://ubuntuforums.org/showthread.php?t=2780

I have lm-sensors installed.  The only thing I can access on this MB is 
one temperature setting.  Mind you, I've only relied on gkrellm to find 
them, though with other MBs it's been pretty good at sussing them out.  
I'll run the setup utility and see what I can find.  Voltage variability 
might be the culprit, I suppose.

I still wonder if it's a software issue with the various cron jobs that 
run at that time, and I'm still working through.

Anyway, thank you very much for these ideas - they're a considerable help.

Cam



___
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech


Re: [vox-tech] Problem with Gigabyte 890FX, Phenom II, and Kubuntu

2010-12-08 Thread Rick Moen
Quoting Cam Ellison (c...@ellisonet.ca):

 This is my only machine, and it's a production machine, so I'm not sure 
 about taking it out of service to run ctcs2 (thanks Rick!).

You're very welcome.  I have notes here, which I recommend, because
Cerberus is rather peculiar software that takes a little getting used
to, and has some quirks.

'Burn-in' on http://linuxmafia.com/kb/Hardware

(We used to put all new or repaired machines at VA Linux Systems through
at least 48 hours of Cerberus / ctcs testing, to catch problems.)

___
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech


Re: [vox-tech] Problem with Gigabyte 890FX, Phenom II, and Kubuntu

2010-12-08 Thread Cam Ellison
On 10-12-08 01:43 PM, Rick Moen wrote:
 Quoting Cam Ellison (c...@ellisonet.ca):

 This is my only machine, and it's a production machine, so I'm not sure
 about taking it out of service to run ctcs2 (thanks Rick!).
 You're very welcome.  I have notes here, which I recommend, because
 Cerberus is rather peculiar software that takes a little getting used
 to, and has some quirks.

 'Burn-in' on http://linuxmafia.com/kb/Hardware

 (We used to put all new or repaired machines at VA Linux Systems through
 at least 48 hours of Cerberus / ctcs testing, to catch problems.)


That looks very useful.  I'll give it a try.

On another list that I frequent, the two responses thus far both 
suggested replacing or swapping out the PS.  I have to admit the idea 
has merit, though it's an Antec Signature 650, came new with the rest of 
the system, and over $200 here including the taxes.  I'm a little leery 
of ending up with a good, but effectively useless, PS.  Which leads to 
another question: how do you test a PS?  Is it possible?

Thanks again

Cam

___
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech


Re: [vox-tech] Problem with Gigabyte 890FX, Phenom II, and Kubuntu

2010-12-08 Thread Brian Lavender

On Wed, Dec 08, 2010 at 02:37:35PM -0800, Cam Ellison wrote:
 On 10-12-08 01:43 PM, Rick Moen wrote:
  Quoting Cam Ellison (c...@ellisonet.ca):
 
  This is my only machine, and it's a production machine, so I'm not sure
  about taking it out of service to run ctcs2 (thanks Rick!).
  You're very welcome.  I have notes here, which I recommend, because
  Cerberus is rather peculiar software that takes a little getting used
  to, and has some quirks.
 
  'Burn-in' on http://linuxmafia.com/kb/Hardware
 
  (We used to put all new or repaired machines at VA Linux Systems through
  at least 48 hours of Cerberus / ctcs testing, to catch problems.)
 
 
 That looks very useful.  I'll give it a try.
 
 On another list that I frequent, the two responses thus far both 
 suggested replacing or swapping out the PS.  I have to admit the idea 
 has merit, though it's an Antec Signature 650, came new with the rest of 
 the system, and over $200 here including the taxes.  I'm a little leery 
 of ending up with a good, but effectively useless, PS.  Which leads to 
 another question: how do you test a PS?  Is it possible?

The burn in process would probably reveal the fault, as it will load the machine
using more power and creating heat.

-- 
Brian Lavender
http://www.brie.com/brian/

Program testing can be used to show the presence of bugs, but never to
show their absence!

Professor Edsger Dijkstra
1972 Turing award recipient
___
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech


Re: [vox-tech] Problem with Gigabyte 890FX, Phenom II, and Kubuntu

2010-12-08 Thread Rick Moen
Quoting Cam Ellison (c...@ellisonet.ca):

 On another list that I frequent, the two responses thus far both 
 suggested replacing or swapping out the PS.  I have to admit the idea 
 has merit, though it's an Antec Signature 650, came new with the rest of 
 the system, and over $200 here including the taxes.  I'm a little leery 
 of ending up with a good, but effectively useless, PS.  Which leads to 
 another question: how do you test a PS?  Is it possible?

I'm sure it's possible (at least in theory), but I never have tried.
I've always just tried to keep around at least one of each major type
with a piece of masking tape on it labelled 'known good as of [date]', 
and swap those into systems where I suspect the PSU.

If the PSU is generally functional, then in my experience the usual
question is whether it is too weak for the current draw asked of it.
(In a perfect world, you would be able to believe manufacturer ratings,
but of course they lie and exaggerate, and also doubtless some PSUs
achieve their claimed ratings better loaded with some impedance types
than others.)

Antec PSUs are on the short list of ones I have faith in, generally.


I have a confession to make:  I really didn't pay much attention to this
thread until I saw Brian mention CTCS (Cerberus), with which I have a
great deal of experience.  I've just now re-read your original posting
to get the context for all this.

That having been done, I think the suggestion of a (say, overnight)
Cerberus run has a lot to recommend it.  Cerberus puts a system under
very, very serious load, which is the rationale for its use to
stress-test newly constructed systems on the VA Linux Systems production
line:  It exposes most hardware flaws through thrashing the hell out of 
pretty nearly every hardware subsystem in the host.

Your description (halted suddenly, no output, coldboot required) doesn't
sound a-priori like a RAM problem.  It's conceivable that it's a
software problem, but my instinct says hardware is more likely.  That
instinct says it's likely to be something with either the motherboard +
CPU or with the PSU.


One avenue towards diagnosis (generally speaking and probably _not_
useful for your situation; this is just for general knowledge of
troubleshooting) is to simplify the hardware situation temporarily for
diagnostic purposes, to attempt to isolate the problem.  That is, open
up your system and look at what's plugged into what.  Do you have
expansion cards that can be disconnected and the system is still able to
produce video?  Remove them.  A miniPCI-format wireless card?  Unplug
it.  Non-boot hard drives?  Unplug and detach them.  Optical drives?
Unplug and detach them.  Get as close as you can to just motherboard +
PSU and still have the system be functional enough to run and expose the
syndrome if it's still present.

That method is useful primarily for symptoms that express strongly and
constantly, like 'System doesn't even beep or produce video'.  In those
cases, you detach every non-essential subsystem and see if the remaining
hardware then beeps and does video.  If it does, then the root cause
lies in one of the subsystems you detached -or- in the 100%-wired-up 
system trying to draw too much current from a borderline PSU.  If if 
doesn't, then the problem may be in the system core (motherboard, PSU,
CPU, RAM).

The latter case is of course tough to narrow down.  If you have multiple 
sticks of RAM, and the motherboard northbridge can function with fewer
than all of them, try with half the RAM, then with the other half,
seeing if bootup beep + video reappears and correlates with one bank of
RAM but not the other.


Getting back to steps more likely relevant to _your_ problem, the other
general class of diagnostic techniques involve swapping in known-good
components, and seeing if the problem suddenly vanishes with one such
swap-in.  The pain-in-the-ass requisite is, of course, having a bunch of
known-good parts sitting around for this purpose, which one only rarely
has.  Sorry, I don't know any easy way around that.

-- 
Rick Moen Told my friend she shouldn't smoke weed while she's 
r...@linuxmafia.com   pregnant because her baby's never going to want to 
McQ!  (4x80)  come out.   -- Kelly Oxford
___
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech


Re: [vox-tech] Problem with Gigabyte 890FX, Phenom II, and Kubuntu

2010-12-08 Thread Cam Ellison
On 10-12-08 03:26 PM, Rick Moen wrote:
 Quoting Cam Ellison (c...@ellisonet.ca):

 On another list that I frequent, the two responses thus far both
 suggested replacing or swapping out the PS.  I have to admit the idea
 has merit, though it's an Antec Signature 650, came new with the rest of
 the system, and over $200 here including the taxes.  I'm a little leery
 of ending up with a good, but effectively useless, PS.  Which leads to
 another question: how do you test a PS?  Is it possible?
 I'm sure it's possible (at least in theory), but I never have tried.
 I've always just tried to keep around at least one of each major type
 with a piece of masking tape on it labelled 'known good as of [date]',
 and swap those into systems where I suspect the PSU.

 If the PSU is generally functional, then in my experience the usual
 question is whether it is too weak for the current draw asked of it.
 (In a perfect world, you would be able to believe manufacturer ratings,
 but of course they lie and exaggerate, and also doubtless some PSUs
 achieve their claimed ratings better loaded with some impedance types
 than others.)

 Antec PSUs are on the short list of ones I have faith in, generally.


 I have a confession to make:  I really didn't pay much attention to this
 thread until I saw Brian mention CTCS (Cerberus), with which I have a
 great deal of experience.  I've just now re-read your original posting
 to get the context for all this.

 That having been done, I think the suggestion of a (say, overnight)
 Cerberus run has a lot to recommend it.  Cerberus puts a system under
 very, very serious load, which is the rationale for its use to
 stress-test newly constructed systems on the VA Linux Systems production
 line:  It exposes most hardware flaws through thrashing the hell out of
 pretty nearly every hardware subsystem in the host.
That sounds like the way to go.  I've downloaded and unzipped it.  Now 
to grab a new kernel (this is a Kubuntu box, and there are only header 
files) and set things for this weekend, maybe.
 Your description (halted suddenly, no output, coldboot required) doesn't
 sound a-priori like a RAM problem.  It's conceivable that it's a
 software problem, but my instinct says hardware is more likely.  That
 instinct says it's likely to be something with either the motherboard +
 CPU or with the PSU.
Fortunately, they're within warranty.  Unfortunately, enough time has 
passed that it will mean shipping to the manufacturer.  Too bad I didn't 
know about CTCS earlier - I guess that's for next time, if there is one. :-p

With regard to the rest of your email (snipped out), I'll try that if 
nothing comes from CTCS.  Two halts five weeks apart doesn't give me 
much to work with.

I did try dmidecode on the PS, but drew a blank, perhaps not 
surprisingly.  On the basis of your instincts, plus my own suspicions 
and previous experience (now that I think about it), I'm beginning to 
suspect the PSU.  Time for some negotiation with the supplier, I think.

Thanks again

Cam




___
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech


Re: [vox-tech] Problem with Gigabyte 890FX, Phenom II, and Kubuntu

2010-12-08 Thread Rick Moen
Quoting Cam Ellison (c...@ellisonet.ca):

 With regard to the rest of your email (snipped out), I'll try that if 
 nothing comes from CTCS.  Two halts five weeks apart doesn't give me 
 much to work with.

Yes.  One of the Bad Words that one doesn't really want to hear when
performing diagnosis of any kind, including on computer hardware, is
'intermittant'.

 I did try dmidecode on the PS, but drew a blank, perhaps not 
 surprisingly.  On the basis of your instincts, plus my own suspicions 
 and previous experience (now that I think about it), I'm beginning to 
 suspect the PSU.

Could be.  FWIW, this would be the very first time I'd heard of an Antec
PSU being the root cause of a system problem.  They're really good.
However, there's always a first time.

___
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech