Re: random system hang

2005-06-21 Thread Olleg Samoylov

Ron Farrer wrote:


Hardware list (to look for any possible common connections):
2 x Opteron 252
Tyan S2875
2GB DDR400 (2 x 1GB)
SATA (one seagate drive)
IDE (one sony DVDRW)
EVGA Nvidia Geforce 6800 Ultra 256MB (AGP 8x, FW, and SBA enabled)
no PCI devices installed
 


I have Tyan S2875 with one Opteron 240, DDR400. I can give some advise.
1. I had problems with SATA drive. Update BIOS to newest. Bios for S2875 
is very buggy especialy for SATA (may be your swap in SATA?).
2. I had hangs with DDR400. memtest86+ show buggy DIMM. Try this usefull 
test.
3. I very often have hang with mplayer. But mplayer is not in debian, 
thus I can't comment this.


--
Olleg Samoylov



smime.p7s
Description: S/MIME Cryptographic Signature


Re: random system hang

2005-06-21 Thread Thomas Steffen
On 6/18/05, Ron Farrer [EMAIL PROTECTED] wrote:
 I guess I didn't knock on wood quick enough or something. After I sent
 my reply I left for lunch and apon returning (1 hour) the thing was locked
 up!
 
 I went over to another machine, logged in via ssh. I looked at the CPU
 usage and X was pegging one of the CPUs out to 100% usage. 

It might just be the X server going crazy. Since it runs as root, it
can take the whole system down if it crashes. You may try to compile
Magic keys into the kernel, so that you can kill the server using
the keyboard.

But what I would really recommend is to try X.org 6.8 from ubuntu. It
has solved many many problems for me (ATI card), and my machine rans
fine now.

 So I tried to run less
 /var/log/messages and it hung. 

Hm... I had some issues with my SATA drive. The connection was not
reliable, so that it would unplugg itself. A new cable fixed that.
Is /var/log/messages on the SATA drive?

Thomas



Re: random system hang

2005-06-21 Thread Manuel Capinha
Yet another Me too message.

I've seen this happen a lot on a Tyan Mobo with 2 Opterons (can't
recall the exact model but I could look it up). For us, disabling the
xscreensaver solved it. I can't pinpoint it to one specific
screensaver (we even stress tested a lot of them) but _almost_
everytime it crashed it was running xscreensaver. The times it crashed
without xscreensaver, I believe that the screensaver could be just
starting.

Our first guess went into the amd64 java package from sun, but after
removing all the java apps from there it kept crashing. It kept
crashing when we started using the x86 java so... we then removed the
screensaver and it never crashed ever since. We're using the DPMS
stuff to turn off the monitor and nothing else.

Since you're all seeing problems in X maybe this is related and YMMV :)

Cheers,
Manuel

On 6/21/05, Thomas Steffen [EMAIL PROTECTED] wrote:
 On 6/18/05, Ron Farrer [EMAIL PROTECTED] wrote:
  I guess I didn't knock on wood quick enough or something. After I sent
  my reply I left for lunch and apon returning (1 hour) the thing was locked
  up!
 
  I went over to another machine, logged in via ssh. I looked at the CPU
  usage and X was pegging one of the CPUs out to 100% usage.
 
 It might just be the X server going crazy. Since it runs as root, it
 can take the whole system down if it crashes. You may try to compile
 Magic keys into the kernel, so that you can kill the server using
 the keyboard.
 
 But what I would really recommend is to try X.org 6.8 from ubuntu. It
 has solved many many problems for me (ATI card), and my machine rans
 fine now.
 
  So I tried to run less
  /var/log/messages and it hung.
 
 Hm... I had some issues with my SATA drive. The connection was not
 reliable, so that it would unplugg itself. A new cable fixed that.
 Is /var/log/messages on the SATA drive?
 
 Thomas
 




Re: random system hang

2005-06-21 Thread Charles Leggett

I'm using a Tyan thunder K8W S2885 mobo, but have no SATA drives
attached.

Hmmm, most of the time, the machine hangs when I'm away, so the
screensaver is running, though it's only using the blank screen.
Sometimes however, it's hung while I'm in the middle of using
the system.

I've moved to the 2.6.12 kernel yesterday. Let's see if that makes
any difference. Haven't tried xorg.

Charles.

On Tue, 2005-06-21 at 15:28 +0100, Manuel Capinha wrote:
 Yet another Me too message.
 
 I've seen this happen a lot on a Tyan Mobo with 2 Opterons (can't
 recall the exact model but I could look it up). For us, disabling the
 xscreensaver solved it. I can't pinpoint it to one specific
 screensaver (we even stress tested a lot of them) but _almost_
 everytime it crashed it was running xscreensaver. The times it crashed
 without xscreensaver, I believe that the screensaver could be just
 starting.
 
 Our first guess went into the amd64 java package from sun, but after
 removing all the java apps from there it kept crashing. It kept
 crashing when we started using the x86 java so... we then removed the
 screensaver and it never crashed ever since. We're using the DPMS
 stuff to turn off the monitor and nothing else.
 
 Since you're all seeing problems in X maybe this is related and YMMV :)
 
 Cheers,
 Manuel
 
 On 6/21/05, Thomas Steffen [EMAIL PROTECTED] wrote:
  On 6/18/05, Ron Farrer [EMAIL PROTECTED] wrote:
   I guess I didn't knock on wood quick enough or something. After I sent
   my reply I left for lunch and apon returning (1 hour) the thing was locked
   up!
  
   I went over to another machine, logged in via ssh. I looked at the CPU
   usage and X was pegging one of the CPUs out to 100% usage.
  
  It might just be the X server going crazy. Since it runs as root, it
  can take the whole system down if it crashes. You may try to compile
  Magic keys into the kernel, so that you can kill the server using
  the keyboard.
  
  But what I would really recommend is to try X.org 6.8 from ubuntu. It
  has solved many many problems for me (ATI card), and my machine rans
  fine now.
  
   So I tried to run less
   /var/log/messages and it hung.
  
  Hm... I had some issues with my SATA drive. The connection was not
  reliable, so that it would unplugg itself. A new cable fixed that.
  Is /var/log/messages on the SATA drive?
  
  Thomas
  
 



signature.asc
Description: This is a digitally signed message part


Re: random system hang

2005-06-21 Thread Ron Farrer

On Tue, June 21, 2005 0:09, Olleg Samoylov said:

 I have Tyan S2875 with one Opteron 240, DDR400. I can give some advise.
 1. I had problems with SATA drive. Update BIOS to newest. Bios for S2875
 is very buggy especialy for SATA (may be your swap in SATA?).

I am running the latest BIOS (3.00) because it is required to get this
board to POST with Opteron 252 processors (found that out the hard way).

 2. I had hangs with DDR400. memtest86+ show buggy DIMM. Try this usefull
 test.

I left memtest run overnight (I know, not that long...) and there were no
problems.

 3. I very often have hang with mplayer. But mplayer is not in debian,
 thus I can't comment this.

I can't comment here as I don't use mplayer. I do use xine and it runs
perfectly from a chroot (for w32 codecs). XMMS runs fine natively.

Ron


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: random system hang

2005-06-21 Thread Ron Farrer

On Tue, June 21, 2005 2:40, Thomas Steffen said:

 It might just be the X server going crazy. Since it runs as root, it
 can take the whole system down if it crashes. You may try to compile
 Magic keys into the kernel, so that you can kill the server using
 the keyboard.

That's good advice. Although with only one exception I have been able to
ssh in and restart X or reboot the machine.

 But what I would really recommend is to try X.org 6.8 from ubuntu. It
 has solved many many problems for me (ATI card), and my machine rans
 fine now.

I'd like to try x.org, but I think I'll wait until it's in sid.

 Hm... I had some issues with my SATA drive. The connection was not
 reliable, so that it would unplugg itself. A new cable fixed that.
 Is /var/log/messages on the SATA drive?

This is possible but I have done a LOT of heavy I/O to/from the disk
without even a hickup. I even (recently) copied 200GB from an ATA/100
drive to a SATA drive without issue.

Based on some off-list discussion I decided to try recompiling the nvidia
kernel driver with gcc 3.4.4 (instead of 3.3.6) and give a kernel boot
option of acpi=off. The results so far show much better stability in X
and the kernel no longer complains about loosing ticks. I'll keep an eye
on it but ut2004 (amd64 binary) no longer randomly crashes - something I
did not notice before because I didn't play the game long enough on this
machine (hence why I stated it was fine in a previous post). However, it
hasn't been that long (day and a half) and only time will tell if either
change fixed anything...

Ron


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: random system hang

2005-06-21 Thread Ron Farrer

I also forgot to mention that xscreensaver seems better behaved. The
OpenGL screensavers no longer bleed onto the screen in preview mode
(from KDE) and most of them seem to work when quickly scrolling through
them.

Ron

On Tue, June 21, 2005 17:23, Ron Farrer said:
 Based on some off-list discussion I decided to try recompiling the nvidia
 kernel driver with gcc 3.4.4 (instead of 3.3.6) and give a kernel boot
 option of acpi=off. The results so far show much better stability in X
 and the kernel no longer complains about loosing ticks. I'll keep an eye
 on it but ut2004 (amd64 binary) no longer randomly crashes - something I
 did not notice before because I didn't play the game long enough on this
 machine (hence why I stated it was fine in a previous post). However, it
 hasn't been that long (day and a half) and only time will tell if either
 change fixed anything...


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: random system hang

2005-06-17 Thread Ron Farrer

On Thu, June 16, 2005 15:56, Charles Leggett said:

 My dual opteron system hangs at random intervals. Sometimes it's stable
 for a week, sometimes it hangs after just a few hours. The symptoms
 are always the same - NONE. Carefully scanning the system logs shows
 abslutely nothing occurred to cause a hang. No kernel oopses, no error
 messages. It's been this way ever since I insalled debian 6 months ago -
 before that I was running CentOS, and it never died then, so I'm pretty
 sure it's not a hardware problem.

I don't really have a solution, but rather this is more of a me too reply.

I have also been seeing seemingly random lockups on a dual Opteron system.
The screen is frozen and there is no keyboard response. Once it has locked
up, however, I was able to login via ssh and do some looking around. Most
commands take about 1-2 minutes to complete. Just typing vim somefile
can take as long as 2 minutes before it completes. Login in via ssh
sometimes takes 30 seconds or more.

On one of the lockups I started killing off processes and finally
determined that X was not stopping when given the HUP and SEGV signals.
Running (I use KDE) /etc/init.d/kdm stop would end in an error about the
xserver not responding. Luckily (for me) X would stop with a KILL signal
(-9) and I was able to restart X with /etc/init.d/kdm restart which
would then return the local console and the frozen screen to normal and
the machine would operate normally from that point on.

There doesn't seem to be any visable connection between the lockups
outside X and friends. It can be anywhere from a few hours (rare) to a
week. So far most of the lockups were while I wasn't even in the office -
one was in the middle of the night and another was during the day when I
was away.

I have been having weird behavior from xscreensaver (doesn't want to start
sometimes, some screensavers (especially opengl ones) will bleed onto the
screen in preview mode (forcing me to restart X to regain control), and
sometimes (rare) it doesn't want to stop on keyboard or mouse activity)
which could be the root of the problem. I have not yet tried running
without xscreensaver and if the lockups continue I may try stopping it. I
configured xscreensaver to use the slide show screensaver and I've only
seen one lockup so far (knock on wood). The machine has never locked up
while in use, only when idle.

This system has been otherwise rock solid. It runs everything extremely
well including games like ut2004 (amd64) and doom3 (i386 chroot). When a
lockup happens there is nothing in any of the logs. I am running sid and
keep it up-to-date. This machine has been heavily tested with a huge range
of tasks (games, compiling, benchmarks, etc.) and none have shown signs of
any problems. I like to compile large packages (for comparison to other
machines and) to look for any signs of instability. After compiling
xserver-xfree86 (33 minutes) there were no errors or unusual behavior
(although I did not actually try using the compiled packages). Other
compiled packages have built and run fine (although none are as large as
X) so I'm leaning towards a problem with X or xscreensaver.

Hardware list (to look for any possible common connections):
2 x Opteron 252
Tyan S2875
2GB DDR400 (2 x 1GB)
SATA (one seagate drive)
IDE (one sony DVDRW)
EVGA Nvidia Geforce 6800 Ultra 256MB (AGP 8x, FW, and SBA enabled)
no PCI devices installed

Right now the machine has been up 7 days without a lockup and I'll
continue to track it - but narrowing the problem down is difficult when
the lockups only happen about a week or two apart...

Regards,
Ron


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: random system hang

2005-06-17 Thread Ron Farrer

On Fri, June 17, 2005 13:09, Ron Farrer said:
lots of snipping
 On one of the lockups I started killing off processes and finally
 determined that X was not stopping when given the HUP and SEGV signals.
 Running (I use KDE) /etc/init.d/kdm stop would end in an error about the
 xserver not responding. Luckily (for me) X would stop with a KILL signal
 (-9) and I was able to restart X with /etc/init.d/kdm restart which
 would then return the local console and the frozen screen to normal and
 the machine would operate normally from that point on.

I guess I didn't knock on wood quick enough or something. After I sent
my reply I left for lunch and apon returning (1 hour) the thing was locked
up!

I went over to another machine, logged in via ssh. I looked at the CPU
usage and X was pegging one of the CPUs out to 100% usage. xscreensaver
appeared to have died (it was not in the output of ps aux) but it was
clearly the last thing on the frozen screen. I noted the system load was
2.95 (the machine was idle when I left it). I killed X (with -9) and
restarted kdm. I opened (still on ssh) a file in vim (a text file that I
had been editing before I left for lunch) and vim hung. So I opened
another ssh session and ran ps aux and it hung before finishing the list
(got probably 3/4 the way through). So I tried to run less
/var/log/messages and it hung.  Ok, so I opened yet another ssh session
and tried to run top and the whole system stopped responding (first time
ever). At this point I could no longer ssh in. Left with no other choice I
pressed the reset and the box came back up fine. Just a bit of FYI if
anyone has an idea about the cause.

BTW I'm running Debian kernel 2.6.11-9-amd64-k8-smp

Ron


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]