RE: Is this kernel related (signal 11)?

2001-01-24 Thread Rainer Mager

Hi all,

Well, I upgraded my system to glibc 2.2.1 with few problems. Unfortunately,
there are no improvements in my stability problems. X still dies.


So, I ask again, how can I debug this? How can I determine if this is a
kernel problem or not?


Thanks,

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Is this kernel related (signal 11)?

2001-01-24 Thread Rainer Mager

Hi all,

Well, I upgraded my system to glibc 2.2.1 with few problems. Unfortunately,
there are no improvements in my stability problems. X still dies.


So, I ask again, how can I debug this? How can I determine if this is a
kernel problem or not?


Thanks,

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Is this kernel related (signal 11)?

2001-01-23 Thread Rainer Mager

As per Russell King's suggestion, I ran memtest86 on my system for about 12
hours last night. I found no memory errors. Note that the tests did not
complete because I had to stop them this morning. I'll contiue them tonight.
They got through test 9 of 11.


As per David Ford's suggestion, I am looking into upgrading to glibc 2.2.1.
Can someone please give hints on doing this. I tried to upgrade to 2.2 a few
weeks ago and after the 'make install' and then reboot my system was very
broken and I had to reinstall the RedHat glibc RPM from CD to recover. I
found a howto but it seems pretty old. How do other people do this?


I've also done a strace on X. Now what do I do with this 4 MB log file?


Thanks,

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Is this kernel related (signal 11)?

2001-01-23 Thread Rainer Mager

Thanks for all the info, comments below:

First, I ran X in gdb and got the following via 'bt' after X died. This is
my first experience with gdb so if I should do anything in particular,
please tell me.

#0  0x401addeb in __sigsuspend (set=0xb930)
at ../sysdeps/unix/sysv/linux/sigsuspend.c:48
#1  0x80495a4 in startServer ()
#2  0x804922c in main ()
#3  0x401a79cb in __libc_start_main (main=0x8048ee0 , argc=5,
argv=0xbacc, init=0x8048a64 <_init>, fini=0x8049a44 <_fini>,
rtld_fini=0x4000ae60 <_dl_fini>, stack_end=0xbac4)
at ../sysdeps/generic/libc-start.c:92


> David Ford:
>
> Upgrade -past- 2.2, get 2.2.1.  2.2 causes numerous segfaults,
> notably sendmail
> and apache stop working.

I'm willing. Are there any good how-tos on doing this without killing your
system? The last time I manually upgraded libc was about 5 years ago.


> Russell King:
>
>
> In answer to the original posters question, the first step would be
> to grab a copy of memtest86 (iirc its a program that is run from floppy
> disk) and run that on your system.  That /should/ (and I stress should
> there) detect any RAM problems you have.

I'll try this.



> Barry K. Nathan:
>
>
> Does it always happen when you are moving the mouse over a button or
> windowbar or some other on-screen object like that?

Nope. If anything I'd say it happens during blitting (scrolling, screen
refreshing, etc). Also, I'm not overclocking anything.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Is this kernel related (signal 11)?

2001-01-23 Thread Rainer Mager

Thanks for all the info, comments below:

First, I ran X in gdb and got the following via 'bt' after X died. This is
my first experience with gdb so if I should do anything in particular,
please tell me.

#0  0x401addeb in __sigsuspend (set=0xb930)
at ../sysdeps/unix/sysv/linux/sigsuspend.c:48
#1  0x80495a4 in startServer ()
#2  0x804922c in main ()
#3  0x401a79cb in __libc_start_main (main=0x8048ee0 main, argc=5,
argv=0xbacc, init=0x8048a64 _init, fini=0x8049a44 _fini,
rtld_fini=0x4000ae60 _dl_fini, stack_end=0xbac4)
at ../sysdeps/generic/libc-start.c:92


 David Ford:

 Upgrade -past- 2.2, get 2.2.1.  2.2 causes numerous segfaults,
 notably sendmail
 and apache stop working.

I'm willing. Are there any good how-tos on doing this without killing your
system? The last time I manually upgraded libc was about 5 years ago.


 Russell King:


 In answer to the original posters question, the first step would be
 to grab a copy of memtest86 (iirc its a program that is run from floppy
 disk) and run that on your system.  That /should/ (and I stress should
 there) detect any RAM problems you have.

I'll try this.



 Barry K. Nathan:


 Does it always happen when you are moving the mouse over a button or
 windowbar or some other on-screen object like that?

Nope. If anything I'd say it happens during blitting (scrolling, screen
refreshing, etc). Also, I'm not overclocking anything.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Is this kernel related (signal 11)?

2001-01-23 Thread Rainer Mager

As per Russell King's suggestion, I ran memtest86 on my system for about 12
hours last night. I found no memory errors. Note that the tests did not
complete because I had to stop them this morning. I'll contiue them tonight.
They got through test 9 of 11.


As per David Ford's suggestion, I am looking into upgrading to glibc 2.2.1.
Can someone please give hints on doing this. I tried to upgrade to 2.2 a few
weeks ago and after the 'make install' and then reboot my system was very
broken and I had to reinstall the RedHat glibc RPM from CD to recover. I
found a howto but it seems pretty old. How do other people do this?


I've also done a strace on X. Now what do I do with this 4 MB log file?


Thanks,

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-22 Thread David Ford

Rainer Mager wrote:

> > Would this be an SMP IA32 box with glibc 2.2? I have two such boxen
> > showing exactly the same behaviour, although I can't reproduce it at will.
>
> Close, it is actually an SMP IA32 box with glibc 2.1.3. But you've now
> convinced me to not upgrade glibc yet  ;-)

Upgrade -past- 2.2, get 2.2.1.  2.2 causes numerous segfaults, notably sendmail
and apache stop working.

-d

--
  There is a natural aristocracy among men. The grounds of this are virtue and 
talents. Thomas Jefferson
  The good thing about standards is that there are so many to choose from. Andrew S. 
Tanenbaum



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-22 Thread Paul Jakma

On Mon, 22 Jan 2001, Russell King wrote:

> Evidence: I recently had a bad 128MB SDRAM which *always* failed at byte
> address 0x220068,

and X is likely to be the biggest process by far on a box, so
statistically will be the process that hits this bad byte the most.
no?

regards,
-- 
Paul Jakma  [EMAIL PROTECTED]   [EMAIL PROTECTED]
PGP5 key: http://www.clubi.ie/jakma/publickey.txt
---
Fortune:
The bomb will never go off.  I speak as an expert in explosives.
-- Admiral William Leahy, U.S. Atomic Bomb Project

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-22 Thread Russell King

Rogier Wolff writes:
> Harware problems are normally not reproducable. Can you attach a
> debugger to your X server, and catch it when things go bad? (And
> give the Xfree86 people a backtrace)

Bad RAM can be extremely reproducable though, and can certainly produce
SEGVs.

Evidence: I recently had a bad 128MB SDRAM which *always* failed at byte
address 0x220068, which was the middle of the mem_map array.  All I
needed to do was 'dd if=/dev/hda of=/dev/null' and the machine would
die within 5 minutes due to an invalid buffer_head pointer.

The SDRAM naturally passed each and every single memory test I could
throw at it.  However, a new SDRAM fixed the problem.

It is quite common for SDRAMs to fail in this way - think about the
failure mode.  Some of the silicon in the SDRAM is damaged.  This isn't
going to move about, so its going to be in a fixed position.  A fixed
position means a specific set of transistors, gate, and therefore
memory location.

In answer to the original posters question, the first step would be
to grab a copy of memtest86 (iirc its a program that is run from floppy
disk) and run that on your system.  That /should/ (and I stress should
there) detect any RAM problems you have.

--
Russell King ([EMAIL PROTECTED])The developer of ARM Linux
 http://www.arm.linux.org.uk/personal/aboutme.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-22 Thread Barry K. Nathan

Rainer Mager wrote:
> particular problem still exists. In brief, X windows dies with signal 11. I
[snip]

Does it always happen when you are moving the mouse over a button or
windowbar or some other on-screen object like that?

Usually, when I have that happen, it's because I'm overclocking the
machine too much... I have no idea if that helps, but I thought I'd go
ahead and throw in my two cents, just in case it does.

-Barry K. Nathan <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-22 Thread Barry K. Nathan

Rainer Mager wrote:
 particular problem still exists. In brief, X windows dies with signal 11. I
[snip]

Does it always happen when you are moving the mouse over a button or
windowbar or some other on-screen object like that?

Usually, when I have that happen, it's because I'm overclocking the
machine too much... I have no idea if that helps, but I thought I'd go
ahead and throw in my two cents, just in case it does.

-Barry K. Nathan [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-22 Thread Russell King

Rogier Wolff writes:
 Harware problems are normally not reproducable. Can you attach a
 debugger to your X server, and catch it when things go bad? (And
 give the Xfree86 people a backtrace)

Bad RAM can be extremely reproducable though, and can certainly produce
SEGVs.

Evidence: I recently had a bad 128MB SDRAM which *always* failed at byte
address 0x220068, which was the middle of the mem_map array.  All I
needed to do was 'dd if=/dev/hda of=/dev/null' and the machine would
die within 5 minutes due to an invalid buffer_head pointer.

The SDRAM naturally passed each and every single memory test I could
throw at it.  However, a new SDRAM fixed the problem.

It is quite common for SDRAMs to fail in this way - think about the
failure mode.  Some of the silicon in the SDRAM is damaged.  This isn't
going to move about, so its going to be in a fixed position.  A fixed
position means a specific set of transistors, gate, and therefore
memory location.

In answer to the original posters question, the first step would be
to grab a copy of memtest86 (iirc its a program that is run from floppy
disk) and run that on your system.  That /should/ (and I stress should
there) detect any RAM problems you have.

--
Russell King ([EMAIL PROTECTED])The developer of ARM Linux
 http://www.arm.linux.org.uk/personal/aboutme.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-22 Thread Paul Jakma

On Mon, 22 Jan 2001, Russell King wrote:

 Evidence: I recently had a bad 128MB SDRAM which *always* failed at byte
 address 0x220068,

and X is likely to be the biggest process by far on a box, so
statistically will be the process that hits this bad byte the most.
no?

regards,
-- 
Paul Jakma  [EMAIL PROTECTED]   [EMAIL PROTECTED]
PGP5 key: http://www.clubi.ie/jakma/publickey.txt
---
Fortune:
The bomb will never go off.  I speak as an expert in explosives.
-- Admiral William Leahy, U.S. Atomic Bomb Project

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Is this kernel related (signal 11)?

2001-01-21 Thread Rainer Mager

> Would this be an SMP IA32 box with glibc 2.2? I have two such boxen
> showing exactly the same behaviour, although I can't reproduce it at will.

Close, it is actually an SMP IA32 box with glibc 2.1.3. But you've now
convinced me to not upgrade glibc yet  ;-)

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-21 Thread Rogier Wolff

Rainer Mager wrote:

> that it is likely a hardware or kernel problem. So, my question is,
> how can I pin point the problem? Is this likely to be a kernel
> issue?

No, not hardware. No not kernel. 

Harware problems are normally not reproducable. Can you attach a
debugger to your X server, and catch it when things go bad? (And
give the Xfree86 people a backtrace)

Roger. 

-- 
** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots. 
* There are also old, bald pilots. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-21 Thread David Woodhouse

On Mon, 22 Jan 2001, Rainer Mager wrote:

>   I brought up this issue last month and had some response but as
> of yet my particular problem still exists. In brief, X windows dies
> with signal 11. I have done quite a bit of testing and this does not
> seem to be a hardware issue. Also, I have never managed to get a
> signal 11 error when not running X.

Would this be an SMP IA32 box with glibc 2.2? I have two such boxen 
showing exactly the same behaviour, although I can't reproduce it at will.

It happens even when I use the same kernel and XFree86 binaries which were
working perfectly before the upgrade. The LDT handling fixes which were
added between 2.4.0-prerelease and the real 2.4.0 appeared to make this
_slightly_ less frequent, but I still rarely have an X server uptime of
more than a few days.

-- 
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-21 Thread David Woodhouse

On Mon, 22 Jan 2001, Rainer Mager wrote:

   I brought up this issue last month and had some response but as
 of yet my particular problem still exists. In brief, X windows dies
 with signal 11. I have done quite a bit of testing and this does not
 seem to be a hardware issue. Also, I have never managed to get a
 signal 11 error when not running X.

Would this be an SMP IA32 box with glibc 2.2? I have two such boxen 
showing exactly the same behaviour, although I can't reproduce it at will.

It happens even when I use the same kernel and XFree86 binaries which were
working perfectly before the upgrade. The LDT handling fixes which were
added between 2.4.0-prerelease and the real 2.4.0 appeared to make this
_slightly_ less frequent, but I still rarely have an X server uptime of
more than a few days.

-- 
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-21 Thread Rogier Wolff

Rainer Mager wrote:

 that it is likely a hardware or kernel problem. So, my question is,
 how can I pin point the problem? Is this likely to be a kernel
 issue?

No, not hardware. No not kernel. 

Harware problems are normally not reproducable. Can you attach a
debugger to your X server, and catch it when things go bad? (And
give the Xfree86 people a backtrace)

Roger. 

-- 
** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots. 
* There are also old, bald pilots. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Is this kernel related (signal 11)?

2001-01-21 Thread Rainer Mager

 Would this be an SMP IA32 box with glibc 2.2? I have two such boxen
 showing exactly the same behaviour, although I can't reproduce it at will.

Close, it is actually an SMP IA32 box with glibc 2.1.3. But you've now
convinced me to not upgrade glibc yet  ;-)

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/