Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-07-02 Thread Tha Phlash

Ive also had a problem with signal 11, heres a great page explaining the aspects of 
signal 11 error from gcc (http://www.bitwizard.nl/sig11/).

Signal 11 is usually a hardware problem, as the article points out. I found a sloppy 
soulution playing with my BIOS settings, turns out there was an option called "Memory 
Hole at 15Mb Addr." I enabled it and i got no more sig11, however when I boot up, 
Linux only recognizes like 13Mb of my 64Mb of RAM. 

Anyway, there are my 2 cents.

Luis 
-- 

___
FREE Personalized E-mail at Mail.com
http://www.mail.com/?sr=signup

FREE PC-to-Phone calls with Net2Phone
http://www.net2phone.com/cgi-bin/link.cgi?121





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-07-02 Thread Tha Phlash

Ive also had a problem with signal 11, heres a great page explaining the aspects of 
signal 11 error from gcc (http://www.bitwizard.nl/sig11/).

Signal 11 is usually a hardware problem, as the article points out. I found a sloppy 
soulution playing with my BIOS settings, turns out there was an option called Memory 
Hole at 15Mb Addr. I enabled it and i got no more sig11, however when I boot up, 
Linux only recognizes like 13Mb of my 64Mb of RAM. 

Anyway, there are my 2 cents.

Luis phlash
-- 

___
FREE Personalized E-mail at Mail.com
http://www.mail.com/?sr=signup

FREE PC-to-Phone calls with Net2Phone
http://www.net2phone.com/cgi-bin/link.cgi?121





-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-07-01 Thread H. Peter Anvin

Riley Williams wrote:

> Hi Peter.
> 
>  >> Wasn't 2.2.12 the kernel that included the `lock halt` bug patch?
> 
>  > Perhaps, but is has absolutely nothing to do with the rest of
>  > this discussion.
> 
> The `lock halt` bug patch was specific to the Cyrix processors (not to
> be confused with the `lock registers` patch for the Intel processors,
> and I noted that the processor in question was a Cyrix one, hence the
> comment.
> 


Oh.  Sorry, I don't know about "lock halt" and its effects.  However, if 
it refers to the instruction sequence LOCK HLT I find it hard to believe 
it would have the symptoms described.

-hpa


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-07-01 Thread H. Peter Anvin

Riley Williams wrote:

> 
> Wasn't 2.2.12 the kernel that included the `lock halt` bug patch?
> 


Perhaps, but is has absolutely nothing to do with the rest of this 
discussion.

-hpa


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-07-01 Thread H. Peter Anvin

Followup to:  <[EMAIL PROTECTED]>
By author:szonyi calin <[EMAIL PROTECTED]>
In newsgroup: linux.dev.kernel
> 
> Almost always ?
> It seems like gcc is THE ONLY program which gets
> signal 11
> Why the X server doesn't get signal 11 ?
> Why others programs don't get signal 11 ?
> 

gcc happens to be one of the best memory testers known to man -- much
better than most other programs.  A big reason for that is that it
accesses lots of memory in funny patterns, *AND* accesses to it are
likely to be fatal.

It is just the way it is.  gcc doing the signal 11 is HIGHLY
correlated with the hardware you are running on, which means it's
*usually* hardware-related.

> [... Lots of M$ flames ignored ...]

> Some time ago I installed Linux (Redhat 6.0) on my pc (Cx486 8M RAM)
> and gcc had a lot of signal 11 (a couple every hour) I was upgrading
> the kernel every time there was a new kernel and from 2.2.12(or 14)
> no more signal 11 (very rare) Is this still a hardware problem ?
> Was a bug in kernel ?
> 
> I think the last answer is more obvious.(or the gcc
> had a bug and the kernel -- a workaround).

Most likely is that your *hardware* had a bug and the new kernel a
workaround (this is quite common), but without more detail it is very
hard to know.

-hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-07-01 Thread H. Peter Anvin

Followup to:  [EMAIL PROTECTED]
By author:szonyi calin [EMAIL PROTECTED]
In newsgroup: linux.dev.kernel
 
 Almost always ?
 It seems like gcc is THE ONLY program which gets
 signal 11
 Why the X server doesn't get signal 11 ?
 Why others programs don't get signal 11 ?
 

gcc happens to be one of the best memory testers known to man -- much
better than most other programs.  A big reason for that is that it
accesses lots of memory in funny patterns, *AND* accesses to it are
likely to be fatal.

It is just the way it is.  gcc doing the signal 11 is HIGHLY
correlated with the hardware you are running on, which means it's
*usually* hardware-related.

 [... Lots of M$ flames ignored ...]

 Some time ago I installed Linux (Redhat 6.0) on my pc (Cx486 8M RAM)
 and gcc had a lot of signal 11 (a couple every hour) I was upgrading
 the kernel every time there was a new kernel and from 2.2.12(or 14)
 no more signal 11 (very rare) Is this still a hardware problem ?
 Was a bug in kernel ?
 
 I think the last answer is more obvious.(or the gcc
 had a bug and the kernel -- a workaround).

Most likely is that your *hardware* had a bug and the new kernel a
workaround (this is quite common), but without more detail it is very
hard to know.

-hpa
-- 
[EMAIL PROTECTED] at work, [EMAIL PROTECTED] in private!
Unix gives you enough rope to shoot yourself in the foot.
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-07-01 Thread H. Peter Anvin

Riley Williams wrote:

 
 Wasn't 2.2.12 the kernel that included the `lock halt` bug patch?
 


Perhaps, but is has absolutely nothing to do with the rest of this 
discussion.

-hpa


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-07-01 Thread H. Peter Anvin

Riley Williams wrote:

 Hi Peter.
 
   Wasn't 2.2.12 the kernel that included the `lock halt` bug patch?
 
   Perhaps, but is has absolutely nothing to do with the rest of
   this discussion.
 
 The `lock halt` bug patch was specific to the Cyrix processors (not to
 be confused with the `lock registers` patch for the Intel processors,
 and I noted that the processor in question was a Cyrix one, hence the
 comment.
 


Oh.  Sorry, I don't know about lock halt and its effects.  However, if 
it refers to the instruction sequence LOCK HLT I find it hard to believe 
it would have the symptoms described.

-hpa


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-06-29 Thread Albert D. Cahalan

> Almost always ?
> It seems like gcc is THE ONLY program which gets
> signal 11
> Why the X server doesn't get signal 11 ?
> Why others programs don't get signal 11 ?
...
> Some time ago I installed Linux (Redhat 6.0) on my 
> pc (Cx486 8M RAM) and gcc had a lot of signal 11 (a
> couple every hour) I was upgrading
> the kernel every time there was a new kernel and
> from 2.2.12(or 14) no more signal 11 (very rare)
> Is this still a hardware problem ?

It could be. One possible way:

1. your system is clogged with dust
2. gcc runs the CPU hard, generating lots of heat
3. the heat causes crashes
4. a new Linux version that sets a Cyrix-specific power-saving mode
5. your heat problems go away, and so do the crashes

Another possible way:

1. you have buggy motherboard or disk hardware
2. when you swap, gcc gets corrupted by the hardware
3. you get a new Linux kernel that has a bug work-around
4. your problems go away

Yet another way:

1. your room is hot, your computer is near a huge motor...
2. you upgrade to Linux 2.2.12 and move your computer
3. soon you realize that the crashes are gone
4. you credit the kernel, but location was the problem
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-06-29 Thread Jesse Pollard

-  Received message begins Here  -

> 
> 
> --- Jesse Pollard <[EMAIL PROTECTED]>
> wrote:
> > > 
> > > 
> > > "This is almost always the result of flakiness in
> > your hardware - either
> > > RAM (most likely), or motherboard (less likely). 
> > "
> > >  
> > >   I cannot understand
> > this. There are many other
> > > stuffs that I compiled with gcc without any
> > problem. Again compilation is only
> > > a application. It  only parse and gernerates
> > object files. How can RAM or
> > > motherboard makes different
> > 
> > It's most likely flackey memory.
> > 
> > Remember- a single bit that dropps can cause the
> > signal 11. It doesn't have
> > to happen consistently either. I had the same
> > problem until I slowed down
> > memory access (that seemd to cover the borderline
> > chip).
> > 
> > The compiler uses different amounts of memory
> > depending on the source file,
> > number of symbols defined (via include headers).
> > When the multiple passes
> > occur simultaneously, there is higher memory
> > pressure, and more of the
> > free space used. One of the pages may flake out.
> > Compiling the kernel
> > puts more pressure on memory than compiling most
> > applications.
> > 
> >
> -
> > Jesse I Pollard, II
> > Email: [EMAIL PROTECTED]
> > 
> > Any opinions expressed are solely my own.
> > -
> > To unsubscribe from this list: send the line
> > "unsubscribe linux-kernel" in
> > the body of a message to [EMAIL PROTECTED]
> > More majordomo info at 
> > http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> Almost always ?
> It seems like gcc is THE ONLY program which gets
> signal 11
> Why the X server doesn't get signal 11 ?
> Why others programs don't get signal 11 ?

Load the system down with lots of processes/large
image windows. Unless the bit in question is in
a pointer, or data used in pointer arithmetic or function call
it won't
segfault. Applications (if an instruction page gets hit)
may get an illegal instruction.

> I remember that once Bill Gates was asked about
> crashes in windows and he said: It's a hardware
> problem.
> It was also a joke on that subject:
> Winerr xxx: Hardware problem (it's not our fault, it's
> not, it's not, it's not, it's not...)

Yup - because it crashed VERY frequently when it was obviously a
software bug.

> Seems to me like Micro$oft way of handling problems.
> 
> We must agree that gcc is full of bugs (xanim does not
> 
> run corectly if it is compiled with gcc 2.95.3 
> and other programs which use floating point
> calculations do the same (spice 3f5))

Generating wrong code is different than a segfault.

Currently I'm using egcs-2.91.66 on a 486, without problems.
(I don't do floating point on a 486... too slow).

> Some time ago I installed Linux (Redhat 6.0) on my 
> pc (Cx486 8M RAM) and gcc had a lot of signal 11 (a
> couple every hour) I was upgrading
> the kernel every time there was a new kernel and
> from 2.2.12(or 14) no more signal 11 (very rare)
> Is this still a hardware problem ?
> Was a bug in kernel ?

Not likely - It could just depend on whether all of available
was used. If the physical page with the problem doesn't get used
very often, it won't show up. If the bit in question is not part
of a pointer, or used in pointer arithmetic, again it won't show
up (actually, any operation on addresses). Wrong, or slightly wrong
results MAY show up.

> I think the last answer is more obvious.(or the gcc
> had a bug and the kernel -- a workaround).
> 
> Sorry for bothering you but in every piece of linux
> documentation signal 11 seems to be __identic__ with
> hardware problem.
> Bye

Only when it appears in random location.

GCC is a fairly well debugged program and doesn't segfault
unless you run out of memory, or flakey memory.

-
Jesse I Pollard, II
Email: [EMAIL PROTECTED]

Any opinions expressed are solely my own.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: gcc: internal compiler error: program cc1 got fatal signal 11

2001-06-29 Thread David Relson

At 10:20 AM 6/29/01, you wrote:

>Almost always ?
>It seems like gcc is THE ONLY program which gets
>signal 11
>Why the X server doesn't get signal 11 ?
>Why others programs don't get signal 11 ?
>
>I remember that once Bill Gates was asked about
>crashes in windows and he said: It's a hardware
>problem.
>It was also a joke on that subject:
>Winerr xxx: Hardware problem (it's not our fault, it's
>not, it's not, it's not, it's not...)
>
>
>Seems to me like Micro$oft way of handling problems.
>
>We must agree that gcc is full of bugs (xanim does not
>run corectly if it is compiled with gcc 2.95.3
>and other programs which use floating point
>calculations do the same (spice 3f5))

All versions of gcc have bugs.  They generally show up as incorrect 
complaints about the source code, as generated code that is less than 
optimal or that is flat out wrong.  With this kind of bug, if you compile 
the program twice you'll get the same (buggy) result.

Sig 11 is a bit different.  With a compiler bug causing the sig 11, the 
problem will happen EVERY time you compile the given file - because the 
compiler is busted.  This kind of problem is detected early in the 
compiler's life cycle and gets fixed.

Then there are the intermittent sig 11 errors.  If the software was broken, 
the sig 11 would happen whenever you do the same thing.  Being able to 
compile a bunch of files, get a sig 11, compile a bunch more, sig 11, a 
bunch more ... is a sign that the problem isn't the compiler.  Peoples' 
experience over the years has shown that symptoms of this type are cause by 
(intermittent) hardware problems.

Why does this affect gcc more than other programs?  Because gcc uses 
gazillions of pointers and bad memory causes bad pointers causes sig 11.

Hope this helps.

David

P.S.  Years ago, installing OS/2 on an apparently 100% working system would 
show similar problems.  OS/2 was the first widely used 32 bit operating 
system on Intel hardware.  It exercised the hardware differently from DOS, 
Windows, etc and flaky memory would make itself known.  The usual reaction 
was "But my system worked fine before OS/2"  The response was 
"different software exercises the hardware differently and may reveal 
unsuspected problems".

David Relson   Osage Software Systems, Inc.
[EMAIL PROTECTED]   Ann Arbor, MI 48103
www.osagesoftware.com  tel:  734.821.8800

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-06-29 Thread szonyi calin


--- Jesse Pollard <[EMAIL PROTECTED]>
wrote:
> > 
> > 
> > "This is almost always the result of flakiness in
> your hardware - either
> > RAM (most likely), or motherboard (less likely). 
> "
> >  
> >   I cannot understand
> this. There are many other
> > stuffs that I compiled with gcc without any
> problem. Again compilation is only
> > a application. It  only parse and gernerates
> object files. How can RAM or
> > motherboard makes different
> 
> It's most likely flackey memory.
> 
> Remember- a single bit that dropps can cause the
> signal 11. It doesn't have
> to happen consistently either. I had the same
> problem until I slowed down
> memory access (that seemd to cover the borderline
> chip).
> 
> The compiler uses different amounts of memory
> depending on the source file,
> number of symbols defined (via include headers).
> When the multiple passes
> occur simultaneously, there is higher memory
> pressure, and more of the
> free space used. One of the pages may flake out.
> Compiling the kernel
> puts more pressure on memory than compiling most
> applications.
> 
>
-
> Jesse I Pollard, II
> Email: [EMAIL PROTECTED]
> 
> Any opinions expressed are solely my own.
> -
> To unsubscribe from this list: send the line
> "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at 
> http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Almost always ?
It seems like gcc is THE ONLY program which gets
signal 11
Why the X server doesn't get signal 11 ?
Why others programs don't get signal 11 ?

I remember that once Bill Gates was asked about
crashes in windows and he said: It's a hardware
problem.
It was also a joke on that subject:
Winerr xxx: Hardware problem (it's not our fault, it's
not, it's not, it's not, it's not...)


Seems to me like Micro$oft way of handling problems.

We must agree that gcc is full of bugs (xanim does not

run corectly if it is compiled with gcc 2.95.3 
and other programs which use floating point
calculations do the same (spice 3f5))

Some time ago I installed Linux (Redhat 6.0) on my 
pc (Cx486 8M RAM) and gcc had a lot of signal 11 (a
couple every hour) I was upgrading
the kernel every time there was a new kernel and
from 2.2.12(or 14) no more signal 11 (very rare)
Is this still a hardware problem ?
Was a bug in kernel ?

I think the last answer is more obvious.(or the gcc
had a bug and the kernel -- a workaround).

Sorry for bothering you but in every piece of linux
documentation signal 11 seems to be __identic__ with
hardware problem.
Bye

__
Do You Yahoo!?
Get personalized email addresses from Yahoo! Mail
http://personal.mail.yahoo.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-06-29 Thread Jesse Pollard

> 
> 
> "This is almost always the result of flakiness in your hardware - either
> RAM (most likely), or motherboard (less likely).  "
>  
>   I cannot understand this. There are many other
> stuffs that I compiled with gcc without any problem. Again compilation is only
> a application. It  only parse and gernerates object files. How can RAM or
> motherboard makes different

It's most likely flackey memory.

Remember- a single bit that dropps can cause the signal 11. It doesn't have
to happen consistently either. I had the same problem until I slowed down
memory access (that seemd to cover the borderline chip).

The compiler uses different amounts of memory depending on the source file,
number of symbols defined (via include headers). When the multiple passes
occur simultaneously, there is higher memory pressure, and more of the
free space used. One of the pages may flake out. Compiling the kernel
puts more pressure on memory than compiling most applications.

-
Jesse I Pollard, II
Email: [EMAIL PROTECTED]

Any opinions expressed are solely my own.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-06-29 Thread Erik Mouw

On Thu, Jun 28, 2001 at 11:23:37PM -0600, Blesson Paul wrote:
> 
> "This is almost always the result of flakiness in your hardware - either
> RAM (most likely), or motherboard (less likely).  "
>  
>   I cannot understand this. There are many other
> stuffs that I compiled with gcc without any problem. Again compilation is only
> a application. It  only parse and gernerates object files. How can RAM or
> motherboard makes different

Please read the complete Sig11 FAQ (http://www.bitwizard.nl/sig11/ ),
your question is discussed in it as well.


Erik

-- 
J.A.K. (Erik) Mouw, Information and Communication Theory Group, Department
of Electrical Engineering, Faculty of Information Technology and Systems,
Delft University of Technology, PO BOX 5031,  2600 GA Delft, The Netherlands
Phone: +31-15-2783635  Fax: +31-15-2781843  Email: [EMAIL PROTECTED]
WWW: http://www-ict.its.tudelft.nl/~erik/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-06-29 Thread Erik Mouw

On Thu, Jun 28, 2001 at 11:23:37PM -0600, Blesson Paul wrote:
 
 This is almost always the result of flakiness in your hardware - either
 RAM (most likely), or motherboard (less likely).  
  
   I cannot understand this. There are many other
 stuffs that I compiled with gcc without any problem. Again compilation is only
 a application. It  only parse and gernerates object files. How can RAM or
 motherboard makes different

Please read the complete Sig11 FAQ (http://www.bitwizard.nl/sig11/ ),
your question is discussed in it as well.


Erik

-- 
J.A.K. (Erik) Mouw, Information and Communication Theory Group, Department
of Electrical Engineering, Faculty of Information Technology and Systems,
Delft University of Technology, PO BOX 5031,  2600 GA Delft, The Netherlands
Phone: +31-15-2783635  Fax: +31-15-2781843  Email: [EMAIL PROTECTED]
WWW: http://www-ict.its.tudelft.nl/~erik/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-06-29 Thread Jesse Pollard

 
 
 This is almost always the result of flakiness in your hardware - either
 RAM (most likely), or motherboard (less likely).  
  
   I cannot understand this. There are many other
 stuffs that I compiled with gcc without any problem. Again compilation is only
 a application. It  only parse and gernerates object files. How can RAM or
 motherboard makes different

It's most likely flackey memory.

Remember- a single bit that dropps can cause the signal 11. It doesn't have
to happen consistently either. I had the same problem until I slowed down
memory access (that seemd to cover the borderline chip).

The compiler uses different amounts of memory depending on the source file,
number of symbols defined (via include headers). When the multiple passes
occur simultaneously, there is higher memory pressure, and more of the
free space used. One of the pages may flake out. Compiling the kernel
puts more pressure on memory than compiling most applications.

-
Jesse I Pollard, II
Email: [EMAIL PROTECTED]

Any opinions expressed are solely my own.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-06-29 Thread szonyi calin


--- Jesse Pollard [EMAIL PROTECTED]
wrote:
  
  
  This is almost always the result of flakiness in
 your hardware - either
  RAM (most likely), or motherboard (less likely). 
 
   
I cannot understand
 this. There are many other
  stuffs that I compiled with gcc without any
 problem. Again compilation is only
  a application. It  only parse and gernerates
 object files. How can RAM or
  motherboard makes different
 
 It's most likely flackey memory.
 
 Remember- a single bit that dropps can cause the
 signal 11. It doesn't have
 to happen consistently either. I had the same
 problem until I slowed down
 memory access (that seemd to cover the borderline
 chip).
 
 The compiler uses different amounts of memory
 depending on the source file,
 number of symbols defined (via include headers).
 When the multiple passes
 occur simultaneously, there is higher memory
 pressure, and more of the
 free space used. One of the pages may flake out.
 Compiling the kernel
 puts more pressure on memory than compiling most
 applications.
 

-
 Jesse I Pollard, II
 Email: [EMAIL PROTECTED]
 
 Any opinions expressed are solely my own.
 -
 To unsubscribe from this list: send the line
 unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at 
 http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

Almost always ?
It seems like gcc is THE ONLY program which gets
signal 11
Why the X server doesn't get signal 11 ?
Why others programs don't get signal 11 ?

I remember that once Bill Gates was asked about
crashes in windows and he said: It's a hardware
problem.
It was also a joke on that subject:
Winerr xxx: Hardware problem (it's not our fault, it's
not, it's not, it's not, it's not...)


Seems to me like Micro$oft way of handling problems.

We must agree that gcc is full of bugs (xanim does not

run corectly if it is compiled with gcc 2.95.3 
and other programs which use floating point
calculations do the same (spice 3f5))

Some time ago I installed Linux (Redhat 6.0) on my 
pc (Cx486 8M RAM) and gcc had a lot of signal 11 (a
couple every hour) I was upgrading
the kernel every time there was a new kernel and
from 2.2.12(or 14) no more signal 11 (very rare)
Is this still a hardware problem ?
Was a bug in kernel ?

I think the last answer is more obvious.(or the gcc
had a bug and the kernel -- a workaround).

Sorry for bothering you but in every piece of linux
documentation signal 11 seems to be __identic__ with
hardware problem.
Bye

__
Do You Yahoo!?
Get personalized email addresses from Yahoo! Mail
http://personal.mail.yahoo.com/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: gcc: internal compiler error: program cc1 got fatal signal 11

2001-06-29 Thread David Relson

At 10:20 AM 6/29/01, you wrote:

Almost always ?
It seems like gcc is THE ONLY program which gets
signal 11
Why the X server doesn't get signal 11 ?
Why others programs don't get signal 11 ?

I remember that once Bill Gates was asked about
crashes in windows and he said: It's a hardware
problem.
It was also a joke on that subject:
Winerr xxx: Hardware problem (it's not our fault, it's
not, it's not, it's not, it's not...)


Seems to me like Micro$oft way of handling problems.

We must agree that gcc is full of bugs (xanim does not
run corectly if it is compiled with gcc 2.95.3
and other programs which use floating point
calculations do the same (spice 3f5))

All versions of gcc have bugs.  They generally show up as incorrect 
complaints about the source code, as generated code that is less than 
optimal or that is flat out wrong.  With this kind of bug, if you compile 
the program twice you'll get the same (buggy) result.

Sig 11 is a bit different.  With a compiler bug causing the sig 11, the 
problem will happen EVERY time you compile the given file - because the 
compiler is busted.  This kind of problem is detected early in the 
compiler's life cycle and gets fixed.

Then there are the intermittent sig 11 errors.  If the software was broken, 
the sig 11 would happen whenever you do the same thing.  Being able to 
compile a bunch of files, get a sig 11, compile a bunch more, sig 11, a 
bunch more ... is a sign that the problem isn't the compiler.  Peoples' 
experience over the years has shown that symptoms of this type are cause by 
(intermittent) hardware problems.

Why does this affect gcc more than other programs?  Because gcc uses 
gazillions of pointers and bad memory causes bad pointers causes sig 11.

Hope this helps.

David

P.S.  Years ago, installing OS/2 on an apparently 100% working system would 
show similar problems.  OS/2 was the first widely used 32 bit operating 
system on Intel hardware.  It exercised the hardware differently from DOS, 
Windows, etc and flaky memory would make itself known.  The usual reaction 
was But my system worked fine before OS/2  The response was 
different software exercises the hardware differently and may reveal 
unsuspected problems.

David Relson   Osage Software Systems, Inc.
[EMAIL PROTECTED]   Ann Arbor, MI 48103
www.osagesoftware.com  tel:  734.821.8800

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-06-29 Thread Jesse Pollard

-  Received message begins Here  -

 
 
 --- Jesse Pollard [EMAIL PROTECTED]
 wrote:
   
   
   This is almost always the result of flakiness in
  your hardware - either
   RAM (most likely), or motherboard (less likely). 
  

 I cannot understand
  this. There are many other
   stuffs that I compiled with gcc without any
  problem. Again compilation is only
   a application. It  only parse and gernerates
  object files. How can RAM or
   motherboard makes different
  
  It's most likely flackey memory.
  
  Remember- a single bit that dropps can cause the
  signal 11. It doesn't have
  to happen consistently either. I had the same
  problem until I slowed down
  memory access (that seemd to cover the borderline
  chip).
  
  The compiler uses different amounts of memory
  depending on the source file,
  number of symbols defined (via include headers).
  When the multiple passes
  occur simultaneously, there is higher memory
  pressure, and more of the
  free space used. One of the pages may flake out.
  Compiling the kernel
  puts more pressure on memory than compiling most
  applications.
  
 
 -
  Jesse I Pollard, II
  Email: [EMAIL PROTECTED]
  
  Any opinions expressed are solely my own.
  -
  To unsubscribe from this list: send the line
  unsubscribe linux-kernel in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at 
  http://vger.kernel.org/majordomo-info.html
  Please read the FAQ at  http://www.tux.org/lkml/
 
 Almost always ?
 It seems like gcc is THE ONLY program which gets
 signal 11
 Why the X server doesn't get signal 11 ?
 Why others programs don't get signal 11 ?

Load the system down with lots of processes/large
image windows. Unless the bit in question is in
a pointer, or data used in pointer arithmetic or function call
it won't
segfault. Applications (if an instruction page gets hit)
may get an illegal instruction.

 I remember that once Bill Gates was asked about
 crashes in windows and he said: It's a hardware
 problem.
 It was also a joke on that subject:
 Winerr xxx: Hardware problem (it's not our fault, it's
 not, it's not, it's not, it's not...)

Yup - because it crashed VERY frequently when it was obviously a
software bug.

 Seems to me like Micro$oft way of handling problems.
 
 We must agree that gcc is full of bugs (xanim does not
 
 run corectly if it is compiled with gcc 2.95.3 
 and other programs which use floating point
 calculations do the same (spice 3f5))

Generating wrong code is different than a segfault.

Currently I'm using egcs-2.91.66 on a 486, without problems.
(I don't do floating point on a 486... too slow).

 Some time ago I installed Linux (Redhat 6.0) on my 
 pc (Cx486 8M RAM) and gcc had a lot of signal 11 (a
 couple every hour) I was upgrading
 the kernel every time there was a new kernel and
 from 2.2.12(or 14) no more signal 11 (very rare)
 Is this still a hardware problem ?
 Was a bug in kernel ?

Not likely - It could just depend on whether all of available
was used. If the physical page with the problem doesn't get used
very often, it won't show up. If the bit in question is not part
of a pointer, or used in pointer arithmetic, again it won't show
up (actually, any operation on addresses). Wrong, or slightly wrong
results MAY show up.

 I think the last answer is more obvious.(or the gcc
 had a bug and the kernel -- a workaround).
 
 Sorry for bothering you but in every piece of linux
 documentation signal 11 seems to be __identic__ with
 hardware problem.
 Bye

Only when it appears in random location.

GCC is a fairly well debugged program and doesn't segfault
unless you run out of memory, or flakey memory.

-
Jesse I Pollard, II
Email: [EMAIL PROTECTED]

Any opinions expressed are solely my own.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-06-29 Thread Albert D. Cahalan

 Almost always ?
 It seems like gcc is THE ONLY program which gets
 signal 11
 Why the X server doesn't get signal 11 ?
 Why others programs don't get signal 11 ?
...
 Some time ago I installed Linux (Redhat 6.0) on my 
 pc (Cx486 8M RAM) and gcc had a lot of signal 11 (a
 couple every hour) I was upgrading
 the kernel every time there was a new kernel and
 from 2.2.12(or 14) no more signal 11 (very rare)
 Is this still a hardware problem ?

It could be. One possible way:

1. your system is clogged with dust
2. gcc runs the CPU hard, generating lots of heat
3. the heat causes crashes
4. a new Linux version that sets a Cyrix-specific power-saving mode
5. your heat problems go away, and so do the crashes

Another possible way:

1. you have buggy motherboard or disk hardware
2. when you swap, gcc gets corrupted by the hardware
3. you get a new Linux kernel that has a bug work-around
4. your problems go away

Yet another way:

1. your room is hot, your computer is near a huge motor...
2. you upgrade to Linux 2.2.12 and move your computer
3. soon you realize that the crashes are gone
4. you credit the kernel, but location was the problem
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-06-28 Thread Blesson Paul


"This is almost always the result of flakiness in your hardware - either
RAM (most likely), or motherboard (less likely).  "
 
  I cannot understand this. There are many other
stuffs that I compiled with gcc without any problem. Again compilation is only
a application. It  only parse and gernerates object files. How can RAM or
motherboard makes different
  

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



gcc: internal compiler error: program cc1 got fatal signal 11

2001-06-28 Thread Blesson Paul

hi
  I am trying to compile the kernel2.4.5 source code. 
Presently I have kernel2.2.14 and Redhat6.2. I have egcs1.2.2.  Now when I
compile I will get the following error 
 gcc: Internel compiler error: program   cc1 got fatal signal 11
 make Error 1
 Leaving directory ...
 ..
 .
 Assembler messages 
 Warning: end of file not at end of file: newline inserted 
 cpp: output pipe has been closed 
  Error: suffix or operands invalid for mov   
Here  cofusion part is that, when I recompile, the same part where this
error occured will compile perfectly. But again after some compilation, the
same error will show in any other place. The last line in the error statement
may be different in the second time.   
 
   Moreover my cpu info in given below. I have given
processor i486. Is there any particular choice should be made to compile
kernel source code
 
processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 5
model   : 8
model name  : AMD-K6(tm) 3D processor
stepping: 12
cpu MHz : 400.921117
fdiv_bug: no
hlt_bug : no
sep_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 1
wp  : yes
flags   : fpu vme de pse tsc msr mce cx8 sep mtrr pge mmx 3dnow
bogomips: 799.54

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



gcc: internal compiler error: program cc1 got fatal signal 11

2001-06-28 Thread Blesson Paul

hi
  I am trying to compile the kernel2.4.5 source code. 
Presently I have kernel2.2.14 and Redhat6.2. I have egcs1.2.2.  Now when I
compile I will get the following error 
 gcc: Internel compiler error: program   cc1 got fatal signal 11
 make Error 1
 Leaving directory ...
 ..
 .
 Assembler messages 
 Warning: end of file not at end of file: newline inserted 
 cpp: output pipe has been closed 
  Error: suffix or operands invalid for mov   
Here  cofusion part is that, when I recompile, the same part where this
error occured will compile perfectly. But again after some compilation, the
same error will show in any other place. The last line in the error statement
may be different in the second time.   
 
   Moreover my cpu info in given below. I have given
processor i486. Is there any particular choice should be made to compile
kernel source code
 
processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 5
model   : 8
model name  : AMD-K6(tm) 3D processor
stepping: 12
cpu MHz : 400.921117
fdiv_bug: no
hlt_bug : no
sep_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 1
wp  : yes
flags   : fpu vme de pse tsc msr mce cx8 sep mtrr pge mmx 3dnow
bogomips: 799.54

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: gcc: internal compiler error: program cc1 got fatal signal 11]

2001-06-28 Thread Blesson Paul


This is almost always the result of flakiness in your hardware - either
RAM (most likely), or motherboard (less likely).  
 
  I cannot understand this. There are many other
stuffs that I compiled with gcc without any problem. Again compilation is only
a application. It  only parse and gernerates object files. How can RAM or
motherboard makes different
  

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: Is this kernel related (signal 11)?

2001-01-24 Thread Rainer Mager

Hi all,

Well, I upgraded my system to glibc 2.2.1 with few problems. Unfortunately,
there are no improvements in my stability problems. X still dies.


So, I ask again, how can I debug this? How can I determine if this is a
kernel problem or not?


Thanks,

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Is this kernel related (signal 11)?

2001-01-24 Thread Rainer Mager

Hi all,

Well, I upgraded my system to glibc 2.2.1 with few problems. Unfortunately,
there are no improvements in my stability problems. X still dies.


So, I ask again, how can I debug this? How can I determine if this is a
kernel problem or not?


Thanks,

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Is this kernel related (signal 11)?

2001-01-23 Thread Rainer Mager

As per Russell King's suggestion, I ran memtest86 on my system for about 12
hours last night. I found no memory errors. Note that the tests did not
complete because I had to stop them this morning. I'll contiue them tonight.
They got through test 9 of 11.


As per David Ford's suggestion, I am looking into upgrading to glibc 2.2.1.
Can someone please give hints on doing this. I tried to upgrade to 2.2 a few
weeks ago and after the 'make install' and then reboot my system was very
broken and I had to reinstall the RedHat glibc RPM from CD to recover. I
found a howto but it seems pretty old. How do other people do this?


I've also done a strace on X. Now what do I do with this 4 MB log file?


Thanks,

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Is this kernel related (signal 11)?

2001-01-23 Thread Rainer Mager

Thanks for all the info, comments below:

First, I ran X in gdb and got the following via 'bt' after X died. This is
my first experience with gdb so if I should do anything in particular,
please tell me.

#0  0x401addeb in __sigsuspend (set=0xb930)
at ../sysdeps/unix/sysv/linux/sigsuspend.c:48
#1  0x80495a4 in startServer ()
#2  0x804922c in main ()
#3  0x401a79cb in __libc_start_main (main=0x8048ee0 , argc=5,
argv=0xbacc, init=0x8048a64 <_init>, fini=0x8049a44 <_fini>,
rtld_fini=0x4000ae60 <_dl_fini>, stack_end=0xbac4)
at ../sysdeps/generic/libc-start.c:92


> David Ford:
>
> Upgrade -past- 2.2, get 2.2.1.  2.2 causes numerous segfaults,
> notably sendmail
> and apache stop working.

I'm willing. Are there any good how-tos on doing this without killing your
system? The last time I manually upgraded libc was about 5 years ago.


> Russell King:
>
>
> In answer to the original posters question, the first step would be
> to grab a copy of memtest86 (iirc its a program that is run from floppy
> disk) and run that on your system.  That /should/ (and I stress should
> there) detect any RAM problems you have.

I'll try this.



> Barry K. Nathan:
>
>
> Does it always happen when you are moving the mouse over a button or
> windowbar or some other on-screen object like that?

Nope. If anything I'd say it happens during blitting (scrolling, screen
refreshing, etc). Also, I'm not overclocking anything.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Is this kernel related (signal 11)?

2001-01-23 Thread Rainer Mager

Thanks for all the info, comments below:

First, I ran X in gdb and got the following via 'bt' after X died. This is
my first experience with gdb so if I should do anything in particular,
please tell me.

#0  0x401addeb in __sigsuspend (set=0xb930)
at ../sysdeps/unix/sysv/linux/sigsuspend.c:48
#1  0x80495a4 in startServer ()
#2  0x804922c in main ()
#3  0x401a79cb in __libc_start_main (main=0x8048ee0 main, argc=5,
argv=0xbacc, init=0x8048a64 _init, fini=0x8049a44 _fini,
rtld_fini=0x4000ae60 _dl_fini, stack_end=0xbac4)
at ../sysdeps/generic/libc-start.c:92


 David Ford:

 Upgrade -past- 2.2, get 2.2.1.  2.2 causes numerous segfaults,
 notably sendmail
 and apache stop working.

I'm willing. Are there any good how-tos on doing this without killing your
system? The last time I manually upgraded libc was about 5 years ago.


 Russell King:


 In answer to the original posters question, the first step would be
 to grab a copy of memtest86 (iirc its a program that is run from floppy
 disk) and run that on your system.  That /should/ (and I stress should
 there) detect any RAM problems you have.

I'll try this.



 Barry K. Nathan:


 Does it always happen when you are moving the mouse over a button or
 windowbar or some other on-screen object like that?

Nope. If anything I'd say it happens during blitting (scrolling, screen
refreshing, etc). Also, I'm not overclocking anything.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Is this kernel related (signal 11)?

2001-01-23 Thread Rainer Mager

As per Russell King's suggestion, I ran memtest86 on my system for about 12
hours last night. I found no memory errors. Note that the tests did not
complete because I had to stop them this morning. I'll contiue them tonight.
They got through test 9 of 11.


As per David Ford's suggestion, I am looking into upgrading to glibc 2.2.1.
Can someone please give hints on doing this. I tried to upgrade to 2.2 a few
weeks ago and after the 'make install' and then reboot my system was very
broken and I had to reinstall the RedHat glibc RPM from CD to recover. I
found a howto but it seems pretty old. How do other people do this?


I've also done a strace on X. Now what do I do with this 4 MB log file?


Thanks,

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: oops, signal 11

2001-01-22 Thread Ralf Baechle

On Sat, Jan 20, 2001 at 01:46:50PM +0100, [EMAIL PROTECTED] wrote:

> I know that signal 11 with gcc is a sign of bad hardware; however  it
> strikes me that I don't get random oopses - a whole bunch of them is appended.

The compiler tends to hammer harder on the memory than the kernel; this
is a sign of the great effort which was taken to optimize the kernel's
cache usage.

  Ralf
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-22 Thread David Ford

Rainer Mager wrote:

> > Would this be an SMP IA32 box with glibc 2.2? I have two such boxen
> > showing exactly the same behaviour, although I can't reproduce it at will.
>
> Close, it is actually an SMP IA32 box with glibc 2.1.3. But you've now
> convinced me to not upgrade glibc yet  ;-)

Upgrade -past- 2.2, get 2.2.1.  2.2 causes numerous segfaults, notably sendmail
and apache stop working.

-d

--
  There is a natural aristocracy among men. The grounds of this are virtue and 
talents. Thomas Jefferson
  The good thing about standards is that there are so many to choose from. Andrew S. 
Tanenbaum



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-22 Thread Paul Jakma

On Mon, 22 Jan 2001, Russell King wrote:

> Evidence: I recently had a bad 128MB SDRAM which *always* failed at byte
> address 0x220068,

and X is likely to be the biggest process by far on a box, so
statistically will be the process that hits this bad byte the most.
no?

regards,
-- 
Paul Jakma  [EMAIL PROTECTED]   [EMAIL PROTECTED]
PGP5 key: http://www.clubi.ie/jakma/publickey.txt
---
Fortune:
The bomb will never go off.  I speak as an expert in explosives.
-- Admiral William Leahy, U.S. Atomic Bomb Project

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-22 Thread Russell King

Rogier Wolff writes:
> Harware problems are normally not reproducable. Can you attach a
> debugger to your X server, and catch it when things go bad? (And
> give the Xfree86 people a backtrace)

Bad RAM can be extremely reproducable though, and can certainly produce
SEGVs.

Evidence: I recently had a bad 128MB SDRAM which *always* failed at byte
address 0x220068, which was the middle of the mem_map array.  All I
needed to do was 'dd if=/dev/hda of=/dev/null' and the machine would
die within 5 minutes due to an invalid buffer_head pointer.

The SDRAM naturally passed each and every single memory test I could
throw at it.  However, a new SDRAM fixed the problem.

It is quite common for SDRAMs to fail in this way - think about the
failure mode.  Some of the silicon in the SDRAM is damaged.  This isn't
going to move about, so its going to be in a fixed position.  A fixed
position means a specific set of transistors, gate, and therefore
memory location.

In answer to the original posters question, the first step would be
to grab a copy of memtest86 (iirc its a program that is run from floppy
disk) and run that on your system.  That /should/ (and I stress should
there) detect any RAM problems you have.

--
Russell King ([EMAIL PROTECTED])The developer of ARM Linux
 http://www.arm.linux.org.uk/personal/aboutme.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-22 Thread Barry K. Nathan

Rainer Mager wrote:
> particular problem still exists. In brief, X windows dies with signal 11. I
[snip]

Does it always happen when you are moving the mouse over a button or
windowbar or some other on-screen object like that?

Usually, when I have that happen, it's because I'm overclocking the
machine too much... I have no idea if that helps, but I thought I'd go
ahead and throw in my two cents, just in case it does.

-Barry K. Nathan <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-22 Thread Barry K. Nathan

Rainer Mager wrote:
 particular problem still exists. In brief, X windows dies with signal 11. I
[snip]

Does it always happen when you are moving the mouse over a button or
windowbar or some other on-screen object like that?

Usually, when I have that happen, it's because I'm overclocking the
machine too much... I have no idea if that helps, but I thought I'd go
ahead and throw in my two cents, just in case it does.

-Barry K. Nathan [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-22 Thread Russell King

Rogier Wolff writes:
 Harware problems are normally not reproducable. Can you attach a
 debugger to your X server, and catch it when things go bad? (And
 give the Xfree86 people a backtrace)

Bad RAM can be extremely reproducable though, and can certainly produce
SEGVs.

Evidence: I recently had a bad 128MB SDRAM which *always* failed at byte
address 0x220068, which was the middle of the mem_map array.  All I
needed to do was 'dd if=/dev/hda of=/dev/null' and the machine would
die within 5 minutes due to an invalid buffer_head pointer.

The SDRAM naturally passed each and every single memory test I could
throw at it.  However, a new SDRAM fixed the problem.

It is quite common for SDRAMs to fail in this way - think about the
failure mode.  Some of the silicon in the SDRAM is damaged.  This isn't
going to move about, so its going to be in a fixed position.  A fixed
position means a specific set of transistors, gate, and therefore
memory location.

In answer to the original posters question, the first step would be
to grab a copy of memtest86 (iirc its a program that is run from floppy
disk) and run that on your system.  That /should/ (and I stress should
there) detect any RAM problems you have.

--
Russell King ([EMAIL PROTECTED])The developer of ARM Linux
 http://www.arm.linux.org.uk/personal/aboutme.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-22 Thread Paul Jakma

On Mon, 22 Jan 2001, Russell King wrote:

 Evidence: I recently had a bad 128MB SDRAM which *always* failed at byte
 address 0x220068,

and X is likely to be the biggest process by far on a box, so
statistically will be the process that hits this bad byte the most.
no?

regards,
-- 
Paul Jakma  [EMAIL PROTECTED]   [EMAIL PROTECTED]
PGP5 key: http://www.clubi.ie/jakma/publickey.txt
---
Fortune:
The bomb will never go off.  I speak as an expert in explosives.
-- Admiral William Leahy, U.S. Atomic Bomb Project

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: oops, signal 11

2001-01-22 Thread Ralf Baechle

On Sat, Jan 20, 2001 at 01:46:50PM +0100, [EMAIL PROTECTED] wrote:

 I know that signal 11 with gcc is a sign of bad hardware; however  it
 strikes me that I don't get random oopses - a whole bunch of them is appended.

The compiler tends to hammer harder on the memory than the kernel; this
is a sign of the great effort which was taken to optimize the kernel's
cache usage.

  Ralf
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Is this kernel related (signal 11)?

2001-01-21 Thread Rainer Mager

> Would this be an SMP IA32 box with glibc 2.2? I have two such boxen
> showing exactly the same behaviour, although I can't reproduce it at will.

Close, it is actually an SMP IA32 box with glibc 2.1.3. But you've now
convinced me to not upgrade glibc yet  ;-)

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-21 Thread Rogier Wolff

Rainer Mager wrote:

> that it is likely a hardware or kernel problem. So, my question is,
> how can I pin point the problem? Is this likely to be a kernel
> issue?

No, not hardware. No not kernel. 

Harware problems are normally not reproducable. Can you attach a
debugger to your X server, and catch it when things go bad? (And
give the Xfree86 people a backtrace)

Roger. 

-- 
** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots. 
* There are also old, bald pilots. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-21 Thread David Woodhouse

On Mon, 22 Jan 2001, Rainer Mager wrote:

>   I brought up this issue last month and had some response but as
> of yet my particular problem still exists. In brief, X windows dies
> with signal 11. I have done quite a bit of testing and this does not
> seem to be a hardware issue. Also, I have never managed to get a
> signal 11 error when not running X.

Would this be an SMP IA32 box with glibc 2.2? I have two such boxen 
showing exactly the same behaviour, although I can't reproduce it at will.

It happens even when I use the same kernel and XFree86 binaries which were
working perfectly before the upgrade. The LDT handling fixes which were
added between 2.4.0-prerelease and the real 2.4.0 appeared to make this
_slightly_ less frequent, but I still rarely have an X server uptime of
more than a few days.

-- 
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Is this kernel related (signal 11)?

2001-01-21 Thread Rainer Mager

Hi all,

I brought up this issue last month and had some response but as of yet my
particular problem still exists. In brief, X windows dies with signal 11. I
have done quite a bit of testing and this does not seem to be a hardware
issue. Also, I have never managed to get a signal 11 error when not running
X.
I posted on the X Free86 mailing lists and the consensus there seems to be
that it is likely a hardware or kernel problem. So, my question is, how can
I pin point the problem? Is this likely to be a kernel issue?

Recently I have been able to reproduce the problem reliably in a few ways.
First, if I use an app that uses ncurses (like 'make menuconfig' on the
Linux kernel) from within Gnome-terminal then X dies instantly. For now I
have gone to using only xterm.
I can also cause the error from xmms by scrolling the playlist repeatedly.
This will happen within a few seconds but not instantly like above.
I have also seen the error in other cases but none that I am yet able to
reproduce on demand.


PLEASE, any suggestions?


--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Is this kernel related (signal 11)?

2001-01-21 Thread Rainer Mager

Hi all,

I brought up this issue last month and had some response but as of yet my
particular problem still exists. In brief, X windows dies with signal 11. I
have done quite a bit of testing and this does not seem to be a hardware
issue. Also, I have never managed to get a signal 11 error when not running
X.
I posted on the X Free86 mailing lists and the consensus there seems to be
that it is likely a hardware or kernel problem. So, my question is, how can
I pin point the problem? Is this likely to be a kernel issue?

Recently I have been able to reproduce the problem reliably in a few ways.
First, if I use an app that uses ncurses (like 'make menuconfig' on the
Linux kernel) from within Gnome-terminal then X dies instantly. For now I
have gone to using only xterm.
I can also cause the error from xmms by scrolling the playlist repeatedly.
This will happen within a few seconds but not instantly like above.
I have also seen the error in other cases but none that I am yet able to
reproduce on demand.


PLEASE, any suggestions?


--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-21 Thread David Woodhouse

On Mon, 22 Jan 2001, Rainer Mager wrote:

   I brought up this issue last month and had some response but as
 of yet my particular problem still exists. In brief, X windows dies
 with signal 11. I have done quite a bit of testing and this does not
 seem to be a hardware issue. Also, I have never managed to get a
 signal 11 error when not running X.

Would this be an SMP IA32 box with glibc 2.2? I have two such boxen 
showing exactly the same behaviour, although I can't reproduce it at will.

It happens even when I use the same kernel and XFree86 binaries which were
working perfectly before the upgrade. The LDT handling fixes which were
added between 2.4.0-prerelease and the real 2.4.0 appeared to make this
_slightly_ less frequent, but I still rarely have an X server uptime of
more than a few days.

-- 
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Is this kernel related (signal 11)?

2001-01-21 Thread Rogier Wolff

Rainer Mager wrote:

 that it is likely a hardware or kernel problem. So, my question is,
 how can I pin point the problem? Is this likely to be a kernel
 issue?

No, not hardware. No not kernel. 

Harware problems are normally not reproducable. Can you attach a
debugger to your X server, and catch it when things go bad? (And
give the Xfree86 people a backtrace)

Roger. 

-- 
** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots. 
* There are also old, bald pilots. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Is this kernel related (signal 11)?

2001-01-21 Thread Rainer Mager

 Would this be an SMP IA32 box with glibc 2.2? I have two such boxen
 showing exactly the same behaviour, although I can't reproduce it at will.

Close, it is actually an SMP IA32 box with glibc 2.1.3. But you've now
convinced me to not upgrade glibc yet  ;-)

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



oops, signal 11

2001-01-20 Thread mkloppstech

I know that signal 11 with gcc is a sign of bad hardware; however  it
strikes me that I don't get random oopses - a whole bunch of them is appended.

I used 2.4.0 with alsa, kmp3player running and an endless loop compiling the
kernel.

Mirko Kloppstech


ksymoops 2.3.7 on i686 2.4.0.  Options used
 -V (default)
 -k /proc/ksyms (default)
 -l /proc/modules (default)
 -o /lib/modules/2.4.0/ (default)
 -m /boot/System.map (specified)

Stack: 27ad 08140b88  bfffe03c c125abd8  cff7ea64 0001 
    0001 cca70824 cca70780 c01245f3 ccaafcc0 ccaafce0 cb09df90 
   c0124530 ffea ccaafcc0 27ad 1000 17ad 08141b88  
Call Trace: [] [] [] [] 
Code: 39 7b 08 75 f0 8b 74 24 24 39 73 0c 75 e7 53 e8 4d 4d 00 00 
Using defaults from ksymoops -t elf32-i386 -a i386

Trace; c01245f3 
Trace; c0124530 
Trace; c013029e 
Trace; c0108f27 
Code;   Before first symbol
 <_EIP>:
Code;   Before first symbol
   0:   39 7b 08  cmpl   %edi,0x8(%ebx)
Code;  0003 Before first symbol
   3:   75 f0 jnefff5 <_EIP+0xfff5> fff5 

Code;  0005 Before first symbol
   5:   8b 74 24 24   movl   0x24(%esp,1),%esi
Code;  0009 Before first symbol
   9:   39 73 0c  cmpl   %esi,0xc(%ebx)
Code;  000c Before first symbol
   c:   75 e7 jnefff5 <_EIP+0xfff5> fff5 

Code;  000e Before first symbol
   e:   53pushl  %ebx
Code;  000f Before first symbol
   f:   e8 4d 4d 00 00call   4d61 <_EIP+0x4d61> 4d61 Before first 
symbol

Unable to handle kernel paging request at virtual address 3640
c012414f
*pde = 
Oops: 
CPU:0
EIP:0010:[]
EFLAGS: 00010202
eax: cff4   ebx: 3638   ecx: 0010   edx: cff7ea64
esi: cca70780   edi: cca70824   ebp: 1000   esp: cad1ff40
ds: 0018   es: 0018   ss: 0018
Process cpp (pid: 15018, stackpage=cad1f000)
Stack: 27ad 0809ab20  bfffd900 c125abd8  cff7ea64 0001 
    0001 cca70824 cca70780 c01245f3 cbe42340 cbe42360 cad1ff90 
   c0124530 ffea cbe42340 27ad 1000 17ad 0809bb20  
Call Trace: [] [] [] [] 
Code: 39 7b 08 75 f0 8b 74 24 24 39 73 0c 75 e7 53 e8 4d 4d 00 00 

>>EIP; c012414f<=
Trace; c01245f3 
Trace; c0124530 
Trace; c013029e 
Trace; c0108f27 
Code;  c012414f 
 <_EIP>:
Code;  c012414f<=
   0:   39 7b 08  cmpl   %edi,0x8(%ebx)   <=
Code;  c0124152 
   3:   75 f0 jnefff5 <_EIP+0xfff5> c0124144 

Code;  c0124154 
   5:   8b 74 24 24   movl   0x24(%esp,1),%esi
Code;  c0124158 
   9:   39 73 0c  cmpl   %esi,0xc(%ebx)
Code;  c012415b 
   c:   75 e7 jnefff5 <_EIP+0xfff5> c0124144 

Code;  c012415d 
   e:   53pushl  %ebx
Code;  c012415e 
   f:   e8 4d 4d 00 00call   4d61 <_EIP+0x4d61> c0128eb0 

Unable to handle kernel paging request at virtual address 3659
c012414f
*pde = 
Oops: 
CPU:0
EIP:0010:[]
EFLAGS: 00010202
eax: cff4   ebx: 3651   ecx: 0010   edx: cff7ea64
esi: cca70780   edi: cca70824   ebp: 1000   esp: ca31df40
ds: 0018   es: 0018   ss: 0018
Process cpp (pid: 15039, stackpage=ca31d000)
Stack: 27ad 08140b88  bfffe03c c125abd8  cff7ea64 0001 
    0001 cca70824 cca70780 c01245f3 cc5fed40 cc5fed60 ca31df90 
   c0124530 ffea cc5fed40 27ad 1000 17ad 08141b88  
Call Trace: [] [] [] [] 
Code: 39 7b 08 75 f0 8b 74 24 24 39 73 0c 75 e7 53 e8 4d 4d 00 00 

>>EIP; c012414f<=
Trace; c01245f3 
Trace; c0124530 
Trace; c013029e 
Trace; c0108f27 
Code;  c012414f 
 <_EIP>:
Code;  c012414f<=
   0:   39 7b 08  cmpl   %edi,0x8(%ebx)   <=
Code;  c0124152 
   3:   75 f0 jnefff5 <_EIP+0xfff5> c0124144 

Code;  c0124154 
   5:   8b 74 24 24   movl   0x24(%esp,1),%esi
Code;  c0124158 
   9:   39 73 0c  cmpl   %esi,0xc(%ebx)
Code;  c012415b 
   c:   75 e7 jnefff5 <_EIP+0xfff5> c0124144 

Code;  c012415d 
   e:   53pushl  %ebx
Code;  c012415e 
   f:   e8 4d 4d 00 00call   4d61 <_EIP+0x4d61> c0128eb0 

Unable to handle kernel paging request at virtual address 3663
c012414f
*pde = 
Oops: 
CPU:0
EIP:0010:[]
EFLAGS: 00010202
eax: cff4   ebx: 365b   ecx: 0010   edx: cff7ea64
esi: cca70780   edi: cca70824   ebp: 1000   esp: cb09df40
ds: 0018   es: 0018   ss: 0018
Process cpp (pid: 15089, stackpage=cb09d000)
Stack: 27ad 0809ab20  bfffd900 c125abd8  cff7ea64 0001 
   

oops, signal 11

2001-01-20 Thread mkloppstech

I know that signal 11 with gcc is a sign of bad hardware; however  it
strikes me that I don't get random oopses - a whole bunch of them is appended.

I used 2.4.0 with alsa, kmp3player running and an endless loop compiling the
kernel.

Mirko Kloppstech


ksymoops 2.3.7 on i686 2.4.0.  Options used
 -V (default)
 -k /proc/ksyms (default)
 -l /proc/modules (default)
 -o /lib/modules/2.4.0/ (default)
 -m /boot/System.map (specified)

Stack: 27ad 08140b88  bfffe03c c125abd8  cff7ea64 0001 
    0001 cca70824 cca70780 c01245f3 ccaafcc0 ccaafce0 cb09df90 
   c0124530 ffea ccaafcc0 27ad 1000 17ad 08141b88  
Call Trace: [c01245f3] [c0124530] [c013029e] [c0108f27] 
Code: 39 7b 08 75 f0 8b 74 24 24 39 73 0c 75 e7 53 e8 4d 4d 00 00 
Using defaults from ksymoops -t elf32-i386 -a i386

Trace; c01245f3 generic_file_read+63/80
Trace; c0124530 file_read_actor+0/60
Trace; c013029e sys_read+8e/d0
Trace; c0108f27 system_call+33/38
Code;   Before first symbol
 _EIP:
Code;   Before first symbol
   0:   39 7b 08  cmpl   %edi,0x8(%ebx)
Code;  0003 Before first symbol
   3:   75 f0 jnefff5 _EIP+0xfff5 fff5 
END_OF_CODE+2f7904e2/
Code;  0005 Before first symbol
   5:   8b 74 24 24   movl   0x24(%esp,1),%esi
Code;  0009 Before first symbol
   9:   39 73 0c  cmpl   %esi,0xc(%ebx)
Code;  000c Before first symbol
   c:   75 e7 jnefff5 _EIP+0xfff5 fff5 
END_OF_CODE+2f7904e2/
Code;  000e Before first symbol
   e:   53pushl  %ebx
Code;  000f Before first symbol
   f:   e8 4d 4d 00 00call   4d61 _EIP+0x4d61 4d61 Before first 
symbol

Unable to handle kernel paging request at virtual address 3640
c012414f
*pde = 
Oops: 
CPU:0
EIP:0010:[c012414f]
EFLAGS: 00010202
eax: cff4   ebx: 3638   ecx: 0010   edx: cff7ea64
esi: cca70780   edi: cca70824   ebp: 1000   esp: cad1ff40
ds: 0018   es: 0018   ss: 0018
Process cpp (pid: 15018, stackpage=cad1f000)
Stack: 27ad 0809ab20  bfffd900 c125abd8  cff7ea64 0001 
    0001 cca70824 cca70780 c01245f3 cbe42340 cbe42360 cad1ff90 
   c0124530 ffea cbe42340 27ad 1000 17ad 0809bb20  
Call Trace: [c01245f3] [c0124530] [c013029e] [c0108f27] 
Code: 39 7b 08 75 f0 8b 74 24 24 39 73 0c 75 e7 53 e8 4d 4d 00 00 

EIP; c012414f do_generic_file_read+1af/590   =
Trace; c01245f3 generic_file_read+63/80
Trace; c0124530 file_read_actor+0/60
Trace; c013029e sys_read+8e/d0
Trace; c0108f27 system_call+33/38
Code;  c012414f do_generic_file_read+1af/590
 _EIP:
Code;  c012414f do_generic_file_read+1af/590   =
   0:   39 7b 08  cmpl   %edi,0x8(%ebx)   =
Code;  c0124152 do_generic_file_read+1b2/590
   3:   75 f0 jnefff5 _EIP+0xfff5 c0124144 
do_generic_file_read+1a4/590
Code;  c0124154 do_generic_file_read+1b4/590
   5:   8b 74 24 24   movl   0x24(%esp,1),%esi
Code;  c0124158 do_generic_file_read+1b8/590
   9:   39 73 0c  cmpl   %esi,0xc(%ebx)
Code;  c012415b do_generic_file_read+1bb/590
   c:   75 e7 jnefff5 _EIP+0xfff5 c0124144 
do_generic_file_read+1a4/590
Code;  c012415d do_generic_file_read+1bd/590
   e:   53pushl  %ebx
Code;  c012415e do_generic_file_read+1be/590
   f:   e8 4d 4d 00 00call   4d61 _EIP+0x4d61 c0128eb0 age_page_up+0/30

Unable to handle kernel paging request at virtual address 3659
c012414f
*pde = 
Oops: 
CPU:0
EIP:0010:[c012414f]
EFLAGS: 00010202
eax: cff4   ebx: 3651   ecx: 0010   edx: cff7ea64
esi: cca70780   edi: cca70824   ebp: 1000   esp: ca31df40
ds: 0018   es: 0018   ss: 0018
Process cpp (pid: 15039, stackpage=ca31d000)
Stack: 27ad 08140b88  bfffe03c c125abd8  cff7ea64 0001 
    0001 cca70824 cca70780 c01245f3 cc5fed40 cc5fed60 ca31df90 
   c0124530 ffea cc5fed40 27ad 1000 17ad 08141b88  
Call Trace: [c01245f3] [c0124530] [c013029e] [c0108f27] 
Code: 39 7b 08 75 f0 8b 74 24 24 39 73 0c 75 e7 53 e8 4d 4d 00 00 

EIP; c012414f do_generic_file_read+1af/590   =
Trace; c01245f3 generic_file_read+63/80
Trace; c0124530 file_read_actor+0/60
Trace; c013029e sys_read+8e/d0
Trace; c0108f27 system_call+33/38
Code;  c012414f do_generic_file_read+1af/590
 _EIP:
Code;  c012414f do_generic_file_read+1af/590   =
   0:   39 7b 08  cmpl   %edi,0x8(%ebx)   =
Code;  c0124152 do_generic_file_read+1b2/590
   3:   75 f0 jnefff5 _EIP+0xfff5 c0124144 
do_generic_file_read+1a4/590
Code;  c0124154 do_generic_file_read+1b4/590
   5:   8b 74 24 24   movl   0x24(%esp,1),%esi
Code

Signal 11 - revisited

2000-12-17 Thread Rainer Mager

I was wondering if anyone had any new info/suggestions for the Signal 11
problem.

I think I last reported that I had tried 2.4.0test12 w AGPGart and DRM
turned off. This seemed a bit more stable but I did have X crash with
Signall 11 after about 1.5 days.

I'd really appreciate any advice on how to diagnose this.


Thanks,

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Signal 11 - revisited

2000-12-17 Thread Rainer Mager

I was wondering if anyone had any new info/suggestions for the Signal 11
problem.

I think I last reported that I had tried 2.4.0test12 w AGPGart and DRM
turned off. This seemed a bit more stable but I did have X crash with
Signall 11 after about 1.5 days.

I'd really appreciate any advice on how to diagnose this.


Thanks,

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-15 Thread Dan Egli

On Thu, 14 Dec 2000, Linus Torvalds wrote:

> Yes. 
> 
> And I realize that somebody inside RedHat really wanted to use a snapshot
> in order to get some C++ code to compile right.
> 
> But it at the same time threw C stability out the window, by using a
> not-very-widely-tested snapshot for a major new release. 
> 
> Are you seriously saying that you think it was a good trade-off? Or are
> you just ashamed of admitting that RH did something stupid?
> 
Pardon the poking in here, but I must say I agree here. RH did a VERY dumb
thing. 

> I have a report from a Sony VAIO user that couldn't compile the CVS X at
> all on his picturebook (and you need to compile the CVS tree in order to
> get required fixes for the ATI Rage Mobility in that machine). I don't
> know the details, but they were apparently due to RH 7 issues. 

It's not in the X tree or anything, but here's a personal example.
Machine: Dual P3 550
HDD: Dual Ultra2Wide Seagate 18GB Hdd
OS: RedHat 7
Compile Target: Linux Kernel 2.2.17
Result with gcc 2.96: Failure (syntax errors in the i386 branch of the
arch tree)
Result with compat-egcs-62: Success on the first try.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-15 Thread Theodore Y. Ts'o

   Date:Fri, 15 Dec 2000 01:09:29 + (GMT)
   From: Alan Cox <[EMAIL PROTECTED]>

   > > oWe tell vendors to build RPMv3 , glibc 2.1.x
   > Curious HOW do you tell vendors??

   When they ask. More usefully Dan Quinlann and most vendors put together a
   recommended set of things to build with and use. It warns about library
   pitfalls, kernel changes and what packaging is supported. It is far from
   perfect and nothing like the LSB goals but its a start and following it does
   give you applications that with a bit of care run on everything.

In the interests of making sure everyone understands the history:

The Linux Development Platform Specification (LDPS) was started as a
result of an informal evening post-LSB-meeting gathering in June --- to
which by the way Red Hat didn't send any representatives(*) --- the
discussion at the restaurant started along the lines of "Oh, my *GOD*
RedHat is about to do something stupid --- they're releasing Red Hat 7.0
with beta/snapshots of just about every single critical system component
except the kernel --- and vendors who fall into the trap developing
against Red Hat 7.0 won't work with any other distribution.  This is
going to be *bad* for Linux."

So yes, the reason why LDPS was formed was to recommend to vendors what
they should build and use --- but while Alan gave comments about the
LDPS once it was announced that a group of people were working on the
LDPS , there is no way that the LDPS could even vaguely be considered a
Red Hat initiative.  (The LDPS is a separate work group which is part of
the FSG, so it is a sister group to the LSB effort.)

- Ted

(*) Ever since Jim Kingdon left Red Hat (he was at VA Linux for a while,
and is now at SGI), as far as I know no one at Red Hat is actively
participating in the LSB activities --- they haven't sent anyone to the
physical LSB meetings, or participated in the bi-weekly phone
conferences, or taken work items to help finish the LSB.  Alan does
participate on the mailing lists, and makes quite helpful comments, but
as far as I know that's about the limit to Red Hat's participation to
either the LSB or the LDPS specification work.  Speaking as someone who
has been contributing time and effort to the LSB, it would be great if
Red Hat were to become more fully involved in the LSB; I (and I'm sure
all the other LSB volunteers) would welcome a greater level of
participation by Red Hat.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-15 Thread Theodore Y. Ts'o

   Date:Fri, 15 Dec 2000 01:09:29 + (GMT)
   From: Alan Cox [EMAIL PROTECTED]

 oWe tell vendors to build RPMv3 , glibc 2.1.x
Curious HOW do you tell vendors??

   When they ask. More usefully Dan Quinlann and most vendors put together a
   recommended set of things to build with and use. It warns about library
   pitfalls, kernel changes and what packaging is supported. It is far from
   perfect and nothing like the LSB goals but its a start and following it does
   give you applications that with a bit of care run on everything.

In the interests of making sure everyone understands the history:

The Linux Development Platform Specification (LDPS) was started as a
result of an informal evening post-LSB-meeting gathering in June --- to
which by the way Red Hat didn't send any representatives(*) --- the
discussion at the restaurant started along the lines of "Oh, my *GOD*
RedHat is about to do something stupid --- they're releasing Red Hat 7.0
with beta/snapshots of just about every single critical system component
except the kernel --- and vendors who fall into the trap developing
against Red Hat 7.0 won't work with any other distribution.  This is
going to be *bad* for Linux."

So yes, the reason why LDPS was formed was to recommend to vendors what
they should build and use --- but while Alan gave comments about the
LDPS once it was announced that a group of people were working on the
LDPS , there is no way that the LDPS could even vaguely be considered a
Red Hat initiative.  (The LDPS is a separate work group which is part of
the FSG, so it is a sister group to the LSB effort.)

- Ted

(*) Ever since Jim Kingdon left Red Hat (he was at VA Linux for a while,
and is now at SGI), as far as I know no one at Red Hat is actively
participating in the LSB activities --- they haven't sent anyone to the
physical LSB meetings, or participated in the bi-weekly phone
conferences, or taken work items to help finish the LSB.  Alan does
participate on the mailing lists, and makes quite helpful comments, but
as far as I know that's about the limit to Red Hat's participation to
either the LSB or the LDPS specification work.  Speaking as someone who
has been contributing time and effort to the LSB, it would be great if
Red Hat were to become more fully involved in the LSB; I (and I'm sure
all the other LSB volunteers) would welcome a greater level of
participation by Red Hat.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Alan Cox

> > o   We tell vendors to build RPMv3 , glibc 2.1.x
> Curious HOW do you tell vendors??

When they ask. More usefully Dan Quinlann and most vendors put together a
recommended set of things to build with and use. It warns about library
pitfalls, kernel changes and what packaging is supported. It is far from
perfect and nothing like the LSB goals but its a start and following it does
give you applications that with a bit of care run on everything.

> > o   Vendors not being stupid understand that they have a bigger market
> > share if they do that.
> Ummm.. I remember Oracle's first release... wasn't it JUST redhat??

I believe so, and Adabas was SuSE only, and I doubt either vendor wanted it
that way. Both actually ran fine on the other but were not supported.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Michael Peddemors

Sticking my nose where it doesn't belong...

On Thu, 14 Dec 2000, Alan Cox wrote:
> > Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
> > And since redhat is _the_ distro that commercial entities use to
> > release software for, this was very arguably a bad move.

> o We tell vendors to build RPMv3 , glibc 2.1.x

Curious HOW do you tell vendors??

> o Vendors not being stupid understand that they have a bigger market
>   share if they do that.

Ummm.. I remember Oracle's first release... wasn't it JUST redhat??

-- 

Michael Peddemors - Senior Consultant
Unix Administration - WebSite Hosting
Network Services - Programming
Wizard Internet Services http://www.wizard.ca
Linux Support Specialist - http://www.linuxmagic.com

(604) 589-0037 Beautiful British Columbia, Canada

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Miquel van Smoorenburg

In article <[EMAIL PROTECTED]>,
Alan Cox  <[EMAIL PROTECTED]> wrote:
>> Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
>> And since redhat is _the_ distro that commercial entities use to
>> release software for, this was very arguably a bad move.
>
>Except you conveniently ignore a few facts

Doesn't everyone. I should have included a smiley with as comment
that I was only half-joking. Anyway this is the kernel list, and
as such this is becoming off-topic.

Mike.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Alan Cox

> Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
> And since redhat is _the_ distro that commercial entities use to
> release software for, this was very arguably a bad move.

Except you conveniently ignore a few facts

o   Someone else moved to 2.95 not RH . In fact some of us felt 2.95 wasnt 
fit to ship at the time. 

o   We tell vendors to build RPMv3 , glibc 2.1.x

o   Vendors not being stupid understand that they have a bigger market
share if they do that.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread lamont


I had tons of problems with K6III/450s in ASUS P5A motherboards with
various kinds of 128MB SIMMs.  There were multiple different symptoms,
including just sig11s on compiles, corrupted input (leading to syntax
error) in compiles, and corrupted input in the buffer cache (same crash
over and over, but dd if=/dev/hda of=/dev/null bs=1024k count=128 fixed
it).  Swapping the memory would sometimes get rid of the problem, but then
it would come back weeks-months later.

I saw a bizzare problem once in an Tyan dual proc PIII/500 box with
2x256MB ECC RAM that one of the ECC RAM sticks was bad and that repeated
kernel compiles would hang after about 24 hours.  Strange problem, but
found that in troubleshooting it, the problem followed this stick of RAM
around to different machines.  Blamed the RAM but don't understand what
the underlying problem was...

On Fri, 8 Dec 2000 [EMAIL PROTECTED] wrote:
> On Thu, 7 Dec 2000, Jeff V. Merkey wrote:
> 
> > It's related to some change in 2.4 vs. 2.2.  There are other programs
> > affected other than X, SSH also get's spurious signal 11's now and again
> > with 2.4 and glibc <= 2.1 and it does not occur on 2.2.
> 
> 
> 
> I've begun to get a bit paranoid about my K6-2 500 box.
> 
> Various processes have been getting random signals after heavy CPU usage.
> Playing an MPEG movie, kernel compile, or even just some small apps
> compiling sometimes. Just for the record, this isn't an OOM situation,
> I've watched this box with half its memory free or in buffers left
> unattended, and suddenly a compile will just die.
> 
> I replaced the CPU with a brand new K6-2. Problem remained.
> Next suspect was faulty RAM. Despite having passed a memtest, I
> swapped out the DIMMs for some known good ones.
> Suspecting cooling problems, I added some case fans.
> Next came a bigger power supply. Still the problems.
> The latest last ditch attempt to make this box stable has been
> to attach the biggest fan I could find that would fit a socket 7 CPU.
> 
> And still the problems are there.
> The only remaining suspect would be a flaky motherboard.
> But then comes the real killer : This box is rock solid under 2.2
> 
> *boggle*
> 
> I'm not sure exactly when this started, but I think I first noticed
> it around test5 or so, but didn't suspect the kernel at the time.
> 
> I've tried kernels compiled with everything from 2.91.66 when this
> was a Redhat box, to gcc 2.95.2 (from Debian woody) when I installed
> debian on it.  If this is a compiler bug, it's one that no compiler
> I've tried seems to be immune from.
> 
> regards,
> 
> Davej.
> 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Miquel van Smoorenburg

In article <[EMAIL PROTECTED]>,
Bernhard Rosenkraenzer  <[EMAIL PROTECTED]> wrote:
>The same thing is true of *any* gcc release.
>For example, C++-ABI wise, 2.95.x is incompatible BOTH with egcs 1.1.x
>_and_ the upcoming 3.0 release.

Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
And since redhat is _the_ distro that commercial entities use to
release software for, this was very arguably a bad move.

There's simply no excuse. It's too obvious.

Mike.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Linus Torvalds



On Thu, 14 Dec 2000, Jakub Jelinek wrote:

> On Thu, Dec 14, 2000 at 11:11:28AM -0800, Linus Torvalds wrote:
> > user applications and (b) gcc-2.96 is so broken that it requires special
> > libraries for C++ vtable chunks handling that is different, so the
> > _working_ gcc can only be used with programs that do not need such
> > library support.
> 
> Every major g++ release had incompatible libstdc++, even g++ 2.95.2 if
> bootstrapped under glibc 2.1.x is binary incompatible with g++ 2.95.2
> bootstrapped under glibc 2.2.x (libstdc++ uses different soname then;
> even if we used g++ 2.95.2 we would not have C++ binary compatible with
> other distributions).

Yes. 

And I realize that somebody inside RedHat really wanted to use a snapshot
in order to get some C++ code to compile right.

But it at the same time threw C stability out the window, by using a
not-very-widely-tested snapshot for a major new release. 

Are you seriously saying that you think it was a good trade-off? Or are
you just ashamed of admitting that RH did something stupid?

> > compiler to something that works better RSN.  It apparently has problems
> > compiling stuff like the CVS snapshots of X etc too (and obviously,
> > anything you compile under gcc-2.96 is not likely to work anywhere else
> > except with the broken libraries). 
> 
> Can you point to things in X which were actually miscompiled because of bugs
> in gcc 2.96?

I have a report from a Sony VAIO user that couldn't compile the CVS X at
all on his picturebook (and you need to compile the CVS tree in order to
get required fixes for the ATI Rage Mobility in that machine). I don't
know the details, but they were apparently due to RH 7 issues. 

> So far I was aware about X bugs (already fixed in X CVS) which
> were triggered with -fstrict-aliasing which is now the default while
> gcc 2.95.2 had -fstrict-aliasing disabled by default.

I hope that's another thing that the gcc people fix by the time they do a
_real_ release. Anobody who thinks that "-fstrict-aliasing" being on by
default is a good idea is probably a compiler person who hasn't seen real
code.

> That is not to say there were not bugs in the gcc we shipped, but the bugs
> which were reported against it have been fixed already.

That's good.

It's even better if you don't play quite as fast-and-lose with your
shipping compiler.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Jakub Jelinek

On Thu, Dec 14, 2000 at 11:11:28AM -0800, Linus Torvalds wrote:
> user applications and (b) gcc-2.96 is so broken that it requires special
> libraries for C++ vtable chunks handling that is different, so the
> _working_ gcc can only be used with programs that do not need such
> library support.

Every major g++ release had incompatible libstdc++, even g++ 2.95.2 if
bootstrapped under glibc 2.1.x is binary incompatible with g++ 2.95.2
bootstrapped under glibc 2.2.x (libstdc++ uses different soname then;
even if we used g++ 2.95.2 we would not have C++ binary compatible with
other distributions).
This will change once 3.0 is out, but it will still take some time.

> compiler to something that works better RSN.  It apparently has problems
> compiling stuff like the CVS snapshots of X etc too (and obviously,
> anything you compile under gcc-2.96 is not likely to work anywhere else
> except with the broken libraries). 

Can you point to things in X which were actually miscompiled because of bugs
in gcc 2.96? So far I was aware about X bugs (already fixed in X CVS) which
were triggered with -fstrict-aliasing which is now the default while
gcc 2.95.2 had -fstrict-aliasing disabled by default.
That is not to say there were not bugs in the gcc we shipped, but the bugs
which were reported against it have been fixed already.

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Alan Cox

> If you ask any gcc folks, the main reason they think this was a really
> stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
> with the 2.95.x release _and_ the upcoming 3.0 release.

And with egcs 1.1.2. So 
egcs is a different format to all others
2.95 is a different format to all others
2.96 is a different format to all others

and 2.96 is a C++ compiler

> gcc-2.95.2 is at least a real release, from a branch that is actively
> maintained - so a 2.95.3 is likely to happen reasonably soon, fixing as
> many problems as possible _without_ being incompatible like the snapshots
> are.

The 2.96 tree is maintained actively. Updates for the Red Hat 7 packages
are being worked on and CygnusHat people are working on both that maintenance
and on feeding all they find back to the core gcc team.

In fact we have sufficient faith in it we sell packages and support based around
that and our preparedness to support it. 

> As to X compile problems - neither egcs nor 2.95.2 appears to have any
> trouble with the CVS tree. Possibly because they got fixed, because, after
> all, at least those were real releases.

I asked Jakub. He's confused as to your report. As far as he is aware the only
X problems in the CVS tree were related to XFree86 source code bugs misusing
type punning. If you have a case to lookat Jakub would love to hear about it
and fix either X or gcc.

> I'd applaud RedHat for making snapshots available, but they should be
> marked as SNAPSHOTS, and not as the main compiler with no way to fix the
> damn problems it causes.

That it was confusing and mistaken by some as an official GNU group release
is something we never intended and have already apologised for. It was done
without malice or ill intent.

> As it is, anybody doing development is probably better off at RH-6.2.
> That is doubly true if they intend to release binaries.

We strongly recommend that people use 6.2 for developing binaries for general
release unless they have specific requirements for glibc 2.2. Thats the same
guidelines the LSB 'oops we havent finished yet here is a quickie for now'
documentation recommends.

Similarly RPM packaging using RPMv3 is recommended.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Linus Torvalds



On Thu, 14 Dec 2000, Bernhard Rosenkraenzer wrote:
> >
> > gcc-2.95.2 is at least a real release, from a branch that is actively
> > maintained
> 
> Not very actively.
> Please take the time to compare the activity in gcc_2_95_branch with the
> patches in the current "2.96" version in rawhide.

Take a look at the differences in linux-2.2.x and linux-2.3.x.

linux-2.3.x is was a h*ll of a lot more "actively maintained".

But nobody really considers that to be an argument for RedHat (or anybody
else) to installa 2.3.x kernel by default. Sure, most distributions have a
"hacker kernel", but it's NOT installed by default, and it is clearly
marked as experimental.

Your arguments make no sense.

The compiler is often _more_ important to system stability than the
kernel. A "real release" implies that it at least had testing, and that
people know what the problem spots tend to be.

Note that the "know what the problem spots tend to be" is important.

> > As to X compile problems - neither egcs nor 2.95.2 appears to have any
> > trouble with the CVS tree.
> 
> Neither does 2.96-68.

Good. Maybe you'd make it clearer to everybody who installed from your
CD's that they had better upgrade. Pronto.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Bernhard Rosenkraenzer

On Thu, 14 Dec 2000, Linus Torvalds wrote:

> If you ask any gcc folks, the main reason they think this was a really
> stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
> with the 2.95.x release _and_ the upcoming 3.0 release.

The same thing is true of *any* gcc release.
For example, C++-ABI wise, 2.95.x is incompatible BOTH with egcs 1.1.x
_and_ the upcoming 3.0 release.

> > Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
> > buggier than before, but that the bugs are in different places. egcs and gcc295
> > both caused X compile problems too.
>
> gcc-2.95.2 is at least a real release, from a branch that is actively
> maintained

Not very actively.
Please take the time to compare the activity in gcc_2_95_branch with the
patches in the current "2.96" version in rawhide.

> - so a 2.95.3 is likely to happen reasonably soon, fixing as
> many problems as possible _without_ being incompatible like the snapshots
> are.

It will be incompatible with any non-2.95.x-version, and I don't think
2.96-68 is any more buggy than the current 2.95 branch.
The initial 2.96 "release" did have some odd bugs; all the known ones have
been fixed.

> Or just stay at 2.91.66 (egcs).

This may be good for the kernel, but it's not acceptable for C++.
Also, there's no support for some of the platforms we have to work with,
such as ia64 and S/390 - using different compilers for different
architectures isn't a real solution either.

> As to X compile problems - neither egcs nor 2.95.2 appears to have any
> trouble with the CVS tree.

Neither does 2.96-68.

LLaP
bero


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Linus Torvalds



On Thu, 14 Dec 2000, Alan Cox wrote:
> 
> > user applications and (b) gcc-2.96 is so broken that it requires special
> > libraries for C++ vtable chunks handling that is different, so the
> 
> Wrong - the C++ vtable format change is part of the intended progression of the
> compiler and needed to meet standards compliance. gcc 295 also changed the
> internal formats. Unfortunately the gcc295 and 296 formats are both probably
> not the final format. The compiler folks are not willing to guarantee anything
> untill gcc 3.0, which may actually be out by the time 2.4 is stable.

If you ask any gcc folks, the main reason they think this was a really
stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
with the 2.95.x release _and_ the upcoming 3.0 release.

Nobody asked the people who knew this, apparently.

> > unusable as a development platform, and I hope RH downgrades their
> > compiler to something that works better RSN.  It apparently has problems
> 
> Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
> buggier than before, but that the bugs are in different places. egcs and gcc295
> both caused X compile problems too.

gcc-2.95.2 is at least a real release, from a branch that is actively
maintained - so a 2.95.3 is likely to happen reasonably soon, fixing as
many problems as possible _without_ being incompatible like the snapshots
are.

Or just stay at 2.91.66 (egcs).

As to X compile problems - neither egcs nor 2.95.2 appears to have any
trouble with the CVS tree. Possibly because they got fixed, because, after
all, at least those were real releases.

I'd applaud RedHat for making snapshots available, but they should be
marked as SNAPSHOTS, and not as the main compiler with no way to fix the
damn problems it causes.

As it is, anybody doing development is probably better off at RH-6.2.
That is doubly true if they intend to release binaries.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Jakub Jelinek

On Thu, Dec 14, 2000 at 04:42:03AM -0800, Clayton Weaver wrote:
> There has a been a thread on the teTeX mailing list the last few days
> about a (RedHat, but probably more general than just their rpms)
> gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile 
> 
> unsigned varname; /* "unsigned int varname;" is ok */
> 
> (no problem at -O or no optimization at all, and doesn't happen if teTeX
> is compiled with kgcc).

That one is fixed already for some time, it was a bug in loop unrolling
(that patch is still pending review for the mainline CVS though).

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Alan Cox

> I don't know why RH decided to do their idiotic gcc-2.96 release (it
> certainly wasn't approved by any technical gcc people - the gcc people

Every single patch in that release barring I believe 2 was accepted into
the main tree. So they liked the code. The naming did upset people and was
unfortunate, but done talking to the compiler folks at Red Hat with the
best of intentions behind it. If we had called it 'Red Hat cc' I think people
would have been even more offended at the way they had been discredited.

I do understand why they got peeved, I do understand why they feel no urge
to support the 296 codebase (nor would I want them to). I hit 'd' when I 
see 'I have 2.2.18 patched with [reiserfs|ext3|bigmem|lfs]' for the same
reasons.

> They included another (non-broken) compiler, and called it "kgcc". 
> "kgcc" stands for "kernel gcc", apparently because (a) they realised

kgcc is a convention invented a long time ago by Conectiva. Debian also used
to have gcc272. It is done because

gcc272 is useless at C++, has lots of bugs
egcs is no better at C++ and has lots of bugs
gcc295 is a little better at C++ and is _Crawling_ with bugs
gcc296(redhat) is a lot better at C++ and doesn't appear to be any buggier.

In fact gcc296 is the first compiler that can compiled 2.2.16 correctly. All
the previous compilers miscompile the strstr() inline in some cases. Thats
why I had to hack the 2.2 kernel tree to make it work. (And the cases where
you got compile time errors gcc was right to moan about - like using (...)
in traditional

> user applications and (b) gcc-2.96 is so broken that it requires special
> libraries for C++ vtable chunks handling that is different, so the

Wrong - the C++ vtable format change is part of the intended progression of the
compiler and needed to meet standards compliance. gcc 295 also changed the
internal formats. Unfortunately the gcc295 and 296 formats are both probably
not the final format. The compiler folks are not willing to guarantee anything
untill gcc 3.0, which may actually be out by the time 2.4 is stable.

> unusable as a development platform, and I hope RH downgrades their
> compiler to something that works better RSN.  It apparently has problems

Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
buggier than before, but that the bugs are in different places. egcs and gcc295
both caused X compile problems too.

I still advise people: Use egcs-1.1.2 for Linux 2.2.x. You can build 2.2.18 with
gcc 2.9.6 but I personally wouldn't be running production systems on a kernel
built that way - but NOT because gcc296 is buggier but because the bugs are
going to be in different places and I firmly believe production system people
should let the loons find them ;)

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Clayton Weaver  <[EMAIL PROTECTED]> wrote:
>
>There has a been a thread on the teTeX mailing list the last few days
>about a (RedHat, but probably more general than just their rpms)
>gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile 

Quite frankly, anybody who uses RedHat 7.0 and their broken compiler for
_anything_ is going to have trouble.

I don't know why RH decided to do their idiotic gcc-2.96 release (it
certainly wasn't approved by any technical gcc people - the gcc people
were upset about it too), and I find it even more surprising that they
apparently KNEW that the compiler they were using was completely broken. 
They included another (non-broken) compiler, and called it "kgcc". 

"kgcc" stands for "kernel gcc", apparently because (a) they realised
that a miscompiled kernel is even worse than miscompiling some random
user applications and (b) gcc-2.96 is so broken that it requires special
libraries for C++ vtable chunks handling that is different, so the
_working_ gcc can only be used with programs that do not need such
library support.  Namely the kernel. 

In case it wasn't obvious yet, I consider RedHat-7.0 to be basically
unusable as a development platform, and I hope RH downgrades their
compiler to something that works better RSN.  It apparently has problems
compiling stuff like the CVS snapshots of X etc too (and obviously,
anything you compile under gcc-2.96 is not likely to work anywhere else
except with the broken libraries). 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Clayton Weaver

This is unrelated to the signal 11 problem, but something to consider
for "random crashes and segfaults", ie are you using this compiler
and glibc version combination.

There has a been a thread on the teTeX mailing list the last few days
about a (RedHat, but probably more general than just their rpms)
gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile 

unsigned varname; /* "unsigned int varname;" is ok */

(no problem at -O or no optimization at all, and doesn't happen if teTeX
is compiled with kgcc).

Showed up in the kpathsea library (which began to split paths on
'-' as well as '/' after a user upgraded compiler and libc and
recompiled teTeX).

Regards,

Clayton Weaver
<mailto:[EMAIL PROTECTED]>
(Seattle)

"Everybody's ignorant, just in different subjects."  Will Rogers



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Clayton Weaver

This is unrelated to the signal 11 problem, but something to consider
for "random crashes and segfaults", ie are you using this compiler
and glibc version combination.

There has a been a thread on the teTeX mailing list the last few days
about a (RedHat, but probably more general than just their rpms)
gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile 

unsigned varname; /* "unsigned int varname;" is ok */

(no problem at -O or no optimization at all, and doesn't happen if teTeX
is compiled with kgcc).

Showed up in the kpathsea library (which began to split paths on
'-' as well as '/' after a user upgraded compiler and libc and
recompiled teTeX).

Regards,

Clayton Weaver
mailto:[EMAIL PROTECTED]
(Seattle)

"Everybody's ignorant, just in different subjects."  Will Rogers



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Linus Torvalds

In article [EMAIL PROTECTED],
Clayton Weaver  [EMAIL PROTECTED] wrote:

There has a been a thread on the teTeX mailing list the last few days
about a (RedHat, but probably more general than just their rpms)
gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile 

Quite frankly, anybody who uses RedHat 7.0 and their broken compiler for
_anything_ is going to have trouble.

I don't know why RH decided to do their idiotic gcc-2.96 release (it
certainly wasn't approved by any technical gcc people - the gcc people
were upset about it too), and I find it even more surprising that they
apparently KNEW that the compiler they were using was completely broken. 
They included another (non-broken) compiler, and called it "kgcc". 

"kgcc" stands for "kernel gcc", apparently because (a) they realised
that a miscompiled kernel is even worse than miscompiling some random
user applications and (b) gcc-2.96 is so broken that it requires special
libraries for C++ vtable chunks handling that is different, so the
_working_ gcc can only be used with programs that do not need such
library support.  Namely the kernel. 

In case it wasn't obvious yet, I consider RedHat-7.0 to be basically
unusable as a development platform, and I hope RH downgrades their
compiler to something that works better RSN.  It apparently has problems
compiling stuff like the CVS snapshots of X etc too (and obviously,
anything you compile under gcc-2.96 is not likely to work anywhere else
except with the broken libraries). 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Alan Cox

 I don't know why RH decided to do their idiotic gcc-2.96 release (it
 certainly wasn't approved by any technical gcc people - the gcc people

Every single patch in that release barring I believe 2 was accepted into
the main tree. So they liked the code. The naming did upset people and was
unfortunate, but done talking to the compiler folks at Red Hat with the
best of intentions behind it. If we had called it 'Red Hat cc' I think people
would have been even more offended at the way they had been discredited.

I do understand why they got peeved, I do understand why they feel no urge
to support the 296 codebase (nor would I want them to). I hit 'd' when I 
see 'I have 2.2.18 patched with [reiserfs|ext3|bigmem|lfs]' for the same
reasons.

 They included another (non-broken) compiler, and called it "kgcc". 
 "kgcc" stands for "kernel gcc", apparently because (a) they realised

kgcc is a convention invented a long time ago by Conectiva. Debian also used
to have gcc272. It is done because

gcc272 is useless at C++, has lots of bugs
egcs is no better at C++ and has lots of bugs
gcc295 is a little better at C++ and is _Crawling_ with bugs
gcc296(redhat) is a lot better at C++ and doesn't appear to be any buggier.

In fact gcc296 is the first compiler that can compiled 2.2.16 correctly. All
the previous compilers miscompile the strstr() inline in some cases. Thats
why I had to hack the 2.2 kernel tree to make it work. (And the cases where
you got compile time errors gcc was right to moan about - like using (...)
in traditional

 user applications and (b) gcc-2.96 is so broken that it requires special
 libraries for C++ vtable chunks handling that is different, so the

Wrong - the C++ vtable format change is part of the intended progression of the
compiler and needed to meet standards compliance. gcc 295 also changed the
internal formats. Unfortunately the gcc295 and 296 formats are both probably
not the final format. The compiler folks are not willing to guarantee anything
untill gcc 3.0, which may actually be out by the time 2.4 is stable.

 unusable as a development platform, and I hope RH downgrades their
 compiler to something that works better RSN.  It apparently has problems

Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
buggier than before, but that the bugs are in different places. egcs and gcc295
both caused X compile problems too.

I still advise people: Use egcs-1.1.2 for Linux 2.2.x. You can build 2.2.18 with
gcc 2.9.6 but I personally wouldn't be running production systems on a kernel
built that way - but NOT because gcc296 is buggier but because the bugs are
going to be in different places and I firmly believe production system people
should let the loons find them ;)

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Linus Torvalds



On Thu, 14 Dec 2000, Alan Cox wrote:
 
  user applications and (b) gcc-2.96 is so broken that it requires special
  libraries for C++ vtable chunks handling that is different, so the
 
 Wrong - the C++ vtable format change is part of the intended progression of the
 compiler and needed to meet standards compliance. gcc 295 also changed the
 internal formats. Unfortunately the gcc295 and 296 formats are both probably
 not the final format. The compiler folks are not willing to guarantee anything
 untill gcc 3.0, which may actually be out by the time 2.4 is stable.

If you ask any gcc folks, the main reason they think this was a really
stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
with the 2.95.x release _and_ the upcoming 3.0 release.

Nobody asked the people who knew this, apparently.

  unusable as a development platform, and I hope RH downgrades their
  compiler to something that works better RSN.  It apparently has problems
 
 Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
 buggier than before, but that the bugs are in different places. egcs and gcc295
 both caused X compile problems too.

gcc-2.95.2 is at least a real release, from a branch that is actively
maintained - so a 2.95.3 is likely to happen reasonably soon, fixing as
many problems as possible _without_ being incompatible like the snapshots
are.

Or just stay at 2.91.66 (egcs).

As to X compile problems - neither egcs nor 2.95.2 appears to have any
trouble with the CVS tree. Possibly because they got fixed, because, after
all, at least those were real releases.

I'd applaud RedHat for making snapshots available, but they should be
marked as SNAPSHOTS, and not as the main compiler with no way to fix the
damn problems it causes.

As it is, anybody doing development is probably better off at RH-6.2.
That is doubly true if they intend to release binaries.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Jakub Jelinek

On Thu, Dec 14, 2000 at 04:42:03AM -0800, Clayton Weaver wrote:
 There has a been a thread on the teTeX mailing list the last few days
 about a (RedHat, but probably more general than just their rpms)
 gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile 
 
 unsigned varname; /* "unsigned int varname;" is ok */
 
 (no problem at -O or no optimization at all, and doesn't happen if teTeX
 is compiled with kgcc).

That one is fixed already for some time, it was a bug in loop unrolling
(that patch is still pending review for the mainline CVS though).

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Bernhard Rosenkraenzer

On Thu, 14 Dec 2000, Linus Torvalds wrote:

 If you ask any gcc folks, the main reason they think this was a really
 stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
 with the 2.95.x release _and_ the upcoming 3.0 release.

The same thing is true of *any* gcc release.
For example, C++-ABI wise, 2.95.x is incompatible BOTH with egcs 1.1.x
_and_ the upcoming 3.0 release.

  Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
  buggier than before, but that the bugs are in different places. egcs and gcc295
  both caused X compile problems too.

 gcc-2.95.2 is at least a real release, from a branch that is actively
 maintained

Not very actively.
Please take the time to compare the activity in gcc_2_95_branch with the
patches in the current "2.96" version in rawhide.

 - so a 2.95.3 is likely to happen reasonably soon, fixing as
 many problems as possible _without_ being incompatible like the snapshots
 are.

It will be incompatible with any non-2.95.x-version, and I don't think
2.96-68 is any more buggy than the current 2.95 branch.
The initial 2.96 "release" did have some odd bugs; all the known ones have
been fixed.

 Or just stay at 2.91.66 (egcs).

This may be good for the kernel, but it's not acceptable for C++.
Also, there's no support for some of the platforms we have to work with,
such as ia64 and S/390 - using different compilers for different
architectures isn't a real solution either.

 As to X compile problems - neither egcs nor 2.95.2 appears to have any
 trouble with the CVS tree.

Neither does 2.96-68.

LLaP
bero


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Alan Cox

 If you ask any gcc folks, the main reason they think this was a really
 stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
 with the 2.95.x release _and_ the upcoming 3.0 release.

And with egcs 1.1.2. So 
egcs is a different format to all others
2.95 is a different format to all others
2.96 is a different format to all others

and 2.96 is a C++ compiler

 gcc-2.95.2 is at least a real release, from a branch that is actively
 maintained - so a 2.95.3 is likely to happen reasonably soon, fixing as
 many problems as possible _without_ being incompatible like the snapshots
 are.

The 2.96 tree is maintained actively. Updates for the Red Hat 7 packages
are being worked on and CygnusHat people are working on both that maintenance
and on feeding all they find back to the core gcc team.

In fact we have sufficient faith in it we sell packages and support based around
that and our preparedness to support it. 

 As to X compile problems - neither egcs nor 2.95.2 appears to have any
 trouble with the CVS tree. Possibly because they got fixed, because, after
 all, at least those were real releases.

I asked Jakub. He's confused as to your report. As far as he is aware the only
X problems in the CVS tree were related to XFree86 source code bugs misusing
type punning. If you have a case to lookat Jakub would love to hear about it
and fix either X or gcc.

 I'd applaud RedHat for making snapshots available, but they should be
 marked as SNAPSHOTS, and not as the main compiler with no way to fix the
 damn problems it causes.

That it was confusing and mistaken by some as an official GNU group release
is something we never intended and have already apologised for. It was done
without malice or ill intent.

 As it is, anybody doing development is probably better off at RH-6.2.
 That is doubly true if they intend to release binaries.

We strongly recommend that people use 6.2 for developing binaries for general
release unless they have specific requirements for glibc 2.2. Thats the same
guidelines the LSB 'oops we havent finished yet here is a quickie for now'
documentation recommends.

Similarly RPM packaging using RPMv3 is recommended.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Jakub Jelinek

On Thu, Dec 14, 2000 at 11:11:28AM -0800, Linus Torvalds wrote:
 user applications and (b) gcc-2.96 is so broken that it requires special
 libraries for C++ vtable chunks handling that is different, so the
 _working_ gcc can only be used with programs that do not need such
 library support.

Every major g++ release had incompatible libstdc++, even g++ 2.95.2 if
bootstrapped under glibc 2.1.x is binary incompatible with g++ 2.95.2
bootstrapped under glibc 2.2.x (libstdc++ uses different soname then;
even if we used g++ 2.95.2 we would not have C++ binary compatible with
other distributions).
This will change once 3.0 is out, but it will still take some time.

 compiler to something that works better RSN.  It apparently has problems
 compiling stuff like the CVS snapshots of X etc too (and obviously,
 anything you compile under gcc-2.96 is not likely to work anywhere else
 except with the broken libraries). 

Can you point to things in X which were actually miscompiled because of bugs
in gcc 2.96? So far I was aware about X bugs (already fixed in X CVS) which
were triggered with -fstrict-aliasing which is now the default while
gcc 2.95.2 had -fstrict-aliasing disabled by default.
That is not to say there were not bugs in the gcc we shipped, but the bugs
which were reported against it have been fixed already.

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Linus Torvalds



On Thu, 14 Dec 2000, Jakub Jelinek wrote:

 On Thu, Dec 14, 2000 at 11:11:28AM -0800, Linus Torvalds wrote:
  user applications and (b) gcc-2.96 is so broken that it requires special
  libraries for C++ vtable chunks handling that is different, so the
  _working_ gcc can only be used with programs that do not need such
  library support.
 
 Every major g++ release had incompatible libstdc++, even g++ 2.95.2 if
 bootstrapped under glibc 2.1.x is binary incompatible with g++ 2.95.2
 bootstrapped under glibc 2.2.x (libstdc++ uses different soname then;
 even if we used g++ 2.95.2 we would not have C++ binary compatible with
 other distributions).

Yes. 

And I realize that somebody inside RedHat really wanted to use a snapshot
in order to get some C++ code to compile right.

But it at the same time threw C stability out the window, by using a
not-very-widely-tested snapshot for a major new release. 

Are you seriously saying that you think it was a good trade-off? Or are
you just ashamed of admitting that RH did something stupid?

  compiler to something that works better RSN.  It apparently has problems
  compiling stuff like the CVS snapshots of X etc too (and obviously,
  anything you compile under gcc-2.96 is not likely to work anywhere else
  except with the broken libraries). 
 
 Can you point to things in X which were actually miscompiled because of bugs
 in gcc 2.96?

I have a report from a Sony VAIO user that couldn't compile the CVS X at
all on his picturebook (and you need to compile the CVS tree in order to
get required fixes for the ATI Rage Mobility in that machine). I don't
know the details, but they were apparently due to RH 7 issues. 

 So far I was aware about X bugs (already fixed in X CVS) which
 were triggered with -fstrict-aliasing which is now the default while
 gcc 2.95.2 had -fstrict-aliasing disabled by default.

I hope that's another thing that the gcc people fix by the time they do a
_real_ release. Anobody who thinks that "-fstrict-aliasing" being on by
default is a good idea is probably a compiler person who hasn't seen real
code.

 That is not to say there were not bugs in the gcc we shipped, but the bugs
 which were reported against it have been fixed already.

That's good.

It's even better if you don't play quite as fast-and-lose with your
shipping compiler.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Miquel van Smoorenburg

In article [EMAIL PROTECTED],
Bernhard Rosenkraenzer  [EMAIL PROTECTED] wrote:
The same thing is true of *any* gcc release.
For example, C++-ABI wise, 2.95.x is incompatible BOTH with egcs 1.1.x
_and_ the upcoming 3.0 release.

Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
And since redhat is _the_ distro that commercial entities use to
release software for, this was very arguably a bad move.

There's simply no excuse. It's too obvious.

Mike.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread lamont


I had tons of problems with K6III/450s in ASUS P5A motherboards with
various kinds of 128MB SIMMs.  There were multiple different symptoms,
including just sig11s on compiles, corrupted input (leading to syntax
error) in compiles, and corrupted input in the buffer cache (same crash
over and over, but dd if=/dev/hda of=/dev/null bs=1024k count=128 fixed
it).  Swapping the memory would sometimes get rid of the problem, but then
it would come back weeks-months later.

I saw a bizzare problem once in an Tyan dual proc PIII/500 box with
2x256MB ECC RAM that one of the ECC RAM sticks was bad and that repeated
kernel compiles would hang after about 24 hours.  Strange problem, but
found that in troubleshooting it, the problem followed this stick of RAM
around to different machines.  Blamed the RAM but don't understand what
the underlying problem was...

On Fri, 8 Dec 2000 [EMAIL PROTECTED] wrote:
 On Thu, 7 Dec 2000, Jeff V. Merkey wrote:
 
  It's related to some change in 2.4 vs. 2.2.  There are other programs
  affected other than X, SSH also get's spurious signal 11's now and again
  with 2.4 and glibc = 2.1 and it does not occur on 2.2.
 
 AOL
 
 I've begun to get a bit paranoid about my K6-2 500 box.
 
 Various processes have been getting random signals after heavy CPU usage.
 Playing an MPEG movie, kernel compile, or even just some small apps
 compiling sometimes. Just for the record, this isn't an OOM situation,
 I've watched this box with half its memory free or in buffers left
 unattended, and suddenly a compile will just die.
 
 I replaced the CPU with a brand new K6-2. Problem remained.
 Next suspect was faulty RAM. Despite having passed a memtest, I
 swapped out the DIMMs for some known good ones.
 Suspecting cooling problems, I added some case fans.
 Next came a bigger power supply. Still the problems.
 The latest last ditch attempt to make this box stable has been
 to attach the biggest fan I could find that would fit a socket 7 CPU.
 
 And still the problems are there.
 The only remaining suspect would be a flaky motherboard.
 But then comes the real killer : This box is rock solid under 2.2
 
 *boggle*
 
 I'm not sure exactly when this started, but I think I first noticed
 it around test5 or so, but didn't suspect the kernel at the time.
 
 I've tried kernels compiled with everything from 2.91.66 when this
 was a Redhat box, to gcc 2.95.2 (from Debian woody) when I installed
 debian on it.  If this is a compiler bug, it's one that no compiler
 I've tried seems to be immune from.
 
 regards,
 
 Davej.
 
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Alan Cox

 Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
 And since redhat is _the_ distro that commercial entities use to
 release software for, this was very arguably a bad move.

Except you conveniently ignore a few facts

o   Someone else moved to 2.95 not RH . In fact some of us felt 2.95 wasnt 
fit to ship at the time. 

o   We tell vendors to build RPMv3 , glibc 2.1.x

o   Vendors not being stupid understand that they have a bigger market
share if they do that.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Miquel van Smoorenburg

In article [EMAIL PROTECTED],
Alan Cox  [EMAIL PROTECTED] wrote:
 Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
 And since redhat is _the_ distro that commercial entities use to
 release software for, this was very arguably a bad move.

Except you conveniently ignore a few facts

Doesn't everyone. I should have included a smiley with as comment
that I was only half-joking. Anyway this is the kernel list, and
as such this is becoming off-topic.

Mike.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Michael Peddemors

Sticking my nose where it doesn't belong...

On Thu, 14 Dec 2000, Alan Cox wrote:
  Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
  And since redhat is _the_ distro that commercial entities use to
  release software for, this was very arguably a bad move.

 o We tell vendors to build RPMv3 , glibc 2.1.x

Curious HOW do you tell vendors??

 o Vendors not being stupid understand that they have a bigger market
   share if they do that.

Ummm.. I remember Oracle's first release... wasn't it JUST redhat??

-- 

Michael Peddemors - Senior Consultant
Unix Administration - WebSite Hosting
Network Services - Programming
Wizard Internet Services http://www.wizard.ca
Linux Support Specialist - http://www.linuxmagic.com

(604) 589-0037 Beautiful British Columbia, Canada

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Alan Cox

  o   We tell vendors to build RPMv3 , glibc 2.1.x
 Curious HOW do you tell vendors??

When they ask. More usefully Dan Quinlann and most vendors put together a
recommended set of things to build with and use. It warns about library
pitfalls, kernel changes and what packaging is supported. It is far from
perfect and nothing like the LSB goals but its a start and following it does
give you applications that with a bit of care run on everything.

  o   Vendors not being stupid understand that they have a bigger market
  share if they do that.
 Ummm.. I remember Oracle's first release... wasn't it JUST redhat??

I believe so, and Adabas was SuSE only, and I doubt either vendor wanted it
that way. Both actually ran fine on the other but were not supported.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Linus Torvalds wrote:

> On Wed, 13 Dec 2000, Linus Torvalds wrote:
> > 
> > Hint: "ptep_mkdirty()".

 rather obvious oopsie.. once spotted.

> In case you wonder why the bug was so insidious, what this caused was two
> separate problems, both of them able to cause SIGSGV's. 
> 
> One: we didn't mark the page table entry dirty like we were supposed to.
> 
> Two: by making it writable, we also made the page shared, even if it
> wasn't supposed to be shared (so when the next process wrote to the page,
> if the swap page was shared with somebody else, the changes would show up
> even in the process that _didn't_ write to it).
> 
> And "ptep_mkdirty()" is only used by swapoff, so nothing else would show
> this. Which was why it hadn't been immediately obvious that anything was
> broken.

The terminal OOM problem is now gone and I haven't seen a SIGSEGV yet
running virgin source.

IOU 5 bogo$$

-Mike

(I still see something with IKD that _could_ be timing related troubles.
There are a couple of grubby fingerprints I need to wipe off, and some
churn/burn hours to be sure)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Linus Torvalds wrote:

> On Wed, 13 Dec 2000, Mike Galbraith wrote:
> > 
> > Not in my test tree.  Same fault, and same trace leading up to it. no
> 
> Ok.
> 
> It definitely looks like a swapoff() problem.
> 
> Have you ever seen the behaviour without running swapoff?

No.

> Also, can you re-create it without running swapon() (if it's something
> like a lost dirty bit, it should be possible to trigger even without the
> swapon, and I'd like to hear if that can happen - if it only happens with
> swapon() and you can't trigger it with just a swapoff() it might be a
> question of re-using some swap file stuff and delaying the writeout or
> whatever).

I'll try loading up swap, swapoff and then doing jobs that fit in ram.

(hmm.. what about inactive_clean list when you do swapoff.. might there
be pages sitting there that are [were] swap cache? reclaim_page=kaboom?)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-13 Thread Rainer Mager

Err, for those of us who aren't up to our elbows in the kernel code, is
there a patch for this? Presumeably this will be rolled into 2.4.0test13 but
I'd like to try it out? Also, can someone summarize the fix in English along
with the expected, improved behavior (e.g. Linux will never have a signal 11
again and will never, ever crash ;-)

Finally, as soon as there is a patch, can other people who have seen this
problem test it. My problem is so random that I'd need at least a few days
to gain some confidence this is fixed.


Thanks all.

--Rainer

> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Linus Torvalds
> Sent: Thursday, December 14, 2000 5:19 AM
> To: Mike Galbraith
> Cc: Kernel Mailing List
> Subject: Re: Signal 11 - the continuing saga
>
>
> On Wed, 13 Dec 2000, Linus Torvalds wrote:
> >
> > Hint: "ptep_mkdirty()".
>
> In case you wonder why the bug was so insidious, what this caused was two
> separate problems, both of them able to cause SIGSGV's.
>
> One: we didn't mark the page table entry dirty like we were supposed to.
>
> Two: by making it writable, we also made the page shared, even if it
> wasn't supposed to be shared (so when the next process wrote to the page,
> if the swap page was shared with somebody else, the changes would show up
> even in the process that _didn't_ write to it).
>
> And "ptep_mkdirty()" is only used by swapoff, so nothing else would show
> this. Which was why it hadn't been immediately obvious that anything was
> broken.
>
>   Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Gérard Roudier



On Wed, 13 Dec 2000, Linus Torvalds wrote:

> 
> 
> Ehh, I think I found it.
> 
> Hint: "ptep_mkdirty()".
> 
> Oops.
> 
> I'll bet you $5 USD (and these days, that's about a gadzillion Euros) that

Poor European Gérard as slim as 1.84 meter - 78 Kg these days.
What about old days poor European Linus versus these days American Linus
on these points ? ;-)

> this explains it.

Really ? :o)

>   Linus

  Gérard.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Jeff V. Merkey

On Wed, Dec 13, 2000 at 11:35:57AM -0800, Linus Torvalds wrote:
> 
> 
> Ehh, I think I found it.
> 
> Hint: "ptep_mkdirty()".
> 
> Oops.
> 
> I'll bet you $5 USD (and these days, that's about a gadzillion Euros) that
> this explains it.
> 
>   Linus

Good.  Sounds like you guys have a handle on it now.

:-)

Jeff

> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Linus Torvalds



On Wed, 13 Dec 2000, Linus Torvalds wrote:
> 
> Hint: "ptep_mkdirty()".

In case you wonder why the bug was so insidious, what this caused was two
separate problems, both of them able to cause SIGSGV's. 

One: we didn't mark the page table entry dirty like we were supposed to.

Two: by making it writable, we also made the page shared, even if it
wasn't supposed to be shared (so when the next process wrote to the page,
if the swap page was shared with somebody else, the changes would show up
even in the process that _didn't_ write to it).

And "ptep_mkdirty()" is only used by swapoff, so nothing else would show
this. Which was why it hadn't been immediately obvious that anything was
broken.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Linus Torvalds



On Wed, 13 Dec 2000, Mike Galbraith wrote:
> 
> Not in my test tree.  Same fault, and same trace leading up to it. no

Ok.

It definitely looks like a swapoff() problem.

Have you ever seen the behaviour without running swapoff?

Also, can you re-create it without running swapon() (if it's something
like a lost dirty bit, it should be possible to trigger even without the
swapon, and I'd like to hear if that can happen - if it only happens with
swapon() and you can't trigger it with just a swapoff() it might be a
question of re-using some swap file stuff and delaying the writeout or
whatever).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Linus Torvalds wrote:

> On Wed, 13 Dec 2000, Linus Torvalds wrote:
> > 
> > Lookin gat "swapoff()", it could easily be something like
> > 
> >  - swapoff walks theough the processes, marking the pages dirty
> >(correctly)
> >  - swapoff goes on to the next swap entry, and because it needs memory for
> >this, the VM layer will swap out old entries by marking them dirty in
> >the "struct page".
> >  - final stages of swapoff() removes the swap cache entry, never minding
> >the fact that it is marked dirty again in "struct page", and clean in
> >various VM page tables.
> > 
> > Ho humm.. I don't think that is it exactly, but something along those
> > lines.
> 
> Actually, having thought about it for five more minutes, I actually think
> that that _is_ it.
> 
> If so, the fix looks like it could be really simple. The whole problem
> arises from the fact that we remove the page from the swap cache only
> _after_ we've walked the page-tables to look at it. It looks like the
> fairly trivial fix is simply to remove it from the swap cache before,
> getting rid of all such races in swapoff().
> 
> Mind trying out this patch?
> 
> NOTE! It's untested. It might not work. It might trigger some sanity-test
> somewhere else. But it looks like it should do the right thing (the page
> might be moved to _another_ swap device early, if there are multiple swap
> areas, but even that should be fine - the unuse_process() stuff doesn't
> care about what swapcache this actually is any more.
> 
> Does this patch make a difference (I moved the delete seven lines upwards,
> and removed the test - the test looks extraneous).

Not in my test tree.  Same fault, and same trace leading up to it.
I'll run virgin source hard tomorrow to be sure. (No message means
no change)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Linus Torvalds



On Wed, 13 Dec 2000, Linus Torvalds wrote:
> 
> Lookin gat "swapoff()", it could easily be something like
> 
>  - swapoff walks theough the processes, marking the pages dirty
>(correctly)
>  - swapoff goes on to the next swap entry, and because it needs memory for
>this, the VM layer will swap out old entries by marking them dirty in
>the "struct page".
>  - final stages of swapoff() removes the swap cache entry, never minding
>the fact that it is marked dirty again in "struct page", and clean in
>various VM page tables.
> 
> Ho humm.. I don't think that is it exactly, but something along those
> lines.

Actually, having thought about it for five more minutes, I actually think
that that _is_ it.

If so, the fix looks like it could be really simple. The whole problem
arises from the fact that we remove the page from the swap cache only
_after_ we've walked the page-tables to look at it. It looks like the
fairly trivial fix is simply to remove it from the swap cache before,
getting rid of all such races in swapoff().

Mind trying out this patch?

NOTE! It's untested. It might not work. It might trigger some sanity-test
somewhere else. But it looks like it should do the right thing (the page
might be moved to _another_ swap device early, if there are multiple swap
areas, but even that should be fine - the unuse_process() stuff doesn't
care about what swapcache this actually is any more.

Does this patch make a difference (I moved the delete seven lines upwards,
and removed the test - the test looks extraneous).

Linus


--- v2.4.0-test12/linux/mm/swapfile.c   Tue Oct 31 12:42:27 2000
+++ linux/mm/swapfile.c Wed Dec 13 09:17:51 2000
@@ -370,6 +370,7 @@
swap_free(entry);
return -ENOMEM;
}
+   delete_from_swap_cache(page);
read_lock(_lock);
for_each_task(p)
unuse_process(p->mm, entry, page);
@@ -377,8 +378,6 @@
shm_unuse(entry, page);
/* Now get rid of the extra reference to the temporary
page we've been using. */
-   if (PageSwapCache(page))
-   delete_from_swap_cache(page);
page_cache_release(page);
/*
 * Check for and clear any overflowed swap map counts.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Jeff V. Merkey

On Tue, Dec 12, 2000 at 07:17:41PM -0800, Linus Torvalds wrote:
> In article <[EMAIL PROTECTED]>,
> Jeff V. Merkey <[EMAIL PROTECTED]> wrote:
> >On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote:
> >>I have a tiny bash script that launches a Java swing app. If I run my
> >> script from an xterm (or gnome-terminal or whatever) then it starts up fine.
> >> If, however, I try to launch it from my gnome taskbar's menu then it dies
> >> with signal 11 (the Java log is available upon request). This seems to be
> >> 100% consistent, since I noticed it yesterday, even across reboots.
> >> Interestingly, the same behavior occurs if I try to run the program from
> >> withis JBuilder 4.
> >>So, is this related to the larger signal 11 problems?
> >
> >There's a corruption bug in the page cache somewhere, and it's 100%
> >reproducable.  Finding it will be tough
> 
> Unlikely. If the actual program data was corrupted, it would SIGSEGV
> regardless of how it's executed.
> 
> I'd guess that the program has a bug, and depending on the arguments and
> environment (especially the latter will be different), it shows up or
> not. Things like not having a LOCALE set in either case or similar.
> 
>   Linus

Linus,

I agree that there may be some problem in the code above -- the question is
what has changed to make this behavior emerge?  I see it with a host of 
programs(ssh, make, netscape) -- true all are userspace.  Time permitting, 
I may attempt to track this down in ssh and make in jobserver mode.  It
may be related to some interaction that changed underneath.

Jeff


> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Rainer Mager wrote:

> Mike et al,
> 
>   I have no idea what IKD is and I don't know what to do with any results I
> might find BUT I'd be happy to do this if it will help. Please pass on the
> info with the instructions. Who should I report the results to?

IKD is a debugging toolkit.  The trap I have set up freezes the kernel
trace buffer at SIGSEGV time.  From there you have to read it backward
looking for problems. (which isn't particularly easy).  I was thinking
you wanted to roll your shirt sleeves up and maybe this would help ;-)  

If you want it, and do a trace, I'b be very interested in the last
couple of schedules to compare to my traces.  It's not something you
can just run and report though.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



R: Signal 11 - the continuing saga

2000-12-13 Thread CMA

>> From: CMA [mailto:[EMAIL PROTECTED]]
>> Did you already try to selectively disable L1 and L2 caches (if
>> your box has both) and see what happens?
>
>Anyone know how to do this?

If you own a p6 class machine (sorry but I didn't find your hw specs in
previous messages)
you should be able to enter setup and disable L1 and/or L2 usually in
"advanced setup".
If you disable L1, the machine will be *much* slower.
If you disable L2, you will notice it under heavy load.
Most of the times sig 11 is due L1 cache overheating (on chip). Just
controlling whether cpu cooling fan is properly seated and spinning solves
the problem.
Regards.
Dr. Eng. Mauro Tassinari
www.c-m-a.it

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-13 Thread Rainer Mager

Mike et al,

I have no idea what IKD is and I don't know what to do with any results I
might find BUT I'd be happy to do this if it will help. Please pass on the
info with the instructions. Who should I report the results to?



--Rainer

> [mailto:[EMAIL PROTECTED]]On Behalf Of Mike Galbraith
> If you want, I can extract IKD.. which happens to have a trap in place
> for this (because I have a 100% reproducable swap related SIGSEGV that
> I'm trying to figure out).
>
> If you're interested, let me know and I'll extract it (quite large) and
> send it along instructions on how to do the trap.
>
>   -Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-13 Thread Rainer Mager
Give that man a cigarit was an env var (not LOCALE but LANG). I'd
actually checked this but I didn't think that made a difference in my case.

Thanks Linus, now can you fix the larger signal 11 problem?

--Rainer


> [mailto:[EMAIL PROTECTED]]On Behalf Of Linus Torvalds
> I'd guess that the program has a bug, and depending on the arguments and
> environment (especially the latter will be different), it shows up or
> not. Things like not having a LOCALE set in either case or similar.
>
>   Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/


RE: Signal 11 - the continuing saga

2000-12-13 Thread Rainer Mager

Mike et al,

I have no idea what IKD is and I don't know what to do with any results I
might find BUT I'd be happy to do this if it will help. Please pass on the
info with the instructions. Who should I report the results to?



--Rainer

 [mailto:[EMAIL PROTECTED]]On Behalf Of Mike Galbraith
 If you want, I can extract IKD.. which happens to have a trap in place
 for this (because I have a 100% reproducable swap related SIGSEGV that
 I'm trying to figure out).

 If you're interested, let me know and I'll extract it (quite large) and
 send it along instructions on how to do the trap.

   -Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



  1   2   >