Re: Signal 11

2000-12-15 Thread Dan Egli

On Thu, 14 Dec 2000, Linus Torvalds wrote:

> Yes. 
> 
> And I realize that somebody inside RedHat really wanted to use a snapshot
> in order to get some C++ code to compile right.
> 
> But it at the same time threw C stability out the window, by using a
> not-very-widely-tested snapshot for a major new release. 
> 
> Are you seriously saying that you think it was a good trade-off? Or are
> you just ashamed of admitting that RH did something stupid?
> 
Pardon the poking in here, but I must say I agree here. RH did a VERY dumb
thing. 

> I have a report from a Sony VAIO user that couldn't compile the CVS X at
> all on his picturebook (and you need to compile the CVS tree in order to
> get required fixes for the ATI Rage Mobility in that machine). I don't
> know the details, but they were apparently due to RH 7 issues. 

It's not in the X tree or anything, but here's a personal example.
Machine: Dual P3 550
HDD: Dual Ultra2Wide Seagate 18GB Hdd
OS: RedHat 7
Compile Target: Linux Kernel 2.2.17
Result with gcc 2.96: Failure (syntax errors in the i386 branch of the
arch tree)
Result with compat-egcs-62: Success on the first try.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-15 Thread Theodore Y. Ts'o

   Date:Fri, 15 Dec 2000 01:09:29 + (GMT)
   From: Alan Cox <[EMAIL PROTECTED]>

   > > oWe tell vendors to build RPMv3 , glibc 2.1.x
   > Curious HOW do you tell vendors??

   When they ask. More usefully Dan Quinlann and most vendors put together a
   recommended set of things to build with and use. It warns about library
   pitfalls, kernel changes and what packaging is supported. It is far from
   perfect and nothing like the LSB goals but its a start and following it does
   give you applications that with a bit of care run on everything.

In the interests of making sure everyone understands the history:

The Linux Development Platform Specification (LDPS) was started as a
result of an informal evening post-LSB-meeting gathering in June --- to
which by the way Red Hat didn't send any representatives(*) --- the
discussion at the restaurant started along the lines of "Oh, my *GOD*
RedHat is about to do something stupid --- they're releasing Red Hat 7.0
with beta/snapshots of just about every single critical system component
except the kernel --- and vendors who fall into the trap developing
against Red Hat 7.0 won't work with any other distribution.  This is
going to be *bad* for Linux."

So yes, the reason why LDPS was formed was to recommend to vendors what
they should build and use --- but while Alan gave comments about the
LDPS once it was announced that a group of people were working on the
LDPS , there is no way that the LDPS could even vaguely be considered a
Red Hat initiative.  (The LDPS is a separate work group which is part of
the FSG, so it is a sister group to the LSB effort.)

- Ted

(*) Ever since Jim Kingdon left Red Hat (he was at VA Linux for a while,
and is now at SGI), as far as I know no one at Red Hat is actively
participating in the LSB activities --- they haven't sent anyone to the
physical LSB meetings, or participated in the bi-weekly phone
conferences, or taken work items to help finish the LSB.  Alan does
participate on the mailing lists, and makes quite helpful comments, but
as far as I know that's about the limit to Red Hat's participation to
either the LSB or the LDPS specification work.  Speaking as someone who
has been contributing time and effort to the LSB, it would be great if
Red Hat were to become more fully involved in the LSB; I (and I'm sure
all the other LSB volunteers) would welcome a greater level of
participation by Red Hat.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-15 Thread Theodore Y. Ts'o

   Date:Fri, 15 Dec 2000 01:09:29 + (GMT)
   From: Alan Cox [EMAIL PROTECTED]

 oWe tell vendors to build RPMv3 , glibc 2.1.x
Curious HOW do you tell vendors??

   When they ask. More usefully Dan Quinlann and most vendors put together a
   recommended set of things to build with and use. It warns about library
   pitfalls, kernel changes and what packaging is supported. It is far from
   perfect and nothing like the LSB goals but its a start and following it does
   give you applications that with a bit of care run on everything.

In the interests of making sure everyone understands the history:

The Linux Development Platform Specification (LDPS) was started as a
result of an informal evening post-LSB-meeting gathering in June --- to
which by the way Red Hat didn't send any representatives(*) --- the
discussion at the restaurant started along the lines of "Oh, my *GOD*
RedHat is about to do something stupid --- they're releasing Red Hat 7.0
with beta/snapshots of just about every single critical system component
except the kernel --- and vendors who fall into the trap developing
against Red Hat 7.0 won't work with any other distribution.  This is
going to be *bad* for Linux."

So yes, the reason why LDPS was formed was to recommend to vendors what
they should build and use --- but while Alan gave comments about the
LDPS once it was announced that a group of people were working on the
LDPS , there is no way that the LDPS could even vaguely be considered a
Red Hat initiative.  (The LDPS is a separate work group which is part of
the FSG, so it is a sister group to the LSB effort.)

- Ted

(*) Ever since Jim Kingdon left Red Hat (he was at VA Linux for a while,
and is now at SGI), as far as I know no one at Red Hat is actively
participating in the LSB activities --- they haven't sent anyone to the
physical LSB meetings, or participated in the bi-weekly phone
conferences, or taken work items to help finish the LSB.  Alan does
participate on the mailing lists, and makes quite helpful comments, but
as far as I know that's about the limit to Red Hat's participation to
either the LSB or the LDPS specification work.  Speaking as someone who
has been contributing time and effort to the LSB, it would be great if
Red Hat were to become more fully involved in the LSB; I (and I'm sure
all the other LSB volunteers) would welcome a greater level of
participation by Red Hat.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Alan Cox

> > o   We tell vendors to build RPMv3 , glibc 2.1.x
> Curious HOW do you tell vendors??

When they ask. More usefully Dan Quinlann and most vendors put together a
recommended set of things to build with and use. It warns about library
pitfalls, kernel changes and what packaging is supported. It is far from
perfect and nothing like the LSB goals but its a start and following it does
give you applications that with a bit of care run on everything.

> > o   Vendors not being stupid understand that they have a bigger market
> > share if they do that.
> Ummm.. I remember Oracle's first release... wasn't it JUST redhat??

I believe so, and Adabas was SuSE only, and I doubt either vendor wanted it
that way. Both actually ran fine on the other but were not supported.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Michael Peddemors

Sticking my nose where it doesn't belong...

On Thu, 14 Dec 2000, Alan Cox wrote:
> > Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
> > And since redhat is _the_ distro that commercial entities use to
> > release software for, this was very arguably a bad move.

> o We tell vendors to build RPMv3 , glibc 2.1.x

Curious HOW do you tell vendors??

> o Vendors not being stupid understand that they have a bigger market
>   share if they do that.

Ummm.. I remember Oracle's first release... wasn't it JUST redhat??

-- 

Michael Peddemors - Senior Consultant
Unix Administration - WebSite Hosting
Network Services - Programming
Wizard Internet Services http://www.wizard.ca
Linux Support Specialist - http://www.linuxmagic.com

(604) 589-0037 Beautiful British Columbia, Canada

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Miquel van Smoorenburg

In article <[EMAIL PROTECTED]>,
Alan Cox  <[EMAIL PROTECTED]> wrote:
>> Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
>> And since redhat is _the_ distro that commercial entities use to
>> release software for, this was very arguably a bad move.
>
>Except you conveniently ignore a few facts

Doesn't everyone. I should have included a smiley with as comment
that I was only half-joking. Anyway this is the kernel list, and
as such this is becoming off-topic.

Mike.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Alan Cox

> Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
> And since redhat is _the_ distro that commercial entities use to
> release software for, this was very arguably a bad move.

Except you conveniently ignore a few facts

o   Someone else moved to 2.95 not RH . In fact some of us felt 2.95 wasnt 
fit to ship at the time. 

o   We tell vendors to build RPMv3 , glibc 2.1.x

o   Vendors not being stupid understand that they have a bigger market
share if they do that.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread lamont


I had tons of problems with K6III/450s in ASUS P5A motherboards with
various kinds of 128MB SIMMs.  There were multiple different symptoms,
including just sig11s on compiles, corrupted input (leading to syntax
error) in compiles, and corrupted input in the buffer cache (same crash
over and over, but dd if=/dev/hda of=/dev/null bs=1024k count=128 fixed
it).  Swapping the memory would sometimes get rid of the problem, but then
it would come back weeks-months later.

I saw a bizzare problem once in an Tyan dual proc PIII/500 box with
2x256MB ECC RAM that one of the ECC RAM sticks was bad and that repeated
kernel compiles would hang after about 24 hours.  Strange problem, but
found that in troubleshooting it, the problem followed this stick of RAM
around to different machines.  Blamed the RAM but don't understand what
the underlying problem was...

On Fri, 8 Dec 2000 [EMAIL PROTECTED] wrote:
> On Thu, 7 Dec 2000, Jeff V. Merkey wrote:
> 
> > It's related to some change in 2.4 vs. 2.2.  There are other programs
> > affected other than X, SSH also get's spurious signal 11's now and again
> > with 2.4 and glibc <= 2.1 and it does not occur on 2.2.
> 
> 
> 
> I've begun to get a bit paranoid about my K6-2 500 box.
> 
> Various processes have been getting random signals after heavy CPU usage.
> Playing an MPEG movie, kernel compile, or even just some small apps
> compiling sometimes. Just for the record, this isn't an OOM situation,
> I've watched this box with half its memory free or in buffers left
> unattended, and suddenly a compile will just die.
> 
> I replaced the CPU with a brand new K6-2. Problem remained.
> Next suspect was faulty RAM. Despite having passed a memtest, I
> swapped out the DIMMs for some known good ones.
> Suspecting cooling problems, I added some case fans.
> Next came a bigger power supply. Still the problems.
> The latest last ditch attempt to make this box stable has been
> to attach the biggest fan I could find that would fit a socket 7 CPU.
> 
> And still the problems are there.
> The only remaining suspect would be a flaky motherboard.
> But then comes the real killer : This box is rock solid under 2.2
> 
> *boggle*
> 
> I'm not sure exactly when this started, but I think I first noticed
> it around test5 or so, but didn't suspect the kernel at the time.
> 
> I've tried kernels compiled with everything from 2.91.66 when this
> was a Redhat box, to gcc 2.95.2 (from Debian woody) when I installed
> debian on it.  If this is a compiler bug, it's one that no compiler
> I've tried seems to be immune from.
> 
> regards,
> 
> Davej.
> 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Miquel van Smoorenburg

In article <[EMAIL PROTECTED]>,
Bernhard Rosenkraenzer  <[EMAIL PROTECTED]> wrote:
>The same thing is true of *any* gcc release.
>For example, C++-ABI wise, 2.95.x is incompatible BOTH with egcs 1.1.x
>_and_ the upcoming 3.0 release.

Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
And since redhat is _the_ distro that commercial entities use to
release software for, this was very arguably a bad move.

There's simply no excuse. It's too obvious.

Mike.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Linus Torvalds



On Thu, 14 Dec 2000, Jakub Jelinek wrote:

> On Thu, Dec 14, 2000 at 11:11:28AM -0800, Linus Torvalds wrote:
> > user applications and (b) gcc-2.96 is so broken that it requires special
> > libraries for C++ vtable chunks handling that is different, so the
> > _working_ gcc can only be used with programs that do not need such
> > library support.
> 
> Every major g++ release had incompatible libstdc++, even g++ 2.95.2 if
> bootstrapped under glibc 2.1.x is binary incompatible with g++ 2.95.2
> bootstrapped under glibc 2.2.x (libstdc++ uses different soname then;
> even if we used g++ 2.95.2 we would not have C++ binary compatible with
> other distributions).

Yes. 

And I realize that somebody inside RedHat really wanted to use a snapshot
in order to get some C++ code to compile right.

But it at the same time threw C stability out the window, by using a
not-very-widely-tested snapshot for a major new release. 

Are you seriously saying that you think it was a good trade-off? Or are
you just ashamed of admitting that RH did something stupid?

> > compiler to something that works better RSN.  It apparently has problems
> > compiling stuff like the CVS snapshots of X etc too (and obviously,
> > anything you compile under gcc-2.96 is not likely to work anywhere else
> > except with the broken libraries). 
> 
> Can you point to things in X which were actually miscompiled because of bugs
> in gcc 2.96?

I have a report from a Sony VAIO user that couldn't compile the CVS X at
all on his picturebook (and you need to compile the CVS tree in order to
get required fixes for the ATI Rage Mobility in that machine). I don't
know the details, but they were apparently due to RH 7 issues. 

> So far I was aware about X bugs (already fixed in X CVS) which
> were triggered with -fstrict-aliasing which is now the default while
> gcc 2.95.2 had -fstrict-aliasing disabled by default.

I hope that's another thing that the gcc people fix by the time they do a
_real_ release. Anobody who thinks that "-fstrict-aliasing" being on by
default is a good idea is probably a compiler person who hasn't seen real
code.

> That is not to say there were not bugs in the gcc we shipped, but the bugs
> which were reported against it have been fixed already.

That's good.

It's even better if you don't play quite as fast-and-lose with your
shipping compiler.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Jakub Jelinek

On Thu, Dec 14, 2000 at 11:11:28AM -0800, Linus Torvalds wrote:
> user applications and (b) gcc-2.96 is so broken that it requires special
> libraries for C++ vtable chunks handling that is different, so the
> _working_ gcc can only be used with programs that do not need such
> library support.

Every major g++ release had incompatible libstdc++, even g++ 2.95.2 if
bootstrapped under glibc 2.1.x is binary incompatible with g++ 2.95.2
bootstrapped under glibc 2.2.x (libstdc++ uses different soname then;
even if we used g++ 2.95.2 we would not have C++ binary compatible with
other distributions).
This will change once 3.0 is out, but it will still take some time.

> compiler to something that works better RSN.  It apparently has problems
> compiling stuff like the CVS snapshots of X etc too (and obviously,
> anything you compile under gcc-2.96 is not likely to work anywhere else
> except with the broken libraries). 

Can you point to things in X which were actually miscompiled because of bugs
in gcc 2.96? So far I was aware about X bugs (already fixed in X CVS) which
were triggered with -fstrict-aliasing which is now the default while
gcc 2.95.2 had -fstrict-aliasing disabled by default.
That is not to say there were not bugs in the gcc we shipped, but the bugs
which were reported against it have been fixed already.

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Alan Cox

> If you ask any gcc folks, the main reason they think this was a really
> stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
> with the 2.95.x release _and_ the upcoming 3.0 release.

And with egcs 1.1.2. So 
egcs is a different format to all others
2.95 is a different format to all others
2.96 is a different format to all others

and 2.96 is a C++ compiler

> gcc-2.95.2 is at least a real release, from a branch that is actively
> maintained - so a 2.95.3 is likely to happen reasonably soon, fixing as
> many problems as possible _without_ being incompatible like the snapshots
> are.

The 2.96 tree is maintained actively. Updates for the Red Hat 7 packages
are being worked on and CygnusHat people are working on both that maintenance
and on feeding all they find back to the core gcc team.

In fact we have sufficient faith in it we sell packages and support based around
that and our preparedness to support it. 

> As to X compile problems - neither egcs nor 2.95.2 appears to have any
> trouble with the CVS tree. Possibly because they got fixed, because, after
> all, at least those were real releases.

I asked Jakub. He's confused as to your report. As far as he is aware the only
X problems in the CVS tree were related to XFree86 source code bugs misusing
type punning. If you have a case to lookat Jakub would love to hear about it
and fix either X or gcc.

> I'd applaud RedHat for making snapshots available, but they should be
> marked as SNAPSHOTS, and not as the main compiler with no way to fix the
> damn problems it causes.

That it was confusing and mistaken by some as an official GNU group release
is something we never intended and have already apologised for. It was done
without malice or ill intent.

> As it is, anybody doing development is probably better off at RH-6.2.
> That is doubly true if they intend to release binaries.

We strongly recommend that people use 6.2 for developing binaries for general
release unless they have specific requirements for glibc 2.2. Thats the same
guidelines the LSB 'oops we havent finished yet here is a quickie for now'
documentation recommends.

Similarly RPM packaging using RPMv3 is recommended.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Linus Torvalds



On Thu, 14 Dec 2000, Bernhard Rosenkraenzer wrote:
> >
> > gcc-2.95.2 is at least a real release, from a branch that is actively
> > maintained
> 
> Not very actively.
> Please take the time to compare the activity in gcc_2_95_branch with the
> patches in the current "2.96" version in rawhide.

Take a look at the differences in linux-2.2.x and linux-2.3.x.

linux-2.3.x is was a h*ll of a lot more "actively maintained".

But nobody really considers that to be an argument for RedHat (or anybody
else) to installa 2.3.x kernel by default. Sure, most distributions have a
"hacker kernel", but it's NOT installed by default, and it is clearly
marked as experimental.

Your arguments make no sense.

The compiler is often _more_ important to system stability than the
kernel. A "real release" implies that it at least had testing, and that
people know what the problem spots tend to be.

Note that the "know what the problem spots tend to be" is important.

> > As to X compile problems - neither egcs nor 2.95.2 appears to have any
> > trouble with the CVS tree.
> 
> Neither does 2.96-68.

Good. Maybe you'd make it clearer to everybody who installed from your
CD's that they had better upgrade. Pronto.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Bernhard Rosenkraenzer

On Thu, 14 Dec 2000, Linus Torvalds wrote:

> If you ask any gcc folks, the main reason they think this was a really
> stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
> with the 2.95.x release _and_ the upcoming 3.0 release.

The same thing is true of *any* gcc release.
For example, C++-ABI wise, 2.95.x is incompatible BOTH with egcs 1.1.x
_and_ the upcoming 3.0 release.

> > Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
> > buggier than before, but that the bugs are in different places. egcs and gcc295
> > both caused X compile problems too.
>
> gcc-2.95.2 is at least a real release, from a branch that is actively
> maintained

Not very actively.
Please take the time to compare the activity in gcc_2_95_branch with the
patches in the current "2.96" version in rawhide.

> - so a 2.95.3 is likely to happen reasonably soon, fixing as
> many problems as possible _without_ being incompatible like the snapshots
> are.

It will be incompatible with any non-2.95.x-version, and I don't think
2.96-68 is any more buggy than the current 2.95 branch.
The initial 2.96 "release" did have some odd bugs; all the known ones have
been fixed.

> Or just stay at 2.91.66 (egcs).

This may be good for the kernel, but it's not acceptable for C++.
Also, there's no support for some of the platforms we have to work with,
such as ia64 and S/390 - using different compilers for different
architectures isn't a real solution either.

> As to X compile problems - neither egcs nor 2.95.2 appears to have any
> trouble with the CVS tree.

Neither does 2.96-68.

LLaP
bero


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Linus Torvalds



On Thu, 14 Dec 2000, Alan Cox wrote:
> 
> > user applications and (b) gcc-2.96 is so broken that it requires special
> > libraries for C++ vtable chunks handling that is different, so the
> 
> Wrong - the C++ vtable format change is part of the intended progression of the
> compiler and needed to meet standards compliance. gcc 295 also changed the
> internal formats. Unfortunately the gcc295 and 296 formats are both probably
> not the final format. The compiler folks are not willing to guarantee anything
> untill gcc 3.0, which may actually be out by the time 2.4 is stable.

If you ask any gcc folks, the main reason they think this was a really
stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
with the 2.95.x release _and_ the upcoming 3.0 release.

Nobody asked the people who knew this, apparently.

> > unusable as a development platform, and I hope RH downgrades their
> > compiler to something that works better RSN.  It apparently has problems
> 
> Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
> buggier than before, but that the bugs are in different places. egcs and gcc295
> both caused X compile problems too.

gcc-2.95.2 is at least a real release, from a branch that is actively
maintained - so a 2.95.3 is likely to happen reasonably soon, fixing as
many problems as possible _without_ being incompatible like the snapshots
are.

Or just stay at 2.91.66 (egcs).

As to X compile problems - neither egcs nor 2.95.2 appears to have any
trouble with the CVS tree. Possibly because they got fixed, because, after
all, at least those were real releases.

I'd applaud RedHat for making snapshots available, but they should be
marked as SNAPSHOTS, and not as the main compiler with no way to fix the
damn problems it causes.

As it is, anybody doing development is probably better off at RH-6.2.
That is doubly true if they intend to release binaries.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Jakub Jelinek

On Thu, Dec 14, 2000 at 04:42:03AM -0800, Clayton Weaver wrote:
> There has a been a thread on the teTeX mailing list the last few days
> about a (RedHat, but probably more general than just their rpms)
> gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile 
> 
> unsigned varname; /* "unsigned int varname;" is ok */
> 
> (no problem at -O or no optimization at all, and doesn't happen if teTeX
> is compiled with kgcc).

That one is fixed already for some time, it was a bug in loop unrolling
(that patch is still pending review for the mainline CVS though).

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Alan Cox

> I don't know why RH decided to do their idiotic gcc-2.96 release (it
> certainly wasn't approved by any technical gcc people - the gcc people

Every single patch in that release barring I believe 2 was accepted into
the main tree. So they liked the code. The naming did upset people and was
unfortunate, but done talking to the compiler folks at Red Hat with the
best of intentions behind it. If we had called it 'Red Hat cc' I think people
would have been even more offended at the way they had been discredited.

I do understand why they got peeved, I do understand why they feel no urge
to support the 296 codebase (nor would I want them to). I hit 'd' when I 
see 'I have 2.2.18 patched with [reiserfs|ext3|bigmem|lfs]' for the same
reasons.

> They included another (non-broken) compiler, and called it "kgcc". 
> "kgcc" stands for "kernel gcc", apparently because (a) they realised

kgcc is a convention invented a long time ago by Conectiva. Debian also used
to have gcc272. It is done because

gcc272 is useless at C++, has lots of bugs
egcs is no better at C++ and has lots of bugs
gcc295 is a little better at C++ and is _Crawling_ with bugs
gcc296(redhat) is a lot better at C++ and doesn't appear to be any buggier.

In fact gcc296 is the first compiler that can compiled 2.2.16 correctly. All
the previous compilers miscompile the strstr() inline in some cases. Thats
why I had to hack the 2.2 kernel tree to make it work. (And the cases where
you got compile time errors gcc was right to moan about - like using (...)
in traditional

> user applications and (b) gcc-2.96 is so broken that it requires special
> libraries for C++ vtable chunks handling that is different, so the

Wrong - the C++ vtable format change is part of the intended progression of the
compiler and needed to meet standards compliance. gcc 295 also changed the
internal formats. Unfortunately the gcc295 and 296 formats are both probably
not the final format. The compiler folks are not willing to guarantee anything
untill gcc 3.0, which may actually be out by the time 2.4 is stable.

> unusable as a development platform, and I hope RH downgrades their
> compiler to something that works better RSN.  It apparently has problems

Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
buggier than before, but that the bugs are in different places. egcs and gcc295
both caused X compile problems too.

I still advise people: Use egcs-1.1.2 for Linux 2.2.x. You can build 2.2.18 with
gcc 2.9.6 but I personally wouldn't be running production systems on a kernel
built that way - but NOT because gcc296 is buggier but because the bugs are
going to be in different places and I firmly believe production system people
should let the loons find them ;)

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Clayton Weaver  <[EMAIL PROTECTED]> wrote:
>
>There has a been a thread on the teTeX mailing list the last few days
>about a (RedHat, but probably more general than just their rpms)
>gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile 

Quite frankly, anybody who uses RedHat 7.0 and their broken compiler for
_anything_ is going to have trouble.

I don't know why RH decided to do their idiotic gcc-2.96 release (it
certainly wasn't approved by any technical gcc people - the gcc people
were upset about it too), and I find it even more surprising that they
apparently KNEW that the compiler they were using was completely broken. 
They included another (non-broken) compiler, and called it "kgcc". 

"kgcc" stands for "kernel gcc", apparently because (a) they realised
that a miscompiled kernel is even worse than miscompiling some random
user applications and (b) gcc-2.96 is so broken that it requires special
libraries for C++ vtable chunks handling that is different, so the
_working_ gcc can only be used with programs that do not need such
library support.  Namely the kernel. 

In case it wasn't obvious yet, I consider RedHat-7.0 to be basically
unusable as a development platform, and I hope RH downgrades their
compiler to something that works better RSN.  It apparently has problems
compiling stuff like the CVS snapshots of X etc too (and obviously,
anything you compile under gcc-2.96 is not likely to work anywhere else
except with the broken libraries). 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Clayton Weaver

This is unrelated to the signal 11 problem, but something to consider
for "random crashes and segfaults", ie are you using this compiler
and glibc version combination.

There has a been a thread on the teTeX mailing list the last few days
about a (RedHat, but probably more general than just their rpms)
gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile 

unsigned varname; /* "unsigned int varname;" is ok */

(no problem at -O or no optimization at all, and doesn't happen if teTeX
is compiled with kgcc).

Showed up in the kpathsea library (which began to split paths on
'-' as well as '/' after a user upgraded compiler and libc and
recompiled teTeX).

Regards,

Clayton Weaver

(Seattle)

"Everybody's ignorant, just in different subjects."  Will Rogers



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Clayton Weaver

This is unrelated to the signal 11 problem, but something to consider
for "random crashes and segfaults", ie are you using this compiler
and glibc version combination.

There has a been a thread on the teTeX mailing list the last few days
about a (RedHat, but probably more general than just their rpms)
gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile 

unsigned varname; /* "unsigned int varname;" is ok */

(no problem at -O or no optimization at all, and doesn't happen if teTeX
is compiled with kgcc).

Showed up in the kpathsea library (which began to split paths on
'-' as well as '/' after a user upgraded compiler and libc and
recompiled teTeX).

Regards,

Clayton Weaver
mailto:[EMAIL PROTECTED]
(Seattle)

"Everybody's ignorant, just in different subjects."  Will Rogers



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Linus Torvalds

In article [EMAIL PROTECTED],
Clayton Weaver  [EMAIL PROTECTED] wrote:

There has a been a thread on the teTeX mailing list the last few days
about a (RedHat, but probably more general than just their rpms)
gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile 

Quite frankly, anybody who uses RedHat 7.0 and their broken compiler for
_anything_ is going to have trouble.

I don't know why RH decided to do their idiotic gcc-2.96 release (it
certainly wasn't approved by any technical gcc people - the gcc people
were upset about it too), and I find it even more surprising that they
apparently KNEW that the compiler they were using was completely broken. 
They included another (non-broken) compiler, and called it "kgcc". 

"kgcc" stands for "kernel gcc", apparently because (a) they realised
that a miscompiled kernel is even worse than miscompiling some random
user applications and (b) gcc-2.96 is so broken that it requires special
libraries for C++ vtable chunks handling that is different, so the
_working_ gcc can only be used with programs that do not need such
library support.  Namely the kernel. 

In case it wasn't obvious yet, I consider RedHat-7.0 to be basically
unusable as a development platform, and I hope RH downgrades their
compiler to something that works better RSN.  It apparently has problems
compiling stuff like the CVS snapshots of X etc too (and obviously,
anything you compile under gcc-2.96 is not likely to work anywhere else
except with the broken libraries). 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Alan Cox

 I don't know why RH decided to do their idiotic gcc-2.96 release (it
 certainly wasn't approved by any technical gcc people - the gcc people

Every single patch in that release barring I believe 2 was accepted into
the main tree. So they liked the code. The naming did upset people and was
unfortunate, but done talking to the compiler folks at Red Hat with the
best of intentions behind it. If we had called it 'Red Hat cc' I think people
would have been even more offended at the way they had been discredited.

I do understand why they got peeved, I do understand why they feel no urge
to support the 296 codebase (nor would I want them to). I hit 'd' when I 
see 'I have 2.2.18 patched with [reiserfs|ext3|bigmem|lfs]' for the same
reasons.

 They included another (non-broken) compiler, and called it "kgcc". 
 "kgcc" stands for "kernel gcc", apparently because (a) they realised

kgcc is a convention invented a long time ago by Conectiva. Debian also used
to have gcc272. It is done because

gcc272 is useless at C++, has lots of bugs
egcs is no better at C++ and has lots of bugs
gcc295 is a little better at C++ and is _Crawling_ with bugs
gcc296(redhat) is a lot better at C++ and doesn't appear to be any buggier.

In fact gcc296 is the first compiler that can compiled 2.2.16 correctly. All
the previous compilers miscompile the strstr() inline in some cases. Thats
why I had to hack the 2.2 kernel tree to make it work. (And the cases where
you got compile time errors gcc was right to moan about - like using (...)
in traditional

 user applications and (b) gcc-2.96 is so broken that it requires special
 libraries for C++ vtable chunks handling that is different, so the

Wrong - the C++ vtable format change is part of the intended progression of the
compiler and needed to meet standards compliance. gcc 295 also changed the
internal formats. Unfortunately the gcc295 and 296 formats are both probably
not the final format. The compiler folks are not willing to guarantee anything
untill gcc 3.0, which may actually be out by the time 2.4 is stable.

 unusable as a development platform, and I hope RH downgrades their
 compiler to something that works better RSN.  It apparently has problems

Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
buggier than before, but that the bugs are in different places. egcs and gcc295
both caused X compile problems too.

I still advise people: Use egcs-1.1.2 for Linux 2.2.x. You can build 2.2.18 with
gcc 2.9.6 but I personally wouldn't be running production systems on a kernel
built that way - but NOT because gcc296 is buggier but because the bugs are
going to be in different places and I firmly believe production system people
should let the loons find them ;)

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Linus Torvalds



On Thu, 14 Dec 2000, Alan Cox wrote:
 
  user applications and (b) gcc-2.96 is so broken that it requires special
  libraries for C++ vtable chunks handling that is different, so the
 
 Wrong - the C++ vtable format change is part of the intended progression of the
 compiler and needed to meet standards compliance. gcc 295 also changed the
 internal formats. Unfortunately the gcc295 and 296 formats are both probably
 not the final format. The compiler folks are not willing to guarantee anything
 untill gcc 3.0, which may actually be out by the time 2.4 is stable.

If you ask any gcc folks, the main reason they think this was a really
stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
with the 2.95.x release _and_ the upcoming 3.0 release.

Nobody asked the people who knew this, apparently.

  unusable as a development platform, and I hope RH downgrades their
  compiler to something that works better RSN.  It apparently has problems
 
 Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
 buggier than before, but that the bugs are in different places. egcs and gcc295
 both caused X compile problems too.

gcc-2.95.2 is at least a real release, from a branch that is actively
maintained - so a 2.95.3 is likely to happen reasonably soon, fixing as
many problems as possible _without_ being incompatible like the snapshots
are.

Or just stay at 2.91.66 (egcs).

As to X compile problems - neither egcs nor 2.95.2 appears to have any
trouble with the CVS tree. Possibly because they got fixed, because, after
all, at least those were real releases.

I'd applaud RedHat for making snapshots available, but they should be
marked as SNAPSHOTS, and not as the main compiler with no way to fix the
damn problems it causes.

As it is, anybody doing development is probably better off at RH-6.2.
That is doubly true if they intend to release binaries.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Jakub Jelinek

On Thu, Dec 14, 2000 at 04:42:03AM -0800, Clayton Weaver wrote:
 There has a been a thread on the teTeX mailing list the last few days
 about a (RedHat, but probably more general than just their rpms)
 gcc-2.9.6 w/glibc-2.2.x bug. At -O2, it can miscompile 
 
 unsigned varname; /* "unsigned int varname;" is ok */
 
 (no problem at -O or no optimization at all, and doesn't happen if teTeX
 is compiled with kgcc).

That one is fixed already for some time, it was a bug in loop unrolling
(that patch is still pending review for the mainline CVS though).

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Bernhard Rosenkraenzer

On Thu, 14 Dec 2000, Linus Torvalds wrote:

 If you ask any gcc folks, the main reason they think this was a really
 stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
 with the 2.95.x release _and_ the upcoming 3.0 release.

The same thing is true of *any* gcc release.
For example, C++-ABI wise, 2.95.x is incompatible BOTH with egcs 1.1.x
_and_ the upcoming 3.0 release.

  Like what - gcc 2.5.8 ? The problem is not in general that the snapshot is any
  buggier than before, but that the bugs are in different places. egcs and gcc295
  both caused X compile problems too.

 gcc-2.95.2 is at least a real release, from a branch that is actively
 maintained

Not very actively.
Please take the time to compare the activity in gcc_2_95_branch with the
patches in the current "2.96" version in rawhide.

 - so a 2.95.3 is likely to happen reasonably soon, fixing as
 many problems as possible _without_ being incompatible like the snapshots
 are.

It will be incompatible with any non-2.95.x-version, and I don't think
2.96-68 is any more buggy than the current 2.95 branch.
The initial 2.96 "release" did have some odd bugs; all the known ones have
been fixed.

 Or just stay at 2.91.66 (egcs).

This may be good for the kernel, but it's not acceptable for C++.
Also, there's no support for some of the platforms we have to work with,
such as ia64 and S/390 - using different compilers for different
architectures isn't a real solution either.

 As to X compile problems - neither egcs nor 2.95.2 appears to have any
 trouble with the CVS tree.

Neither does 2.96-68.

LLaP
bero


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Alan Cox

 If you ask any gcc folks, the main reason they think this was a really
 stupid thing to do was exactly that the 2.96 thing is incompatible BOTH
 with the 2.95.x release _and_ the upcoming 3.0 release.

And with egcs 1.1.2. So 
egcs is a different format to all others
2.95 is a different format to all others
2.96 is a different format to all others

and 2.96 is a C++ compiler

 gcc-2.95.2 is at least a real release, from a branch that is actively
 maintained - so a 2.95.3 is likely to happen reasonably soon, fixing as
 many problems as possible _without_ being incompatible like the snapshots
 are.

The 2.96 tree is maintained actively. Updates for the Red Hat 7 packages
are being worked on and CygnusHat people are working on both that maintenance
and on feeding all they find back to the core gcc team.

In fact we have sufficient faith in it we sell packages and support based around
that and our preparedness to support it. 

 As to X compile problems - neither egcs nor 2.95.2 appears to have any
 trouble with the CVS tree. Possibly because they got fixed, because, after
 all, at least those were real releases.

I asked Jakub. He's confused as to your report. As far as he is aware the only
X problems in the CVS tree were related to XFree86 source code bugs misusing
type punning. If you have a case to lookat Jakub would love to hear about it
and fix either X or gcc.

 I'd applaud RedHat for making snapshots available, but they should be
 marked as SNAPSHOTS, and not as the main compiler with no way to fix the
 damn problems it causes.

That it was confusing and mistaken by some as an official GNU group release
is something we never intended and have already apologised for. It was done
without malice or ill intent.

 As it is, anybody doing development is probably better off at RH-6.2.
 That is doubly true if they intend to release binaries.

We strongly recommend that people use 6.2 for developing binaries for general
release unless they have specific requirements for glibc 2.2. Thats the same
guidelines the LSB 'oops we havent finished yet here is a quickie for now'
documentation recommends.

Similarly RPM packaging using RPMv3 is recommended.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Jakub Jelinek

On Thu, Dec 14, 2000 at 11:11:28AM -0800, Linus Torvalds wrote:
 user applications and (b) gcc-2.96 is so broken that it requires special
 libraries for C++ vtable chunks handling that is different, so the
 _working_ gcc can only be used with programs that do not need such
 library support.

Every major g++ release had incompatible libstdc++, even g++ 2.95.2 if
bootstrapped under glibc 2.1.x is binary incompatible with g++ 2.95.2
bootstrapped under glibc 2.2.x (libstdc++ uses different soname then;
even if we used g++ 2.95.2 we would not have C++ binary compatible with
other distributions).
This will change once 3.0 is out, but it will still take some time.

 compiler to something that works better RSN.  It apparently has problems
 compiling stuff like the CVS snapshots of X etc too (and obviously,
 anything you compile under gcc-2.96 is not likely to work anywhere else
 except with the broken libraries). 

Can you point to things in X which were actually miscompiled because of bugs
in gcc 2.96? So far I was aware about X bugs (already fixed in X CVS) which
were triggered with -fstrict-aliasing which is now the default while
gcc 2.95.2 had -fstrict-aliasing disabled by default.
That is not to say there were not bugs in the gcc we shipped, but the bugs
which were reported against it have been fixed already.

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Linus Torvalds



On Thu, 14 Dec 2000, Jakub Jelinek wrote:

 On Thu, Dec 14, 2000 at 11:11:28AM -0800, Linus Torvalds wrote:
  user applications and (b) gcc-2.96 is so broken that it requires special
  libraries for C++ vtable chunks handling that is different, so the
  _working_ gcc can only be used with programs that do not need such
  library support.
 
 Every major g++ release had incompatible libstdc++, even g++ 2.95.2 if
 bootstrapped under glibc 2.1.x is binary incompatible with g++ 2.95.2
 bootstrapped under glibc 2.2.x (libstdc++ uses different soname then;
 even if we used g++ 2.95.2 we would not have C++ binary compatible with
 other distributions).

Yes. 

And I realize that somebody inside RedHat really wanted to use a snapshot
in order to get some C++ code to compile right.

But it at the same time threw C stability out the window, by using a
not-very-widely-tested snapshot for a major new release. 

Are you seriously saying that you think it was a good trade-off? Or are
you just ashamed of admitting that RH did something stupid?

  compiler to something that works better RSN.  It apparently has problems
  compiling stuff like the CVS snapshots of X etc too (and obviously,
  anything you compile under gcc-2.96 is not likely to work anywhere else
  except with the broken libraries). 
 
 Can you point to things in X which were actually miscompiled because of bugs
 in gcc 2.96?

I have a report from a Sony VAIO user that couldn't compile the CVS X at
all on his picturebook (and you need to compile the CVS tree in order to
get required fixes for the ATI Rage Mobility in that machine). I don't
know the details, but they were apparently due to RH 7 issues. 

 So far I was aware about X bugs (already fixed in X CVS) which
 were triggered with -fstrict-aliasing which is now the default while
 gcc 2.95.2 had -fstrict-aliasing disabled by default.

I hope that's another thing that the gcc people fix by the time they do a
_real_ release. Anobody who thinks that "-fstrict-aliasing" being on by
default is a good idea is probably a compiler person who hasn't seen real
code.

 That is not to say there were not bugs in the gcc we shipped, but the bugs
 which were reported against it have been fixed already.

That's good.

It's even better if you don't play quite as fast-and-lose with your
shipping compiler.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Miquel van Smoorenburg

In article [EMAIL PROTECTED],
Bernhard Rosenkraenzer  [EMAIL PROTECTED] wrote:
The same thing is true of *any* gcc release.
For example, C++-ABI wise, 2.95.x is incompatible BOTH with egcs 1.1.x
_and_ the upcoming 3.0 release.

Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
And since redhat is _the_ distro that commercial entities use to
release software for, this was very arguably a bad move.

There's simply no excuse. It's too obvious.

Mike.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread lamont


I had tons of problems with K6III/450s in ASUS P5A motherboards with
various kinds of 128MB SIMMs.  There were multiple different symptoms,
including just sig11s on compiles, corrupted input (leading to syntax
error) in compiles, and corrupted input in the buffer cache (same crash
over and over, but dd if=/dev/hda of=/dev/null bs=1024k count=128 fixed
it).  Swapping the memory would sometimes get rid of the problem, but then
it would come back weeks-months later.

I saw a bizzare problem once in an Tyan dual proc PIII/500 box with
2x256MB ECC RAM that one of the ECC RAM sticks was bad and that repeated
kernel compiles would hang after about 24 hours.  Strange problem, but
found that in troubleshooting it, the problem followed this stick of RAM
around to different machines.  Blamed the RAM but don't understand what
the underlying problem was...

On Fri, 8 Dec 2000 [EMAIL PROTECTED] wrote:
 On Thu, 7 Dec 2000, Jeff V. Merkey wrote:
 
  It's related to some change in 2.4 vs. 2.2.  There are other programs
  affected other than X, SSH also get's spurious signal 11's now and again
  with 2.4 and glibc = 2.1 and it does not occur on 2.2.
 
 AOL
 
 I've begun to get a bit paranoid about my K6-2 500 box.
 
 Various processes have been getting random signals after heavy CPU usage.
 Playing an MPEG movie, kernel compile, or even just some small apps
 compiling sometimes. Just for the record, this isn't an OOM situation,
 I've watched this box with half its memory free or in buffers left
 unattended, and suddenly a compile will just die.
 
 I replaced the CPU with a brand new K6-2. Problem remained.
 Next suspect was faulty RAM. Despite having passed a memtest, I
 swapped out the DIMMs for some known good ones.
 Suspecting cooling problems, I added some case fans.
 Next came a bigger power supply. Still the problems.
 The latest last ditch attempt to make this box stable has been
 to attach the biggest fan I could find that would fit a socket 7 CPU.
 
 And still the problems are there.
 The only remaining suspect would be a flaky motherboard.
 But then comes the real killer : This box is rock solid under 2.2
 
 *boggle*
 
 I'm not sure exactly when this started, but I think I first noticed
 it around test5 or so, but didn't suspect the kernel at the time.
 
 I've tried kernels compiled with everything from 2.91.66 when this
 was a Redhat box, to gcc 2.95.2 (from Debian woody) when I installed
 debian on it.  If this is a compiler bug, it's one that no compiler
 I've tried seems to be immune from.
 
 regards,
 
 Davej.
 
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Alan Cox

 Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
 And since redhat is _the_ distro that commercial entities use to
 release software for, this was very arguably a bad move.

Except you conveniently ignore a few facts

o   Someone else moved to 2.95 not RH . In fact some of us felt 2.95 wasnt 
fit to ship at the time. 

o   We tell vendors to build RPMv3 , glibc 2.1.x

o   Vendors not being stupid understand that they have a bigger market
share if they do that.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Miquel van Smoorenburg

In article [EMAIL PROTECTED],
Alan Cox  [EMAIL PROTECTED] wrote:
 Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
 And since redhat is _the_ distro that commercial entities use to
 release software for, this was very arguably a bad move.

Except you conveniently ignore a few facts

Doesn't everyone. I should have included a smiley with as comment
that I was only half-joking. Anyway this is the kernel list, and
as such this is becoming off-topic.

Mike.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Michael Peddemors

Sticking my nose where it doesn't belong...

On Thu, 14 Dec 2000, Alan Cox wrote:
  Yes, but 2.96 is also binary incompatible with all non-redhat distro's.
  And since redhat is _the_ distro that commercial entities use to
  release software for, this was very arguably a bad move.

 o We tell vendors to build RPMv3 , glibc 2.1.x

Curious HOW do you tell vendors??

 o Vendors not being stupid understand that they have a bigger market
   share if they do that.

Ummm.. I remember Oracle's first release... wasn't it JUST redhat??

-- 

Michael Peddemors - Senior Consultant
Unix Administration - WebSite Hosting
Network Services - Programming
Wizard Internet Services http://www.wizard.ca
Linux Support Specialist - http://www.linuxmagic.com

(604) 589-0037 Beautiful British Columbia, Canada

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-14 Thread Alan Cox

  o   We tell vendors to build RPMv3 , glibc 2.1.x
 Curious HOW do you tell vendors??

When they ask. More usefully Dan Quinlann and most vendors put together a
recommended set of things to build with and use. It warns about library
pitfalls, kernel changes and what packaging is supported. It is far from
perfect and nothing like the LSB goals but its a start and following it does
give you applications that with a bit of care run on everything.

  o   Vendors not being stupid understand that they have a bigger market
  share if they do that.
 Ummm.. I remember Oracle's first release... wasn't it JUST redhat??

I believe so, and Adabas was SuSE only, and I doubt either vendor wanted it
that way. Both actually ran fine on the other but were not supported.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Linus Torvalds wrote:

> On Wed, 13 Dec 2000, Linus Torvalds wrote:
> > 
> > Hint: "ptep_mkdirty()".

 rather obvious oopsie.. once spotted.

> In case you wonder why the bug was so insidious, what this caused was two
> separate problems, both of them able to cause SIGSGV's. 
> 
> One: we didn't mark the page table entry dirty like we were supposed to.
> 
> Two: by making it writable, we also made the page shared, even if it
> wasn't supposed to be shared (so when the next process wrote to the page,
> if the swap page was shared with somebody else, the changes would show up
> even in the process that _didn't_ write to it).
> 
> And "ptep_mkdirty()" is only used by swapoff, so nothing else would show
> this. Which was why it hadn't been immediately obvious that anything was
> broken.

The terminal OOM problem is now gone and I haven't seen a SIGSEGV yet
running virgin source.

IOU 5 bogo$$

-Mike

(I still see something with IKD that _could_ be timing related troubles.
There are a couple of grubby fingerprints I need to wipe off, and some
churn/burn hours to be sure)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Linus Torvalds wrote:

> On Wed, 13 Dec 2000, Mike Galbraith wrote:
> > 
> > Not in my test tree.  Same fault, and same trace leading up to it. no
> 
> Ok.
> 
> It definitely looks like a swapoff() problem.
> 
> Have you ever seen the behaviour without running swapoff?

No.

> Also, can you re-create it without running swapon() (if it's something
> like a lost dirty bit, it should be possible to trigger even without the
> swapon, and I'd like to hear if that can happen - if it only happens with
> swapon() and you can't trigger it with just a swapoff() it might be a
> question of re-using some swap file stuff and delaying the writeout or
> whatever).

I'll try loading up swap, swapoff and then doing jobs that fit in ram.

(hmm.. what about inactive_clean list when you do swapoff.. might there
be pages sitting there that are [were] swap cache? reclaim_page=kaboom?)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-13 Thread Rainer Mager

Err, for those of us who aren't up to our elbows in the kernel code, is
there a patch for this? Presumeably this will be rolled into 2.4.0test13 but
I'd like to try it out? Also, can someone summarize the fix in English along
with the expected, improved behavior (e.g. Linux will never have a signal 11
again and will never, ever crash ;-)

Finally, as soon as there is a patch, can other people who have seen this
problem test it. My problem is so random that I'd need at least a few days
to gain some confidence this is fixed.


Thanks all.

--Rainer

> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Linus Torvalds
> Sent: Thursday, December 14, 2000 5:19 AM
> To: Mike Galbraith
> Cc: Kernel Mailing List
> Subject: Re: Signal 11 - the continuing saga
>
>
> On Wed, 13 Dec 2000, Linus Torvalds wrote:
> >
> > Hint: "ptep_mkdirty()".
>
> In case you wonder why the bug was so insidious, what this caused was two
> separate problems, both of them able to cause SIGSGV's.
>
> One: we didn't mark the page table entry dirty like we were supposed to.
>
> Two: by making it writable, we also made the page shared, even if it
> wasn't supposed to be shared (so when the next process wrote to the page,
> if the swap page was shared with somebody else, the changes would show up
> even in the process that _didn't_ write to it).
>
> And "ptep_mkdirty()" is only used by swapoff, so nothing else would show
> this. Which was why it hadn't been immediately obvious that anything was
> broken.
>
>   Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Gérard Roudier



On Wed, 13 Dec 2000, Linus Torvalds wrote:

> 
> 
> Ehh, I think I found it.
> 
> Hint: "ptep_mkdirty()".
> 
> Oops.
> 
> I'll bet you $5 USD (and these days, that's about a gadzillion Euros) that

Poor European Gérard as slim as 1.84 meter - 78 Kg these days.
What about old days poor European Linus versus these days American Linus
on these points ? ;-)

> this explains it.

Really ? :o)

>   Linus

  Gérard.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Jeff V. Merkey

On Wed, Dec 13, 2000 at 11:35:57AM -0800, Linus Torvalds wrote:
> 
> 
> Ehh, I think I found it.
> 
> Hint: "ptep_mkdirty()".
> 
> Oops.
> 
> I'll bet you $5 USD (and these days, that's about a gadzillion Euros) that
> this explains it.
> 
>   Linus

Good.  Sounds like you guys have a handle on it now.

:-)

Jeff

> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Linus Torvalds



On Wed, 13 Dec 2000, Linus Torvalds wrote:
> 
> Hint: "ptep_mkdirty()".

In case you wonder why the bug was so insidious, what this caused was two
separate problems, both of them able to cause SIGSGV's. 

One: we didn't mark the page table entry dirty like we were supposed to.

Two: by making it writable, we also made the page shared, even if it
wasn't supposed to be shared (so when the next process wrote to the page,
if the swap page was shared with somebody else, the changes would show up
even in the process that _didn't_ write to it).

And "ptep_mkdirty()" is only used by swapoff, so nothing else would show
this. Which was why it hadn't been immediately obvious that anything was
broken.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Linus Torvalds



On Wed, 13 Dec 2000, Mike Galbraith wrote:
> 
> Not in my test tree.  Same fault, and same trace leading up to it. no

Ok.

It definitely looks like a swapoff() problem.

Have you ever seen the behaviour without running swapoff?

Also, can you re-create it without running swapon() (if it's something
like a lost dirty bit, it should be possible to trigger even without the
swapon, and I'd like to hear if that can happen - if it only happens with
swapon() and you can't trigger it with just a swapoff() it might be a
question of re-using some swap file stuff and delaying the writeout or
whatever).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Linus Torvalds wrote:

> On Wed, 13 Dec 2000, Linus Torvalds wrote:
> > 
> > Lookin gat "swapoff()", it could easily be something like
> > 
> >  - swapoff walks theough the processes, marking the pages dirty
> >(correctly)
> >  - swapoff goes on to the next swap entry, and because it needs memory for
> >this, the VM layer will swap out old entries by marking them dirty in
> >the "struct page".
> >  - final stages of swapoff() removes the swap cache entry, never minding
> >the fact that it is marked dirty again in "struct page", and clean in
> >various VM page tables.
> > 
> > Ho humm.. I don't think that is it exactly, but something along those
> > lines.
> 
> Actually, having thought about it for five more minutes, I actually think
> that that _is_ it.
> 
> If so, the fix looks like it could be really simple. The whole problem
> arises from the fact that we remove the page from the swap cache only
> _after_ we've walked the page-tables to look at it. It looks like the
> fairly trivial fix is simply to remove it from the swap cache before,
> getting rid of all such races in swapoff().
> 
> Mind trying out this patch?
> 
> NOTE! It's untested. It might not work. It might trigger some sanity-test
> somewhere else. But it looks like it should do the right thing (the page
> might be moved to _another_ swap device early, if there are multiple swap
> areas, but even that should be fine - the unuse_process() stuff doesn't
> care about what swapcache this actually is any more.
> 
> Does this patch make a difference (I moved the delete seven lines upwards,
> and removed the test - the test looks extraneous).

Not in my test tree.  Same fault, and same trace leading up to it.
I'll run virgin source hard tomorrow to be sure. (No message means
no change)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Linus Torvalds



On Wed, 13 Dec 2000, Linus Torvalds wrote:
> 
> Lookin gat "swapoff()", it could easily be something like
> 
>  - swapoff walks theough the processes, marking the pages dirty
>(correctly)
>  - swapoff goes on to the next swap entry, and because it needs memory for
>this, the VM layer will swap out old entries by marking them dirty in
>the "struct page".
>  - final stages of swapoff() removes the swap cache entry, never minding
>the fact that it is marked dirty again in "struct page", and clean in
>various VM page tables.
> 
> Ho humm.. I don't think that is it exactly, but something along those
> lines.

Actually, having thought about it for five more minutes, I actually think
that that _is_ it.

If so, the fix looks like it could be really simple. The whole problem
arises from the fact that we remove the page from the swap cache only
_after_ we've walked the page-tables to look at it. It looks like the
fairly trivial fix is simply to remove it from the swap cache before,
getting rid of all such races in swapoff().

Mind trying out this patch?

NOTE! It's untested. It might not work. It might trigger some sanity-test
somewhere else. But it looks like it should do the right thing (the page
might be moved to _another_ swap device early, if there are multiple swap
areas, but even that should be fine - the unuse_process() stuff doesn't
care about what swapcache this actually is any more.

Does this patch make a difference (I moved the delete seven lines upwards,
and removed the test - the test looks extraneous).

Linus


--- v2.4.0-test12/linux/mm/swapfile.c   Tue Oct 31 12:42:27 2000
+++ linux/mm/swapfile.c Wed Dec 13 09:17:51 2000
@@ -370,6 +370,7 @@
swap_free(entry);
return -ENOMEM;
}
+   delete_from_swap_cache(page);
read_lock(_lock);
for_each_task(p)
unuse_process(p->mm, entry, page);
@@ -377,8 +378,6 @@
shm_unuse(entry, page);
/* Now get rid of the extra reference to the temporary
page we've been using. */
-   if (PageSwapCache(page))
-   delete_from_swap_cache(page);
page_cache_release(page);
/*
 * Check for and clear any overflowed swap map counts.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Jeff V. Merkey

On Tue, Dec 12, 2000 at 07:17:41PM -0800, Linus Torvalds wrote:
> In article <[EMAIL PROTECTED]>,
> Jeff V. Merkey <[EMAIL PROTECTED]> wrote:
> >On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote:
> >>I have a tiny bash script that launches a Java swing app. If I run my
> >> script from an xterm (or gnome-terminal or whatever) then it starts up fine.
> >> If, however, I try to launch it from my gnome taskbar's menu then it dies
> >> with signal 11 (the Java log is available upon request). This seems to be
> >> 100% consistent, since I noticed it yesterday, even across reboots.
> >> Interestingly, the same behavior occurs if I try to run the program from
> >> withis JBuilder 4.
> >>So, is this related to the larger signal 11 problems?
> >
> >There's a corruption bug in the page cache somewhere, and it's 100%
> >reproducable.  Finding it will be tough
> 
> Unlikely. If the actual program data was corrupted, it would SIGSEGV
> regardless of how it's executed.
> 
> I'd guess that the program has a bug, and depending on the arguments and
> environment (especially the latter will be different), it shows up or
> not. Things like not having a LOCALE set in either case or similar.
> 
>   Linus

Linus,

I agree that there may be some problem in the code above -- the question is
what has changed to make this behavior emerge?  I see it with a host of 
programs(ssh, make, netscape) -- true all are userspace.  Time permitting, 
I may attempt to track this down in ssh and make in jobserver mode.  It
may be related to some interaction that changed underneath.

Jeff


> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Rainer Mager wrote:

> Mike et al,
> 
>   I have no idea what IKD is and I don't know what to do with any results I
> might find BUT I'd be happy to do this if it will help. Please pass on the
> info with the instructions. Who should I report the results to?

IKD is a debugging toolkit.  The trap I have set up freezes the kernel
trace buffer at SIGSEGV time.  From there you have to read it backward
looking for problems. (which isn't particularly easy).  I was thinking
you wanted to roll your shirt sleeves up and maybe this would help ;-)  

If you want it, and do a trace, I'b be very interested in the last
couple of schedules to compare to my traces.  It's not something you
can just run and report though.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-13 Thread Rainer Mager

Mike et al,

I have no idea what IKD is and I don't know what to do with any results I
might find BUT I'd be happy to do this if it will help. Please pass on the
info with the instructions. Who should I report the results to?



--Rainer

> [mailto:[EMAIL PROTECTED]]On Behalf Of Mike Galbraith
> If you want, I can extract IKD.. which happens to have a trap in place
> for this (because I have a 100% reproducable swap related SIGSEGV that
> I'm trying to figure out).
>
> If you're interested, let me know and I'll extract it (quite large) and
> send it along instructions on how to do the trap.
>
>   -Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-13 Thread Rainer Mager
Give that man a cigarit was an env var (not LOCALE but LANG). I'd
actually checked this but I didn't think that made a difference in my case.

Thanks Linus, now can you fix the larger signal 11 problem?

--Rainer


> [mailto:[EMAIL PROTECTED]]On Behalf Of Linus Torvalds
> I'd guess that the program has a bug, and depending on the arguments and
> environment (especially the latter will be different), it shows up or
> not. Things like not having a LOCALE set in either case or similar.
>
>   Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/


RE: Signal 11 - the continuing saga

2000-12-13 Thread Rainer Mager

Mike et al,

I have no idea what IKD is and I don't know what to do with any results I
might find BUT I'd be happy to do this if it will help. Please pass on the
info with the instructions. Who should I report the results to?



--Rainer

 [mailto:[EMAIL PROTECTED]]On Behalf Of Mike Galbraith
 If you want, I can extract IKD.. which happens to have a trap in place
 for this (because I have a 100% reproducable swap related SIGSEGV that
 I'm trying to figure out).

 If you're interested, let me know and I'll extract it (quite large) and
 send it along instructions on how to do the trap.

   -Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-13 Thread Rainer Mager
Give that man a cigarit was an env var (not LOCALE but LANG). I'd
actually checked this but I didn't think that made a difference in my case.

Thanks Linus, now can you fix the larger signal 11 problem?

--Rainer


 [mailto:[EMAIL PROTECTED]]On Behalf Of Linus Torvalds
 I'd guess that the program has a bug, and depending on the arguments and
 environment (especially the latter will be different), it shows up or
 not. Things like not having a LOCALE set in either case or similar.

   Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/


RE: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Rainer Mager wrote:

 Mike et al,
 
   I have no idea what IKD is and I don't know what to do with any results I
 might find BUT I'd be happy to do this if it will help. Please pass on the
 info with the instructions. Who should I report the results to?

IKD is a debugging toolkit.  The trap I have set up freezes the kernel
trace buffer at SIGSEGV time.  From there you have to read it backward
looking for problems. (which isn't particularly easy).  I was thinking
you wanted to roll your shirt sleeves up and maybe this would help ;-)  

If you want it, and do a trace, I'b be very interested in the last
couple of schedules to compare to my traces.  It's not something you
can just run and report though.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Jeff V. Merkey

On Tue, Dec 12, 2000 at 07:17:41PM -0800, Linus Torvalds wrote:
 In article [EMAIL PROTECTED],
 Jeff V. Merkey [EMAIL PROTECTED] wrote:
 On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote:
 I have a tiny bash script that launches a Java swing app. If I run my
  script from an xterm (or gnome-terminal or whatever) then it starts up fine.
  If, however, I try to launch it from my gnome taskbar's menu then it dies
  with signal 11 (the Java log is available upon request). This seems to be
  100% consistent, since I noticed it yesterday, even across reboots.
  Interestingly, the same behavior occurs if I try to run the program from
  withis JBuilder 4.
 So, is this related to the larger signal 11 problems?
 
 There's a corruption bug in the page cache somewhere, and it's 100%
 reproducable.  Finding it will be tough
 
 Unlikely. If the actual program data was corrupted, it would SIGSEGV
 regardless of how it's executed.
 
 I'd guess that the program has a bug, and depending on the arguments and
 environment (especially the latter will be different), it shows up or
 not. Things like not having a LOCALE set in either case or similar.
 
   Linus

Linus,

I agree that there may be some problem in the code above -- the question is
what has changed to make this behavior emerge?  I see it with a host of 
programs(ssh, make, netscape) -- true all are userspace.  Time permitting, 
I may attempt to track this down in ssh and make in jobserver mode.  It
may be related to some interaction that changed underneath.

Jeff


 -
 To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 the body of a message to [EMAIL PROTECTED]
 Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Linus Torvalds



On Wed, 13 Dec 2000, Linus Torvalds wrote:
 
 Lookin gat "swapoff()", it could easily be something like
 
  - swapoff walks theough the processes, marking the pages dirty
(correctly)
  - swapoff goes on to the next swap entry, and because it needs memory for
this, the VM layer will swap out old entries by marking them dirty in
the "struct page".
  - final stages of swapoff() removes the swap cache entry, never minding
the fact that it is marked dirty again in "struct page", and clean in
various VM page tables.
 
 Ho humm.. I don't think that is it exactly, but something along those
 lines.

Actually, having thought about it for five more minutes, I actually think
that that _is_ it.

If so, the fix looks like it could be really simple. The whole problem
arises from the fact that we remove the page from the swap cache only
_after_ we've walked the page-tables to look at it. It looks like the
fairly trivial fix is simply to remove it from the swap cache before,
getting rid of all such races in swapoff().

Mind trying out this patch?

NOTE! It's untested. It might not work. It might trigger some sanity-test
somewhere else. But it looks like it should do the right thing (the page
might be moved to _another_ swap device early, if there are multiple swap
areas, but even that should be fine - the unuse_process() stuff doesn't
care about what swapcache this actually is any more.

Does this patch make a difference (I moved the delete seven lines upwards,
and removed the test - the test looks extraneous).

Linus


--- v2.4.0-test12/linux/mm/swapfile.c   Tue Oct 31 12:42:27 2000
+++ linux/mm/swapfile.c Wed Dec 13 09:17:51 2000
@@ -370,6 +370,7 @@
swap_free(entry);
return -ENOMEM;
}
+   delete_from_swap_cache(page);
read_lock(tasklist_lock);
for_each_task(p)
unuse_process(p-mm, entry, page);
@@ -377,8 +378,6 @@
shm_unuse(entry, page);
/* Now get rid of the extra reference to the temporary
page we've been using. */
-   if (PageSwapCache(page))
-   delete_from_swap_cache(page);
page_cache_release(page);
/*
 * Check for and clear any overflowed swap map counts.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Linus Torvalds wrote:

 On Wed, 13 Dec 2000, Linus Torvalds wrote:
  
  Lookin gat "swapoff()", it could easily be something like
  
   - swapoff walks theough the processes, marking the pages dirty
 (correctly)
   - swapoff goes on to the next swap entry, and because it needs memory for
 this, the VM layer will swap out old entries by marking them dirty in
 the "struct page".
   - final stages of swapoff() removes the swap cache entry, never minding
 the fact that it is marked dirty again in "struct page", and clean in
 various VM page tables.
  
  Ho humm.. I don't think that is it exactly, but something along those
  lines.
 
 Actually, having thought about it for five more minutes, I actually think
 that that _is_ it.
 
 If so, the fix looks like it could be really simple. The whole problem
 arises from the fact that we remove the page from the swap cache only
 _after_ we've walked the page-tables to look at it. It looks like the
 fairly trivial fix is simply to remove it from the swap cache before,
 getting rid of all such races in swapoff().
 
 Mind trying out this patch?
 
 NOTE! It's untested. It might not work. It might trigger some sanity-test
 somewhere else. But it looks like it should do the right thing (the page
 might be moved to _another_ swap device early, if there are multiple swap
 areas, but even that should be fine - the unuse_process() stuff doesn't
 care about what swapcache this actually is any more.
 
 Does this patch make a difference (I moved the delete seven lines upwards,
 and removed the test - the test looks extraneous).

Not in my test tree.  Same fault, and same trace leading up to it.
I'll run virgin source hard tomorrow to be sure. (No message means
no change)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Linus Torvalds



On Wed, 13 Dec 2000, Mike Galbraith wrote:
 
 Not in my test tree.  Same fault, and same trace leading up to it. no

Ok.

It definitely looks like a swapoff() problem.

Have you ever seen the behaviour without running swapoff?

Also, can you re-create it without running swapon() (if it's something
like a lost dirty bit, it should be possible to trigger even without the
swapon, and I'd like to hear if that can happen - if it only happens with
swapon() and you can't trigger it with just a swapoff() it might be a
question of re-using some swap file stuff and delaying the writeout or
whatever).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Linus Torvalds



On Wed, 13 Dec 2000, Linus Torvalds wrote:
 
 Hint: "ptep_mkdirty()".

In case you wonder why the bug was so insidious, what this caused was two
separate problems, both of them able to cause SIGSGV's. 

One: we didn't mark the page table entry dirty like we were supposed to.

Two: by making it writable, we also made the page shared, even if it
wasn't supposed to be shared (so when the next process wrote to the page,
if the swap page was shared with somebody else, the changes would show up
even in the process that _didn't_ write to it).

And "ptep_mkdirty()" is only used by swapoff, so nothing else would show
this. Which was why it hadn't been immediately obvious that anything was
broken.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Jeff V. Merkey

On Wed, Dec 13, 2000 at 11:35:57AM -0800, Linus Torvalds wrote:
 
 
 Ehh, I think I found it.
 
 Hint: "ptep_mkdirty()".
 
 Oops.
 
 I'll bet you $5 USD (and these days, that's about a gadzillion Euros) that
 this explains it.
 
   Linus

Good.  Sounds like you guys have a handle on it now.

:-)

Jeff

 
 -
 To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 the body of a message to [EMAIL PROTECTED]
 Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Gérard Roudier



On Wed, 13 Dec 2000, Linus Torvalds wrote:

 
 
 Ehh, I think I found it.
 
 Hint: "ptep_mkdirty()".
 
 Oops.
 
 I'll bet you $5 USD (and these days, that's about a gadzillion Euros) that

Poor European Gérard as slim as 1.84 meter - 78 Kg these days.
What about old days poor European Linus versus these days American Linus
on these points ? ;-)

 this explains it.

Really ? :o)

   Linus

  Gérard.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-13 Thread Rainer Mager

Err, for those of us who aren't up to our elbows in the kernel code, is
there a patch for this? Presumeably this will be rolled into 2.4.0test13 but
I'd like to try it out? Also, can someone summarize the fix in English along
with the expected, improved behavior (e.g. Linux will never have a signal 11
again and will never, ever crash ;-)

Finally, as soon as there is a patch, can other people who have seen this
problem test it. My problem is so random that I'd need at least a few days
to gain some confidence this is fixed.


Thanks all.

--Rainer

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED]]On Behalf Of Linus Torvalds
 Sent: Thursday, December 14, 2000 5:19 AM
 To: Mike Galbraith
 Cc: Kernel Mailing List
 Subject: Re: Signal 11 - the continuing saga


 On Wed, 13 Dec 2000, Linus Torvalds wrote:
 
  Hint: "ptep_mkdirty()".

 In case you wonder why the bug was so insidious, what this caused was two
 separate problems, both of them able to cause SIGSGV's.

 One: we didn't mark the page table entry dirty like we were supposed to.

 Two: by making it writable, we also made the page shared, even if it
 wasn't supposed to be shared (so when the next process wrote to the page,
 if the swap page was shared with somebody else, the changes would show up
 even in the process that _didn't_ write to it).

 And "ptep_mkdirty()" is only used by swapoff, so nothing else would show
 this. Which was why it hadn't been immediately obvious that anything was
 broken.

   Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Linus Torvalds wrote:

 On Wed, 13 Dec 2000, Mike Galbraith wrote:
  
  Not in my test tree.  Same fault, and same trace leading up to it. no
 
 Ok.
 
 It definitely looks like a swapoff() problem.
 
 Have you ever seen the behaviour without running swapoff?

No.

 Also, can you re-create it without running swapon() (if it's something
 like a lost dirty bit, it should be possible to trigger even without the
 swapon, and I'd like to hear if that can happen - if it only happens with
 swapon() and you can't trigger it with just a swapoff() it might be a
 question of re-using some swap file stuff and delaying the writeout or
 whatever).

I'll try loading up swap, swapoff and then doing jobs that fit in ram.

(hmm.. what about inactive_clean list when you do swapoff.. might there
be pages sitting there that are [were] swap cache? reclaim_page=kaboom?)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Linus Torvalds wrote:

 On Wed, 13 Dec 2000, Linus Torvalds wrote:
  
  Hint: "ptep_mkdirty()".

g rather obvious oopsie.. once spotted.

 In case you wonder why the bug was so insidious, what this caused was two
 separate problems, both of them able to cause SIGSGV's. 
 
 One: we didn't mark the page table entry dirty like we were supposed to.
 
 Two: by making it writable, we also made the page shared, even if it
 wasn't supposed to be shared (so when the next process wrote to the page,
 if the swap page was shared with somebody else, the changes would show up
 even in the process that _didn't_ write to it).
 
 And "ptep_mkdirty()" is only used by swapoff, so nothing else would show
 this. Which was why it hadn't been immediately obvious that anything was
 broken.

The terminal OOM problem is now gone and I haven't seen a SIGSEGV yet
running virgin source.

IOU 5 bogo$$

-Mike

(I still see something with IKD that _could_ be timing related troubles.
There are a couple of grubby fingerprints I need to wipe off, and some
churn/burn hours to be sure)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-12 Thread Mike Galbraith

On Wed, 13 Dec 2000, Rainer Mager wrote:

> Thanks for the info...
> 
> > [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff V. Merkey
> > >   So, is this related to the larger signal 11 problems?
> >
> > There's a corruption bug in the page cache somewhere, and it's 100%
> > reproducable.  Finding it will be tough
> 
> Ok, granted this will be tough but is anyone even actively working on it?
> What can I do to help?

If you want, I can extract IKD.. which happens to have a trap in place
for this (because I have a 100% reproducable swap related SIGSEGV that
I'm trying to figure out). 

If you're interested, let me know and I'll extract it (quite large) and
send it along instructions on how to do the trap.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-12 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Jeff V. Merkey <[EMAIL PROTECTED]> wrote:
>On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote:
>>  I have a tiny bash script that launches a Java swing app. If I run my
>> script from an xterm (or gnome-terminal or whatever) then it starts up fine.
>> If, however, I try to launch it from my gnome taskbar's menu then it dies
>> with signal 11 (the Java log is available upon request). This seems to be
>> 100% consistent, since I noticed it yesterday, even across reboots.
>> Interestingly, the same behavior occurs if I try to run the program from
>> withis JBuilder 4.
>>  So, is this related to the larger signal 11 problems?
>
>There's a corruption bug in the page cache somewhere, and it's 100%
>reproducable.  Finding it will be tough

Unlikely. If the actual program data was corrupted, it would SIGSEGV
regardless of how it's executed.

I'd guess that the program has a bug, and depending on the arguments and
environment (especially the latter will be different), it shows up or
not. Things like not having a LOCALE set in either case or similar.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-12 Thread Rainer Mager

Thanks for the info...

> [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff V. Merkey
> > So, is this related to the larger signal 11 problems?
>
> There's a corruption bug in the page cache somewhere, and it's 100%
> reproducable.  Finding it will be tough

Ok, granted this will be tough but is anyone even actively working on it?
What can I do to help?



> > Anyone know how to do [disable L1 and L2 caches]?
>
> Usually this is performed in the BIOS setup.  You can also disable L1
> with a sequence of instructions that write to the CR0 register on intel
> and flip a bit, but in doing this you have to execute a WBINV (write
> back invalidate) instruction to flush out the cache.  BIOS setup is
> probably simpler.  Disabling Level I will make the machine slower
> than mollasses, BTW, and if this bug is race related (they always
> are) it won't help much in running it down.

Aha, just as I suspected. My BIOS doesn't appear to support this. You seem
to be saying that doing so won't really contribute anything anyway so I will
hold off for now.



--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-12 Thread Jeff V. Merkey

On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote:
> Hi again,
> 
>   Ok, I just upgraded to 2.4.0test12 (although I don't think there was any
> work in 12 that directly addresses this signal 11 problem). When compiling
> the new kernel I chose to disable AGPGart and RDM as suggested by
> [EMAIL PROTECTED] I will report later if this makes any difference.
> 
>   On another, possibly related note, I'm getting some really weird behavior
> with a Java program. The only reason I mention it here is because it dies
> with our old friend Signal 11. Anyway, please bear with the description
> below.
>   I have a tiny bash script that launches a Java swing app. If I run my
> script from an xterm (or gnome-terminal or whatever) then it starts up fine.
> If, however, I try to launch it from my gnome taskbar's menu then it dies
> with signal 11 (the Java log is available upon request). This seems to be
> 100% consistent, since I noticed it yesterday, even across reboots.
> Interestingly, the same behavior occurs if I try to run the program from
> withis JBuilder 4.
>   So, is this related to the larger signal 11 problems?

There's a corruption bug in the page cache somewhere, and it's 100%
reproducable.  Finding it will be tough

> 
> 
>   What else can I do regarding these issues to help fix it? Would a core dump
> help anyone? I'd really like to contribute somehow but I need some
> direction.
> 
> 
> --Rainer
> 
> > From: CMA [mailto:[EMAIL PROTECTED]]
> > Did you already try to selectively disable L1 and L2 caches (if
> > your box has both) and see what happens?
> 
> Anyone know how to do this?

Usually this is performed in the BIOS setup.  You can also disable L1 
with a sequence of instructions that write to the CR0 register on intel
and flip a bit, but in doing this you have to execute a WBINV (write
back invalidate) instruction to flush out the cache.  BIOS setup is
probably simpler.  Disabling Level I will make the machine slower 
than mollasses, BTW, and if this bug is race related (they always 
are) it won't help much in running it down.

Jeff

> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-12 Thread Rainer Mager

Hi again,

Ok, I just upgraded to 2.4.0test12 (although I don't think there was any
work in 12 that directly addresses this signal 11 problem). When compiling
the new kernel I chose to disable AGPGart and RDM as suggested by
[EMAIL PROTECTED] I will report later if this makes any difference.

On another, possibly related note, I'm getting some really weird behavior
with a Java program. The only reason I mention it here is because it dies
with our old friend Signal 11. Anyway, please bear with the description
below.
I have a tiny bash script that launches a Java swing app. If I run my
script from an xterm (or gnome-terminal or whatever) then it starts up fine.
If, however, I try to launch it from my gnome taskbar's menu then it dies
with signal 11 (the Java log is available upon request). This seems to be
100% consistent, since I noticed it yesterday, even across reboots.
Interestingly, the same behavior occurs if I try to run the program from
withis JBuilder 4.
So, is this related to the larger signal 11 problems?


What else can I do regarding these issues to help fix it? Would a core dump
help anyone? I'd really like to contribute somehow but I need some
direction.


--Rainer

> From: CMA [mailto:[EMAIL PROTECTED]]
> Did you already try to selectively disable L1 and L2 caches (if
> your box has both) and see what happens?

Anyone know how to do this?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-12 Thread Rainer Mager

Hi again,

Ok, I just upgraded to 2.4.0test12 (although I don't think there was any
work in 12 that directly addresses this signal 11 problem). When compiling
the new kernel I chose to disable AGPGart and RDM as suggested by
[EMAIL PROTECTED] I will report later if this makes any difference.

On another, possibly related note, I'm getting some really weird behavior
with a Java program. The only reason I mention it here is because it dies
with our old friend Signal 11. Anyway, please bear with the description
below.
I have a tiny bash script that launches a Java swing app. If I run my
script from an xterm (or gnome-terminal or whatever) then it starts up fine.
If, however, I try to launch it from my gnome taskbar's menu then it dies
with signal 11 (the Java log is available upon request). This seems to be
100% consistent, since I noticed it yesterday, even across reboots.
Interestingly, the same behavior occurs if I try to run the program from
withis JBuilder 4.
So, is this related to the larger signal 11 problems?


What else can I do regarding these issues to help fix it? Would a core dump
help anyone? I'd really like to contribute somehow but I need some
direction.


--Rainer

 From: CMA [mailto:[EMAIL PROTECTED]]
 Did you already try to selectively disable L1 and L2 caches (if
 your box has both) and see what happens?

Anyone know how to do this?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-12 Thread Jeff V. Merkey

On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote:
 Hi again,
 
   Ok, I just upgraded to 2.4.0test12 (although I don't think there was any
 work in 12 that directly addresses this signal 11 problem). When compiling
 the new kernel I chose to disable AGPGart and RDM as suggested by
 [EMAIL PROTECTED] I will report later if this makes any difference.
 
   On another, possibly related note, I'm getting some really weird behavior
 with a Java program. The only reason I mention it here is because it dies
 with our old friend Signal 11. Anyway, please bear with the description
 below.
   I have a tiny bash script that launches a Java swing app. If I run my
 script from an xterm (or gnome-terminal or whatever) then it starts up fine.
 If, however, I try to launch it from my gnome taskbar's menu then it dies
 with signal 11 (the Java log is available upon request). This seems to be
 100% consistent, since I noticed it yesterday, even across reboots.
 Interestingly, the same behavior occurs if I try to run the program from
 withis JBuilder 4.
   So, is this related to the larger signal 11 problems?

There's a corruption bug in the page cache somewhere, and it's 100%
reproducable.  Finding it will be tough

 
 
   What else can I do regarding these issues to help fix it? Would a core dump
 help anyone? I'd really like to contribute somehow but I need some
 direction.
 
 
 --Rainer
 
  From: CMA [mailto:[EMAIL PROTECTED]]
  Did you already try to selectively disable L1 and L2 caches (if
  your box has both) and see what happens?
 
 Anyone know how to do this?

Usually this is performed in the BIOS setup.  You can also disable L1 
with a sequence of instructions that write to the CR0 register on intel
and flip a bit, but in doing this you have to execute a WBINV (write
back invalidate) instruction to flush out the cache.  BIOS setup is
probably simpler.  Disabling Level I will make the machine slower 
than mollasses, BTW, and if this bug is race related (they always 
are) it won't help much in running it down.

Jeff

 
 -
 To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 the body of a message to [EMAIL PROTECTED]
 Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-12 Thread Rainer Mager

Thanks for the info...

 [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff V. Merkey
  So, is this related to the larger signal 11 problems?

 There's a corruption bug in the page cache somewhere, and it's 100%
 reproducable.  Finding it will be tough

Ok, granted this will be tough but is anyone even actively working on it?
What can I do to help?



  Anyone know how to do [disable L1 and L2 caches]?

 Usually this is performed in the BIOS setup.  You can also disable L1
 with a sequence of instructions that write to the CR0 register on intel
 and flip a bit, but in doing this you have to execute a WBINV (write
 back invalidate) instruction to flush out the cache.  BIOS setup is
 probably simpler.  Disabling Level I will make the machine slower
 than mollasses, BTW, and if this bug is race related (they always
 are) it won't help much in running it down.

Aha, just as I suspected. My BIOS doesn't appear to support this. You seem
to be saying that doing so won't really contribute anything anyway so I will
hold off for now.



--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-12 Thread Linus Torvalds

In article [EMAIL PROTECTED],
Jeff V. Merkey [EMAIL PROTECTED] wrote:
On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote:
  I have a tiny bash script that launches a Java swing app. If I run my
 script from an xterm (or gnome-terminal or whatever) then it starts up fine.
 If, however, I try to launch it from my gnome taskbar's menu then it dies
 with signal 11 (the Java log is available upon request). This seems to be
 100% consistent, since I noticed it yesterday, even across reboots.
 Interestingly, the same behavior occurs if I try to run the program from
 withis JBuilder 4.
  So, is this related to the larger signal 11 problems?

There's a corruption bug in the page cache somewhere, and it's 100%
reproducable.  Finding it will be tough

Unlikely. If the actual program data was corrupted, it would SIGSEGV
regardless of how it's executed.

I'd guess that the program has a bug, and depending on the arguments and
environment (especially the latter will be different), it shows up or
not. Things like not having a LOCALE set in either case or similar.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-12 Thread Mike Galbraith

On Wed, 13 Dec 2000, Rainer Mager wrote:

 Thanks for the info...
 
  [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff V. Merkey
 So, is this related to the larger signal 11 problems?
 
  There's a corruption bug in the page cache somewhere, and it's 100%
  reproducable.  Finding it will be tough
 
 Ok, granted this will be tough but is anyone even actively working on it?
 What can I do to help?

If you want, I can extract IKD.. which happens to have a trap in place
for this (because I have a 100% reproducable swap related SIGSEGV that
I'm trying to figure out). 

If you're interested, let me know and I'll extract it (quite large) and
send it along instructions on how to do the trap.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11

2000-12-11 Thread Rainer Mager

(This message contains a number of related replies.)

> From: Mike Galbraith [mailto:[EMAIL PROTECTED]]
> Is init permanently running after you see a couple of these?

No, that is, after 23 hours up time it has used only 6 seconds CPU time
(according to top).

That reminds me that I should repeat that my signal 11 problem has (so far)
only caused X to die. The OS remains up and stable.


> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
> My troublesome box finally seems to be stable.[...]I disabled DRM
> & AGPGart. With them both disabled, I get no problems at all.
> No Sig11's, No Sig4's, No lockups.
>
> This box has a Voodoo3 3000 AGP..

I suppose I can try this too. My box has a Matrox G400. BTW, what is DRM?
Direct Rendering something?


> From: CMA [mailto:[EMAIL PROTECTED]]
> Did you already try to selectively disable L1 and L2 caches (if
> your box has both) and see what happens?

I'll look into this as well. Anyone have any pointers on how to do this? I
have a Tyan Tiger 133 with Award BIOS if this helps/matters.

Even if this setting does make a difference, what does this tell me/us? I
don't consider running the box with disabled cache(s) a viable solution.



Thanks all and keep those suggestions coming.

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11

2000-12-11 Thread davej

On Mon, 11 Dec 2000, Rainer Mager wrote:

> Well, I just had a Signal 11 even with the patch. What can I do to help
> figure this out?

My troublesome box finally seems to be stable. It's been up for the
last two days whilst under quite heavy loads without problems.
Previously, it would be lucky to last an hour.
The change? I disabled DRM & AGPGart.
With them both disabled, I get no problems at all. No Sig11's,
No Sig4's, No lockups.

This box has a Voodoo3 3000 AGP..

01:00.0 VGA compatible controller: 3Dfx Interactive, Inc. Voodoo 3 (rev 01)

And is running on an MVP3 chipset

00:01.0 PCI bridge: VIA Technologies, Inc. VT82C598/694x [Apollo MVP3/Pro133x AGP]

This box does display the same problem with IRQ routing that I've
got on my Athlon box...

PCI: Using IRQ router VIA [1106/0586] at 00:07.0
PCI: Assigned IRQ 11 for device 00:08.0
PCI: The same IRQ used for device 01:00.0
IRQ routing conflict in pirq table! Try 'pci=autoirq'

(00:08:0 is an SBLive)

A related problem ?
As I mentioned in an earlier mail `autoirq' is an unknown option.

The Athlon box has similar messages, but it happens with even
more devices..

They both do the same with the various PCI options 'nobios' etc,
and changing PnP OS in the BIOS makes no difference either.

regards,

Davej.

-- 
| Dave Jones <[EMAIL PROTECTED]>  http://www.suse.de/~davej
| SuSE Labs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11

2000-12-11 Thread Mike Galbraith

On Mon, 11 Dec 2000, Rainer Mager wrote:

> Well, I just had a Signal 11 even with the patch. What can I do to help
> figure this out?

Is init permanently running after you see a couple of these?

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11

2000-12-11 Thread Rainer Mager

Well, I just had a Signal 11 even with the patch. What can I do to help
figure this out?


Thanks,

--Rainer

-Original Message-
From: Alan Cox [mailto:[EMAIL PROTECTED]]
Sent: Friday, December 08, 2000 11:07 PM
To: David Woodhouse
Cc: Andi Kleen; Rainer Mager; [EMAIL PROTECTED]; Mark Vojkovich
Subject: Re: Signal 11


> > wrong with it.  I've only seen this under 2.3.x/2.4 SMP kernels.  I
> > would say that this is definitely a kernel problem.=20
>
> XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> kernels - even on my BP6=B9. The random crashes started to happen when =
> I
> upgraded my distribution=B2 - and are only seen by people using 2.4. So=
>  I
> suspect that it's the combination of glibc and kernel which is triggeri=
> ng
> it.

Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
table updating race help ?

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11

2000-12-11 Thread Rainer Mager

Well, I just had a Signal 11 even with the patch. What can I do to help
figure this out?


Thanks,

--Rainer

-Original Message-
From: Alan Cox [mailto:[EMAIL PROTECTED]]
Sent: Friday, December 08, 2000 11:07 PM
To: David Woodhouse
Cc: Andi Kleen; Rainer Mager; [EMAIL PROTECTED]; Mark Vojkovich
Subject: Re: Signal 11


  wrong with it.  I've only seen this under 2.3.x/2.4 SMP kernels.  I
  would say that this is definitely a kernel problem.=20

 XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
 kernels - even on my BP6=B9. The random crashes started to happen when =
 I
 upgraded my distribution=B2 - and are only seen by people using 2.4. So=
  I
 suspect that it's the combination of glibc and kernel which is triggeri=
 ng
 it.

Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
table updating race help ?

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11

2000-12-11 Thread Mike Galbraith

On Mon, 11 Dec 2000, Rainer Mager wrote:

 Well, I just had a Signal 11 even with the patch. What can I do to help
 figure this out?

Is init permanently running after you see a couple of these?

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11

2000-12-11 Thread davej

On Mon, 11 Dec 2000, Rainer Mager wrote:

 Well, I just had a Signal 11 even with the patch. What can I do to help
 figure this out?

My troublesome box finally seems to be stable. It's been up for the
last two days whilst under quite heavy loads without problems.
Previously, it would be lucky to last an hour.
The change? I disabled DRM  AGPGart.
With them both disabled, I get no problems at all. No Sig11's,
No Sig4's, No lockups.

This box has a Voodoo3 3000 AGP..

01:00.0 VGA compatible controller: 3Dfx Interactive, Inc. Voodoo 3 (rev 01)

And is running on an MVP3 chipset

00:01.0 PCI bridge: VIA Technologies, Inc. VT82C598/694x [Apollo MVP3/Pro133x AGP]

This box does display the same problem with IRQ routing that I've
got on my Athlon box...

PCI: Using IRQ router VIA [1106/0586] at 00:07.0
PCI: Assigned IRQ 11 for device 00:08.0
PCI: The same IRQ used for device 01:00.0
IRQ routing conflict in pirq table! Try 'pci=autoirq'

(00:08:0 is an SBLive)

A related problem ?
As I mentioned in an earlier mail `autoirq' is an unknown option.

The Athlon box has similar messages, but it happens with even
more devices..

They both do the same with the various PCI options 'nobios' etc,
and changing PnP OS in the BIOS makes no difference either.

regards,

Davej.

-- 
| Dave Jones [EMAIL PROTECTED]  http://www.suse.de/~davej
| SuSE Labs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11

2000-12-11 Thread Rainer Mager

(This message contains a number of related replies.)

 From: Mike Galbraith [mailto:[EMAIL PROTECTED]]
 Is init permanently running after you see a couple of these?

No, that is, after 23 hours up time it has used only 6 seconds CPU time
(according to top).

That reminds me that I should repeat that my signal 11 problem has (so far)
only caused X to die. The OS remains up and stable.


 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
 My troublesome box finally seems to be stable.[...]I disabled DRM
  AGPGart. With them both disabled, I get no problems at all.
 No Sig11's, No Sig4's, No lockups.

 This box has a Voodoo3 3000 AGP..

I suppose I can try this too. My box has a Matrox G400. BTW, what is DRM?
Direct Rendering something?


 From: CMA [mailto:[EMAIL PROTECTED]]
 Did you already try to selectively disable L1 and L2 caches (if
 your box has both) and see what happens?

I'll look into this as well. Anyone have any pointers on how to do this? I
have a Tyan Tiger 133 with Award BIOS if this helps/matters.

Even if this setting does make a difference, what does this tell me/us? I
don't consider running the box with disabled cache(s) a viable solution.



Thanks all and keep those suggestions coming.

--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11

2000-12-10 Thread Rainer Mager

I just applied the said patch and will report my results. Note that I have
never been able to reliably, on-demand reproduce this so give me a few days
to see what happens.

--Rainer


-Original Message-
From: Alan Cox [mailto:[EMAIL PROTECTED]]
Sent: Friday, December 08, 2000 11:07 PM
To: David Woodhouse
Cc: Andi Kleen; Rainer Mager; [EMAIL PROTECTED]; Mark Vojkovich
Subject: Re: Signal 11


> > wrong with it.  I've only seen this under 2.3.x/2.4 SMP kernels.  I
> > would say that this is definitely a kernel problem.=20
>
> XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> kernels - even on my BP6=B9. The random crashes started to happen when =
> I
> upgraded my distribution=B2 - and are only seen by people using 2.4. So=
>  I
> suspect that it's the combination of glibc and kernel which is triggeri=
> ng
> it.

Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
table updating race help ?

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11

2000-12-10 Thread Rainer Mager

I just applied the said patch and will report my results. Note that I have
never been able to reliably, on-demand reproduce this so give me a few days
to see what happens.

--Rainer


-Original Message-
From: Alan Cox [mailto:[EMAIL PROTECTED]]
Sent: Friday, December 08, 2000 11:07 PM
To: David Woodhouse
Cc: Andi Kleen; Rainer Mager; [EMAIL PROTECTED]; Mark Vojkovich
Subject: Re: Signal 11


  wrong with it.  I've only seen this under 2.3.x/2.4 SMP kernels.  I
  would say that this is definitely a kernel problem.=20

 XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
 kernels - even on my BP6=B9. The random crashes started to happen when =
 I
 upgraded my distribution=B2 - and are only seen by people using 2.4. So=
  I
 suspect that it's the combination of glibc and kernel which is triggeri=
 ng
 it.

Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
table updating race help ?

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-09 Thread Matthew Vanecek

[EMAIL PROTECTED] wrote:
> 
> On Sat, 9 Dec 2000, Matthew Vanecek wrote:
> 
> > > Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
> > > table updating race help ?
> > > Alan
> >
> > Where are his fixes at?  I don't seem to see any of his posts in the
> > archives.
> 
> dwmw2 posted one such patch earlier this week :-
> 
> http://www.lib.uaa.alaska.edu/linux-kernel/archive/2000-Week-49/0856.html
> 
> regards,
> 

I saw that.  I thought it was a patch to try to "reproduce it", as
opposed to fixing it.  Is it truly a fix, and is it applicable for UP
kernels?
-- 
Matthew Vanecek
perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

For 93 million miles, there is nothing between the sun and my shadow
except me.
I'm always getting in the way of something...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-09 Thread davej

On Sat, 9 Dec 2000, Matthew Vanecek wrote:

> > Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
> > table updating race help ?
> > Alan
> 
> Where are his fixes at?  I don't seem to see any of his posts in the
> archives.

dwmw2 posted one such patch earlier this week :-

http://www.lib.uaa.alaska.edu/linux-kernel/archive/2000-Week-49/0856.html

regards,

Davej.

-- 
| Dave Jones <[EMAIL PROTECTED]>  http://www.suse.de/~davej
| SuSE Labs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-09 Thread Matthew Vanecek

Alan Cox wrote:
> 
> > > wrong with it.  I've only seen this under 2.3.x/2.4 SMP kernels.  I
> > > would say that this is definitely a kernel problem.=20
> >
> > XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> > kernels - even on my BP6=B9. The random crashes started to happen when =
> > I
> > upgraded my distribution=B2 - and are only seen by people using 2.4. So=
> >  I
> > suspect that it's the combination of glibc and kernel which is triggeri=
> > ng
> > it.
> 
> Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
> table updating race help ?
> 
> Alan

Where are his fixes at?  I don't seem to see any of his posts in the
archives.
-- 
Matthew Vanecek
perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

For 93 million miles, there is nothing between the sun and my shadow
except me.
I'm always getting in the way of something...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-09 Thread Matthew Vanecek

Alan Cox wrote:
 
   wrong with it.  I've only seen this under 2.3.x/2.4 SMP kernels.  I
   would say that this is definitely a kernel problem.=20
 
  XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
  kernels - even on my BP6=B9. The random crashes started to happen when =
  I
  upgraded my distribution=B2 - and are only seen by people using 2.4. So=
   I
  suspect that it's the combination of glibc and kernel which is triggeri=
  ng
  it.
 
 Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
 table updating race help ?
 
 Alan

Where are his fixes at?  I don't seem to see any of his posts in the
archives.
-- 
Matthew Vanecek
perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

For 93 million miles, there is nothing between the sun and my shadow
except me.
I'm always getting in the way of something...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-09 Thread davej

On Sat, 9 Dec 2000, Matthew Vanecek wrote:

  Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
  table updating race help ?
  Alan
 
 Where are his fixes at?  I don't seem to see any of his posts in the
 archives.

dwmw2 posted one such patch earlier this week :-

http://www.lib.uaa.alaska.edu/linux-kernel/archive/2000-Week-49/0856.html

regards,

Davej.

-- 
| Dave Jones [EMAIL PROTECTED]  http://www.suse.de/~davej
| SuSE Labs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-09 Thread Matthew Vanecek

[EMAIL PROTECTED] wrote:
 
 On Sat, 9 Dec 2000, Matthew Vanecek wrote:
 
   Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
   table updating race help ?
   Alan
 
  Where are his fixes at?  I don't seem to see any of his posts in the
  archives.
 
 dwmw2 posted one such patch earlier this week :-
 
 http://www.lib.uaa.alaska.edu/linux-kernel/archive/2000-Week-49/0856.html
 
 regards,
 

I saw that.  I thought it was a patch to try to "reproduce it", as
opposed to fixing it.  Is it truly a fix, and is it applicable for UP
kernels?
-- 
Matthew Vanecek
perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

For 93 million miles, there is nothing between the sun and my shadow
except me.
I'm always getting in the way of something...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-08 Thread davej


David Woodhouse ([EMAIL PROTECTED]) wrote...

> Can you reproduce it with bcrl's patch below: 

Did nothing for me. gcc still got a sig11 after a while.
Took three runs of 'make bzImage' before it completed.

I wondered if I'd been unlucky enough to have been sent a
replacement K6-2 which was also screwed, but as I mentioned
earlier, this box runs fine under 2.2

btw, I was unsubscribed from all lists at vger yesterday,
for reasons currently unknown to me. Did this happen to anyone
else, or did my mail setup break something?

regards,

Davej.

-- 
| Dave Jones <[EMAIL PROTECTED]>  http://www.suse.de/~davej
| SuSE Labs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-08 Thread Jeff V. Merkey



I'll try.

Jeff


On Fri, Dec 08, 2000 at 10:24:55PM +, David Woodhouse wrote:
> On Fri, 8 Dec 2000, Jeff V. Merkey wrote:
> 
> > I have not seen it on UP systems either.  I only see it on SMP systems.
> > After trying very hard last night, I was able to get my 4 x PPro system to
> > do it with 2.4.0-12.  It seems related to loading in some way.  If you
> > have more than two processors, the loading is less since there's more
> > processors, and for whatever reason, it makes it harder to produce
> > whatever race condition is causing it.  I can get it to happen
> > pretty easily on a 2 x PII system.
> 
> Can you reproduce it with bcrl's patch below:
> 
> Index: mm/memory.c
> ===
> RCS file: /net/passion/inst/cvs/linux/mm/memory.c,v
> retrieving revision 1.2.2.40
> diff -u -r1.2.2.40 memory.c
> --- mm/memory.c   2000/12/05 13:33:39 1.2.2.40
> +++ mm/memory.c   2000/12/08 22:24:09
> @@ -860,6 +860,7 @@
>   /*
>* Ok, we need to copy. Oh, well..
>*/
> + set_pte(page_table, pte);
>   spin_unlock(>page_table_lock);
>   new_page = page_cache_alloc();
>   if (!new_page)
> @@ -870,6 +871,12 @@
>* Re-check the pte - we dropped the lock
>*/
>   if (pte_same(*page_table, pte)) {
> + /* We are changing the pte, so get rid of the old
> +  * one to avoid races with the hardware, this really
> +  * only affects the accessed bit here.
> +  */
> + pte = ptep_get_and_clear(page_table);
> +
>   if (PageReserved(old_page))
>   ++mm->rss;
>   break_cow(vma, old_page, new_page, address, page_table);
> @@ -1216,12 +1223,14 @@
>   return do_swap_page(mm, vma, address, pte,
> pte_to_swp_entry(entry), write_access);
>   }
> 
> + entry = ptep_get_and_clear(pte);
>   if (write_access) {
>   if (!pte_write(entry))
>   return do_wp_page(mm, vma, address, pte, entry);
> 
>   entry = pte_mkdirty(entry);
>   }
> +
>   entry = pte_mkyoung(entry);
>   establish_pte(vma, address, pte, entry);
>   spin_unlock(>page_table_lock);
> 
> 
> -- 
> dwmw2
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-08 Thread Horst von Brand

David Woodhouse <[EMAIL PROTECTED]> said:

[...]

> I quote from the X devel list, which perhaps I shouldn't do but this is
> hardly NDA'd stuff:

> On Mon 20 Nov 2000, [EMAIL PROTECTED] said:
> >   I have seen random crashes on dual P3 BX boards (Tyan) and dual Xeon
> > GX boards (Intel).  XFree86 core dumps indicate that it happens in
> > random places, in old as dirt software rendering code that has nothing
> > wrong with it.  I've only seen this under 2.3.x/2.4 SMP kernels.  I
> > would say that this is definitely a kernel problem. 

> XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> kernels - even on my BP6¹. The random crashes started to happen when I
> upgraded my distribution² - and are only seen by people using 2.4. So I
> suspect that it's the combination of glibc and kernel which is triggering
> it.

I get regular segfaults and random lockups trying to build CVS GCCs and
kernels since I updated RH 7 to glibc-2.2-5. P3, sr440bx mobo (UP),
2.2.18preX kernels; previously rock solid. Might be that the mains voltage
here tends to be out of whack, but I doubt it.
-- 
Horst von Brand [EMAIL PROTECTED]
Casilla 9G, Vin~a del Mar, Chile   +56 32 672616

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-08 Thread Jeff V. Merkey

On Fri, Dec 08, 2000 at 11:34:51AM -0800, Mark Vojkovich wrote:
> 
> 
> On Fri, 8 Dec 2000, David Woodhouse wrote:
> 
>Some additional data points.  It goes away on UP 2.4 kernels.
> Also, I can't recall seeing this problem on IA64.  Maybe it's still
> there on IA64 and I just haven't been trying hard enough to crash
> it, but my current impression is that the problem doesn't exist on IA64.
> 
>   Hmmm...  IA64 is a static server.  I don't hear of people having
> problems on 3.3.6 servers either.  I'm wondering if a non-loader
> 4.0 server would have problems on IA32 with a 2.4 kernel.  That's
> something for people to try.
> 
> 
>   Mark.


I have not seen it on UP systems either.  I only see it on SMP systems.  
After trying very hard last night, I was able to get my 4 x PPro system to 
do it with 2.4.0-12.  It seems related to loading in some way.  If you 
have more than two processors, the loading is less since there's more 
processors, and for whatever reason, it makes it harder to produce
whatever race condition is causing it.  I can get it to happen 
pretty easily on a 2 x PII system.

:-)

Jeff



> 
> >
> > --
> > dwmw2
> >
> > ¹ And the BP6 still falls over less frequently than the dual P3 I use at
> > work.
> > ² RH7. Don't start.
> >
> >
> >
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-08 Thread David Woodhouse

On Fri, 8 Dec 2000, Jeff V. Merkey wrote:

> I have not seen it on UP systems either.  I only see it on SMP systems.
> After trying very hard last night, I was able to get my 4 x PPro system to
> do it with 2.4.0-12.  It seems related to loading in some way.  If you
> have more than two processors, the loading is less since there's more
> processors, and for whatever reason, it makes it harder to produce
> whatever race condition is causing it.  I can get it to happen
> pretty easily on a 2 x PII system.

Can you reproduce it with bcrl's patch below:

Index: mm/memory.c
===
RCS file: /net/passion/inst/cvs/linux/mm/memory.c,v
retrieving revision 1.2.2.40
diff -u -r1.2.2.40 memory.c
--- mm/memory.c 2000/12/05 13:33:39 1.2.2.40
+++ mm/memory.c 2000/12/08 22:24:09
@@ -860,6 +860,7 @@
/*
 * Ok, we need to copy. Oh, well..
 */
+   set_pte(page_table, pte);
spin_unlock(>page_table_lock);
new_page = page_cache_alloc();
if (!new_page)
@@ -870,6 +871,12 @@
 * Re-check the pte - we dropped the lock
 */
if (pte_same(*page_table, pte)) {
+   /* We are changing the pte, so get rid of the old
+* one to avoid races with the hardware, this really
+* only affects the accessed bit here.
+*/
+   pte = ptep_get_and_clear(page_table);
+
if (PageReserved(old_page))
++mm->rss;
break_cow(vma, old_page, new_page, address, page_table);
@@ -1216,12 +1223,14 @@
return do_swap_page(mm, vma, address, pte,
pte_to_swp_entry(entry), write_access);
}

+   entry = ptep_get_and_clear(pte);
if (write_access) {
if (!pte_write(entry))
return do_wp_page(mm, vma, address, pte, entry);

entry = pte_mkdirty(entry);
}
+
entry = pte_mkyoung(entry);
establish_pte(vma, address, pte, entry);
spin_unlock(>page_table_lock);


-- 
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-08 Thread Dr. Kelsey Hudson

On Fri, 8 Dec 2000 [EMAIL PROTECTED] wrote:

> On Thu, 7 Dec 2000, Jeff V. Merkey wrote:
> 
> > I think there may be a case when a process forks, that the MMU or some
> > other subsystem is either not setting the page bits correctly, or
> > mapping in a bad page.  It's a LEVEL I bug in 2.4 is this is the case,
> > BTW.  In core dumps (I've looked at 2 of them from SSH) it barfs right
> > after executing fork() or one of the exec functions and at some places
> > in the code where there's not any obvious coding bugs.  Looks like some
> > type of mapping problem.  I reported it three months ago, but it was
> > pretty much ignored.
> > 
> > Linus needs to add this one to the pre-12 list -- looks like some type
> > of mapping bug.
> 
> Now that you mention it, every app that has bombed has been the type
> that forks a lot. MpegTV, gtv, and make spring to mind. All apps drive
> the CPU load up quite a lot, which was why I initially suspected
> overheating. I don't see it on my other 2.4 boxes though which is
> suspicious. But they don't get as much of a beating as this, which was
> up until last week my main workstation.

Just to add some input and insight on here, I loaded the system down with
some FFT algorithms, and then ran an 8-way kernel compile. The machine in
question is a dual P3/600 with 512MB RAM, 2.4.0-test11. The load
skyrocketed to a mere 13.6. xmms was still running, didn't skip even
once. The FFT algorithms didn't bitch at all. Neither did the kernel
compile. In fact, it compiled without a hitch...

I dunno what to say about these boxes that segfault all the
time... Probably just bad hardware somewhere along the lines.

 Kelsey Hudson   [EMAIL PROTECTED] 
 Software Engineer
 Compendium Technologies, Inc   (619) 725-0771
--- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-08 Thread Dr. Kelsey Hudson

On Thu, 7 Dec 2000, Peter Samuelson wrote:

> 
> [Dick Johnson]
> > Do:
> > 
> > char main[]={0xff,0xff,0xff,0xff};
> 
> Oh come on, at least pick an *interesting* invalid opcode:
> 
>   char main[]={0xf0,0x0f,0xc0,0xc8};  /* try also on NT (: */

What's funny, is that this actually executes on SPARC hardware, but
immediately segfaults. On Intel hardware though, you get a message similar
to:

zsh: illegal hardware instruction (core dumped)  a.out

I wrote relatively the same program in college. It exploits the F0 0F bug
found in early Pentium hardware.

 Kelsey Hudson   [EMAIL PROTECTED] 
 Software Engineer
 Compendium Technologies, Inc   (619) 725-0771
--- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-08 Thread Peter Samuelson


[Dick Johnson]
> > >   char main[]={0xf0,0x0f,0xc0,0xc8};/* try also on NT (: */
> > me2v@reliant DRFDecoder $ ./op
> > Illegal instruction (core dumped)
> 
> Yep. And on early Pentinums, the ones with the "f00f" bug, it would
> lock the machine tighter than a witches crotch. Ooops, not
> politically correct It would allow user-mode code to halt the
> machine.

...Until Linux 2.0.34 or so (can't remember the exact version number)
which had the workaround for this bug, about a week after the bug was
discovered.

And I was reminded in private mail that the correct lockup sequence is
actually

  char main[]={0xf0,0x0f,0xc7,0xc8};

where the 0xc8 can be anything from 0xc8 to 0xcf.

Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-08 Thread Richard B. Johnson

On Fri, 8 Dec 2000, Matthew Vanecek wrote:

> Peter Samuelson wrote:
> > 
> > [Dick Johnson]
> > > Do:
> > >
> > > char main[]={0xff,0xff,0xff,0xff};
> > 
> > Oh come on, at least pick an *interesting* invalid opcode:
> > 
> >   char main[]={0xf0,0x0f,0xc0,0xc8};/* try also on NT (: */
> > 
> 
> me2v@reliant DRFDecoder $ ./op
> Illegal instruction (core dumped)
> 
> Is that the expected behavior?

Yep. And on early Pentinums, the ones with the "f00f" bug, it
would lock the machine tighter than a witches crotch. Ooops,
not politically correct It would allow user-mode code
to halt the machine.

Here is code that just quietly returns to the runtime code
that called it:

char main[]={0x90, 0x90, 0xc3};

FYI, if the .data section was not executable, you couldn't do
this. You would have to use some __asm__ stuff to put it in
the .text section. But, this is an interesting example of
how you can create code that the compiler refuses to generate.

It's easier to use assembly, though.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.0 on an i686 machine (799.54 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-08 Thread Matthew Vanecek

[EMAIL PROTECTED] wrote:
> 
> On Thu, 7 Dec 2000, Jeff V. Merkey wrote:
> 
> > I think there may be a case when a process forks, that the MMU or some
> > other subsystem is either not setting the page bits correctly, or
> > mapping in a bad page.  It's a LEVEL I bug in 2.4 is this is the case,
> > BTW.  In core dumps (I've looked at 2 of them from SSH) it barfs right
> > after executing fork() or one of the exec functions and at some places
> > in the code where there's not any obvious coding bugs.  Looks like some
> > type of mapping problem.  I reported it three months ago, but it was
> > pretty much ignored.
> >
> > Linus needs to add this one to the pre-12 list -- looks like some type
> > of mapping bug.
> 
> Now that you mention it, every app that has bombed has been the type
> that forks a lot. MpegTV, gtv, and make spring to mind. All apps drive
> the CPU load up quite a lot, which was why I initially suspected
> overheating. I don't see it on my other 2.4 boxes though which is
> suspicious. But they don't get as much of a beating as this, which was
> up until last week my main workstation.
> 
> regards,
> 
> Dave.
> 

I've noticed the same problem, and it occasionally happens with XFree86
4.0.1, as well.  Hopefully we've established that this is not the
hardware issue which gcc people of so fond of pushing sig 11s on (even
in the face of overwhelming evidence to the contrary).  It would be good
to have this put on a current to-do list and looked into.

-- 
Matthew Vanecek
perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

For 93 million miles, there is nothing between the sun and my shadow
except me.
I'm always getting in the way of something...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-08 Thread Matthew Vanecek

Peter Samuelson wrote:
> 
> [Dick Johnson]
> > Do:
> >
> > char main[]={0xff,0xff,0xff,0xff};
> 
> Oh come on, at least pick an *interesting* invalid opcode:
> 
>   char main[]={0xf0,0x0f,0xc0,0xc8};/* try also on NT (: */
> 

me2v@reliant DRFDecoder $ ./op
Illegal instruction (core dumped)

Is that the expected behavior?

-- 
Matthew Vanecek
perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

For 93 million miles, there is nothing between the sun and my shadow
except me.
I'm always getting in the way of something...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-08 Thread Alan Cox

> > wrong with it.  I've only seen this under 2.3.x/2.4 SMP kernels.  I
> > would say that this is definitely a kernel problem.=20
> 
> XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
> kernels - even on my BP6=B9. The random crashes started to happen when =
> I
> upgraded my distribution=B2 - and are only seen by people using 2.4. So=
>  I
> suspect that it's the combination of glibc and kernel which is triggeri=
> ng
> it.

Have any of the folks seeing it checked if Ben LaHaise's fixes for the page
table updating race help ?

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-08 Thread Alan Cox

> Various processes have been getting random signals after heavy CPU usage.
> Playing an MPEG movie, kernel compile, or even just some small apps
> compiling sometimes. Just for the record, this isn't an OOM situation,
> I've watched this box with half its memory free or in buffers left
> unattended, and suddenly a compile will just die.

This is consistent with page cache corruption in memory. We definitely had
that in older 2.4test kernels. I saw this building stuff on Linux parisc
and it was because some page of gcc had randomly decided to become something
different. Since that was test6 I didnt figure it important 8)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11

2000-12-08 Thread David Woodhouse


[EMAIL PROTECTED] said:
>  Sounds like a X Server bug. You should probably contact XFree86, not
> linux-kernel

I quote from the X devel list, which perhaps I shouldn't do but this is hardly 
NDA'd stuff:

On Mon 20 Nov 2000, [EMAIL PROTECTED] said:
>   I have seen random crashes on dual P3 BX boards (Tyan) and dual Xeon
> GX boards (Intel).  XFree86 core dumps indicate that it happens in
> random places, in old as dirt software rendering code that has nothing
> wrong with it.  I've only seen this under 2.3.x/2.4 SMP kernels.  I
> would say that this is definitely a kernel problem. 

XFree86 3.9 and XFree86 4 were rock solid for a _long_ time on 2.[34]
kernels - even on my BP6¹. The random crashes started to happen when I
upgraded my distribution² - and are only seen by people using 2.4. So I
suspect that it's the combination of glibc and kernel which is triggering
it.

--
dwmw2

¹ And the BP6 still falls over less frequently than the dual P3 I use at 
work.
² RH7. Don't start.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



  1   2   >