subject:"Re\: BUG\: Global FPU corruption in 2.2"

Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Michal Jaegermann

On Tue, Apr 24, 2001 at 06:56:32PM +0200, Christian Ehrhardt wrote:
> On Tue, Apr 24, 2001 at 09:10:07AM -0700, Linus Torvalds wrote:
> > ptrace only operates on processes that are stopped. So there are no
> > locking issues - we've synchronized on a much higher level than a
> > spinlock or semaphore.
> 
> This is only true for requests other than PTRACE_ATTACH and
> PTRACE_ATTACH is exactly what I'm worried about.

May I remind everybody that at the beginning of this thread I posted
another example, from an SMP Alpha, of FPU problems.  It certainly
was not exactly like the one under discussion but it looked that
it had a similar "smell" to it.

It looks like that to reproduce this Alpha example one needs processors
with a rather fast clock and this hardware version is not yet very
widely available.

  Michal
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Alan Cox


> Alan Cox <[EMAIL PROTECTED]> writes:
> > The preferable one for performance is certainly to backport the 2.4 changes
> 
> Is it any more substantial than changing all uses of the ptrace flags
> to the new variable?

It affects asm blocks and offsets on some ports. Its not too bad tho

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Victor Zandy


Alan Cox <[EMAIL PROTECTED]> writes:

> The preferable one for performance is certainly to backport the 2.4 changes

Is it any more substantial than changing all uses of the ptrace flags
to the new variable?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Alan Cox


> > child->flags |= PF_PTRACED; 
> > 
> > without waiting for the child to have stopped. 
> 
> I can see how this could case PF_USEDFPU to be cleared inadvertently,
> but I do not have any ideas for testing this.  Is it clear that this
> is the source of the problem?

There is no guarantee that |= is implemented atomically - in fact its quite
likely to read

get child->flags
or PF_PTRACED
write child->flags

and a PF_USEDFPU on another processor at the same instant -would- end up being
lost.

There are two fixes

1.  Make all the ops atomic (foo_bit())
2.  Split the flags

The preferable one for performance is certainly to backport the 2.4 changes

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Victor Zandy

Linus Torvalds writes:
> Ahh.. This actually _does_ look like a race on "current->flags": 
> PTRACE_ATTACH will do a 
> 
> child->flags |= PF_PTRACED; 
> 
> without waiting for the child to have stopped. 

I can see how this could case PF_USEDFPU to be cleared inadvertently,
but I do not have any ideas for testing this.  Is it clear that this
is the source of the problem?

What would be involved in backporting the split ptrace flags to 2.2?
Are there other solutions?

Vic
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Victor Zandy


"Christian Ehrhardt" <[EMAIL PROTECTED]> writes:
> Victor: Could you try to reproduce the system wide corruption if you
> add an explicit call to stts(); at the very end of __switch_to?
> This should prevent the FPU corruption from spreading.

After adding this call, I cannot reproduce the global corruption.
There is still occasional local corruption of individual pi processes
while pt is running.

Vic




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Christian Ehrhardt


On Tue, Apr 24, 2001 at 09:10:07AM -0700, Linus Torvalds wrote:
> ptrace only operates on processes that are stopped. So there are no
> locking issues - we've synchronized on a much higher level than a
> spinlock or semaphore.

This is only true for requests other than PTRACE_ATTACH and
PTRACE_ATTACH is exactly what I'm worried about.

   regards   Christian

-- 
THAT'S ALL FOLKS!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Christian Ehrhardt

On Tue, Apr 24, 2001 at 08:05:15AM -0500, Victor Zandy wrote:
> 
> He found that PF_USEDFPU was always set before the machine was broken.
> After he found that it was set about 70% of the time.

If I'm not mistaken this actully can cause GLOBAL FPU corruption.
Here's why:

Assyme for a moment that we lose either the PF_USEDFPU flag of one
process. This not only means that the current process won't have its
state saved, it also means that the next process won't have the TS bit
set. This in turn means that this new process won't get PF_USEDFPU set
and suddenly we have a second process with a corrupted FPU state.

Victor: Could you try to reproduce the system wide corruption if you
add an explicit call to stts(); at the very end of __switch_to?
This should prevent the FPU corruption from spreading.

NOTE: This is just to prove my theory, it is not and isn't meant
to be a fix for the actual problem.

   regards   Christian Ehrhardt

-- 

THAT'S ALL FOLKS!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Linus Torvalds

[ Alan, I'm lazy and only have 2.2.14 sources on-line. Maybe this has
  been fixed already and there's something else going on. Worth a look ]

In article <[EMAIL PROTECTED]>,
Victor Zandy  <[EMAIL PROTECTED]> wrote:
>
>Someone else here traced the process flags of a FP-intensive program
>on a machine before and after it is put in the faulty FPU state.  He
>periodically sampled /proc/pid/stat while the program was running.
>
>He found that PF_USEDFPU was always set before the machine was broken.
>After he found that it was set about 70% of the time.

[ Looks closer at the ptrace synchronization ]

Ahh.. This actually _does_ look like a race on "current->flags":
PTRACE_ATTACH will do a

child->flags |= PF_PTRACED;

without waiting for the child to have stopped.

(Aside: thinking more about the stopping logic - I'm not actually sure
the ptrace synchronization is complete wrt scheduling, as there will be
a window when the process has set the task state to TASK_STOPPED but
hasn't actually yet scheduled away. Oh, well).

All other ptrace operations (not counting killing the child) will check
that the child is quiescent.  But PTRACE_ATTACH will not, as we're just
setting up the stopping.

In 2.4.x, this bug doesn't happen because "flags" was split up into
"current->ptrace" and "current->flags".  Exactly because of locking
concerns.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Alan Cox


> >1.) If I'm not mistaken switch_to changes current->flags without
> >atomic operations and without any locks and sys_ptrace changes
> >child->flags only protected by the big kernel lock.
> 
> ptrace only operates on processes that are stopped. So there are no
> locking issues - we've synchronized on a much higher level than a
> spinlock or semaphore.

In the 2.2 case the ptrace flags themselves are in the same flag set as
the PF_ flags. In 2.4 that was fixed. That means there are some bizarre cases
where current->flags might not be handled perfectly.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Christian Ehrhardt <[EMAIL PROTECTED]> wrote:
>
>1.) If I'm not mistaken switch_to changes current->flags without
>atomic operations and without any locks and sys_ptrace changes
>child->flags only protected by the big kernel lock.

ptrace only operates on processes that are stopped. So there are no
locking issues - we've synchronized on a much higher level than a
spinlock or semaphore.

That said, it does look like 2.2.x has a real bug, and maybe the ptrace
task stopping sycnhronization is broken..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Victor Zandy



Someone else here traced the process flags of a FP-intensive program
on a machine before and after it is put in the faulty FPU state.  He
periodically sampled /proc/pid/stat while the program was running.

He found that PF_USEDFPU was always set before the machine was broken.
After he found that it was set about 70% of the time.

Vic



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread alad

Hi,
 I want to look into this problem. Its seems to be very interesting. But I
was not following the thread from the beginning (and I mistakely deleted all
these mails :( .. ).. I hope you won't mind answering following questions...

1) you are doing this on an MP or a uniprocessor ?
2) I want to know how are you calling sys_ptrace(Attach) and
sys_ptrace(detach).. i.e is it something linke following

  for(;;){
 sys_ptrace(attach to process);
 sys_wait4();
 sys_ptrace(detach from process);
  }

In short the sequence of system calls you are using for attaching and detaching
to the process

3) Have you tried doing attach and detach only once ? If not.. can you please
try this and let me know whether by doing attach and detach one time also
results in global FPU corruption. Please do not fork in the above process.

-

Whenever process A calls sys_ptrace(Attach) to Process B, sys_ptrace sends
SIGSTOP to process B.
Now process B in do_signal, checks that it is being traced and then it does the
following
 current->state = TASK_STOPPED;
 notify_parent(current,SIGCHLD);
 schedule();

so now in schedule() --> __switch_to --> unlazy_fpu() function we do following
 if (current->flags & PF_USEDFPU)
  save_fpu();

In save_fpu() we do following
 fnsave current->tss.i387
 fwait;

I want to ask a question... is it possible if 'somehow' we were not able to
save the complete floating point state with fnsave i.e. current->tss.i387 is
'invalid' after
  fnsave current->tss.i387
 fwait;

Thanks
Amol

David Konerding <[EMAIL PROTECTED]> on 04/23/2001 01:09:27 AM

To:   Ulrich Drepper <[EMAIL PROTECTED]>
cc:   [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Amol Lad/HSS)

Subject:  Re: BUG: Global FPU corruption in 2.2

Ulrich Drepper wrote:

> "Richard B. Johnson" <[EMAIL PROTECTED]> writes:
>
> > The kernel doesn't know if a process is going to use the FPU when
> > a new process is created. Only the user's code, i.e., the 'C' runtime
> > library knows.
>
> Maybe you should try to understand the kernel code and the features of
> the processor first.  The kernel can detect when the FPU is used for
> the first time.

OK, regardless of how the linux kernel actually manages the FPU for user-space

programs, does anybody have any comments on the original bugreport?

>We have found that one of our programs can cause system-wide
>corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
>run this program, the FPU gives bad results to all subsequent
>processes.

>We see this problem on dual 550MHz Xeons with 1GB RAM.  We have 64 of
>these things, and we see the problem on every node we try (dozens).
>We don't have other SMPs handy.  Uniprocessors, including other PIIIs,
>don't seem to be affected.

>Below are two programs we use to produce the behavior.  The first
>program, pi, repeatedly spawns 10 parallel computations of pi.  When
>all is well, each process prints pi as it completes.

>The second program, pt, repeatedly attaches to and detaches from
>another process.  Run pt against the root pi process until the output
>of pi begins to look wrong.  Then kill everything and run pi by itself
>again.  It will no longer produce good results.  We find that the FPU
>persistently gives bad results until we reboot.

I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior
described.
If it is a bug in the linux kernel (I can see nothing wrong with the source
code provided),
I would suspect probems with SMP and ptrace, somehow causing the wrong FP
registers
to be returned to a process after the scheduler restarted it.  It's very
interesting that the
PI program works fine until you run PT, but after you run PT, PI is screwed
until reboot.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread alad

Hi,
 I want to look into this problem. Its seems to be very interesting. But I
was not following the thread from the beginning (and I mistakely deleted all
these mails :( .. ).. I hope you won't mind answering following questions...

1) you are doing this on an MP or a uniprocessor ?
2) I want to know how are you calling sys_ptrace(Attach) and
sys_ptrace(detach).. i.e is it something linke following

  for(;;){
 sys_ptrace(attach to process);
 sys_wait4();
 sys_ptrace(detach from process);
  }

In short the sequence of system calls you are using for attaching and detaching
to the process

3) Have you tried doing attach and detach only once ? If not.. can you please
try this and let me know whether by doing attach and detach one time also
results in global FPU corruption. Please do not fork in the above process.

-

Whenever process A calls sys_ptrace(Attach) to Process B, sys_ptrace sends
SIGSTOP to process B.
Now process B in do_signal, checks that it is being traced and then it does the
following
 current->state = TASK_STOPPED;
 notify_parent(current,SIGCHLD);
 schedule();

so now in schedule() --> __switch_to --> unlazy_fpu() function we do following
 if (current->flags & PF_USEDFPU)
  save_fpu();

In save_fpu() we do following
 fnsave current->tss.i387
 fwait;

I want to ask a question... is it possible if 'somehow' we were not able to
save the complete floating point state with fnsave i.e. current->tss.i387 is
'invalid' after
  fnsave current->tss.i387
 fwait;

Thanks
Amol

David Konerding <[EMAIL PROTECTED]> on 04/23/2001 01:09:27 AM

To:   Ulrich Drepper <[EMAIL PROTECTED]>
cc:   [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Amol Lad/HSS)

Subject:  Re: BUG: Global FPU corruption in 2.2

Ulrich Drepper wrote:

> "Richard B. Johnson" <[EMAIL PROTECTED]> writes:
>
> > The kernel doesn't know if a process is going to use the FPU when
> > a new process is created. Only the user's code, i.e., the 'C' runtime
> > library knows.
>
> Maybe you should try to understand the kernel code and the features of
> the processor first.  The kernel can detect when the FPU is used for
> the first time.

OK, regardless of how the linux kernel actually manages the FPU for user-space

programs, does anybody have any comments on the original bugreport?

>We have found that one of our programs can cause system-wide
>corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
>run this program, the FPU gives bad results to all subsequent
>processes.

>We see this problem on dual 550MHz Xeons with 1GB RAM.  We have 64 of
>these things, and we see the problem on every node we try (dozens).
>We don't have other SMPs handy.  Uniprocessors, including other PIIIs,
>don't seem to be affected.

>Below are two programs we use to produce the behavior.  The first
>program, pi, repeatedly spawns 10 parallel computations of pi.  When
>all is well, each process prints pi as it completes.

>The second program, pt, repeatedly attaches to and detaches from
>another process.  Run pt against the root pi process until the output
>of pi begins to look wrong.  Then kill everything and run pi by itself
>again.  It will no longer produce good results.  We find that the FPU
>persistently gives bad results until we reboot.

I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior
described.
If it is a bug in the linux kernel (I can see nothing wrong with the source
code provided),
I would suspect probems with SMP and ptrace, somehow causing the wrong FP
registers
to be returned to a process after the scheduler restarted it.  It's very
interesting that the
PI program works fine until you run PT, but after you run PT, PI is screwed
until reboot.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-23 Thread alad









Erik Paulson <[EMAIL PROTECTED]> on 04/24/2001 01:14:27 AM

To:   Christian Ehrhardt <[EMAIL PROTECTED]>
cc:   [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Amol Lad/HSS)

Subject:  Re: BUG: Global FPU corruption in 2.2




On 23 Apr 2001 18:11:48 +0200, Christian Ehrhardt wrote:
> On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote:
> >
> > We have found that one of our programs can cause system-wide
> > corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
> > run this program, the FPU gives bad results to all subsequent
> > processes.
>
<...>
>
> 3.) It might be interesting to know if the problem can be triggered:
> a) If pi doesn't fork, i.e. just one process calculating pi and
> another one doing the attach/detach.

Yes, we are still able to reproduce it without calling fork (the new
program just calls
do_pi() a bunch of times, and then we attach and detach to that process)

> b) If pi doesn't do FPU Operations, i.e. only the children call do_pi.
>

You seem to need to attach and detach to a program using the fpu -
running pt on a
process that is just busy-looping over and over some integer adds does
not seem to
while running pi on the machine at the same time, but not attaching to
it does not
seem to affect the floating point state.

>>>> well... during context switching.. call to unlazy_fpu() does the following
if (current->flags & PF_USEDFPU)
  save_fpu();

somebody earlier pointed out, for the possible race when in sys_ptrace, at the
time of attach we modify child->flags.
It really looks again strange that it is software that is causing the problem as
the code to handle FPU looks pretty clean.
still can we check current->flags when the problem occurs ?


Amol


-Erik

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-23 Thread Erik Paulson


On 23 Apr 2001 18:11:48 +0200, Christian Ehrhardt wrote:
> On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote:
> > 
> > We have found that one of our programs can cause system-wide
> > corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
> > run this program, the FPU gives bad results to all subsequent
> > processes.
> 
<...>
> 
> 3.) It might be interesting to know if the problem can be triggered:
> a) If pi doesn't fork, i.e. just one process calculating pi and
> another one doing the attach/detach.

Yes, we are still able to reproduce it without calling fork (the new
program just calls
do_pi() a bunch of times, and then we attach and detach to that process)

> b) If pi doesn't do FPU Operations, i.e. only the children call do_pi.
> 

You seem to need to attach and detach to a program using the fpu -
running pt on a 
process that is just busy-looping over and over some integer adds does
not seem to
while running pi on the machine at the same time, but not attaching to
it does not
seem to affect the floating point state.

-Erik

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-23 Thread Christian Ehrhardt

On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote:
> 
> We have found that one of our programs can cause system-wide
> corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
> run this program, the FPU gives bad results to all subsequent
> processes.

A few comments, not sure if they will help very much:

1.) If I'm not mistaken switch_to changes current->flags without
atomic operations and without any locks and sys_ptrace changes
child->flags only protected by the big kernel lock.
I could imagine that this causes local corruption on an SMP machine
and this is something that changed in 2.4 kernels, but I don't see
how this can corrupt FPU state globally. Maybe there is something else.

2.) I guess a single finit (as proposed by someone else in this thread)
won't assure that both FPUs are in a sane state.

3.) It might be interesting to know if the problem can be triggered:
a) If pi doesn't fork, i.e. just one process calculating pi and
another one doing the attach/detach.
b) If pi doesn't do FPU Operations, i.e. only the children call do_pi.

regardsChristian

-- 
THAT'S ALL FOLKS!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-22 Thread kees


Hello,

Linux 2.2.19 SMP, confirm report. Even games are going weird after
running this test, (my wife is complaining :-))

Have to reboot.

Kees


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-22 Thread Alan Cox


> OK, regardless of how the linux kernel actually manages the FPU for user-space
> 
> programs, does anybody have any comments on the original bugreport?

Complete mystification.

> >of pi begins to look wrong.  Then kill everything and run pi by itself
> >again.  It will no longer produce good results.  We find that the FPU
> >persistently gives bad results until we reboot.
> 
> I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior
> described.

This is the most odd bit of all. The processor state for the FPU is per task
private and each task initializes its own FPU state. In terms of FPU state
itself I don't currently see what there is that can be left behind

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-22 Thread David Konerding


Ulrich Drepper wrote:

> "Richard B. Johnson" <[EMAIL PROTECTED]> writes:
>
> > The kernel doesn't know if a process is going to use the FPU when
> > a new process is created. Only the user's code, i.e., the 'C' runtime
> > library knows.
>
> Maybe you should try to understand the kernel code and the features of
> the processor first.  The kernel can detect when the FPU is used for
> the first time.

OK, regardless of how the linux kernel actually manages the FPU for user-space

programs, does anybody have any comments on the original bugreport?

>We have found that one of our programs can cause system-wide
>corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
>run this program, the FPU gives bad results to all subsequent
>processes.

>We see this problem on dual 550MHz Xeons with 1GB RAM.  We have 64 of
>these things, and we see the problem on every node we try (dozens).
>We don't have other SMPs handy.  Uniprocessors, including other PIIIs,
>don't seem to be affected.

>Below are two programs we use to produce the behavior.  The first
>program, pi, repeatedly spawns 10 parallel computations of pi.  When
>all is well, each process prints pi as it completes.

>The second program, pt, repeatedly attaches to and detaches from
>another process.  Run pt against the root pi process until the output
>of pi begins to look wrong.  Then kill everything and run pi by itself
>again.  It will no longer produce good results.  We find that the FPU
>persistently gives bad results until we reboot.

I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior
described.
If it is a bug in the linux kernel (I can see nothing wrong with the source
code provided),
I would suspect probems with SMP and ptrace, somehow causing the wrong FP
registers
to be returned to a process after the scheduler restarted it.  It's very
interesting that the
PI program works fine until you run PT, but after you run PT, PI is screwed
until reboot.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Ulrich Drepper


"Richard B. Johnson" <[EMAIL PROTECTED]> writes:

> The kernel doesn't know if a process is going to use the FPU when
> a new process is created. Only the user's code, i.e., the 'C' runtime
> library knows.

Maybe you should try to understand the kernel code and the features of
the processor first.  The kernel can detect when the FPU is used for
the first time.

-- 
---.  ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \,---'   \  Sunnyvale, CA 94089 USA
Red Hat  `--' drepper at redhat.com   `
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Victor Zandy



It looks to me like the kernel sets a trap for FP operations when a
process is switched in.  Then when the process executes an FP op, the
kernel clears the trap and either loads the FP context or initializes
it, depending on whether it is the process' first FP operation.  So no
help is need from anything in user space.

Vic

"Richard B. Johnson" <[EMAIL PROTECTED]> writes:
> On 20 Apr 2001, Ulrich Drepper wrote:
> 
> > "Richard B. Johnson" <[EMAIL PROTECTED]> writes:
> > 
> > > If it "fixes" it, there is no problem with the FPU, but with the
> > > 'C' runtime library which doesn't initialize the FPU to a known
> > > state before it uses it.
> > 
> > It's the kernel which initializes the FPU.  This was always the case
> > and necessary to implement the fast lazy FPU saving/restoring.
> > Processes which never use the FPU never initialize it.
> 
> The kernel doesn't know if a process is going to use the FPU when
> a new process is created. Only the user's code, i.e., the 'C' runtime
> library knows. If the user is using 'asm' or whatever, the user must

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Richard B. Johnson

On 20 Apr 2001, Victor Zandy wrote:

> 
> No dice.  Your program does not fix the problem.
> 
> If it were a hardware problem, I would expect the problem to occur
> under 2.4.2 as well as 2.2.*, and I would be surprised that we can
> consistently produce the behavior across our 64 node cluster.  But we
> are keeping the possibility in mind.
> 
> Thanks for your suggestions.
> 
> Vic
> 

Then, if the FPU is fine, you have just proven that the storage
where the FPU context is saved, gets overwritten. Further, once the
initial write occurs, all subsequent fnsave/frestore operations also
encounter the same spurious write. --OR some continuously-running
floating-point has sneaked into the kernel.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Richard B. Johnson

On 20 Apr 2001, Ulrich Drepper wrote:

> "Richard B. Johnson" <[EMAIL PROTECTED]> writes:
> 
> > If it "fixes" it, there is no problem with the FPU, but with the
> > 'C' runtime library which doesn't initialize the FPU to a known
> > state before it uses it.
> 
> It's the kernel which initializes the FPU.  This was always the case
> and necessary to implement the fast lazy FPU saving/restoring.
> Processes which never use the FPU never initialize it.

The kernel doesn't know if a process is going to use the FPU when
a new process is created. Only the user's code, i.e., the 'C' runtime
library knows. If the user is using 'asm' or whatever, the user must
initialize the FPU before using it, otherwise, the user doesn't know
anything about its state and the results ... (let's see, what was at
TOS, errm, is this a NAN?). The results are indeterminate.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Ulrich Drepper

"Richard B. Johnson" <[EMAIL PROTECTED]> writes:

> If it "fixes" it, there is no problem with the FPU, but with the
> 'C' runtime library which doesn't initialize the FPU to a known
> state before it uses it.

It's the kernel which initializes the FPU.  This was always the case
and necessary to implement the fast lazy FPU saving/restoring.
Processes which never use the FPU never initialize it.

-- 
---.  ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \,---'   \  Sunnyvale, CA 94089 USA
Red Hat  `--' drepper at redhat.com   `
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Victor Zandy



No dice.  Your program does not fix the problem.

If it were a hardware problem, I would expect the problem to occur
under 2.4.2 as well as 2.2.*, and I would be surprised that we can
consistently produce the behavior across our 64 node cluster.  But we
are keeping the possibility in mind.

Thanks for your suggestions.

Vic
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Richard B. Johnson

On 20 Apr 2001, Victor Zandy wrote:

> 
> Victor Zandy <[EMAIL PROTECTED]> writes:
> > We have found that one of our programs can cause system-wide
> > corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
> > run this program, the FPU gives bad results to all subsequent
> > processes.
> 
> We have now tested 2.4.2 and 2.2.19.
> 
> 2.2.19 has the same problem.
> 
> 2.4.3 does not seem to be affected.  Unfortunately, we really need a
> working 2.2 kernel at this time.
> 
> We also patched the 2.2.19 kernel with the PIII patch found in
> /pub/linux/kernel/people/andrea/patches/v2.2/2.2.19pre13/PIII-10.bz2
> on ftp.kernel.org.  Same problem.
> 
> Does anyone have any ideas for us?
> 
> Thanks.
> 
> Vic

Just for kicks, do whatever is necessary to "break" the fpu. Then run
this program:

int  main()
{
__asm__("finit\n");
return 0;
}

If it "fixes" it, there is no problem with the FPU, but with the
'C' runtime library which doesn't initialize the FPU to a known
state before it uses it. It is possible for the kernel to work
around th 'C' library problem by clearing the FPU after every
fork(). The last time I checked (years ago), 'finit' was executed
during the fork. Maybe it isn't anymore because it takes many
machine-cycles to complete.

If this doesn't "fix" it, then your hardware may have a problem
like overheating, etc., (loose heatsink?).

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Victor Zandy

Victor Zandy <[EMAIL PROTECTED]> writes:
> We have found that one of our programs can cause system-wide
> corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
> run this program, the FPU gives bad results to all subsequent
> processes.

We have now tested 2.4.2 and 2.2.19.

2.2.19 has the same problem.

2.4.3 does not seem to be affected.  Unfortunately, we really need a
working 2.2 kernel at this time.

We also patched the 2.2.19 kernel with the PIII patch found in
/pub/linux/kernel/people/andrea/patches/v2.2/2.2.19pre13/PIII-10.bz2
on ftp.kernel.org.  Same problem.

Does anyone have any ideas for us?

Thanks.

Vic

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Global FPU corruption in 2.2

2001-04-19 Thread Michal Jaegermann

On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote:
> 
> We have found that one of our programs can cause system-wide
> corruption of the x86 FPU under 2.2.16 and 2.2.17.

> 
> We see this problem on dual 550MHz Xeons with 1GB RAM.

Hm, I started to wonder if this is not somewhat related to a recent
report I got.  "The victim" was running 2.2.19 (basically) on an SMP
Alpha UP2000+ with two 800 MHz processors.  He managed to reduce the
problem to a rather small test case and I attach sources,  Makefile and
a "loop.sh" driver as a shar archive if you want to have a closer look.

This "loop.sh" simply fires triplets of "harry" process in a loop.
The guy hit by this gets apparently random floating point exceptions
starting with roughly sixth process and later intervals between bombs
will vary.  I have also 'strace' outputs from failing processes but
they are not telling very much.  'gdb' is also not very illuminating:

Program received signal SIGFPE, Arithmetic exception.
0x1200010a8 in vadd_ (a=0x11fff21e4, ia=0x120003294, b=0x11fff7004, 
ib=0x120003294, c=0x11fffbe20, ic=0x120003294, n=0x11c70) at vadd.f:99
99   C(CI) = A(AI) + B(BI)
Current language:  auto; currently fortran

(gdb) p *ia
$10 = 1
(gdb) p *ib
$11 = 1
(gdb) p *ic
$12 = 1
(gdb) p *n
Cannot access memory at address 0x4
(gdb) p *(0x11c70)
$13 = 1024

(gdb) info locals
n = (PTR TO -> ( integer )) 0x4
__g77_expr_0 = 10

He tells me that he is getting that on two different machines he has
around.

The trouble is that I tried to repeat that with different hardware,
kernels, compilers and libraries and I failed even on SMP; but I got an
access to a box with only 667 MHz processors.  OTOH he is running
right now 2.4.3-ac9 plus Andrea Arcangeli patches for rw semaphores
on Alpha and he reports that the problem went away (and, hopefuly,
nothing else will crop out :-).

Anybody can offer an insight what that may really be?  It may be,
of course, totally unrelated to this report from Victor Zandy.

  Michal
  [EMAIL PROTECTED]

 fpbomb.shar

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

Re: BUG: Global FPU corruption in 2.2

29 matches

Site Navigation

Mail list logo

Footer information