Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Michal Jaegermann

On Tue, Apr 24, 2001 at 06:56:32PM +0200, Christian Ehrhardt wrote:
> On Tue, Apr 24, 2001 at 09:10:07AM -0700, Linus Torvalds wrote:
> > ptrace only operates on processes that are stopped. So there are no
> > locking issues - we've synchronized on a much higher level than a
> > spinlock or semaphore.
> 
> This is only true for requests other than PTRACE_ATTACH and
> PTRACE_ATTACH is exactly what I'm worried about.

May I remind everybody that at the beginning of this thread I posted
another example, from an SMP Alpha, of FPU problems.  It certainly
was not exactly like the one under discussion but it looked that
it had a similar "smell" to it.

It looks like that to reproduce this Alpha example one needs processors
with a rather fast clock and this hardware version is not yet very
widely available.

  Michal
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Alan Cox

> Alan Cox <[EMAIL PROTECTED]> writes:
> > The preferable one for performance is certainly to backport the 2.4 changes
> 
> Is it any more substantial than changing all uses of the ptrace flags
> to the new variable?

It affects asm blocks and offsets on some ports. Its not too bad tho

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Victor Zandy

Alan Cox <[EMAIL PROTECTED]> writes:

> The preferable one for performance is certainly to backport the 2.4 changes

Is it any more substantial than changing all uses of the ptrace flags
to the new variable?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Alan Cox

> > child->flags |= PF_PTRACED; 
> > 
> > without waiting for the child to have stopped. 
> 
> I can see how this could case PF_USEDFPU to be cleared inadvertently,
> but I do not have any ideas for testing this.  Is it clear that this
> is the source of the problem?

There is no guarantee that |= is implemented atomically - in fact its quite
likely to read

get child->flags
or PF_PTRACED
write child->flags

and a PF_USEDFPU on another processor at the same instant -would- end up being
lost.

There are two fixes

1.  Make all the ops atomic (foo_bit())
2.  Split the flags

The preferable one for performance is certainly to backport the 2.4 changes

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Victor Zandy


Linus Torvalds writes:
> Ahh.. This actually _does_ look like a race on "current->flags": 
> PTRACE_ATTACH will do a 
> 
> child->flags |= PF_PTRACED; 
> 
> without waiting for the child to have stopped. 

I can see how this could case PF_USEDFPU to be cleared inadvertently,
but I do not have any ideas for testing this.  Is it clear that this
is the source of the problem?

What would be involved in backporting the split ptrace flags to 2.2?
Are there other solutions?

Vic
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Victor Zandy

"Christian Ehrhardt" <[EMAIL PROTECTED]> writes:
> Victor: Could you try to reproduce the system wide corruption if you
> add an explicit call to stts(); at the very end of __switch_to?
> This should prevent the FPU corruption from spreading.

After adding this call, I cannot reproduce the global corruption.
There is still occasional local corruption of individual pi processes
while pt is running.

Vic




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Christian Ehrhardt

On Tue, Apr 24, 2001 at 09:10:07AM -0700, Linus Torvalds wrote:
> ptrace only operates on processes that are stopped. So there are no
> locking issues - we've synchronized on a much higher level than a
> spinlock or semaphore.

This is only true for requests other than PTRACE_ATTACH and
PTRACE_ATTACH is exactly what I'm worried about.

   regards   Christian

-- 
THAT'S ALL FOLKS!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Christian Ehrhardt

On Tue, Apr 24, 2001 at 08:05:15AM -0500, Victor Zandy wrote:
> 
> He found that PF_USEDFPU was always set before the machine was broken.
> After he found that it was set about 70% of the time.

If I'm not mistaken this actully can cause GLOBAL FPU corruption.
Here's why:

Assyme for a moment that we lose either the PF_USEDFPU flag of one
process. This not only means that the current process won't have its
state saved, it also means that the next process won't have the TS bit
set. This in turn means that this new process won't get PF_USEDFPU set
and suddenly we have a second process with a corrupted FPU state.

Victor: Could you try to reproduce the system wide corruption if you
add an explicit call to stts(); at the very end of __switch_to?
This should prevent the FPU corruption from spreading.

NOTE: This is just to prove my theory, it is not and isn't meant
to be a fix for the actual problem.

   regards   Christian Ehrhardt

-- 

THAT'S ALL FOLKS!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Linus Torvalds

[ Alan, I'm lazy and only have 2.2.14 sources on-line. Maybe this has
  been fixed already and there's something else going on. Worth a look ]

In article <[EMAIL PROTECTED]>,
Victor Zandy  <[EMAIL PROTECTED]> wrote:
>
>Someone else here traced the process flags of a FP-intensive program
>on a machine before and after it is put in the faulty FPU state.  He
>periodically sampled /proc/pid/stat while the program was running.
>
>He found that PF_USEDFPU was always set before the machine was broken.
>After he found that it was set about 70% of the time.

[ Looks closer at the ptrace synchronization ]

Ahh.. This actually _does_ look like a race on "current->flags":
PTRACE_ATTACH will do a

child->flags |= PF_PTRACED;

without waiting for the child to have stopped.

(Aside: thinking more about the stopping logic - I'm not actually sure
the ptrace synchronization is complete wrt scheduling, as there will be
a window when the process has set the task state to TASK_STOPPED but
hasn't actually yet scheduled away. Oh, well).

All other ptrace operations (not counting killing the child) will check
that the child is quiescent.  But PTRACE_ATTACH will not, as we're just
setting up the stopping.

In 2.4.x, this bug doesn't happen because "flags" was split up into
"current->ptrace" and "current->flags".  Exactly because of locking
concerns.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Alan Cox

> >1.) If I'm not mistaken switch_to changes current->flags without
> >atomic operations and without any locks and sys_ptrace changes
> >child->flags only protected by the big kernel lock.
> 
> ptrace only operates on processes that are stopped. So there are no
> locking issues - we've synchronized on a much higher level than a
> spinlock or semaphore.

In the 2.2 case the ptrace flags themselves are in the same flag set as
the PF_ flags. In 2.4 that was fixed. That means there are some bizarre cases
where current->flags might not be handled perfectly.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Christian Ehrhardt <[EMAIL PROTECTED]> wrote:
>
>1.) If I'm not mistaken switch_to changes current->flags without
>atomic operations and without any locks and sys_ptrace changes
>child->flags only protected by the big kernel lock.

ptrace only operates on processes that are stopped. So there are no
locking issues - we've synchronized on a much higher level than a
spinlock or semaphore.

That said, it does look like 2.2.x has a real bug, and maybe the ptrace
task stopping sycnhronization is broken..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Victor Zandy


Someone else here traced the process flags of a FP-intensive program
on a machine before and after it is put in the faulty FPU state.  He
periodically sampled /proc/pid/stat while the program was running.

He found that PF_USEDFPU was always set before the machine was broken.
After he found that it was set about 70% of the time.

Vic



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread alad






Hi,
 I want to look into this problem. Its seems to be very interesting. But I
was not following the thread from the beginning (and I mistakely deleted all
these mails :( .. ).. I hope you won't mind answering following questions...

1) you are doing this on an MP or a uniprocessor ?
2) I want to know how are you calling sys_ptrace(Attach) and
sys_ptrace(detach).. i.e is it something linke following

  for(;;){
 sys_ptrace(attach to process);
 sys_wait4();
 sys_ptrace(detach from process);
  }

In short the sequence of system calls you are using for attaching and detaching
to the process

3) Have you tried doing attach and detach only once ? If not.. can you please
try this and let me know whether by doing attach and detach one time also
results in global FPU corruption. Please do not fork in the above process.

-

Whenever process A calls sys_ptrace(Attach) to Process B, sys_ptrace sends
SIGSTOP to process B.
Now process B in do_signal, checks that it is being traced and then it does the
following
 current->state = TASK_STOPPED;
 notify_parent(current,SIGCHLD);
 schedule();

so now in schedule() --> __switch_to --> unlazy_fpu() function we do following
 if (current->flags & PF_USEDFPU)
  save_fpu();

In save_fpu() we do following
 fnsave current->tss.i387
 fwait;

I want to ask a question... is it possible if 'somehow' we were not able to
save the complete floating point state with fnsave i.e. current->tss.i387 is
'invalid' after
  fnsave current->tss.i387
 fwait;

Thanks
Amol




David Konerding <[EMAIL PROTECTED]> on 04/23/2001 01:09:27 AM

To:   Ulrich Drepper <[EMAIL PROTECTED]>
cc:   [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Amol Lad/HSS)

Subject:  Re: BUG: Global FPU corruption in 2.2




Ulrich Drepper wrote:

> "Richard B. Johnson" <[EMAIL PROTECTED]> writes:
>
> > The kernel doesn't know if a process is going to use the FPU when
> > a new process is created. Only the user's code, i.e., the 'C' runtime
> > library knows.
>
> Maybe you should try to understand the kernel code and the features of
> the processor first.  The kernel can detect when the FPU is used for
> the first time.

OK, regardless of how the linux kernel actually manages the FPU for user-space

programs, does anybody have any comments on the original bugreport?

>We have found that one of our programs can cause system-wide
>corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
>run this program, the FPU gives bad results to all subsequent
>processes.

>We see this problem on dual 550MHz Xeons with 1GB RAM.  We have 64 of
>these things, and we see the problem on every node we try (dozens).
>We don't have other SMPs handy.  Uniprocessors, including other PIIIs,
>don't seem to be affected.

>Below are two programs we use to produce the behavior.  The first
>program, pi, repeatedly spawns 10 parallel computations of pi.  When
>all is well, each process prints pi as it completes.

>The second program, pt, repeatedly attaches to and detaches from
>another process.  Run pt against the root pi process until the output
>of pi begins to look wrong.  Then kill everything and run pi by itself
>again.  It will no longer produce good results.  We find that the FPU
>persistently gives bad results until we reboot.

I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior
described.
If it is a bug in the linux kernel (I can see nothing wrong with the source
code provided),
I would suspect probems with SMP and ptrace, somehow causing the wrong FP
registers
to be returned to a process after the scheduler restarted it.  It's very
interesting that the
PI program works fine until you run PT, but after you run PT, PI is screwed
until reboot.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/






-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread alad



Hi,
 I want to look into this problem. Its seems to be very interesting. But I
was not following the thread from the beginning (and I mistakely deleted all
these mails :( .. ).. I hope you won't mind answering following questions...

1) you are doing this on an MP or a uniprocessor ?
2) I want to know how are you calling sys_ptrace(Attach) and
sys_ptrace(detach).. i.e is it something linke following

  for(;;){
 sys_ptrace(attach to process);
 sys_wait4();
 sys_ptrace(detach from process);
  }

In short the sequence of system calls you are using for attaching and detaching
to the process

3) Have you tried doing attach and detach only once ? If not.. can you please
try this and let me know whether by doing attach and detach one time also
results in global FPU corruption. Please do not fork in the above process.

-

Whenever process A calls sys_ptrace(Attach) to Process B, sys_ptrace sends
SIGSTOP to process B.
Now process B in do_signal, checks that it is being traced and then it does the
following
 current->state = TASK_STOPPED;
 notify_parent(current,SIGCHLD);
 schedule();

so now in schedule() --> __switch_to --> unlazy_fpu() function we do following
 if (current->flags & PF_USEDFPU)
  save_fpu();

In save_fpu() we do following
 fnsave current->tss.i387
 fwait;

I want to ask a question... is it possible if 'somehow' we were not able to
save the complete floating point state with fnsave i.e. current->tss.i387 is
'invalid' after
  fnsave current->tss.i387
 fwait;

Thanks
Amol





David Konerding <[EMAIL PROTECTED]> on 04/23/2001 01:09:27 AM

To:   Ulrich Drepper <[EMAIL PROTECTED]>
cc:   [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Amol Lad/HSS)

Subject:  Re: BUG: Global FPU corruption in 2.2




Ulrich Drepper wrote:

> "Richard B. Johnson" <[EMAIL PROTECTED]> writes:
>
> > The kernel doesn't know if a process is going to use the FPU when
> > a new process is created. Only the user's code, i.e., the 'C' runtime
> > library knows.
>
> Maybe you should try to understand the kernel code and the features of
> the processor first.  The kernel can detect when the FPU is used for
> the first time.

OK, regardless of how the linux kernel actually manages the FPU for user-space

programs, does anybody have any comments on the original bugreport?

>We have found that one of our programs can cause system-wide
>corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
>run this program, the FPU gives bad results to all subsequent
>processes.

>We see this problem on dual 550MHz Xeons with 1GB RAM.  We have 64 of
>these things, and we see the problem on every node we try (dozens).
>We don't have other SMPs handy.  Uniprocessors, including other PIIIs,
>don't seem to be affected.

>Below are two programs we use to produce the behavior.  The first
>program, pi, repeatedly spawns 10 parallel computations of pi.  When
>all is well, each process prints pi as it completes.

>The second program, pt, repeatedly attaches to and detaches from
>another process.  Run pt against the root pi process until the output
>of pi begins to look wrong.  Then kill everything and run pi by itself
>again.  It will no longer produce good results.  We find that the FPU
>persistently gives bad results until we reboot.

I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior
described.
If it is a bug in the linux kernel (I can see nothing wrong with the source
code provided),
I would suspect probems with SMP and ptrace, somehow causing the wrong FP
registers
to be returned to a process after the scheduler restarted it.  It's very
interesting that the
PI program works fine until you run PT, but after you run PT, PI is screwed
until reboot.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread alad



Hi,
 I want to look into this problem. Its seems to be very interesting. But I
was not following the thread from the beginning (and I mistakely deleted all
these mails :( .. ).. I hope you won't mind answering following questions...

1) you are doing this on an MP or a uniprocessor ?
2) I want to know how are you calling sys_ptrace(Attach) and
sys_ptrace(detach).. i.e is it something linke following

  for(;;){
 sys_ptrace(attach to process);
 sys_wait4();
 sys_ptrace(detach from process);
  }

In short the sequence of system calls you are using for attaching and detaching
to the process

3) Have you tried doing attach and detach only once ? If not.. can you please
try this and let me know whether by doing attach and detach one time also
results in global FPU corruption. Please do not fork in the above process.

-

Whenever process A calls sys_ptrace(Attach) to Process B, sys_ptrace sends
SIGSTOP to process B.
Now process B in do_signal, checks that it is being traced and then it does the
following
 current-state = TASK_STOPPED;
 notify_parent(current,SIGCHLD);
 schedule();

so now in schedule() -- __switch_to -- unlazy_fpu() function we do following
 if (current-flags  PF_USEDFPU)
  save_fpu();

In save_fpu() we do following
 fnsave current-tss.i387
 fwait;

I want to ask a question... is it possible if 'somehow' we were not able to
save the complete floating point state with fnsave i.e. current-tss.i387 is
'invalid' after
  fnsave current-tss.i387
 fwait;

Thanks
Amol





David Konerding [EMAIL PROTECTED] on 04/23/2001 01:09:27 AM

To:   Ulrich Drepper [EMAIL PROTECTED]
cc:   [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Amol Lad/HSS)

Subject:  Re: BUG: Global FPU corruption in 2.2




Ulrich Drepper wrote:

 Richard B. Johnson [EMAIL PROTECTED] writes:

  The kernel doesn't know if a process is going to use the FPU when
  a new process is created. Only the user's code, i.e., the 'C' runtime
  library knows.

 Maybe you should try to understand the kernel code and the features of
 the processor first.  The kernel can detect when the FPU is used for
 the first time.

OK, regardless of how the linux kernel actually manages the FPU for user-space

programs, does anybody have any comments on the original bugreport?

We have found that one of our programs can cause system-wide
corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
run this program, the FPU gives bad results to all subsequent
processes.

We see this problem on dual 550MHz Xeons with 1GB RAM.  We have 64 of
these things, and we see the problem on every node we try (dozens).
We don't have other SMPs handy.  Uniprocessors, including other PIIIs,
don't seem to be affected.

Below are two programs we use to produce the behavior.  The first
program, pi, repeatedly spawns 10 parallel computations of pi.  When
all is well, each process prints pi as it completes.

The second program, pt, repeatedly attaches to and detaches from
another process.  Run pt against the root pi process until the output
of pi begins to look wrong.  Then kill everything and run pi by itself
again.  It will no longer produce good results.  We find that the FPU
persistently gives bad results until we reboot.

I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior
described.
If it is a bug in the linux kernel (I can see nothing wrong with the source
code provided),
I would suspect probems with SMP and ptrace, somehow causing the wrong FP
registers
to be returned to a process after the scheduler restarted it.  It's very
interesting that the
PI program works fine until you run PT, but after you run PT, PI is screwed
until reboot.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread alad






Hi,
 I want to look into this problem. Its seems to be very interesting. But I
was not following the thread from the beginning (and I mistakely deleted all
these mails :( .. ).. I hope you won't mind answering following questions...

1) you are doing this on an MP or a uniprocessor ?
2) I want to know how are you calling sys_ptrace(Attach) and
sys_ptrace(detach).. i.e is it something linke following

  for(;;){
 sys_ptrace(attach to process);
 sys_wait4();
 sys_ptrace(detach from process);
  }

In short the sequence of system calls you are using for attaching and detaching
to the process

3) Have you tried doing attach and detach only once ? If not.. can you please
try this and let me know whether by doing attach and detach one time also
results in global FPU corruption. Please do not fork in the above process.

-

Whenever process A calls sys_ptrace(Attach) to Process B, sys_ptrace sends
SIGSTOP to process B.
Now process B in do_signal, checks that it is being traced and then it does the
following
 current-state = TASK_STOPPED;
 notify_parent(current,SIGCHLD);
 schedule();

so now in schedule() -- __switch_to -- unlazy_fpu() function we do following
 if (current-flags  PF_USEDFPU)
  save_fpu();

In save_fpu() we do following
 fnsave current-tss.i387
 fwait;

I want to ask a question... is it possible if 'somehow' we were not able to
save the complete floating point state with fnsave i.e. current-tss.i387 is
'invalid' after
  fnsave current-tss.i387
 fwait;

Thanks
Amol




David Konerding [EMAIL PROTECTED] on 04/23/2001 01:09:27 AM

To:   Ulrich Drepper [EMAIL PROTECTED]
cc:   [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Amol Lad/HSS)

Subject:  Re: BUG: Global FPU corruption in 2.2




Ulrich Drepper wrote:

 Richard B. Johnson [EMAIL PROTECTED] writes:

  The kernel doesn't know if a process is going to use the FPU when
  a new process is created. Only the user's code, i.e., the 'C' runtime
  library knows.

 Maybe you should try to understand the kernel code and the features of
 the processor first.  The kernel can detect when the FPU is used for
 the first time.

OK, regardless of how the linux kernel actually manages the FPU for user-space

programs, does anybody have any comments on the original bugreport?

We have found that one of our programs can cause system-wide
corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
run this program, the FPU gives bad results to all subsequent
processes.

We see this problem on dual 550MHz Xeons with 1GB RAM.  We have 64 of
these things, and we see the problem on every node we try (dozens).
We don't have other SMPs handy.  Uniprocessors, including other PIIIs,
don't seem to be affected.

Below are two programs we use to produce the behavior.  The first
program, pi, repeatedly spawns 10 parallel computations of pi.  When
all is well, each process prints pi as it completes.

The second program, pt, repeatedly attaches to and detaches from
another process.  Run pt against the root pi process until the output
of pi begins to look wrong.  Then kill everything and run pi by itself
again.  It will no longer produce good results.  We find that the FPU
persistently gives bad results until we reboot.

I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior
described.
If it is a bug in the linux kernel (I can see nothing wrong with the source
code provided),
I would suspect probems with SMP and ptrace, somehow causing the wrong FP
registers
to be returned to a process after the scheduler restarted it.  It's very
interesting that the
PI program works fine until you run PT, but after you run PT, PI is screwed
until reboot.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/






-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Victor Zandy


Someone else here traced the process flags of a FP-intensive program
on a machine before and after it is put in the faulty FPU state.  He
periodically sampled /proc/pid/stat while the program was running.

He found that PF_USEDFPU was always set before the machine was broken.
After he found that it was set about 70% of the time.

Vic



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Linus Torvalds

In article [EMAIL PROTECTED],
Christian Ehrhardt [EMAIL PROTECTED] wrote:

1.) If I'm not mistaken switch_to changes current-flags without
atomic operations and without any locks and sys_ptrace changes
child-flags only protected by the big kernel lock.

ptrace only operates on processes that are stopped. So there are no
locking issues - we've synchronized on a much higher level than a
spinlock or semaphore.

That said, it does look like 2.2.x has a real bug, and maybe the ptrace
task stopping sycnhronization is broken..

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Alan Cox

 1.) If I'm not mistaken switch_to changes current-flags without
 atomic operations and without any locks and sys_ptrace changes
 child-flags only protected by the big kernel lock.
 
 ptrace only operates on processes that are stopped. So there are no
 locking issues - we've synchronized on a much higher level than a
 spinlock or semaphore.

In the 2.2 case the ptrace flags themselves are in the same flag set as
the PF_ flags. In 2.4 that was fixed. That means there are some bizarre cases
where current-flags might not be handled perfectly.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Linus Torvalds

[ Alan, I'm lazy and only have 2.2.14 sources on-line. Maybe this has
  been fixed already and there's something else going on. Worth a look ]

In article [EMAIL PROTECTED],
Victor Zandy  [EMAIL PROTECTED] wrote:

Someone else here traced the process flags of a FP-intensive program
on a machine before and after it is put in the faulty FPU state.  He
periodically sampled /proc/pid/stat while the program was running.

He found that PF_USEDFPU was always set before the machine was broken.
After he found that it was set about 70% of the time.

[ Looks closer at the ptrace synchronization ]

Ahh.. This actually _does_ look like a race on current-flags:
PTRACE_ATTACH will do a

child-flags |= PF_PTRACED;

without waiting for the child to have stopped.

(Aside: thinking more about the stopping logic - I'm not actually sure
the ptrace synchronization is complete wrt scheduling, as there will be
a window when the process has set the task state to TASK_STOPPED but
hasn't actually yet scheduled away. Oh, well).

All other ptrace operations (not counting killing the child) will check
that the child is quiescent.  But PTRACE_ATTACH will not, as we're just
setting up the stopping.

In 2.4.x, this bug doesn't happen because flags was split up into
current-ptrace and current-flags.  Exactly because of locking
concerns.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Christian Ehrhardt

On Tue, Apr 24, 2001 at 08:05:15AM -0500, Victor Zandy wrote:
 
 He found that PF_USEDFPU was always set before the machine was broken.
 After he found that it was set about 70% of the time.

If I'm not mistaken this actully can cause GLOBAL FPU corruption.
Here's why:

Assyme for a moment that we lose either the PF_USEDFPU flag of one
process. This not only means that the current process won't have its
state saved, it also means that the next process won't have the TS bit
set. This in turn means that this new process won't get PF_USEDFPU set
and suddenly we have a second process with a corrupted FPU state.

Victor: Could you try to reproduce the system wide corruption if you
add an explicit call to stts(); at the very end of __switch_to?
This should prevent the FPU corruption from spreading.

NOTE: This is just to prove my theory, it is not and isn't meant
to be a fix for the actual problem.

   regards   Christian Ehrhardt

-- 

THAT'S ALL FOLKS!
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Christian Ehrhardt

On Tue, Apr 24, 2001 at 09:10:07AM -0700, Linus Torvalds wrote:
 ptrace only operates on processes that are stopped. So there are no
 locking issues - we've synchronized on a much higher level than a
 spinlock or semaphore.

This is only true for requests other than PTRACE_ATTACH and
PTRACE_ATTACH is exactly what I'm worried about.

   regards   Christian

-- 
THAT'S ALL FOLKS!
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Victor Zandy

Christian Ehrhardt [EMAIL PROTECTED] writes:
 Victor: Could you try to reproduce the system wide corruption if you
 add an explicit call to stts(); at the very end of __switch_to?
 This should prevent the FPU corruption from spreading.

After adding this call, I cannot reproduce the global corruption.
There is still occasional local corruption of individual pi processes
while pt is running.

Vic




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Alan Cox

  child-flags |= PF_PTRACED; 
  
  without waiting for the child to have stopped. 
 
 I can see how this could case PF_USEDFPU to be cleared inadvertently,
 but I do not have any ideas for testing this.  Is it clear that this
 is the source of the problem?

There is no guarantee that |= is implemented atomically - in fact its quite
likely to read

get child-flags
or PF_PTRACED
write child-flags

and a PF_USEDFPU on another processor at the same instant -would- end up being
lost.

There are two fixes

1.  Make all the ops atomic (foo_bit())
2.  Split the flags

The preferable one for performance is certainly to backport the 2.4 changes

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Victor Zandy

Alan Cox [EMAIL PROTECTED] writes:

 The preferable one for performance is certainly to backport the 2.4 changes

Is it any more substantial than changing all uses of the ptrace flags
to the new variable?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-24 Thread Michal Jaegermann

On Tue, Apr 24, 2001 at 06:56:32PM +0200, Christian Ehrhardt wrote:
 On Tue, Apr 24, 2001 at 09:10:07AM -0700, Linus Torvalds wrote:
  ptrace only operates on processes that are stopped. So there are no
  locking issues - we've synchronized on a much higher level than a
  spinlock or semaphore.
 
 This is only true for requests other than PTRACE_ATTACH and
 PTRACE_ATTACH is exactly what I'm worried about.

May I remind everybody that at the beginning of this thread I posted
another example, from an SMP Alpha, of FPU problems.  It certainly
was not exactly like the one under discussion but it looked that
it had a similar smell to it.

It looks like that to reproduce this Alpha example one needs processors
with a rather fast clock and this hardware version is not yet very
widely available.

  Michal
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-23 Thread alad








Erik Paulson <[EMAIL PROTECTED]> on 04/24/2001 01:14:27 AM

To:   Christian Ehrhardt <[EMAIL PROTECTED]>
cc:   [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Amol Lad/HSS)

Subject:  Re: BUG: Global FPU corruption in 2.2




On 23 Apr 2001 18:11:48 +0200, Christian Ehrhardt wrote:
> On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote:
> >
> > We have found that one of our programs can cause system-wide
> > corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
> > run this program, the FPU gives bad results to all subsequent
> > processes.
>
<...>
>
> 3.) It might be interesting to know if the problem can be triggered:
> a) If pi doesn't fork, i.e. just one process calculating pi and
> another one doing the attach/detach.

Yes, we are still able to reproduce it without calling fork (the new
program just calls
do_pi() a bunch of times, and then we attach and detach to that process)

> b) If pi doesn't do FPU Operations, i.e. only the children call do_pi.
>

You seem to need to attach and detach to a program using the fpu -
running pt on a
process that is just busy-looping over and over some integer adds does
not seem to
while running pi on the machine at the same time, but not attaching to
it does not
seem to affect the floating point state.

>>>> well... during context switching.. call to unlazy_fpu() does the following
if (current->flags & PF_USEDFPU)
  save_fpu();

somebody earlier pointed out, for the possible race when in sys_ptrace, at the
time of attach we modify child->flags.
It really looks again strange that it is software that is causing the problem as
the code to handle FPU looks pretty clean.
still can we check current->flags when the problem occurs ?


Amol


-Erik

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-23 Thread Erik Paulson

On 23 Apr 2001 18:11:48 +0200, Christian Ehrhardt wrote:
> On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote:
> > 
> > We have found that one of our programs can cause system-wide
> > corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
> > run this program, the FPU gives bad results to all subsequent
> > processes.
> 
<...>
> 
> 3.) It might be interesting to know if the problem can be triggered:
> a) If pi doesn't fork, i.e. just one process calculating pi and
> another one doing the attach/detach.

Yes, we are still able to reproduce it without calling fork (the new
program just calls
do_pi() a bunch of times, and then we attach and detach to that process)

> b) If pi doesn't do FPU Operations, i.e. only the children call do_pi.
> 

You seem to need to attach and detach to a program using the fpu -
running pt on a 
process that is just busy-looping over and over some integer adds does
not seem to
while running pi on the machine at the same time, but not attaching to
it does not
seem to affect the floating point state.

-Erik

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-23 Thread Christian Ehrhardt

On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote:
> 
> We have found that one of our programs can cause system-wide
> corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
> run this program, the FPU gives bad results to all subsequent
> processes.

A few comments, not sure if they will help very much:

1.) If I'm not mistaken switch_to changes current->flags without
atomic operations and without any locks and sys_ptrace changes
child->flags only protected by the big kernel lock.
I could imagine that this causes local corruption on an SMP machine
and this is something that changed in 2.4 kernels, but I don't see
how this can corrupt FPU state globally. Maybe there is something else.

2.) I guess a single finit (as proposed by someone else in this thread)
won't assure that both FPUs are in a sane state.

3.) It might be interesting to know if the problem can be triggered:
a) If pi doesn't fork, i.e. just one process calculating pi and
another one doing the attach/detach.
b) If pi doesn't do FPU Operations, i.e. only the children call do_pi.

regardsChristian

-- 
THAT'S ALL FOLKS!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-23 Thread Christian Ehrhardt

On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote:
 
 We have found that one of our programs can cause system-wide
 corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
 run this program, the FPU gives bad results to all subsequent
 processes.

A few comments, not sure if they will help very much:

1.) If I'm not mistaken switch_to changes current-flags without
atomic operations and without any locks and sys_ptrace changes
child-flags only protected by the big kernel lock.
I could imagine that this causes local corruption on an SMP machine
and this is something that changed in 2.4 kernels, but I don't see
how this can corrupt FPU state globally. Maybe there is something else.

2.) I guess a single finit (as proposed by someone else in this thread)
won't assure that both FPUs are in a sane state.

3.) It might be interesting to know if the problem can be triggered:
a) If pi doesn't fork, i.e. just one process calculating pi and
another one doing the attach/detach.
b) If pi doesn't do FPU Operations, i.e. only the children call do_pi.

regardsChristian

-- 
THAT'S ALL FOLKS!
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-23 Thread Erik Paulson

On 23 Apr 2001 18:11:48 +0200, Christian Ehrhardt wrote:
 On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote:
  
  We have found that one of our programs can cause system-wide
  corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
  run this program, the FPU gives bad results to all subsequent
  processes.
 
...
 
 3.) It might be interesting to know if the problem can be triggered:
 a) If pi doesn't fork, i.e. just one process calculating pi and
 another one doing the attach/detach.

Yes, we are still able to reproduce it without calling fork (the new
program just calls
do_pi() a bunch of times, and then we attach and detach to that process)

 b) If pi doesn't do FPU Operations, i.e. only the children call do_pi.
 

You seem to need to attach and detach to a program using the fpu -
running pt on a 
process that is just busy-looping over and over some integer adds does
not seem to
while running pi on the machine at the same time, but not attaching to
it does not
seem to affect the floating point state.

-Erik

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-23 Thread alad








Erik Paulson [EMAIL PROTECTED] on 04/24/2001 01:14:27 AM

To:   Christian Ehrhardt [EMAIL PROTECTED]
cc:   [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Amol Lad/HSS)

Subject:  Re: BUG: Global FPU corruption in 2.2




On 23 Apr 2001 18:11:48 +0200, Christian Ehrhardt wrote:
 On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote:
 
  We have found that one of our programs can cause system-wide
  corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
  run this program, the FPU gives bad results to all subsequent
  processes.

...

 3.) It might be interesting to know if the problem can be triggered:
 a) If pi doesn't fork, i.e. just one process calculating pi and
 another one doing the attach/detach.

Yes, we are still able to reproduce it without calling fork (the new
program just calls
do_pi() a bunch of times, and then we attach and detach to that process)

 b) If pi doesn't do FPU Operations, i.e. only the children call do_pi.


You seem to need to attach and detach to a program using the fpu -
running pt on a
process that is just busy-looping over and over some integer adds does
not seem to
while running pi on the machine at the same time, but not attaching to
it does not
seem to affect the floating point state.

 well... during context switching.. call to unlazy_fpu() does the following
if (current-flags  PF_USEDFPU)
  save_fpu();

somebody earlier pointed out, for the possible race when in sys_ptrace, at the
time of attach we modify child-flags.
It really looks again strange that it is software that is causing the problem as
the code to handle FPU looks pretty clean.
still can we check current-flags when the problem occurs ?


Amol


-Erik

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-22 Thread kees

Hello,

Linux 2.2.19 SMP, confirm report. Even games are going weird after
running this test, (my wife is complaining :-))

Have to reboot.

Kees


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-22 Thread Alan Cox

> OK, regardless of how the linux kernel actually manages the FPU for user-space
> 
> programs, does anybody have any comments on the original bugreport?

Complete mystification.

> >of pi begins to look wrong.  Then kill everything and run pi by itself
> >again.  It will no longer produce good results.  We find that the FPU
> >persistently gives bad results until we reboot.
> 
> I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior
> described.

This is the most odd bit of all. The processor state for the FPU is per task
private and each task initializes its own FPU state. In terms of FPU state
itself I don't currently see what there is that can be left behind

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-22 Thread David Konerding

Ulrich Drepper wrote:

> "Richard B. Johnson" <[EMAIL PROTECTED]> writes:
>
> > The kernel doesn't know if a process is going to use the FPU when
> > a new process is created. Only the user's code, i.e., the 'C' runtime
> > library knows.
>
> Maybe you should try to understand the kernel code and the features of
> the processor first.  The kernel can detect when the FPU is used for
> the first time.

OK, regardless of how the linux kernel actually manages the FPU for user-space

programs, does anybody have any comments on the original bugreport?

>We have found that one of our programs can cause system-wide
>corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
>run this program, the FPU gives bad results to all subsequent
>processes.

>We see this problem on dual 550MHz Xeons with 1GB RAM.  We have 64 of
>these things, and we see the problem on every node we try (dozens).
>We don't have other SMPs handy.  Uniprocessors, including other PIIIs,
>don't seem to be affected.

>Below are two programs we use to produce the behavior.  The first
>program, pi, repeatedly spawns 10 parallel computations of pi.  When
>all is well, each process prints pi as it completes.

>The second program, pt, repeatedly attaches to and detaches from
>another process.  Run pt against the root pi process until the output
>of pi begins to look wrong.  Then kill everything and run pi by itself
>again.  It will no longer produce good results.  We find that the FPU
>persistently gives bad results until we reboot.

I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior
described.
If it is a bug in the linux kernel (I can see nothing wrong with the source
code provided),
I would suspect probems with SMP and ptrace, somehow causing the wrong FP
registers
to be returned to a process after the scheduler restarted it.  It's very
interesting that the
PI program works fine until you run PT, but after you run PT, PI is screwed
until reboot.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-22 Thread David Konerding

Ulrich Drepper wrote:

 "Richard B. Johnson" [EMAIL PROTECTED] writes:

  The kernel doesn't know if a process is going to use the FPU when
  a new process is created. Only the user's code, i.e., the 'C' runtime
  library knows.

 Maybe you should try to understand the kernel code and the features of
 the processor first.  The kernel can detect when the FPU is used for
 the first time.

OK, regardless of how the linux kernel actually manages the FPU for user-space

programs, does anybody have any comments on the original bugreport?

We have found that one of our programs can cause system-wide
corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
run this program, the FPU gives bad results to all subsequent
processes.

We see this problem on dual 550MHz Xeons with 1GB RAM.  We have 64 of
these things, and we see the problem on every node we try (dozens).
We don't have other SMPs handy.  Uniprocessors, including other PIIIs,
don't seem to be affected.

Below are two programs we use to produce the behavior.  The first
program, pi, repeatedly spawns 10 parallel computations of pi.  When
all is well, each process prints pi as it completes.

The second program, pt, repeatedly attaches to and detaches from
another process.  Run pt against the root pi process until the output
of pi begins to look wrong.  Then kill everything and run pi by itself
again.  It will no longer produce good results.  We find that the FPU
persistently gives bad results until we reboot.

I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior
described.
If it is a bug in the linux kernel (I can see nothing wrong with the source
code provided),
I would suspect probems with SMP and ptrace, somehow causing the wrong FP
registers
to be returned to a process after the scheduler restarted it.  It's very
interesting that the
PI program works fine until you run PT, but after you run PT, PI is screwed
until reboot.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-22 Thread Alan Cox

 OK, regardless of how the linux kernel actually manages the FPU for user-space
 
 programs, does anybody have any comments on the original bugreport?

Complete mystification.

 of pi begins to look wrong.  Then kill everything and run pi by itself
 again.  It will no longer produce good results.  We find that the FPU
 persistently gives bad results until we reboot.
 
 I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior
 described.

This is the most odd bit of all. The processor state for the FPU is per task
private and each task initializes its own FPU state. In terms of FPU state
itself I don't currently see what there is that can be left behind

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-22 Thread kees

Hello,

Linux 2.2.19 SMP, confirm report. Even games are going weird after
running this test, (my wife is complaining :-))

Have to reboot.

Kees


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Ulrich Drepper

"Richard B. Johnson" <[EMAIL PROTECTED]> writes:

> The kernel doesn't know if a process is going to use the FPU when
> a new process is created. Only the user's code, i.e., the 'C' runtime
> library knows.

Maybe you should try to understand the kernel code and the features of
the processor first.  The kernel can detect when the FPU is used for
the first time.

-- 
---.  ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \,---'   \  Sunnyvale, CA 94089 USA
Red Hat  `--' drepper at redhat.com   `
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Victor Zandy


It looks to me like the kernel sets a trap for FP operations when a
process is switched in.  Then when the process executes an FP op, the
kernel clears the trap and either loads the FP context or initializes
it, depending on whether it is the process' first FP operation.  So no
help is need from anything in user space.

Vic

"Richard B. Johnson" <[EMAIL PROTECTED]> writes:
> On 20 Apr 2001, Ulrich Drepper wrote:
> 
> > "Richard B. Johnson" <[EMAIL PROTECTED]> writes:
> > 
> > > If it "fixes" it, there is no problem with the FPU, but with the
> > > 'C' runtime library which doesn't initialize the FPU to a known
> > > state before it uses it.
> > 
> > It's the kernel which initializes the FPU.  This was always the case
> > and necessary to implement the fast lazy FPU saving/restoring.
> > Processes which never use the FPU never initialize it.
> 
> The kernel doesn't know if a process is going to use the FPU when
> a new process is created. Only the user's code, i.e., the 'C' runtime
> library knows. If the user is using 'asm' or whatever, the user must

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Richard B. Johnson

On 20 Apr 2001, Victor Zandy wrote:

> 
> No dice.  Your program does not fix the problem.
> 
> If it were a hardware problem, I would expect the problem to occur
> under 2.4.2 as well as 2.2.*, and I would be surprised that we can
> consistently produce the behavior across our 64 node cluster.  But we
> are keeping the possibility in mind.
> 
> Thanks for your suggestions.
> 
> Vic
> 

Then, if the FPU is fine, you have just proven that the storage
where the FPU context is saved, gets overwritten. Further, once the
initial write occurs, all subsequent fnsave/frestore operations also
encounter the same spurious write. --OR some continuously-running
floating-point has sneaked into the kernel.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Richard B. Johnson

On 20 Apr 2001, Ulrich Drepper wrote:

> "Richard B. Johnson" <[EMAIL PROTECTED]> writes:
> 
> > If it "fixes" it, there is no problem with the FPU, but with the
> > 'C' runtime library which doesn't initialize the FPU to a known
> > state before it uses it.
> 
> It's the kernel which initializes the FPU.  This was always the case
> and necessary to implement the fast lazy FPU saving/restoring.
> Processes which never use the FPU never initialize it.

The kernel doesn't know if a process is going to use the FPU when
a new process is created. Only the user's code, i.e., the 'C' runtime
library knows. If the user is using 'asm' or whatever, the user must
initialize the FPU before using it, otherwise, the user doesn't know
anything about its state and the results ... (let's see, what was at
TOS, errm, is this a NAN?). The results are indeterminate.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Ulrich Drepper

"Richard B. Johnson" <[EMAIL PROTECTED]> writes:

> If it "fixes" it, there is no problem with the FPU, but with the
> 'C' runtime library which doesn't initialize the FPU to a known
> state before it uses it.

It's the kernel which initializes the FPU.  This was always the case
and necessary to implement the fast lazy FPU saving/restoring.
Processes which never use the FPU never initialize it.

-- 
---.  ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \,---'   \  Sunnyvale, CA 94089 USA
Red Hat  `--' drepper at redhat.com   `
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Victor Zandy


No dice.  Your program does not fix the problem.

If it were a hardware problem, I would expect the problem to occur
under 2.4.2 as well as 2.2.*, and I would be surprised that we can
consistently produce the behavior across our 64 node cluster.  But we
are keeping the possibility in mind.

Thanks for your suggestions.

Vic
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Richard B. Johnson

On 20 Apr 2001, Victor Zandy wrote:

> 
> Victor Zandy <[EMAIL PROTECTED]> writes:
> > We have found that one of our programs can cause system-wide
> > corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
> > run this program, the FPU gives bad results to all subsequent
> > processes.
> 
> We have now tested 2.4.2 and 2.2.19.
> 
> 2.2.19 has the same problem.
> 
> 2.4.3 does not seem to be affected.  Unfortunately, we really need a
> working 2.2 kernel at this time.
> 
> We also patched the 2.2.19 kernel with the PIII patch found in
> /pub/linux/kernel/people/andrea/patches/v2.2/2.2.19pre13/PIII-10.bz2
> on ftp.kernel.org.  Same problem.
> 
> Does anyone have any ideas for us?
> 
> Thanks.
> 
> Vic

Just for kicks, do whatever is necessary to "break" the fpu. Then run
this program:

int  main()
{
__asm__("finit\n");
return 0;
}

If it "fixes" it, there is no problem with the FPU, but with the
'C' runtime library which doesn't initialize the FPU to a known
state before it uses it. It is possible for the kernel to work
around th 'C' library problem by clearing the FPU after every
fork(). The last time I checked (years ago), 'finit' was executed
during the fork. Maybe it isn't anymore because it takes many
machine-cycles to complete.

If this doesn't "fix" it, then your hardware may have a problem
like overheating, etc., (loose heatsink?).


Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Victor Zandy


Victor Zandy <[EMAIL PROTECTED]> writes:
> We have found that one of our programs can cause system-wide
> corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
> run this program, the FPU gives bad results to all subsequent
> processes.

We have now tested 2.4.2 and 2.2.19.

2.2.19 has the same problem.

2.4.3 does not seem to be affected.  Unfortunately, we really need a
working 2.2 kernel at this time.

We also patched the 2.2.19 kernel with the PIII patch found in
/pub/linux/kernel/people/andrea/patches/v2.2/2.2.19pre13/PIII-10.bz2
on ftp.kernel.org.  Same problem.

Does anyone have any ideas for us?

Thanks.

Vic

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Victor Zandy


Victor Zandy [EMAIL PROTECTED] writes:
 We have found that one of our programs can cause system-wide
 corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
 run this program, the FPU gives bad results to all subsequent
 processes.

We have now tested 2.4.2 and 2.2.19.

2.2.19 has the same problem.

2.4.3 does not seem to be affected.  Unfortunately, we really need a
working 2.2 kernel at this time.

We also patched the 2.2.19 kernel with the PIII patch found in
/pub/linux/kernel/people/andrea/patches/v2.2/2.2.19pre13/PIII-10.bz2
on ftp.kernel.org.  Same problem.

Does anyone have any ideas for us?

Thanks.

Vic

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Victor Zandy


No dice.  Your program does not fix the problem.

If it were a hardware problem, I would expect the problem to occur
under 2.4.2 as well as 2.2.*, and I would be surprised that we can
consistently produce the behavior across our 64 node cluster.  But we
are keeping the possibility in mind.

Thanks for your suggestions.

Vic
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Ulrich Drepper

"Richard B. Johnson" [EMAIL PROTECTED] writes:

 If it "fixes" it, there is no problem with the FPU, but with the
 'C' runtime library which doesn't initialize the FPU to a known
 state before it uses it.

It's the kernel which initializes the FPU.  This was always the case
and necessary to implement the fast lazy FPU saving/restoring.
Processes which never use the FPU never initialize it.

-- 
---.  ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \,---'   \  Sunnyvale, CA 94089 USA
Red Hat  `--' drepper at redhat.com   `
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Richard B. Johnson

On 20 Apr 2001, Ulrich Drepper wrote:

 "Richard B. Johnson" [EMAIL PROTECTED] writes:
 
  If it "fixes" it, there is no problem with the FPU, but with the
  'C' runtime library which doesn't initialize the FPU to a known
  state before it uses it.
 
 It's the kernel which initializes the FPU.  This was always the case
 and necessary to implement the fast lazy FPU saving/restoring.
 Processes which never use the FPU never initialize it.

The kernel doesn't know if a process is going to use the FPU when
a new process is created. Only the user's code, i.e., the 'C' runtime
library knows. If the user is using 'asm' or whatever, the user must
initialize the FPU before using it, otherwise, the user doesn't know
anything about its state and the results ... (let's see, what was at
TOS, errm, is this a NAN?). The results are indeterminate.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Richard B. Johnson

On 20 Apr 2001, Victor Zandy wrote:

 
 No dice.  Your program does not fix the problem.
 
 If it were a hardware problem, I would expect the problem to occur
 under 2.4.2 as well as 2.2.*, and I would be surprised that we can
 consistently produce the behavior across our 64 node cluster.  But we
 are keeping the possibility in mind.
 
 Thanks for your suggestions.
 
 Vic
 

Then, if the FPU is fine, you have just proven that the storage
where the FPU context is saved, gets overwritten. Further, once the
initial write occurs, all subsequent fnsave/frestore operations also
encounter the same spurious write. --OR some continuously-running
floating-point has sneaked into the kernel.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Victor Zandy


It looks to me like the kernel sets a trap for FP operations when a
process is switched in.  Then when the process executes an FP op, the
kernel clears the trap and either loads the FP context or initializes
it, depending on whether it is the process' first FP operation.  So no
help is need from anything in user space.

Vic

"Richard B. Johnson" [EMAIL PROTECTED] writes:
 On 20 Apr 2001, Ulrich Drepper wrote:
 
  "Richard B. Johnson" [EMAIL PROTECTED] writes:
  
   If it "fixes" it, there is no problem with the FPU, but with the
   'C' runtime library which doesn't initialize the FPU to a known
   state before it uses it.
  
  It's the kernel which initializes the FPU.  This was always the case
  and necessary to implement the fast lazy FPU saving/restoring.
  Processes which never use the FPU never initialize it.
 
 The kernel doesn't know if a process is going to use the FPU when
 a new process is created. Only the user's code, i.e., the 'C' runtime
 library knows. If the user is using 'asm' or whatever, the user must

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-20 Thread Ulrich Drepper

"Richard B. Johnson" [EMAIL PROTECTED] writes:

 The kernel doesn't know if a process is going to use the FPU when
 a new process is created. Only the user's code, i.e., the 'C' runtime
 library knows.

Maybe you should try to understand the kernel code and the features of
the processor first.  The kernel can detect when the FPU is used for
the first time.

-- 
---.  ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \,---'   \  Sunnyvale, CA 94089 USA
Red Hat  `--' drepper at redhat.com   `
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-19 Thread Michal Jaegermann

On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote:
> 
> We have found that one of our programs can cause system-wide
> corruption of the x86 FPU under 2.2.16 and 2.2.17.

> 
> We see this problem on dual 550MHz Xeons with 1GB RAM.

Hm, I started to wonder if this is not somewhat related to a recent
report I got.  "The victim" was running 2.2.19 (basically) on an SMP
Alpha UP2000+ with two 800 MHz processors.  He managed to reduce the
problem to a rather small test case and I attach sources,  Makefile and
a "loop.sh" driver as a shar archive if you want to have a closer look.

This "loop.sh" simply fires triplets of "harry" process in a loop.
The guy hit by this gets apparently random floating point exceptions
starting with roughly sixth process and later intervals between bombs
will vary.  I have also 'strace' outputs from failing processes but
they are not telling very much.  'gdb' is also not very illuminating:

Program received signal SIGFPE, Arithmetic exception.
0x1200010a8 in vadd_ (a=0x11fff21e4, ia=0x120003294, b=0x11fff7004, 
ib=0x120003294, c=0x11fffbe20, ic=0x120003294, n=0x11c70) at vadd.f:99
99   C(CI) = A(AI) + B(BI)
Current language:  auto; currently fortran

(gdb) p *ia
$10 = 1
(gdb) p *ib
$11 = 1
(gdb) p *ic
$12 = 1
(gdb) p *n
Cannot access memory at address 0x4
(gdb) p *(0x11c70)
$13 = 1024

(gdb) info locals
n = (PTR TO -> ( integer )) 0x4
__g77_expr_0 = 10


He tells me that he is getting that on two different machines he has
around.

The trouble is that I tried to repeat that with different hardware,
kernels, compilers and libraries and I failed even on SMP; but I got an
access to a box with only 667 MHz processors.  OTOH he is running
right now 2.4.3-ac9 plus Andrea Arcangeli patches for rw semaphores
on Alpha and he reports that the problem went away (and, hopefuly,
nothing else will crop out :-).

Anybody can offer an insight what that may really be?  It may be,
of course, totally unrelated to this report from Victor Zandy.

  Michal
  [EMAIL PROTECTED]


 fpbomb.shar


BUG: Global FPU corruption in 2.2

2001-04-19 Thread Victor Zandy


We have found that one of our programs can cause system-wide
corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
run this program, the FPU gives bad results to all subsequent
processes.

We see this problem on dual 550MHz Xeons with 1GB RAM.  We have 64 of
these things, and we see the problem on every node we try (dozens).
We don't have other SMPs handy.  Uniprocessors, including other PIIIs,
don't seem to be affected.

While we prepare to test for the problem on more recent 2.2 and 2.4
kernels, we would appreciate hearing from anyone who may have insight
into it.

Below are two programs we use to produce the behavior.  The first
program, pi, repeatedly spawns 10 parallel computations of pi.  When
all is well, each process prints pi as it completes.

The second program, pt, repeatedly attaches to and detaches from
another process.  Run pt against the root pi process until the output
of pi begins to look wrong.  Then kill everything and run pi by itself
again.  It will no longer produce good results.  We find that the FPU
persistently gives bad results until we reboot.

Here is the sort of thing we see:

BEFORE  AFTER
--
c36% ./pi   c36% ./pi
[3883]  [4069]   
3.1415936865157.146714   
3.141593inf  
3.14159381705.277947 
3.1415934.742524 
3.141593nan  
3.141593585.810296   
3.141593inf  
3.1415934.578857 
3.141593nan  
3.1415934.578857 

I am not currently subscribed to linux-kernel.  I'll be checking the
web archives, but please CC replies to me.

Thanks!

Vic Zandy

/* pi.c: gcc -g -o pi pi.c -lm */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static double
do_pi()
{
double sum=0.0;
double x=1.0;
double s=1.0;
double pi;

while (x <= 1000.0) {
sum += (1.0/pow(x, 3.0))*s;
s = -s;
x += 2.0;
}
pi = pow(sum*32.0, 1.0/3.0);
return pi;
}

int
main( int argc, char* argv[] )
{
int i;
int pid;
int m = 1000;   /* runs */
int n = 10; /* procs per run */

pid = getpid();
fprintf(stderr, "[%d]\n", pid);
while (m-- > 0) {
 for (i = 1; i < n; i++)
  if (!fork())
   break;
 fprintf(stderr, "%f\n", do_pi());
 if (getpid() != pid)
  return 0;
 while (waitpid(0, 0, WNOHANG) > 0)
  ;
}
return 0;
}
/* end of pi.c */

/* pt.c: gcc -g -o pt pt.c */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

long
dptrace(int req, pid_t pid, void *addr, void *data)
{
char buf[64];
int rv;
rv = ptrace(req, pid, addr, data);
if ((req != PTRACE_PEEKUSR && req != PTRACE_PEEKTEXT) && 0 > rv) {
sprintf(buf, "ptrace (req=%d)", req);
perror(buf);
exit(1);
}
return rv;
}

int
main(int argc, char *argv[])
{
int pid;
char buf[1024];
int n;

if (argc < 2) {
fprintf(stderr, "Usage: %s PID\n", argv[0]);
exit(1);
}
pid = atoi(argv[1]);
while (1) {
dptrace(PTRACE_ATTACH, pid, 0, 0);
waitpid(pid, 0, 0);
dptrace(PTRACE_DETACH, pid, 0, 0);
fprintf(stderr, ".");
}
return 0;
}
/* end of pt.c */


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



BUG: Global FPU corruption in 2.2

2001-04-19 Thread Victor Zandy


We have found that one of our programs can cause system-wide
corruption of the x86 FPU under 2.2.16 and 2.2.17.  That is, after we
run this program, the FPU gives bad results to all subsequent
processes.

We see this problem on dual 550MHz Xeons with 1GB RAM.  We have 64 of
these things, and we see the problem on every node we try (dozens).
We don't have other SMPs handy.  Uniprocessors, including other PIIIs,
don't seem to be affected.

While we prepare to test for the problem on more recent 2.2 and 2.4
kernels, we would appreciate hearing from anyone who may have insight
into it.

Below are two programs we use to produce the behavior.  The first
program, pi, repeatedly spawns 10 parallel computations of pi.  When
all is well, each process prints pi as it completes.

The second program, pt, repeatedly attaches to and detaches from
another process.  Run pt against the root pi process until the output
of pi begins to look wrong.  Then kill everything and run pi by itself
again.  It will no longer produce good results.  We find that the FPU
persistently gives bad results until we reboot.

Here is the sort of thing we see:

BEFORE  AFTER
--
c36% ./pi   c36% ./pi
[3883]  [4069]   
3.1415936865157.146714   
3.141593inf  
3.14159381705.277947 
3.1415934.742524 
3.141593nan  
3.141593585.810296   
3.141593inf  
3.1415934.578857 
3.141593nan  
3.1415934.578857 

I am not currently subscribed to linux-kernel.  I'll be checking the
web archives, but please CC replies to me.

Thanks!

Vic Zandy

/* pi.c: gcc -g -o pi pi.c -lm */
#include stdio.h
#include stdlib.h
#include unistd.h
#include math.h
#include sys/types.h
#include sys/wait.h
#include signal.h
#include errno.h

static double
do_pi()
{
double sum=0.0;
double x=1.0;
double s=1.0;
double pi;

while (x = 1000.0) {
sum += (1.0/pow(x, 3.0))*s;
s = -s;
x += 2.0;
}
pi = pow(sum*32.0, 1.0/3.0);
return pi;
}

int
main( int argc, char* argv[] )
{
int i;
int pid;
int m = 1000;   /* runs */
int n = 10; /* procs per run */

pid = getpid();
fprintf(stderr, "[%d]\n", pid);
while (m--  0) {
 for (i = 1; i  n; i++)
  if (!fork())
   break;
 fprintf(stderr, "%f\n", do_pi());
 if (getpid() != pid)
  return 0;
 while (waitpid(0, 0, WNOHANG)  0)
  ;
}
return 0;
}
/* end of pi.c */

/* pt.c: gcc -g -o pt pt.c */
#include stdio.h
#include stdlib.h
#include signal.h
#include sys/types.h
#include sys/wait.h
#include unistd.h
#include string.h
#include linux/ptrace.h

long
dptrace(int req, pid_t pid, void *addr, void *data)
{
char buf[64];
int rv;
rv = ptrace(req, pid, addr, data);
if ((req != PTRACE_PEEKUSR  req != PTRACE_PEEKTEXT)  0  rv) {
sprintf(buf, "ptrace (req=%d)", req);
perror(buf);
exit(1);
}
return rv;
}

int
main(int argc, char *argv[])
{
int pid;
char buf[1024];
int n;

if (argc  2) {
fprintf(stderr, "Usage: %s PID\n", argv[0]);
exit(1);
}
pid = atoi(argv[1]);
while (1) {
dptrace(PTRACE_ATTACH, pid, 0, 0);
waitpid(pid, 0, 0);
dptrace(PTRACE_DETACH, pid, 0, 0);
fprintf(stderr, ".");
}
return 0;
}
/* end of pt.c */


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: BUG: Global FPU corruption in 2.2

2001-04-19 Thread Michal Jaegermann

On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote:
 
 We have found that one of our programs can cause system-wide
 corruption of the x86 FPU under 2.2.16 and 2.2.17.

 
 We see this problem on dual 550MHz Xeons with 1GB RAM.

Hm, I started to wonder if this is not somewhat related to a recent
report I got.  "The victim" was running 2.2.19 (basically) on an SMP
Alpha UP2000+ with two 800 MHz processors.  He managed to reduce the
problem to a rather small test case and I attach sources,  Makefile and
a "loop.sh" driver as a shar archive if you want to have a closer look.

This "loop.sh" simply fires triplets of "harry" process in a loop.
The guy hit by this gets apparently random floating point exceptions
starting with roughly sixth process and later intervals between bombs
will vary.  I have also 'strace' outputs from failing processes but
they are not telling very much.  'gdb' is also not very illuminating:

Program received signal SIGFPE, Arithmetic exception.
0x1200010a8 in vadd_ (a=0x11fff21e4, ia=0x120003294, b=0x11fff7004, 
ib=0x120003294, c=0x11fffbe20, ic=0x120003294, n=0x11c70) at vadd.f:99
99   C(CI) = A(AI) + B(BI)
Current language:  auto; currently fortran

(gdb) p *ia
$10 = 1
(gdb) p *ib
$11 = 1
(gdb) p *ic
$12 = 1
(gdb) p *n
Cannot access memory at address 0x4
(gdb) p *(0x11c70)
$13 = 1024

(gdb) info locals
n = (PTR TO - ( integer )) 0x4
__g77_expr_0 = 10


He tells me that he is getting that on two different machines he has
around.

The trouble is that I tried to repeat that with different hardware,
kernels, compilers and libraries and I failed even on SMP; but I got an
access to a box with only 667 MHz processors.  OTOH he is running
right now 2.4.3-ac9 plus Andrea Arcangeli patches for rw semaphores
on Alpha and he reports that the problem went away (and, hopefuly,
nothing else will crop out :-).

Anybody can offer an insight what that may really be?  It may be,
of course, totally unrelated to this report from Victor Zandy.

  Michal
  [EMAIL PROTECTED]


 fpbomb.shar