[PATCH] Backport of 2.4 ptrace flag to 2.2
> Alan Cox <[EMAIL PROTECTED]> writes: > > The preferable one for performance is certainly to backport the > > 2.4 changes This patch against stock 2.2.19 is a backport of the task structure ptrace flag of Linux 2.4. It is available at http://www.cs.wisc.edu/~zandy/ptrace As we reported a couple weeks ago, under Linux 2.2 ptrace can globally corrupt the FPU on SMPs. Linus identified the problem as a race between ptrace and the FPU trap handler over the process flags. The ptrace flag introduced in 2.4 eliminates the race. This port is faithful to the 2.4 design. Essentially it: - Adds a new variable `ptrace' to the task structure; - Adds new constants for this variable (PT_PTRACED etc.) and removes the corresponding old ones (PF_PTRACED etc.); - Replaces every ptrace-context reference to `flags' with a reference to `ptrace', and updates the constants used accordingly; - Updates ptrace offset constants, loads, and comparisons in assembly files. The patch is complete for all platforms except ARM. On ARM, I didn't understand the meaning of the offset constants used in the assembly, so I didn't try to fix them. The patch does include the necessary changes to C files on ARM. We have applied (cleanly), compiled (cleanly) and tested the patch on an x86 SMP, one of the same ones on which we saw FPU corruption. We have verified that FPU corruption cannot be produced, and that gdb and strace still function. We have not tested any other platform. Please direct any questions or problems with the patch to Victor Zandy <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
Alan Cox <[EMAIL PROTECTED]> writes: > The preferable one for performance is certainly to backport the 2.4 changes Is it any more substantial than changing all uses of the ptrace flags to the new variable? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
Linus Torvalds writes: > Ahh.. This actually _does_ look like a race on "current->flags": > PTRACE_ATTACH will do a > > child->flags |= PF_PTRACED; > > without waiting for the child to have stopped. I can see how this could case PF_USEDFPU to be cleared inadvertently, but I do not have any ideas for testing this. Is it clear that this is the source of the problem? What would be involved in backporting the split ptrace flags to 2.2? Are there other solutions? Vic - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
"Christian Ehrhardt" <[EMAIL PROTECTED]> writes: > Victor: Could you try to reproduce the system wide corruption if you > add an explicit call to stts(); at the very end of __switch_to? > This should prevent the FPU corruption from spreading. After adding this call, I cannot reproduce the global corruption. There is still occasional local corruption of individual pi processes while pt is running. Vic - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
Someone else here traced the process flags of a FP-intensive program on a machine before and after it is put in the faulty FPU state. He periodically sampled /proc/pid/stat while the program was running. He found that PF_USEDFPU was always set before the machine was broken. After he found that it was set about 70% of the time. Vic - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
It looks to me like the kernel sets a trap for FP operations when a process is switched in. Then when the process executes an FP op, the kernel clears the trap and either loads the FP context or initializes it, depending on whether it is the process' first FP operation. So no help is need from anything in user space. Vic "Richard B. Johnson" <[EMAIL PROTECTED]> writes: > On 20 Apr 2001, Ulrich Drepper wrote: > > > "Richard B. Johnson" <[EMAIL PROTECTED]> writes: > > > > > If it "fixes" it, there is no problem with the FPU, but with the > > > 'C' runtime library which doesn't initialize the FPU to a known > > > state before it uses it. > > > > It's the kernel which initializes the FPU. This was always the case > > and necessary to implement the fast lazy FPU saving/restoring. > > Processes which never use the FPU never initialize it. > > The kernel doesn't know if a process is going to use the FPU when > a new process is created. Only the user's code, i.e., the 'C' runtime > library knows. If the user is using 'asm' or whatever, the user must - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
No dice. Your program does not fix the problem. If it were a hardware problem, I would expect the problem to occur under 2.4.2 as well as 2.2.*, and I would be surprised that we can consistently produce the behavior across our 64 node cluster. But we are keeping the possibility in mind. Thanks for your suggestions. Vic - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
Victor Zandy <[EMAIL PROTECTED]> writes: > We have found that one of our programs can cause system-wide > corruption of the x86 FPU under 2.2.16 and 2.2.17. That is, after we > run this program, the FPU gives bad results to all subsequent > processes. We have now tested 2.4.2 and 2.2.19. 2.2.19 has the same problem. 2.4.3 does not seem to be affected. Unfortunately, we really need a working 2.2 kernel at this time. We also patched the 2.2.19 kernel with the PIII patch found in /pub/linux/kernel/people/andrea/patches/v2.2/2.2.19pre13/PIII-10.bz2 on ftp.kernel.org. Same problem. Does anyone have any ideas for us? Thanks. Vic - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
BUG: Global FPU corruption in 2.2
We have found that one of our programs can cause system-wide corruption of the x86 FPU under 2.2.16 and 2.2.17. That is, after we run this program, the FPU gives bad results to all subsequent processes. We see this problem on dual 550MHz Xeons with 1GB RAM. We have 64 of these things, and we see the problem on every node we try (dozens). We don't have other SMPs handy. Uniprocessors, including other PIIIs, don't seem to be affected. While we prepare to test for the problem on more recent 2.2 and 2.4 kernels, we would appreciate hearing from anyone who may have insight into it. Below are two programs we use to produce the behavior. The first program, pi, repeatedly spawns 10 parallel computations of pi. When all is well, each process prints pi as it completes. The second program, pt, repeatedly attaches to and detaches from another process. Run pt against the root pi process until the output of pi begins to look wrong. Then kill everything and run pi by itself again. It will no longer produce good results. We find that the FPU persistently gives bad results until we reboot. Here is the sort of thing we see: BEFORE AFTER -- c36% ./pi c36% ./pi [3883] [4069] 3.1415936865157.146714 3.141593inf 3.14159381705.277947 3.1415934.742524 3.141593nan 3.141593585.810296 3.141593inf 3.1415934.578857 3.141593nan 3.1415934.578857 I am not currently subscribed to linux-kernel. I'll be checking the web archives, but please CC replies to me. Thanks! Vic Zandy /* pi.c: gcc -g -o pi pi.c -lm */ #include #include #include #include #include #include #include #include static double do_pi() { double sum=0.0; double x=1.0; double s=1.0; double pi; while (x <= 1000.0) { sum += (1.0/pow(x, 3.0))*s; s = -s; x += 2.0; } pi = pow(sum*32.0, 1.0/3.0); return pi; } int main( int argc, char* argv[] ) { int i; int pid; int m = 1000; /* runs */ int n = 10; /* procs per run */ pid = getpid(); fprintf(stderr, "[%d]\n", pid); while (m-- > 0) { for (i = 1; i < n; i++) if (!fork()) break; fprintf(stderr, "%f\n", do_pi()); if (getpid() != pid) return 0; while (waitpid(0, 0, WNOHANG) > 0) ; } return 0; } /* end of pi.c */ /* pt.c: gcc -g -o pt pt.c */ #include #include #include #include #include #include #include #include long dptrace(int req, pid_t pid, void *addr, void *data) { char buf[64]; int rv; rv = ptrace(req, pid, addr, data); if ((req != PTRACE_PEEKUSR && req != PTRACE_PEEKTEXT) && 0 > rv) { sprintf(buf, "ptrace (req=%d)", req); perror(buf); exit(1); } return rv; } int main(int argc, char *argv[]) { int pid; char buf[1024]; int n; if (argc < 2) { fprintf(stderr, "Usage: %s PID\n", argv[0]); exit(1); } pid = atoi(argv[1]); while (1) { dptrace(PTRACE_ATTACH, pid, 0, 0); waitpid(pid, 0, 0); dptrace(PTRACE_DETACH, pid, 0, 0); fprintf(stderr, "."); } return 0; } /* end of pt.c */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.0/2.2 Bug: SIGTRAP lost
Victor Zandy <[EMAIL PROTECTED]> writes: > If a process executes an int3 (breakpoint) instruction while > another process is attaching to it, the SIGTRAP can be lost. This bug > is present in 2.4.0-test8 and 2.2.14. Uh, this turns out to be my stupid programming error, not a bug in any of the fine versions of the Linux kernel. My apologies to anyone who invested time looking at this. Vic Zandy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
2.0/2.2 Bug: SIGTRAP lost
If a process executes an int3 (breakpoint) instruction while another process is attaching to it, the SIGTRAP can be lost. This bug is present in 2.4.0-test8 and 2.2.14. Below is a program that demonstrates this behavior. It forks a child that repeatedly executes an int3 and handles the SIGTRAP. The parent repeatedly attaches and detaches to the child. Eventually the SIGTRAP generated by the int3 is lost, and the child falls through (to the fprintf). Vic Zandy #include #include #include #include #include long int dptrace(enum __ptrace_request req, pid_t pid, void *addr, void *data) { int rv; rv = ptrace(req, pid, addr, data); if (0 > rv) { perror("ptrace"); exit(1); } return rv; } void do_trace(int pid) { while (1) { dptrace(PTRACE_ATTACH, pid, 0, 0); waitpid(pid, 0, 0); dptrace(PTRACE_DETACH, pid, 0, 0); } } void handler(int sig, struct sigcontext uap) { uap.eip--; } void do_int3() { struct sigaction sa; sa.sa_handler = (void (*)(int)) handler; sigemptyset(&sa.sa_mask); sa.sa_flags = 0; sigaction(SIGTRAP, &sa, NULL); asm("int3"); /* Should loop here */ fprintf(stderr, "Bug triggered\n"); } int main(int argc, char *argv[]) { int pid; pid = fork(); if (pid) do_trace(pid); else do_int3(); return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
2.0/2.2 Bug: SIGTRAP lost
If a process executes an int3 (breakpoint) instruction while another process is attaching to it, the SIGTRAP can be lost. This bug is present in 2.4.0-test8 and 2.2.14. Below is a program that demonstrates this behavior. It forks a child that repeatedly executes an int3 and handles the SIGTRAP. The parent repeatedly attaches and detaches to the child. Eventually the SIGTRAP generated by the int3 is lost, and the child falls through (to the fprintf). Vic Zandy #include #include #include #include #include long int dptrace(enum __ptrace_request req, pid_t pid, void *addr, void *data) { int rv; rv = ptrace(req, pid, addr, data); if (0 > rv) { perror("ptrace"); exit(1); } return rv; } void do_trace(int pid) { while (1) { dptrace(PTRACE_ATTACH, pid, 0, 0); waitpid(pid, 0, 0); dptrace(PTRACE_DETACH, pid, 0, 0); } } void handler(int sig, struct sigcontext uap) { uap.eip--; } void do_int3() { struct sigaction sa; sa.sa_handler = (void (*)(int)) handler; sigemptyset(&sa.sa_mask); sa.sa_flags = 0; sigaction(SIGTRAP, &sa, NULL); asm("int3"); /* Should loop here */ fprintf(stderr, "Bug triggered\n"); } int main(int argc, char *argv[]) { int pid; pid = fork(); if (pid) do_trace(pid); else do_int3(); return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/