On Friday 28 September 2007 7:23:12 pm Kevin Day wrote: > That aside, here are my problems, of which may or may not be related: > > 1) Kernel 2.6.2? compiled with SMP enabled, will systematically > destroy itself and collapse in a series of kernel panics flood the > system until the OS comes to a complete windows style lockup.
Userspace shouldn't cause panics, that's a kernel internal issue. While it's possible for userspace to cause a few selected panics, they're generally extremely specific. If init exits, or if root writes to /dev/kmem, that's a panic that's not the kernel's fault. But it's really not something userspace should ever be able to do by accident. That really sounds like a kernel bug, which uClibc is merely triggering. > My first impression was that there was some serious regression in the > kernel+hardware I was using or that the kernel may have had an > isolated corrupt compilation. Repeated attempts on different hardware > and the SMP-kernel crash was consistent. So you had an intermittent kernel panic that went away when you changed something in userspace. That isn't a fix, you're just not _triggering_ the problem anymore... > I later build a uClibc-0.9.28.3 system and I have yet to see the > SMP-kernel crash reproduce itself on the same hardware. Whatever the problem is in your kernel, you're no longer triggering it. Doesn't mean it's our bug... > The 0.9.29 crash took no more than a day and I have been running an > intel dual-processor server for a few weeks now under the SMP kernel > compiled under a uClibc-0.9.28.3. > My memory checks out clear and compilation between tests was done > under different systems just in case. I had a fun problem once where I suspect the power supply was marginal and everything worked fine until the hard drive sucked too much power. So intense disk activity _combined_ with intense CPU activity presented as corrupted data read from the disk. A friend (Garrett the uClibc++ guy) had a problem where the "string move" instruction on his CPU went bad. (This instruction is apparently used by very few programs, but one of them was gcc...) > I have no way of figuring out what/where this is happening, > considering that the kernel has its own internal libc equivalent code. > > That suggests that gcc and/or binutils is somehow becoming corrupt under > 0.9.29? If so, this may be the cause of the other obscure problems I am > having.. If gcc or binutils becomes corrupt, and it builds a screwed up binary, that screwed up binary will have deterministic behavior. (Might not happen on a _rebuild_, but a given binary should be have reproducibly.) > 2) Fuse jumps into a (threaded?) deadlock. > This problem exists in 0.9.28.3, but I have a work-around under > 0.9.28.3. That work-around no longer works around the problem in > 0.9.29. I explecitly need fuse and so long as I cannot use fuse, I > cannot change to 0.9.29. If we can reproduce a problem, we can probably debug it. Reproducibility is good. However, combined with the other kernel problems you're having, I'm really not sure it's our bug. > 4) There are more problems with deadlocking as with #2 or random > crashing as with #1. > #1 seems to happen mostly with applications that are graphical (xorg > or gtk based apps..). > #2 happens to a very small number of apps, such as qingy (in qingy's > case the crash is a fatal kernel-level deadlock, time to hard-reboot). Again, userspace shouldn't be able to hard deadlock the kernel. Can you still ping the machine when this happens? Rob -- "One of my most productive days was throwing away 1000 lines of code." - Ken Thompson. _______________________________________________ uClibc mailing list uClibc@uclibc.org http://busybox.net/cgi-bin/mailman/listinfo/uclibc