Re: Fix for sparc64 cpu hangs.
>> I'll leave the kernel running and make sure the machine gets some more >> users and load during the next days. > > Thanks for testing, let me know if any more issues trigger. One problem I was pointed to was the build failure of erlang. Here the created erlc binary segfaults with a bus error. - this only happens on US III machines, works fine on US II. - on lebrun it doesn't happen on the first call of erlc, but after several successful runs of it - see http://buildd.debian.org/fetch.cgi?&pkg=erlang&ver=1%3A11.b.5dfsg-11&arch=sparc&stamp=1197012623&file=log - on our v880 here (which is still running the kernel with your test patch) erlc segfaults instantly. A strace shows that it is stuck at a well known place - pretty similar to the segfault in aptitude which successfully shot the machine to death before your patch(es) was(were) applied: [pid 1224] clone(Process 1228 attached child_stack=0xf7951480, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xf7951bd8, tls=0xf7951b90, child_tidptr=0xf7951bd8) = 1228 [pid 1224] SYS_300(0xf7951be0, 0xc, 0, 0, 0xf7951df4) = 0 [pid 1224] futex(0xff993338, 0x80 /* FUTEX_??? */, 2 ... there it hangs. I guess you should be able to reproduce this on your US III machine. dget -x \ ftp://debian.netcologne.de/debian/pool/main/e/erlang/erlang_11.b.5dfsg-11.dsc cd erlang-11.b.5dfsg dpkg-buildpackage -rfakeroot (you'll probably have to install some build-deps...) when erlc segfaults, change into the directory and set ERL_TOP=/home/bzed/erlang-11.b.5dfsg PATH=/home/bzed/erlang-11.b.5dfsg/bootstrap/bin:${PATH} before retrying to run erlc. Let me know if you need more informations or want me to test something. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix for sparc64 cpu hangs.
> Thanks for testing, let me know if any more issues trigger. The machine had some random processes (ssh, ping and aptitude) being stuck today, but they went away after hitting them with kill -9. They also didn't eat CPU time - they were just doing nothing. Unfortunately I didn't have the time for a closer look, I'll try to gather some more informations the next time it happens. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix for sparc64 cpu hangs.
> I shoule be easily fixed using the patch below. It records which bits > we should actually be concerned about, and only tests those specific > bits in the dispatch status register. > > Could you please give this patch a test? Tested - the patch seems to fix the problem as the machine is still alive and working well after several hours of running the buggy aptitude -u in a loop. I'll leave the kernel running and make sure the machine gets some more users and load during the next days. Thanks for the fix, Bernd -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix for sparc64 cpu hangs.
David Miller wrote: > From: Bernd Zeimetz <[EMAIL PROTECTED]> > Date: Sat, 08 Dec 2007 01:14:46 +0100 > >> works well, thanks for fixing! > > Thanks a lot for testing. You're welcome. Are you going to send the patch for 2.6.23, too? Also I've tried to crash the machine while running the non-SMP kernel - but it is still running fine. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix for sparc64 cpu hangs.
David Miller wrote: > From: Bernd Zeimetz <[EMAIL PROTECTED]> > Date: Thu, 06 Dec 2007 13:09:18 +0100 > >> ERROR(0): Cheetah error trap taken afsr[1000] >> afar[040001c0] TL1(0) >> ERROR(0): TPC[4351dc] TNPC[4351e0] O7[4353b4] TSTATE[80001606] >> ERROR(0): TPC >> ERROR(0): M_SYND(0), E_SYND(0) > > Please try this patch: [...] titan:~# uname -a Linux titan 2.6.23.9+davem-nonsmp #1 Fri Dec 7 10:02:01 UTC 2007 sparc64 GNU/Linux titan:~# cat /proc/cpuinfo cpu : TI UltraSparc III (Cheetah) fpu : UltraSparc III integrated FPU prom: OBP 4.22.34 2007/07/23 13:01 type: sun4u ncpus probed: 4 ncpus active: 1 D$ parity tl1 : 0 I$ parity tl1 : 0 Cpu0ClkTck : 2cb41780 MMU Type: Cheetah titan:~# works well, thanks for fixing! -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix for sparc64 cpu hangs.
David Miller wrote: > From: Bernd Zeimetz <[EMAIL PROTECTED]> > Date: Thu, 06 Dec 2007 11:43:45 +0100 > >> David Miller wrote: >>> From: Bernd Zeimetz <[EMAIL PROTECTED]> >>> Date: Fri, 16 Nov 2007 22:17:07 +0100 >>> >>>> The sysrq-g output is attached, I hope you can make sense out of it. >>>> We'll also add some extra workload to the other machines here to try to >>>> trigger the bug on other CPUs, too. >>> I just got back from my vacation and started looking at these >>> dumps. I think there might be some bug in cheetah_xcall_deliver(), >>> I'll try to diagnose this some more. >> I'm not sure if it is related, but non-SMP Kernels don't boot at all on >> the machine. > > I doubt it's related as non-SMP kernels won't even have that > code compiled in :-) > What does a failed non-SMP boot say? If it doesn't even bring up the > console, give it "-p" on the kernel command line. That's from a 2.6.21-2-sparc64, had the output lying around here. I can build and install a 2.6.23 and try it again if you want. It would be good to know if non-SMP kernels work at all on the v880 and larger machines, same for more recent CPU models - at the moment the Sparc installer is non-SMP only, which resulted in some extra fun to install the v880. Rebooting with command: boot net:dhcp -p Boot device: /[EMAIL PROTECTED],70/[EMAIL PROTECTED],1:dhcp File and args: -p Timed out waiting for BOOTP/DHCP reply \ PROMLIB: Sun IEEE Boot Prom 'OBP 4.22.34 2007/07/23 13:01' PROMLIB: Root node compatible: Linux version 2.6.21-2-sparc64 (Debian 2.6.21-6) ([EMAIL PROTECTED]) (gcc version 4.1.3 20070629 (prerelease) (Debian 4.1.2 -13)) #1 Thu Jul 12 12:33:00 UTC 2007 ARCH: SUN4U Ethernet address: 00:03:ba:0b:07:89 Remapping the kernel... done. PROM: Built device tree with 125090 bytes of memory. Booting Linux... CPU[0]: Caches D[sz(65536):line_sz(32)] I[sz(32768):line_sz(32)] E[sz(8388608):line_sz(512)] Built 1 zonelists. Total pages: 412546 Kernel command line: -p PID hash table entries: 4096 (order: 12, 32768 bytes) Console: colour dummy device 80x25 Dentry cache hash table entries: 524288 (order: 9, 4194304 bytes) Inode-cache hash table entries: 262144 (order: 8, 2097152 bytes) Memory: 8311800k available (2360k kernel code, 824k data, 144k init) [f800,00b0ffb16000] Calibrating delay using timer specific routine.. 20.00 BogoMIPS (lpj=40009) Security Framework v1.0.0 initialized SELinux: Disabled at boot. Capability LSM initialized Mount-cache hash table entries: 512 NET: Registered protocol family 16 PCI: Probing for controllers. /[EMAIL PROTECTED],70: SCHIZO PCI Bus Module ver[4:0] /[EMAIL PROTECTED],70: PCI CFG[7ffee00] IO[7ffef00] MEM[7fe] /[EMAIL PROTECTED],60: SCHIZO PCI Bus Module ver[4:0] /[EMAIL PROTECTED],60: PCI CFG[7ffec00] IO[7ffed00] MEM[7fd] /[EMAIL PROTECTED],70: SCHIZO PCI Bus Module ver[4:0] /[EMAIL PROTECTED],70: PCI CFG[7ffea00] IO[7ffeb00] MEM[7fc] /[EMAIL PROTECTED],60: SCHIZO PCI Bus Module ver[4:0] /[EMAIL PROTECTED],60: PCI CFG[7ffe800] IO[7ffe900] MEM[7fb] PCI1(PBMB): Bus running at 33MHz PCI1(PBMA): Bus running at 66MHz PCI0(PBMB): Bus running at 33MHz PCI0(PBMA): Bus running at 66MHz ebus0: [flashprom] [bbc] [power] [i2c -> (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (temperature) (temperature) (temperature) (temperature) (temperature) (temperature) (temperature)] [i2c -> (controller) (smbus-ara) (controller) (temperature) (temperature) (temperature) (ioexp) (temperature) (controller) (adio) (adio) (ioexp) (ioexp) (ioexp) (ioexp) (ioexp) (ioexp) (ioexp) (adio) (adio) (adio) (adio) (temperature-sensor) (fru) (fru) (fru) (fru) (fru) (fru) (rscrtc) (hotplug-controller) (hotplug-controller) (hotplug-controller) (hotplug-controller)] [bbc] [i2c -> (temperature) (temperature) (temperature)] [i2c -> (nvram) (idprom)] [rtc] [gpio] [pmc] [rsc-control] [rsc-console] [serial] power: Control reg at 7fc7e30002e ... not using powerd. usbcore: registered new interface driver usbfs usbcore: registered new interface driver hub usbcore: registered new device driver usb /[EMAIL PROTECTED],70/[EMAIL PROTECTED]/[EMAIL PROTECTED],300070: Clock regs at 07fc7e300070 NET: Registered protocol family 2 IP route cache hash table entries: 131072 (order: 7, 1048576 bytes) TCP established hash table entries: 524288 (order: 10, 8388608 bytes) TCP bind hash table entries: 65536 (order: 6, 524288 bytes) TCP: Hash tables configured (established 524288 bind 65536) TCP reno registered checking if image is initramfs... it is Freei
Re: Fix for sparc64 cpu hangs.
David Miller wrote: > From: Bernd Zeimetz <[EMAIL PROTECTED]> > Date: Fri, 16 Nov 2007 22:17:07 +0100 > >> The sysrq-g output is attached, I hope you can make sense out of it. >> We'll also add some extra workload to the other machines here to try to >> trigger the bug on other CPUs, too. > > I just got back from my vacation and started looking at these > dumps. I think there might be some bug in cheetah_xcall_deliver(), > I'll try to diagnose this some more. I'm not sure if it is related, but non-SMP Kernels don't boot at all on the machine. > If you cannot reproduce this bug on non-Ultra-III systems that > would help confirm or deny my theory. Have you been able to > trigger this on your Ultra-II machine for example? If so, what > do the sysrq-g traces look like there? Since your Futex bugfix the Ultra-II machine runs pretty stable. I did not manage to trigger the bug there, but it was hard to trigger the bug the first time there already - even if I run a Kernel without the Futex bugfix the machine will just hang itself at some random point, I never managed to reproduce the bug easily on US II. Best regards, Bernd -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Aurora SPARC Linux Build 2.99 (Beta 2 for 3.0)
David Miller wrote: > From: Bernd Zeimetz <[EMAIL PROTECTED]> > Date: Sat, 01 Dec 2007 13:43:30 +0100 > >>>> - Systems that boot off qlogic attached disks are not supported, because >>>> there is no working firmware loader in anaconda, and the qlogic driver >>>> needs firmware. >>> That's very unfortunate, how are qlogic device handled on other >>> platforms? >> In Debian you just install the firmware package, udevl will handle it >> then. If you have to boot from it, you need to rebuild your initrd after >> installing the firmware package. >> The installer doesn't support non-free modules yet unfortunately, but >> with some not too complicated tricks you can install Debian without >> problems. > > I said "other platforms" as in x86, x86_64, powerpc. Just the same. You can even install the firmware on hardware where you wouldn't be able use a qlogic card. It's only loaded if an appropriate device is detected. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Aurora SPARC Linux Build 2.99 (Beta 2 for 3.0)
>> - Systems that boot off qlogic attached disks are not supported, because >> there is no working firmware loader in anaconda, and the qlogic driver >> needs firmware. > > That's very unfortunate, how are qlogic device handled on other > platforms? In Debian you just install the firmware package, udevl will handle it then. If you have to boot from it, you need to rebuild your initrd after installing the firmware package. The installer doesn't support non-free modules yet unfortunately, but with some not too complicated tricks you can install Debian without problems. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix for sparc64 cpu hangs.
Hi David, > Please let me know if things go smoothly when the > build becomes active again. first the good news: The U60 here still building and working fine, also I didn't hear any bad news from lebrun.d.o. the not so good news: the v880 (4x US III) here was hit by a stuck process again, after running fine for some time now. But the machine didn't freeze, one CPU was running at 100%, but otherwise the machine was responsible. I think I'll also run a full diag in service mode to make it's not a CPU bug. The sysrq-g output is attached, I hope you can make sense out of it. We'll also add some extra workload to the other machines here to try to trigger the bug on other CPUs, too. Best regards, Bernd -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> Nov 16 21:40:57 titan kernel: [12019.840715] SysRq : Show Global CPU Regs Nov 16 21:40:57 titan kernel: [12019.886698] CPU[ 0]: TSTATE[] TPC[] TNPC[] TASK[NULL:-1] Nov 16 21:40:57 titan kernel: [12020.003361] TPC[atomic_sub_ret+0x0/0x30] Nov 16 21:40:58 titan kernel: [12020.063757] O7[schedule+0x6dc/0x7a4] Nov 16 21:40:58 titan kernel: [12020.120007] I7[do_syslog+0xfc/0x400] Nov 16 21:40:58 titan kernel: [12020.176249] * CPU[ 1]: TSTATE[] TPC[] TNPC[] TASK[bash:3157] Nov 16 21:40:58 titan kernel: [12020.295006] CPU[ 2]: TSTATE[11009602] TPC[0042fc30] TNPC[0042fc34] TASK[cat:4365] Nov 16 21:40:58 titan kernel: [12020.412726] TPC[udelay+0x0/0x1c] Nov 16 21:40:58 titan kernel: [12020.464809] O7[cheetah_xcall_deliver+0x1b8/0x23c] Nov 16 21:40:58 titan kernel: [12020.534581] I7[flush_dcache_page_all+0x178/0x240] Nov 16 21:40:58 titan kernel: [12020.604370] CPU[ 3]: TSTATE[004480009602] TPC[004288a0] TNPC[004288a4] TASK[swapper:0] Nov 16 21:40:58 titan kernel: [12020.723128] TPC[cpu_idle+0x94/0xb8] Nov 16 21:40:58 titan kernel: [12020.778323] O7[cpu_idle+0xa8/0xb8] Nov 16 21:40:58 titan kernel: [12020.832498] I7[start_kernel+0x31c/0x32c] Nov 16 21:41:05 titan ntpd[2766]: adjusting local clock by -20.711568s Nov 16 21:41:26 titan kernel: [12048.836922] SysRq : Show Global CPU Regs Nov 16 21:41:26 titan kernel: [12048.882885] * CPU[ 0]: TSTATE[] TPC[] TNPC[] TASK[bash:3157] Nov 16 21:41:26 titan kernel: [12049.001617] CPU[ 1]: TSTATE[009911009602] TPC[00407af0] TNPC[00407af4] TASK[swapper:0] Nov 16 21:41:27 titan kernel: [12049.120373] TPC[__tsb_context_switch+0xf0/0x100] Nov 16 21:41:27 titan kernel: [12049.189109] O7[schedule+0x514/0x7a4] Nov 16 21:41:27 titan kernel: [12049.245354] I7[cpu_idle+0xa8/0xb8] Nov 16 21:41:27 titan kernel: [12049.299516] CPU[ 2]: TSTATE[11009603] TPC[0042faa0] TNPC[0042fc18] TASK[cat:4365] Nov 16 21:41:27 titan kernel: [12049.417244] TPC[stick_get_tick+0x10/0x14] Nov 16 21:41:27 titan kernel: [12049.478681] O7[__delay+0x28/0x48] Nov 16 21:41:27 titan kernel: [12049.531809] I7[cheetah_xcall_deliver+0x1b8/0x23c] Nov 16 21:41:27 titan kernel: [12049.601598] CPU[ 3]: TSTATE[004480009602] TPC[004288a0] TNPC[004288a4] TASK[swapper:0] Nov 16 21:41:27 titan kernel: [12049.720351] TPC[cpu_idle+0x94/0xb8] Nov 16 21:41:27 titan kernel: [12049.775551] O7[cpu_idle+0xa8/0xb8] Nov 16 21:41:27 titan kernel: [12049.829725] I7[start_kernel+0x31c/0x32c] Nov 16 21:41:28 titan kernel: [12050.571422] SysRq : Show Global CPU Regs Nov 16 21:41:28 titan kernel: [12050.617320] * CPU[ 0]: TSTATE[] TPC[] TNPC[] TASK[bash:3157] Nov 16 21:41:28 titan kernel: [12050.736074] CPU[ 1]: TSTATE[004411009604] TPC[0045731c] TNPC[00457320] TASK[swapper:0] Nov 16 21:41:28 titan kernel: [12050.854834] TPC[update_stats_wait_end+0x24/0x88] Nov 16 21:41:28 titan kernel: [12050.923565] O7[sched_clock+0x10/0x30] Nov 16 21:41:29 titan kernel: [12050.980856] I7[pick_next_task_fair+0x24/0x44] Nov 16 21:41:29 titan kernel: [12051.046480] CPU[ 2]: TSTATE[11009602] TPC[00441a78] TNPC[00441a7c] TASK[cat:4365] Nov 16 21:41:29 titan kernel: [12051.164194] TPC[cheetah_xcall_deliver+0x174/0x23c] Nov 16 21:41:29 titan kernel: [12051.235018] O7[cheetah_xcall_deliver+0x6c/0x23c] Nov 16 21:41:29 titan kernel: [12051.303771] I7[flush_dcache_page_all+0x178/0x240] Nov 16 21:41:29 titan kernel: [12051.373560] CPU[ 3]: TSTATE[004480009602] TPC[004288a0] TNPC[004288a4] TASK[swapper:0]
Re: Fix for sparc64 cpu hangs.
David Miller wrote: > From: Bernd Zeimetz <[EMAIL PROTECTED]> > Date: Wed, 07 Nov 2007 15:35:42 +0100 > >>> But I did the artificial tests, like running dpkg-query --search libc.so.6 >>> in loops, and this seems to work well. Thanks a lot! >>> >> I was running aptitude -u in a loop for half an hour now, and it didn't >> crash, so I assume that fixed the bug. Many thanks for the patch David! > > Many thanks for helping me track it down. You're welcome! The v880 is still running fine, I'll setup the stuff which was supposed to be running on the machine during the next days, so we'll see how it behaves under a higher load for a longer time soon. Thanks again for looking into this annoying bug! -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: klibc sparc trouble with gcc > 4.0
Oleg Verych wrote: > == Mon, Nov 05, 2007 at 02:55:45PM +0100, maximilian attems == > [] >> titan:~# strace -vfF /usr/lib/klibc/bin/fstype >> execve("/usr/lib/klibc/bin/fstype", ["/usr/lib/klibc/bin/fstype"], >> ["SHELL=/bin/bash", "TERM=xterm", "SSH_CLIENT=[myip] 39403"..., >> "SSH_TTY=/dev/pts/0", "USER=root", >> "LS_COLORS=no=00:fi=00:di=01;34:l"..., >> "PATH=/usr/local/sbin:/usr/local/"..., "MAIL=/var/mail/root", >> "PWD=/root", "LANG=en_US.UTF-8", "PS1=\\h:\\w\\$ ", "HOME=/root", >> "SHLVL=2", "LS_OPTIONS=--color=auto", "LOGNAME=root", >> "SSH_CONNECTION=[myip] 3"..., "_=/usr/bin/strace", "OLDPWD=/"]) = 0 >> --- SIGSEGV (Segmentation fault) @ 0 (0) --- >> +++ killed by SIGSEGV +++ > > gdb doesn't work/help? (gdb) where #0 0x8000faac in ?? () #1 0x8000facc in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) Not sure if this is a gdb problem, though - never even tried to debug klibc. With the mentioned patch klibc compiles, but all utils just segfault, strace is as short as seen above. > > [] >> +++ b/usr/klibc/libgcc/__clzdi2.c >> @@ -0,0 +1,23 @@ >> +/* >> + * __clzdi2 - Returns the leading number of 0 bits in the argument >> + */ >> + without this patch it doesn't compile at all: KLIBCLD usr/klibc/libc.so ld: sparc architecture of input file `/usr/lib/gcc/sparc-linux-gnu/4.2.3/libgcc.a(_clzdi2.o)' is incompatible with sparc:v9 output ld: sparc architecture of input file `/usr/lib/gcc/sparc-linux-gnu/4.2.3/libgcc.a(_clz.o)' is incompatible with sparc:v9 output /usr/lib/gcc/sparc-linux-gnu/4.2.3/libgcc.a(_clzdi2.o): In function `__clzdi2': (.text+0xc): undefined reference to `_GLOBAL_OFFSET_TABLE_' /usr/lib/gcc/sparc-linux-gnu/4.2.3/libgcc.a(_clzdi2.o): In function `__clzdi2': (.text+0x14): undefined reference to `_GLOBAL_OFFSET_TABLE_' make[3]: *** [usr/klibc/libc.so] Error 1 make[2]: *** [all] Error 2 make[1]: *** [klibc] Error 2 make[1]: Leaving directory `/root/klibc-1.5.7' -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix for sparc64 cpu hangs.
> But I did the artificial tests, like running dpkg-query --search libc.so.6 > in loops, and this seems to work well. Thanks a lot! > I was running aptitude -u in a loop for half an hour now, and it didn't crash, so I assume that fixed the bug. Many thanks for the patch David! -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
David Miller wrote: > From: Bernd Zeimetz <[EMAIL PROTECTED]> > Date: Tue, 06 Nov 2007 04:51:07 +0100 > >> Here's also some output from apt-get which got stuck in my unstable >> chroot while I wanted to retrieve the klibc source to try to debug it... > > So the good news is that I started getting the hang seen > on the Debain buildd on my workstation. > > The bad news is that it's very sporadic, for a while I > could trigger it during bootup, on every boot, and now > I can't get it to wedge at all. > > Anyways, we're getting closer. Running stress -c 2 on a 4 CPU machine made things really worse here, probably it helps to trigger the bug for you, too. Our US II machine is also just running fine at the moment. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
4070] SysRq : Show Global CPU Regs Nov 6 04:43:35 titan kernel: [100912.520982] * CPU[ 0]: TSTATE[] TPC[] TNPC[] TASK[bash:11762] Nov 6 04:43:35 titan kernel: [100912.641822] CPU[ 1]: TSTATE[004411009604] TPC[0045731c] TNPC[00457320] TASK[swapper:0] Nov 6 04:43:35 titan kernel: [100912.761614] TPC[update_stats_wait_end+0x24/0x88] Nov 6 04:43:35 titan kernel: [100912.831396] O7[sched_clock+0x10/0x30] Nov 6 04:43:35 titan kernel: [100912.889728] I7[pick_next_task_fair+0x24/0x44] Nov 6 04:43:35 titan kernel: [100912.956393] CPU[ 2]: TSTATE[004411009601] TPC[004288a0] TNPC[004288a4] TASK[swapper:0] Nov 6 04:43:35 titan kernel: [100913.076191] TPC[cpu_idle+0x94/0xb8] Nov 6 04:43:35 titan kernel: [100913.132429] O7[cpu_idle+0xa8/0xb8] Nov 6 04:43:35 titan kernel: [100913.187640] I7[after_lock_tlb+0x19c/0x1b0] Nov 6 04:43:35 titan kernel: [100913.251181] CPU[ 3]: TSTATE[11009601] TPC[00441a78] TNPC[00441a7c] TASK[apt-get:11759] Nov 6 04:43:36 titan kernel: [100913.375147] TPC[cheetah_xcall_deliver+0x174/0x23c] Nov 6 04:43:36 titan kernel: [100913.447011] O7[cheetah_xcall_deliver+0x6c/0x23c] Nov 6 04:43:36 titan kernel: [100913.516805] I7[flush_dcache_page_all+0x178/0x240] Nov 6 04:43:36 titan kernel: [100914.295153] SysRq : Show Global CPU Regs Nov 6 04:43:37 titan kernel: [100914.342110] * CPU[ 0]: TSTATE[] TPC[] TNPC[] TASK[bash:11762] Nov 6 04:43:37 titan kernel: [100914.462949] CPU[ 1]: TSTATE[004411009604] TPC[0045731c] TNPC[00457320] TASK[swapper:0] Nov 6 04:43:37 titan kernel: [100914.582741] TPC[update_stats_wait_end+0x24/0x88] Nov 6 04:43:37 titan kernel: [100914.652524] O7[sched_clock+0x10/0x30] Nov 6 04:43:37 titan kernel: [100914.710855] I7[pick_next_task_fair+0x24/0x44] Nov 6 04:43:37 titan kernel: [100914.777522] CPU[ 2]: TSTATE[009911009601] TPC[0042888c] TNPC[00428890] TASK[swapper:0] Nov 6 04:43:37 titan kernel: [100914.897319] TPC[cpu_idle+0x80/0xb8] Nov 6 04:43:37 titan kernel: [100914.953557] O7[cpu_idle+0xa8/0xb8] Nov 6 04:43:37 titan kernel: [100915.008768] I7[after_lock_tlb+0x19c/0x1b0] Nov 6 04:43:37 titan kernel: [100915.072310] CPU[ 3]: TSTATE[11009601] TPC[00441a78] TNPC[00441a7c] TASK[apt-get:11759] Nov 6 04:43:37 titan kernel: [100915.196274] TPC[cheetah_xcall_deliver+0x174/0x23c] Nov 6 04:43:37 titan kernel: [100915.268140] O7[cheetah_xcall_deliver+0x6c/0x23c] Nov 6 04:43:38 titan kernel: [100915.337932] I7[flush_dcache_page_all+0x178/0x240] Sorry to the klibc people - I'll try it again later. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
> In the meantime I'll build an aptitude which should exit after running > trough the part which crashed usually, so it should be possible to run > it in a loop... This was successful - it made crashing the machine pretty simple, even without activated libnss-db. To reproduce on Etch: - get the source of aptitude - apply the attached patch - rebuild the .deb, install it - while true; do aptitude -u; done Some of the aptitudes hit a SIGABRT before one got stuck. Best regards, Bernd -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> aptitude.diff Description: application/pgp-keys aptitude-sysrq-q.txt.gz Description: GNU Zip compressed data
Re: unkillable dpkg-query processes
David Miller wrote: > From: Bernd Zeimetz <[EMAIL PROTECTED]> > Date: Sun, 04 Nov 2007 20:55:20 +0100 > >> So I'm not sure if the result is really useful for you - if not just let >> me know. I've attached the last ~10-20 sysrq-g outputs - as it was >> running in a loop I have a ton of them. In case you're wondering: http >> is aptitude's http method. > > The http module is stuck in a different place, I'll try to > see if I can make sense of it. In the meantime I'll build an aptitude which should exit after running trough the part which crashed usually, so it should be possible to run it in a loop... -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
> Ok, the key in the trace is: > > Nov 2 16:25:30 titan kernel: [ 978.134874] CPU[ 1]: > TSTATE[80009603] TPC[0067d2e0] TNPC[0067d2d4] > TASK[aptitude:3204] > Nov 2 16:25:30 titan kernel: [ 978.257809] > TPC[_write_unlock_irq+0x20/0x110] > ... > Nov 2 16:25:30 titan kernel: [ 978.507778] CPU[ 3]: > TSTATE[11009605] TPC[004419f8] TNPC[004419fc] > TASK[aptitude:3203] > Nov 2 16:25:30 titan kernel: [ 978.630707] > TPC[cheetah_xcall_deliver+0x174/0x23c] > > The first symbol is misleading, it says _write_unlock_irq but actually > in the assembler the PC is in the spinlock read spinning loop > section. So actually it's hanging in _spin_lock(). > > CPU #3 is trying to send a cross-call message interrupt, but for > some reason that isn't making forward progress. > > Let's see what's calling these things by adding some more debugging > information. Please retry the test with the following patch on > top of the original sysrq-g debugging patch and please get new > logs when it hangs. Today I was a bit out of luck, either the machine crashed so badly that it just didn't react on anything anymore, or it didn't crash. The machine went amok a bit slower when I did the following things, which also resulted in the attached sysrq output. - run stress -c 2 to get the load up, didn't need that the last time... - run something like `while true; do echo g > /proc/sysrg-trigger; sleep 0.5; done` - run aptitude -u several times until the machine died. So I'm not sure if the result is really useful for you - if not just let me know. I've attached the last ~10-20 sysrq-g outputs - as it was running in a loop I have a ton of them. In case you're wondering: http is aptitude's http method. We'll also run the patched Kernel on a US II machine form tomorrow on - but it always took a longer time until it crashed, so we'll see if it happens at all. Thanks for your work, Bernd -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> sysrq2.txt Description: application/pgp-keys
Re: unkillable dpkg-query processes
David Miller wrote: > From: David Miller <[EMAIL PROTECTED]> > Date: Thu, 01 Nov 2007 15:01:13 -0700 (PDT) > >> I'm working on a kernel patch for 2.6.23 that will allow you to get >> some useful debugging information in situations like this. >> >> I'll try to get you that patch by the end of tonight. > > As promised, here is the patch below. Thanks for the patch. Applied and used libnss-db + aptitude -u to hang the machine. I've sent g several times to sysrq, output is attached. According to top the two hanging aptitude processes were running on CPU 1 + 3. 3204 root 20 0 19552 5088 4072 R 100 0.1 6:54.49 1 aptitude 3203 root 20 0 19552 5088 4072 R 100 0.1 6:56.39 3 aptitude Cheers, Bernd -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> sysrq-g.txt Description: application/pgp-keys
Re: unkillable dpkg-query processes
> The futex() calls are definitely from libnss-db. And on Lenny/testing we have futex calls from libc6. Didn't have the time to come up with any instructions yet as we have public holidays today, I'll try to finish them tomorrow. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
David Miller wrote: > From: Bernd Zeimetz <[EMAIL PROTECTED]> > Date: Tue, 30 Oct 2007 01:50:30 +0100 > >> What we're missing here is a probably important piece: >> >> If dpkg-query is running during a build, it is running in a fakeroot >> environment. I've straced that, see the attachment. >> >> What I find in the strace are at least several clones, which is the >> point where aptitude -u crashed according to the straces in >> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=433187#102 > > Thanks for the fakeroot trace. > > I am pretty sure the clone()'s we see here are just normal > fork()'s, in both the fakeroot's dpkg-query and the aptitude > case. I just grepped trough the source of aptitude, there's only one fork, but that one should not be executed if aptitude has been started as root. If it is of any use for you I can figure out which piece of code resulted in the call to clone, or figure out which piece of code results results in the use of futexes there. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
>>> mount -t devpts none /dev/pts >> mount --bind /dev /thechroot/dev >> is what I use here, running udev in a chroot is no fun. > > Ok. AFaik the buildds only have a minimal /dev. though. But to bootstrap a system that's usually not enough. > Let's stick to 2.6.23 testing for pinpointing these bugs. Ok. Do you have a .deb with a kernel for me? If not - would you like to have any specific options enabled - I have to build one then. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Bernd Zeimetz wrote: >> Here you go. >> >> (Mind, this is capturing the current status of the chroot, which is fairly >> unclean, because right now it happens to be building python-qt4-4.3.1.) > > What we're missing here is a probably important piece: > > If dpkg-query is running during a build, it is running in a fakeroot > environment. I've straced that, see the attachment. what I forgot to mention - this strace was taken as non-root user of course, not sure what fakeroot does if it's called as root. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
> Here you go. > > (Mind, this is capturing the current status of the chroot, which is fairly > unclean, because right now it happens to be building python-qt4-4.3.1.) What we're missing here is a probably important piece: If dpkg-query is running during a build, it is running in a fakeroot environment. I've straced that, see the attachment. What I find in the strace are at least several clones, which is the point where aptitude -u crashed according to the straces in http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=433187#102 -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> execve("/usr/bin/fakeroot", ["fakeroot", "dpkg-query", "-S", "libc.so.6"], [/* 12 vars */]) = 0 brk(0) = 0xca000 uname({sys="Linux", node="titan", ...}) = 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7fba000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=12402, ...}) = 0 mmap(NULL, 12402, PROT_READ, MAP_PRIVATE, 3, 0) = 0xf7fb4000 close(3)= 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/lib/libncurses.so.5", O_RDONLY) = 3 read(3, "\177ELF\1\2\1\0\0\0\0\0\0\0\0\0\0\3\0\22\0\0\0\1\0\0\263"..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0644, st_size=208688, ...}) = 0 mmap(NULL, 208480, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xf7f8 mmap(0xf7fb, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3) = 0xf7fb close(3)= 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/lib/libdl.so.2", O_RDONLY) = 3 read(3, "\177ELF\1\2\1\0\0\0\0\0\0\0\0\0\0\3\0\22\0\0\0\1\0\0\f"..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0644, st_size=18216, ...}) = 0 mmap(NULL, 82432, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xf7f68000 mprotect(0xf7f6c000, 57344, PROT_NONE) = 0 mmap(0xf7f7a000, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0xf7f7a000 close(3)= 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/lib/libc.so.6", O_RDONLY)= 3 read(3, "\177ELF\1\2\1\0\0\0\0\0\0\0\0\0\0\3\0\22\0\0\0\1\0\1\364"..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=1419756, ...}) = 0 mmap(NULL, 1489032, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xf7dfc000 mprotect(0xf7f5, 65536, PROT_NONE) = 0 mmap(0xf7f6, 24576, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0xf7f6 mmap(0xf7f66000, 6280, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xf7f66000 close(3)= 0 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7fde000 mprotect(0xf7f7a000, 8192, PROT_READ) = 0 munmap(0xf7fb4000, 12402) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 open("/dev/tty", O_RDWR|O_NONBLOCK|O_LARGEFILE) = 3 close(3)= 0 brk(0) = 0xca000 brk(0xec000)= 0xec000 getuid32() = 1000 getgid32() = 1000 geteuid32() = 1000 getegid32() = 1000 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 time(NULL) = 1193705202 open("/proc/meminfo", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7fb8000 read(3, "MemTotal: 8314712 kB\nMemFre"..., 1024) = 624 close(3)= 0 munmap(0xf7fb8000, 8192)= 0 rt_sigaction(SIGCHLD, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0 rt_sigaction(SIGCHLD, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0 rt_sigaction(SIGINT, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0 rt_sigaction(SIGINT, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0 rt_sigaction(SIGQUIT, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0 rt_sigaction(SIGQUIT, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigaction(SIGQUIT, {SIG_IGN}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0 uname({sys="Linux", node="titan", ...}) = 0 stat64("/home/foo", {st_mode=S_IFDIR|0755, st_size=4096, ...}) =
Re: unkillable dpkg-query processes
> mount -t devpts none /dev/pts mount --bind /dev /thechroot/dev is what I use here, running udev in a chroot is no fun. > So, it's a lot more than just running the appropriate debootstrap > command. I'm almost done with a howto which is cut&paste for 95% to debootstrap and boot a debian system, unfortunately it doesn't boot as the klibc (which is used in the initramfs) is broken on sparc again... So I'll modify it to setup a proper chroot only, it should also allow to boot into it if you use the Kernel/initrd form Ubuntu. This should allow Josip and you to setup a complete chroot. > I have done a GCC package build and am now running a libc6 build under > this lenny chroot and haven't hit any problems yet. The following things also like to crash here (on Etch, not in a chroot): - running aptitude -u several times (at least with libnss-db installed) - since I've installed 2.6.24-rc1: vgdisplay (with and without active libnss-db) > BTW, in your buildroot, can you do something like: > > strace -o x.log dpkg-query -S libc.so.6 there're some comparisons of the strace of aptitude -u in http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=433187#102 Probably interesting as there're futexes in the game. The interesting thing is that it didn't crash the machine while running under strace. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
David Miller wrote: > From: Bernd Zeimetz <[EMAIL PROTECTED]> > Date: Mon, 29 Oct 2007 02:18:30 +0100 > >> But if this bug isn't fixed chances are good that the next Debian >> release won't support Sparc at all. > > Please don't use pseudo-threats like this, it only deters me even more > from working on this bug. This was not meant as a threat, it's just a fact and the reason why I'm spending way too much time on trying to make this bug reproducible and also the reason why we're annoying you these days. Sorry for that. >> This explains why you have trouble to reproduce this, while Josip and me >> get hit by this bug way too often. > > Josip stated explicitly that he has a SunFire280R, which disagrees > with what you're saying here. Sorry, I mixed something up here. I was somehow sure that they were using a v440, but it was somebody else. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Sorry, but if there would be an easy test case we'd be more than happy and would present it - unfortunately there is not. This is more than annoying for us all. But if this bug isn't fixed chances are good that the next Debian release won't support Sparc at all. > I have ubuntu gutsy on my SunFire280R, so I can debootstrap > debian chroots or whatever is needed to trigger this. You need a Blade 1000/2000 or v440/v880 or an enterprise class machine to reproduce this more easily (still assuming that we're facing the same bug here - at least the symptoms are the same). Those machines use repeater chips as interconnect between two CPUs (and between pairs of cpus for larger machines), according to my contact from Sun similar to that what's implemented in one US IV cpu. This explains why you have trouble to reproduce this, while Josip and me get hit by this bug way too often. On all other machines using cpus <= US III I have now idea how to reproduce this easily - you just get hit by it after $random builds. Don't have access to more recent hardware. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Bernd Zeimetz wrote: > Hi, > >> Since mono team decided that the mono is broken on Sparc (and despite >> the fix provided by David Miller), I had to rebuild after enabling the >> sparc >> arch in the source. > Trying this at the moment. not reproducible - mono fails to build from source in sid... so it doesn;t reach the interesting part of dh_shlibdeps... -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Hi, > Since mono team decided that the mono is broken on Sparc (and despite > the fix provided by David Miller), I had to rebuild after enabling the > sparc > arch in the source. > > The hangs happens always at the end of the buid when invoking > dh_shgenlibs in the build. > > This is not 100% reproducable even in my env. Trying this at the moment. > Second was sun blade 2000 SMP with Ubuntu gutsy, I wasn't able to update > the xemacs21 package. > The machine hanged with invoking the post installation script. Does the Blade run with one or two CPUs? If I remember right they support to run with one CPU which has to be inserted in a special slot/carrier for that. With two CPUs it should use the same repeater chips and architecture as the v440, v880 and larger machines. Cheers, Bernd -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Hi, please note that the futex bug also happens on US II machines, it is jsut almost impossible to reproduce it - it'll just hang after random days of building. > Everyone who sees these UltraSPARC-III problems please send me PRECISE > and FULL description of how to install from scratch a machine and run > something that will trigger these errors. Can you please check if the Kernel config I've attached to one of my last mails is fine for you? The normal Debian installer doesn't boot on the US III machines which use two CPUs in one board as the installer's Kernel is a non-SMP Kernel, and the result is that the machine throws a CPU exception and needs to be power-cycled I've started to investigate there with the help of a contact from Sun, but we both didn't have the time to finish this. http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=440720 if you want to have a look, please ignore those troll postings from chealer in between... So to give you a recipe to install Debian on such a box, I need to build an installer with a SMP Kernel for you. If the config is fine for your needs, I could just use use it. The other option is to use debootstrap, if you have some system on the machine already - so if you want to use that instead of messing with a network installer, please let me know. Debootstrap should run on most systems, as long as they have ar/tar/gunzip and a bash (probably sh is enough...). Would be faster to use that, and faster to write a recipe for that. I'll mark all Qlogic firmware related points, so the recipe should work on machines with (v440, v880, probably the Enterprise models, too) and without FC (I guess the Blade 1000 and 2000). If you don't have access to an US-III machine, I can find a way to give you access to the RSC and serial console of our machine. Cheers, Bernd -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
>> [29074.766486] TSTATE: 11009600 TPC: 0042f984 TNPC: >> 0042f928 Y: Not tainted >> [29074.884191] TPC: > > What kind of OOPS is this? Please provide the kernel log messages > that appeared right before these register dumps. Oct 28 03:25:12 titan kernel: [29074.698695] BUG: soft lockup - CPU#0 stuck for 11s! [sh:4252] This happened while a cronjob was running which updates the libnss-db database... With an older kernel (2.6.23-rcsomething) this didn't crash the machine. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
>> I think things got worse with 2.6.24... >> The machine shoots itself now, I guess by running cron jobs or so. >> >> [29074.766486] TSTATE: 11009600 TPC: 0042f984 TNPC: >> 0042f928 Y: Not tainted >> [29074.884191] TPC: > > What kind of OOPS is this? Please provide the kernel log messages > that appeared right before these register dumps. I'll boot the machine and check the logs, was not in the mood to do this tonight. The pasted messages were dumped on the serial console - as the machine didn't show any reaction I only powered it down... Cheers, Bernd -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
I think things got worse with 2.6.24... The machine shoots itself now, I guess by running cron jobs or so. [29074.766486] TSTATE: 11009600 TPC: 0042f984 TNPC: 0042f928 Y: Not tainted [29074.884191] TPC: [29074.929988] g0: g1: 004417ec g2: g3: [29075.034163] g4: f8a00493a4e0 g5: f89fff97c000 g6: f8a006c64000 g7: [29075.138329] o0: o1: f8a006c67968 o2: 0008 o3: 0001 [29075.242493] o4: 3385 o5: sp: f8a006c67011 ret_pc: 0042f980 [29075.350830] RPC: [29075.392482] l0: 0020 l1: l2: 0096 l3: [29075.496658] l4: 0200 l5: 0001c5569e6c l6: 0006c390404c l7: 6204052f31ec823e [29075.600824] i0: 0044d100 i1: 00b0fcc2c000 i2: i3: [29075.704989] i4: 0040 i5: 007a0578 i6: f8a006c670d1 i7: 004420d8 [29075.809161] I7: [29075.867493] BUG: soft lockup - CPU#2 stuck for 11s! [sh:4253] [29075.936259] TSTATE: 11009600 TPC: 004417a8 TNPC: 004417ac Y: Not tainted [29076.053980] TPC: [29076.113311] g0: g1: g2: g3: [29076.217483] g4: f8a0048f9260 g5: f89fff98c000 g6: f8a006c7 g7: [29076.321648] o0: 0020 o1: f8a006c73968 o2: 0002 o3: 0001 [29076.425816] o4: 781b o5: sp: f8a006c73011 ret_pc: 004416a0 [29076.534150] RPC: [29076.592471] l0: 0008 l1: l2: 0096 l3: [29076.696645] l4: 0200 l5: 0001c5569e6c l6: 0006c3904054 l7: 7e645445948ed154 [29076.800811] i0: 0044d100 i1: 00b0fcf8 i2: i3: [29076.904977] i4: 0040 i5: 007a0578 i6: f8a006c730d1 i7: 004420d8 [29077.009144] I7: -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
> > Luckily much more output of sysrq is in the syslog, so I should be able to > mail it later when the > machine is finished with rebooting (which takes some time...). the sysrq output from the syslog and my kernel config are attached to this mail. Cheers, Bernd -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> config-2.6.24-rc1-git2+bzed-farm1.gz Description: GNU Zip compressed data syslog.gz Description: GNU Zip compressed data
Re: unkillable dpkg-query processes
Bernd Zeimetz wrote: >>> For those who can reproduce it an have something like libnss-db >>> enabled, try disabling it. > > - disabled it > - running vgdisplay killed the machine (wanted to create a new LV for a > chroot)... it's not accessible at all anymore, I think the kernel is > a 2.6.23-something here, I'll build a recent one and give it a try > again Will take some time as I need to build on USII... I just wanted to write that I'm not able to reproduce this bug anymore... but running aptitude -u often enough gave me this nice output: titan:~# [ 2427.313946] BUG: soft lockup - CPU#3 stuck for 11s! [aptitude:13375] [ 2427.389128] TSTATE: 11009602 TPC: 0042f93c TNPC: 0042f7d0 Y: Not tainted [ 2427.506821] TPC: <__delay+0x1c/0x48> [ 2427.549494] g0: 9000 g1: 0042f7d0 g2: g3: [ 2427.653670] g4: f8a00793c960 g5: f89fff994000 g6: f8a007dfc000 g7: [ 2427.757835] o0: 0020 o1: 0020 o2: o3: [ 2427.862001] o4: 0030a0d0 o5: sp: f8a007dff071 ret_pc: 0042f938 [ 2427.970337] RPC: <__delay+0x18/0x48> [ 2428.013031] l0: 0005a6cab647 l1: 11009601 l2: 004417a8 l3: 0400 [ 2428.117206] l4: l5: 0001 l6: l7: 0008 [ 2428.221374] i0: i1: f8a007dffa88 i2: 0004 i3: 0001 [ 2428.325538] i4: i5: i6: f8a007dff131 i7: 004417ec [ 2428.429710] I7: and an unkillable, cpu-eating aptitude. While retrieving some info using sysrq the machine froze after echoing m into sysrq-trigger, producing this output while dieing: [ 3680.006794] BUG: soft lockup - CPU#1 stuck for 11s! [pdflush:265] [ 3680.078838] TSTATE: 80009603 TPC: 004417a8 TNPC: 004417ac Y: Not tainted [ 3680.196551] TPC: [ 3680.255881] g0: g1: g2: 0001869e g3: [ 3680.360055] g4: f8a0048e3260 g5: f89fff984000 g6: f8a00717c000 g7: [ 3680.464220] o0: 0020 o1: f8a00717f418 o2: f8a005a84040 o3: 0010 [ 3680.568384] o4: 0015 o5: sp: f8a00717eac1 ret_pc: 004416e4 [ 3680.676719] RPC: [ 3680.735042] l0: 0002 l1: 0002 l2: 0096 l3: [ 3680.839217] l4: l5: f8a0048d3cd8 l6: 00024098 l7: f7d31000 [ 3680.943382] i0: 0044d100 i1: 00b0f60f8000 i2: i3: 0001 [ 3681.047548] i4: 0001 i5: 0001 i6: f8a00717eb81 i7: 00442be4 [ 3681.151717] I7: Luckily much more output of sysrq is in the syslog, so I should be able to mail it later when the machine is finished with rebooting (which takes some time...). 2.6.24-rc1-git2 (SMP) gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) titan:~# cat /proc/cpuinfo cpu : TI UltraSparc III (Cheetah) fpu : UltraSparc III integrated FPU prom: OBP 4.22.34 2007/07/23 13:01 type: sun4u ncpus probed: 4 ncpus active: 4 D$ parity tl1 : 0 I$ parity tl1 : 0 Cpu0ClkTck : 2cb41780 Cpu1ClkTck : 2cb41780 Cpu2ClkTck : 2cb41780 Cpu3ClkTck : 2cb41780 MMU Type: Cheetah State: CPU0: online CPU1: online CPU2: online CPU3: online -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
>> For those who can reproduce it an have something like libnss-db >> enabled, try disabling it. - disabled it - running vgdisplay killed the machine (wanted to create a new LV for a chroot)... it's not accessible at all anymore, I think the kernel is a 2.6.23-something here, I'll build a recent one and give it a try again Will take some time as I need to build on USII... -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Josip Rodin wrote: > On Sat, Oct 27, 2007 at 12:30:56AM +0200, Bernd Zeimetz wrote: >>> Josip, do you guys have libnss-db or similar in use on the buildd >>> machine? >> They have, that's what Debian's userdir-ldap uses. > > No, I have to correct you, this machine isn't part of that setup > (at least not yet). > Oh ok, I stand corrected - thought it would have it. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
> Josip, do you guys have libnss-db or similar in use on the buildd > machine? They have, that's what Debian's userdir-ldap uses. > For those who can reproduce it an have something like libnss-db > enabled, try disabling it. Will do in a few minutes. -- Bernd Zeimetz <[EMAIL PROTECTED]> <http://bzed.de/> - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Hi, just got linked to this thread, so here's a bit input form me :) >> 1) system type > > A Sun Fire 280R, with two CPU boards, each carrying a TI UltraSparc III > (Cheetah), and 2 GB of RAM. If you need more info, just say. > > (Bernd Zeimetz has previously suggested that the problem is linked to > the processor type, the USIII.) It seems to hit USIII machines with 2 CPUs in one tray much more hard than US II, but once a month our Ultra60 (running two US II) has the same issues - it got much better with since 179c85ea53bef807621f335767e41e23f86f01df, though. before the mentioned patch it died a few times per day. Seems it got better on the USIII here, too (we have a v880 here, the large version of Josip's machine, with 2x 2 CPUs), but it still dies way too often, just not useable in the current state. > >> 2) compiler used to build kernel and is it SMP? > > gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) same compiler here. Please note that non-SMP kernels do not boot on those US-III machines at all (at least I didn't find a single one which does). Cheers, Bernd - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Hi, > It seems that instead of getting stuck in the kernel where I > thought it would, the process gets stuck elsewhere and > also tends to loop allocating memory until all memory in the > machine is exhausted and the OOM killer starts to try and > kill processes left and right. at least it runs with 100% CPU, attaching strace to the pid doesn't give any results strace-ing the whole process doesn't result in more useful output, but the hanging processes were killable when they were running under strace... Cheers, Bernd - To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html