Re: unkillable dpkg-query processes
From: Josip Rodin [EMAIL PROTECTED] Date: Fri, 2 Nov 2007 17:21:06 +0100 Great. Here you go, three of them, while the load was 3 and this process was stuck: buildd 10813 100 0.8 987368 17504 ?RN 14:44 155:49 dpkg-query --search libpthread.so.0 libdl.so.2 libstdc++.so.6 libm.so.6 libgcc_s.so.1 libc.so.6 libFLAC.so.8 libid3tag.so.0 libz.so.1 libmad.so.0 libglib-2.0.so.0 libmikmod.so.2 libsndfile.so.1 libvorbis.so.0 libogg.so.0 libvorbisfile.so.3 ... Nov 2 17:02:04 lebrun kernel: CPU[ 0]: TSTATE[80009604] TPC[00407924] TNPC[00407928] TASK[dpkg-query:10813] Nov 2 17:02:04 lebrun kernel: TPC[sparc64_realfault_common+0x8/0x20] It looks like dpkg_query is stuck on a page fault. Typically this means the fault processing is not putting a valid translation into the TLB to satisfy the fault, so we loop forever never making forward progress. I've had to debug something similar to this before, so I'll piece together a debugging patch you can use to get more information. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Ok, the key in the trace is: Nov 2 16:25:30 titan kernel: [ 978.134874] CPU[ 1]: TSTATE[80009603] TPC[0067d2e0] TNPC[0067d2d4] TASK[aptitude:3204] Nov 2 16:25:30 titan kernel: [ 978.257809] TPC[_write_unlock_irq+0x20/0x110] ... Nov 2 16:25:30 titan kernel: [ 978.507778] CPU[ 3]: TSTATE[11009605] TPC[004419f8] TNPC[004419fc] TASK[aptitude:3203] Nov 2 16:25:30 titan kernel: [ 978.630707] TPC[cheetah_xcall_deliver+0x174/0x23c] The first symbol is misleading, it says _write_unlock_irq but actually in the assembler the PC is in the spinlock read spinning loop section. So actually it's hanging in _spin_lock(). CPU #3 is trying to send a cross-call message interrupt, but for some reason that isn't making forward progress. Let's see what's calling these things by adding some more debugging information. Please retry the test with the following patch on top of the original sysrq-g debugging patch and please get new logs when it hangs. Today I was a bit out of luck, either the machine crashed so badly that it just didn't react on anything anymore, or it didn't crash. The machine went amok a bit slower when I did the following things, which also resulted in the attached sysrq output. - run stress -c 2 to get the load up, didn't need that the last time... - run something like `while true; do echo g /proc/sysrg-trigger; sleep 0.5; done` - run aptitude -u several times until the machine died. So I'm not sure if the result is really useful for you - if not just let me know. I've attached the last ~10-20 sysrq-g outputs - as it was running in a loop I have a ton of them. In case you're wondering: http is aptitude's http method. We'll also run the patched Kernel on a US II machine form tomorrow on - but it always took a longer time until it crashed, so we'll see if it happens at all. Thanks for your work, Bernd -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ sysrq2.txt Description: application/pgp-keys
Re: unkillable dpkg-query processes
From: Bernd Zeimetz [EMAIL PROTECTED] Date: Sun, 04 Nov 2007 20:55:20 +0100 So I'm not sure if the result is really useful for you - if not just let me know. I've attached the last ~10-20 sysrq-g outputs - as it was running in a loop I have a ton of them. In case you're wondering: http is aptitude's http method. The http module is stuck in a different place, I'll try to see if I can make sense of it. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
David Miller wrote: From: Bernd Zeimetz [EMAIL PROTECTED] Date: Sun, 04 Nov 2007 20:55:20 +0100 So I'm not sure if the result is really useful for you - if not just let me know. I've attached the last ~10-20 sysrq-g outputs - as it was running in a loop I have a ton of them. In case you're wondering: http is aptitude's http method. The http module is stuck in a different place, I'll try to see if I can make sense of it. In the meantime I'll build an aptitude which should exit after running trough the part which crashed usually, so it should be possible to run it in a loop... -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
In the meantime I'll build an aptitude which should exit after running trough the part which crashed usually, so it should be possible to run it in a loop... This was successful - it made crashing the machine pretty simple, even without activated libnss-db. To reproduce on Etch: - get the source of aptitude - apply the attached patch - rebuild the .deb, install it - while true; do aptitude -u; done Some of the aptitudes hit a SIGABRT before one got stuck. Best regards, Bernd -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ aptitude.diff Description: application/pgp-keys aptitude-sysrq-q.txt.gz Description: GNU Zip compressed data