Re: unkillable dpkg-query processes
From: Bernd Zeimetz [EMAIL PROTECTED] Date: Tue, 06 Nov 2007 04:51:07 +0100 Here's also some output from apt-get which got stuck in my unstable chroot while I wanted to retrieve the klibc source to try to debug it... So the good news is that I started getting the hang seen on the Debain buildd on my workstation. The bad news is that it's very sporadic, for a while I could trigger it during bootup, on every boot, and now I can't get it to wedge at all. Anyways, we're getting closer. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
David Miller wrote: From: Bernd Zeimetz [EMAIL PROTECTED] Date: Tue, 06 Nov 2007 04:51:07 +0100 Here's also some output from apt-get which got stuck in my unstable chroot while I wanted to retrieve the klibc source to try to debug it... So the good news is that I started getting the hang seen on the Debain buildd on my workstation. The bad news is that it's very sporadic, for a while I could trigger it during bootup, on every boot, and now I can't get it to wedge at all. Anyways, we're getting closer. Running stress -c 2 on a 4 CPU machine made things really worse here, probably it helps to trigger the bug for you, too. Our US II machine is also just running fine at the moment. -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
So I'm not sure if the result is really useful for you - if not just let me know. I've attached the last ~10-20 sysrq-g outputs - as it was running in a loop I have a ton of them. In case you're wondering: http is aptitude's http method. The http module is stuck in a different place, I'll try to see if I can make sense of it. Here's also some output from apt-get which got stuck in my unstable chroot while I wanted to retrieve the klibc source to try to debug it... ov 6 04:43:19 titan kernel: [100896.376237] SysRq : Show Global CPU Regs Nov 6 04:43:19 titan kernel: [100896.423254] * CPU[ 0]: TSTATE[] TPC[] TNPC[] TASK[bash:11762] Nov 6 04:43:19 titan kernel: [100896.544064] CPU[ 1]: TSTATE[004411009602] TPC[0067b59c] TNPC[0067b5a0] TASK[swapper:0] Nov 6 04:43:19 titan kernel: [100896.663869] TPC[schedule+0x5f8/0x7a4] Nov 6 04:43:19 titan kernel: [100896.722179] O7[schedule+0x5cc/0x7a4] Nov 6 04:43:19 titan kernel: [100896.779474] I7[cpu_idle+0xa8/0xb8] Nov 6 04:43:19 titan kernel: [100896.834677] CPU[ 2]: TSTATE[009911009601] TPC[0042888c] TNPC[00428890] TASK[swapper:0] Nov 6 04:43:19 titan kernel: [100896.954474] TPC[cpu_idle+0x80/0xb8] Nov 6 04:43:19 titan kernel: [100897.010715] O7[cpu_idle+0xa8/0xb8] Nov 6 04:43:19 titan kernel: [100897.065932] I7[after_lock_tlb+0x19c/0x1b0] Nov 6 04:43:19 titan kernel: [100897.129468] CPU[ 3]: TSTATE[004411009602] TPC[0053a0c4] TNPC[0053a0c8] TASK[apt-get:11759] Nov 6 04:43:19 titan kernel: [100897.253443] TPC[__first_cpu+0x4/0x28] Nov 6 04:43:19 titan kernel: [100897.311767] O7[__delay+0x28/0x48] Nov 6 04:43:20 titan kernel: [100897.365923] I7[cheetah_xcall_deliver+0x1c0/0x23c] Nov 6 04:43:31 titan kernel: [100909.020406] SysRq : Show Global CPU Regs Nov 6 04:43:31 titan kernel: [100909.067374] * CPU[ 0]: TSTATE[] TPC[] TNPC[] TASK[bash:11762] Nov 6 04:43:31 titan kernel: [100909.188209] CPU[ 1]: TSTATE[004411009604] TPC[0045731c] TNPC[00457320] TASK[swapper:0] Nov 6 04:43:32 titan kernel: [100909.308013] TPC[update_stats_wait_end+0x24/0x88] Nov 6 04:43:32 titan kernel: [100909.377808] O7[sched_clock+0x10/0x30] Nov 6 04:43:32 titan kernel: [100909.436116] I7[pick_next_task_fair+0x24/0x44] Nov 6 04:43:32 titan kernel: [100909.502782] CPU[ 2]: TSTATE[009911009601] TPC[0042888c] TNPC[00428890] TASK[swapper:0] Nov 6 04:43:32 titan kernel: [100909.622580] TPC[cpu_idle+0x80/0xb8] Nov 6 04:43:32 titan kernel: [100909.678817] O7[cpu_idle+0xa8/0xb8] Nov 6 04:43:32 titan kernel: [100909.734029] I7[after_lock_tlb+0x19c/0x1b0] Nov 6 04:43:32 titan kernel: [100909.797570] CPU[ 3]: TSTATE[11009601] TPC[00441a78] TNPC[00441a7c] TASK[apt-get:11759] Nov 6 04:43:32 titan kernel: [100909.921536] TPC[cheetah_xcall_deliver+0x174/0x23c] Nov 6 04:43:32 titan kernel: [100909.993401] O7[cheetah_xcall_deliver+0x6c/0x23c] Nov 6 04:43:32 titan kernel: [100910.063193] I7[flush_dcache_page_all+0x178/0x240] Nov 6 04:43:33 titan kernel: [100910.766366] SysRq : Show Global CPU Regs Nov 6 04:43:33 titan kernel: [100910.813292] * CPU[ 0]: TSTATE[] TPC[] TNPC[] TASK[bash:11762] Nov 6 04:43:33 titan kernel: [100910.934129] CPU[ 1]: TSTATE[004411009604] TPC[0045731c] TNPC[00457320] TASK[swapper:0] Nov 6 04:43:33 titan kernel: [100911.053923] TPC[update_stats_wait_end+0x24/0x88] Nov 6 04:43:33 titan kernel: [100911.123706] O7[sched_clock+0x10/0x30] Nov 6 04:43:33 titan kernel: [100911.182037] I7[pick_next_task_fair+0x24/0x44] Nov 6 04:43:33 titan kernel: [100911.248702] CPU[ 2]: TSTATE[004411009601] TPC[004288a0] TNPC[004288a4] TASK[swapper:0] Nov 6 04:43:34 titan kernel: [100911.368498] TPC[cpu_idle+0x94/0xb8] Nov 6 04:43:34 titan kernel: [100911.424738] O7[cpu_idle+0xa8/0xb8] Nov 6 04:43:34 titan kernel: [100911.479949] I7[after_lock_tlb+0x19c/0x1b0] Nov 6 04:43:34 titan kernel: [100911.543490] CPU[ 3]: TSTATE[11009601] TPC[0042fc44] TNPC[0042fbe8] TASK[apt-get:11759] Nov 6 04:43:34 titan kernel: [100911.667456] TPC[udelay+0x14/0x1c] Nov 6 04:43:34 titan kernel: [100911.721611] O7[udelay+0x10/0x1c] Nov 6 04:43:34 titan kernel: [100911.774739] I7[flush_dcache_page_all+0x178/0x240] Nov 6 04:43:35 titan kernel: [100912.474070] SysRq : Show Global CPU Regs Nov 6 04:43:35 titan kernel: [100912.520982] *
Re: unkillable dpkg-query processes
From: Josip Rodin [EMAIL PROTECTED] Date: Fri, 2 Nov 2007 17:21:06 +0100 Great. Here you go, three of them, while the load was 3 and this process was stuck: buildd 10813 100 0.8 987368 17504 ?RN 14:44 155:49 dpkg-query --search libpthread.so.0 libdl.so.2 libstdc++.so.6 libm.so.6 libgcc_s.so.1 libc.so.6 libFLAC.so.8 libid3tag.so.0 libz.so.1 libmad.so.0 libglib-2.0.so.0 libmikmod.so.2 libsndfile.so.1 libvorbis.so.0 libogg.so.0 libvorbisfile.so.3 ... Nov 2 17:02:04 lebrun kernel: CPU[ 0]: TSTATE[80009604] TPC[00407924] TNPC[00407928] TASK[dpkg-query:10813] Nov 2 17:02:04 lebrun kernel: TPC[sparc64_realfault_common+0x8/0x20] It looks like dpkg_query is stuck on a page fault. Typically this means the fault processing is not putting a valid translation into the TLB to satisfy the fault, so we loop forever never making forward progress. I've had to debug something similar to this before, so I'll piece together a debugging patch you can use to get more information. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Ok, the key in the trace is: Nov 2 16:25:30 titan kernel: [ 978.134874] CPU[ 1]: TSTATE[80009603] TPC[0067d2e0] TNPC[0067d2d4] TASK[aptitude:3204] Nov 2 16:25:30 titan kernel: [ 978.257809] TPC[_write_unlock_irq+0x20/0x110] ... Nov 2 16:25:30 titan kernel: [ 978.507778] CPU[ 3]: TSTATE[11009605] TPC[004419f8] TNPC[004419fc] TASK[aptitude:3203] Nov 2 16:25:30 titan kernel: [ 978.630707] TPC[cheetah_xcall_deliver+0x174/0x23c] The first symbol is misleading, it says _write_unlock_irq but actually in the assembler the PC is in the spinlock read spinning loop section. So actually it's hanging in _spin_lock(). CPU #3 is trying to send a cross-call message interrupt, but for some reason that isn't making forward progress. Let's see what's calling these things by adding some more debugging information. Please retry the test with the following patch on top of the original sysrq-g debugging patch and please get new logs when it hangs. Today I was a bit out of luck, either the machine crashed so badly that it just didn't react on anything anymore, or it didn't crash. The machine went amok a bit slower when I did the following things, which also resulted in the attached sysrq output. - run stress -c 2 to get the load up, didn't need that the last time... - run something like `while true; do echo g /proc/sysrg-trigger; sleep 0.5; done` - run aptitude -u several times until the machine died. So I'm not sure if the result is really useful for you - if not just let me know. I've attached the last ~10-20 sysrq-g outputs - as it was running in a loop I have a ton of them. In case you're wondering: http is aptitude's http method. We'll also run the patched Kernel on a US II machine form tomorrow on - but it always took a longer time until it crashed, so we'll see if it happens at all. Thanks for your work, Bernd -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ sysrq2.txt Description: application/pgp-keys
Re: unkillable dpkg-query processes
From: Bernd Zeimetz [EMAIL PROTECTED] Date: Sun, 04 Nov 2007 20:55:20 +0100 So I'm not sure if the result is really useful for you - if not just let me know. I've attached the last ~10-20 sysrq-g outputs - as it was running in a loop I have a ton of them. In case you're wondering: http is aptitude's http method. The http module is stuck in a different place, I'll try to see if I can make sense of it. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
David Miller wrote: From: Bernd Zeimetz [EMAIL PROTECTED] Date: Sun, 04 Nov 2007 20:55:20 +0100 So I'm not sure if the result is really useful for you - if not just let me know. I've attached the last ~10-20 sysrq-g outputs - as it was running in a loop I have a ton of them. In case you're wondering: http is aptitude's http method. The http module is stuck in a different place, I'll try to see if I can make sense of it. In the meantime I'll build an aptitude which should exit after running trough the part which crashed usually, so it should be possible to run it in a loop... -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
In the meantime I'll build an aptitude which should exit after running trough the part which crashed usually, so it should be possible to run it in a loop... This was successful - it made crashing the machine pretty simple, even without activated libnss-db. To reproduce on Etch: - get the source of aptitude - apply the attached patch - rebuild the .deb, install it - while true; do aptitude -u; done Some of the aptitudes hit a SIGABRT before one got stuck. Best regards, Bernd -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ aptitude.diff Description: application/pgp-keys aptitude-sysrq-q.txt.gz Description: GNU Zip compressed data
Re: unkillable dpkg-query processes
From: Bernd Zeimetz [EMAIL PROTECTED] Date: Fri, 02 Nov 2007 16:37:25 +0100 I've sent g several times to sysrq, output is attached. According to top the two hanging aptitude processes were running on CPU 1 + 3. 3204 root 20 0 19552 5088 4072 R 100 0.1 6:54.49 1 aptitude 3203 root 20 0 19552 5088 4072 R 100 0.1 6:56.39 3 aptitude Ok, the key in the trace is: Nov 2 16:25:30 titan kernel: [ 978.134874] CPU[ 1]: TSTATE[80009603] TPC[0067d2e0] TNPC[0067d2d4] TASK[aptitude:3204] Nov 2 16:25:30 titan kernel: [ 978.257809] TPC[_write_unlock_irq+0x20/0x110] ... Nov 2 16:25:30 titan kernel: [ 978.507778] CPU[ 3]: TSTATE[11009605] TPC[004419f8] TNPC[004419fc] TASK[aptitude:3203] Nov 2 16:25:30 titan kernel: [ 978.630707] TPC[cheetah_xcall_deliver+0x174/0x23c] The first symbol is misleading, it says _write_unlock_irq but actually in the assembler the PC is in the spinlock read spinning loop section. So actually it's hanging in _spin_lock(). CPU #3 is trying to send a cross-call message interrupt, but for some reason that isn't making forward progress. Let's see what's calling these things by adding some more debugging information. Please retry the test with the following patch on top of the original sysrq-g debugging patch and please get new logs when it hangs. Thanks! --- arch/sparc64/kernel/process.c.ORIG 2007-11-03 20:53:27.0 -0700 +++ arch/sparc64/kernel/process.c 2007-11-03 21:05:47.0 -0700 @@ -49,6 +49,7 @@ #include asm/hypervisor.h #include asm/sstate.h #include asm/irq_regs.h +#include asm/smp.h /* #define VERBOSE_SHOWREGS */ @@ -394,7 +395,11 @@ struct global_reg_snapshot { unsigned long tstate; unsigned long tpc; unsigned long tnpc; + unsigned long o7; + unsigned long i7; struct thread_info *thread; + unsigned long pad1; + unsigned long pad2; } global_reg_snapshot[NR_CPUS]; static DEFINE_SPINLOCK(global_reg_snapshot_lock); @@ -413,6 +418,8 @@ static void sysrq_handle_globreg(int key global_reg_snapshot[cpu].tstate = regs-tstate; global_reg_snapshot[cpu].tpc = regs-tpc; global_reg_snapshot[cpu].tnpc = regs-tnpc; + global_reg_snapshot[cpu].o7 = regs-u_regs[UREG_I7]; + global_reg_snapshot[cpu].i7 = 0; } else { global_reg_snapshot[cpu].tstate = 0; global_reg_snapshot[cpu].tpc = 0; @@ -432,9 +439,19 @@ static void sysrq_handle_globreg(int key ((tp tp-task) ? tp-task-comm : NULL), ((tp tp-task) ? tp-task-pid : -1)); #ifdef CONFIG_KALLSYMS - if ((gp-tstate TSTATE_PRIV) (gp-tpc != 0UL)) { - sprint_symbol(buffer, gp-tpc); - printk( TPC[%s]\n, buffer); + if (gp-tstate TSTATE_PRIV) { + if (gp-tpc != 0UL) { + sprint_symbol(buffer, gp-tpc); + printk( TPC[%s]\n, buffer); + } + if (gp-o7 != 0UL) { + sprint_symbol(buffer, gp-o7); + printk( O7[%s]\n, buffer); + } + if (gp-i7 != 0UL) { + sprint_symbol(buffer, gp-i7); + printk( I7[%s]\n, buffer); + } } #endif } --- arch/sparc64/mm/ultra.S.ORIG2007-11-03 20:53:27.0 -0700 +++ arch/sparc64/mm/ultra.S 2007-11-03 20:57:12.0 -0700 @@ -528,7 +528,7 @@ xcall_fetch_glob_regs: sethi %hi(global_reg_snapshot), %g1 or %g1, %lo(global_reg_snapshot), %g1 __GET_CPUID(%g2) - sllx%g2, 5, %g3 + sllx%g2, 6, %g3 add %g1, %g3, %g1 rdpr%tstate, %g7 stx %g7, [%g1 + 0x00] @@ -536,12 +536,14 @@ xcall_fetch_glob_regs: stx %g7, [%g1 + 0x08] rdpr%tnpc, %g7 stx %g7, [%g1 + 0x10] + stx %o7, [%g1 + 0x18] + stx %i7, [%g1 + 0x20] sethi %hi(trap_block), %g7 or %g7, %lo(trap_block), %g7 sllx%g2, TRAP_BLOCK_SZ_SHIFT, %g2 add %g7, %g2, %g7 ldx [%g7 + TRAP_PER_CPU_THREAD], %g3 - stx %g3, [%g1 + 0x18] + stx %g3, [%g1 + 0x28] retry #ifdef DCACHE_ALIASING_POSSIBLE - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at
Re: unkillable dpkg-query processes
David Miller wrote: From: David Miller [EMAIL PROTECTED] Date: Thu, 01 Nov 2007 15:01:13 -0700 (PDT) I'm working on a kernel patch for 2.6.23 that will allow you to get some useful debugging information in situations like this. I'll try to get you that patch by the end of tonight. As promised, here is the patch below. Thanks for the patch. Applied and used libnss-db + aptitude -u to hang the machine. I've sent g several times to sysrq, output is attached. According to top the two hanging aptitude processes were running on CPU 1 + 3. 3204 root 20 0 19552 5088 4072 R 100 0.1 6:54.49 1 aptitude 3203 root 20 0 19552 5088 4072 R 100 0.1 6:56.39 3 aptitude Cheers, Bernd -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ sysrq-g.txt Description: application/pgp-keys
Re: unkillable dpkg-query processes
On Thu, Nov 01, 2007 at 09:55:44PM -0700, David Miller wrote: I'm working on a kernel patch for 2.6.23 that will allow you to get some useful debugging information in situations like this. I'll try to get you that patch by the end of tonight. As promised, here is the patch below. echo g /proc/sysrq-trigger So when you get a stuck process or whatever, trigger this and send the output :-) Great. Here you go, three of them, while the load was 3 and this process was stuck: buildd 10813 100 0.8 987368 17504 ?RN 14:44 155:49 dpkg-query --search libpthread.so.0 libdl.so.2 libstdc++.so.6 libm.so.6 libgcc_s.so.1 libc.so.6 libFLAC.so.8 libid3tag.so.0 libz.so.1 libmad.so.0 libglib-2.0.so.0 libmikmod.so.2 libsndfile.so.1 libvorbis.so.0 libogg.so.0 libvorbisfile.so.3 -- 2. That which causes joy or happiness. Nov 2 17:01:52 lebrun kernel: SysRq : Show Global CPU Regs Nov 2 17:01:52 lebrun kernel: CPU[ 0]: TSTATE[] TPC[] TNPC[] TASK[NULL:-1] Nov 2 17:01:52 lebrun kernel: TPC[sparc64_realfault_common+0x8/0x20] Nov 2 17:01:52 lebrun kernel: * CPU[ 1]: TSTATE[] TPC[] TNPC[] TASK[sh:12919] Nov 2 17:02:04 lebrun kernel: SysRq : Show Global CPU Regs Nov 2 17:02:04 lebrun kernel: CPU[ 0]: TSTATE[80009604] TPC[00407924] TNPC[00407928] TASK[dpkg-query:10813] Nov 2 17:02:04 lebrun kernel: TPC[sparc64_realfault_common+0x8/0x20] Nov 2 17:02:04 lebrun kernel: * CPU[ 1]: TSTATE[] TPC[] TNPC[] TASK[sh:12928] Nov 2 17:17:02 lebrun kernel: SysRq : Show Global CPU Regs Nov 2 17:17:02 lebrun kernel: CPU[ 0]: TSTATE[] TPC[00407924] TNPC[00407928] TASK[dpkg-query:10813] Nov 2 17:17:02 lebrun kernel: TPC[sparc64_realfault_common+0x8/0x20] Nov 2 17:17:02 lebrun kernel: * CPU[ 1]: TSTATE[] TPC[] TNPC[] TASK[sh:16444]
Re: unkillable dpkg-query processes
Hi, lebrun.d.o hasn't crashed in a while now, but it has this in the process list: buildd2382 0.0 0.2 8144 4736 ?Ss Oct30 0:00 /usr/bin/perl /usr/bin/buildd buildd2407 0.0 0.5 13920 11296 ?SN Oct30 0:10 \_ /usr/bin/perl /usr/bin/sbuild --batch --stats-dir=/home/buildd/ buildd 18174 0.0 0.0 0 0 ?ZNs Oct30 0:00 \_ [su] defunct buildd 23305 100 1.6 1007296 33288 ? RN Oct30 3507:30 dpkg-query --status squashfs-source At the same time: % free total used free sharedbuffers cached Mem: 20730402021224 51816 0 196808 21144 -/+ buffers/cache:1803272 269768 Swap: 10486881041048584 % uptime 22:38:36 up 2 days, 10:53, 1 user, load average: 3.00, 3.01, 3.00 Given that it's still not catatonic, can I do something to provide some debugging information? (BTW, I'm subscribed to the sparclinux list now.) -- 2. That which causes joy or happiness. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
The futex() calls are definitely from libnss-db. And on Lenny/testing we have futex calls from libc6. Didn't have the time to come up with any instructions yet as we have public holidays today, I'll try to finish them tomorrow. -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
From: David Miller [EMAIL PROTECTED] Date: Thu, 01 Nov 2007 15:01:13 -0700 (PDT) I'm working on a kernel patch for 2.6.23 that will allow you to get some useful debugging information in situations like this. I'll try to get you that patch by the end of tonight. As promised, here is the patch below. To trigger the debugging log, simple give the console a Alt-SysRQ then a g. On a serial console you can do this by giving a single BREAK then a g. If you're having trouble triggering the sysrq on the console, try instead: bash# echo g /proc/sysrq-trigger Here is some sample output from my Niagara-2 system while running a benchmark. The current CPU is denoted by the leading * character. [81940.250994] SysRq : Show Global CPU Regs [81940.251800] * CPU[ 0]: TSTATE[e2001602] TPC[0055813c] TNPC[00558140] TASK[dd:2940] [81940.252206] TPC[NGbzero_loop+0x1c/0x38] [81940.252422] CPU[ 1]: TSTATE[004411001607] TPC[0055c9bc] TNPC[0055c9c0] TASK[dd:2926] [81940.252739] TPC[atomic_sub_ret+0x4/0x30] [81940.252936] CPU[ 2]: TSTATE[11001607] TPC[0055feec] TNPC[0055fef0] TASK[dd:2899] [81940.253238] TPC[NG2copy_to_user+0x46c/0x680] [81940.253451] CPU[ 3]: TSTATE[e2001602] TPC[00558130] TNPC[00558134] TASK[dd:2929] [81940.253776] TPC[NGbzero_loop+0x10/0x38] [81940.253993] CPU[ 4]: TSTATE[e2001602] TPC[00558124] TNPC[00558128] TASK[dd:2947] [81940.254325] TPC[NGbzero_loop+0x4/0x38] [81940.254497] CPU[ 5]: TSTATE[004411001606] TPC[00495f94] TNPC[00495f98] TASK[dd:2908] [81940.254893] TPC[do_generic_mapping_read+0xbc/0x428] [81940.255203] CPU[ 6]: TSTATE[11001607] TPC[0055fee8] TNPC[0055feec] TASK[dd:2920] [81940.255699] TPC[NG2copy_to_user+0x468/0x680] [81940.256104] CPU[ 7]: TSTATE[11001607] TPC[0055feec] TNPC[0055fef0] TASK[dd:2935] [81940.256574] TPC[NG2copy_to_user+0x46c/0x680] [81940.256972] CPU[ 8]: TSTATE[e2001602] TPC[00558124] TNPC[00558128] TASK[dd:2903] [81940.257399] TPC[NGbzero_loop+0x4/0x38] [81940.257899] CPU[ 9]: TSTATE[11001607] TPC[0055feec] TNPC[0055fef0] TASK[dd:2904] [81940.258240] TPC[NG2copy_to_user+0x46c/0x680] [81940.258482] CPU[ 10]: TSTATE[e2001602] TPC[00558138] TNPC[0055813c] TASK[dd:2902] [81940.258808] TPC[NGbzero_loop+0x18/0x38] [81940.258999] CPU[ 11]: TSTATE[e2001602] TPC[00558120] TNPC[00558124] TASK[dd:2941] [81940.259319] TPC[NGbzero_loop+0x0/0x38] [81940.259487] CPU[ 12]: TSTATE[e2001602] TPC[00558130] TNPC[00558134] TASK[dd:2919] [81940.259801] TPC[NGbzero_loop+0x10/0x38] [81940.260012] CPU[ 13]: TSTATE[11001607] TPC[0055feec] TNPC[0055fef0] TASK[dd:2950] [81940.260350] TPC[NG2copy_to_user+0x46c/0x680] [81940.260564] CPU[ 14]: TSTATE[e2001602] TPC[00558134] TNPC[00558138] TASK[dd:2936] [81940.260937] TPC[NGbzero_loop+0x14/0x38] [81940.261150] CPU[ 15]: TSTATE[11001607] TPC[0055fee8] TNPC[0055feec] TASK[dd:2905] [81940.261457] TPC[NG2copy_to_user+0x468/0x680] [81940.261677] CPU[ 16]: TSTATE[11001607] TPC[0055feec] TNPC[0055fef0] TASK[dd:2923] [81940.261973] TPC[NG2copy_to_user+0x46c/0x680] [81940.262167] CPU[ 17]: TSTATE[11001607] TPC[0055feec] TNPC[0055fef0] TASK[dd:2897] [81940.262462] TPC[NG2copy_to_user+0x46c/0x680] [81940.262643] CPU[ 18]: TSTATE[e2001602] TPC[00558128] TNPC[0055812c] TASK[dd:2909] [81940.262987] TPC[NGbzero_loop+0x8/0x38] [81940.263180] CPU[ 19]: TSTATE[11001607] TPC[0055fee8] TNPC[0055feec] TASK[dd:2913] [81940.263500] TPC[NG2copy_to_user+0x468/0x680] [81940.263901] CPU[ 20]: TSTATE[e2001602] TPC[00558128] TNPC[0055812c] TASK[dd:2890] [81940.264403] TPC[NGbzero_loop+0x8/0x38] [81940.264679] CPU[ 21]: TSTATE[11001607] TPC[0055fee8] TNPC[0055feec] TASK[dd:2906] [81940.265152] TPC[NG2copy_to_user+0x468/0x680] [81940.265535] CPU[ 22]: TSTATE[11001607] TPC[0055feec] TNPC[0055fef0] TASK[dd:2918] [81940.266075] TPC[NG2copy_to_user+0x46c/0x680] [81940.266448] CPU[ 23]: TSTATE[11001607] TPC[0055fee8] TNPC[0055feec] TASK[dd:2900] [81940.266942] TPC[NG2copy_to_user+0x468/0x680] [81940.267328] CPU[ 24]: TSTATE[11001602] TPC[0049a618] TNPC[0049a61c] TASK[dd:2938] [81940.267710]
Re: unkillable dpkg-query processes
From: Josip Rodin [EMAIL PROTECTED] Date: Tue, 30 Oct 2007 00:37:13 +0100 I'd try doing a debootstrap of lenny (that's Debian testing), and then inside it, run one or more of those 'dpkg-query -S libc.so.6'. Thanks for the info. While waiting for you to reply I created a lenny buildd build root on my SunFire 280R using: debootstrap --variant=buildd lenny /org/buildd/chroots/lenny \ http://mirrors.kernel.org/debian basically following roughly the instructions at: http://www.debian.org/devel/buildd/setting-up And then once chroot'ed into the lenny build root you have to setup a few manual things like /proc, /sys/, and /dev/pts mounts for anything to work: chroot /org/buildd/chroots/lenny mount -t proc none /proc mount -t sysfs none /sys mount -t devpts none /dev/pts So, it's a lot more than just running the appropriate debootstrap command. I have done a GCC package build and am now running a libc6 build under this lenny chroot and haven't hit any problems yet. This is with a stock 2.6.23.1 kernel. BTW, in your buildroot, can you do something like: strace -o x.log dpkg-query -S libc.so.6 and send me that x.log file? That might give some important clues. Thanks. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Hi, (Sorry for breaking the threading - I didn't subscribe to the list, I just found this in the web archive. I should probably subscribe... :) David Miller wrote: Ok, since I have a 280R just like Josip, I think a good plan is for him to show me the commands he used to create the build root where he can trigger bad things. I can't be 100% sure, because it was James Troup who initially set it up, but I believe that the chroot on lebrun.d.o was set up by just doing something mundane like running debootstrap, more specifically something like this: sudo debootstrap lenny /mnt http://ftp.us.debian.org/debian I conclude this because it has a var/log/bootstrap.log in it, dated 2007-06-19 12:15, which has: Selecting previously deselected package base-files. (Reading database ... 0 files and directories currently installed.) Unpacking base-files (from .../base-files_4.0.0_sparc.deb) ... [...] Setting up build-essential (11.3) ... And it also has a var/log/dpkg.log which has: 2007-06-19 12:13:10 install base-files none 4.0.0 [...] 2007-06-19 12:15:23 status installed build-essential 11.3 Again I can't be 100% sure of the exact command line used, but that really should be it :) After that, dpkg.log in the chroot also has a purge of the 'procps' package, and an installation of the 'sparc-utils' package. A few hours after those two, a random selection of package installations starts - the buildd went online. I'd try doing a debootstrap of lenny (that's Debian testing), and then inside it, run one or more of those 'dpkg-query -S libc.so.6'. -- 2. That which causes joy or happiness. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
mount -t devpts none /dev/pts mount --bind /dev /thechroot/dev is what I use here, running udev in a chroot is no fun. So, it's a lot more than just running the appropriate debootstrap command. I'm almost done with a howto which is cutpaste for 95% to debootstrap and boot a debian system, unfortunately it doesn't boot as the klibc (which is used in the initramfs) is broken on sparc again... So I'll modify it to setup a proper chroot only, it should also allow to boot into it if you use the Kernel/initrd form Ubuntu. This should allow Josip and you to setup a complete chroot. I have done a GCC package build and am now running a libc6 build under this lenny chroot and haven't hit any problems yet. The following things also like to crash here (on Etch, not in a chroot): - running aptitude -u several times (at least with libnss-db installed) - since I've installed 2.6.24-rc1: vgdisplay (with and without active libnss-db) BTW, in your buildroot, can you do something like: strace -o x.log dpkg-query -S libc.so.6 there're some comparisons of the strace of aptitude -u in http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=433187#102 Probably interesting as there're futexes in the game. The interesting thing is that it didn't crash the machine while running under strace. -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Here you go. (Mind, this is capturing the current status of the chroot, which is fairly unclean, because right now it happens to be building python-qt4-4.3.1.) What we're missing here is a probably important piece: If dpkg-query is running during a build, it is running in a fakeroot environment. I've straced that, see the attachment. What I find in the strace are at least several clones, which is the point where aptitude -u crashed according to the straces in http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=433187#102 -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ execve(/usr/bin/fakeroot, [fakeroot, dpkg-query, -S, libc.so.6], [/* 12 vars */]) = 0 brk(0) = 0xca000 uname({sys=Linux, node=titan, ...}) = 0 access(/etc/ld.so.nohwcap, F_OK) = -1 ENOENT (No such file or directory) mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7fba000 access(/etc/ld.so.preload, R_OK) = -1 ENOENT (No such file or directory) open(/etc/ld.so.cache, O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=12402, ...}) = 0 mmap(NULL, 12402, PROT_READ, MAP_PRIVATE, 3, 0) = 0xf7fb4000 close(3)= 0 access(/etc/ld.so.nohwcap, F_OK) = -1 ENOENT (No such file or directory) open(/lib/libncurses.so.5, O_RDONLY) = 3 read(3, \177ELF\1\2\1\0\0\0\0\0\0\0\0\0\0\3\0\22\0\0\0\1\0\0\263..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0644, st_size=208688, ...}) = 0 mmap(NULL, 208480, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xf7f8 mmap(0xf7fb, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3) = 0xf7fb close(3)= 0 access(/etc/ld.so.nohwcap, F_OK) = -1 ENOENT (No such file or directory) open(/lib/libdl.so.2, O_RDONLY) = 3 read(3, \177ELF\1\2\1\0\0\0\0\0\0\0\0\0\0\3\0\22\0\0\0\1\0\0\f..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0644, st_size=18216, ...}) = 0 mmap(NULL, 82432, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xf7f68000 mprotect(0xf7f6c000, 57344, PROT_NONE) = 0 mmap(0xf7f7a000, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0xf7f7a000 close(3)= 0 access(/etc/ld.so.nohwcap, F_OK) = -1 ENOENT (No such file or directory) open(/lib/libc.so.6, O_RDONLY)= 3 read(3, \177ELF\1\2\1\0\0\0\0\0\0\0\0\0\0\3\0\22\0\0\0\1\0\1\364..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=1419756, ...}) = 0 mmap(NULL, 1489032, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xf7dfc000 mprotect(0xf7f5, 65536, PROT_NONE) = 0 mmap(0xf7f6, 24576, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0xf7f6 mmap(0xf7f66000, 6280, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xf7f66000 close(3)= 0 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7fde000 mprotect(0xf7f7a000, 8192, PROT_READ) = 0 munmap(0xf7fb4000, 12402) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 open(/dev/tty, O_RDWR|O_NONBLOCK|O_LARGEFILE) = 3 close(3)= 0 brk(0) = 0xca000 brk(0xec000)= 0xec000 getuid32() = 1000 getgid32() = 1000 geteuid32() = 1000 getegid32() = 1000 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 time(NULL) = 1193705202 open(/proc/meminfo, O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7fb8000 read(3, MemTotal: 8314712 kB\nMemFre..., 1024) = 624 close(3)= 0 munmap(0xf7fb8000, 8192)= 0 rt_sigaction(SIGCHLD, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0 rt_sigaction(SIGCHLD, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0 rt_sigaction(SIGINT, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0 rt_sigaction(SIGINT, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0 rt_sigaction(SIGQUIT, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0 rt_sigaction(SIGQUIT, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigaction(SIGQUIT, {SIG_IGN}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0 uname({sys=Linux, node=titan, ...}) = 0 stat64(/home/foo, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 getpid()= 18537 getppid() = 18536 getpgrp() = 18536 rt_sigaction(SIGCHLD, {0x42460, [], 0}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 open(/usr/bin/fakeroot,
Re: unkillable dpkg-query processes
mount -t devpts none /dev/pts mount --bind /dev /thechroot/dev is what I use here, running udev in a chroot is no fun. Ok. AFaik the buildds only have a minimal /dev. though. But to bootstrap a system that's usually not enough. Let's stick to 2.6.23 testing for pinpointing these bugs. Ok. Do you have a .deb with a kernel for me? If not - would you like to have any specific options enabled - I have to build one then. -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
From: Bernd Zeimetz [EMAIL PROTECTED] Date: Tue, 30 Oct 2007 01:50:30 +0100 What we're missing here is a probably important piece: If dpkg-query is running during a build, it is running in a fakeroot environment. I've straced that, see the attachment. What I find in the strace are at least several clones, which is the point where aptitude -u crashed according to the straces in http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=433187#102 Thanks for the fakeroot trace. I am pretty sure the clone()'s we see here are just normal fork()'s, in both the fakeroot's dpkg-query and the aptitude case. I'll go study up on fakeroot's implementation to look for potential clues. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
From: Bernd Zeimetz [EMAIL PROTECTED] Date: Tue, 30 Oct 2007 01:47:33 +0100 mount -t devpts none /dev/pts mount --bind /dev /thechroot/dev is what I use here, running udev in a chroot is no fun. Ok. I'm almost done with a howto which is cutpaste for 95% to debootstrap and boot a debian system, unfortunately it doesn't boot as the klibc (which is used in the initramfs) is broken on sparc again... So I'll modify it to setup a proper chroot only, it should also allow to boot into it if you use the Kernel/initrd form Ubuntu. This should allow Josip and you to setup a complete chroot. Thanks. I have done a GCC package build and am now running a libc6 build under this lenny chroot and haven't hit any problems yet. The following things also like to crash here (on Etch, not in a chroot): - running aptitude -u several times (at least with libnss-db installed) - since I've installed 2.6.24-rc1: vgdisplay (with and without active libnss-db) There are several issues with 2.6.24, stay away from it for now. I will fix things there. Let's stick to 2.6.23 testing for pinpointing these bugs. there're some comparisons of the strace of aptitude -u in http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=433187#102 Probably interesting as there're futexes in the game. Of course there are, as soon as you start using libnss-db there will be futexes. Can you reproduce the aptitute problems under 2.6.23 with libnss-db disabled? The interesting thing is that it didn't crash the machine while running under strace. If the futex problem I suspect is in fact the issue, strace'ing would definitely make that problem go away. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
From: Bernd Zeimetz [EMAIL PROTECTED] Date: Tue, 30 Oct 2007 02:54:14 +0100 Ok. Do you have a .deb with a kernel for me? If not - would you like to have any specific options enabled - I have to build one then. I usually just cp arch/sparc64/defconfig ./.config in a fresh vanilla kernel tree and tweak from there. For my 280R I enabled SMP, accepted the NR_CPUS default value (64), set SERIAL_SUNSAB to y and enabled console support, and then enabled the qlogic fibrechannel driver and the SUNGEM driver as modules. Oh yes, I also enabled INITRD support so I can use initramfs to get the firmware loaded properly in the qlogic FC card. Really, I don't use anything fancy, just enough to get the machine functional. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
I think things got worse with 2.6.24... The machine shoots itself now, I guess by running cron jobs or so. [29074.766486] TSTATE: 11009600 TPC: 0042f984 TNPC: 0042f928 Y: Not tainted [29074.884191] TPC: sched_clock+0x0/0x30 What kind of OOPS is this? Please provide the kernel log messages that appeared right before these register dumps. I'll boot the machine and check the logs, was not in the mood to do this tonight. The pasted messages were dumped on the serial console - as the machine didn't show any reaction I only powered it down... Cheers, Bernd -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
[29074.766486] TSTATE: 11009600 TPC: 0042f984 TNPC: 0042f928 Y: Not tainted [29074.884191] TPC: sched_clock+0x0/0x30 What kind of OOPS is this? Please provide the kernel log messages that appeared right before these register dumps. Oct 28 03:25:12 titan kernel: [29074.698695] BUG: soft lockup - CPU#0 stuck for 11s! [sh:4252] This happened while a cronjob was running which updates the libnss-db database... With an older kernel (2.6.23-rcsomething) this didn't crash the machine. -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Bernd Zeimetz a écrit : Hi, please note that the futex bug also happens on US II machines, it is jsut almost impossible to reproduce it - it'll just hang after random days of building. Everyone who sees these UltraSPARC-III problems please send me PRECISE and FULL description of how to install from scratch a machine and run something that will trigger these errors. Can you please check if the Kernel config I've attached to one of my last mails is fine for you? The normal Debian installer doesn't boot on the US III machines which use two CPUs in one board as the installer's Kernel is a non-SMP Kernel, and the result is that the machine throws a CPU exception and needs to be power-cycled I've started to investigate there with the help of a contact from Sun, but we both didn't have the time to finish this. http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=440720 if you want to have a look, please ignore those troll postings from chealer in between... So to give you a recipe to install Debian on such a box, I need to build an installer with a SMP Kernel for you. If the config is fine for your needs, I could just use use it. The other option is to use debootstrap, if you have some system on the machine already - so if you want to use that instead of messing with a network installer, please let me know. Debootstrap should run on most systems, as long as they have ar/tar/gunzip and a bash (probably sh is enough...). Would be faster to use that, and faster to write a recipe for that. I'll mark all Qlogic firmware related points, so the recipe should work on machines with (v440, v880, probably the Enterprise models, too) and without FC (I guess the Blade 1000 and 2000). If you don't have access to an US-III machine, I can find a way to give you access to the RSC and serial console of our machine. Cheers, Bernd Well, I got bitten twice with this bug. First is on U60, unstable debian. Since mono team decided that the mono is broken on Sparc (and despite the fix provided by David Miller), I had to rebuild after enabling the sparc arch in the source. The hangs happens always at the end of the buid when invoking dh_shgenlibs in the build. This is not 100% reproducable even in my env. Second was sun blade 2000 SMP with Ubuntu gutsy, I wasn't able to update the xemacs21 package. The machine hanged with invoking the post installation script. This is not really reproducable now that I upgraded the packages. The mono build is, in my humble opinion, the most interesting track to catch the bug. Seb - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Hi, Since mono team decided that the mono is broken on Sparc (and despite the fix provided by David Miller), I had to rebuild after enabling the sparc arch in the source. The hangs happens always at the end of the buid when invoking dh_shgenlibs in the build. This is not 100% reproducable even in my env. Trying this at the moment. Second was sun blade 2000 SMP with Ubuntu gutsy, I wasn't able to update the xemacs21 package. The machine hanged with invoking the post installation script. Does the Blade run with one or two CPUs? If I remember right they support to run with one CPU which has to be inserted in a special slot/carrier for that. With two CPUs it should use the same repeater chips and architecture as the v440, v880 and larger machines. Cheers, Bernd -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Bernd Zeimetz wrote: Hi, Since mono team decided that the mono is broken on Sparc (and despite the fix provided by David Miller), I had to rebuild after enabling the sparc arch in the source. Trying this at the moment. not reproducible - mono fails to build from source in sid... so it doesn;t reach the interesting part of dh_shlibdeps... -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
From: Bernd Zeimetz [EMAIL PROTECTED] Date: Mon, 29 Oct 2007 02:18:30 +0100 But if this bug isn't fixed chances are good that the next Debian release won't support Sparc at all. Please don't use pseudo-threats like this, it only deters me even more from working on this bug. This explains why you have trouble to reproduce this, while Josip and me get hit by this bug way too often. Josip stated explicitly that he has a SunFire280R, which disagrees with what you're saying here. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
David Miller wrote: From: Bernd Zeimetz [EMAIL PROTECTED] Date: Mon, 29 Oct 2007 02:18:30 +0100 But if this bug isn't fixed chances are good that the next Debian release won't support Sparc at all. Please don't use pseudo-threats like this, it only deters me even more from working on this bug. This was not meant as a threat, it's just a fact and the reason why I'm spending way too much time on trying to make this bug reproducible and also the reason why we're annoying you these days. Sorry for that. This explains why you have trouble to reproduce this, while Josip and me get hit by this bug way too often. Josip stated explicitly that he has a SunFire280R, which disagrees with what you're saying here. Sorry, I mixed something up here. I was somehow sure that they were using a v440, but it was somebody else. -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
From: Bernd Zeimetz [EMAIL PROTECTED] Date: Mon, 29 Oct 2007 03:06:13 +0100 David Miller wrote: Josip stated explicitly that he has a SunFire280R, which disagrees with what you're saying here. Sorry, I mixed something up here. I was somehow sure that they were using a v440, but it was somebody else. Ok, since I have a 280R just like Josip, I think a good plan is for him to show me the commands he used to create the build root where he can trigger bad things. I think we can move forward much better starting with this. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Bernd Zeimetz wrote: For those who can reproduce it an have something like libnss-db enabled, try disabling it. - disabled it - running vgdisplay killed the machine (wanted to create a new LV for a chroot)... it's not accessible at all anymore, I think the kernel is a 2.6.23-something here, I'll build a recent one and give it a try again Will take some time as I need to build on USII... I just wanted to write that I'm not able to reproduce this bug anymore... but running aptitude -u often enough gave me this nice output: titan:~# [ 2427.313946] BUG: soft lockup - CPU#3 stuck for 11s! [aptitude:13375] [ 2427.389128] TSTATE: 11009602 TPC: 0042f93c TNPC: 0042f7d0 Y: Not tainted [ 2427.506821] TPC: __delay+0x1c/0x48 [ 2427.549494] g0: 9000 g1: 0042f7d0 g2: g3: [ 2427.653670] g4: f8a00793c960 g5: f89fff994000 g6: f8a007dfc000 g7: [ 2427.757835] o0: 0020 o1: 0020 o2: o3: [ 2427.862001] o4: 0030a0d0 o5: sp: f8a007dff071 ret_pc: 0042f938 [ 2427.970337] RPC: __delay+0x18/0x48 [ 2428.013031] l0: 0005a6cab647 l1: 11009601 l2: 004417a8 l3: 0400 [ 2428.117206] l4: l5: 0001 l6: l7: 0008 [ 2428.221374] i0: i1: f8a007dffa88 i2: 0004 i3: 0001 [ 2428.325538] i4: i5: i6: f8a007dff131 i7: 004417ec [ 2428.429710] I7: cheetah_xcall_deliver+0x1c0/0x23c and an unkillable, cpu-eating aptitude. While retrieving some info using sysrq the machine froze after echoing m into sysrq-trigger, producing this output while dieing: [ 3680.006794] BUG: soft lockup - CPU#1 stuck for 11s! [pdflush:265] [ 3680.078838] TSTATE: 80009603 TPC: 004417a8 TNPC: 004417ac Y: Not tainted [ 3680.196551] TPC: cheetah_xcall_deliver+0x17c/0x23c [ 3680.255881] g0: g1: g2: 0001869e g3: [ 3680.360055] g4: f8a0048e3260 g5: f89fff984000 g6: f8a00717c000 g7: [ 3680.464220] o0: 0020 o1: f8a00717f418 o2: f8a005a84040 o3: 0010 [ 3680.568384] o4: 0015 o5: sp: f8a00717eac1 ret_pc: 004416e4 [ 3680.676719] RPC: cheetah_xcall_deliver+0xb8/0x23c [ 3680.735042] l0: 0002 l1: 0002 l2: 0096 l3: [ 3680.839217] l4: l5: f8a0048d3cd8 l6: 00024098 l7: f7d31000 [ 3680.943382] i0: 0044d100 i1: 00b0f60f8000 i2: i3: 0001 [ 3681.047548] i4: 0001 i5: 0001 i6: f8a00717eb81 i7: 00442be4 [ 3681.151717] I7: smp_flush_dcache_page_impl+0x21c/0x228 Luckily much more output of sysrq is in the syslog, so I should be able to mail it later when the machine is finished with rebooting (which takes some time...). 2.6.24-rc1-git2 (SMP) gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) titan:~# cat /proc/cpuinfo cpu : TI UltraSparc III (Cheetah) fpu : UltraSparc III integrated FPU prom: OBP 4.22.34 2007/07/23 13:01 type: sun4u ncpus probed: 4 ncpus active: 4 D$ parity tl1 : 0 I$ parity tl1 : 0 Cpu0ClkTck : 2cb41780 Cpu1ClkTck : 2cb41780 Cpu2ClkTck : 2cb41780 Cpu3ClkTck : 2cb41780 MMU Type: Cheetah State: CPU0: online CPU1: online CPU2: online CPU3: online -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
I think things got worse with 2.6.24... The machine shoots itself now, I guess by running cron jobs or so. [29074.766486] TSTATE: 11009600 TPC: 0042f984 TNPC: 0042f928 Y: Not tainted [29074.884191] TPC: sched_clock+0x0/0x30 [29074.929988] g0: g1: 004417ec g2: g3: [29075.034163] g4: f8a00493a4e0 g5: f89fff97c000 g6: f8a006c64000 g7: [29075.138329] o0: o1: f8a006c67968 o2: 0008 o3: 0001 [29075.242493] o4: 3385 o5: sp: f8a006c67011 ret_pc: 0042f980 [29075.350830] RPC: udelay+0x18/0x1c [29075.392482] l0: 0020 l1: l2: 0096 l3: [29075.496658] l4: 0200 l5: 0001c5569e6c l6: 0006c390404c l7: 6204052f31ec823e [29075.600824] i0: 0044d100 i1: 00b0fcc2c000 i2: i3: [29075.704989] i4: 0040 i5: 007a0578 i6: f8a006c670d1 i7: 004420d8 [29075.809161] I7: flush_dcache_page_all+0x16c/0x1c0 [29075.867493] BUG: soft lockup - CPU#2 stuck for 11s! [sh:4253] [29075.936259] TSTATE: 11009600 TPC: 004417a8 TNPC: 004417ac Y: Not tainted [29076.053980] TPC: cheetah_xcall_deliver+0x17c/0x23c [29076.113311] g0: g1: g2: g3: [29076.217483] g4: f8a0048f9260 g5: f89fff98c000 g6: f8a006c7 g7: [29076.321648] o0: 0020 o1: f8a006c73968 o2: 0002 o3: 0001 [29076.425816] o4: 781b o5: sp: f8a006c73011 ret_pc: 004416a0 [29076.534150] RPC: cheetah_xcall_deliver+0x74/0x23c [29076.592471] l0: 0008 l1: l2: 0096 l3: [29076.696645] l4: 0200 l5: 0001c5569e6c l6: 0006c3904054 l7: 7e645445948ed154 [29076.800811] i0: 0044d100 i1: 00b0fcf8 i2: i3: [29076.904977] i4: 0040 i5: 007a0578 i6: f8a006c730d1 i7: 004420d8 [29077.009144] I7: flush_dcache_page_all+0x16c/0x1c0 -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
From: Bernd Zeimetz [EMAIL PROTECTED] Date: Sun, 28 Oct 2007 04:03:44 +0100 I think things got worse with 2.6.24... The machine shoots itself now, I guess by running cron jobs or so. [29074.766486] TSTATE: 11009600 TPC: 0042f984 TNPC: 0042f928 Y: Not tainted [29074.884191] TPC: sched_clock+0x0/0x30 What kind of OOPS is this? Please provide the kernel log messages that appeared right before these register dumps. Thanks. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
From: Bernd Zeimetz [EMAIL PROTECTED] Date: Sat, 27 Oct 2007 20:09:47 +0200 titan:~# [ 2427.313946] BUG: soft lockup - CPU#3 stuck for 11s! [aptitude:13375] [ 2427.389128] TSTATE: 11009602 TPC: 0042f93c TNPC: 0042f7d0 Y: Not tainted [ 2427.506821] TPC: __delay+0x1c/0x48 [ 2427.549494] g0: 9000 g1: 0042f7d0 g2: g3: [ 2427.653670] g4: f8a00793c960 g5: f89fff994000 g6: f8a007dfc000 g7: [ 2427.757835] o0: 0020 o1: 0020 o2: o3: [ 2427.862001] o4: 0030a0d0 o5: sp: f8a007dff071 ret_pc: 0042f938 [ 2427.970337] RPC: __delay+0x18/0x48 [ 2428.013031] l0: 0005a6cab647 l1: 11009601 l2: 004417a8 l3: 0400 [ 2428.117206] l4: l5: 0001 l6: l7: 0008 [ 2428.221374] i0: i1: f8a007dffa88 i2: 0004 i3: 0001 [ 2428.325538] i4: i5: i6: f8a007dff131 i7: 004417ec [ 2428.429710] I7: cheetah_xcall_deliver+0x1c0/0x23c and an unkillable, cpu-eating aptitude. One cpu can't send a message successfully to another cpu, likely because it is stuck somewhere with interrupts off. I was going to give you a patch like the one at the end of this email to try and get a register dump from all cpus with Alt-Sysrq-p but that is guarenteed not to work. It will just call back into cheetah_xcall_deliver() and wedge further. Again, don't use the patch, trying to get a register dump with it in this state will just wedge the machine further. I don't know how to suggest a way to debug this further, sorry. I'm sick of these bugs and I need to reproduce all of these UltraSPARC-III issues locally to fix them. So let's go. Everyone who sees these UltraSPARC-III problems please send me PRECISE and FULL description of how to install from scratch a machine and run something that will trigger these errors. DO NOT leave out any detail of your installation. Any minor omission will mean that I potentially won't be able to reproduce this bug and therefore I won't be able to fix it either. If you are using NIS, say so and give the exact configuration. If you have any modifications to some core configuration file like /etc/nsswitch.conf, tell me. If you are using static IP addresses, tell me. If you have netfilter enabled, tell me. If you have even installed some extra package, like libnss-db or anything else, tell me even if you think it's not in use. In short I want a flawless cook-book style recipe for installing a machine that I can reproduce this problem on. Do not omit any detail. Thanks! diff --git a/arch/sparc64/kernel/process.c b/arch/sparc64/kernel/process.c index ca7cdfd..e10fdce 100644 --- a/arch/sparc64/kernel/process.c +++ b/arch/sparc64/kernel/process.c @@ -348,7 +348,7 @@ void show_regs(struct pt_regs *regs) extern long etrap, etraptl1; #endif __show_regs(regs); -#if 0 +#if 1 #ifdef CONFIG_SMP { extern void smp_report_regs(void); - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Hi, It seems that instead of getting stuck in the kernel where I thought it would, the process gets stuck elsewhere and also tends to loop allocating memory until all memory in the machine is exhausted and the OOM killer starts to try and kill processes left and right. at least it runs with 100% CPU, attaching strace to the pid doesn't give any results strace-ing the whole process doesn't result in more useful output, but the hanging processes were killable when they were running under strace... Cheers, Bernd - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Hi, just got linked to this thread, so here's a bit input form me :) 1) system type A Sun Fire 280R, with two CPU boards, each carrying a TI UltraSparc III (Cheetah), and 2 GB of RAM. If you need more info, just say. (Bernd Zeimetz has previously suggested that the problem is linked to the processor type, the USIII.) It seems to hit USIII machines with 2 CPUs in one tray much more hard than US II, but once a month our Ultra60 (running two US II) has the same issues - it got much better with since 179c85ea53bef807621f335767e41e23f86f01df, though. before the mentioned patch it died a few times per day. Seems it got better on the USIII here, too (we have a v880 here, the large version of Josip's machine, with 2x 2 CPUs), but it still dies way too often, just not useable in the current state. 2) compiler used to build kernel and is it SMP? gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) same compiler here. Please note that non-SMP kernels do not boot on those US-III machines at all (at least I didn't find a single one which does). Cheers, Bernd - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
From: Bernd Zeimetz [EMAIL PROTECTED] Date: Fri, 26 Oct 2007 14:30:21 +0200 at least it runs with 100% CPU, attaching strace to the pid doesn't give any results strace-ing the whole process doesn't result in more useful output, but the hanging processes were killable when they were running under strace... When it runs with 100% CPU that's what makes me suspect it's spinning in the kernel futex code somewhere or similar. One thing I notice in the debian bug report is a mention of libnss-db So I did some testing here and without libnss-db installed, running dpkg-query does not use futexes at all. But once I install libnss-db and enable it (by running 'make' under /var/lib/misc then editing /etc/nsswitch.conf to make 'db' get searched first) indeed dpkg-query starts using futexes via the libnss-db library. Josip, do you guys have libnss-db or similar in use on the buildd machine? For those who can reproduce it an have something like libnss-db enabled, try disabling it. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
On Fri, Oct 26, 2007 at 03:01:24PM -0700, David Miller wrote: One thing I notice in the debian bug report is a mention of libnss-db So I did some testing here and without libnss-db installed, running dpkg-query does not use futexes at all. But once I install libnss-db and enable it (by running 'make' under /var/lib/misc then editing /etc/nsswitch.conf to make 'db' get searched first) indeed dpkg-query starts using futexes via the libnss-db library. Josip, do you guys have libnss-db or similar in use on the buildd machine? lebrun.d.o doesn't have libnss-db installed, neither outside nor inside the chroot, sorry. Both setups have the default /etc/nsswitch.conf that searches 'db' before 'files' for protocols, services, ethers, rpc, but that's it. BTW, would you benefit from having an account on this machine? -- 2. That which causes joy or happiness. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
On Sat, Oct 27, 2007 at 12:30:56AM +0200, Bernd Zeimetz wrote: Josip, do you guys have libnss-db or similar in use on the buildd machine? They have, that's what Debian's userdir-ldap uses. No, I have to correct you, this machine isn't part of that setup (at least not yet). -- 2. That which causes joy or happiness. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Josip Rodin wrote: On Sat, Oct 27, 2007 at 12:30:56AM +0200, Bernd Zeimetz wrote: Josip, do you guys have libnss-db or similar in use on the buildd machine? They have, that's what Debian's userdir-ldap uses. No, I have to correct you, this machine isn't part of that setup (at least not yet). Oh ok, I stand corrected - thought it would have it. -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
For those who can reproduce it an have something like libnss-db enabled, try disabling it. - disabled it - running vgdisplay killed the machine (wanted to create a new LV for a chroot)... it's not accessible at all anymore, I think the kernel is a 2.6.23-something here, I'll build a recent one and give it a try again Will take some time as I need to build on USII... -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Josip, do you guys have libnss-db or similar in use on the buildd machine? They have, that's what Debian's userdir-ldap uses. For those who can reproduce it an have something like libnss-db enabled, try disabling it. Will do in a few minutes. -- Bernd Zeimetz [EMAIL PROTECTED] http://bzed.de/ - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
Josip, give this debugging patch a try. It is against 2.6.23.1 but it should apply to most recent kernels. It should give you debugging messages in the kernel log that start with FUTEX_BUG if the debugging code triggers. Please post just a few samples of whatever it spits out. Thanks! diff --git a/kernel/futex.c b/kernel/futex.c index fcc94e7..6da8b3c 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1874,6 +1874,25 @@ err_unlock: return ret; } +static void log_futex_bug(u32 __user *uaddr, struct task_struct *curr, int pi) +{ + struct mm_struct *mm = curr-mm; + struct vm_area_struct *vma; + unsigned long addr; + + printk(KERN_ERR FUTEX_BUG: Looping too much in futex death\n); + printk(KERN_ERR FUTEX_BUG: uaddr[%p] task[%s:%d] pi(%d)\n, + uaddr, curr-comm, curr-pid, pi); + + addr = (unsigned long) uaddr; + vma = find_vma(mm, addr); + if (vma) + printk(KERN_ERR FUTEX_BUG: VMA start[%lx] end[%lx] flags[%lx]\n, + vma-vm_start, + vma-vm_end, + vma-vm_flags); +} + /* * Process a futex-list entry, check whether it's owned by the * dying task, and do notification if so: @@ -1881,6 +1900,7 @@ err_unlock: int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi) { u32 uval, nval, mval; + int limit = 0; retry: if (get_user(uval, uaddr)) @@ -1903,8 +1923,12 @@ retry: if (nval == -EFAULT) return -1; - if (nval != uval) - goto retry; + if (nval != uval) { + if (++limit 100) + goto retry; + log_futex_bug(uaddr, curr, pi); + put_user(mval, uaddr); + } /* * Wake robust non-PI futexes here. The wakeup of - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
On Wed, Oct 24, 2007 at 11:41:13PM -0700, David Miller wrote: Josip, give this debugging patch a try. It is against 2.6.23.1 but it should apply to most recent kernels. OK, after resurrecting the machine once again (it had died in the meantime, reliably as ever), I did: patching file kernel/futex.c Hunk #1 succeeded at 1877 (offset 3 lines). Hunk #2 succeeded at 1903 (offset 3 lines). Hunk #3 succeeded at 1926 (offset 3 lines). It should give you debugging messages in the kernel log that start with FUTEX_BUG if the debugging code triggers. Please post just a few samples of whatever it spits out. It's been running with the patched kernel for some 6.5 hours now, no problems yet. I'll let you know as soon as it starts to misbehave. -- 2. That which causes joy or happiness. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
On Thu, Oct 25, 2007 at 05:07:36PM +0200, joy wrote: If you try, within that troublesome build-root, a few times to try to fork off a couple hundred: dpkg-query --something python-2.5 or whatever, can you get some of processes to wedge under that build root? I did this in a chrooted bash: for i in $(seq 0 100); do (dpkg-query -s python2.5-minimal /dev/null ); done And now the machine went catatonic. :( Thankfully the console is still vaguely operational - I can enter my username to log in, but I can't get the Password prompt to appear. Magic SysRq still works - if you need any output from it, tell me. The machine continued in this state for a couple of hours or so, it didn't come back to life. When I went to check up on it, the kernel showed one message on the console - OOM killer killed a make process. I then gave up, used SysRq to S+U+B, and it booted again, and I was able to retrieve the following data from kern.log that is in the attachment. Hope that helps. -- 2. That which causes joy or happiness. Oct 25 17:04:09 lebrun kernel: SysRq : Emergency Sync Oct 25 17:04:20 lebrun kernel: SysRq : HELP : loglevel0-8 reBoot tErm Full kIll saK showMem Nice showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks Oct 25 17:04:20 lebrun kernel: SysRq : Show Memory Oct 25 17:04:20 lebrun kernel: Mem-info: Oct 25 17:04:20 lebrun kernel: Normal per-cpu: Oct 25 17:04:20 lebrun kernel: CPU0: Hot: hi: 90, btch: 15 usd: 0 Cold: hi: 30, btch: 7 usd: 0 Oct 25 17:04:20 lebrun kernel: CPU1: Hot: hi: 90, btch: 15 usd: 4 Cold: hi: 30, btch: 7 usd: 24 Oct 25 17:04:20 lebrun kernel: Active:202209 inactive:46687 dirty:39 writeback:279 unstable:0 Oct 25 17:04:20 lebrun kernel: free:723 slab:2826 mapped:2986 pagetables:875 bounce:0 Oct 25 17:04:20 lebrun kernel: Normal free:5616kB min:5760kB low:7200kB high:8640kB active:1619760kB inactive:371344kB present:2077352kB pages_scanned:178 all_unreclaimable? no Oct 25 17:04:20 lebrun kernel: lowmem_reserve[]: 0 0 Oct 25 17:04:21 lebrun kernel: Normal: 780*8kB 11*16kB 1*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB = 6512kB Oct 25 17:04:21 lebrun kernel: Swap cache: add 251630, delete 187227, find 26426/42924, race 80+86 Oct 25 17:04:21 lebrun kernel: Free swap = 174880kB Oct 25 17:04:24 lebrun kernel: Total swap = 1048688kB Oct 25 17:04:24 lebrun kernel: Free swap: 174648kB Oct 25 17:04:24 lebrun kernel: 261865 pages of RAM Oct 25 17:04:24 lebrun kernel: 3001 reserved pages Oct 25 17:04:24 lebrun kernel: 155176 pages shared Oct 25 17:04:24 lebrun kernel: 64407 pages swap cached Oct 25 17:04:24 lebrun kernel: 39 pages dirty Oct 25 17:04:24 lebrun kernel: 124 pages writeback Oct 25 17:04:24 lebrun kernel: 2986 pages mapped Oct 25 17:04:24 lebrun kernel: 2826 pages slab Oct 25 17:04:24 lebrun kernel: 875 pages pagetables Oct 25 17:05:01 lebrun kernel: SysRq : Emergency Sync Oct 25 17:05:04 lebrun kernel: SysRq : HELP : loglevel0-8 reBoot tErm Full kIll saK showMem Nice showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks Oct 25 17:05:07 lebrun kernel: SysRq : Show Blocked State Oct 25 17:05:07 lebrun kernel: taskPC stack pid father Oct 25 17:05:07 lebrun kernel: kswapd0 D 00528bc8 0 181 2 Oct 25 17:05:07 lebrun kernel: Call Trace: Oct 25 17:05:08 lebrun kernel: [006258e0] io_schedule+0x2c/0x38 Oct 25 17:05:08 lebrun kernel: [00528bc8] get_request_wait+0x11c/0x15c Oct 25 17:12:13 lebrun kernel: [0052a220] ges+0x144/0x258 Oct 25 17:12:13 lebrun kernel: [0048cf34] __alloc_pages+0x1b0/0x330 Oct 25 17:12:13 lebrun kernel: [0049f50c] read_swap_cache_async+0x40/0x150 Oct 25 17:12:13 lebrun kernel: [00495908] swapin_readahead+0x3c/0x7c Oct 25 17:12:13 lebrun kernel: [004973b4] handle_mm_fault+0x3fc/0x7cc Oct 25 17:12:13 lebrun kernel: [0044e084] do_sparc64_fault+0x314/0x594 Oct 25 17:12:13 lebrun kernel: [0040794c] sparc64_realfault_common+0x18/0x20 Oct 25 17:12:13 lebrun kernel: [00015078] 0x15080 Oct 25 17:12:13 lebrun kernel: dpkg-queryD 00528bc8 0 3924 1 Oct 25 17:12:13 lebrun kernel: Call Trace: Oct 25 17:12:13 lebrun kernel: [006258e0] io_schedule+0x2c/0x38 Oct 25 17:12:13 lebrun kernel: [00528bc8] get_request_wait+0x11c/0x15c Oct 25 17:12:13 lebrun kernel: [0052a220] __make_request+0x5f0/0x6a8 Oct 25 17:12:13 lebrun kernel: [00526bac] generic_make_request+0x2f8/0x31c Oct 25 17:12:13 lebrun kernel: [00526cd4] submit_bio+0x104/0x10c Oct 25 17:12:13 lebrun kernel: [0049f30c] swap_writepage+0xa4/0xb4 Oct 25 17:12:13 lebrun kernel: [004918c4] shrink_page_list+0x410/0x6f4 Oct 25 17:12:13 lebrun kernel: [004922c8] shrink_zone+0x720/0xa38 Oct 25 17:12:13 lebrun kernel: [00492de8]
Re: unkillable dpkg-query processes
From: Josip Rodin [EMAIL PROTECTED] Date: Thu, 25 Oct 2007 00:33:32 +0200 We've been having grave issues with a few of our sparc build daemon machines in Debian. Something causes dpkg-query(8) processes, otherwise harmless, to run amok and allocate too much memory, but keep running and become resilient to killing. They eventually push the machine to the point where you can only ping it, but all the userland and the console is dead. I know, I've seen this report a million times :-) I can't reproduce it, I've even tried the fabled test case where you spawn thousands of dpkg-query instances and it never does anything wrong on my Niagara boxes. So something is different about your environment than mine. Let's see if there is some aspect of the environment that contributed to the problem occurring. Please reproduce with 2.6.23-final and then list (I know this is redundant, just humor me :-): 1) system type 2) compiler used to build kernel and is it SMP? 3) glibc in use 4) compiler used to build running glibc If you have a reproducable test case, that's even better. If necessary I'll try to install a replica of your build environment here in order to reproduce. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unkillable dpkg-query processes
On Wed, Oct 24, 2007 at 03:58:29PM -0700, David Miller wrote: I know, I've seen this report a million times :-) Oh, I know you know, I mailed you a while ago and you told me to mail the mailing list :) I can't reproduce it, I've even tried the fabled test case where you spawn thousands of dpkg-query instances and it never does anything wrong on my Niagara boxes. So something is different about your environment than mine. Let's see if there is some aspect of the environment that contributed to the problem occurring. Please reproduce with 2.6.23-final and then list (I know this is redundant, just humor me :-): Confirming that the machine could reproduce the problem with 2.6.23.1. (I can send over the .config if it matters.) 1) system type A Sun Fire 280R, with two CPU boards, each carrying a TI UltraSparc III (Cheetah), and 2 GB of RAM. If you need more info, just say. (Bernd Zeimetz has previously suggested that the problem is linked to the processor type, the USIII.) 2) compiler used to build kernel and is it SMP? gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) I've no idea if that compiler is SMP, if you want I'll ask someone else. 3) glibc in use 4) compiler used to build running glibc In that particular chroot, it's: chroot-unstable% lib/libc-2.6.1.so GNU C Library stable release version 2.6.1, by Roland McGrath et al. [...] Compiled by GNU CC version 4.2.1 (Debian 4.2.1-5). Compiled on a Linux 2.6.17-rc1 system on 2007-09-04. Available extensions: crypt add-on version 2.1 by Michael Glad and others GNU Libidn by Simon Josefsson Native POSIX Threads Library by Ulrich Drepper et al BIND-8.2.3-T5B software FPU emulation by Richard Henderson, Jakub Jelinek and others [...] Outside of that chroot, it's: % /lib/libc-2.3.6.so GNU C Library stable release version 2.3.6, by Roland McGrath et al. [...] Compiled by GNU CC version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21). Compiled on a Linux 2.6.18 system on 2007-03-01. Available extensions: GNU libio by Per Bothner crypt add-on version 2.1 by Michael Glad and others GNU Libidn by Simon Josefsson linuxthreads-0.10 by Xavier Leroy BIND-8.2.3-T5B libthread_db work sponsored by Alpha Processor Inc NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk software FPU emulation by Richard Henderson, Jakub Jelinek and others Thread-local storage support included. [...] If you have a reproducable test case, that's even better. There doesn't appear to be a pattern, on this machine at least - I just let the buildd run, building whatever comes up, and after a few hours it inevitably runs into a wall. -- 2. That which causes joy or happiness. - To unsubscribe from this list: send the line unsubscribe sparclinux in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html