Re: unkillable dpkg-query processes

2007-11-06 Thread David Miller
From: Bernd Zeimetz [EMAIL PROTECTED]
Date: Tue, 06 Nov 2007 04:51:07 +0100

 Here's also some output from apt-get which got stuck in my unstable
 chroot while I wanted to retrieve the klibc source to try to debug it...

So the good news is that I started getting the hang seen
on the Debain buildd on my workstation.

The bad news is that it's very sporadic, for a while I
could trigger it during bootup, on every boot, and now
I can't get it to wedge at all.

Anyways, we're getting closer.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-11-06 Thread Bernd Zeimetz
David Miller wrote:
 From: Bernd Zeimetz [EMAIL PROTECTED]
 Date: Tue, 06 Nov 2007 04:51:07 +0100
 
 Here's also some output from apt-get which got stuck in my unstable
 chroot while I wanted to retrieve the klibc source to try to debug it...
 
 So the good news is that I started getting the hang seen
 on the Debain buildd on my workstation.
 
 The bad news is that it's very sporadic, for a while I
 could trigger it during bootup, on every boot, and now
 I can't get it to wedge at all.
 
 Anyways, we're getting closer.


Running stress -c 2 on a 4 CPU machine made things really worse here,
probably it helps to trigger the bug for you, too.
Our US II machine is also just running fine at the moment.



-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-11-05 Thread Bernd Zeimetz

 So I'm not sure if the result is really useful for you - if not just let
 me know. I've attached the last ~10-20 sysrq-g outputs - as it was
 running in a loop I have a ton of them. In case you're wondering: http
 is aptitude's http method.
 
 The http module is stuck in a different place, I'll try to
 see if I can make sense of it.


Here's also some output from apt-get which got stuck in my unstable
chroot while I wanted to retrieve the klibc source to try to debug it...

ov  6 04:43:19 titan kernel: [100896.376237] SysRq : Show Global CPU Regs
Nov  6 04:43:19 titan kernel: [100896.423254] * CPU[  0]: 
TSTATE[] TPC[] TNPC[] 
TASK[bash:11762]
Nov  6 04:43:19 titan kernel: [100896.544064]   CPU[  1]: 
TSTATE[004411009602] TPC[0067b59c] TNPC[0067b5a0] 
TASK[swapper:0]
Nov  6 04:43:19 titan kernel: [100896.663869]  
TPC[schedule+0x5f8/0x7a4]
Nov  6 04:43:19 titan kernel: [100896.722179]  
O7[schedule+0x5cc/0x7a4]
Nov  6 04:43:19 titan kernel: [100896.779474]  
I7[cpu_idle+0xa8/0xb8]
Nov  6 04:43:19 titan kernel: [100896.834677]   CPU[  2]: 
TSTATE[009911009601] TPC[0042888c] TNPC[00428890] 
TASK[swapper:0]
Nov  6 04:43:19 titan kernel: [100896.954474]  
TPC[cpu_idle+0x80/0xb8]
Nov  6 04:43:19 titan kernel: [100897.010715]  
O7[cpu_idle+0xa8/0xb8]
Nov  6 04:43:19 titan kernel: [100897.065932]  
I7[after_lock_tlb+0x19c/0x1b0]
Nov  6 04:43:19 titan kernel: [100897.129468]   CPU[  3]: 
TSTATE[004411009602] TPC[0053a0c4] TNPC[0053a0c8] 
TASK[apt-get:11759]
Nov  6 04:43:19 titan kernel: [100897.253443]  
TPC[__first_cpu+0x4/0x28]
Nov  6 04:43:19 titan kernel: [100897.311767]  O7[__delay+0x28/0x48]
Nov  6 04:43:20 titan kernel: [100897.365923]  
I7[cheetah_xcall_deliver+0x1c0/0x23c]
Nov  6 04:43:31 titan kernel: [100909.020406] SysRq : Show Global CPU Regs
Nov  6 04:43:31 titan kernel: [100909.067374] * CPU[  0]: 
TSTATE[] TPC[] TNPC[] 
TASK[bash:11762]
Nov  6 04:43:31 titan kernel: [100909.188209]   CPU[  1]: 
TSTATE[004411009604] TPC[0045731c] TNPC[00457320] 
TASK[swapper:0]
Nov  6 04:43:32 titan kernel: [100909.308013]  
TPC[update_stats_wait_end+0x24/0x88]
Nov  6 04:43:32 titan kernel: [100909.377808]  
O7[sched_clock+0x10/0x30]
Nov  6 04:43:32 titan kernel: [100909.436116]  
I7[pick_next_task_fair+0x24/0x44]
Nov  6 04:43:32 titan kernel: [100909.502782]   CPU[  2]: 
TSTATE[009911009601] TPC[0042888c] TNPC[00428890] 
TASK[swapper:0]
Nov  6 04:43:32 titan kernel: [100909.622580]  
TPC[cpu_idle+0x80/0xb8]
Nov  6 04:43:32 titan kernel: [100909.678817]  
O7[cpu_idle+0xa8/0xb8]
Nov  6 04:43:32 titan kernel: [100909.734029]  
I7[after_lock_tlb+0x19c/0x1b0]
Nov  6 04:43:32 titan kernel: [100909.797570]   CPU[  3]: 
TSTATE[11009601] TPC[00441a78] TNPC[00441a7c] 
TASK[apt-get:11759]
Nov  6 04:43:32 titan kernel: [100909.921536]  
TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov  6 04:43:32 titan kernel: [100909.993401]  
O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov  6 04:43:32 titan kernel: [100910.063193]  
I7[flush_dcache_page_all+0x178/0x240]
Nov  6 04:43:33 titan kernel: [100910.766366] SysRq : Show Global CPU Regs
Nov  6 04:43:33 titan kernel: [100910.813292] * CPU[  0]: 
TSTATE[] TPC[] TNPC[] 
TASK[bash:11762]
Nov  6 04:43:33 titan kernel: [100910.934129]   CPU[  1]: 
TSTATE[004411009604] TPC[0045731c] TNPC[00457320] 
TASK[swapper:0]
Nov  6 04:43:33 titan kernel: [100911.053923]  
TPC[update_stats_wait_end+0x24/0x88]
Nov  6 04:43:33 titan kernel: [100911.123706]  
O7[sched_clock+0x10/0x30]
Nov  6 04:43:33 titan kernel: [100911.182037]  
I7[pick_next_task_fair+0x24/0x44]
Nov  6 04:43:33 titan kernel: [100911.248702]   CPU[  2]: 
TSTATE[004411009601] TPC[004288a0] TNPC[004288a4] 
TASK[swapper:0]
Nov  6 04:43:34 titan kernel: [100911.368498]  
TPC[cpu_idle+0x94/0xb8]
Nov  6 04:43:34 titan kernel: [100911.424738]  
O7[cpu_idle+0xa8/0xb8]
Nov  6 04:43:34 titan kernel: [100911.479949]  
I7[after_lock_tlb+0x19c/0x1b0]
Nov  6 04:43:34 titan kernel: [100911.543490]   CPU[  3]: 
TSTATE[11009601] TPC[0042fc44] TNPC[0042fbe8] 
TASK[apt-get:11759]
Nov  6 04:43:34 titan kernel: [100911.667456]  TPC[udelay+0x14/0x1c]
Nov  6 04:43:34 titan kernel: [100911.721611]  O7[udelay+0x10/0x1c]
Nov  6 04:43:34 titan kernel: [100911.774739]  
I7[flush_dcache_page_all+0x178/0x240]
Nov  6 04:43:35 titan kernel: [100912.474070] SysRq : Show Global CPU Regs
Nov  6 04:43:35 titan kernel: [100912.520982] * 

Re: unkillable dpkg-query processes

2007-11-04 Thread David Miller
From: Josip Rodin [EMAIL PROTECTED]
Date: Fri, 2 Nov 2007 17:21:06 +0100

 Great. Here you go, three of them, while the load was 3 and this process was
 stuck:
 
 buildd   10813  100  0.8 987368 17504 ?RN   14:44 155:49 dpkg-query 
 --search libpthread.so.0 libdl.so.2 libstdc++.so.6 libm.so.6 libgcc_s.so.1 
 libc.so.6 libFLAC.so.8 libid3tag.so.0 libz.so.1 libmad.so.0 libglib-2.0.so.0 
 libmikmod.so.2 libsndfile.so.1 libvorbis.so.0 libogg.so.0 libvorbisfile.so.3
 ...
Nov  2 17:02:04 lebrun kernel:   CPU[  0]: TSTATE[80009604] 
TPC[00407924] TNPC[00407928] TASK[dpkg-query:10813]
Nov  2 17:02:04 lebrun kernel:  
TPC[sparc64_realfault_common+0x8/0x20]

It looks like dpkg_query is stuck on a page fault.  Typically
this means the fault processing is not putting a valid
translation into the TLB to satisfy the fault, so we loop
forever never making forward progress.

I've had to debug something similar to this before, so I'll
piece together a debugging patch you can use to get more
information.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-11-04 Thread Bernd Zeimetz

 Ok, the key in the trace is:
 
 Nov  2 16:25:30 titan kernel: [  978.134874]   CPU[  1]: 
 TSTATE[80009603] TPC[0067d2e0] TNPC[0067d2d4] 
 TASK[aptitude:3204]
 Nov  2 16:25:30 titan kernel: [  978.257809]  
 TPC[_write_unlock_irq+0x20/0x110]
  ...
 Nov  2 16:25:30 titan kernel: [  978.507778]   CPU[  3]: 
 TSTATE[11009605] TPC[004419f8] TNPC[004419fc] 
 TASK[aptitude:3203]
 Nov  2 16:25:30 titan kernel: [  978.630707]  
 TPC[cheetah_xcall_deliver+0x174/0x23c]
 
 The first symbol is misleading, it says _write_unlock_irq but actually
 in the assembler the PC is in the spinlock read spinning loop
 section.  So actually it's hanging in _spin_lock().
 
 CPU #3 is trying to send a cross-call message interrupt, but for
 some reason that isn't making forward progress.
 
 Let's see what's calling these things by adding some more debugging
 information.  Please retry the test with the following patch on
 top of the original sysrq-g debugging patch and please get new
 logs when it hangs.


Today I was a bit out of luck, either the machine crashed so badly that
it just didn't react on anything anymore, or it didn't crash.
The machine went amok a bit slower when I did the following things,
which also resulted in the attached sysrq output.
- run stress -c 2 to get the load up, didn't need that the last time...
- run something like `while true; do echo g  /proc/sysrg-trigger; sleep
0.5; done`
- run aptitude -u several times until the machine died.

So I'm not sure if the result is really useful for you - if not just let
me know. I've attached the last ~10-20 sysrq-g outputs - as it was
running in a loop I have a ton of them. In case you're wondering: http
is aptitude's http method.

We'll also run the patched Kernel on a US II machine form tomorrow on -
but it always took a longer time until it crashed, so we'll see if it
happens at all.

Thanks for your work,


Bernd


-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/


sysrq2.txt
Description: application/pgp-keys


Re: unkillable dpkg-query processes

2007-11-04 Thread David Miller
From: Bernd Zeimetz [EMAIL PROTECTED]
Date: Sun, 04 Nov 2007 20:55:20 +0100

 So I'm not sure if the result is really useful for you - if not just let
 me know. I've attached the last ~10-20 sysrq-g outputs - as it was
 running in a loop I have a ton of them. In case you're wondering: http
 is aptitude's http method.

The http module is stuck in a different place, I'll try to
see if I can make sense of it.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-11-04 Thread Bernd Zeimetz
David Miller wrote:
 From: Bernd Zeimetz [EMAIL PROTECTED]
 Date: Sun, 04 Nov 2007 20:55:20 +0100
 
 So I'm not sure if the result is really useful for you - if not just let
 me know. I've attached the last ~10-20 sysrq-g outputs - as it was
 running in a loop I have a ton of them. In case you're wondering: http
 is aptitude's http method.
 
 The http module is stuck in a different place, I'll try to
 see if I can make sense of it.

In the meantime I'll build an aptitude which should exit after running
trough the part which crashed usually, so it should be possible to run
it in a loop...

-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-11-04 Thread Bernd Zeimetz

 In the meantime I'll build an aptitude which should exit after running
 trough the part which crashed usually, so it should be possible to run
 it in a loop...

This was successful - it made crashing the machine pretty simple, even
without activated libnss-db.

To reproduce on Etch:
- get the source of aptitude
- apply the attached patch
- rebuild the .deb, install it
- while true; do aptitude -u; done

Some of the aptitudes hit a SIGABRT before one got stuck.

Best regards,

Bernd

-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/


aptitude.diff
Description: application/pgp-keys


aptitude-sysrq-q.txt.gz
Description: GNU Zip compressed data


Re: unkillable dpkg-query processes

2007-11-03 Thread David Miller
From: Bernd Zeimetz [EMAIL PROTECTED]
Date: Fri, 02 Nov 2007 16:37:25 +0100

 I've sent g several times to sysrq, output is attached.
 According to top the two hanging aptitude processes were running on CPU
 1 + 3.
 
  3204 root  20   0 19552 5088 4072 R  100  0.1   6:54.49 1 aptitude
  3203 root  20   0 19552 5088 4072 R  100  0.1   6:56.39 3 aptitude

Ok, the key in the trace is:

Nov  2 16:25:30 titan kernel: [  978.134874]   CPU[  1]: 
TSTATE[80009603] TPC[0067d2e0] TNPC[0067d2d4] 
TASK[aptitude:3204]
Nov  2 16:25:30 titan kernel: [  978.257809]  
TPC[_write_unlock_irq+0x20/0x110]
 ...
Nov  2 16:25:30 titan kernel: [  978.507778]   CPU[  3]: 
TSTATE[11009605] TPC[004419f8] TNPC[004419fc] 
TASK[aptitude:3203]
Nov  2 16:25:30 titan kernel: [  978.630707]  
TPC[cheetah_xcall_deliver+0x174/0x23c]

The first symbol is misleading, it says _write_unlock_irq but actually
in the assembler the PC is in the spinlock read spinning loop
section.  So actually it's hanging in _spin_lock().

CPU #3 is trying to send a cross-call message interrupt, but for
some reason that isn't making forward progress.

Let's see what's calling these things by adding some more debugging
information.  Please retry the test with the following patch on
top of the original sysrq-g debugging patch and please get new
logs when it hangs.

Thanks!

--- arch/sparc64/kernel/process.c.ORIG  2007-11-03 20:53:27.0 -0700
+++ arch/sparc64/kernel/process.c   2007-11-03 21:05:47.0 -0700
@@ -49,6 +49,7 @@
 #include asm/hypervisor.h
 #include asm/sstate.h
 #include asm/irq_regs.h
+#include asm/smp.h
 
 /* #define VERBOSE_SHOWREGS */
 
@@ -394,7 +395,11 @@ struct global_reg_snapshot {
unsigned long   tstate;
unsigned long   tpc;
unsigned long   tnpc;
+   unsigned long   o7;
+   unsigned long   i7;
struct thread_info  *thread;
+   unsigned long   pad1;
+   unsigned long   pad2;
 } global_reg_snapshot[NR_CPUS];
 static DEFINE_SPINLOCK(global_reg_snapshot_lock);
 
@@ -413,6 +418,8 @@ static void sysrq_handle_globreg(int key
global_reg_snapshot[cpu].tstate = regs-tstate;
global_reg_snapshot[cpu].tpc = regs-tpc;
global_reg_snapshot[cpu].tnpc = regs-tnpc;
+   global_reg_snapshot[cpu].o7 = regs-u_regs[UREG_I7];
+   global_reg_snapshot[cpu].i7 = 0;
} else {
global_reg_snapshot[cpu].tstate = 0;
global_reg_snapshot[cpu].tpc = 0;
@@ -432,9 +439,19 @@ static void sysrq_handle_globreg(int key
   ((tp   tp-task) ? tp-task-comm : NULL),
   ((tp   tp-task) ? tp-task-pid : -1));
 #ifdef CONFIG_KALLSYMS
-   if ((gp-tstate  TSTATE_PRIV)  (gp-tpc != 0UL)) {
-   sprint_symbol(buffer, gp-tpc);
-   printk( TPC[%s]\n, buffer);
+   if (gp-tstate  TSTATE_PRIV) {
+   if (gp-tpc != 0UL) {
+   sprint_symbol(buffer, gp-tpc);
+   printk( TPC[%s]\n, buffer);
+   }
+   if (gp-o7 != 0UL) {
+   sprint_symbol(buffer, gp-o7);
+   printk( O7[%s]\n, buffer);
+   }
+   if (gp-i7 != 0UL) {
+   sprint_symbol(buffer, gp-i7);
+   printk( I7[%s]\n, buffer);
+   }
}
 #endif
}
--- arch/sparc64/mm/ultra.S.ORIG2007-11-03 20:53:27.0 -0700
+++ arch/sparc64/mm/ultra.S 2007-11-03 20:57:12.0 -0700
@@ -528,7 +528,7 @@ xcall_fetch_glob_regs:
sethi   %hi(global_reg_snapshot), %g1
or  %g1, %lo(global_reg_snapshot), %g1
__GET_CPUID(%g2)
-   sllx%g2, 5, %g3
+   sllx%g2, 6, %g3
add %g1, %g3, %g1
rdpr%tstate, %g7
stx %g7, [%g1 + 0x00]
@@ -536,12 +536,14 @@ xcall_fetch_glob_regs:
stx %g7, [%g1 + 0x08]
rdpr%tnpc, %g7
stx %g7, [%g1 + 0x10]
+   stx %o7, [%g1 + 0x18]
+   stx %i7, [%g1 + 0x20]
sethi   %hi(trap_block), %g7
or  %g7, %lo(trap_block), %g7
sllx%g2, TRAP_BLOCK_SZ_SHIFT, %g2
add %g7, %g2, %g7
ldx [%g7 + TRAP_PER_CPU_THREAD], %g3
-   stx %g3, [%g1 + 0x18]
+   stx %g3, [%g1 + 0x28]
retry
 
 #ifdef DCACHE_ALIASING_POSSIBLE
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

Re: unkillable dpkg-query processes

2007-11-02 Thread Bernd Zeimetz
David Miller wrote:
 From: David Miller [EMAIL PROTECTED]
 Date: Thu, 01 Nov 2007 15:01:13 -0700 (PDT)
 
 I'm working on a kernel patch for 2.6.23 that will allow you to get
 some useful debugging information in situations like this.

 I'll try to get you that patch by the end of tonight.
 
 As promised, here is the patch below.

Thanks for the patch. Applied and used libnss-db + aptitude -u to hang
the machine.

I've sent g several times to sysrq, output is attached.
According to top the two hanging aptitude processes were running on CPU
1 + 3.

 3204 root  20   0 19552 5088 4072 R  100  0.1   6:54.49 1 aptitude
 3203 root  20   0 19552 5088 4072 R  100  0.1   6:56.39 3 aptitude


Cheers,

Bernd

-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/


sysrq-g.txt
Description: application/pgp-keys


Re: unkillable dpkg-query processes

2007-11-02 Thread Josip Rodin
On Thu, Nov 01, 2007 at 09:55:44PM -0700, David Miller wrote:
  I'm working on a kernel patch for 2.6.23 that will allow you to get
  some useful debugging information in situations like this.
 
  I'll try to get you that patch by the end of tonight.
 
 As promised, here is the patch below.
 echo g /proc/sysrq-trigger
 
 So when you get a stuck process or whatever, trigger this and
 send the output :-)

Great. Here you go, three of them, while the load was 3 and this process was
stuck:

buildd   10813  100  0.8 987368 17504 ?RN   14:44 155:49 dpkg-query 
--search libpthread.so.0 libdl.so.2 libstdc++.so.6 libm.so.6 libgcc_s.so.1 
libc.so.6 libFLAC.so.8 libid3tag.so.0 libz.so.1 libmad.so.0 libglib-2.0.so.0 
libmikmod.so.2 libsndfile.so.1 libvorbis.so.0 libogg.so.0 libvorbisfile.so.3

-- 
 2. That which causes joy or happiness.
Nov  2 17:01:52 lebrun kernel: SysRq : Show Global CPU Regs
Nov  2 17:01:52 lebrun kernel:   CPU[  0]: TSTATE[] 
TPC[] TNPC[] TASK[NULL:-1]
Nov  2 17:01:52 lebrun kernel:  
TPC[sparc64_realfault_common+0x8/0x20]
Nov  2 17:01:52 lebrun kernel: * CPU[  1]: TSTATE[] 
TPC[] TNPC[] TASK[sh:12919]
Nov  2 17:02:04 lebrun kernel: SysRq : Show Global CPU Regs
Nov  2 17:02:04 lebrun kernel:   CPU[  0]: TSTATE[80009604] 
TPC[00407924] TNPC[00407928] TASK[dpkg-query:10813]
Nov  2 17:02:04 lebrun kernel:  
TPC[sparc64_realfault_common+0x8/0x20]
Nov  2 17:02:04 lebrun kernel: * CPU[  1]: TSTATE[] 
TPC[] TNPC[] TASK[sh:12928]
Nov  2 17:17:02 lebrun kernel: SysRq : Show Global CPU Regs
Nov  2 17:17:02 lebrun kernel:   CPU[  0]: TSTATE[] 
TPC[00407924] TNPC[00407928] TASK[dpkg-query:10813]
Nov  2 17:17:02 lebrun kernel:  
TPC[sparc64_realfault_common+0x8/0x20]
Nov  2 17:17:02 lebrun kernel: * CPU[  1]: TSTATE[] 
TPC[] TNPC[] TASK[sh:16444]


Re: unkillable dpkg-query processes

2007-11-01 Thread Josip Rodin
Hi,

lebrun.d.o hasn't crashed in a while now, but it has this in the
process list:

buildd2382  0.0  0.2   8144  4736 ?Ss   Oct30   0:00 /usr/bin/perl 
/usr/bin/buildd
buildd2407  0.0  0.5  13920 11296 ?SN   Oct30   0:10  \_ 
/usr/bin/perl /usr/bin/sbuild --batch --stats-dir=/home/buildd/
buildd   18174  0.0  0.0  0 0 ?ZNs  Oct30   0:00  \_ [su] 
defunct
buildd   23305  100  1.6 1007296 33288 ?   RN   Oct30 3507:30 dpkg-query 
--status squashfs-source

At the same time:

% free
 total   used   free sharedbuffers cached
Mem:   20730402021224  51816  0 196808  21144
-/+ buffers/cache:1803272 269768
Swap:  10486881041048584
% uptime
 22:38:36 up 2 days, 10:53,  1 user,  load average: 3.00, 3.01, 3.00

Given that it's still not catatonic, can I do something to provide some
debugging information?

(BTW, I'm subscribed to the sparclinux list now.)

-- 
 2. That which causes joy or happiness.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-11-01 Thread Bernd Zeimetz


 The futex() calls are definitely from libnss-db.

And on Lenny/testing we have futex calls from libc6.
Didn't have the time to come up with any instructions yet as we have
public holidays today, I'll try to finish them tomorrow.

-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-11-01 Thread David Miller
From: David Miller [EMAIL PROTECTED]
Date: Thu, 01 Nov 2007 15:01:13 -0700 (PDT)

 I'm working on a kernel patch for 2.6.23 that will allow you to get
 some useful debugging information in situations like this.

 I'll try to get you that patch by the end of tonight.

As promised, here is the patch below.

To trigger the debugging log, simple give the console
a Alt-SysRQ then a g.

On a serial console you can do this by giving a single
BREAK then a g.

If you're having trouble triggering the sysrq on the
console, try instead:

bash# echo g /proc/sysrq-trigger

Here is some sample output from my Niagara-2 system while
running a benchmark.  The current CPU is denoted by the
leading * character.

[81940.250994] SysRq : Show Global CPU Regs
[81940.251800] * CPU[  0]: TSTATE[e2001602] TPC[0055813c] 
TNPC[00558140] TASK[dd:2940]
[81940.252206]  TPC[NGbzero_loop+0x1c/0x38]
[81940.252422]   CPU[  1]: TSTATE[004411001607] TPC[0055c9bc] 
TNPC[0055c9c0] TASK[dd:2926]
[81940.252739]  TPC[atomic_sub_ret+0x4/0x30]
[81940.252936]   CPU[  2]: TSTATE[11001607] TPC[0055feec] 
TNPC[0055fef0] TASK[dd:2899]
[81940.253238]  TPC[NG2copy_to_user+0x46c/0x680]
[81940.253451]   CPU[  3]: TSTATE[e2001602] TPC[00558130] 
TNPC[00558134] TASK[dd:2929]
[81940.253776]  TPC[NGbzero_loop+0x10/0x38]
[81940.253993]   CPU[  4]: TSTATE[e2001602] TPC[00558124] 
TNPC[00558128] TASK[dd:2947]
[81940.254325]  TPC[NGbzero_loop+0x4/0x38]
[81940.254497]   CPU[  5]: TSTATE[004411001606] TPC[00495f94] 
TNPC[00495f98] TASK[dd:2908]
[81940.254893]  TPC[do_generic_mapping_read+0xbc/0x428]
[81940.255203]   CPU[  6]: TSTATE[11001607] TPC[0055fee8] 
TNPC[0055feec] TASK[dd:2920]
[81940.255699]  TPC[NG2copy_to_user+0x468/0x680]
[81940.256104]   CPU[  7]: TSTATE[11001607] TPC[0055feec] 
TNPC[0055fef0] TASK[dd:2935]
[81940.256574]  TPC[NG2copy_to_user+0x46c/0x680]
[81940.256972]   CPU[  8]: TSTATE[e2001602] TPC[00558124] 
TNPC[00558128] TASK[dd:2903]
[81940.257399]  TPC[NGbzero_loop+0x4/0x38]
[81940.257899]   CPU[  9]: TSTATE[11001607] TPC[0055feec] 
TNPC[0055fef0] TASK[dd:2904]
[81940.258240]  TPC[NG2copy_to_user+0x46c/0x680]
[81940.258482]   CPU[ 10]: TSTATE[e2001602] TPC[00558138] 
TNPC[0055813c] TASK[dd:2902]
[81940.258808]  TPC[NGbzero_loop+0x18/0x38]
[81940.258999]   CPU[ 11]: TSTATE[e2001602] TPC[00558120] 
TNPC[00558124] TASK[dd:2941]
[81940.259319]  TPC[NGbzero_loop+0x0/0x38]
[81940.259487]   CPU[ 12]: TSTATE[e2001602] TPC[00558130] 
TNPC[00558134] TASK[dd:2919]
[81940.259801]  TPC[NGbzero_loop+0x10/0x38]
[81940.260012]   CPU[ 13]: TSTATE[11001607] TPC[0055feec] 
TNPC[0055fef0] TASK[dd:2950]
[81940.260350]  TPC[NG2copy_to_user+0x46c/0x680]
[81940.260564]   CPU[ 14]: TSTATE[e2001602] TPC[00558134] 
TNPC[00558138] TASK[dd:2936]
[81940.260937]  TPC[NGbzero_loop+0x14/0x38]
[81940.261150]   CPU[ 15]: TSTATE[11001607] TPC[0055fee8] 
TNPC[0055feec] TASK[dd:2905]
[81940.261457]  TPC[NG2copy_to_user+0x468/0x680]
[81940.261677]   CPU[ 16]: TSTATE[11001607] TPC[0055feec] 
TNPC[0055fef0] TASK[dd:2923]
[81940.261973]  TPC[NG2copy_to_user+0x46c/0x680]
[81940.262167]   CPU[ 17]: TSTATE[11001607] TPC[0055feec] 
TNPC[0055fef0] TASK[dd:2897]
[81940.262462]  TPC[NG2copy_to_user+0x46c/0x680]
[81940.262643]   CPU[ 18]: TSTATE[e2001602] TPC[00558128] 
TNPC[0055812c] TASK[dd:2909]
[81940.262987]  TPC[NGbzero_loop+0x8/0x38]
[81940.263180]   CPU[ 19]: TSTATE[11001607] TPC[0055fee8] 
TNPC[0055feec] TASK[dd:2913]
[81940.263500]  TPC[NG2copy_to_user+0x468/0x680]
[81940.263901]   CPU[ 20]: TSTATE[e2001602] TPC[00558128] 
TNPC[0055812c] TASK[dd:2890]
[81940.264403]  TPC[NGbzero_loop+0x8/0x38]
[81940.264679]   CPU[ 21]: TSTATE[11001607] TPC[0055fee8] 
TNPC[0055feec] TASK[dd:2906]
[81940.265152]  TPC[NG2copy_to_user+0x468/0x680]
[81940.265535]   CPU[ 22]: TSTATE[11001607] TPC[0055feec] 
TNPC[0055fef0] TASK[dd:2918]
[81940.266075]  TPC[NG2copy_to_user+0x46c/0x680]
[81940.266448]   CPU[ 23]: TSTATE[11001607] TPC[0055fee8] 
TNPC[0055feec] TASK[dd:2900]
[81940.266942]  TPC[NG2copy_to_user+0x468/0x680]
[81940.267328]   CPU[ 24]: TSTATE[11001602] TPC[0049a618] 
TNPC[0049a61c] TASK[dd:2938]
[81940.267710]  

Re: unkillable dpkg-query processes

2007-10-29 Thread David Miller
From: Josip Rodin [EMAIL PROTECTED]
Date: Tue, 30 Oct 2007 00:37:13 +0100

 I'd try doing a debootstrap of lenny (that's Debian testing),
 and then inside it, run one or more of those 'dpkg-query -S libc.so.6'.

Thanks for the info.

While waiting for you to reply I created a lenny buildd
build root on my SunFire 280R using:

debootstrap --variant=buildd lenny /org/buildd/chroots/lenny \
http://mirrors.kernel.org/debian

basically following roughly the instructions at:

http://www.debian.org/devel/buildd/setting-up

And then once chroot'ed into the lenny build root you have
to setup a few manual things like /proc, /sys/, and /dev/pts
mounts for anything to work:

chroot /org/buildd/chroots/lenny
mount -t proc none /proc
mount -t sysfs none /sys
mount -t devpts none /dev/pts

So, it's a lot more than just running the appropriate debootstrap
command.

I have done a GCC package build and am now running a libc6 build under
this lenny chroot and haven't hit any problems yet.

This is with a stock 2.6.23.1 kernel.

BTW, in your buildroot, can you do something like:

strace -o x.log dpkg-query -S libc.so.6

and send me that x.log file?

That might give some important clues.

Thanks.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-29 Thread Josip Rodin
Hi,

(Sorry for breaking the threading - I didn't subscribe to the list,
I just found this in the web archive. I should probably subscribe... :)

David Miller wrote:
 Ok, since I have a 280R just like Josip, I think a good plan
 is for him to show me the commands he used to create the
 build root where he can trigger bad things.

I can't be 100% sure, because it was James Troup who initially set it up,
but I believe that the chroot on lebrun.d.o was set up by just doing
something mundane like running debootstrap, more specifically something
like this:

sudo debootstrap lenny /mnt http://ftp.us.debian.org/debian

I conclude this because it has a var/log/bootstrap.log in it,
dated 2007-06-19 12:15, which has:

Selecting previously deselected package base-files.
(Reading database ... 0 files and directories currently installed.)
Unpacking base-files (from .../base-files_4.0.0_sparc.deb) ...
[...]
Setting up build-essential (11.3) ...

And it also has a var/log/dpkg.log which has:

2007-06-19 12:13:10 install base-files none 4.0.0
[...]
2007-06-19 12:15:23 status installed build-essential 11.3

Again I can't be 100% sure of the exact command line used, but that
really should be it :)

After that, dpkg.log in the chroot also has a purge of the 'procps' package,
and an installation of the 'sparc-utils' package. A few hours after those
two, a random selection of package installations starts - the buildd went
online.

I'd try doing a debootstrap of lenny (that's Debian testing),
and then inside it, run one or more of those 'dpkg-query -S libc.so.6'.

-- 
 2. That which causes joy or happiness.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-29 Thread Bernd Zeimetz

   mount -t devpts none /dev/pts

mount --bind /dev /thechroot/dev
is what I use here, running udev in a chroot is no fun.

 So, it's a lot more than just running the appropriate debootstrap
 command.

I'm almost done with a howto which is cutpaste for 95% to debootstrap
and boot a debian system, unfortunately it doesn't boot as the klibc
(which is used in the initramfs) is broken on sparc again...
So I'll modify it to setup a proper chroot only, it should also allow to
boot into it if you use the Kernel/initrd form Ubuntu.
This should allow Josip and you to setup a complete chroot.

 I have done a GCC package build and am now running a libc6 build under
 this lenny chroot and haven't hit any problems yet.

The following things also like to crash here (on Etch, not in a chroot):
- running aptitude -u several times (at least with libnss-db installed)
- since I've installed 2.6.24-rc1: vgdisplay (with and without active
libnss-db)


 BTW, in your buildroot, can you do something like:
 
   strace -o x.log dpkg-query -S libc.so.6

there're some comparisons of the strace of aptitude -u in
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=433187#102
Probably interesting as there're futexes in the game.

The interesting thing is that it didn't crash the machine while running
under strace.

-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-29 Thread Bernd Zeimetz

 Here you go.
 
 (Mind, this is capturing the current status of the chroot, which is fairly
 unclean, because right now it happens to be building python-qt4-4.3.1.)

What we're missing here is a probably important piece:

If dpkg-query is running during a build, it is running in a fakeroot
environment. I've straced that, see the attachment.

What I find in the strace are at least several clones, which is the
point where aptitude -u crashed according to the straces in
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=433187#102


-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/
execve(/usr/bin/fakeroot, [fakeroot, dpkg-query, -S, libc.so.6], [/* 
12 vars */]) = 0
brk(0)  = 0xca000
uname({sys=Linux, node=titan, ...}) = 0
access(/etc/ld.so.nohwcap, F_OK)  = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0xf7fba000
access(/etc/ld.so.preload, R_OK)  = -1 ENOENT (No such file or directory)
open(/etc/ld.so.cache, O_RDONLY)  = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=12402, ...}) = 0
mmap(NULL, 12402, PROT_READ, MAP_PRIVATE, 3, 0) = 0xf7fb4000
close(3)= 0
access(/etc/ld.so.nohwcap, F_OK)  = -1 ENOENT (No such file or directory)
open(/lib/libncurses.so.5, O_RDONLY)  = 3
read(3, \177ELF\1\2\1\0\0\0\0\0\0\0\0\0\0\3\0\22\0\0\0\1\0\0\263..., 512) = 
512
fstat64(3, {st_mode=S_IFREG|0644, st_size=208688, ...}) = 0
mmap(NULL, 208480, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0xf7f8
mmap(0xf7fb, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3) = 0xf7fb
close(3)= 0
access(/etc/ld.so.nohwcap, F_OK)  = -1 ENOENT (No such file or directory)
open(/lib/libdl.so.2, O_RDONLY)   = 3
read(3, \177ELF\1\2\1\0\0\0\0\0\0\0\0\0\0\3\0\22\0\0\0\1\0\0\f..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0644, st_size=18216, ...}) = 0
mmap(NULL, 82432, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0xf7f68000
mprotect(0xf7f6c000, 57344, PROT_NONE)  = 0
mmap(0xf7f7a000, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0xf7f7a000
close(3)= 0
access(/etc/ld.so.nohwcap, F_OK)  = -1 ENOENT (No such file or directory)
open(/lib/libc.so.6, O_RDONLY)= 3
read(3, \177ELF\1\2\1\0\0\0\0\0\0\0\0\0\0\3\0\22\0\0\0\1\0\1\364..., 512) = 
512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1419756, ...}) = 0
mmap(NULL, 1489032, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0xf7dfc000
mprotect(0xf7f5, 65536, PROT_NONE)  = 0
mmap(0xf7f6, 24576, PROT_READ|PROT_WRITE|PROT_EXEC, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0xf7f6
mmap(0xf7f66000, 6280, PROT_READ|PROT_WRITE|PROT_EXEC, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xf7f66000
close(3)= 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0xf7fde000
mprotect(0xf7f7a000, 8192, PROT_READ)   = 0
munmap(0xf7fb4000, 12402)   = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
open(/dev/tty, O_RDWR|O_NONBLOCK|O_LARGEFILE) = 3
close(3)= 0
brk(0)  = 0xca000
brk(0xec000)= 0xec000
getuid32()  = 1000
getgid32()  = 1000
geteuid32() = 1000
getegid32() = 1000
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
time(NULL)  = 1193705202
open(/proc/meminfo, O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0xf7fb8000
read(3, MemTotal:  8314712 kB\nMemFre..., 1024) = 624
close(3)= 0
munmap(0xf7fb8000, 8192)= 0
rt_sigaction(SIGCHLD, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0
rt_sigaction(SIGCHLD, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0
rt_sigaction(SIGINT, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0
rt_sigaction(SIGINT, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0
rt_sigaction(SIGQUIT, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0
rt_sigaction(SIGQUIT, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigaction(SIGQUIT, {SIG_IGN}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0
uname({sys=Linux, node=titan, ...}) = 0
stat64(/home/foo, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
getpid()= 18537
getppid()   = 18536
getpgrp()   = 18536
rt_sigaction(SIGCHLD, {0x42460, [], 0}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
open(/usr/bin/fakeroot, 

Re: unkillable dpkg-query processes

2007-10-29 Thread Bernd Zeimetz

 mount -t devpts none /dev/pts
 mount --bind /dev /thechroot/dev
 is what I use here, running udev in a chroot is no fun.
 
 Ok.

AFaik the buildds only have a minimal /dev. though. But to bootstrap a
system that's usually not enough.

 Let's stick to 2.6.23 testing for pinpointing these bugs.

Ok. Do you have a .deb with a kernel for me? If not - would you like to
have any specific options enabled - I have to build one then.


-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-29 Thread David Miller
From: Bernd Zeimetz [EMAIL PROTECTED]
Date: Tue, 30 Oct 2007 01:50:30 +0100

 What we're missing here is a probably important piece:
 
 If dpkg-query is running during a build, it is running in a fakeroot
 environment. I've straced that, see the attachment.
 
 What I find in the strace are at least several clones, which is the
 point where aptitude -u crashed according to the straces in
 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=433187#102

Thanks for the fakeroot trace.

I am pretty sure the clone()'s we see here are just normal
fork()'s, in both the fakeroot's dpkg-query and the aptitude
case.

I'll go study up on fakeroot's implementation to look for
potential clues.

-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-29 Thread David Miller
From: Bernd Zeimetz [EMAIL PROTECTED]
Date: Tue, 30 Oct 2007 01:47:33 +0100

  mount -t devpts none /dev/pts
 
 mount --bind /dev /thechroot/dev
 is what I use here, running udev in a chroot is no fun.

Ok.

 I'm almost done with a howto which is cutpaste for 95% to debootstrap
 and boot a debian system, unfortunately it doesn't boot as the klibc
 (which is used in the initramfs) is broken on sparc again...
 So I'll modify it to setup a proper chroot only, it should also allow to
 boot into it if you use the Kernel/initrd form Ubuntu.
 This should allow Josip and you to setup a complete chroot.

Thanks.

  I have done a GCC package build and am now running a libc6 build under
  this lenny chroot and haven't hit any problems yet.
 
 The following things also like to crash here (on Etch, not in a chroot):
 - running aptitude -u several times (at least with libnss-db installed)
 - since I've installed 2.6.24-rc1: vgdisplay (with and without active
 libnss-db)

There are several issues with 2.6.24, stay away from it for now.
I will fix things there.

Let's stick to 2.6.23 testing for pinpointing these bugs.

 there're some comparisons of the strace of aptitude -u in
 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=433187#102
 Probably interesting as there're futexes in the game.

Of course there are, as soon as you start using libnss-db there
will be futexes.

Can you reproduce the aptitute problems under 2.6.23 with libnss-db
disabled?

 The interesting thing is that it didn't crash the machine while running
 under strace.

If the futex problem I suspect is in fact the issue, strace'ing
would definitely make that problem go away.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-29 Thread David Miller
From: Bernd Zeimetz [EMAIL PROTECTED]
Date: Tue, 30 Oct 2007 02:54:14 +0100

 Ok. Do you have a .deb with a kernel for me? If not - would you like to
 have any specific options enabled - I have to build one then.

I usually just cp arch/sparc64/defconfig ./.config in a fresh
vanilla kernel tree and tweak from there.

For my 280R I enabled SMP, accepted the NR_CPUS default value (64),
set SERIAL_SUNSAB to y and enabled console support, and then enabled
the qlogic fibrechannel driver and the SUNGEM driver as modules.

Oh yes, I also enabled INITRD support so I can use initramfs to get
the firmware loaded properly in the qlogic FC card.

Really, I don't use anything fancy, just enough to get the machine
functional.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-28 Thread Bernd Zeimetz

 I think things got worse with 2.6.24...
 The machine shoots itself now, I guess by running cron jobs or so.

 [29074.766486] TSTATE: 11009600 TPC: 0042f984 TNPC: 
 0042f928 Y: Not tainted
 [29074.884191] TPC: sched_clock+0x0/0x30
 
 What kind of OOPS is this?  Please provide the kernel log messages
 that appeared right before these register dumps.

I'll boot the machine and check the logs, was not in the mood to do
this tonight. The pasted messages were dumped on the serial console -
as the machine didn't show any reaction I only powered it down...


Cheers,

Bernd

-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-28 Thread Bernd Zeimetz

 [29074.766486] TSTATE: 11009600 TPC: 0042f984 TNPC: 
 0042f928 Y: Not tainted
 [29074.884191] TPC: sched_clock+0x0/0x30
 
 What kind of OOPS is this?  Please provide the kernel log messages
 that appeared right before these register dumps.


Oct 28 03:25:12 titan kernel: [29074.698695] BUG: soft lockup - CPU#0
stuck for 11s! [sh:4252]

This happened while a cronjob was running which updates the libnss-db
database... With an older kernel (2.6.23-rcsomething) this didn't crash
the machine.


-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-28 Thread Sébastien Bernard

Bernd Zeimetz a écrit :

Hi,

please note that the futex bug also happens on US II machines,
it is jsut almost impossible to reproduce it - it'll just hang
after random days of building.


Everyone who sees these UltraSPARC-III problems please send me PRECISE
and FULL description of how to install from scratch a machine and run
something that will trigger these errors.


Can you please check if the Kernel config I've attached to one of my
last mails is fine for you? The normal Debian installer doesn't
boot on the US III machines which use two CPUs in one board as the
installer's Kernel is a non-SMP Kernel, and the result is that the
machine throws a CPU exception and needs to be power-cycled

I've started to investigate there with the help of a contact from
Sun, but we both didn't have the time to finish this.
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=440720 if you want
to have a look, please ignore those troll postings from chealer
in between...

So to give you a recipe to install Debian on such a box, I need to
build an installer with a SMP Kernel for you. If the config is fine
for your needs, I could just use use it.


The other option is to use debootstrap, if you have some system
on the machine already - so if you want to use that instead of
messing with a network installer, please let me know.
Debootstrap should run on most systems, as long as they have
ar/tar/gunzip and a bash (probably sh is enough...).
Would be faster to use that, and faster to write a recipe for
that.

I'll mark all Qlogic firmware related points, so the recipe should
work on machines with (v440, v880, probably the Enterprise models,
too) and without FC (I guess the Blade 1000 and 2000).


If you don't have access to an US-III machine, I can find a way
to give you access to the RSC and serial console of our machine.


Cheers,

Bernd


Well, I got bitten twice with this bug.

First is on U60, unstable debian.
Since mono team decided that the mono is broken on Sparc (and despite
the fix provided by David Miller), I had to rebuild after enabling the sparc
arch in the source.

The hangs happens always at the end of the buid when invoking dh_shgenlibs in 
the build.

This is not 100% reproducable even in my env.

Second was sun blade 2000 SMP with Ubuntu gutsy, I wasn't able to update the 
xemacs21 package.
The machine hanged with invoking the post installation script.

This is not really reproducable now that I upgraded the packages.

The mono build is, in my humble opinion, the most interesting track to catch 
the bug.

Seb
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-28 Thread Bernd Zeimetz
Hi,

 Since mono team decided that the mono is broken on Sparc (and despite
 the fix provided by David Miller), I had to rebuild after enabling the
 sparc
 arch in the source.
 
 The hangs happens always at the end of the buid when invoking
 dh_shgenlibs in the build.
 
 This is not 100% reproducable even in my env.

Trying this at the moment.

 Second was sun blade 2000 SMP with Ubuntu gutsy, I wasn't able to update
 the xemacs21 package.
 The machine hanged with invoking the post installation script.

Does the Blade run with one or two CPUs? If I remember right they
support to run with one CPU which has to be inserted in a special
slot/carrier for that. With two CPUs it should use the same repeater
chips and architecture as the v440, v880 and larger machines.


Cheers,

Bernd

-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-28 Thread Bernd Zeimetz
Bernd Zeimetz wrote:
 Hi,
 
 Since mono team decided that the mono is broken on Sparc (and despite
 the fix provided by David Miller), I had to rebuild after enabling the
 sparc
 arch in the source.

 Trying this at the moment.

not reproducible - mono fails to build from source in sid... so it
doesn;t reach the interesting part of dh_shlibdeps...


-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-28 Thread David Miller
From: Bernd Zeimetz [EMAIL PROTECTED]
Date: Mon, 29 Oct 2007 02:18:30 +0100

 But if this bug isn't fixed chances are good that the next Debian
 release won't support Sparc at all.

Please don't use pseudo-threats like this, it only deters me even more
from working on this bug.

 This explains why you have trouble to reproduce this, while Josip and me
 get hit by this bug way too often.

Josip stated explicitly that he has a SunFire280R, which disagrees
with what you're saying here.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-28 Thread Bernd Zeimetz
David Miller wrote:
 From: Bernd Zeimetz [EMAIL PROTECTED]
 Date: Mon, 29 Oct 2007 02:18:30 +0100
 
 But if this bug isn't fixed chances are good that the next Debian
 release won't support Sparc at all.
 
 Please don't use pseudo-threats like this, it only deters me even more
 from working on this bug.

This was not meant as a threat, it's just a fact and the reason why I'm
spending way too much time on trying to make this bug reproducible and
also the reason why we're annoying you these days. Sorry for that.


 This explains why you have trouble to reproduce this, while Josip and me
 get hit by this bug way too often.
 
 Josip stated explicitly that he has a SunFire280R, which disagrees
 with what you're saying here.

Sorry, I mixed something up here. I was somehow sure that they were
using a v440, but it was somebody else.



-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-28 Thread David Miller
From: Bernd Zeimetz [EMAIL PROTECTED]
Date: Mon, 29 Oct 2007 03:06:13 +0100

 David Miller wrote:
  Josip stated explicitly that he has a SunFire280R, which disagrees
  with what you're saying here.
 
 Sorry, I mixed something up here. I was somehow sure that they were
 using a v440, but it was somebody else.

Ok, since I have a 280R just like Josip, I think a good plan
is for him to show me the commands he used to create the
build root where he can trigger bad things.

I think we can move forward much better starting with this.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-27 Thread Bernd Zeimetz
Bernd Zeimetz wrote:
 For those who can reproduce it an have something like libnss-db
 enabled, try disabling it.
 
 - disabled it
 - running vgdisplay killed the machine (wanted to create a new LV for a
 chroot)... it's not accessible at all anymore, I think the kernel is
 a 2.6.23-something here, I'll build a recent one and give it a try
 again Will take some time as I need to build on USII...


I just wanted to write that I'm not able to reproduce this bug
anymore... but running aptitude -u often enough gave me this nice output:


titan:~# [ 2427.313946] BUG: soft lockup - CPU#3 stuck for 11s! [aptitude:13375]
[ 2427.389128] TSTATE: 11009602 TPC: 0042f93c TNPC: 
0042f7d0 Y: Not tainted
[ 2427.506821] TPC: __delay+0x1c/0x48
[ 2427.549494] g0: 9000 g1: 0042f7d0 g2:  
g3: 
[ 2427.653670] g4: f8a00793c960 g5: f89fff994000 g6: f8a007dfc000 
g7: 
[ 2427.757835] o0: 0020 o1: 0020 o2:  
o3: 
[ 2427.862001] o4: 0030a0d0 o5:  sp: f8a007dff071 
ret_pc: 0042f938
[ 2427.970337] RPC: __delay+0x18/0x48
[ 2428.013031] l0: 0005a6cab647 l1: 11009601 l2: 004417a8 
l3: 0400
[ 2428.117206] l4:  l5: 0001 l6:  
l7: 0008
[ 2428.221374] i0:  i1: f8a007dffa88 i2: 0004 
i3: 0001
[ 2428.325538] i4:  i5:  i6: f8a007dff131 
i7: 004417ec
[ 2428.429710] I7: cheetah_xcall_deliver+0x1c0/0x23c

and an unkillable, cpu-eating aptitude.


While retrieving some info using sysrq the machine froze after
echoing m into sysrq-trigger, producing this output while dieing:

[ 3680.006794] BUG: soft lockup - CPU#1 stuck for 11s! [pdflush:265]
[ 3680.078838] TSTATE: 80009603 TPC: 004417a8 TNPC: 
004417ac Y: Not tainted
[ 3680.196551] TPC: cheetah_xcall_deliver+0x17c/0x23c
[ 3680.255881] g0:  g1:  g2: 0001869e 
g3: 
[ 3680.360055] g4: f8a0048e3260 g5: f89fff984000 g6: f8a00717c000 
g7: 
[ 3680.464220] o0: 0020 o1: f8a00717f418 o2: f8a005a84040 
o3: 0010
[ 3680.568384] o4: 0015 o5:  sp: f8a00717eac1 
ret_pc: 004416e4
[ 3680.676719] RPC: cheetah_xcall_deliver+0xb8/0x23c
[ 3680.735042] l0: 0002 l1: 0002 l2: 0096 
l3: 
[ 3680.839217] l4:  l5: f8a0048d3cd8 l6: 00024098 
l7: f7d31000
[ 3680.943382] i0: 0044d100 i1: 00b0f60f8000 i2:  
i3: 0001
[ 3681.047548] i4: 0001 i5: 0001 i6: f8a00717eb81 
i7: 00442be4
[ 3681.151717] I7: smp_flush_dcache_page_impl+0x21c/0x228



Luckily much more output of sysrq is in the syslog, so I should be able to mail 
it later when the
machine is finished with rebooting (which takes some time...).


 2.6.24-rc1-git2 (SMP)
 gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)


titan:~# cat /proc/cpuinfo
cpu : TI UltraSparc III (Cheetah)
fpu : UltraSparc III integrated FPU
prom: OBP 4.22.34 2007/07/23 13:01
type: sun4u
ncpus probed: 4
ncpus active: 4
D$ parity tl1   : 0
I$ parity tl1   : 0
Cpu0ClkTck  : 2cb41780
Cpu1ClkTck  : 2cb41780
Cpu2ClkTck  : 2cb41780
Cpu3ClkTck  : 2cb41780
MMU Type: Cheetah
State:
CPU0:   online
CPU1:   online
CPU2:   online
CPU3:   online



-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/

-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-27 Thread Bernd Zeimetz



I think things got worse with 2.6.24...
The machine shoots itself now, I guess by running cron jobs or so.

[29074.766486] TSTATE: 11009600 TPC: 0042f984 TNPC: 
0042f928 Y: Not tainted
[29074.884191] TPC: sched_clock+0x0/0x30
[29074.929988] g0:  g1: 004417ec g2:  
g3: 
[29075.034163] g4: f8a00493a4e0 g5: f89fff97c000 g6: f8a006c64000 
g7: 
[29075.138329] o0:  o1: f8a006c67968 o2: 0008 
o3: 0001
[29075.242493] o4: 3385 o5:  sp: f8a006c67011 
ret_pc: 0042f980
[29075.350830] RPC: udelay+0x18/0x1c
[29075.392482] l0: 0020 l1:  l2: 0096 
l3: 
[29075.496658] l4: 0200 l5: 0001c5569e6c l6: 0006c390404c 
l7: 6204052f31ec823e
[29075.600824] i0: 0044d100 i1: 00b0fcc2c000 i2:  
i3: 
[29075.704989] i4: 0040 i5: 007a0578 i6: f8a006c670d1 
i7: 004420d8
[29075.809161] I7: flush_dcache_page_all+0x16c/0x1c0
[29075.867493] BUG: soft lockup - CPU#2 stuck for 11s! [sh:4253]
[29075.936259] TSTATE: 11009600 TPC: 004417a8 TNPC: 
004417ac Y: Not tainted
[29076.053980] TPC: cheetah_xcall_deliver+0x17c/0x23c
[29076.113311] g0:  g1:  g2:  
g3: 
[29076.217483] g4: f8a0048f9260 g5: f89fff98c000 g6: f8a006c7 
g7: 
[29076.321648] o0: 0020 o1: f8a006c73968 o2: 0002 
o3: 0001
[29076.425816] o4: 781b o5:  sp: f8a006c73011 
ret_pc: 004416a0
[29076.534150] RPC: cheetah_xcall_deliver+0x74/0x23c
[29076.592471] l0: 0008 l1:  l2: 0096 
l3: 
[29076.696645] l4: 0200 l5: 0001c5569e6c l6: 0006c3904054 
l7: 7e645445948ed154
[29076.800811] i0: 0044d100 i1: 00b0fcf8 i2:  
i3: 
[29076.904977] i4: 0040 i5: 007a0578 i6: f8a006c730d1 
i7: 004420d8
[29077.009144] I7: flush_dcache_page_all+0x16c/0x1c0

-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-27 Thread David Miller
From: Bernd Zeimetz [EMAIL PROTECTED]
Date: Sun, 28 Oct 2007 04:03:44 +0100

 
 
 
 I think things got worse with 2.6.24...
 The machine shoots itself now, I guess by running cron jobs or so.
 
 [29074.766486] TSTATE: 11009600 TPC: 0042f984 TNPC: 
 0042f928 Y: Not tainted
 [29074.884191] TPC: sched_clock+0x0/0x30

What kind of OOPS is this?  Please provide the kernel log messages
that appeared right before these register dumps.

Thanks.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-27 Thread David Miller
From: Bernd Zeimetz [EMAIL PROTECTED]
Date: Sat, 27 Oct 2007 20:09:47 +0200

 titan:~# [ 2427.313946] BUG: soft lockup - CPU#3 stuck for 11s! 
 [aptitude:13375]
 [ 2427.389128] TSTATE: 11009602 TPC: 0042f93c TNPC: 
 0042f7d0 Y: Not tainted
 [ 2427.506821] TPC: __delay+0x1c/0x48
 [ 2427.549494] g0: 9000 g1: 0042f7d0 g2:  
 g3: 
 [ 2427.653670] g4: f8a00793c960 g5: f89fff994000 g6: f8a007dfc000 
 g7: 
 [ 2427.757835] o0: 0020 o1: 0020 o2:  
 o3: 
 [ 2427.862001] o4: 0030a0d0 o5:  sp: f8a007dff071 
 ret_pc: 0042f938
 [ 2427.970337] RPC: __delay+0x18/0x48
 [ 2428.013031] l0: 0005a6cab647 l1: 11009601 l2: 004417a8 
 l3: 0400
 [ 2428.117206] l4:  l5: 0001 l6:  
 l7: 0008
 [ 2428.221374] i0:  i1: f8a007dffa88 i2: 0004 
 i3: 0001
 [ 2428.325538] i4:  i5:  i6: f8a007dff131 
 i7: 004417ec
 [ 2428.429710] I7: cheetah_xcall_deliver+0x1c0/0x23c
 
 and an unkillable, cpu-eating aptitude.

One cpu can't send a message successfully to another cpu, likely
because it is stuck somewhere with interrupts off.

I was going to give you a patch like the one at the end of this email
to try and get a register dump from all cpus with Alt-Sysrq-p but that
is guarenteed not to work.  It will just call back into
cheetah_xcall_deliver() and wedge further.  Again, don't use the
patch, trying to get a register dump with it in this state will just
wedge the machine further.

I don't know how to suggest a way to debug this further, sorry.

I'm sick of these bugs and I need to reproduce all of these
UltraSPARC-III issues locally to fix them.  So let's go.

Everyone who sees these UltraSPARC-III problems please send me PRECISE
and FULL description of how to install from scratch a machine and run
something that will trigger these errors.

DO NOT leave out any detail of your installation.  Any minor omission
will mean that I potentially won't be able to reproduce this bug and
therefore I won't be able to fix it either.

If you are using NIS, say so and give the exact configuration.  If you
have any modifications to some core configuration file like
/etc/nsswitch.conf, tell me.  If you are using static IP addresses,
tell me.  If you have netfilter enabled, tell me.  If you have even
installed some extra package, like libnss-db or anything else, tell me
even if you think it's not in use.

In short I want a flawless cook-book style recipe for installing a
machine that I can reproduce this problem on.  Do not omit any detail.

Thanks!

diff --git a/arch/sparc64/kernel/process.c b/arch/sparc64/kernel/process.c
index ca7cdfd..e10fdce 100644
--- a/arch/sparc64/kernel/process.c
+++ b/arch/sparc64/kernel/process.c
@@ -348,7 +348,7 @@ void show_regs(struct pt_regs *regs)
extern long etrap, etraptl1;
 #endif
__show_regs(regs);
-#if 0
+#if 1
 #ifdef CONFIG_SMP
{
extern void smp_report_regs(void);
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-26 Thread Bernd Zeimetz
Hi,


 It seems that instead of getting stuck in the kernel where I
 thought it would, the process gets stuck elsewhere and
 also tends to loop allocating memory until all memory in the
 machine is exhausted and the OOM killer starts to try and
 kill processes left and right.

at least it runs with 100% CPU, attaching strace to the pid doesn't give
any results
strace-ing the whole process doesn't result in more useful output, but
the hanging processes were killable when they were running under strace...


Cheers,

Bernd
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-26 Thread Bernd Zeimetz
Hi,

just got linked to this thread, so here's a bit input form me :)


 1) system type
 
 A Sun Fire 280R, with two CPU boards, each carrying a TI UltraSparc III
 (Cheetah), and 2 GB of RAM. If you need more info, just say.
 
 (Bernd Zeimetz has previously suggested that the problem is linked to
 the processor type, the USIII.)

It seems to hit USIII machines with 2 CPUs in one tray much more hard
than US II, but once a month our Ultra60 (running two US II) has the
same issues - it got much better with since
179c85ea53bef807621f335767e41e23f86f01df, though. before the mentioned
patch it died a few times per day. Seems it got better on the USIII
here, too (we have a v880 here, the large version of Josip's machine,
with 2x 2 CPUs), but it still dies way too often, just not useable in
the current state.


 
 2) compiler used to build kernel and is it SMP?
 
 gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)

same compiler here.
Please note that non-SMP kernels do not boot on those US-III machines at
all (at least I didn't find a single one which does).



Cheers,

Bernd
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-26 Thread David Miller
From: Bernd Zeimetz [EMAIL PROTECTED]
Date: Fri, 26 Oct 2007 14:30:21 +0200

 at least it runs with 100% CPU, attaching strace to the pid doesn't give
 any results
 strace-ing the whole process doesn't result in more useful output, but
 the hanging processes were killable when they were running under strace...

When it runs with 100% CPU that's what makes me suspect it's
spinning in the kernel futex code somewhere or similar.

One thing I notice in the debian bug report is a mention of libnss-db

So I did some testing here and without libnss-db installed, running
dpkg-query does not use futexes at all.

But once I install libnss-db and enable it (by running 'make' under
/var/lib/misc then editing /etc/nsswitch.conf to make 'db' get
searched first) indeed dpkg-query starts using futexes via the
libnss-db library.

Josip, do you guys have libnss-db or similar in use on the buildd
machine?

For those who can reproduce it an have something like libnss-db
enabled, try disabling it.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-26 Thread Josip Rodin
On Fri, Oct 26, 2007 at 03:01:24PM -0700, David Miller wrote:
 One thing I notice in the debian bug report is a mention of libnss-db
 
 So I did some testing here and without libnss-db installed, running
 dpkg-query does not use futexes at all.
 
 But once I install libnss-db and enable it (by running 'make' under
 /var/lib/misc then editing /etc/nsswitch.conf to make 'db' get
 searched first) indeed dpkg-query starts using futexes via the
 libnss-db library.
 
 Josip, do you guys have libnss-db or similar in use on the buildd
 machine?

lebrun.d.o doesn't have libnss-db installed, neither outside nor inside
the chroot, sorry.

Both setups have the default /etc/nsswitch.conf that searches 'db' before
'files' for protocols, services, ethers, rpc, but that's it.

BTW, would you benefit from having an account on this machine?

-- 
 2. That which causes joy or happiness.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-26 Thread Josip Rodin
On Sat, Oct 27, 2007 at 12:30:56AM +0200, Bernd Zeimetz wrote:
  Josip, do you guys have libnss-db or similar in use on the buildd
  machine?
 
 They have, that's what Debian's userdir-ldap uses.

No, I have to correct you, this machine isn't part of that setup
(at least not yet).

-- 
 2. That which causes joy or happiness.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-26 Thread Bernd Zeimetz
Josip Rodin wrote:
 On Sat, Oct 27, 2007 at 12:30:56AM +0200, Bernd Zeimetz wrote:
 Josip, do you guys have libnss-db or similar in use on the buildd
 machine?
 They have, that's what Debian's userdir-ldap uses.
 
 No, I have to correct you, this machine isn't part of that setup
 (at least not yet).
 

Oh ok, I stand corrected - thought it would have it.

-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-26 Thread Bernd Zeimetz

 For those who can reproduce it an have something like libnss-db
 enabled, try disabling it.

- disabled it
- running vgdisplay killed the machine (wanted to create a new LV for a
chroot)... it's not accessible at all anymore, I think the kernel is
a 2.6.23-something here, I'll build a recent one and give it a try
again Will take some time as I need to build on USII...


-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-26 Thread Bernd Zeimetz

 Josip, do you guys have libnss-db or similar in use on the buildd
 machine?

They have, that's what Debian's userdir-ldap uses.

 For those who can reproduce it an have something like libnss-db
 enabled, try disabling it.

Will do in a few minutes.



-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-25 Thread David Miller

Josip, give this debugging patch a try.  It is against 2.6.23.1
but it should apply to most recent kernels.

It should give you debugging messages in the kernel log that
start with FUTEX_BUG if the debugging code triggers.

Please post just a few samples of whatever it spits out.

Thanks!

diff --git a/kernel/futex.c b/kernel/futex.c
index fcc94e7..6da8b3c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1874,6 +1874,25 @@ err_unlock:
return ret;
 }
 
+static void log_futex_bug(u32 __user *uaddr, struct task_struct *curr, int pi)
+{
+   struct mm_struct *mm = curr-mm;
+   struct vm_area_struct *vma;
+   unsigned long addr;
+
+   printk(KERN_ERR FUTEX_BUG: Looping too much in futex death\n);
+   printk(KERN_ERR FUTEX_BUG: uaddr[%p] task[%s:%d] pi(%d)\n,
+  uaddr, curr-comm, curr-pid, pi);
+
+   addr = (unsigned long) uaddr;
+   vma = find_vma(mm, addr);
+   if (vma)
+   printk(KERN_ERR FUTEX_BUG: VMA start[%lx] end[%lx] 
flags[%lx]\n,
+  vma-vm_start,
+  vma-vm_end,
+  vma-vm_flags);
+}
+
 /*
  * Process a futex-list entry, check whether it's owned by the
  * dying task, and do notification if so:
@@ -1881,6 +1900,7 @@ err_unlock:
 int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi)
 {
u32 uval, nval, mval;
+   int limit = 0;
 
 retry:
if (get_user(uval, uaddr))
@@ -1903,8 +1923,12 @@ retry:
if (nval == -EFAULT)
return -1;
 
-   if (nval != uval)
-   goto retry;
+   if (nval != uval) {
+   if (++limit  100)
+   goto retry;
+   log_futex_bug(uaddr, curr, pi);
+   put_user(mval, uaddr);
+   }
 
/*
 * Wake robust non-PI futexes here. The wakeup of
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-25 Thread Josip Rodin
On Wed, Oct 24, 2007 at 11:41:13PM -0700, David Miller wrote:
 Josip, give this debugging patch a try.  It is against 2.6.23.1
 but it should apply to most recent kernels.

OK, after resurrecting the machine once again (it had died in the meantime,
reliably as ever), I did:

patching file kernel/futex.c
Hunk #1 succeeded at 1877 (offset 3 lines).
Hunk #2 succeeded at 1903 (offset 3 lines).
Hunk #3 succeeded at 1926 (offset 3 lines).

 It should give you debugging messages in the kernel log that
 start with FUTEX_BUG if the debugging code triggers.
 
 Please post just a few samples of whatever it spits out.

It's been running with the patched kernel for some 6.5 hours now, no
problems yet. I'll let you know as soon as it starts to misbehave.

-- 
 2. That which causes joy or happiness.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-25 Thread Josip Rodin
On Thu, Oct 25, 2007 at 05:07:36PM +0200, joy wrote:
  If you try, within that troublesome build-root, a few times to try to
  fork off a couple hundred:
  
  dpkg-query --something python-2.5
  
  or whatever, can you get some of processes to wedge under that
  build root?
 
 I did this in a chrooted bash:
 
 for i in $(seq 0 100); do (dpkg-query -s python2.5-minimal  /dev/null ); 
 done
 
 And now the machine went catatonic. :(
 
 Thankfully the console is still vaguely operational - I can enter my
 username to log in, but I can't get the Password prompt to appear.
 Magic SysRq still works - if you need any output from it, tell me.

The machine continued in this state for a couple of hours or so, it didn't
come back to life. When I went to check up on it, the kernel showed one
message on the console - OOM killer killed a make process. I then gave up,
used SysRq to S+U+B, and it booted again, and I was able to retrieve the
following data from kern.log that is in the attachment. Hope that helps.

-- 
 2. That which causes joy or happiness.
Oct 25 17:04:09 lebrun kernel: SysRq : Emergency Sync
Oct 25 17:04:20 lebrun kernel: SysRq : HELP : loglevel0-8 reBoot tErm Full kIll 
saK showMem Nice showPc show-all-timers(Q) unRaw Sync showTasks Unmount 
shoW-blocked-tasks 
Oct 25 17:04:20 lebrun kernel: SysRq : Show Memory
Oct 25 17:04:20 lebrun kernel: Mem-info:
Oct 25 17:04:20 lebrun kernel: Normal per-cpu:
Oct 25 17:04:20 lebrun kernel: CPU0: Hot: hi:   90, btch:  15 usd:   0   
Cold: hi:   30, btch:   7 usd:   0
Oct 25 17:04:20 lebrun kernel: CPU1: Hot: hi:   90, btch:  15 usd:   4   
Cold: hi:   30, btch:   7 usd:  24
Oct 25 17:04:20 lebrun kernel: Active:202209 inactive:46687 dirty:39 
writeback:279 unstable:0
Oct 25 17:04:20 lebrun kernel:  free:723 slab:2826 mapped:2986 pagetables:875 
bounce:0
Oct 25 17:04:20 lebrun kernel: Normal free:5616kB min:5760kB low:7200kB 
high:8640kB active:1619760kB inactive:371344kB present:2077352kB 
pages_scanned:178 all_unreclaimable? no
Oct 25 17:04:20 lebrun kernel: lowmem_reserve[]: 0 0
Oct 25 17:04:21 lebrun kernel: Normal: 780*8kB 11*16kB 1*32kB 1*64kB 0*128kB 
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB = 6512kB
Oct 25 17:04:21 lebrun kernel: Swap cache: add 251630, delete 187227, find 
26426/42924, race 80+86
Oct 25 17:04:21 lebrun kernel: Free swap  = 174880kB
Oct 25 17:04:24 lebrun kernel: Total swap = 1048688kB
Oct 25 17:04:24 lebrun kernel: Free swap:   174648kB
Oct 25 17:04:24 lebrun kernel: 261865 pages of RAM
Oct 25 17:04:24 lebrun kernel: 3001 reserved pages
Oct 25 17:04:24 lebrun kernel: 155176 pages shared
Oct 25 17:04:24 lebrun kernel: 64407 pages swap cached
Oct 25 17:04:24 lebrun kernel: 39 pages dirty
Oct 25 17:04:24 lebrun kernel: 124 pages writeback
Oct 25 17:04:24 lebrun kernel: 2986 pages mapped
Oct 25 17:04:24 lebrun kernel: 2826 pages slab
Oct 25 17:04:24 lebrun kernel: 875 pages pagetables
Oct 25 17:05:01 lebrun kernel: SysRq : Emergency Sync
Oct 25 17:05:04 lebrun kernel: SysRq : HELP : loglevel0-8 reBoot tErm Full kIll 
saK showMem Nice showPc show-all-timers(Q) unRaw Sync showTasks Unmount 
shoW-blocked-tasks 
Oct 25 17:05:07 lebrun kernel: SysRq : Show Blocked State
Oct 25 17:05:07 lebrun kernel:   taskPC stack   pid 
father
Oct 25 17:05:07 lebrun kernel: kswapd0   D 00528bc8 0   181 
 2
Oct 25 17:05:07 lebrun kernel: Call Trace:
Oct 25 17:05:08 lebrun kernel:  [006258e0] io_schedule+0x2c/0x38
Oct 25 17:05:08 lebrun kernel:  [00528bc8] get_request_wait+0x11c/0x15c
Oct 25 17:12:13 lebrun kernel:  [0052a220] ges+0x144/0x258
Oct 25 17:12:13 lebrun kernel:  [0048cf34] __alloc_pages+0x1b0/0x330
Oct 25 17:12:13 lebrun kernel:  [0049f50c] 
read_swap_cache_async+0x40/0x150
Oct 25 17:12:13 lebrun kernel:  [00495908] swapin_readahead+0x3c/0x7c
Oct 25 17:12:13 lebrun kernel:  [004973b4] handle_mm_fault+0x3fc/0x7cc
Oct 25 17:12:13 lebrun kernel:  [0044e084] do_sparc64_fault+0x314/0x594
Oct 25 17:12:13 lebrun kernel:  [0040794c] 
sparc64_realfault_common+0x18/0x20
Oct 25 17:12:13 lebrun kernel:  [00015078] 0x15080
Oct 25 17:12:13 lebrun kernel: dpkg-queryD 00528bc8 0  3924 
 1
Oct 25 17:12:13 lebrun kernel: Call Trace:
Oct 25 17:12:13 lebrun kernel:  [006258e0] io_schedule+0x2c/0x38
Oct 25 17:12:13 lebrun kernel:  [00528bc8] get_request_wait+0x11c/0x15c
Oct 25 17:12:13 lebrun kernel:  [0052a220] __make_request+0x5f0/0x6a8
Oct 25 17:12:13 lebrun kernel:  [00526bac] 
generic_make_request+0x2f8/0x31c
Oct 25 17:12:13 lebrun kernel:  [00526cd4] submit_bio+0x104/0x10c
Oct 25 17:12:13 lebrun kernel:  [0049f30c] swap_writepage+0xa4/0xb4
Oct 25 17:12:13 lebrun kernel:  [004918c4] shrink_page_list+0x410/0x6f4
Oct 25 17:12:13 lebrun kernel:  [004922c8] shrink_zone+0x720/0xa38
Oct 25 17:12:13 lebrun kernel:  [00492de8] 

Re: unkillable dpkg-query processes

2007-10-24 Thread David Miller
From: Josip Rodin [EMAIL PROTECTED]
Date: Thu, 25 Oct 2007 00:33:32 +0200

 We've been having grave issues with a few of our sparc build daemon machines
 in Debian. Something causes dpkg-query(8) processes, otherwise harmless, to
 run amok and allocate too much memory, but keep running and become resilient
 to killing. They eventually push the machine to the point where you can only
 ping it, but all the userland and the console is dead.

I know, I've seen this report a million times :-)

I can't reproduce it, I've even tried the fabled test case
where you spawn thousands of dpkg-query instances and it never
does anything wrong on my Niagara boxes.

So something is different about your environment than mine.

Let's see if there is some aspect of the environment that
contributed to the problem occurring.  Please reproduce
with 2.6.23-final and then list (I know this is redundant,
just humor me :-):

1) system type
2) compiler used to build kernel and is it SMP?
3) glibc in use
4) compiler used to build running glibc

If you have a reproducable test case, that's even better.

If necessary I'll try to install a replica of your build
environment here in order to reproduce.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-24 Thread Josip Rodin
On Wed, Oct 24, 2007 at 03:58:29PM -0700, David Miller wrote:
 I know, I've seen this report a million times :-)

Oh, I know you know, I mailed you a while ago and you told me to mail
the mailing list :)

 I can't reproduce it, I've even tried the fabled test case
 where you spawn thousands of dpkg-query instances and it never
 does anything wrong on my Niagara boxes.
 
 So something is different about your environment than mine.
 
 Let's see if there is some aspect of the environment that
 contributed to the problem occurring.  Please reproduce
 with 2.6.23-final and then list (I know this is redundant,
 just humor me :-):

Confirming that the machine could reproduce the problem with 2.6.23.1.
(I can send over the .config if it matters.)

 1) system type

A Sun Fire 280R, with two CPU boards, each carrying a TI UltraSparc III
(Cheetah), and 2 GB of RAM. If you need more info, just say.

(Bernd Zeimetz has previously suggested that the problem is linked to
the processor type, the USIII.)

 2) compiler used to build kernel and is it SMP?

gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)

I've no idea if that compiler is SMP, if you want I'll ask someone else.

 3) glibc in use
 4) compiler used to build running glibc

In that particular chroot, it's:

chroot-unstable% lib/libc-2.6.1.so
GNU C Library stable release version 2.6.1, by Roland McGrath et al.
[...]
Compiled by GNU CC version 4.2.1 (Debian 4.2.1-5).
Compiled on a Linux 2.6.17-rc1 system on 2007-09-04.
Available extensions:
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
Native POSIX Threads Library by Ulrich Drepper et al
BIND-8.2.3-T5B
software FPU emulation by Richard Henderson, Jakub Jelinek and
others
[...]

Outside of that chroot, it's:

% /lib/libc-2.3.6.so 
GNU C Library stable release version 2.3.6, by Roland McGrath et al.
[...]
Compiled by GNU CC version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21).
Compiled on a Linux 2.6.18 system on 2007-03-01.
Available extensions:
GNU libio by Per Bothner
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
linuxthreads-0.10 by Xavier Leroy
BIND-8.2.3-T5B
libthread_db work sponsored by Alpha Processor Inc
NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
software FPU emulation by Richard Henderson, Jakub Jelinek and
others
Thread-local storage support included.
[...]

 If you have a reproducable test case, that's even better.

There doesn't appear to be a pattern, on this machine at least - I just let
the buildd run, building whatever comes up, and after a few hours it
inevitably runs into a wall.

-- 
 2. That which causes joy or happiness.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html