Re: unkillable dpkg-query processes

2007-11-04 Thread David Miller
From: Josip Rodin [EMAIL PROTECTED]
Date: Fri, 2 Nov 2007 17:21:06 +0100

 Great. Here you go, three of them, while the load was 3 and this process was
 stuck:
 
 buildd   10813  100  0.8 987368 17504 ?RN   14:44 155:49 dpkg-query 
 --search libpthread.so.0 libdl.so.2 libstdc++.so.6 libm.so.6 libgcc_s.so.1 
 libc.so.6 libFLAC.so.8 libid3tag.so.0 libz.so.1 libmad.so.0 libglib-2.0.so.0 
 libmikmod.so.2 libsndfile.so.1 libvorbis.so.0 libogg.so.0 libvorbisfile.so.3
 ...
Nov  2 17:02:04 lebrun kernel:   CPU[  0]: TSTATE[80009604] 
TPC[00407924] TNPC[00407928] TASK[dpkg-query:10813]
Nov  2 17:02:04 lebrun kernel:  
TPC[sparc64_realfault_common+0x8/0x20]

It looks like dpkg_query is stuck on a page fault.  Typically
this means the fault processing is not putting a valid
translation into the TLB to satisfy the fault, so we loop
forever never making forward progress.

I've had to debug something similar to this before, so I'll
piece together a debugging patch you can use to get more
information.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-11-04 Thread Bernd Zeimetz

 Ok, the key in the trace is:
 
 Nov  2 16:25:30 titan kernel: [  978.134874]   CPU[  1]: 
 TSTATE[80009603] TPC[0067d2e0] TNPC[0067d2d4] 
 TASK[aptitude:3204]
 Nov  2 16:25:30 titan kernel: [  978.257809]  
 TPC[_write_unlock_irq+0x20/0x110]
  ...
 Nov  2 16:25:30 titan kernel: [  978.507778]   CPU[  3]: 
 TSTATE[11009605] TPC[004419f8] TNPC[004419fc] 
 TASK[aptitude:3203]
 Nov  2 16:25:30 titan kernel: [  978.630707]  
 TPC[cheetah_xcall_deliver+0x174/0x23c]
 
 The first symbol is misleading, it says _write_unlock_irq but actually
 in the assembler the PC is in the spinlock read spinning loop
 section.  So actually it's hanging in _spin_lock().
 
 CPU #3 is trying to send a cross-call message interrupt, but for
 some reason that isn't making forward progress.
 
 Let's see what's calling these things by adding some more debugging
 information.  Please retry the test with the following patch on
 top of the original sysrq-g debugging patch and please get new
 logs when it hangs.


Today I was a bit out of luck, either the machine crashed so badly that
it just didn't react on anything anymore, or it didn't crash.
The machine went amok a bit slower when I did the following things,
which also resulted in the attached sysrq output.
- run stress -c 2 to get the load up, didn't need that the last time...
- run something like `while true; do echo g  /proc/sysrg-trigger; sleep
0.5; done`
- run aptitude -u several times until the machine died.

So I'm not sure if the result is really useful for you - if not just let
me know. I've attached the last ~10-20 sysrq-g outputs - as it was
running in a loop I have a ton of them. In case you're wondering: http
is aptitude's http method.

We'll also run the patched Kernel on a US II machine form tomorrow on -
but it always took a longer time until it crashed, so we'll see if it
happens at all.

Thanks for your work,


Bernd


-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/


sysrq2.txt
Description: application/pgp-keys


Re: unkillable dpkg-query processes

2007-11-04 Thread David Miller
From: Bernd Zeimetz [EMAIL PROTECTED]
Date: Sun, 04 Nov 2007 20:55:20 +0100

 So I'm not sure if the result is really useful for you - if not just let
 me know. I've attached the last ~10-20 sysrq-g outputs - as it was
 running in a loop I have a ton of them. In case you're wondering: http
 is aptitude's http method.

The http module is stuck in a different place, I'll try to
see if I can make sense of it.
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-11-04 Thread Bernd Zeimetz
David Miller wrote:
 From: Bernd Zeimetz [EMAIL PROTECTED]
 Date: Sun, 04 Nov 2007 20:55:20 +0100
 
 So I'm not sure if the result is really useful for you - if not just let
 me know. I've attached the last ~10-20 sysrq-g outputs - as it was
 running in a loop I have a ton of them. In case you're wondering: http
 is aptitude's http method.
 
 The http module is stuck in a different place, I'll try to
 see if I can make sense of it.

In the meantime I'll build an aptitude which should exit after running
trough the part which crashed usually, so it should be possible to run
it in a loop...

-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/
-
To unsubscribe from this list: send the line unsubscribe sparclinux in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-11-04 Thread Bernd Zeimetz

 In the meantime I'll build an aptitude which should exit after running
 trough the part which crashed usually, so it should be possible to run
 it in a loop...

This was successful - it made crashing the machine pretty simple, even
without activated libnss-db.

To reproduce on Etch:
- get the source of aptitude
- apply the attached patch
- rebuild the .deb, install it
- while true; do aptitude -u; done

Some of the aptitudes hit a SIGABRT before one got stuck.

Best regards,

Bernd

-- 
Bernd Zeimetz
[EMAIL PROTECTED] http://bzed.de/


aptitude.diff
Description: application/pgp-keys


aptitude-sysrq-q.txt.gz
Description: GNU Zip compressed data