Re: Fix for sparc64 cpu hangs.

2007-12-16 Thread Bernd Zeimetz

>> I'll leave the kernel running and make sure the machine gets some more
>> users and load during the next days.
> 
> Thanks for testing, let me know if any more issues trigger.

One problem I was pointed to was the build failure of erlang. Here the
created erlc binary segfaults with a bus error.

- this only happens on US III machines, works fine on US II.

- on lebrun it doesn't happen on the first call of erlc, but after
several successful runs of it - see
http://buildd.debian.org/fetch.cgi?&pkg=erlang&ver=1%3A11.b.5dfsg-11&arch=sparc&stamp=1197012623&file=log

- on our v880 here (which is still running the kernel with your test
patch) erlc segfaults instantly. A strace shows that it is stuck at a
well known place - pretty similar to the segfault in aptitude which
successfully shot the machine to death before your patch(es) was(were)
applied:

[pid  1224] clone(Process 1228 attached
child_stack=0xf7951480,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
parent_tidptr=0xf7951bd8, tls=0xf7951b90, child_tidptr=0xf7951bd8) = 1228
[pid  1224] SYS_300(0xf7951be0, 0xc, 0, 0, 0xf7951df4) = 0
[pid  1224] futex(0xff993338, 0x80 /* FUTEX_??? */, 2

... there it hangs.


I guess you should be able to reproduce this on your US III machine.
dget -x \
ftp://debian.netcologne.de/debian/pool/main/e/erlang/erlang_11.b.5dfsg-11.dsc
cd erlang-11.b.5dfsg
dpkg-buildpackage -rfakeroot
(you'll probably have to install some build-deps...)
when erlc segfaults, change into the directory and set

ERL_TOP=/home/bzed/erlang-11.b.5dfsg
PATH=/home/bzed/erlang-11.b.5dfsg/bootstrap/bin:${PATH}

before retrying to run erlc.


Let me know if you need more informations or want me to test something.


-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix for sparc64 cpu hangs.

2007-12-13 Thread Bernd Zeimetz

> Thanks for testing, let me know if any more issues trigger.


The machine had some random processes (ssh, ping and aptitude) being
stuck today, but they went away after hitting them with kill -9. They
also didn't eat CPU time - they were just doing nothing.
Unfortunately I didn't have the time for a closer look, I'll try to
gather some more informations the next time it happens.



-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix for sparc64 cpu hangs.

2007-12-12 Thread Bernd Zeimetz

> I shoule be easily fixed using the patch below.  It records which bits
> we should actually be concerned about, and only tests those specific
> bits in the dispatch status register.
> 
> Could you please give this patch a test?

Tested - the patch seems to fix the problem as the machine is still
alive and working well after several hours of running the buggy aptitude
-u in a loop.

I'll leave the kernel running and make sure the machine gets some more
users and load during the next days.


Thanks for the fix,

Bernd

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix for sparc64 cpu hangs.

2007-12-10 Thread Bernd Zeimetz
David Miller wrote:
> From: Bernd Zeimetz <[EMAIL PROTECTED]>
> Date: Sat, 08 Dec 2007 01:14:46 +0100
> 
>> works well, thanks for fixing!
> 
> Thanks a lot for testing.


You're welcome.
Are you going to send the patch for 2.6.23, too?

Also I've tried to crash the machine while running the non-SMP kernel -
but it is still running fine.


-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix for sparc64 cpu hangs.

2007-12-07 Thread Bernd Zeimetz

David Miller wrote:
> From: Bernd Zeimetz <[EMAIL PROTECTED]>
> Date: Thu, 06 Dec 2007 13:09:18 +0100
> 
>> ERROR(0): Cheetah error trap taken afsr[1000]
>> afar[040001c0] TL1(0)
>> ERROR(0): TPC[4351dc] TNPC[4351e0] O7[4353b4] TSTATE[80001606]
>> ERROR(0): TPC
>> ERROR(0): M_SYND(0),  E_SYND(0)
> 
> Please try this patch:
[...]

titan:~# uname -a
Linux titan 2.6.23.9+davem-nonsmp #1 Fri Dec 7 10:02:01 UTC 2007 sparc64
GNU/Linux
titan:~# cat /proc/cpuinfo
cpu : TI UltraSparc III (Cheetah)
fpu : UltraSparc III integrated FPU
prom: OBP 4.22.34 2007/07/23 13:01
type: sun4u
ncpus probed: 4
ncpus active: 1
D$ parity tl1   : 0
I$ parity tl1   : 0
Cpu0ClkTck  : 2cb41780
MMU Type: Cheetah
titan:~#

works well, thanks for fixing!

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix for sparc64 cpu hangs.

2007-12-06 Thread Bernd Zeimetz
David Miller wrote:
> From: Bernd Zeimetz <[EMAIL PROTECTED]>
> Date: Thu, 06 Dec 2007 11:43:45 +0100
> 
>> David Miller wrote:
>>> From: Bernd Zeimetz <[EMAIL PROTECTED]>
>>> Date: Fri, 16 Nov 2007 22:17:07 +0100
>>>
>>>> The sysrq-g output is attached, I hope you can make sense out of it.
>>>> We'll also add some extra workload to the other machines here to try to
>>>> trigger the bug on other CPUs, too.
>>> I just got back from my vacation and started looking at these
>>> dumps.  I think there might be some bug in cheetah_xcall_deliver(),
>>> I'll try to diagnose this some more.
>> I'm not sure if it is related, but non-SMP Kernels don't boot at all on
>> the machine.
> 
> I doubt it's related as non-SMP kernels won't even have that
> code compiled in :-)
> What does a failed non-SMP boot say?  If it doesn't even bring up the
> console, give it "-p" on the kernel command line.


That's from a 2.6.21-2-sparc64, had the output lying around here. I can
build and install a 2.6.23 and try it again if you want. It would be
good to know if non-SMP kernels work at all on the v880 and larger
machines, same for more recent CPU models - at the moment the Sparc
installer is non-SMP only, which resulted in some extra fun to install
the v880.


Rebooting with command: boot net:dhcp -p
Boot device: /[EMAIL PROTECTED],70/[EMAIL PROTECTED],1:dhcp  File and args: 
-p
Timed out waiting for BOOTP/DHCP reply
\
PROMLIB: Sun IEEE Boot Prom 'OBP 4.22.34 2007/07/23 13:01'
PROMLIB: Root node compatible:
Linux version 2.6.21-2-sparc64 (Debian 2.6.21-6) ([EMAIL PROTECTED]) (gcc
version 4.1.3 20070629 (prerelease) (Debian 4.1.2
-13)) #1 Thu Jul 12 12:33:00 UTC 2007
ARCH: SUN4U
Ethernet address: 00:03:ba:0b:07:89
Remapping the kernel... done.
PROM: Built device tree with 125090 bytes of memory.
Booting Linux...
CPU[0]: Caches D[sz(65536):line_sz(32)] I[sz(32768):line_sz(32)]
E[sz(8388608):line_sz(512)]
Built 1 zonelists.  Total pages: 412546
Kernel command line: -p
PID hash table entries: 4096 (order: 12, 32768 bytes)
Console: colour dummy device 80x25
Dentry cache hash table entries: 524288 (order: 9, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 8, 2097152 bytes)
Memory: 8311800k available (2360k kernel code, 824k data, 144k init)
[f800,00b0ffb16000]
Calibrating delay using timer specific routine.. 20.00 BogoMIPS
(lpj=40009)
Security Framework v1.0.0 initialized
SELinux:  Disabled at boot.
Capability LSM initialized
Mount-cache hash table entries: 512
NET: Registered protocol family 16
PCI: Probing for controllers.
/[EMAIL PROTECTED],70: SCHIZO PCI Bus Module ver[4:0]
/[EMAIL PROTECTED],70: PCI CFG[7ffee00] IO[7ffef00] MEM[7fe]
/[EMAIL PROTECTED],60: SCHIZO PCI Bus Module ver[4:0]
/[EMAIL PROTECTED],60: PCI CFG[7ffec00] IO[7ffed00] MEM[7fd]
/[EMAIL PROTECTED],70: SCHIZO PCI Bus Module ver[4:0]
/[EMAIL PROTECTED],70: PCI CFG[7ffea00] IO[7ffeb00] MEM[7fc]
/[EMAIL PROTECTED],60: SCHIZO PCI Bus Module ver[4:0]
/[EMAIL PROTECTED],60: PCI CFG[7ffe800] IO[7ffe900] MEM[7fb]
PCI1(PBMB): Bus running at 33MHz
PCI1(PBMA): Bus running at 66MHz
PCI0(PBMB): Bus running at 33MHz
PCI0(PBMA): Bus running at 66MHz
ebus0: [flashprom] [bbc] [power] [i2c -> (fru) (fru) (fru) (fru) (fru)
(fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru)
(fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru) (fru)
(fru) (fru) (fru) (fru) (fru) (fru) (fru) (temperature) (temperature)
(temperature) (temperature) (temperature) (temperature) (temperature)]
[i2c -> (controller) (smbus-ara) (controller) (temperature)
(temperature) (temperature) (ioexp) (temperature) (controller) (adio)
(adio) (ioexp) (ioexp) (ioexp) (ioexp) (ioexp) (ioexp) (ioexp) (adio)
(adio) (adio) (adio) (temperature-sensor) (fru) (fru) (fru) (fru) (fru)
(fru) (rscrtc) (hotplug-controller) (hotplug-controller)
(hotplug-controller) (hotplug-controller)] [bbc] [i2c -> (temperature)
(temperature) (temperature)] [i2c -> (nvram) (idprom)] [rtc] [gpio]
[pmc] [rsc-control] [rsc-console] [serial]
power: Control reg at 7fc7e30002e ... not using powerd.
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
/[EMAIL PROTECTED],70/[EMAIL PROTECTED]/[EMAIL PROTECTED],300070: Clock 
regs at 07fc7e300070
NET: Registered protocol family 2
IP route cache hash table entries: 131072 (order: 7, 1048576 bytes)
TCP established hash table entries: 524288 (order: 10, 8388608 bytes)
TCP bind hash table entries: 65536 (order: 6, 524288 bytes)
TCP: Hash tables configured (established 524288 bind 65536)
TCP reno registered
checking if image is initramfs... it is
Freei

Re: Fix for sparc64 cpu hangs.

2007-12-06 Thread Bernd Zeimetz
David Miller wrote:
> From: Bernd Zeimetz <[EMAIL PROTECTED]>
> Date: Fri, 16 Nov 2007 22:17:07 +0100
> 
>> The sysrq-g output is attached, I hope you can make sense out of it.
>> We'll also add some extra workload to the other machines here to try to
>> trigger the bug on other CPUs, too.
> 
> I just got back from my vacation and started looking at these
> dumps.  I think there might be some bug in cheetah_xcall_deliver(),
> I'll try to diagnose this some more.

I'm not sure if it is related, but non-SMP Kernels don't boot at all on
the machine.

> If you cannot reproduce this bug on non-Ultra-III systems that
> would help confirm or deny my theory.  Have you been able to
> trigger this on your Ultra-II machine for example?  If so, what
> do the sysrq-g traces look like there?

Since your Futex bugfix the Ultra-II machine runs pretty stable. I did
not manage to trigger the bug there, but it was hard to trigger the bug
the first time there already - even if I run a Kernel without the Futex
bugfix the machine will just hang itself at some random point, I never
managed to reproduce the bug easily on US II.


Best regards,

Bernd

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Aurora SPARC Linux Build 2.99 (Beta 2 for 3.0)

2007-12-01 Thread Bernd Zeimetz
David Miller wrote:
> From: Bernd Zeimetz <[EMAIL PROTECTED]>
> Date: Sat, 01 Dec 2007 13:43:30 +0100
> 
>>>> - Systems that boot off qlogic attached disks are not supported, because 
>>>> there is no working firmware loader in anaconda, and the qlogic driver 
>>>> needs firmware.
>>> That's very unfortunate, how are qlogic device handled on other
>>> platforms?
>> In Debian you just install the firmware package, udevl will handle it
>> then. If you have to boot from it, you need to rebuild your initrd after
>> installing the firmware package.
>> The installer doesn't support non-free modules yet unfortunately, but
>> with some not too complicated tricks you can install Debian without
>> problems.
> 
> I said "other platforms" as in x86, x86_64, powerpc.

Just the same. You can even install the firmware on hardware where you
wouldn't be able use a qlogic card. It's only loaded if an appropriate
device is detected.

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Aurora SPARC Linux Build 2.99 (Beta 2 for 3.0)

2007-12-01 Thread Bernd Zeimetz

>> - Systems that boot off qlogic attached disks are not supported, because 
>> there is no working firmware loader in anaconda, and the qlogic driver 
>> needs firmware.
> 
> That's very unfortunate, how are qlogic device handled on other
> platforms?

In Debian you just install the firmware package, udevl will handle it
then. If you have to boot from it, you need to rebuild your initrd after
installing the firmware package.
The installer doesn't support non-free modules yet unfortunately, but
with some not too complicated tricks you can install Debian without
problems.


-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix for sparc64 cpu hangs.

2007-11-16 Thread Bernd Zeimetz
Hi David,

> Please let me know if things go smoothly when the
> build becomes active again.

first the good news:
The U60 here still building and working fine, also I didn't hear any bad
news from lebrun.d.o.

the not so good news:
the v880 (4x US III) here was hit by a stuck process again, after
running fine for some time now. But the machine didn't freeze, one CPU
was running at 100%, but otherwise the machine was responsible.

I think I'll also run a full diag in service mode to make it's not a CPU
bug.
The sysrq-g output is attached, I hope you can make sense out of it.
We'll also add some extra workload to the other machines here to try to
trigger the bug on other CPUs, too.

Best regards,

Bernd

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
Nov 16 21:40:57 titan kernel: [12019.840715] SysRq : Show Global CPU Regs
Nov 16 21:40:57 titan kernel: [12019.886698]   CPU[  0]: 
TSTATE[] TPC[] TNPC[] 
TASK[NULL:-1]
Nov 16 21:40:57 titan kernel: [12020.003361]  
TPC[atomic_sub_ret+0x0/0x30]
Nov 16 21:40:58 titan kernel: [12020.063757]  
O7[schedule+0x6dc/0x7a4]
Nov 16 21:40:58 titan kernel: [12020.120007]  
I7[do_syslog+0xfc/0x400]
Nov 16 21:40:58 titan kernel: [12020.176249] * CPU[  1]: 
TSTATE[] TPC[] TNPC[] 
TASK[bash:3157]
Nov 16 21:40:58 titan kernel: [12020.295006]   CPU[  2]: 
TSTATE[11009602] TPC[0042fc30] TNPC[0042fc34] 
TASK[cat:4365]
Nov 16 21:40:58 titan kernel: [12020.412726]  TPC[udelay+0x0/0x1c]
Nov 16 21:40:58 titan kernel: [12020.464809]  
O7[cheetah_xcall_deliver+0x1b8/0x23c]
Nov 16 21:40:58 titan kernel: [12020.534581]  
I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:40:58 titan kernel: [12020.604370]   CPU[  3]: 
TSTATE[004480009602] TPC[004288a0] TNPC[004288a4] 
TASK[swapper:0]
Nov 16 21:40:58 titan kernel: [12020.723128]  
TPC[cpu_idle+0x94/0xb8]
Nov 16 21:40:58 titan kernel: [12020.778323]  O7[cpu_idle+0xa8/0xb8]
Nov 16 21:40:58 titan kernel: [12020.832498]  
I7[start_kernel+0x31c/0x32c]
Nov 16 21:41:05 titan ntpd[2766]: adjusting local clock by -20.711568s
Nov 16 21:41:26 titan kernel: [12048.836922] SysRq : Show Global CPU Regs
Nov 16 21:41:26 titan kernel: [12048.882885] * CPU[  0]: 
TSTATE[] TPC[] TNPC[] 
TASK[bash:3157]
Nov 16 21:41:26 titan kernel: [12049.001617]   CPU[  1]: 
TSTATE[009911009602] TPC[00407af0] TNPC[00407af4] 
TASK[swapper:0]
Nov 16 21:41:27 titan kernel: [12049.120373]  
TPC[__tsb_context_switch+0xf0/0x100]
Nov 16 21:41:27 titan kernel: [12049.189109]  
O7[schedule+0x514/0x7a4]
Nov 16 21:41:27 titan kernel: [12049.245354]  I7[cpu_idle+0xa8/0xb8]
Nov 16 21:41:27 titan kernel: [12049.299516]   CPU[  2]: 
TSTATE[11009603] TPC[0042faa0] TNPC[0042fc18] 
TASK[cat:4365]
Nov 16 21:41:27 titan kernel: [12049.417244]  
TPC[stick_get_tick+0x10/0x14]
Nov 16 21:41:27 titan kernel: [12049.478681]  O7[__delay+0x28/0x48]
Nov 16 21:41:27 titan kernel: [12049.531809]  
I7[cheetah_xcall_deliver+0x1b8/0x23c]
Nov 16 21:41:27 titan kernel: [12049.601598]   CPU[  3]: 
TSTATE[004480009602] TPC[004288a0] TNPC[004288a4] 
TASK[swapper:0]
Nov 16 21:41:27 titan kernel: [12049.720351]  
TPC[cpu_idle+0x94/0xb8]
Nov 16 21:41:27 titan kernel: [12049.775551]  O7[cpu_idle+0xa8/0xb8]
Nov 16 21:41:27 titan kernel: [12049.829725]  
I7[start_kernel+0x31c/0x32c]
Nov 16 21:41:28 titan kernel: [12050.571422] SysRq : Show Global CPU Regs
Nov 16 21:41:28 titan kernel: [12050.617320] * CPU[  0]: 
TSTATE[] TPC[] TNPC[] 
TASK[bash:3157]
Nov 16 21:41:28 titan kernel: [12050.736074]   CPU[  1]: 
TSTATE[004411009604] TPC[0045731c] TNPC[00457320] 
TASK[swapper:0]
Nov 16 21:41:28 titan kernel: [12050.854834]  
TPC[update_stats_wait_end+0x24/0x88]
Nov 16 21:41:28 titan kernel: [12050.923565]  
O7[sched_clock+0x10/0x30]
Nov 16 21:41:29 titan kernel: [12050.980856]  
I7[pick_next_task_fair+0x24/0x44]
Nov 16 21:41:29 titan kernel: [12051.046480]   CPU[  2]: 
TSTATE[11009602] TPC[00441a78] TNPC[00441a7c] 
TASK[cat:4365]
Nov 16 21:41:29 titan kernel: [12051.164194]  
TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov 16 21:41:29 titan kernel: [12051.235018]  
O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov 16 21:41:29 titan kernel: [12051.303771]  
I7[flush_dcache_page_all+0x178/0x240]
Nov 16 21:41:29 titan kernel: [12051.373560]   CPU[  3]: 
TSTATE[004480009602] TPC[004288a0] TNPC[004288a4] 
TASK[swapper:0]

Re: Fix for sparc64 cpu hangs.

2007-11-10 Thread Bernd Zeimetz
David Miller wrote:
> From: Bernd Zeimetz <[EMAIL PROTECTED]>
> Date: Wed, 07 Nov 2007 15:35:42 +0100
> 
>>> But I did the artificial tests, like running dpkg-query --search libc.so.6
>>> in loops, and this seems to work well. Thanks a lot!
>>>
>> I was running aptitude -u in a loop for half an hour now, and it didn't
>> crash, so I assume that fixed the bug. Many thanks for the patch David!
> 
> Many thanks for helping me track it down.

You're welcome!

The v880 is still running fine, I'll setup the stuff which was supposed
to be running on the machine during the next days, so we'll see how it
behaves under a higher load for a longer time soon.

Thanks again for looking into this annoying bug!


-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: klibc sparc trouble with gcc > 4.0

2007-11-09 Thread Bernd Zeimetz
Oleg Verych wrote:
> == Mon, Nov 05, 2007 at 02:55:45PM +0100, maximilian attems ==
> []
>> titan:~# strace -vfF /usr/lib/klibc/bin/fstype
>> execve("/usr/lib/klibc/bin/fstype", ["/usr/lib/klibc/bin/fstype"],
>> ["SHELL=/bin/bash", "TERM=xterm", "SSH_CLIENT=[myip] 39403"...,
>> "SSH_TTY=/dev/pts/0", "USER=root",
>> "LS_COLORS=no=00:fi=00:di=01;34:l"...,
>> "PATH=/usr/local/sbin:/usr/local/"..., "MAIL=/var/mail/root",
>> "PWD=/root", "LANG=en_US.UTF-8", "PS1=\\h:\\w\\$ ", "HOME=/root",
>> "SHLVL=2", "LS_OPTIONS=--color=auto", "LOGNAME=root",
>> "SSH_CONNECTION=[myip] 3"..., "_=/usr/bin/strace", "OLDPWD=/"]) = 0
>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>> +++ killed by SIGSEGV +++
> 
> gdb doesn't work/help?

(gdb) where
#0  0x8000faac in ?? ()
#1  0x8000facc in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Not sure if this is a gdb problem, though - never even tried to debug klibc.
With the mentioned patch klibc compiles, but all utils just segfault,
strace is as short as seen above.

> 
> []
>> +++ b/usr/klibc/libgcc/__clzdi2.c
>> @@ -0,0 +1,23 @@
>> +/*
>> + * __clzdi2 - Returns the leading number of 0 bits in the argument
>> + */
>> +


without this patch it doesn't compile at all:


  KLIBCLD usr/klibc/libc.so
ld: sparc architecture of input file
`/usr/lib/gcc/sparc-linux-gnu/4.2.3/libgcc.a(_clzdi2.o)' is incompatible
with sparc:v9 output
ld: sparc architecture of input file
`/usr/lib/gcc/sparc-linux-gnu/4.2.3/libgcc.a(_clz.o)' is incompatible
with sparc:v9 output
/usr/lib/gcc/sparc-linux-gnu/4.2.3/libgcc.a(_clzdi2.o): In function
`__clzdi2':
(.text+0xc): undefined reference to `_GLOBAL_OFFSET_TABLE_'
/usr/lib/gcc/sparc-linux-gnu/4.2.3/libgcc.a(_clzdi2.o): In function
`__clzdi2':
(.text+0x14): undefined reference to `_GLOBAL_OFFSET_TABLE_'
make[3]: *** [usr/klibc/libc.so] Error 1
make[2]: *** [all] Error 2
make[1]: *** [klibc] Error 2
make[1]: Leaving directory `/root/klibc-1.5.7'

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix for sparc64 cpu hangs.

2007-11-07 Thread Bernd Zeimetz

> But I did the artificial tests, like running dpkg-query --search libc.so.6
> in loops, and this seems to work well. Thanks a lot!
> 

I was running aptitude -u in a loop for half an hour now, and it didn't
crash, so I assume that fixed the bug. Many thanks for the patch David!

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-11-06 Thread Bernd Zeimetz
David Miller wrote:
> From: Bernd Zeimetz <[EMAIL PROTECTED]>
> Date: Tue, 06 Nov 2007 04:51:07 +0100
> 
>> Here's also some output from apt-get which got stuck in my unstable
>> chroot while I wanted to retrieve the klibc source to try to debug it...
> 
> So the good news is that I started getting the hang seen
> on the Debain buildd on my workstation.
> 
> The bad news is that it's very sporadic, for a while I
> could trigger it during bootup, on every boot, and now
> I can't get it to wedge at all.
> 
> Anyways, we're getting closer.


Running stress -c 2 on a 4 CPU machine made things really worse here,
probably it helps to trigger the bug for you, too.
Our US II machine is also just running fine at the moment.



-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-11-05 Thread Bernd Zeimetz
4070] SysRq : Show Global CPU Regs
Nov  6 04:43:35 titan kernel: [100912.520982] * CPU[  0]: 
TSTATE[] TPC[] TNPC[] 
TASK[bash:11762]
Nov  6 04:43:35 titan kernel: [100912.641822]   CPU[  1]: 
TSTATE[004411009604] TPC[0045731c] TNPC[00457320] 
TASK[swapper:0]
Nov  6 04:43:35 titan kernel: [100912.761614]  
TPC[update_stats_wait_end+0x24/0x88]
Nov  6 04:43:35 titan kernel: [100912.831396]  
O7[sched_clock+0x10/0x30]
Nov  6 04:43:35 titan kernel: [100912.889728]  
I7[pick_next_task_fair+0x24/0x44]
Nov  6 04:43:35 titan kernel: [100912.956393]   CPU[  2]: 
TSTATE[004411009601] TPC[004288a0] TNPC[004288a4] 
TASK[swapper:0]
Nov  6 04:43:35 titan kernel: [100913.076191]  
TPC[cpu_idle+0x94/0xb8]
Nov  6 04:43:35 titan kernel: [100913.132429]  
O7[cpu_idle+0xa8/0xb8]
Nov  6 04:43:35 titan kernel: [100913.187640]  
I7[after_lock_tlb+0x19c/0x1b0]
Nov  6 04:43:35 titan kernel: [100913.251181]   CPU[  3]: 
TSTATE[11009601] TPC[00441a78] TNPC[00441a7c] 
TASK[apt-get:11759]
Nov  6 04:43:36 titan kernel: [100913.375147]  
TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov  6 04:43:36 titan kernel: [100913.447011]  
O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov  6 04:43:36 titan kernel: [100913.516805]  
I7[flush_dcache_page_all+0x178/0x240]
Nov  6 04:43:36 titan kernel: [100914.295153] SysRq : Show Global CPU Regs
Nov  6 04:43:37 titan kernel: [100914.342110] * CPU[  0]: 
TSTATE[] TPC[] TNPC[] 
TASK[bash:11762]
Nov  6 04:43:37 titan kernel: [100914.462949]   CPU[  1]: 
TSTATE[004411009604] TPC[0045731c] TNPC[00457320] 
TASK[swapper:0]
Nov  6 04:43:37 titan kernel: [100914.582741]  
TPC[update_stats_wait_end+0x24/0x88]
Nov  6 04:43:37 titan kernel: [100914.652524]  
O7[sched_clock+0x10/0x30]
Nov  6 04:43:37 titan kernel: [100914.710855]  
I7[pick_next_task_fair+0x24/0x44]
Nov  6 04:43:37 titan kernel: [100914.777522]   CPU[  2]: 
TSTATE[009911009601] TPC[0042888c] TNPC[00428890] 
TASK[swapper:0]
Nov  6 04:43:37 titan kernel: [100914.897319]  
TPC[cpu_idle+0x80/0xb8]
Nov  6 04:43:37 titan kernel: [100914.953557]  
O7[cpu_idle+0xa8/0xb8]
Nov  6 04:43:37 titan kernel: [100915.008768]  
I7[after_lock_tlb+0x19c/0x1b0]
Nov  6 04:43:37 titan kernel: [100915.072310]   CPU[  3]: 
TSTATE[11009601] TPC[00441a78] TNPC[00441a7c] 
TASK[apt-get:11759]
Nov  6 04:43:37 titan kernel: [100915.196274]  
TPC[cheetah_xcall_deliver+0x174/0x23c]
Nov  6 04:43:37 titan kernel: [100915.268140]  
O7[cheetah_xcall_deliver+0x6c/0x23c]
Nov  6 04:43:38 titan kernel: [100915.337932]  
I7[flush_dcache_page_all+0x178/0x240]


Sorry to the klibc people - I'll try it again later.

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>

-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-11-04 Thread Bernd Zeimetz

> In the meantime I'll build an aptitude which should exit after running
> trough the part which crashed usually, so it should be possible to run
> it in a loop...

This was successful - it made crashing the machine pretty simple, even
without activated libnss-db.

To reproduce on Etch:
- get the source of aptitude
- apply the attached patch
- rebuild the .deb, install it
- while true; do aptitude -u; done

Some of the aptitudes hit a SIGABRT before one got stuck.

Best regards,

Bernd

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>


aptitude.diff
Description: application/pgp-keys


aptitude-sysrq-q.txt.gz
Description: GNU Zip compressed data


Re: unkillable dpkg-query processes

2007-11-04 Thread Bernd Zeimetz
David Miller wrote:
> From: Bernd Zeimetz <[EMAIL PROTECTED]>
> Date: Sun, 04 Nov 2007 20:55:20 +0100
> 
>> So I'm not sure if the result is really useful for you - if not just let
>> me know. I've attached the last ~10-20 sysrq-g outputs - as it was
>> running in a loop I have a ton of them. In case you're wondering: http
>> is aptitude's http method.
> 
> The http module is stuck in a different place, I'll try to
> see if I can make sense of it.

In the meantime I'll build an aptitude which should exit after running
trough the part which crashed usually, so it should be possible to run
it in a loop...

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-11-04 Thread Bernd Zeimetz

> Ok, the key in the trace is:
> 
> Nov  2 16:25:30 titan kernel: [  978.134874]   CPU[  1]: 
> TSTATE[80009603] TPC[0067d2e0] TNPC[0067d2d4] 
> TASK[aptitude:3204]
> Nov  2 16:25:30 titan kernel: [  978.257809]  
> TPC[_write_unlock_irq+0x20/0x110]
>  ...
> Nov  2 16:25:30 titan kernel: [  978.507778]   CPU[  3]: 
> TSTATE[11009605] TPC[004419f8] TNPC[004419fc] 
> TASK[aptitude:3203]
> Nov  2 16:25:30 titan kernel: [  978.630707]  
> TPC[cheetah_xcall_deliver+0x174/0x23c]
> 
> The first symbol is misleading, it says _write_unlock_irq but actually
> in the assembler the PC is in the spinlock read spinning loop
> section.  So actually it's hanging in _spin_lock().
> 
> CPU #3 is trying to send a cross-call message interrupt, but for
> some reason that isn't making forward progress.
> 
> Let's see what's calling these things by adding some more debugging
> information.  Please retry the test with the following patch on
> top of the original sysrq-g debugging patch and please get new
> logs when it hangs.


Today I was a bit out of luck, either the machine crashed so badly that
it just didn't react on anything anymore, or it didn't crash.
The machine went amok a bit slower when I did the following things,
which also resulted in the attached sysrq output.
- run stress -c 2 to get the load up, didn't need that the last time...
- run something like `while true; do echo g > /proc/sysrg-trigger; sleep
0.5; done`
- run aptitude -u several times until the machine died.

So I'm not sure if the result is really useful for you - if not just let
me know. I've attached the last ~10-20 sysrq-g outputs - as it was
running in a loop I have a ton of them. In case you're wondering: http
is aptitude's http method.

We'll also run the patched Kernel on a US II machine form tomorrow on -
but it always took a longer time until it crashed, so we'll see if it
happens at all.

Thanks for your work,


Bernd


-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>


sysrq2.txt
Description: application/pgp-keys


Re: unkillable dpkg-query processes

2007-11-02 Thread Bernd Zeimetz
David Miller wrote:
> From: David Miller <[EMAIL PROTECTED]>
> Date: Thu, 01 Nov 2007 15:01:13 -0700 (PDT)
> 
>> I'm working on a kernel patch for 2.6.23 that will allow you to get
>> some useful debugging information in situations like this.
>>
>> I'll try to get you that patch by the end of tonight.
> 
> As promised, here is the patch below.

Thanks for the patch. Applied and used libnss-db + aptitude -u to hang
the machine.

I've sent g several times to sysrq, output is attached.
According to top the two hanging aptitude processes were running on CPU
1 + 3.

 3204 root  20   0 19552 5088 4072 R  100  0.1   6:54.49 1 aptitude
 3203 root  20   0 19552 5088 4072 R  100  0.1   6:56.39 3 aptitude


Cheers,

Bernd

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>


sysrq-g.txt
Description: application/pgp-keys


Re: unkillable dpkg-query processes

2007-11-01 Thread Bernd Zeimetz


> The futex() calls are definitely from libnss-db.

And on Lenny/testing we have futex calls from libc6.
Didn't have the time to come up with any instructions yet as we have
public holidays today, I'll try to finish them tomorrow.

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-29 Thread Bernd Zeimetz
David Miller wrote:
> From: Bernd Zeimetz <[EMAIL PROTECTED]>
> Date: Tue, 30 Oct 2007 01:50:30 +0100
> 
>> What we're missing here is a probably important piece:
>>
>> If dpkg-query is running during a build, it is running in a fakeroot
>> environment. I've straced that, see the attachment.
>>
>> What I find in the strace are at least several clones, which is the
>> point where aptitude -u crashed according to the straces in
>> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=433187#102
> 
> Thanks for the fakeroot trace.
> 
> I am pretty sure the clone()'s we see here are just normal
> fork()'s, in both the fakeroot's dpkg-query and the aptitude
> case.

I just grepped trough the source of aptitude, there's only one fork, but
that one should not be executed if aptitude has been started as root.
If it is of any use for you I can figure out which piece of code
resulted in the call to clone, or figure out which piece of code results
 results in the use of futexes there.

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-29 Thread Bernd Zeimetz

>>> mount -t devpts none /dev/pts
>> mount --bind /dev /thechroot/dev
>> is what I use here, running udev in a chroot is no fun.
> 
> Ok.

AFaik the buildds only have a minimal /dev. though. But to bootstrap a
system that's usually not enough.

> Let's stick to 2.6.23 testing for pinpointing these bugs.

Ok. Do you have a .deb with a kernel for me? If not - would you like to
have any specific options enabled - I have to build one then.


-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-29 Thread Bernd Zeimetz
Bernd Zeimetz wrote:
>> Here you go.
>>
>> (Mind, this is capturing the current status of the chroot, which is fairly
>> unclean, because right now it happens to be building python-qt4-4.3.1.)
> 
> What we're missing here is a probably important piece:
> 
> If dpkg-query is running during a build, it is running in a fakeroot
> environment. I've straced that, see the attachment.

what I forgot to mention - this strace was taken as non-root user of
course, not sure what fakeroot does if it's called as root.


-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-29 Thread Bernd Zeimetz

> Here you go.
> 
> (Mind, this is capturing the current status of the chroot, which is fairly
> unclean, because right now it happens to be building python-qt4-4.3.1.)

What we're missing here is a probably important piece:

If dpkg-query is running during a build, it is running in a fakeroot
environment. I've straced that, see the attachment.

What I find in the strace are at least several clones, which is the
point where aptitude -u crashed according to the straces in
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=433187#102


-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
execve("/usr/bin/fakeroot", ["fakeroot", "dpkg-query", "-S", "libc.so.6"], [/* 
12 vars */]) = 0
brk(0)  = 0xca000
uname({sys="Linux", node="titan", ...}) = 0
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0xf7fba000
access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)  = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=12402, ...}) = 0
mmap(NULL, 12402, PROT_READ, MAP_PRIVATE, 3, 0) = 0xf7fb4000
close(3)= 0
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
open("/lib/libncurses.so.5", O_RDONLY)  = 3
read(3, "\177ELF\1\2\1\0\0\0\0\0\0\0\0\0\0\3\0\22\0\0\0\1\0\0\263"..., 512) = 
512
fstat64(3, {st_mode=S_IFREG|0644, st_size=208688, ...}) = 0
mmap(NULL, 208480, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0xf7f8
mmap(0xf7fb, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3) = 0xf7fb
close(3)= 0
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
open("/lib/libdl.so.2", O_RDONLY)   = 3
read(3, "\177ELF\1\2\1\0\0\0\0\0\0\0\0\0\0\3\0\22\0\0\0\1\0\0\f"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0644, st_size=18216, ...}) = 0
mmap(NULL, 82432, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0xf7f68000
mprotect(0xf7f6c000, 57344, PROT_NONE)  = 0
mmap(0xf7f7a000, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0xf7f7a000
close(3)= 0
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
open("/lib/libc.so.6", O_RDONLY)= 3
read(3, "\177ELF\1\2\1\0\0\0\0\0\0\0\0\0\0\3\0\22\0\0\0\1\0\1\364"..., 512) = 
512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1419756, ...}) = 0
mmap(NULL, 1489032, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0xf7dfc000
mprotect(0xf7f5, 65536, PROT_NONE)  = 0
mmap(0xf7f6, 24576, PROT_READ|PROT_WRITE|PROT_EXEC, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0xf7f6
mmap(0xf7f66000, 6280, PROT_READ|PROT_WRITE|PROT_EXEC, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xf7f66000
close(3)= 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0xf7fde000
mprotect(0xf7f7a000, 8192, PROT_READ)   = 0
munmap(0xf7fb4000, 12402)   = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
open("/dev/tty", O_RDWR|O_NONBLOCK|O_LARGEFILE) = 3
close(3)= 0
brk(0)  = 0xca000
brk(0xec000)= 0xec000
getuid32()  = 1000
getgid32()  = 1000
geteuid32() = 1000
getegid32() = 1000
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
time(NULL)  = 1193705202
open("/proc/meminfo", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0xf7fb8000
read(3, "MemTotal:  8314712 kB\nMemFre"..., 1024) = 624
close(3)= 0
munmap(0xf7fb8000, 8192)= 0
rt_sigaction(SIGCHLD, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0
rt_sigaction(SIGCHLD, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0
rt_sigaction(SIGINT, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0
rt_sigaction(SIGINT, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0
rt_sigaction(SIGQUIT, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0
rt_sigaction(SIGQUIT, {SIG_DFL}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigaction(SIGQUIT, {SIG_IGN}, {SIG_DFL}, 0xf7e32cb8, 4294967295) = 0
uname({sys="Linux", node="titan", ...}) = 0
stat64("/home/foo", {st_mode=S_IFDIR|0755, st_size=4096, ...}) =

Re: unkillable dpkg-query processes

2007-10-29 Thread Bernd Zeimetz

>   mount -t devpts none /dev/pts

mount --bind /dev /thechroot/dev
is what I use here, running udev in a chroot is no fun.

> So, it's a lot more than just running the appropriate debootstrap
> command.

I'm almost done with a howto which is cut&paste for 95% to debootstrap
and boot a debian system, unfortunately it doesn't boot as the klibc
(which is used in the initramfs) is broken on sparc again...
So I'll modify it to setup a proper chroot only, it should also allow to
boot into it if you use the Kernel/initrd form Ubuntu.
This should allow Josip and you to setup a complete chroot.

> I have done a GCC package build and am now running a libc6 build under
> this lenny chroot and haven't hit any problems yet.

The following things also like to crash here (on Etch, not in a chroot):
- running aptitude -u several times (at least with libnss-db installed)
- since I've installed 2.6.24-rc1: vgdisplay (with and without active
libnss-db)


> BTW, in your buildroot, can you do something like:
> 
>   strace -o x.log dpkg-query -S libc.so.6

there're some comparisons of the strace of aptitude -u in
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=433187#102
Probably interesting as there're futexes in the game.

The interesting thing is that it didn't crash the machine while running
under strace.

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-28 Thread Bernd Zeimetz
David Miller wrote:
> From: Bernd Zeimetz <[EMAIL PROTECTED]>
> Date: Mon, 29 Oct 2007 02:18:30 +0100
> 
>> But if this bug isn't fixed chances are good that the next Debian
>> release won't support Sparc at all.
> 
> Please don't use pseudo-threats like this, it only deters me even more
> from working on this bug.

This was not meant as a threat, it's just a fact and the reason why I'm
spending way too much time on trying to make this bug reproducible and
also the reason why we're annoying you these days. Sorry for that.


>> This explains why you have trouble to reproduce this, while Josip and me
>> get hit by this bug way too often.
> 
> Josip stated explicitly that he has a SunFire280R, which disagrees
> with what you're saying here.

Sorry, I mixed something up here. I was somehow sure that they were
using a v440, but it was somebody else.



-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-28 Thread Bernd Zeimetz

Sorry, but if there would be an easy test case we'd be more than happy
and would present it - unfortunately there is not. This is more than
annoying for us all. But if this bug isn't fixed chances are good that
the next Debian release won't support Sparc at all.

> I have ubuntu gutsy on my SunFire280R, so I can debootstrap
> debian chroots or whatever is needed to trigger this.

You need a Blade 1000/2000 or v440/v880 or an enterprise class machine
to reproduce this more easily (still assuming that we're facing the same
bug here - at least the symptoms are the same). Those machines use
repeater chips as interconnect between two CPUs (and between pairs of
cpus for larger machines), according to my contact from Sun similar to
that what's implemented in one US IV cpu.
This explains why you have trouble to reproduce this, while Josip and me
get hit by this bug way too often.

On all other machines using cpus <= US III I have now idea how to
reproduce this easily - you just get hit by it after $random builds.
Don't have access to more recent hardware.

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-28 Thread Bernd Zeimetz
Bernd Zeimetz wrote:
> Hi,
> 
>> Since mono team decided that the mono is broken on Sparc (and despite
>> the fix provided by David Miller), I had to rebuild after enabling the
>> sparc
>> arch in the source.

> Trying this at the moment.

not reproducible - mono fails to build from source in sid... so it
doesn;t reach the interesting part of dh_shlibdeps...


-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-28 Thread Bernd Zeimetz
Hi,

> Since mono team decided that the mono is broken on Sparc (and despite
> the fix provided by David Miller), I had to rebuild after enabling the
> sparc
> arch in the source.
> 
> The hangs happens always at the end of the buid when invoking
> dh_shgenlibs in the build.
> 
> This is not 100% reproducable even in my env.

Trying this at the moment.

> Second was sun blade 2000 SMP with Ubuntu gutsy, I wasn't able to update
> the xemacs21 package.
> The machine hanged with invoking the post installation script.

Does the Blade run with one or two CPUs? If I remember right they
support to run with one CPU which has to be inserted in a special
slot/carrier for that. With two CPUs it should use the same repeater
chips and architecture as the v440, v880 and larger machines.


Cheers,

Bernd

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-28 Thread Bernd Zeimetz
Hi,

please note that the futex bug also happens on US II machines,
it is jsut almost impossible to reproduce it - it'll just hang
after random days of building.

> Everyone who sees these UltraSPARC-III problems please send me PRECISE
> and FULL description of how to install from scratch a machine and run
> something that will trigger these errors.

Can you please check if the Kernel config I've attached to one of my
last mails is fine for you? The normal Debian installer doesn't
boot on the US III machines which use two CPUs in one board as the
installer's Kernel is a non-SMP Kernel, and the result is that the
machine throws a CPU exception and needs to be power-cycled

I've started to investigate there with the help of a contact from
Sun, but we both didn't have the time to finish this.
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=440720 if you want
to have a look, please ignore those troll postings from chealer
in between...

So to give you a recipe to install Debian on such a box, I need to
build an installer with a SMP Kernel for you. If the config is fine
for your needs, I could just use use it.


The other option is to use debootstrap, if you have some system
on the machine already - so if you want to use that instead of
messing with a network installer, please let me know.
Debootstrap should run on most systems, as long as they have
ar/tar/gunzip and a bash (probably sh is enough...).
Would be faster to use that, and faster to write a recipe for
that.

I'll mark all Qlogic firmware related points, so the recipe should
work on machines with (v440, v880, probably the Enterprise models,
too) and without FC (I guess the Blade 1000 and 2000).


If you don't have access to an US-III machine, I can find a way
to give you access to the RSC and serial console of our machine.


Cheers,

Bernd

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-28 Thread Bernd Zeimetz

>> [29074.766486] TSTATE: 11009600 TPC: 0042f984 TNPC: 
>> 0042f928 Y: Not tainted
>> [29074.884191] TPC: 
> 
> What kind of OOPS is this?  Please provide the kernel log messages
> that appeared right before these register dumps.


Oct 28 03:25:12 titan kernel: [29074.698695] BUG: soft lockup - CPU#0
stuck for 11s! [sh:4252]

This happened while a cronjob was running which updates the libnss-db
database... With an older kernel (2.6.23-rcsomething) this didn't crash
the machine.


-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-28 Thread Bernd Zeimetz

>> I think things got worse with 2.6.24...
>> The machine shoots itself now, I guess by running cron jobs or so.
>>
>> [29074.766486] TSTATE: 11009600 TPC: 0042f984 TNPC: 
>> 0042f928 Y: Not tainted
>> [29074.884191] TPC: 
> 
> What kind of OOPS is this?  Please provide the kernel log messages
> that appeared right before these register dumps.

I'll boot the machine and check the logs, was not in the mood to do
this tonight. The pasted messages were dumped on the serial console -
as the machine didn't show any reaction I only powered it down...


Cheers,

Bernd

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-27 Thread Bernd Zeimetz



I think things got worse with 2.6.24...
The machine shoots itself now, I guess by running cron jobs or so.

[29074.766486] TSTATE: 11009600 TPC: 0042f984 TNPC: 
0042f928 Y: Not tainted
[29074.884191] TPC: 
[29074.929988] g0:  g1: 004417ec g2:  
g3: 
[29075.034163] g4: f8a00493a4e0 g5: f89fff97c000 g6: f8a006c64000 
g7: 
[29075.138329] o0:  o1: f8a006c67968 o2: 0008 
o3: 0001
[29075.242493] o4: 3385 o5:  sp: f8a006c67011 
ret_pc: 0042f980
[29075.350830] RPC: 
[29075.392482] l0: 0020 l1:  l2: 0096 
l3: 
[29075.496658] l4: 0200 l5: 0001c5569e6c l6: 0006c390404c 
l7: 6204052f31ec823e
[29075.600824] i0: 0044d100 i1: 00b0fcc2c000 i2:  
i3: 
[29075.704989] i4: 0040 i5: 007a0578 i6: f8a006c670d1 
i7: 004420d8
[29075.809161] I7: 
[29075.867493] BUG: soft lockup - CPU#2 stuck for 11s! [sh:4253]
[29075.936259] TSTATE: 11009600 TPC: 004417a8 TNPC: 
004417ac Y: Not tainted
[29076.053980] TPC: 
[29076.113311] g0:  g1:  g2:  
g3: 
[29076.217483] g4: f8a0048f9260 g5: f89fff98c000 g6: f8a006c7 
g7: 
[29076.321648] o0: 0020 o1: f8a006c73968 o2: 0002 
o3: 0001
[29076.425816] o4: 781b o5:  sp: f8a006c73011 
ret_pc: 004416a0
[29076.534150] RPC: 
[29076.592471] l0: 0008 l1:  l2: 0096 
l3: 
[29076.696645] l4: 0200 l5: 0001c5569e6c l6: 0006c3904054 
l7: 7e645445948ed154
[29076.800811] i0: 0044d100 i1: 00b0fcf8 i2:  
i3: 
[29076.904977] i4: 0040 i5: 007a0578 i6: f8a006c730d1 
i7: 004420d8
[29077.009144] I7: 

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-27 Thread Bernd Zeimetz

> 
> Luckily much more output of sysrq is in the syslog, so I should be able to 
> mail it later when the
> machine is finished with rebooting (which takes some time...).

the sysrq output from the syslog and my kernel config are attached to this mail.

Cheers,

Bernd

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>


config-2.6.24-rc1-git2+bzed-farm1.gz
Description: GNU Zip compressed data


syslog.gz
Description: GNU Zip compressed data


Re: unkillable dpkg-query processes

2007-10-27 Thread Bernd Zeimetz
Bernd Zeimetz wrote:
>>> For those who can reproduce it an have something like libnss-db
>>> enabled, try disabling it.
> 
> - disabled it
> - running vgdisplay killed the machine (wanted to create a new LV for a
> chroot)... it's not accessible at all anymore, I think the kernel is
> a 2.6.23-something here, I'll build a recent one and give it a try
> again Will take some time as I need to build on USII...


I just wanted to write that I'm not able to reproduce this bug
anymore... but running aptitude -u often enough gave me this nice output:


titan:~# [ 2427.313946] BUG: soft lockup - CPU#3 stuck for 11s! [aptitude:13375]
[ 2427.389128] TSTATE: 11009602 TPC: 0042f93c TNPC: 
0042f7d0 Y: Not tainted
[ 2427.506821] TPC: <__delay+0x1c/0x48>
[ 2427.549494] g0: 9000 g1: 0042f7d0 g2:  
g3: 
[ 2427.653670] g4: f8a00793c960 g5: f89fff994000 g6: f8a007dfc000 
g7: 
[ 2427.757835] o0: 0020 o1: 0020 o2:  
o3: 
[ 2427.862001] o4: 0030a0d0 o5:  sp: f8a007dff071 
ret_pc: 0042f938
[ 2427.970337] RPC: <__delay+0x18/0x48>
[ 2428.013031] l0: 0005a6cab647 l1: 11009601 l2: 004417a8 
l3: 0400
[ 2428.117206] l4:  l5: 0001 l6:  
l7: 0008
[ 2428.221374] i0:  i1: f8a007dffa88 i2: 0004 
i3: 0001
[ 2428.325538] i4:  i5:  i6: f8a007dff131 
i7: 004417ec
[ 2428.429710] I7: 

and an unkillable, cpu-eating aptitude.


While retrieving some info using sysrq the machine froze after
echoing m into sysrq-trigger, producing this output while dieing:

[ 3680.006794] BUG: soft lockup - CPU#1 stuck for 11s! [pdflush:265]
[ 3680.078838] TSTATE: 80009603 TPC: 004417a8 TNPC: 
004417ac Y: Not tainted
[ 3680.196551] TPC: 
[ 3680.255881] g0:  g1:  g2: 0001869e 
g3: 
[ 3680.360055] g4: f8a0048e3260 g5: f89fff984000 g6: f8a00717c000 
g7: 
[ 3680.464220] o0: 0020 o1: f8a00717f418 o2: f8a005a84040 
o3: 0010
[ 3680.568384] o4: 0015 o5:  sp: f8a00717eac1 
ret_pc: 004416e4
[ 3680.676719] RPC: 
[ 3680.735042] l0: 0002 l1: 0002 l2: 0096 
l3: 
[ 3680.839217] l4:  l5: f8a0048d3cd8 l6: 00024098 
l7: f7d31000
[ 3680.943382] i0: 0044d100 i1: 00b0f60f8000 i2:  
i3: 0001
[ 3681.047548] i4: 0001 i5: 0001 i6: f8a00717eb81 
i7: 00442be4
[ 3681.151717] I7: 



Luckily much more output of sysrq is in the syslog, so I should be able to mail 
it later when the
machine is finished with rebooting (which takes some time...).


 2.6.24-rc1-git2 (SMP)
 gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)


titan:~# cat /proc/cpuinfo
cpu : TI UltraSparc III (Cheetah)
fpu : UltraSparc III integrated FPU
prom: OBP 4.22.34 2007/07/23 13:01
type: sun4u
ncpus probed: 4
ncpus active: 4
D$ parity tl1   : 0
I$ parity tl1   : 0
Cpu0ClkTck  : 2cb41780
Cpu1ClkTck  : 2cb41780
Cpu2ClkTck  : 2cb41780
Cpu3ClkTck  : 2cb41780
MMU Type: Cheetah
State:
CPU0:   online
CPU1:   online
CPU2:   online
CPU3:   online



-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>

-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-26 Thread Bernd Zeimetz

>> For those who can reproduce it an have something like libnss-db
>> enabled, try disabling it.

- disabled it
- running vgdisplay killed the machine (wanted to create a new LV for a
chroot)... it's not accessible at all anymore, I think the kernel is
a 2.6.23-something here, I'll build a recent one and give it a try
again Will take some time as I need to build on USII...


-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-26 Thread Bernd Zeimetz
Josip Rodin wrote:
> On Sat, Oct 27, 2007 at 12:30:56AM +0200, Bernd Zeimetz wrote:
>>> Josip, do you guys have libnss-db or similar in use on the buildd
>>> machine?
>> They have, that's what Debian's userdir-ldap uses.
> 
> No, I have to correct you, this machine isn't part of that setup
> (at least not yet).
> 

Oh ok, I stand corrected - thought it would have it.

-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-26 Thread Bernd Zeimetz

> Josip, do you guys have libnss-db or similar in use on the buildd
> machine?

They have, that's what Debian's userdir-ldap uses.

> For those who can reproduce it an have something like libnss-db
> enabled, try disabling it.

Will do in a few minutes.



-- 
Bernd Zeimetz
<[EMAIL PROTECTED]> <http://bzed.de/>
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-26 Thread Bernd Zeimetz
Hi,

just got linked to this thread, so here's a bit input form me :)


>> 1) system type
> 
> A Sun Fire 280R, with two CPU boards, each carrying a TI UltraSparc III
> (Cheetah), and 2 GB of RAM. If you need more info, just say.
> 
> (Bernd Zeimetz has previously suggested that the problem is linked to
> the processor type, the USIII.)

It seems to hit USIII machines with 2 CPUs in one tray much more hard
than US II, but once a month our Ultra60 (running two US II) has the
same issues - it got much better with since
179c85ea53bef807621f335767e41e23f86f01df, though. before the mentioned
patch it died a few times per day. Seems it got better on the USIII
here, too (we have a v880 here, the large version of Josip's machine,
with 2x 2 CPUs), but it still dies way too often, just not useable in
the current state.


> 
>> 2) compiler used to build kernel and is it SMP?
> 
> gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)

same compiler here.
Please note that non-SMP kernels do not boot on those US-III machines at
all (at least I didn't find a single one which does).



Cheers,

Bernd
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unkillable dpkg-query processes

2007-10-26 Thread Bernd Zeimetz
Hi,


> It seems that instead of getting stuck in the kernel where I
> thought it would, the process gets stuck elsewhere and
> also tends to loop allocating memory until all memory in the
> machine is exhausted and the OOM killer starts to try and
> kill processes left and right.

at least it runs with 100% CPU, attaching strace to the pid doesn't give
any results
strace-ing the whole process doesn't result in more useful output, but
the hanging processes were killable when they were running under strace...


Cheers,

Bernd
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html