Hi! I got bitten by another hang, which will hopefully provide more information...
On 08/09/2012 05:42 PM, Andrew Deason wrote: > On Thu, 09 Aug 2012 11:48:25 +0200 > Alexander 'Leo' Bergolth <l...@strike.wu.ac.at> wrote: >> My box, using openafs-1.6.1 and kernel-2.6.32-131.17.1.el6.i686 on >> Centos 6, just hung completely and had to be rebooted. It looks like >> the problem was caused by a locking problem of the openafs kernel >> module, all processes that e.g. used AFS authentication got stuck >> inside libafs. (See the kernel call-traces below.) > > This would be more useful with a trace of all processes; all those show > is that we're waiting for a lock. You can get that with 'echo t > > /proc/sysrq-trigger'. The output is available at: http://leo.kloburg.at/tmp/openafs-1.6.1-hang/sysrq-show-state3.txt > If you have the ability to run 'crash' (requires crash to be installed, > and the running kernel debuginfo), you could also run something like > this: > > # crash > [...] > crash> sym afs_global_owner > crash> print ((int*)0xADDR)[0] > > where ADDR is the address printed out by 'sym'. If that prints out a > valid pid, knowing information about that pid would be helpful. You > could even: > > crash> set <pid> > crash> bt KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.1.1.el6.i686/vmlinux DUMPFILE: /var/crash/127.0.0.1-2012-09-02-13:54:06/vmcore [PARTIAL DUMP] CPUS: 2 DATE: Sun Sep 2 13:52:55 2012 UPTIME: 19 days, 23:36:15 LOAD AVERAGE: 32.27, 20.22, 9.60 TASKS: 371 NODENAME: strike.wu-wien.ac.at RELEASE: 2.6.32-279.1.1.el6.i686 VERSION: #1 SMP Tue Jul 10 12:30:45 UTC 2012 MACHINE: i686 (2991 Mhz) MEMORY: 3.8 GB PANIC: "Oops: 0002 [#1] SMP " (check log for details) PID: 0 COMMAND: "swapper" TASK: c0a425e0 (1 of 2) [THREAD_INFO: c0a1a000] CPU: 0 STATE: TASK_RUNNING (PANIC) crash> sym afs_global_owner fa400228 (b) afs_global_owner [openafs] crash> print ((int*)0xfa400228)[0] $3 = 17283 crash> set 17283 PID: 17283 COMMAND: "auth" TASK: eef0a550 [THREAD_INFO: f4554000] CPU: 1 STATE: TASK_UNINTERRUPTIBLE crash> bt PID: 17283 TASK: eef0a550 CPU: 1 COMMAND: "auth" #0 [f4555cc8] schedule at c083c5b3 #1 [f4555d8c] __mutex_lock_slowpath at c083d943 #2 [f4555db4] mutex_lock at c083d848 #3 [f4555dc0] afs_dentry_iput at fa3d7208 [openafs] #4 [f4555ddc] dentry_iput at c0540e18 #5 [f4555df4] d_kill at c0540f3d #6 [f4555e00] dput at c05422f8 #7 [f4555e0c] afs_syscall_pioctl at fa3e4337 [openafs] #8 [f4555e64] afs_syscall at fa373770 [openafs] #9 [f4555eac] afs_unlocked_ioctl at fa387cba [openafs] #10 [f4555edc] proc_reg_unlocked_ioctl at c057bac1 #11 [f4555f00] vfs_ioctl at c053d6e9 #12 [f4555f1c] do_vfs_ioctl at c053d8c7 #13 [f4555f90] sys_ioctl at c053de91 #14 [f4555fb0] ia32_sysenter_target at c0409a98 EAX: 00000036 EBX: 00000003 ECX: 40044301 EDX: bfb8ee6c DS: 007b ESI: 00000014 ES: 007b EDI: 00000003 SS: 007b ESP: bfb8ee18 EBP: bfb8ee98 GS: 0033 CS: 0073 EIP: 00d32424 ERR: 00000036 EFLAGS: 00200213 > ('exit' to exit crash). Or, you could just cause the machine to dump > core instead of simply rebooting, via 'echo c > /proc/sysrq-trigger' > (assuming the machine is configured to capture core on a crash, but I > think that's the default), and provide the resulting core. Such a core > would contain a lot of information about everything that's running on > the box, so you would not want to make that generally publicly > available. Please let me know if you need further information. (The crash dump is available.) I'd greatly appreciate if some AFS expert could take a look at the problem! Thanks, --leo P.S.: I am using the openafs-1.6.1-1.el6.i686 RPM for RHEL6. -- e-mail ::: Leo.Bergolth (at) wu.ac.at fax ::: +43-1-31336-906050 location ::: IT-Services | Vienna University of Economics | Austria _______________________________________________ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info