We get an occasional deadlock happening on Solaris 5.10 using OpenAFS 1.4.11. After the problem starts, any attempt to use AFS on the machine freezes: For example:
% truss -f touch /afs/not-here 15694: execve("/usr/bin/touch", 0x08047E20, 0x08047E2C) argc = 2 15694: resolvepath("/usr/lib/ld.so.1", "/lib/ld.so.1", 1023) = 12 15694: resolvepath("/usr/bin/touch", "/usr/bin/touch", 1023) = 14 15694: sysconfig(_CONFIG_PAGESIZE) = 4096 15694: xstat(2, "/usr/bin/touch", 0x08047BF8) = 0 15694: open("/var/ld/ld.config", O_RDONLY) = 3 15694: fxstat(2, 3, 0x08047B38) = 0 15694: mmap(0x00000000, 128104, PROT_READ, MAP_SHARED, 3, 0) = 0xFEFA1000 15694: close(3) = 0 15694: mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFEF90000 15694: xstat(2, "/lib/libc.so.1", 0x08047440) = 0 15694: resolvepath("/lib/libc.so.1", "/lib/libc.so.1", 1023) = 14 15694: open("/lib/libc.so.1", O_RDONLY) = 3 15694: mmap(0x00010000, 32768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_ALIGN, 3, 0) = 0xFEF80000 15694: mmap(0x00010000, 880640, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFEEA0000 15694: mmap(0xFEEA0000, 775469, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xFEEA0000 15694: mmap(0xFEF6E000, 26855, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 778240) = 0xFEF6E000 15694: mmap(0xFEF75000, 5016, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANON, -1, 0) = 0xFEF75000 15694: munmap(0xFEF5E000, 65536) = 0 15694: memcntl(0xFEEA0000, 123376, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0 15694: close(3) = 0 15694: munmap(0xFEF80000, 32768) = 0 15694: mmap(0x00010000, 24576, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFEF80000 15694: getcontext(0x080479B0) 15694: getrlimit(RLIMIT_STACK, 0x080479A8) = 0 15694: getpid() = 15694 [15692] 15694: lwp_private(0, 1, 0xFEF82000) = 0x000001C3 15694: setustack(0xFEF82060) 15694: sysi86(SI86FPSTART, 0xFEF75A58, 0x0000133F, 0x00001F80) = 0x00000001 15694: brk(0x08062758) = 0 15694: brk(0x08064758) = 0 15694: xstat(2, "/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3", 0x08046D08) = 015694: resolvepath("/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3", "/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3", 1023) = 44 15694: open("/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3", O_RDONLY) = 3 15694: mmap(0x00010000, 32768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_ALIGN, 3, 0) = 0xFEE90000 15694: mmap(0x00010000, 2297856, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFEC00000 15694: mmap(0xFEC00000, 2225278, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xFEC00000 15694: mmap(0xFEE2F000, 4234, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 2224128) = 0xFEE2F000 15694: munmap(0xFEE20000, 61440) = 0 15694: memcntl(0xFEC00000, 7188, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0 15694: close(3) = 0 15694: xstat(2, "/usr/lib/locale/en_US.UTF-8/methods_en_US.UTF-8.so.3", 0x08046C60) = 0 15694: resolvepath("/usr/lib/locale/en_US.UTF-8/methods_en_US.UTF-8.so.3", "/usr/lib/locale/common/methods_unicode.so.3", 1023) = 43 15694: open("/usr/lib/locale/en_US.UTF-8/methods_en_US.UTF-8.so.3", O_RDONLY) = 3 15694: mmap(0xFEE90000, 32768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFEE90000 15694: mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFEE80000 15694: mmap(0x00010000, 122880, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFEE60000 15694: mmap(0xFEE60000, 55437, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xFEE60000 15694: mmap(0xFEE7D000, 2524, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 53248) = 0xFEE7D000 15694: munmap(0xFEE6E000, 61440) = 0 15694: memcntl(0xFEE60000, 2532, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0 15694: close(3) = 0 15694: xstat(2, "/usr/lib/locale/en_US.UTF-8/libc.so.1", 0x08046C60) Err#2 ENOENT 15694: munmap(0xFEE90000, 32768) = 0 15694: sysconfig(_CONFIG_PAGESIZE) = 4096 FREEZE On a machine which has not had the problem (yet...) the output continues ... 24608: sysconfig(_CONFIG_PAGESIZE) = 4096 24608: stat64("/afs/not-here", 0x08047C90) Err#2 ENOENT 24608: creat64("/afs/not-here", 0666) Err#30 EROFS 24608: open("/usr/lib/locale/en_US.UTF-8/LC_MESSAGES/SUNW_OST_OSCMD.mo", O_RDONLY) Err#2 ENOENT 24608: fstat64(2, 0x08046EA0) = 0 24608: write(2, " t o u c h", 5) = 5 24608: write(2, " : ", 2) = 2 24608: write(2, " / a f s / n o t - h e r".., 13) = 13 24608: write(2, " c a n n o t c r e a".., 15) = 15 24608: _exit(1) In other words, the stat64 call accesses AFS and (on the machine with the problem), the thread gets stuck in the AFS tarbaby. I suspected it was due to logging, so I changed the configuration to mount a dedicated partition for /usr/vice/cache, and rebooted. The ' machine was fine for a month or two, but problem has re-occurred. The machine is used frequently (it's our main computer server for undergraduate classes) but "fortunately" AFS is not very popular here so most courses don't use it (partly because of nasty things like this happening now and then) and so the machine is still being used for non-AFS courses. Hence I hadn't tried to install a newer version of OpenAFS. If this is a known bug with OpenAFS, I will indeed ask them to take the machine offline long enough to fix this. (Political capital and all; I hope people understand.) I haven't tried to reproduce this bug (and wouldn't want to on the computer server!): it only seems to happen on these main compute servers -- never on my little research machines... :-( Any help would be appreciated. John _______________________________________________ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info