I don't know if our fileserver crashes are related to what Matthew Cocker and others are seeing, but we are indeed seeing problems here at umich.
Background information: The machines in question are dual pentium 4 machines with hyperthreading enabled running linux 2.4.26 (SMP) and glibc 2.3.2. The actual file storage is on "cheap" raid devices that use multiple IDE drives but talk SCSI to the rest of the world. These raids have their own set of problems, so I would not count them as super-reliable file storage. We're running the "pthreads" version of the fileserver. I think we're seeing at least 3 distinct problems with openafs 1.2.11. The first may actually be networking. We get these with varying frequency in VolserLog: Sun Oct 10 22:05:09 2004 1 Volser: DumpVolume: Rx call failed during dump, error -1 Tue Oct 12 11:38:07 2004 1 Volser: DumpVolume: Rx call failed during dump, error -1 Tue Oct 12 13:39:23 2004 1 Volser: DumpVolume: Rx call failed during dump, error -1 Tue Oct 12 15:06:46 2004 1 Volser: DumpVolume: Rx call failed during dump, error -1 Helpful message, eh? Can't tell what volume was being dumped, or where it was going. Most of these probably occur while backing things up to TSM. If I understand it right, we run buta there and there are apparently issues with the amount of CPU it eats. I've wondered what % of our backups are failing, but I haven't heard of any negative consequences here, so far. We have at various times gotten problems with read-only replicas that are oddly truncated. This might or might not be the consequence of the previous problem. Another probably completely different problem we have concerns volumes with really small volume IDs. Modern AFS software creates large 10 digit volume IDs. But we have volumes that were created long before AFS 3.1, with small 3 digit volume IDs. Those volumes are rapidly disappearing as one by one, during various restarts, the fileserver and salvager proceed to discard all the data, then the volume header. The latest problem is of course "signal 6". That's generally from an assertion error, which probably writes a nice message out to stderr (probably out to /dev/null), and might leave an equally useful core dump, except linux defaults to no core dumps. Oops. We've now managed to harvest one core dump, so here's interesting data from it: [ in the below, I've obscured the following for privacy reasons: @IPADDR@ is a hexadecimal umich IP address belonging to some random machine about which we know little (though a transient portable or lab machine running windows or MacOS is quite possible.) @VICEID@ is a vice ID of some random and likely innocent person who might logically be using @[EMAIL PROTECTED] ] (gdb) where #0 0x400a6281 in __kill () at __kill:-1 #1 0x40021811 in pthread_kill (thread=17978, signo=0) at signals.c:65 #2 0x40021b1b in __pthread_raise (sig=1073904656) at signals.c:187 #3 0x400a5ec4 in *__GI_raise (sig=17978) at ../linuxthreads/sysdeps/unix/sysv/linux/raise.c:34 #4 0x400a75ed in *__GI_abort () at ../sysdeps/generic/abort.c:117 #5 0x08096603 in osi_Panic (msg=0x0, a1=0, a2=0, a3=0) at ../rx/rx_user.c:199 #6 0x08096637 in osi_AssertFailU (expr=0x0, file=0x0, line=0) at ../rx/rx_user.c:208 #7 0x080a47fc in rx_SetSpecific (conn=0x4002a510, key=1, ptr=0x0) at ../rx/rx.c:6632 #8 0x0805defc in h_TossStuff_r ([EMAIL PROTECTED]@) at ../viced/host.c:765 #9 0x0805cb6d in h_Release_r ([EMAIL PROTECTED]@) at ../viced/host.c:280 #10 0x0805ca71 in h_Release (host=0x0) at ../viced/host.c:258 #11 0x0805e14e in h_Enumerate (proc=0x8060fa0 <CheckHost>, param=0x0) at ../viced/host.c:913 #12 0x080613cd in h_CheckHosts () at ../viced/host.c:2080 #13 0x0804b6c7 in HostCheckLWP () at ../viced/viced.c:731 #14 0x4001ed03 in pthread_start_thread (arg=0xb31ffbe0) at manager.c:300 In case that's not completely obvious, 0x80a47e8 <rx_SetSpecific+312>: movl $0x80c8a00,0x4(%esp,1) 0x80a47f0 <rx_SetSpecific+320>: movl $0x80c7480,(%esp,1) 0x80a47f7 <rx_SetSpecific+327>: call 0x8096610 <osi_AssertFailU> (gdb) x/8x $ebp 0xb31ffa14: 0xb31ffa44 0x080a47fc 0x080c7480 0x080c8a00 0xb31ffa24: 0x000019e5 0x400f0ce6 0x40a00010 0x08f3d84c (gdb) x/i 0x080a47fc 0x80a47fc <rx_SetSpecific+332>: jmp 0x80a46db <rx_SetSpecific+43> (gdb) x/s 0x080c7480 0x80c7480 <rcsid+3972>: "pthread_mutex_lock(&conn->conn_data_lock) == 0" (gdb) x/s 0x080c8a00 0x80c8a00 <rcsid+80>: "../rx/rx.c" (gdb) print/d 0x000019e5 $8 = 6629 Basically, it's dying at line 6629 in rx/rx.d, which reads thusly: MUTEX_ENTER(&conn->conn_data_lock); The return code from pthread_mutex_lock would be useful, but the assertion logic in openafs doesn't save that. Here's that data structure: (gdb) print *conn $7 = {next = 0x0, peer = 0x0, conn_call_lock = {__m_reserved = -1289749536, __m_count = -1291841536, __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}, conn_call_cv = {__c_lock = {__status = 0, __spinlock = 0}, __c_waiting = 0x0}, conn_data_lock = {__m_reserved = 0, __m_count = 0, __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}, epoch = 0, cid = 0, error = 0, call = {0x0, 0x0, 0x0, 0x0}, callNumber = {0, 0, 0, 0}, serial = 0, lastSerial = 0, maxSerial = 0, challengeEvent = 0x0, delayedAbortEvent = 0x0, checkReachEvent = 0x0, abortCount = 0, service = 0x0, serviceId = 0, refCount = 0, flags = 0 '\0', type = 0 '\0', secondsUntilPing = 0 '\0', securityIndex = 0 '\0', securityObject = 0x0, securityData = 0x0, securityHeaderSize = 0, securityMaxTrailerSize = 0, timeout = 0, lastSendTime = 0, secondsUntilDead = 0, hardDeadTime = 0, ackRate = 0 '\0', makeCallWaiters = 0 '\0', nSpecific = 0, specific = 0x0} (gdb) Unfortunately, that's not really the data structure in question. The next up stack frame looks like this: (gdb) x/16x $ebp 0xb31ffa44: 0xb31ffa74 0x0805defc 0x08f3d820 0x00000001 0xb31ffa54: 0x00000000 0x40022050 0x00000000 0x40027c10 0xb31ffa64: 0xb31ffa74 0x00000003 0x00000008 0x08132ba0 0xb31ffa74: 0xb31ffa94 0x0805cb6d 0x08132ba0 0x00000000 First 3 parameters here are the stack frame back link, the return address, and the first parameter, 'conn'. Gdb thinks that's (gdb) print/x conn $9 = 0x4002a510 So, yet again, source level debuggers lose. (Admittedly, juggling -g and -O is a hard problem.) This is probably the real data: (gdb) print *((struct rx_connection *) 0x08f3d820) $10 = {next = 0x0, peer = 0x0, conn_call_lock = {__m_reserved = 0, __m_count = 0, __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}, conn_call_cv = {__c_lock = {__status = 0, __spinlock = 105}, __c_waiting = 0xffffb923}, conn_data_lock = {__m_reserved = -15030, __m_count = -14533, __m_owner = 0xffffc74e, __m_kind = -13409, __m_lock = { __status = -9283, __spinlock = -9021}}, epoch = 4294958286, cid = 4294960122, error = -7173, call = {0xffffe70b, 0xffffe75f, 0xfffff36f, 0xfffff371}, callNumber = { 4294964109, 4294964111, 4294965251, 4294965253}, serial = 4294966684, lastSerial = 4294966684, maxSerial = -612, challengeEvent = 0xffffff9a, delayedAbortEvent = 0xffffff9b, checkReachEvent = 0x22e35, abortCount = 1145384520, service = 0x39, serviceId = 45936, refCount = 2308, flags = 16 '\020', type = 116 't', secondsUntilPing = 158 '\236', securityIndex = 8 '\b', securityObject = 0x0, securityData = 0x0, securityHeaderSize = 0, securityMaxTrailerSize = 0, timeout = 0, lastSendTime = 0, secondsUntilDead = 0, hardDeadTime = 0, ackRate = 0 '\0', makeCallWaiters = 0 '\0', nSpecific = 0, specific = 0x0} (gdb) Hmm. Looks kinda junky too. But that's apparently what's really there. The caller of is h_TossStuff_r, and here's the client structure it was probably working on: (gdb) print *host->FirstClient->next->next->next $16 = {next = 0x4242d5e4, host = @IPADDR@, sid = -562995324, tcon = 0x8f3d820, CPS = { prlist_len = 7, prlist_val = 0x0}, ViceId = @VICEID@, expTime = 1097781124, LastCall = 1097601371, VenusEpoch = 2334476595, refCount = 0, deleted = 1 '\001', authClass = 2 '\002', prfail = 0 '\0', InSameNetwork = 0 '\0', lock = { wait_states = 0 '\0', excl_locked = 0 '\0', readers_reading = 0 '\0', num_waiting = 0 '\0', mutex = {__m_reserved = 0, __m_count = 0, __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}, read_cv = {__c_lock = { __status = 0, __spinlock = 0}, __c_waiting = 0x0}, write_cv = {__c_lock = { __status = 0, __spinlock = 0}, __c_waiting = 0x0}}} There's the same connection structure, here called "tcon", and it looks like the logic should be calling rx_SetSpecific() on that. I don't see anything further in that client chain that has a non-zero tcon. That doesn't mean there wasn't one; I kinda lost interest in looking for it after I got to pasting this: print *host->FirstClient->next->next->next->next->next->next->next->ne xt->next->next->next->next->next->next->next->next->next->next->next-> next->next->next->next->next->next->next->next->next->next->next->next ->next->next->next->next->next->next->next->next->next->next->next->ne xt->next->next->next->next->next->next->next->next->next->next->next (on one line not wrapped like this) All but the first 3 were marked "deleted", and nearly all but 2 near the start were to the same person. 2 near the start were for 32766, "nobody". Anyways, I expect we'll get another core dump in a day or so. -Marcus Watts UM ITCS Umich Systems Group _______________________________________________ OpenAFS-info mailing list [EMAIL PROTECTED] https://lists.openafs.org/mailman/listinfo/openafs-info