Re: [OpenAFS] Re: 1.6.2 buserver + butc
On Mon, Apr 15, 2013 at 4:20 PM, Andrew Deason adea...@sinenomine.netwrote: On Thu, 28 Mar 2013 16:38:55 -0500 Andrew Deason adea...@sinenomine.net wrote: What I was after is the stack trace of all of the LWPs in the buserver process. You cannot get at those easily, since LWP is a threading system that is not understood by the debugger (dbx or gdb). That's kinda why I was treating the 'core file' option as something where you give the core file to a developer. Getting that information by providing instructions to you makes this a bit more difficult... but is probably doable. So, while I was waiting for some stuff to compile while trying this, I realized this might be fixed by http://git.openafs.org/?p=openafs.git;a=patch;h=dce2d8206ecd35c96e75cc0662432c2a4f9c3d7a . I'm not clear on what exactly the principal is for, but that does fix a bug that was introduced in the 1.6 series. Since there have not been many substantial changes to budb in general, and that change impacts the CreateDump function, that seems like a likely culprit. (To devs: the original change doesn't make a lot of sense to me; the commit messages suggest there are different strutures in play, but the args and function parameters are all ktc_principal.) *If* it's it, it would have to be a missing null termination. I looked at that case specifically and couldn't create an issue but it's conceivable I missed due to differences in the host I tested on. The only other interesting thing was the potential for differences due to the offsetof changes, but that was also a red herring. This actually isn't so bad if you rely on mdb to give you the stack traces. Attached a dbx script that can be used to get some traces. This should probably live in a repo or something... somewhere. Do people have an opinion on where this should go? We have scripts from Robert Milkowski for dtrace which similarly lack a home; Properly they might be in their own module instead of openafs but I don't think it would be particularly abusive to include them here. Anyway, you can use it like this. If you compiled with LWP debug turned on, it's more likely to work (this means running ./configure with --enable-debug-lwp), but it's not required. Run: $ /opt/SUNWspro/bin/dbx /path/to/buserver /path/to/core [...] (dbx) source lwpstacks.ksh (dbx) lwpstacks If you don't have LWP debug, this will fail (probably with something like dbx: struct lwp_pcb is not defined[...]). You can try running this without using debug symbols (we'll guess at where some data is), by running this instead: (dbx) lwpstacks nodebug With the script as-is, the 'nodebug' stuff seems to work with OpenAFS 1.6.2 on Solaris 10 SPARC, but it may need fiddling to work anywhere else. If either of those works, you'll see something like: (dbx) lwpstacks nodebug !# NOT using debug symbols !# looking for threads in blocked ::echo stack pointer for thread 14a530: 1562d8 0x001562d8::stack 0 ! sed 's/^/ /' ::echo ::echo stack pointer for thread 180cf8: 18caa0 0x0018caa0::stack 0 ! sed 's/^/ /' [...] To get actual stack traces out of that, pipe the output through mdb: (dbx) lwpstacks nodebug | mdb /path/to/buserver /path/to/core stack pointer for thread 14a530: 1562d8 LWP_WaitProcess+0x38() rxi_Sleep+4() rx_GetCall+0x320() rxi_ServerProc+0x40() rx_ServerProc+0x74() Create_Process_Part2+0x40() 0x68388() ubik_ServerInitCommon+0x23c() stack pointer for thread 180cf8: 18caa0 LWP_WaitProcess+0x38() [...] This output is similar enough to mdb ::findstack output that it will work with David Powell's munges script if you have that. But it's also pretty useful just by itself. Surprisingly, that doesn't require any manual core editing. mdb I think is the only debugger I've used that lets you get stack trace information from arbitrary context (at least, I haven't seen an easy way for gdb or dbx to do this), but the way state is stored on solaris on sparc probably helps make that easier. If you want to provide such stack output from the core you captured, it may say what's going on. -- Derrick
Re: [OpenAFS] Re: 1.6.2 buserver + butc
On Thu, 28 Mar 2013, Andrew Deason wrote: On Thu, 28 Mar 2013 15:22:29 -0400 Eric Sturdivant sturd...@umd.edu wrote: Ok, I've got core+tcpdump files from the 3 buserver processes, as well as one of the butc processes. Anything in particular you are interested in seeing? What I was after is the stack trace of all of the LWPs in the buserver process. You cannot get at those easily, since LWP is a threading system that is not understood by the debugger (dbx or gdb). That's kinda why I was treating the 'core file' option as something where you give the core file to a developer. Getting that information by providing instructions to you makes this a bit more difficult... but is probably doable. I've only done that a couple of times before, and it involved hex editing the core file; it may be a bit easier with Solaris dbx, but I'll need a little time to look into it. If someone else has information on that handy or something, go ahead :) Ah, gotcha. I was assuming the core file would contain a copy of the KeyFile and shouldn't be passed around, is this not the case? -- Eric Sturdivant University of Maryland Division of Information Technology Enterprise Unix Services ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 1.6.2 buserver + butc
On Wed, 27 Mar 2013, Andrew Deason wrote: On Wed, 27 Mar 2013 16:05:06 -0400 (EDT) Eric Sturdivant sturd...@umd.edu wrote: Info from the buserver and butc: Oh, butc is threaded; cool. You'd be more likely to get more useful results if you used 'pstack' like Derrick said, or dbx instead of gdb. core 'butc-core.19453' of 19453:/usr/afsws/etc/butc 3106 -localauth - lwp# 1 / thread# 1 ff17ac78 lwp_park (0, 0, 0) ff174cc0 cond_wait_queue (447920, 10f6d8, 0, 0, 1c00, 0) + 4c ff175208 cond_wait (447920, 10f6d8, ffbfedb4, 4478f8, f70c8, 107c98) + 10 ff175244 pthread_cond_wait (447920, 10f6d8, 0, cc800, 10f400, 0) + 8 0007df94 rx_GetCall (4478b0, 4476c0, ffbfedb4, 4478f8, f70c8, 107c98) + 52c 0007d560 rxi_ServerProc (6, 0, ffbfedb4, 0, cdc00, f7044) + 38 00079e4c rx_ServerProc (110400, , 7c00, cb800, 4f, f7000) + 168 0007b848 rx_StartServer (0, 10f400, 0, 1, f6c00, f70fc) + 12c 0002a8e0 (0, 22c00, ea400, f8558, 1, 0) 000b0050 cmd_Dispatch (3, 1, 1118b0, 5, 111914, f7368) + 8f4 000224c0 _start (0, 0, 0, 0, 0, 0) + 108 - lwp# 2 / thread# 2 ff17dd64 recvmsg (8, fde4beb0, 0) 00079c00 (8, fde4bf94, fde4bf98, f6c00, 10f500, 1107c0) 00079c94 (8, fde4bf9c, 110400, , fde4bf98, fde4bf94) ff17abd8 _lwp_start (0, 0, 0, 0, 0, 0) - lwp# 3 / thread# 3 ff17ac78 lwp_park (0, fdd4be50, 0) ff174cc0 cond_wait_queue (10f710, 10d0e8, fdd4be50, 0, 9, 0) + 4c ff175104 cond_wait_common (10f710, 10d0e8, fdd4be50, 0, 0, 0) + 294 ff175298 _cond_timedwait (10f710, 10d0e8, fdd4bf98, f4238, 9, 0) + 34 ff17538c cond_timedwait (10f710, 10d0e8, fdd4bf98, 5151e7d0, 354bb, fff0bdc0) + 14 ff1753cc pthread_cond_timedwait (10f710, 10d0e8, fdd4bf98, f4238, 9, 5151e7da) + c 00079a0c (cb400, f4240, 10d000, 5151e7d0, 354bb, fff0bdc0) ff17abd8 _lwp_start (0, 0, 0, 0, 0, 0) - lwp# 4 / thread# 4 ff17dacc nanosleep (fdc48880, fdc48878) 00023334 dbWatcher (2, 1, 951, 110400, 2, 1) + 488 ff17abd8 _lwp_start (0, 0, 0, 0, 0, 0) - lwp# 5 / thread# 5 ff17ac78 lwp_park (0, 0, 0) ff174cc0 cond_wait_queue (4477f8, 10f6d8, 0, 0, 1c00, 0) + 4c ff175208 cond_wait (4477f8, 10f6d8, fdb4bf34, 4477d0, f70c8, 107c98) + 10 ff175244 pthread_cond_wait (4477f8, 10f6d8, 0, 0, 10f6d8, 7d560) + 8 0007df94 rx_GetCall (10f538, 0, fdb4bf34, 4477d0, f70c8, 107c98) + 52c 0007d560 rxi_ServerProc (3, 0, fdb4bf34, 0, fe091a00, f7044) + 38 00079e4c rx_ServerProc (110400, , 0, cb800, 1f, f7000) + 168 00079758 (79ce4, fdb4c000, 0, 0, 79754, 1) ff17abd8 _lwp_start (0, 0, 0, 0, 0, 0) - lwp# 6 / thread# 6 ff17ac78 lwp_park (0, 0, 0) ff174cc0 cond_wait_queue (447840, 10f6d8, 0, 0, 1c00, 0) + 4c ff175208 cond_wait (447840, 10f6d8, fda4bf34, 447818, f70c8, 107c98) + 10 ff175244 pthread_cond_wait (447840, 10f6d8, 0, 0, 10f6d8, 7d560) + 8 0007df94 rx_GetCall (4477d0, 0, fda4bf34, 447818, f70c8, 107c98) + 52c 0007d560 rxi_ServerProc (4, 0, fda4bf34, 0, fe092200, f7044) + 38 00079e4c rx_ServerProc (110400, , 0, cb800, 2f, f7000) + 168 00079758 (79ce4, fda4c000, 0, 0, 79754, 1) ff17abd8 _lwp_start (0, 0, 0, 0, 0, 0) - lwp# 7 / thread# 7 ff17ac78 lwp_park (0, 0, 0) ff174cc0 cond_wait_queue (4478d8, 10f6d8, 0, 0, 1c00, 0) + 4c ff175208 cond_wait (4478d8, 10f6d8, fd94bf34, 4478b0, f70c8, 107c98) + 10 ff175244 pthread_cond_wait (4478d8, 10f6d8, 0, 0, 10f6d8, 7d560) + 8 0007df94 rx_GetCall (447818, 0, fd94bf34, 4478b0, f70c8, 107c98) + 52c 0007d560 rxi_ServerProc (5, 0, fd94bf34, 0, fe092a00, f7044) + 38 00079e4c rx_ServerProc (110400, , 0, cb800, 3f, f7000) + 168 00079758 (79ce4, fd94c000, 0, 0, 79754, 1) ff17abd8 _lwp_start (0, 0, 0, 0, 0, 0) - lwp# 8 / thread# 8 ff17ac78 lwp_park (0, 0, 0) ff174cc0 cond_wait_queue (46c338, 46c308, 0, 0, 1c00, 0) + 4c ff175208 cond_wait (46c338, 46c308, 4, d7690, 46c2e0, 4) + 10 ff175244 pthread_cond_wait (46c338, 46c308, 0, 4, 5, 5) + 8 000989d8 rxi_ReadProc (46c2d0, fd84aa74, 4, d7690, 46c2e0, 4) + 640 0009c2ec (fd84ab38, fd84bb10, 0, fd84bb10, 46c2d0, 1) 00059b44 xdr_budb_dumpEntry (fd84ab38, fd84bb10, 1, 9b800, 9b800, 9b800) + 8 00053310 BUDB_CreateDump (447af8, fd84bb10, 0, fd84bb10, 46c2d0, 1) + 58 0005350c ubik_BUDB_CreateDump (150d, 46c230, fd84bb10, 0, 46c230, 0) + 1bc 000328dc createDump (2, 10f400, fd84b2b4, fd84bb10, fd84b6dc, 5) + e8 00030d94 Dumper (4bfa28, 2f64d1, 0, 0, 4bfa38, 10f400) + 260 ff17abd8 _lwp_start (0, 0, 0, 0, 0, 0) -- Eric Sturdivant University of Maryland Division of Information Technology Enterprise Unix Services ___ OpenAFS-info mailing list OpenAFS-info@openafs.org
Re: [OpenAFS] Re: 1.6.2 buserver + butc
the backtrace appeared not to be pthreaded buserver, only butc (which has been true since ibm days) On Thu, Mar 28, 2013 at 10:59 AM, Simon Wilkinson simonxwilkin...@gmail.com wrote: On 28 Mar 2013, at 14:44, Andrew Deason wrote: We're waiting for the buserver to respond to a BUDB_CreateDump. So, presumably it is hanging on something. A quick glance at that suggests to me that it's most likely hanging on waiting for a write lock to the ubik database, which is presumably hanging because it's waiting for another transaction to finish or something. pthreaded ubik servers don't work correctly in 1.6 - they're missing all of the work that Marc Dionne did to stabilise them. So, I suspect the problem may be that you're trying to use a pthreaded set of backup servers. Are you deliberately building with --enable-pthreaded-ubik ? S. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info -- Derrick ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 1.6.2 buserver + butc
On Thu, 28 Mar 2013 15:22:29 -0400 Eric Sturdivant sturd...@umd.edu wrote: Ok, I've got core+tcpdump files from the 3 buserver processes, as well as one of the butc processes. Anything in particular you are interested in seeing? What I was after is the stack trace of all of the LWPs in the buserver process. You cannot get at those easily, since LWP is a threading system that is not understood by the debugger (dbx or gdb). That's kinda why I was treating the 'core file' option as something where you give the core file to a developer. Getting that information by providing instructions to you makes this a bit more difficult... but is probably doable. I've only done that a couple of times before, and it involved hex editing the core file; it may be a bit easier with Solaris dbx, but I'll need a little time to look into it. If someone else has information on that handy or something, go ahead :) -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 1.6.2 buserver + butc
On Wed, Mar 27, 2013 at 2:44 PM, Andrew Deason adea...@sinenomine.net wrote: On Tue, 26 Mar 2013 20:04:15 -0400 (EDT) Prasad Dharmasena p...@glue.umd.edu wrote: The vicep* partitions (or volsets), for which the backup dump/butc hang, are not consistent. If we kill and restart the dump process, some of the previously hung volsets finish while others hang. What info do we need to grab from butc and buserver in order to track the problem? I assume there's nothing helpful in BackupLog? I haven't worked with butc/buserver for a long time, so I don't remember if there are ways to get more information out of them specifically. However, just going by what works in general: One pretty surefire way of getting to know what's happening is to grab a core from the butc and buserver processes while they are hanging ('gcore pid'). You'll need a developer to look at that to say what's going on, and those cores will contain sensitive information. But if there is someone you trust enough with it, that will let you know what's happening. there won't be anything sensitive in just a stack trace, and on solaris that is easily generated with just running pstack and could be shared with impunity. it may or may not be enough information but it's certainly a place to start. -- Derrick ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 1.6.2 buserver + butc
On Wed, 27 Mar 2013, Andrew Deason wrote: On Wed, 27 Mar 2013 14:53:26 -0400 Derrick Brashear sha...@gmail.com wrote: there won't be anything sensitive in just a stack trace, and on solaris that is easily generated with just running pstack and could be shared with impunity. it may or may not be enough information but it's certainly a place to start. These are LWP, so I was expecting all we'd see is IOMGR. I suppose it's _possible_ it could be something useful; worth a shot, at least. Info from the buserver and butc: Core was generated by `/usr/afs/bin/buserver'. #0 0xff18e4a4 in __pollsys () from /lib/libc.so.1 (gdb) info threads Id Target Id Frame * 2Thread 1 (LWP 1) 0xff18e4a4 in __pollsys () from /lib/libc.so.1 1LWP1 0xff18e4a4 in __pollsys () from /lib/libc.so.1 (gdb) thread 1 [Switching to thread 1 (LWP1)] #0 0xff18e4a4 in __pollsys () from /lib/libc.so.1 (gdb) where #0 0xff18e4a4 in __pollsys () from /lib/libc.so.1 #1 0xff17e190 in _pollsys () from /lib/libc.so.1 #2 0xff12a150 in pselect_large_fdset () from /lib/libc.so.1 #3 0xff12a598 in select_large_fdset () from /lib/libc.so.1 #4 0x0007473c in ?? () #5 0x0007473c in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) thread 2 [Switching to thread 2 (Thread 1 (LWP 1))] #0 0xff18e4a4 in __pollsys () from /lib/libc.so.1 (gdb) where #0 0xff18e4a4 in __pollsys () from /lib/libc.so.1 #1 0xff17e190 in _pollsys () from /lib/libc.so.1 #2 0xff12a150 in pselect_large_fdset () from /lib/libc.so.1 #3 0xff12a598 in select_large_fdset () from /lib/libc.so.1 #4 0x0007473c in ?? () #5 0x0007473c in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) Core was generated by `/usr/afsws/etc/butc 3106 -localauth'. #0 0xff17ac78 in __lwp_park () from /lib/libc.so.1 (gdb) info threads Id Target Id Frame 16 Thread 8 0xff17ac78 in __lwp_park () from /lib/libc.so.1 15 Thread 7 0xff17ac78 in __lwp_park () from /lib/libc.so.1 14 Thread 6 0xff17ac78 in __lwp_park () from /lib/libc.so.1 13 Thread 5 0xff17ac78 in __lwp_park () from /lib/libc.so.1 12 Thread 4 (LWP 4) 0xff17dacc in ___nanosleep () from /lib/libc.so.1 11 Thread 3 0xff17ac78 in __lwp_park () from /lib/libc.so.1 10 Thread 2 (LWP 2) 0xff17dd64 in _so_recvmsg () from /lib/libc.so.1 * 9Thread 1 0xff17ac78 in __lwp_park () from /lib/libc.so.1 8LWP8 0xff17ac78 in __lwp_park () from /lib/libc.so.1 7LWP7 0xff17ac78 in __lwp_park () from /lib/libc.so.1 6LWP6 0xff17ac78 in __lwp_park () from /lib/libc.so.1 5LWP5 0xff17ac78 in __lwp_park () from /lib/libc.so.1 4LWP4 0xff17dacc in ___nanosleep () from /lib/libc.so.1 3LWP3 0xff17ac78 in __lwp_park () from /lib/libc.so.1 2LWP2 0xff17dd64 in _so_recvmsg () from /lib/libc.so.1 1LWP1 0xff17ac78 in __lwp_park () from /lib/libc.so.1 (gdb) thread 1 [Switching to thread 1 (LWP1)] #0 0xff17ac78 in __lwp_park () from /lib/libc.so.1 (gdb) where #0 0xff17ac78 in __lwp_park () from /lib/libc.so.1 #1 0xff174b04 in cond_sleep_queue () from /lib/libc.so.1 #2 0xff174cc8 in cond_wait_queue () from /lib/libc.so.1 #3 0xff175210 in cond_wait () from /lib/libc.so.1 #4 0xff17524c in pthread_cond_wait () from /lib/libc.so.1 #5 0x0007df9c in rx_GetCall () #6 0x0007d568 in rxi_ServerProc () #7 0x00079e54 in rx_ServerProc () #8 0x0007b850 in rx_StartServer () #9 0x0002a8e8 in ?? () #10 0x0002a8e8 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) thread 2 [Switching to thread 2 (LWP2)] #0 0xff17dd64 in _so_recvmsg () from /lib/libc.so.1 (gdb) where #0 0xff17dd64 in _so_recvmsg () from /lib/libc.so.1 #1 0x00095b44 in rxi_ReadPacket () #2 0x00079c08 in ?? () #3 0x00079c08 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) thread 3 [Switching to thread 3 (LWP3)] #0 0xff17ac78 in __lwp_park () from /lib/libc.so.1 (gdb) where #0 0xff17ac78 in __lwp_park () from /lib/libc.so.1 #1 0xff174b04 in cond_sleep_queue () from /lib/libc.so.1 #2 0xff174cc8 in cond_wait_queue () from /lib/libc.so.1 #3 0xff17510c in cond_wait_common () from /lib/libc.so.1 #4 0xff1752a0 in _cond_timedwait () from /lib/libc.so.1 #5 0xff175394 in cond_timedwait () from /lib/libc.so.1 #6 0xff1753d4 in pthread_cond_timedwait () from /lib/libc.so.1 #7 0x00079a14 in ?? () #8 0x00079a14 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) thread 4 [Switching to thread 4 (LWP4)] #0 0xff17dacc in ___nanosleep () from /lib/libc.so.1 (gdb) where #0 0xff17dacc in ___nanosleep () from /lib/libc.so.1 #1 0xff16d408 in sleep () from /lib/libc.so.1 #2 0x0002333c in