Re: [OpenAFS] Re: 1.6.2 buserver + butc

2013-04-15 Thread Derrick Brashear
On Mon, Apr 15, 2013 at 4:20 PM, Andrew Deason adea...@sinenomine.netwrote:

 On Thu, 28 Mar 2013 16:38:55 -0500
 Andrew Deason adea...@sinenomine.net wrote:

  What I was after is the stack trace of all of the LWPs in the buserver
  process. You cannot get at those easily, since LWP is a threading
  system that is not understood by the debugger (dbx or gdb). That's
  kinda why I was treating the 'core file' option as something where you
  give the core file to a developer. Getting that information by
  providing instructions to you makes this a bit more difficult... but
  is probably doable.

 So, while I was waiting for some stuff to compile while trying this, I
 realized this might be fixed by
 
 http://git.openafs.org/?p=openafs.git;a=patch;h=dce2d8206ecd35c96e75cc0662432c2a4f9c3d7a
 .
 I'm not clear on what exactly the principal is for, but that does fix a
 bug that was introduced in the 1.6 series. Since there have not been
 many substantial changes to budb in general, and that change impacts the
 CreateDump function, that seems like a likely culprit. (To devs: the
 original change doesn't make a lot of sense to me; the commit messages
 suggest there are different strutures in play, but the args and function
 parameters are all ktc_principal.)


*If* it's it, it would have to be a missing null termination. I looked at
that
case specifically and couldn't create an issue but it's conceivable I
missed
due to differences in the host I tested on. The only other interesting thing
was the potential for differences due to the offsetof changes, but that was
also a red herring.

This actually isn't so bad if you rely on mdb to give you the stack

 traces. Attached a dbx script that can be used to get some traces. This
 should probably live in a repo or something... somewhere. Do people have
 an opinion on where this should go?


We have scripts from Robert Milkowski for dtrace which similarly lack a
home;
Properly they might be in their own module instead of openafs but I don't
think
it would be particularly abusive to include them here.


 Anyway, you can use it like this. If you compiled with LWP debug turned
 on, it's more likely to work (this means running ./configure with
 --enable-debug-lwp), but it's not required. Run:

 $ /opt/SUNWspro/bin/dbx /path/to/buserver /path/to/core
 [...]
 (dbx) source lwpstacks.ksh
 (dbx) lwpstacks

 If you don't have LWP debug, this will fail (probably with something
 like dbx: struct lwp_pcb is not defined[...]). You can try running
 this without using debug symbols (we'll guess at where some data is), by
 running this instead:

 (dbx) lwpstacks nodebug

 With the script as-is, the 'nodebug' stuff seems to work with OpenAFS
 1.6.2 on Solaris 10 SPARC, but it may need fiddling to work anywhere
 else.

 If either of those works, you'll see something like:

 (dbx) lwpstacks nodebug
 !# NOT using debug symbols
 !# looking for threads in blocked
 ::echo stack pointer for thread 14a530: 1562d8
 0x001562d8::stack 0 ! sed 's/^/  /'
 ::echo
 ::echo stack pointer for thread 180cf8: 18caa0
 0x0018caa0::stack 0 ! sed 's/^/  /'
 [...]

 To get actual stack traces out of that, pipe the output through mdb:

 (dbx) lwpstacks nodebug | mdb /path/to/buserver /path/to/core
 stack pointer for thread 14a530: 1562d8
   LWP_WaitProcess+0x38()
   rxi_Sleep+4()
   rx_GetCall+0x320()
   rxi_ServerProc+0x40()
   rx_ServerProc+0x74()
   Create_Process_Part2+0x40()
   0x68388()
   ubik_ServerInitCommon+0x23c()

 stack pointer for thread 180cf8: 18caa0
   LWP_WaitProcess+0x38()
 [...]

 This output is similar enough to mdb ::findstack output that it will
 work with David Powell's munges script if you have that. But it's also
 pretty useful just by itself.

 Surprisingly, that doesn't require any manual core editing. mdb I think
 is the only debugger I've used that lets you get stack trace information
 from arbitrary context (at least, I haven't seen an easy way for gdb or
 dbx to do this), but the way state is stored on solaris on sparc
 probably helps make that easier.

 If you want to provide such stack output from the core you captured, it
 may say what's going on.



-- 
Derrick


Re: [OpenAFS] Re: 1.6.2 buserver + butc

2013-03-29 Thread Eric Sturdivant

On Thu, 28 Mar 2013, Andrew Deason wrote:


On Thu, 28 Mar 2013 15:22:29 -0400
Eric Sturdivant sturd...@umd.edu wrote:


Ok, I've got core+tcpdump files from the 3 buserver processes, as well
as one of the butc processes. Anything in particular you are
interested in seeing?


What I was after is the stack trace of all of the LWPs in the buserver
process. You cannot get at those easily, since LWP is a threading system
that is not understood by the debugger (dbx or gdb). That's kinda why I
was treating the 'core file' option as something where you give the core
file to a developer. Getting that information by providing instructions
to you makes this a bit more difficult... but is probably doable. I've
only done that a couple of times before, and it involved hex editing the
core file; it may be a bit easier with Solaris dbx, but I'll need a
little time to look into it.

If someone else has information on that handy or something, go ahead :)




Ah, gotcha. I was assuming the core file would contain a copy of the 
KeyFile and shouldn't be passed around, is this not the case?


--
Eric Sturdivant
University of Maryland
Division of Information Technology
Enterprise Unix Services
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: 1.6.2 buserver + butc

2013-03-28 Thread Eric Sturdivant

On Wed, 27 Mar 2013, Andrew Deason wrote:


On Wed, 27 Mar 2013 16:05:06 -0400 (EDT)
Eric Sturdivant sturd...@umd.edu wrote:


Info from the buserver and butc:


Oh, butc is threaded; cool. You'd be more likely to get more useful
results if you used 'pstack' like Derrick said, or dbx instead of gdb.





core 'butc-core.19453' of 19453:/usr/afsws/etc/butc 3106 
-localauth

-  lwp# 1 / thread# 1  
 ff17ac78 lwp_park (0, 0, 0)
 ff174cc0 cond_wait_queue (447920, 10f6d8, 0, 0, 1c00, 0) + 4c
 ff175208 cond_wait (447920, 10f6d8, ffbfedb4, 4478f8, f70c8, 107c98) + 10
 ff175244 pthread_cond_wait (447920, 10f6d8, 0, cc800, 10f400, 0) + 8
 0007df94 rx_GetCall (4478b0, 4476c0, ffbfedb4, 4478f8, f70c8, 107c98) + 52c
 0007d560 rxi_ServerProc (6, 0, ffbfedb4, 0, cdc00, f7044) + 38
 00079e4c rx_ServerProc (110400, , 7c00, cb800, 4f, f7000) + 168
 0007b848 rx_StartServer (0, 10f400, 0, 1, f6c00, f70fc) + 12c
 0002a8e0  (0, 22c00, ea400, f8558, 1, 0)
 000b0050 cmd_Dispatch (3, 1, 1118b0, 5, 111914, f7368) + 8f4
 000224c0 _start   (0, 0, 0, 0, 0, 0) + 108
-  lwp# 2 / thread# 2  
 ff17dd64 recvmsg  (8, fde4beb0, 0)
 00079c00  (8, fde4bf94, fde4bf98, f6c00, 10f500, 1107c0)
 00079c94  (8, fde4bf9c, 110400, , fde4bf98, fde4bf94)
 ff17abd8 _lwp_start (0, 0, 0, 0, 0, 0)
-  lwp# 3 / thread# 3  
 ff17ac78 lwp_park (0, fdd4be50, 0)
 ff174cc0 cond_wait_queue (10f710, 10d0e8, fdd4be50, 0, 9, 0) + 4c
 ff175104 cond_wait_common (10f710, 10d0e8, fdd4be50, 0, 0, 0) + 294
 ff175298 _cond_timedwait (10f710, 10d0e8, fdd4bf98, f4238, 9, 0) + 34
 ff17538c cond_timedwait (10f710, 10d0e8, fdd4bf98, 5151e7d0, 354bb, fff0bdc0) 
+ 14
 ff1753cc pthread_cond_timedwait (10f710, 10d0e8, fdd4bf98, f4238, 9, 5151e7da) 
+ c
 00079a0c  (cb400, f4240, 10d000, 5151e7d0, 354bb, fff0bdc0)
 ff17abd8 _lwp_start (0, 0, 0, 0, 0, 0)
-  lwp# 4 / thread# 4  
 ff17dacc nanosleep (fdc48880, fdc48878)
 00023334 dbWatcher (2, 1, 951, 110400, 2, 1) + 488
 ff17abd8 _lwp_start (0, 0, 0, 0, 0, 0)
-  lwp# 5 / thread# 5  
 ff17ac78 lwp_park (0, 0, 0)
 ff174cc0 cond_wait_queue (4477f8, 10f6d8, 0, 0, 1c00, 0) + 4c
 ff175208 cond_wait (4477f8, 10f6d8, fdb4bf34, 4477d0, f70c8, 107c98) + 10
 ff175244 pthread_cond_wait (4477f8, 10f6d8, 0, 0, 10f6d8, 7d560) + 8
 0007df94 rx_GetCall (10f538, 0, fdb4bf34, 4477d0, f70c8, 107c98) + 52c
 0007d560 rxi_ServerProc (3, 0, fdb4bf34, 0, fe091a00, f7044) + 38
 00079e4c rx_ServerProc (110400, , 0, cb800, 1f, f7000) + 168
 00079758  (79ce4, fdb4c000, 0, 0, 79754, 1)
 ff17abd8 _lwp_start (0, 0, 0, 0, 0, 0)
-  lwp# 6 / thread# 6  
 ff17ac78 lwp_park (0, 0, 0)
 ff174cc0 cond_wait_queue (447840, 10f6d8, 0, 0, 1c00, 0) + 4c
 ff175208 cond_wait (447840, 10f6d8, fda4bf34, 447818, f70c8, 107c98) + 10
 ff175244 pthread_cond_wait (447840, 10f6d8, 0, 0, 10f6d8, 7d560) + 8
 0007df94 rx_GetCall (4477d0, 0, fda4bf34, 447818, f70c8, 107c98) + 52c
 0007d560 rxi_ServerProc (4, 0, fda4bf34, 0, fe092200, f7044) + 38
 00079e4c rx_ServerProc (110400, , 0, cb800, 2f, f7000) + 168
 00079758  (79ce4, fda4c000, 0, 0, 79754, 1)
 ff17abd8 _lwp_start (0, 0, 0, 0, 0, 0)
-  lwp# 7 / thread# 7  
 ff17ac78 lwp_park (0, 0, 0)
 ff174cc0 cond_wait_queue (4478d8, 10f6d8, 0, 0, 1c00, 0) + 4c
 ff175208 cond_wait (4478d8, 10f6d8, fd94bf34, 4478b0, f70c8, 107c98) + 10
 ff175244 pthread_cond_wait (4478d8, 10f6d8, 0, 0, 10f6d8, 7d560) + 8
 0007df94 rx_GetCall (447818, 0, fd94bf34, 4478b0, f70c8, 107c98) + 52c
 0007d560 rxi_ServerProc (5, 0, fd94bf34, 0, fe092a00, f7044) + 38
 00079e4c rx_ServerProc (110400, , 0, cb800, 3f, f7000) + 168
 00079758  (79ce4, fd94c000, 0, 0, 79754, 1)
 ff17abd8 _lwp_start (0, 0, 0, 0, 0, 0)
-  lwp# 8 / thread# 8  
 ff17ac78 lwp_park (0, 0, 0)
 ff174cc0 cond_wait_queue (46c338, 46c308, 0, 0, 1c00, 0) + 4c
 ff175208 cond_wait (46c338, 46c308, 4, d7690, 46c2e0, 4) + 10
 ff175244 pthread_cond_wait (46c338, 46c308, 0, 4, 5, 5) + 8
 000989d8 rxi_ReadProc (46c2d0, fd84aa74, 4, d7690, 46c2e0, 4) + 640
 0009c2ec  (fd84ab38, fd84bb10, 0, fd84bb10, 46c2d0, 1)
 00059b44 xdr_budb_dumpEntry (fd84ab38, fd84bb10, 1, 9b800, 9b800, 9b800) + 8
 00053310 BUDB_CreateDump (447af8, fd84bb10, 0, fd84bb10, 46c2d0, 1) + 58
 0005350c ubik_BUDB_CreateDump (150d, 46c230, fd84bb10, 0, 46c230, 0) + 1bc
 000328dc createDump (2, 10f400, fd84b2b4, fd84bb10, fd84b6dc, 5) + e8
 00030d94 Dumper   (4bfa28, 2f64d1, 0, 0, 4bfa38, 10f400) + 260
 ff17abd8 _lwp_start (0, 0, 0, 0, 0, 0)


--
Eric Sturdivant
University of Maryland
Division of Information Technology
Enterprise Unix Services
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org

Re: [OpenAFS] Re: 1.6.2 buserver + butc

2013-03-28 Thread Derrick Brashear
the backtrace appeared not to be pthreaded buserver, only butc (which
has been true since ibm days)

On Thu, Mar 28, 2013 at 10:59 AM, Simon Wilkinson
simonxwilkin...@gmail.com wrote:

 On 28 Mar 2013, at 14:44, Andrew Deason wrote:
 We're waiting for the buserver to respond to a BUDB_CreateDump. So,
 presumably it is hanging on something. A quick glance at that suggests
 to me that it's most likely hanging on waiting for a write lock to the
 ubik database, which is presumably hanging because it's waiting for
 another transaction to finish or something.

 pthreaded ubik servers don't work correctly in 1.6 - they're missing all of 
 the work that Marc Dionne did to stabilise them. So, I suspect the problem 
 may be that you're trying to use a pthreaded set of backup servers.

 Are you deliberately building with --enable-pthreaded-ubik ?

 S.

 ___
 OpenAFS-info mailing list
 OpenAFS-info@openafs.org
 https://lists.openafs.org/mailman/listinfo/openafs-info




-- 
Derrick
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: 1.6.2 buserver + butc

2013-03-28 Thread Andrew Deason
On Thu, 28 Mar 2013 15:22:29 -0400
Eric Sturdivant sturd...@umd.edu wrote:

 Ok, I've got core+tcpdump files from the 3 buserver processes, as well
 as one of the butc processes. Anything in particular you are
 interested in seeing?

What I was after is the stack trace of all of the LWPs in the buserver
process. You cannot get at those easily, since LWP is a threading system
that is not understood by the debugger (dbx or gdb). That's kinda why I
was treating the 'core file' option as something where you give the core
file to a developer. Getting that information by providing instructions
to you makes this a bit more difficult... but is probably doable. I've
only done that a couple of times before, and it involved hex editing the
core file; it may be a bit easier with Solaris dbx, but I'll need a
little time to look into it.

If someone else has information on that handy or something, go ahead :)

-- 
Andrew Deason
adea...@sinenomine.net
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: 1.6.2 buserver + butc

2013-03-27 Thread Derrick Brashear
On Wed, Mar 27, 2013 at 2:44 PM, Andrew Deason adea...@sinenomine.net wrote:
 On Tue, 26 Mar 2013 20:04:15 -0400 (EDT)
 Prasad Dharmasena p...@glue.umd.edu wrote:

 The vicep* partitions (or volsets), for which the backup dump/butc
 hang, are not consistent.  If we kill and restart the dump process,
 some of the previously hung volsets finish while others hang.

 What info do we need to grab from butc and buserver in order to
 track the problem?

 I assume there's nothing helpful in BackupLog?

 I haven't worked with butc/buserver for a long time, so I don't remember
 if there are ways to get more information out of them specifically.
 However, just going by what works in general:

 One pretty surefire way of getting to know what's happening is to grab a
 core from the butc and buserver processes while they are hanging ('gcore
 pid'). You'll need a developer to look at that to say what's going on,
 and those cores will contain sensitive information. But if there is
 someone you trust enough with it, that will let you know what's
 happening.

there won't be anything sensitive in just a stack trace, and on solaris that
is easily generated with just running pstack and could be shared
with impunity. it may or may not be enough information but it's
certainly a place
to start.

-- 
Derrick
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: 1.6.2 buserver + butc

2013-03-27 Thread Eric Sturdivant

On Wed, 27 Mar 2013, Andrew Deason wrote:


On Wed, 27 Mar 2013 14:53:26 -0400
Derrick Brashear sha...@gmail.com wrote:


there won't be anything sensitive in just a stack trace, and on
solaris that is easily generated with just running pstack and could be
shared with impunity. it may or may not be enough information but it's
certainly a place to start.


These are LWP, so I was expecting all we'd see is IOMGR. I suppose it's
_possible_ it could be something useful; worth a shot, at least.




Info from the buserver and butc:

Core was generated by `/usr/afs/bin/buserver'.
#0  0xff18e4a4 in __pollsys () from /lib/libc.so.1
(gdb) info threads
  Id   Target Id Frame
* 2Thread 1 (LWP 1)  0xff18e4a4 in __pollsys () from /lib/libc.so.1
  1LWP1  0xff18e4a4 in __pollsys () from /lib/libc.so.1

(gdb) thread 1
[Switching to thread 1 (LWP1)]
#0  0xff18e4a4 in __pollsys () from /lib/libc.so.1
(gdb) where
#0  0xff18e4a4 in __pollsys () from /lib/libc.so.1
#1  0xff17e190 in _pollsys () from /lib/libc.so.1
#2  0xff12a150 in pselect_large_fdset () from /lib/libc.so.1
#3  0xff12a598 in select_large_fdset () from /lib/libc.so.1
#4  0x0007473c in ?? ()
#5  0x0007473c in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) thread 2
[Switching to thread 2 (Thread 1 (LWP 1))]
#0  0xff18e4a4 in __pollsys () from /lib/libc.so.1
(gdb) where
#0  0xff18e4a4 in __pollsys () from /lib/libc.so.1
#1  0xff17e190 in _pollsys () from /lib/libc.so.1
#2  0xff12a150 in pselect_large_fdset () from /lib/libc.so.1
#3  0xff12a598 in select_large_fdset () from /lib/libc.so.1
#4  0x0007473c in ?? ()
#5  0x0007473c in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)







Core was generated by `/usr/afsws/etc/butc 3106 -localauth'.
#0  0xff17ac78 in __lwp_park () from /lib/libc.so.1
(gdb) info threads
  Id   Target Id Frame
  16   Thread 8  0xff17ac78 in __lwp_park () from /lib/libc.so.1
  15   Thread 7  0xff17ac78 in __lwp_park () from /lib/libc.so.1
  14   Thread 6  0xff17ac78 in __lwp_park () from /lib/libc.so.1
  13   Thread 5  0xff17ac78 in __lwp_park () from /lib/libc.so.1
  12   Thread 4 (LWP 4)  0xff17dacc in ___nanosleep () from /lib/libc.so.1
  11   Thread 3  0xff17ac78 in __lwp_park () from /lib/libc.so.1
  10   Thread 2 (LWP 2)  0xff17dd64 in _so_recvmsg () from /lib/libc.so.1
* 9Thread 1  0xff17ac78 in __lwp_park () from /lib/libc.so.1
  8LWP8  0xff17ac78 in __lwp_park () from /lib/libc.so.1
  7LWP7  0xff17ac78 in __lwp_park () from /lib/libc.so.1
  6LWP6  0xff17ac78 in __lwp_park () from /lib/libc.so.1
  5LWP5  0xff17ac78 in __lwp_park () from /lib/libc.so.1
  4LWP4  0xff17dacc in ___nanosleep () from /lib/libc.so.1
  3LWP3  0xff17ac78 in __lwp_park () from /lib/libc.so.1
  2LWP2  0xff17dd64 in _so_recvmsg () from /lib/libc.so.1
  1LWP1  0xff17ac78 in __lwp_park () from /lib/libc.so.1

(gdb) thread 1
[Switching to thread 1 (LWP1)]
#0  0xff17ac78 in __lwp_park () from /lib/libc.so.1
(gdb) where
#0  0xff17ac78 in __lwp_park () from /lib/libc.so.1
#1  0xff174b04 in cond_sleep_queue () from /lib/libc.so.1
#2  0xff174cc8 in cond_wait_queue () from /lib/libc.so.1
#3  0xff175210 in cond_wait () from /lib/libc.so.1
#4  0xff17524c in pthread_cond_wait () from /lib/libc.so.1
#5  0x0007df9c in rx_GetCall ()
#6  0x0007d568 in rxi_ServerProc ()
#7  0x00079e54 in rx_ServerProc ()
#8  0x0007b850 in rx_StartServer ()
#9  0x0002a8e8 in ?? ()
#10 0x0002a8e8 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) thread 2
[Switching to thread 2 (LWP2)]
#0  0xff17dd64 in _so_recvmsg () from /lib/libc.so.1
(gdb) where
#0  0xff17dd64 in _so_recvmsg () from /lib/libc.so.1
#1  0x00095b44 in rxi_ReadPacket ()
#2  0x00079c08 in ?? ()
#3  0x00079c08 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)


(gdb) thread 3
[Switching to thread 3 (LWP3)]
#0  0xff17ac78 in __lwp_park () from /lib/libc.so.1
(gdb) where
#0  0xff17ac78 in __lwp_park () from /lib/libc.so.1
#1  0xff174b04 in cond_sleep_queue () from /lib/libc.so.1
#2  0xff174cc8 in cond_wait_queue () from /lib/libc.so.1
#3  0xff17510c in cond_wait_common () from /lib/libc.so.1
#4  0xff1752a0 in _cond_timedwait () from /lib/libc.so.1
#5  0xff175394 in cond_timedwait () from /lib/libc.so.1
#6  0xff1753d4 in pthread_cond_timedwait () from /lib/libc.so.1
#7  0x00079a14 in ?? ()
#8  0x00079a14 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)


(gdb) thread 4
[Switching to thread 4 (LWP4)]
#0  0xff17dacc in ___nanosleep () from /lib/libc.so.1
(gdb) where
#0  0xff17dacc in ___nanosleep () from /lib/libc.so.1
#1  0xff16d408 in sleep () from /lib/libc.so.1
#2  0x0002333c in