I've got a box (freebuild.mit.edu) running master from around 12 november that seems to not be failing over when it hits an unresponsive db server. In particular, if I:
cd /afs/grand.central.org
ls

The 'ls' process hangs indefinitely; rxdebug seems to indicate that I'm only trying to talk to 130.237.48.87 (andrew.e.kth.se) which is known to be not very responsive these days.

My main question is whether anyone else is running a recent master on some flavor of unix, to try and decide if this might be freebsd-specific behavior or not. Of course, if you want to help debug, that'd be fine, too (some more information below).

Thanks,

Ben


The kernel stack of the 'ls' process looks like:
  2368 100179 ls
mi_switch+0x1ea
sleepq_switch+0x123
  sleepq_wait+0x4d
  _sleep+0x369
  rxi_ReadProc+0x3ef
rx_ReadProc32+0xc1
  xdrrx_getint32+0x19
  afs_xdr_char+0x41
afs_xdr_vector+0x44
  xdr_uvldbentry+0x30
  VL_GetEntryByNameU+0x7b
afs_NewVolumeByName+0x237
  afs_GetVolumeByName+0x13c
EvalMountData+0x316
  EvalMountPoint+0x93
  afs_EvalFakeStat_int+0x12b
afs_EvalFakeStat+0xe
  afs_lookup+0x101

rxdebug:
freebuild# rxdebug localhost 7001
Trying 127.0.0.1 (port 7001):
Free packets: 235/243, packet reclaims: 0, calls: 178, used FDs: 64
not waiting for packets.
0 calls waiting for a thread
1 threads are idle
0 calls have waited for a thread
Connection from host 130.237.48.87, port 7003, Cuid a3ee818f/3bb316a8
   serial 15,  natMTU 520, flags DESTROYED, security index 0, client conn
     call 0: # 2, state not initialized
     call 1: # 2, state not initialized
     call 2: # 1, state dally, mode: receiving
     call 3: # 0, state not initialized
Connection from host 130.237.48.87, port 7003, Cuid a3ee818f/3bb316ac
   serial 6,  natMTU 520, security index 0, client conn
     call 0: # 1, state active, mode: receiving, flags: reader_wait, 
has_output_packets
     call 1: # 1, state active, mode: receiving, flags: reader_wait, 
has_output_packets
     call 2: # 0, state not initialized
     call 3: # 0, state not initialized
Done.

I got a kernel core from a previous hang (a 'cp' process), and there wasn't anything that looked like it was going to deadlock; nobody held the glock, either.
_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Reply via email to