Re: [OpenAFS-devel] "Lost contact with file server" problems

Derrick J Brashear Sun, 28 Aug 2005 13:31:11 -0700

From jhutz, try this:

--- rx.c        30 May 2005 04:55:26 -0000      1.82
+++ rx.c        28 Aug 2005 20:30:00 -0000
@@ -1146,7 +1146,11 @@


     /* Client is initially in send mode */
     call->state = RX_STATE_ACTIVE;
-    call->mode = RX_MODE_SENDING;
+    call->error = conn->error;
+    if (call->error)
+       call->mode = RX_MODE_ERROR;
+    else
+       call->mode = RX_MODE_SENDING;

     /* remember start time for call in case we have hard dead time limit */
     call->queueTime = queueTime;


On Sat, 27 Aug 2005, Harald Barth wrote:

Except you missed the abort from the server to the client 2 minutes earlier

05:43:40.773551 IP (tos 0x0, ttl  64, id 6836, offset 0, flags [none],
length: 60) 192.168.18.2.7000 > 192.168.18.39.7001: [udp sum ok]  rx abort
cid 1dd424ec call# 0 seq 0 ser 13 (32)



I had a look at this with ethereal (recent ethereal you have to
disable rudp in Analyze->Enabled Protocols). So far I found:

Abort code is 151725569 = ntohl(0x01260b09) which should be
htonl(call->error) from the server. Undfortunately that is
only a "big number" to me :-( and does not ring a bell.

I have been reading rx.c. In the client after receiving this
abort packet we should end up in

   /* Check for connection-only requests (i.e. not call specific). */
   if (np->header.callNumber == 0) {
       switch (np->header.type) {
       case RX_PACKET_TYPE_ABORT:
           /* What if the supplied error is zero? */
           rxi_ConnectionError(conn, ntohl(rx_GetInt32(np, 0)));

Then ...

rxi_ConnectionError(register struct rx_connection *conn,
                   register afs_int32 error)
{
   if (error) {
       register int i;
       MUTEX_ENTER(&conn->conn_data_lock);
       if (conn->challengeEvent)
           rxevent_Cancel(conn->challengeEvent, (struct rx_call *)0, 0);
       if (conn->checkReachEvent) {
           rxevent_Cancel(conn->checkReachEvent, (struct rx_call *)0, 0);
           conn->checkReachEvent = 0;
           conn->flags &= ~RX_CONN_ATTACHWAIT;
           conn->refCount--;
       }
       MUTEX_EXIT(&conn->conn_data_lock);
       for (i = 0; i < RX_MAXCALLS; i++) {
           struct rx_call *call = conn->call[i];
           if (call) {
               MUTEX_ENTER(&call->lock);
               rxi_CallError(call, error);
               MUTEX_EXIT(&call->lock);
           }
       }
       conn->error = error;
       MUTEX_ENTER(&rx_stats_mutex);
       rx_stats.fatalErrors++;
       MUTEX_EXIT(&rx_stats_mutex);
   }
}

I think in this case error != 0 but I think we should take care of
error == 0 somehow (if it does happen at all).

Then we call rxi_CallError(call, error) which sets the call's error
status if it was not allready set. Then the call is reset if there
is not some kind of BUSY status.

void
rxi_CallError(register struct rx_call *call, afs_int32 error)
{
   if (call->error)
       error = call->error;
#ifdef RX_GLOBAL_RXLOCK_KERNEL
   if (!(call->flags & RX_CALL_TQ_BUSY)) {
       rxi_ResetCall(call, 0);
   }
#else
   rxi_ResetCall(call, 0);
#endif
   call->error = error;
   call->mode = RX_MODE_ERROR;
}

It does not seem that the connections is taken down however. The client seems 
later
to try to use the connection again but is then kinda out of sync and sends 
aborts itself.
If you filter in ethereal with "rx.cid == 500442348" you'll see.

Well, what does this mean? I'm no RX expert...


I don't know either ;-) but I think the connections should be reaped a
bit more agressive after an abort, the rx-code should reestablish new
ones in that case, shouldn't it?

Or could there be some confusion if this packet is encrypted or not?
security index == 2?

I think I had an similar condition this afternoon on my laptop, but I
hadn't any tcpdump running at the time, so I can't tell for sure.

Harald.
_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Re: [OpenAFS-devel] "Lost contact with file server" problems

Reply via email to