Re: [OpenAFS-devel] "Lost contact with file server" problems

Roland Kuhn Mon, 22 Aug 2005 05:11:48 -0700

Hi again!

On Mon, 22 Aug 2005, Roland Kuhn wrote:

Hi Jeffrey!

On Mon, 22 Aug 2005, Jeffrey Altman wrote:

Roland Kuhn wrote:

Hi folks!

On Sun, 21 Aug 2005, Derrick J Brashear wrote:

it needs to include the first error packet, e.g. the window where it
loses contact, to be useful

Okay, it happened again, and I have a full trace:

http://www.e18.physik.tu-muenchen.de/~rkuhn/openafs-fail-trace.cap
http://www.e18.physik.tu-muenchen.de/~rkuhn/openafs-fail-trace-end.cap

The latter contains only the last 81 frames and begins a few frames
before the request which fails. The former is 10MB in size. If you need
more history, I also have the last 1GB of the connection available.
192.168.18.2 is the server, 192.168.18.39 the client. The access is for
big files typically.

Ciao,
                    Roland


The Abort code is RXKADEXPIRED (19270409L).   Would you verify that you
still have a valid token and that your system clocks are in sync?

The clocks are perfectly synchronized and I'm pretty sure that the batch jobshave valid tokens, otherwise I would see other failures as well. Also,wouldn't it be very nasty to effectively disable a complete client becauseone connection has no valid token?

The other thing is: it is the _client_ which sends the first ABORT inresponse to a challenge....

I've also captured the 'self-healing' of the client state, although I'mnot able to make something of it myself. The full trace is at


http://www.e18.physik.tu-muenchen.de/~rkuhn/openafs.cap

It seems that 118 minutes after the failure the client makes a get-timecall which succeeds, and then everything is happy again.


Ciao,
                                        Roland

_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Re: [OpenAFS-devel] "Lost contact with file server" problems

Reply via email to