Hi again!

On Mon, 22 Aug 2005, Roland Kuhn wrote:

Hi Jeffrey!

On Mon, 22 Aug 2005, Jeffrey Altman wrote:

Roland Kuhn wrote:
Hi folks!

On Sun, 21 Aug 2005, Derrick J Brashear wrote:

it needs to include the first error packet, e.g. the window where it
loses contact, to be useful

Okay, it happened again, and I have a full trace:

http://www.e18.physik.tu-muenchen.de/~rkuhn/openafs-fail-trace.cap
http://www.e18.physik.tu-muenchen.de/~rkuhn/openafs-fail-trace-end.cap

The latter contains only the last 81 frames and begins a few frames
before the request which fails. The former is 10MB in size. If you need
more history, I also have the last 1GB of the connection available.
192.168.18.2 is the server, 192.168.18.39 the client. The access is for
big files typically.

Ciao,
                    Roland

The Abort code is RXKADEXPIRED (19270409L).   Would you verify that you
still have a valid token and that your system clocks are in sync?

The clocks are perfectly synchronized and I'm pretty sure that the batch jobs have valid tokens, otherwise I would see other failures as well. Also, wouldn't it be very nasty to effectively disable a complete client because one connection has no valid token?

The other thing is: it is the _client_ which sends the first ABORT in response to a challenge....

I've also captured the 'self-healing' of the client state, although I'm not able to make something of it myself. The full trace is at

http://www.e18.physik.tu-muenchen.de/~rkuhn/openafs.cap

It seems that 118 minutes after the failure the client makes a get-time call which succeeds, and then everything is happy again.

Ciao,
                                        Roland

_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Reply via email to