Carsten Jacobi wrote:
Hello,

starting from last month we have been facing "Lost contact with fileserver"
situations on one of our zLinux systems (Novell SLES-9 distribution).
After further investigation we have found out, that the cause for the
"Lost contact" hanger seems to be our AFS client (version 1.4.5) not
replying to
whoareyou() calls from the fileserver.
We have used tcpdump to record all packages we hope are essential to
track the problem. For example, we see the whoareyou() call replied by
our AFS-Client in about 40 to 100 µsec in normal operation:

10:30:00.945453 IP fs13.xxx.xx.xx.xx.afs3-fileserver >
mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call whoareyou (32)
10:30:00.945499 IP mclinx.xxx.xx.xx.xx.afs3-callback >
fs13.xxx.xx.xx.xx.afs3-fileserver:  rx data cb reply whoareyou (460)
10:30:08.941373 IP fs20.xxx.xx.xx.xx.afs3-fileserver >
mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call whoareyou (32)
10:30:08.941455 IP mclinx.xxx.xx.xx.xx.afs3-callback >
fs20.xxx.xx.xx.xx.afs3-fileserver:  rx data cb reply whoareyou (460)
10:30:08.952207 IP fs25.xxx.xx.xx.xx.afs3-fileserver >
mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call whoareyou (32)
10:30:08.952266 IP mclinx.xxx.xx.xx.xx.afs3-callback >
fs25.xxx.xx.xx.xx.afs3-fileserver:  rx data cb reply whoareyou (460)
10:30:24.173003 IP fs13.xxx.xx.xx.xx.afs3-fileserver >
mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call whoareyou (32)
10:30:24.173042 IP mclinx.xxx.xx.xx.xx.afs3-callback >
fs13.xxx.xx.xx.xx.afs3-fileserver:  rx data cb reply whoareyou (460)
10:30:24.176168 IP fs11.xxx.xx.xx.xx.afs3-fileserver >
mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call whoareyou (32)
10:30:24.176213 IP mclinx.xxx.xx.xx.xx.afs3-callback >
fs11.xxx.xx.xx.xx.afs3-fileserver:  rx data cb reply whoareyou (460)

mclinx is our AFS client and fsxx are AFS fileservers. We see those
whoareyou()
calls and replys any time, but sometimes our client does not response:

10:31 AM, the first whoareyou from fs15 is not replied
10:31:22.808760 IP mclinx.xxx.xx.xx.xx.afs3-callback >
fs20.xxx.xx.xx.xx.afs3-fileserver:  rx data fs call give-cbs (244)
10:31:22.809183 IP mclinx.xxx.xx.xx.xx.afs3-callback >
fs20.xxx.xx.xx.xx.afs3-fileserver:  rx data fs call give-cbs (88)
10:31:22.809368 IP fs20.xxx.xx.xx.xx.afs3-fileserver >
mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call callback fid
1802411095/5/1330193 afsuuid [|cb] (52)
10:31:22.809602 IP mclinx.xxx.xx.xx.xx.afs3-callback >
fs15.xxx.xx.xx.xx.afs3-fileserver:  rx data fs call give-cbs (244) (see
below at 10:35 AM)
10:31:22.810046 IP fs15.xxx.xx.xx.xx.afs3-fileserver >
mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call whoareyou (32)
10:31:23.134195 IP fs15.xxx.xx.xx.xx.afs3-fileserver >
mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call callback fid
1802410300/17405/631375 afsuuid [|cb] (52)
10:31:23.163772 IP fs20.xxx.xx.xx.xx.afs3-fileserver >
mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call callback fid
1802411095/5/1330193 afsuuid [|cb] (52)
10:31:23.166077 IP fs15.xxx.xx.xx.xx.afs3-fileserver >
mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call whoareyou (32)
10:31:23.489312 IP fs15.xxx.xx.xx.xx.afs3-fileserver >
mclinx.xxx.xx.xx.xx.afs3-callback:  rx data cb call callback fid
1802410300/17405/631375 afsuuid [|cb] (52)

Here, fs15 sends a whoareyou() which doesn't get a reply and about
a third second later another whoareyou() is sent to the AFS-Client on
mclinx. Neither of them get an answer.
To make a long story short the fileserver fs15 will send initcb() to
the AFS-Client two minutes later and another two minutes later
we'll see the first rx abort packet send to the AFS-Client which will
make the AFS-Client reporting the "Lost contact" to fs15 on the
system log (at least this is my interpretation).
Unfortunately, the AFS-Client won't respond to any whoareyou()
from other fileservers until 10:45 AM in our log which ends up
in "Lost contact" with all the fileservers being around and any
AFS activity freezing in for about a quarter hour until the connections
are reported "back up" again.

My question is: What can block an AFS-Client from answering
whoareyou() for several minutes? Are there any limits or restrictions
that can lead an AFS client to a situation where it is internally blocked?
Are there parameters one can adjust for tuning in order to avoid this
situation?
We have had those "Lost contact" time slots once every two days
lately and they are painful for users who are logged on our system
during that time. I would be happy to get rid of them somehow ...

This may be a silly question, but are there any firewalls or NAT running on the fileserver, client, or in between?

If the firewall blocks the who are you message, then that would explain why there is no reply.

Sincerely,
Jason
_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to