Hi Harald!

On 8 Sep 2005, at 18:09, Harald Barth wrote:

Examine the capture you took yesterday when things were not working.
Look for the AFS kvno in one of the messages from the client to the server. Since you are using some varient of the Kerberos 5 based tokens, the kvno should always be reported as 213 or 256. If it is anything else, then the
client is confused.


Either my snapshot is not good enough or my ethereal driving license is not
adequate.

But I see more symptoms that indicate that we may have hunted but not
completely killed that bug:

Sep 8 17:05:30 d10n03 kxd[1204]: from fjell.pdc.kth.se (130.237.221.161): [EMAIL PROTECTED] -> lama

Sep 8 17:05:32 d10n03 kernel: afs: Lost contact with file server 130.237.232.195 in cell pdc.kth.se (all multi-homed ip addresses down for the server) Sep 8 17:05:32 d10n03 kernel: afs: Lost contact with file server 130.237.232.195 in cell pdc.kth.se (all multi-homed ip addresses down for the server)

Sep 8 17:05:35 d10n03 kernel: afs: Tokens for user of AFS id 12020 for cell pdc.kth.se have expired

Sep 8 17:12:50 d10n03 kernel: afs: file server 130.237.232.195 in cell pdc.kth.se is back up (multi-homed address; other same-host interfaces may still be down
)

1. User logs in which in this case probably means than an expired
   ticket is used as a token.

AFAICT this did not happen here, no tickets involved. The batch job gets its token via AFS library from password.

2. Client complains that the server which has the user's $HOME
   is all down

Here it didn't affect /afs but only the fileserver which hosts the big data files.

3. Client discovers that the token has expired

I've never seen the 'Tokens expired' log message in connection with the "Lost contact" one, they were mutually exclusive. The only message loggen between down and back was 'afs: failed to store file (110)' (110 -> Connection timed out), sometimes several times.

4. Some minutes later the client recovers. Problem is: Would a batch
job try to start between 17:06 and 17:11 it would crash because AFS is
not available that very moment.

Well, that sounds familiar, but here it took almost two hours in all cases.

So how can I prevent that the server is flagged down because of a
expired token? Seems to me still like a timing issue - sometimes the
server is flagged down first (which gives great grief) and sometimes
the client discovers that this one connection was no big deal and
nukes the connection first.

AFS version is 1.3.87 which has patch
checkservers-set-back-deadtime-correctly-20050804 and I added patch
rx-propagate-error-20050902 to.

Roland: And you don't see this any more? In this case: Lucky you.

If only I had the time to try the multi-threaded fileserver, _then_ I would be lucky ;-) Using the single-threaded one with 40 clients reading simultaneously isn't fun at all :-(

Ciao,
                    Roland

--
TU Muenchen, Physik-Department E18, James-Franck-Str. 85747 Garching
Telefon 089/289-12592; Telefax 089/289-12570
--
A mouse is a device used to point at
the xterm you want to type in.
Kim Alm on a.s.r.
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GS/CS/M/MU d-(++) s:+ a-> C+++ UL++++ P-(+) L+++ E(+) W+ !N K- w--- M + !V Y+
PGP++ t+(++) 5 R+ tv-- b+ DI++ e+++>++++ h---- y+++
------END GEEK CODE BLOCK------


Attachment: PGP.sig
Description: This is a digitally signed message part

Reply via email to