I would be interested in seeing a tcpdump of traffic between the affected client and the file server covering the few minutes before the build failure until the client marks the server up again.
-----Original Message----- From: Mark Henry Sent: Friday, July 08, 2011 5:46 PM To: [email protected] Subject: [OpenAFS] errors in afs when multiple tasks are running We have been getting periodic build failures when building in afs. Here is /var/log/messages at the time of the failure: Jul 6 20:58:01 hostname /usr/sbin/cron[10555]: pam_krb5_compiled(crond:account): pam_sm_acct_mgmt: exit (ignore) Jul 6 20:58:01 hostname /usr/sbin/cron[10556]: (userid1) CMD (${K5S_USERID1} -- /bin/sh -c "/a/p/cpui/build/dir/check_for_bld_request.sh > /a/p/cpui/build/dir/hostname_check.out 2>&1" >> /home/userid1/k5start.out 2>&1) Jul 6 20:58:02 hostname kernel: afs: Lost contact with file server 192.168.15.33 in cell cellname.com (all multi-homed ip addresses down for the server) Jul 6 20:58:02 hostname kernel: afs: Lost contact with file server 192.168.15.33 in cell cellname.com (all multi-homed ip addresses down for the server) Jul 6 20:58:04 hostname kernel: afs: Tokens for user of AFS id -1 for cell cellname.com have expired Jul 6 20:58:07 hostname kernel: afs: failed to store file (110) Jul 6 20:58:08 hostname kernel: afs: failed to store file (110) Jul 6 20:58:08 hostname kernel: afs: failed to store file (110) Jul 6 20:58:23 hostname kernel: afs: file server 192.168.15.33 in cell cellname.com is back up (multi-homed address; other same-host interfaces may still be down) Jul 6 20:58:23 hostname kernel: afs: file server 192.168.15.33 in cell cellname.com is back up (multi-homed address; other same-host interfaces may still be down) Jul 6 21:00:01 hostname /usr/sbin/cron[10588]: pam_krb5_compiled(crond:account): pam_sm_acct_mgmt: entry (0x8000) Jul 6 21:00:01 hostname /usr/sbin/cron[10587]: pam_krb5_compiled(crond:account): pam_sm_acct_mgmt: entry (0x8000) I don't find any errors on the file server that it loses connection to. Our afs servers are on AIX and the client system running the build is opensuse 11.1 with the afs client at 1.4.11. Every minute a simple script (hostname_check.out) goes out to afs and looks for a file. Most times there are no problems. Occasionally this harmless script running seems to mess up the connection to the file server for the running build (the Lost contact error always occurs a few seconds after the minute). Also, we have moved the script to run at 20 seconds after the minute and the errors follow the same pattern only 20 seconds later. This has happened with multiple scripts that access afs so the scripts themselves don't seem to be the problem. The build uses k5start for creds which seems fine. The errors are on different systems (all opensuse) at random times so it is hard to trace. Also we increased the size of the afs cache to 5g hoping that would help and it didn't seem to help. Any ideas? Mark Henry Advisory Software Engineer Ricoh Production Print Solutions, LLC _____________________________________________________________________________ "This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Thank you." _____________________________________________________________________________ _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
