There is something strange about the way selfcheck timeouts are generated. I've set ctimeout to 120 in amanda.conf on the server, rebooted all the hosts, and run amcheck multiple times. Selfcheck requests still timeout sporadically and clearly earlier than ctimeout. There are lower level timeouts logged by amandad on the server and on the client. Do I need to tune something in the lower level protocols?
Here is an amandad.*.debug (edited down) from the server when a client timed out: amandad: debug 1 pid 2294 ruid 33224 euid 33224 start time Wed Jun 5 10:42:31 2002 amandad: version 2.4.2p2 [...snip...] amandad: dgram_recv: timeout after 30 seconds amandad: error receiving message: timeout error receiving message: timeout amandad: pid 2294 finish time Wed Jun 5 10:43:01 2002 Problems are logged wrt dgram_recv at the client too, sometimes overcome but sometimes leading to "giving up", even when the client is the server machine itself. Here is a (slightly condensed) amandad.*.debug from a timed out client: amandad: debug 1 pid 1143 ruid 33224 euid 33224 start time Wed Jun 5 10:41:56 2002 [...snip...] amandad: version 2.4.2p2 got packet: -------- Amanda 2.4 REQ HANDLE 006-10068D10 SEQ 1023288140 SECURITY USER amanda SERVICE selfcheck OPTIONS ; GNUTAR /smith 0 OPTIONS |;bsd-auth;compress-fast;index;exclude-list=/usr/local/lib/amanda/exclude.gtar; [...snip out various other options...] GNUTAR /dev/hda2 0 OPTIONS |;bsd-auth;compress-fast;index;exclude-list=/usr/local/lib/amanda/exclude.gtar; -------- sending ack: ---- Amanda 2.4 ACK HANDLE 006-10068D10 SEQ 1023288140 ---- bsd security: remote host raleigh.cpt.afip.org user amanda local user amanda amandahosts security check passed amandad: running service "/usr/local/amanda_client/libexec/selfcheck" amandad: sending REP packet: ---- Amanda 2.4 REP HANDLE 006-10068D10 SEQ 1023288140 OPTIONS ; OK /smith [...snip out lot of other OK's...] OK /etc has more than 64 KB available. ---- amandad: dgram_recv: timeout after 10 seconds amandad: waiting for ack: timeout, retrying amandad: got packet: ---- Amanda 2.4 REQ HANDLE 006-10068CA0 SEQ 1023288156 SECURITY USER amanda SERVICE selfcheck OPTIONS ; GNUTAR /smith 0 OPTIONS |;bsd-auth;compress-fast;index;exclude-list=/usr/local/lib/amanda/exclude.gtar; [...snip out options...] GNUTAR /dev/hda2 0 OPTIONS |;bsd-auth;compress-fast;index;exclude-list=/usr/local/lib/amanda/exclude.gtar; ---- amandad: weird, it's not a proper ack addr: peer 10.20.30.55 dup 10.20.30.55, port: peer 559 dup 582 amandad: dgram_recv: recvfrom() failed: Connection refused amandad: waiting for ack: Connection refused, retrying amandad: dgram_recv: recvfrom() failed: Connection refused amandad: waiting for ack: Connection refused, retrying amandad: dgram_recv: recvfrom() failed: Connection refused amandad: waiting for ack: Connection refused, retrying amandad: dgram_recv: recvfrom() failed: Connection refused amandad: waiting for ack: Connection refused, giving up! amandad: pid 1143 finish time Wed Jun 5 10:42:27 2002 Somewhat condensed results from multiple runs of amcheck are given below, but I doubt that these are relevant -- it seems the problem is at a lower level than amcheck reports. Clearly one client (optimas3) times out most often, though it succeeded on the first five checks. Two other clients "come and go" with timeouts. One of these (raleigh) is the server itself. Other clients are occasionally affected (in other checks). Still wondering why clients come and go on the failure list. Especially wondering why so fast on the timeouts? I don't see anything it the selfcheck.*.debug logs on affected hosts (looks like no such file is created for the amcheck instances that time out). Thanks. Robert L. Becker, Jr. Col, USAF, MC Department of Cellular Pathology Armed Forces Institute of Pathology Washington, DC 20306-6000 301-319-0300 raleigh 23% amcheck -c Daily ; amcheck -c Daily ; amcheck -c Daily ; amcheck -c Daily ; amcheck -c Daily ; amcheck -c Daily Amanda Backup Client Hosts Check -------------------------------- Client check: 7 hosts checked in 10.034 seconds, 0 problems found (brought to you by Amanda 2.4.2p2) [snip... 4 more sets of output, all in =<31sec, all with 0 problems found...] Amanda Backup Client Hosts Check -------------------------------- WARNING: optimas3.cpt.afip.org: selfcheck request timed out. Host down? Client check: 7 hosts checked in 29.644 seconds, 1 problem found (brought to you by Amanda 2.4.2p2) [...ok, run it again...] raleigh 24% !! amcheck -c Daily ; amcheck -c Daily ; amcheck -c Daily ; amcheck -c Daily ; amcheck -c Daily ; amcheck -c Daily Amanda Backup Client Hosts Check -------------------------------- WARNING: charlotte.cpt.afip.org: selfcheck request timed out. Host down? WARNING: optimas3.cpt.afip.org: selfcheck request timed out. Host down? Client check: 7 hosts checked in 30.170 seconds, 2 problems found (brought to you by Amanda 2.4.2p2) Amanda Backup Client Hosts Check -------------------------------- WARNING: optimas3.cpt.afip.org: selfcheck request timed out. Host down? Client check: 7 hosts checked in 29.743 seconds, 1 problem found (brought to you by Amanda 2.4.2p2) Amanda Backup Client Hosts Check -------------------------------- WARNING: raleigh.cpt.afip.org: selfcheck request timed out. Host down? WARNING: optimas3.cpt.afip.org: selfcheck request timed out. Host down? Client check: 7 hosts checked in 30.097 seconds, 2 problems found (brought to you by Amanda 2.4.2p2) Amanda Backup Client Hosts Check -------------------------------- NFS version 3 mount failed, trying NFS version 2 WARNING: optimas3.cpt.afip.org: selfcheck request timed out. Host down? Client check: 7 hosts checked in 29.652 seconds, 1 problem found (brought to you by Amanda 2.4.2p2) [...snip out two more sets of output, both flagging optimas3 timeout; logout, login again and test once more...] raleigh 112% exit raleigh 113% logout Connection closed. wesser 12% rsh raleigh Password: You have mail. raleigh 1% amcheck -c Daily ; amcheck -c Daily ; amcheck -c Daily ; amcheck -c Daily ; amcheck -c Daily ; amcheck -c Daily Amanda Backup Client Hosts Check -------------------------------- WARNING: raleigh.cpt.afip.org: selfcheck request timed out. Host down? WARNING: optimas3.cpt.afip.org: selfcheck request timed out. Host down? Client check: 7 hosts checked in 30.215 seconds, 2 problems found (brought to you by Amanda 2.4.2p2) Amanda Backup Client Hosts Check -------------------------------- WARNING: optimas3.cpt.afip.org: selfcheck request timed out. Host down? Client check: 7 hosts checked in 30.037 seconds, 1 problem found (brought to you by Amanda 2.4.2p2) [...snip out two more sets of output flagging optimas3 only...] Amanda Backup Client Hosts Check -------------------------------- WARNING: charlotte.cpt.afip.org: selfcheck request timed out. Host down? WARNING: optimas3.cpt.afip.org: selfcheck request timed out. Host down? Client check: 7 hosts checked in 30.016 seconds, 2 problems found (brought to you by Amanda 2.4.2p2) Amanda Backup Client Hosts Check -------------------------------- WARNING: optimas3.cpt.afip.org: selfcheck request timed out. Host down? Client check: 7 hosts checked in 31.370 seconds, 1 problem found (brought to you by Amanda 2.4.2p2) raleigh 2%