My RW server went bump in the night last night. After rebooting, everything came back up as normal but attempting to access either /afs/icequake.net or /afs/.icequake.net would result in "connection timed out".
I have restarted all fileservers and all clients with only the following to note: after the client is restarted, the first request to /afs will pause for a few seconds before returning the timeout error, then subsequent requests return timeout immediately. fs checks/checkv had no effect except to introduce the pause on the first request again. Needless to say this is baffling. There is nothing interesting in the logs or udebug output, but maybe someone else might disagree. 10.0.1.230 is the ubik master and 10.0.1.232 is the RW fileserver. # udebug 10.0.1.230 7003 Host's addresses are: 10.0.1.230 65.38.17.159 Host's 10.0.1.230 time is Sat Sep 17 10:19:38 2011 Local time is Sat Sep 17 10:19:38 2011 (time differential 0 secs) Last yes vote for 10.0.1.230 was 6 secs ago (sync site); Last vote started 6 secs ago (at Sat Sep 17 10:19:32 2011) Local db version is -1438751922.1777336322 I am sync site until 53 secs from now (at Sat Sep 17 10:20:31 2011) (3 servers) Recovery state 1f Sync site's db version is -1438751922.1777336322 0 locked pages, 0 of them for write Last time a new db version was labelled was: 1145824 secs ago (at Sun Sep 4 04:02:34 2011) Server (10.0.1.233 65.38.17.160): (db -1438751922.1777336322) last vote rcvd 7 secs ago (at Sat Sep 17 10:19:31 2011), last beacon sent 6 secs ago (at Sat Sep 17 10:19:32 2011), last vote was yes dbcurrent=1, up=1 beaconSince=1 Server (10.0.1.232 65.38.17.158): (db -1438751922.1777336322) last vote rcvd 7 secs ago (at Sat Sep 17 10:19:31 2011), last beacon sent 6 secs ago (at Sat Sep 17 10:19:32 2011), last vote was yes dbcurrent=1, up=1 beaconSince=1 # udebug 10.0.1.230 7002 Host's addresses are: 10.0.1.230 65.38.17.159 Host's 10.0.1.230 time is Sat Sep 17 10:19:37 2011 Local time is Sat Sep 17 10:19:39 2011 (time differential 2 secs) Last yes vote for 10.0.1.230 was 7 secs ago (sync site); Last vote started 7 secs ago (at Sat Sep 17 10:19:32 2011) Local db version is 1313883291.5 I am sync site until 50 secs from now (at Sat Sep 17 10:20:29 2011) (3 servers) Recovery state 1f Sync site's db version is 1313883291.5 0 locked pages, 0 of them for write Last time a new db version was labelled was: 2389486 secs ago (at Sat Aug 20 18:34:53 2011) Server (10.0.1.233 65.38.17.160): (db 1313883291.5) last vote rcvd 8 secs ago (at Sat Sep 17 10:19:31 2011), last beacon sent 7 secs ago (at Sat Sep 17 10:19:32 2011), last vote was yes dbcurrent=1, up=1 beaconSince=1 Server (10.0.1.232 65.38.17.158): (db 1313883291.5) last vote rcvd 10 secs ago (at Sat Sep 17 10:19:29 2011), last beacon sent 7 secs ago (at Sat Sep 17 10:19:32 2011), last vote was yes dbcurrent=1, up=1 beaconSince=1 # cat FileLog Sat Sep 17 10:04:45 2011 File server starting (/usr/lib/openafs/dafileserver -p 123 -pctspare 20 -L -busyat 50 -rxpck 2000 -rxbind -cb 4000000 -vattachpar 128 -vlruthresh 1440 -vlrumax 8 -vhashsize 11) Sat Sep 17 10:04:45 2011 afs_krb_get_lrealm failed, using icequake.net. Sat Sep 17 10:04:46 2011 VLRU: starting scanner with the following configuration parameters: Sat Sep 17 10:04:46 2011 VLRU: offlining volumes after minimum of 86400 seconds of inactivity Sat Sep 17 10:04:46 2011 VLRU: running VLRU soft detach pass every 120 seconds Sat Sep 17 10:04:46 2011 VLRU: taking up to 8 volumes offline per pass Sat Sep 17 10:04:46 2011 VLRU: scanning generation 0 for inactive volumes every 10800 seconds Sat Sep 17 10:04:46 2011 VLRU: scanning for promotion/demotion between generations 0 and 1 every 172800 seconds Sat Sep 17 10:04:46 2011 VLRU: scanning for promotion/demotion between generations 1 and 2 every 345600 seconds Sat Sep 17 10:04:46 2011 Set thread id 3 for FSYNC_sync Sat Sep 17 10:04:46 2011 VInitVolumePackage: beginning parallel fileserver startup Sat Sep 17 10:04:46 2011 VInitVolumePackage: using 1 threads to pre-attach volumes on 1 partitions Sat Sep 17 10:04:46 2011 Scanning partitions on thread 1 of 1 Sat Sep 17 10:04:46 2011 Partition /vicepa: pre-attaching volumes Sat Sep 17 10:04:46 2011 Partition scan thread 1 of 1 ended Sat Sep 17 10:04:46 2011 fs_stateRestore: commencing fileserver state restore Sat Sep 17 10:04:46 2011 fs_stateRestore: host table restored Sat Sep 17 10:04:46 2011 fs_stateRestore: FileEntry and CallBack tables restored Sat Sep 17 10:04:46 2011 fs_stateRestore: host table indices remapped Sat Sep 17 10:04:46 2011 fs_stateRestore: FileEntry and CallBack indices remapped Sat Sep 17 10:04:46 2011 fs_stateRestore: restore phase complete Sat Sep 17 10:04:46 2011 fs_stateRestore: beginning state verification phase Sat Sep 17 10:04:46 2011 h_stateVerifyUuidHash: warning: uuid hash entry points to different host struct (1, 0) Sat Sep 17 10:04:46 2011 fs_stateRestore: fileserver state verification complete Sat Sep 17 10:04:46 2011 fs_stateRestore: restore was successful Sat Sep 17 10:04:46 2011 Set thread id 0000007E for 'FiveMinuteCheckLWP' Sat Sep 17 10:04:46 2011 Getting FileServer name... Sat Sep 17 10:04:46 2011 Set thread id 00000081 for 'HostCheckLWP' Sat Sep 17 10:04:46 2011 FileServer host name is 'valhalla' Sat Sep 17 10:04:46 2011 Getting FileServer address... Sat Sep 17 10:04:46 2011 Set thread id 00000083 for 'FsyncCheckLWP' Sat Sep 17 10:04:46 2011 FileServer valhalla has address 10.0.1.232 (0xe801000a or 0xa0001e8 in host byte order) Sat Sep 17 10:04:46 2011 File Server started Sat Sep 17 10:04:46 2011 -- Ryan C. Underwood, <neme...@icequake.net>
signature.asc
Description: Digital signature