I have a small OpenAFS 1.8.6 setup using the Debian and Ubuntu packages. Last night everything was working fine, this morning machines were timing out trying to talk to volume servers. Database replication was also stuck.
While there is a single backup database and file server, databases and volumes are primarily on a single server. I logged in to that server ("afs1"), made it the only machine in the cell by editing client and server CellServDB and set out trying to restore things. afs1 is running Debian bullseye. Kernel 5.8 (running at the time when things broke) and 5.10 result in an equally non-functional system. There are no iptables rules on the system. OpenAFS is almost 100% dead for no apparent reason: - "pts listentries" and "vos listvldb localhost" work. udebug shows both servers in recovery state 1f, site is sync site and there are no replicas (as expected at this point). - After restarting services, vos status -localauth -server localhost prints the following: Could not access status information about the server Possible communication failure Error in vos status command. Possible communication failure - After a while, vos status no longer prints anything, just hangs. All AFS client access times out. - There is mostly nothing in the logs. Starting vlserver/ptserver/dafileserver with -d 125 doesn't lead to any extra output. Nothing out of the ordinary (except AFS client errors) appears in dmesg or journalctl -b. After starting dafileserver -L, the following log appears: Thu Jan 14 11:59:54 2021 File server starting (/usr/lib/openafs/dafileserver -L) Thu Jan 14 11:59:54 2021 VL_RegisterAddrs rpc failed; will retry periodically (code=5376, err=0) Thu Jan 14 12:01:04 2021 Couldn't get CPS for AnyUser, will try again in 30 seconds; code=-1. Thu Jan 14 12:02:09 2021 Couldn't get CPS for AnyUser, will try again in 30 seconds; code=-1. [the last message keeps repeating] - dasalvager appears to run successfully. I'm currently running a voldump to recover data and it's running fine so far. There is plenty of disk space. - Kerberos appears to be working. kinit works, aklog works, pts/vos commands without -localauth work when a superuser token is present. KDC (Samba) doesn't show any problems related to the afs principal. Clocks are accurate. - Rebooting the whole system (a qemu VM) makes no difference. After four hours of debugging, I'm at the end of my wits. Even temporarily removing all databases, restarting ptserver and vlserver and touching NoAuth won't make fileserver/volserver happy. It seems like RX communication is failing somehow, but I have no idea why. Any ideas what's going on here? -Valtteri _______________________________________________ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info