Hi, I was wondering if anyone has seen something like this, or has suggestions about how I could debug the issue should it happen again.
We are moving our desktop environment from SL7 to Ubuntu 20.04 LTS. After a couple of weeks of trouble free performance, on Monday two different users on different machines (KVM guests if that makes any difference) suffered problems with cache corruption in their home directories within a couple of hours of each other. The messages in syslog looked like: Aug 3 15:38:52 gazebo kernel: afs: Corrupt directory (5.536870965.13859.4201870 [inf.ed.ac.uk] @ffffb303425613c8, pos 0) Aug 3 15:38:52 gazebo kernel: afs: Corrupt directory (5.536870965.13995.4201950 [inf.ed.ac.uk] @ffffb303423b7ec8, pos 0) Aug 3 15:38:52 gazebo kernel: afs: Corrupt directory (5.536870965.13997.4201995 [inf.ed.ac.uk] @ffffb303423b75c8, pos 0) Aug 3 15:38:52 gazebo kernel: afs: Corrupt directory (5.536870965.13737.4201771 [inf.ed.ac.uk] @ffffb303423b69c8, pos 0) One user also saw input/output errors when trying to access some files. There were a number of byte-range locking warnings in both syslogs but none which referred to anything in the corrupted directories. The effect of the corruption was the appearance of one or more entries of the form -????????? ? ? ? ? ? registrymodifications.xcu when doing an ls of the affected directory. Fs flush cleared up all but one of the issues. This required halting afsd and manually deleting the cache files to get things working again. Both users were very near the upper limits of their quotas when this happened but there was plenty of space in the file server partition and in both cache partitions. Both home volumes are on the same server and partition but there’s no evidence of anything going wrong in the server logs and none of our SL7 users have reported similar issues. The Ubuntu machines are running openafs 1.8.4~pre1-1ubuntu2-debian, the server is running SL7.6, kernel 3.10.0-1062.4.3.el7.x86_64 and openafs-server-1.8.4-1.el7.x86_64. Fs getcacheparms returns AFS using 51% of cache blocks (1068658 of 2097152 1k blocks) 95% of the cache files (62256 of 65536 files) afs_cacheFiles: 65536 IFFree: 3280 IFEverUsed: 9654 IFDataMod: 1 IFDirtyPages: 0 IFAnyPages: 0 IFDiscarded: 1 DCentries: 9998 0k- 4K: 9087 4k- 16k: 460 16k- 64k: 70 64k- 256k: 21 256k- 1M: 6 >=1M: 354 [cache file usage over 90%, consider increasing '-files' argument to afsd] on one machine and AFS using 29% of cache blocks (1783025 of 6098259 1k blocks) 3% of the cache files (5900 of 190570 files) afs_cacheFiles: 190570 IFFree: 184670 IFEverUsed: 2270 IFDataMod: 50 IFDirtyPages: 0 IFAnyPages: 0 IFDiscarded: 0 DCentries: 9998 0k- 4K: 5639 4k- 16k: 1638 16k- 64k: 606 64k- 256k: 308 256k- 1M: 262 >=1M: 1545 on the other. Does anyone have any idea what might be going on or any further steps I can take to investigate the problem if it happens again? All suggestions welcome! Thanks in advance, Craig. --- Craig Strachan, Computing Officer, School of Informatics, University of Edinburgh The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. :�� T���&j)b� b�өzpJ)ߢ�^��좸!��l��b��(���~�+��Y���b�ا~����~ȧ~