Hi,

I was wondering if anyone has seen something like this, or has suggestions 
about how I could debug the issue should it happen again.

We are moving our desktop environment from SL7 to Ubuntu 20.04 LTS. After a 
couple of weeks of trouble free performance, on Monday two different users on 
different machines (KVM guests if that makes any difference) suffered problems 
with cache corruption in their home directories within a couple of hours of 
each other. The messages in syslog looked like:

Aug  3 15:38:52 gazebo kernel: afs: Corrupt directory 
(5.536870965.13859.4201870 [inf.ed.ac.uk] @ffffb303425613c8, pos 0)
Aug  3 15:38:52 gazebo kernel: afs: Corrupt directory 
(5.536870965.13995.4201950 [inf.ed.ac.uk] @ffffb303423b7ec8, pos 0)
Aug  3 15:38:52 gazebo kernel: afs: Corrupt directory 
(5.536870965.13997.4201995 [inf.ed.ac.uk] @ffffb303423b75c8, pos 0)
Aug  3 15:38:52 gazebo kernel: afs: Corrupt directory 
(5.536870965.13737.4201771 [inf.ed.ac.uk] @ffffb303423b69c8, pos 0)

One user also saw input/output errors when trying to access some files.

There were a number of byte-range locking warnings in both syslogs but none 
which referred to anything in the corrupted directories. The effect of the 
corruption was the appearance of one or more entries of the form

-????????? ? ?       ?         ?            ? registrymodifications.xcu

when doing an ls of the affected directory. Fs flush cleared up all but one of 
the issues. This required halting afsd and manually deleting the cache files to 
get things working again.

Both users were very near the upper limits of their quotas when this happened 
but there was plenty of space in the file server partition and in both cache 
partitions. Both home volumes are on the same server and partition but there’s 
no evidence of anything going wrong in the server logs and none of our SL7 
users have reported similar issues. The Ubuntu machines are running openafs 
1.8.4~pre1-1ubuntu2-debian, the server is running SL7.6, kernel 
3.10.0-1062.4.3.el7.x86_64 and openafs-server-1.8.4-1.el7.x86_64. Fs 
getcacheparms returns

AFS using    51% of cache blocks (1068658 of 2097152 1k blocks)
             95% of the cache files (62256 of 65536 files)
afs_cacheFiles:      65536
IFFree:               3280
IFEverUsed:           9654
IFDataMod:               1
IFDirtyPages:            0
IFAnyPages:              0
IFDiscarded:             1
DCentries:        9998
  0k-   4K:       9087
  4k-  16k:        460
 16k-  64k:         70
 64k- 256k:         21
256k-   1M:          6
      >=1M:        354
[cache file usage over 90%, consider increasing '-files' argument to afsd]

on one machine and

AFS using    29% of cache blocks (1783025 of 6098259 1k blocks)
              3% of the cache files (5900 of 190570 files)
afs_cacheFiles:     190570
IFFree:             184670
IFEverUsed:           2270
IFDataMod:              50
IFDirtyPages:            0
IFAnyPages:              0
IFDiscarded:             0
DCentries:        9998
  0k-   4K:       5639
  4k-  16k:       1638
 16k-  64k:        606
 64k- 256k:        308
256k-   1M:        262
      >=1M:       1545

on the other.

Does anyone have any idea what might be going on or any further steps I can 
take to investigate the problem if it happens again? All suggestions welcome!

Thanks in advance,
Craig.
---
Craig Strachan, Computing Officer,
School of Informatics, University of Edinburgh




The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336.
:��T���&j)b�   b�өzpJ)ߢ�^��좸!��l��b��(���~�+��Y���b�ا~����~ȧ~

Reply via email to