A few days ago I have encountered a very strange OpenAFS client issue that basically exhibits in two ways:
* either the processes accessing the file-system get "stuck" reading (or perhaps opening) the files; (although if one waits "long" enough, sometimes those processes will finally complete their job;) (in this case the CPU doesn't go to 100%;) * either if one tries to `SIGTERM` the stuck processes, the CPU goes to 100% (on multiple cores) in kernel mode; (again, sometimes if one waits long enough, the system settles;) The usage pattern is as follows: * it is a typical "build" scenario, where a `make`-like tool (in this case `ninja`) heavily stats all files it knows about to find changed or missing ones; (in my case there are about 90k files, all hosted on AFS; moreover I suspect `ninja` tries to stat these on multiple threads;) * there are a few processes that do CPU-bound tasks, reading a file (from AFS) and writing the output to another one (also on AFS); (the concurrency level doesn't seem to change much, from 128 processes in parallel to 4;) I was able to replicate this issue each time I tried to run the build and send `SIGTERM`, after letting the whole build process run for a night it eventually completed. My setup is as follows: * OpenSUSE Tumbleweed, kernel 5.3.9-1-default, client package `openafs-client` and `openafs-kmp-default` at `1.8.5_k5.3.9_1-1.3` as provided by OpenSUSE; * `afsd` parameters (neither memory cache (on `tmpfs`) or disk cache seems to help; neither daemons from 4 to 1; encryption is off): ~~~~ -verbose -blocks 7864320 -chunksize 17 -files 524288 -files_per_subdir 128 -dcache 524288 -stat 524288 -volumes 128 -splitcache 90/10 -afsdb -dynroot-sparse -fakestat-all -inumcalc md5 -backuptree -daemons 1 -rxmaxfrags 8 -rxmaxmtu 1500 -rxpck 4096 -nosettime ~~~~ -verbose -memcache -blocks 1048576 -chunksize 17 -stat 524288 -volumes 128 -splitcache 90/10 -afsdb -dynroot-sparse -fakestat-all -inumcalc md5 -backuptree -daemons 1 -rxmaxfrags 8 -rxmaxmtu 1500 -rxpck 4096 -nosettime ~~~~ * the server is also on OpenSUSE Leap 15.0, with `openafs-server` package at `1.8.0-lp150.2.2.1` as provided by OpenSUSE; * I suspect that perhaps the issue is due to the latest kernel version, because I have run similar patterns a few weeks ago on an older kernel (but still from the `5.x` family), but can't say for sure; I also tried the following: * `fs flushall` seems to block as the processes accessing the file-system; * the only way to "kill" the stuck processes is to disconnect the network, and let them timeout; Any pointers on how to diagnose this? Thanks, Ciprian. _______________________________________________ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info