On Fri, 22 Nov 2013 16:28:04 -0500 Chris Garrison <ecgar...@iu.edu> wrote:
> The hosts' /usr/vice/etc/cacheinfo files look like this: > > /afs:/usr/vice/cache:7500000 [...] > Something has been locking up the openafs client in the past month or > so. The cache will show as more and more full in "df" and then at > some point, AFS stops answering, and any attempt to do a directory > listing or to access a file results in a zombie process. Sorry if you haven't received any information on this yet; I can't look at this for too long right now, but I can try to provide a little information. Is /usr/vice/cache its own partition? Do you mean cache usage fills up the partition the cache is on, or it just fills up to about the size the cache is configured to? That is, does it fill up the disk, or you just mean it fills up the configured ~7.5G? > What could cause that lockup? It's usually only on one host at a time, > and seems like it will "move" from host to host, even returning to the > same host in the same day after reboot once in awhile. Presumably all accesses are waiting for something to get kicked out of the cache, since the cache is full. But for whatever reason, the thread for kicking stuff out of the cache is not doing that. > To me, it feels like maybe someone is forcing a huge file through and > running the machine out of cache. Though if that's so, I wonder why it > only just started happening after all these years. If nothing else, it > seems like something new is going on with the user end that's causing > it. It's either someone reading or writing a bunch of data. At various points in the past there have been problems when the cache is full of data, and we can't evict stuff out of the cache because it's "in use" or something like that. More recently there were some fixes to some cache eviction processing, but I'm not clear on if that's relevant since I haven't seen a description of the relevant problem it was fixing. That is included in 1.6.6pre1, though, if you wanted to try that. > Any help would be appreciated, anything from a fix by limiting > something in the openafs client or the cache or ideas as to what > someone could be doing. Because at this point, it's like a denial of > service attack that's making lots of problems for us. What you could get is an "fstrace" of the client while this problem is going on (there are instructions on the list and elsewhere for how to collect this, but ask if you need to), or get a stack trace of the CacheTruncateDaemon process. The latter you can get by installing the kernel debuginfo package, and then running 'crash' on the machine as the problem is happening. Find the PID for the 'afs_cachetrim' process, and run inside 'crash': set <pid> bt > /tmp/somefile Or, if you don't want to bother or can't find the PID, or if you want to be sure to capture _all_ possible relevant information just run foreach bt > /tmp/somefile Instead, which will capture the stack trace of every process; that'll take a little more time and CPU. You can also just run 'cmdebug localhost' to see what processes are hanging on, but I assume that will just show that they are hanging on waiting for cache items to be evicted. And running 'cmdebug' may not ever complete if the client is wedged hard enough. -- Andrew Deason adea...@sinenomine.net _______________________________________________ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info