*Summary:* getcwd() is failing, sometimes. When it fails, d_unlinked(pwd.dentry) <http://lxr.free-electrons.com/source/fs/dcache.c#L3250> is likely not returning success for unknown reasons; hence the ENOENT return from getcwd().
*Status:* I ran an strace, filtering on open(), for a 'git log' command which seemed to reliably repeat it after the 3 minute script from Stephan. strace on git log w/ and w/out cwd error with cwd error: http://hastebin.com/dezowowoto.coffee w/o cwd error: http://hastebin.com/nayojoqomo.coffee The only substantive differences are line 45 (an extra mprotect() in the without cwd error case) and then the meat at line ~91-95 at the close(). 96 getcwd("/vagrant/systap", 4096) = 16 | getcwd(0x7da600, 4096) = -1 ENOENT (No such file or directory) The left-hand side above was done on a non-AFS fs, so the path is used as the identifier. The right hand side is the failed AFS case. When that same getcwd() call succeeds with AFS, the arguments are exactly the same, but the return value is 0x16 and git reports that the directory is "Not a git repository". *Next steps:* 1. Get SystemTap to recognize the OpenAFS kernel module and userland debug symbols. 2. Start probing the AFS lookups for this directory and the incorrect unhashing of it. Particularly in osi_TryEvictVCache Any hints on SystemTap with custom built OpenAFS are solicited and welcome. Cheers, Joe On Sat, Jul 16, 2016 at 11:39 PM, Benjamin Kaduk <ka...@mit.edu> wrote: > On Fri, 15 Jul 2016, Joe Gorse wrote: > > > Stephan, > > > > Picking this back up, I am having difficulties repeating it consistently. > > Debian 8.4, kernel 4.4.15, OpenAFS master > > f14d263a73f0be75e4de92f62e836fb2e55680dd. I see the gerrit for reverting > on > > master is not in yet, so that's not it. Tried increasing the frequency of > > afs_ShakeLooseVCaches. > > > > A smaller git repo (e.g. openafs-robotest) never seems to trip the CWD > bug. > > Presumably because it does not create enough vcaches to go over a limit > somewhere. > > > Test method: > > [vagrant@openafs-debian-dev:/afs/.robotest/test] $ mkdir g; cd g; git > clone > > git://gerrit.openafs.org/openafs.git;sleep 180;git log > > Cloning into 'openafs'... > > remote: Counting objects: 192945, done. > > remote: Compressing objects: 100% (46882/46882), done. > > remote: Total 192945 (delta 159381), reused 177218 (delta 145040) > > Receiving objects: 100% (192945/192945), 71.80 MiB | 7.31 MiB/s, done. > > Resolving deltas: 100% (159381/159381), done. > > Checking connectivity... done. > > Checking out files: 100% (5563/5563), done. > > fatal: Unable to read current working directory: No such file or > directory > > [vagrant@openafs-debian-dev:/afs/.robotest/test/g] 3m31s 128 $ date > > Fri Jul 15 21:39:16 UTC 2016 > > [vagrant@openafs-debian-dev:/afs/.robotest/test/g] $ > > > > I don't see anything obvious in the log files, though I am not sure what > I > > would be looking for. > > I do not expect us to be logging anything useful in the current code; you > will probably have to add some logging. > > -Ben > -- Joe Gorse C: 440-552-0730 LI: Joe Gorse <http://www.linkedin.com/pub/joe-gorse/7/12/397>