Stephan, Picking this back up, I am having difficulties repeating it consistently. Debian 8.4, kernel 4.4.15, OpenAFS master f14d263a73f0be75e4de92f62e836fb2e55680dd. I see the gerrit for reverting on master is not in yet, so that's not it. Tried increasing the frequency of afs_ShakeLooseVCaches.
A smaller git repo (e.g. openafs-robotest) never seems to trip the CWD bug. Test method: [vagrant@openafs-debian-dev:/afs/.robotest/test] $ mkdir g; cd g; git clone git://gerrit.openafs.org/openafs.git;sleep 180;git log Cloning into 'openafs'... remote: Counting objects: 192945, done. remote: Compressing objects: 100% (46882/46882), done. remote: Total 192945 (delta 159381), reused 177218 (delta 145040) Receiving objects: 100% (192945/192945), 71.80 MiB | 7.31 MiB/s, done. Resolving deltas: 100% (159381/159381), done. Checking connectivity... done. Checking out files: 100% (5563/5563), done. fatal: Unable to read current working directory: No such file or directory [vagrant@openafs-debian-dev:/afs/.robotest/test/g] 3m31s 128 $ date Fri Jul 15 21:39:16 UTC 2016 [vagrant@openafs-debian-dev:/afs/.robotest/test/g] $ I don't see anything obvious in the log files, though I am not sure what I would be looking for. FileLog: ==> /usr/afs/logs/FileLog <== Fri Jul 15 21:37:43 2016 SAFS_GiveUpCallBacks (Noffids=32) Fri Jul 15 21:37:43 2016 SAFS_GiveUpCallBacks (Noffids=32) Fri Jul 15 21:37:43 2016 SAFS_GiveUpCallBacks (Noffids=32) Fri Jul 15 21:37:43 2016 SAFS_GiveUpCallBacks (Noffids=17) Fri Jul 15 21:38:51 2016 SRXAFS_FetchData, Fid = 536870915.1.1 Fri Jul 15 21:38:51 2016 SAFS_FetchStatus, Fid = 536870915.4.3, Host 10.0.2.15:7001, Id 1 Fri Jul 15 21:38:51 2016 SRXAFS_FetchData, Fid = 536870915.4.3 Fri Jul 15 21:38:51 2016 SRXAFS_FetchData, Fid = 536870918.1.1 Fri Jul 15 21:38:51 2016 SRXAFS_FetchData, Fid = 536870918.3.12879 Fri Jul 15 21:39:33 2016 SAFS_GiveUpCallBacks (Noffids=2) So how do we repeat this more reliably? Cheers, Joe On Sat, Jun 18, 2016 at 2:27 PM, Stephan Wiesand <stephan.wies...@desy.de> wrote: > Joe, > > thanks for the feedback. > > On Jun 17, 2016, at 23:39 , Joe Gorse wrote: > > FWIW, I am able to reproduce a "cwd" message for "git log" command on > > Fedora 23, 4.5.6-200.fc23.x86_64. "git log" reads: > > > > fatal: Unable to read current working directory: No such file or > directory > > > > Though it should read: > > > > fatal: Not a git repository (or any of the parent directories): .git > > Exactly the problem I'm seeing. > > > > However, I am not having any trouble with the git checkout. It seems to > > consistently work on Fedora 23. Even the "git checkout > > openafs-stable-1_6_18". Perhaps try on 4.5.6 on Fedora? > > That's where I started, see the first message in this thread. And in all > cases, the git checkout actually works. I'm just using it to trigger the > cwd problem - there are probably many other ways. Note that also "git log" > is just one way of exposing the problem > > > Though I have seen some more of this issue on Debian 8 with kernel 4.6.0. > > Three of three tests failed to checkout the openafs tree on this system. > I > > will test some other kernels on this system later and note anything > > interesting. > > Sounds even considerably worse :-( > > Any errors logged? Does the client actually have some variant of gerrit > 12228 applied? > > Cheers, > Stephan > > > Cheers, > > Joe > > > > On Fri, Jun 17, 2016 at 11:30 AM, Stephan Wiesand < > stephan.wies...@desy.de> > > wrote: > > > >> > >> On Jun 17, 2016, at 04:45 , Benjamin Kaduk wrote: > >> > >>> On Thu, 16 Jun 2016, Stephan Wiesand wrote: > >>> > >>>> I smoke tested what was planned to be OpenAFS 1.6.18.1, as discussed > in > >> yesterday's release team meeting, on a Fedora 23 x86_64 VM with kernel > >> 4.5.6-200 today. The result was disappointing: > >>>> > >>>> git clone git://gerrit.openafs.org/openafs.git > >>> > >>> Is the pwd the root of a volume? > >> > >> No, everything happens at least one level below. > >> > >>>> cd openafs > >>>> git log > >>>> # scrolled through a few dozen changes, took a couple of seconds > >>>> git checkout openafs-stable-1_6_18 > >>>> > >>>> At this point I got the following error: > >>>> > >>>> fatal: Unable to read current working directory: No such file or > >> directory > >>>> > >>>> A "cd; cd -" cures this for a while, and there's no apparent data > >> corruption. I'm still worried. The problem isn't 100% reproducible, but > it > >> doesn't take too may tries checking out random tags or branches. > >>>> > >>>> This was plain 1.6.18 + gerrit 12300 12301 12302 12274. > >>>> > >>>> Cache is on ext4, no separate partition, default size as set by our > RPM > >> (I think 100MB, but I don't have access to the VM right now to check). > >>>> > >>>> The small cache size may contribute to the problem. But I found no > >> errors logged anywhere, and this shouldn't happen no matter how small > the > >> cache is. > >>> > >>> Please check if the cmdebug output is empty (I expect it is, but it is > >>> good to check). > >> > >> It is empty. > >> > >>>> NB we have a user report of exactly this problem happening frequently > >> while just editing files in a local git repo in AFS space. The data is a > >> bit sketchy, but it's probably Ubuntu 14.04 with its current default > kernel > >> and the openafs packages from Anders' ppa. I'll try to get us more data. > >>>> > >>>> > >>>> Any thoughts? For the time being I'm considering this a showstopper > for > >>>> 1.6.18.1, and it looks like we're not quite there yet regarding Linux > >>>> 4.5, let alone 4.6 or the 4.7 due in a few weeks :-( > >>> > >>> Can you run the same test on a 4.4 kernel for comparison? > >> > >> I tried under the last F22 kernel, 4.4.6-200.fc22. And ok, it's not 4.5 > >> specific, though it seems to happen more frequently with 4.5.2 than with > >> 4.4.6. > >> > >> By chance I found a pretty reliable reproducer: > >> > >> cd /vol/ume/root > >> mkdir g; cd g > >> git clone git://gerrit.openafs.org/openafs.git; sleep 180; git > log > >> > >> Note indeed no "cd openafs". Of course this should complain about the > cwd > >> not being a git repo. But most of the time it will complain about the > cwd > >> issue instead. > >> > >> I'm planning to verify that plain 1.6.18 behaves the same on 4.4.6, and > if > >> it does I'll proceed with the 1.6.18.1 release. > >> > >> I couldn't reproduce this with any EL clients, but those have larger > >> caches (it's indeed 100 MB on that Fedora VM), so there's more to test. > >> Help welcome... > >> > >> > >> _______________________________________________ > >> OpenAFS-devel mailing list > >> OpenAFS-devel@openafs.org > >> https://lists.openafs.org/mailman/listinfo/openafs-devel > >> > > > > > > > > -- > > Joe Gorse > > > > C: 440-552-0730 > > LI: Joe Gorse <http://www.linkedin.com/pub/joe-gorse/7/12/397> > > -- > Stephan Wiesand > DESY -DV- > Platanenenallee 6 > 15738 Zeuthen, Germany > > -- Joe Gorse C: 440-552-0730 LI: Joe Gorse <http://www.linkedin.com/pub/joe-gorse/7/12/397>