Andrew Deason <adea...@sinenomine.net> writes ] John Tang Boyland <boyl...@pabst.cs.uwm.edu> wrote: ] ] > [mdb] ] ] Thanks for those. I'm not sure myself what's going on, but perhaps some ] discussion will help... ] ] You appear to be running out of cache files, though, by the way. If you ] increase the size of your cache (or maybe even just the number of ] files), it may make this less likely to occur.
OK. I'll do that the next time we reboot. The cacheinfo is rather small (25000K). (In fact, I guess that's why other people haven't noticed the problem. Running with a 25MB disk cache is pretty ridiculous.) ] > BTW: ] > process 17679 is the one writing the LONG file that seemed to ] > initiate the deadlock. I notice it is inside "FetchWholeEnchilada". ] ] It appears to have unlinked the file while it was open; does that sound ] correct? Possibly: process 17679 is listed as "make test". I'm guessing the user was noticing things were going slow and control-C'ed the make process, and "make" decided to delete the output file. But I don't know for sure. ] > fffffe8003244cb0 FetchWholeEnchilada+0xf4() ] > fffffe8003244d80 afs_remove+0x7eb() ] ] Can someone explain this, by the way? If I'm reading this correctly, we ] fetch/cache the entire file contents of a file if it's unlinked from ] under a process... Why? ] ] > fffffe8002fda5d0 swtch+0x110() ] > fffffe8002fda5f0 cv_wait+0x68() ] > fffffe8002fda640 afs_osi_Sleep+0x99() ] > fffffe8002fda6c0 Afs_Lock_Obtain+0x1cb() ] > fffffe8002fda780 afs_putpage+0x14a() ] > fffffe8002fda7f0 osi_VM_GetDownD+0xe8() ] > fffffe8002fda9c0 afs_GetDownD+0x7ed() ] > fffffe8002fdab90 afs_GetDCache+0x713() ] ] So, all of these are waiting to free up a dcache entry. I'm not in this ] code very much, but here's a guess... someone tell me if this makes any ] sense. ] ] What looks like may be possible is that some process locks vcache V1, ] and tries to get a dcache entry for it; it tries to create a new dcache ] entry and tries to free up a dcache entry (D1) because we're out. D1 has ] mapped pages (or whatever IFAnyPages means), and we need to invalidate ] the pages, so we need to lock D1's vcache. If D1's vcache is the same as ] vcache V1, we have deadlock. This makes sense to me to see while ] FetchWholeEnchilada is running, since fetching the later chunks may be ] trying to free up the earlier chunks fetched in the same file... ] ] If that is plausible, I think potential solutions include dropping the ] V1 lock before GetDownD (I assume this isn't possible, or a lot of ] things assume this doesn't happen and is a lot of work to make right, ] etc)... or, passing the avc into GetDownD, and have GetDownD skip ] dcaches that need page invalidation that have the same vcache as the one ] passed in. That way we sleep and retry (although still while holding the ] V1 lock...) ] ] -- ] Andrew Deason ] adea...@sinenomine.net BTW: Is there any more useful information I could get from the machine or can we reboot it? Please reply by email to boyl...@cs.uwm.edu. _______________________________________________ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info