Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
> But...suppose we have two racing OPEN calls. They're both in > nfs4_get_vfs_file. One opens the file and succeeds and the other fails > and ends up in out_put_access. At that point, you could end up > clobbering the successful update to st_access_bmap from the other task, > and we'd end up not putting the file access references in > release_all_access. Pondering further (with an assumption about the workload), isn't this path also likely to involve a lease break in progress on that file? Thanks, Andy -- Andrew W. Elble awe...@discipline.rit.edu Infrastructure Engineer, Communications Technical Lead Rochester Institute of Technology PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
> But...suppose we have two racing OPEN calls. They're both in > nfs4_get_vfs_file. One opens the file and succeeds and the other fails > and ends up in out_put_access. At that point, you could end up > clobbering the successful update to st_access_bmap from the other task, > and we'd end up not putting the file access references in > release_all_access. Pondering further (with an assumption about the workload), isn't this path also likely to involve a lease break in progress on that file? Thanks, Andy -- Andrew W. Elble awe...@discipline.rit.edu Infrastructure Engineer, Communications Technical Lead Rochester Institute of Technology PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Jun 8, 2016, at 1:22 PM, Jeff Layton wrote: > On Wed, 2016-06-08 at 12:10 -0400, Oleg Drokin wrote: >> On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote: >> >>> A simple way to confirm that might be to convert all of the read locks >>> on the st_rwsem to write locks. That will serialize all of the open >>> operations and should prevent that particular race from occurring. >>> >>> If that works, we'd probably want to fix it in a less heavy-handed way, >>> but I'd have to think about how best to do that. >> >> So I looked at the call sites for nfs4_get_vfs_file(), how about something >> like this: >> >> after we grab the fp->fi_lock, we can do test_access(open->op_share_access, >> stp); >> >> If that returns true - just drop the spinlock and return EAGAIN. >> >> The callsite in nfs4_upgrade_open() would handle that by retesting the >> access map >> again and either coming back in or more likely reusing the now updated >> stateid >> (synchronised by the fi_lock again). >> We probably need to convert the whole access map testing there to be under >> fi_lock. >> Something like: >> nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct >> svc_fh *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open) >> { >> __be32 status; >> unsigned char old_deny_bmap = stp->st_deny_bmap; >> >> again: >> +spin_lock(>fi_lock); >> if (!test_access(open->op_share_access, stp)) { >> +spin_unlock(>fi_lock); >> + status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open); >> +if (status == -EAGAIN) >> +goto again; >> +return status; >> +} >> >> /* test and set deny mode */ >> -spin_lock(>fi_lock); >> status = nfs4_file_check_deny(fp, open->op_share_deny); >> >> >> The call in nfsd4_process_open2() I think cannot hit this condition, right? >> probably can add a WARN_ON there? BUG_ON? more sensible approach? >> >> Alternatively we can probably always call nfs4_get_vfs_file() under this >> spinlock, >> just have it drop that for the open and then reobtain (already done), not as >> transparent I guess. >> > > Yeah, I think that might be best. It looks like things could change > after you drop the spinlock with the patch above. Since we have to > retake it anyway in nfs4_get_vfs_file, we can just do it there. > >> Or the fi_lock might be converted to say a mutex, so we can sleep with it >> held and >> then we can hold it across whole invocation of nfs4_get_vfs_file() and >> access testing and stuff. > > I think we'd be better off taking the st_rwsem for write (maybe just > turning it into a mutex). That would at least be per-stateid instead of > per-inode. That's a fine fix for now. > > It might slow down a client slightly that is sending two stateid > morphing operations in parallel, but they shouldn't affect each other. > I'm liking that solution more and more here. > Longer term, I think we need to further simplify OPEN handling. It has > gotten better, but it's still really hard to follow currently (and is > obviously error-prone). The conversion to always rwlock holds up nice so far (also no other WARNs are triggered yet.) I guess I'll do a patch converting to mutex, but also separately a patch that just holds fi_lock more - test that other one and if all is well, submit is too, and let you choose which one you like the most ;)
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Jun 8, 2016, at 1:22 PM, Jeff Layton wrote: > On Wed, 2016-06-08 at 12:10 -0400, Oleg Drokin wrote: >> On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote: >> >>> A simple way to confirm that might be to convert all of the read locks >>> on the st_rwsem to write locks. That will serialize all of the open >>> operations and should prevent that particular race from occurring. >>> >>> If that works, we'd probably want to fix it in a less heavy-handed way, >>> but I'd have to think about how best to do that. >> >> So I looked at the call sites for nfs4_get_vfs_file(), how about something >> like this: >> >> after we grab the fp->fi_lock, we can do test_access(open->op_share_access, >> stp); >> >> If that returns true - just drop the spinlock and return EAGAIN. >> >> The callsite in nfs4_upgrade_open() would handle that by retesting the >> access map >> again and either coming back in or more likely reusing the now updated >> stateid >> (synchronised by the fi_lock again). >> We probably need to convert the whole access map testing there to be under >> fi_lock. >> Something like: >> nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct >> svc_fh *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open) >> { >> __be32 status; >> unsigned char old_deny_bmap = stp->st_deny_bmap; >> >> again: >> +spin_lock(>fi_lock); >> if (!test_access(open->op_share_access, stp)) { >> +spin_unlock(>fi_lock); >> + status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open); >> +if (status == -EAGAIN) >> +goto again; >> +return status; >> +} >> >> /* test and set deny mode */ >> -spin_lock(>fi_lock); >> status = nfs4_file_check_deny(fp, open->op_share_deny); >> >> >> The call in nfsd4_process_open2() I think cannot hit this condition, right? >> probably can add a WARN_ON there? BUG_ON? more sensible approach? >> >> Alternatively we can probably always call nfs4_get_vfs_file() under this >> spinlock, >> just have it drop that for the open and then reobtain (already done), not as >> transparent I guess. >> > > Yeah, I think that might be best. It looks like things could change > after you drop the spinlock with the patch above. Since we have to > retake it anyway in nfs4_get_vfs_file, we can just do it there. > >> Or the fi_lock might be converted to say a mutex, so we can sleep with it >> held and >> then we can hold it across whole invocation of nfs4_get_vfs_file() and >> access testing and stuff. > > I think we'd be better off taking the st_rwsem for write (maybe just > turning it into a mutex). That would at least be per-stateid instead of > per-inode. That's a fine fix for now. > > It might slow down a client slightly that is sending two stateid > morphing operations in parallel, but they shouldn't affect each other. > I'm liking that solution more and more here. > Longer term, I think we need to further simplify OPEN handling. It has > gotten better, but it's still really hard to follow currently (and is > obviously error-prone). The conversion to always rwlock holds up nice so far (also no other WARNs are triggered yet.) I guess I'll do a patch converting to mutex, but also separately a patch that just holds fi_lock more - test that other one and if all is well, submit is too, and let you choose which one you like the most ;)
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Wed, 2016-06-08 at 12:10 -0400, Oleg Drokin wrote: > On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote: > > > A simple way to confirm that might be to convert all of the read locks > > on the st_rwsem to write locks. That will serialize all of the open > > operations and should prevent that particular race from occurring. > > > > If that works, we'd probably want to fix it in a less heavy-handed way, > > but I'd have to think about how best to do that. > > So I looked at the call sites for nfs4_get_vfs_file(), how about something > like this: > > after we grab the fp->fi_lock, we can do test_access(open->op_share_access, > stp); > > If that returns true - just drop the spinlock and return EAGAIN. > > The callsite in nfs4_upgrade_open() would handle that by retesting the access > map > again and either coming back in or more likely reusing the now updated stateid > (synchronised by the fi_lock again). > We probably need to convert the whole access map testing there to be under > fi_lock. > Something like: > nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh > *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open) > { > __be32 status; > unsigned char old_deny_bmap = stp->st_deny_bmap; > > again: > +spin_lock(>fi_lock); > if (!test_access(open->op_share_access, stp)) { > + spin_unlock(>fi_lock); > + status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open); > + if (status == -EAGAIN) > + goto again; > + return status; > + } > > /* test and set deny mode */ > -spin_lock(>fi_lock); > status = nfs4_file_check_deny(fp, open->op_share_deny); > > > The call in nfsd4_process_open2() I think cannot hit this condition, right? > probably can add a WARN_ON there? BUG_ON? more sensible approach? > > Alternatively we can probably always call nfs4_get_vfs_file() under this > spinlock, > just have it drop that for the open and then reobtain (already done), not as > transparent I guess. > Yeah, I think that might be best. It looks like things could change after you drop the spinlock with the patch above. Since we have to retake it anyway in nfs4_get_vfs_file, we can just do it there. > Or the fi_lock might be converted to say a mutex, so we can sleep with it > held and > then we can hold it across whole invocation of nfs4_get_vfs_file() and access > testing and stuff. I think we'd be better off taking the st_rwsem for write (maybe just turning it into a mutex). That would at least be per-stateid instead of per-inode. That's a fine fix for now. It might slow down a client slightly that is sending two stateid morphing operations in parallel, but they shouldn't affect each other. I'm liking that solution more and more here. Longer term, I think we need to further simplify OPEN handling. It has gotten better, but it's still really hard to follow currently (and is obviously error-prone). -- Jeff Layton
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Wed, 2016-06-08 at 12:10 -0400, Oleg Drokin wrote: > On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote: > > > A simple way to confirm that might be to convert all of the read locks > > on the st_rwsem to write locks. That will serialize all of the open > > operations and should prevent that particular race from occurring. > > > > If that works, we'd probably want to fix it in a less heavy-handed way, > > but I'd have to think about how best to do that. > > So I looked at the call sites for nfs4_get_vfs_file(), how about something > like this: > > after we grab the fp->fi_lock, we can do test_access(open->op_share_access, > stp); > > If that returns true - just drop the spinlock and return EAGAIN. > > The callsite in nfs4_upgrade_open() would handle that by retesting the access > map > again and either coming back in or more likely reusing the now updated stateid > (synchronised by the fi_lock again). > We probably need to convert the whole access map testing there to be under > fi_lock. > Something like: > nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh > *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open) > { > __be32 status; > unsigned char old_deny_bmap = stp->st_deny_bmap; > > again: > +spin_lock(>fi_lock); > if (!test_access(open->op_share_access, stp)) { > + spin_unlock(>fi_lock); > + status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open); > + if (status == -EAGAIN) > + goto again; > + return status; > + } > > /* test and set deny mode */ > -spin_lock(>fi_lock); > status = nfs4_file_check_deny(fp, open->op_share_deny); > > > The call in nfsd4_process_open2() I think cannot hit this condition, right? > probably can add a WARN_ON there? BUG_ON? more sensible approach? > > Alternatively we can probably always call nfs4_get_vfs_file() under this > spinlock, > just have it drop that for the open and then reobtain (already done), not as > transparent I guess. > Yeah, I think that might be best. It looks like things could change after you drop the spinlock with the patch above. Since we have to retake it anyway in nfs4_get_vfs_file, we can just do it there. > Or the fi_lock might be converted to say a mutex, so we can sleep with it > held and > then we can hold it across whole invocation of nfs4_get_vfs_file() and access > testing and stuff. I think we'd be better off taking the st_rwsem for write (maybe just turning it into a mutex). That would at least be per-stateid instead of per-inode. That's a fine fix for now. It might slow down a client slightly that is sending two stateid morphing operations in parallel, but they shouldn't affect each other. I'm liking that solution more and more here. Longer term, I think we need to further simplify OPEN handling. It has gotten better, but it's still really hard to follow currently (and is obviously error-prone). -- Jeff Layton
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote: > A simple way to confirm that might be to convert all of the read locks > on the st_rwsem to write locks. That will serialize all of the open > operations and should prevent that particular race from occurring. > > If that works, we'd probably want to fix it in a less heavy-handed way, > but I'd have to think about how best to do that. So I looked at the call sites for nfs4_get_vfs_file(), how about something like this: after we grab the fp->fi_lock, we can do test_access(open->op_share_access, stp); If that returns true - just drop the spinlock and return EAGAIN. The callsite in nfs4_upgrade_open() would handle that by retesting the access map again and either coming back in or more likely reusing the now updated stateid (synchronised by the fi_lock again). We probably need to convert the whole access map testing there to be under fi_lock. Something like: nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open) { __be32 status; unsigned char old_deny_bmap = stp->st_deny_bmap; again: +spin_lock(>fi_lock); if (!test_access(open->op_share_access, stp)) { + spin_unlock(>fi_lock); + status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open); + if (status == -EAGAIN) + goto again; + return status; + } /* test and set deny mode */ -spin_lock(>fi_lock); status = nfs4_file_check_deny(fp, open->op_share_deny); The call in nfsd4_process_open2() I think cannot hit this condition, right? probably can add a WARN_ON there? BUG_ON? more sensible approach? Alternatively we can probably always call nfs4_get_vfs_file() under this spinlock, just have it drop that for the open and then reobtain (already done), not as transparent I guess. Or the fi_lock might be converted to say a mutex, so we can sleep with it held and then we can hold it across whole invocation of nfs4_get_vfs_file() and access testing and stuff.
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote: > A simple way to confirm that might be to convert all of the read locks > on the st_rwsem to write locks. That will serialize all of the open > operations and should prevent that particular race from occurring. > > If that works, we'd probably want to fix it in a less heavy-handed way, > but I'd have to think about how best to do that. So I looked at the call sites for nfs4_get_vfs_file(), how about something like this: after we grab the fp->fi_lock, we can do test_access(open->op_share_access, stp); If that returns true - just drop the spinlock and return EAGAIN. The callsite in nfs4_upgrade_open() would handle that by retesting the access map again and either coming back in or more likely reusing the now updated stateid (synchronised by the fi_lock again). We probably need to convert the whole access map testing there to be under fi_lock. Something like: nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open) { __be32 status; unsigned char old_deny_bmap = stp->st_deny_bmap; again: +spin_lock(>fi_lock); if (!test_access(open->op_share_access, stp)) { + spin_unlock(>fi_lock); + status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open); + if (status == -EAGAIN) + goto again; + return status; + } /* test and set deny mode */ -spin_lock(>fi_lock); status = nfs4_file_check_deny(fp, open->op_share_deny); The call in nfsd4_process_open2() I think cannot hit this condition, right? probably can add a WARN_ON there? BUG_ON? more sensible approach? Alternatively we can probably always call nfs4_get_vfs_file() under this spinlock, just have it drop that for the open and then reobtain (already done), not as transparent I guess. Or the fi_lock might be converted to say a mutex, so we can sleep with it held and then we can hold it across whole invocation of nfs4_get_vfs_file() and access testing and stuff.
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote: > On Tue, 2016-06-07 at 22:22 -0400, Oleg Drokin wrote: >> On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote: >> > That said, this code is quite subtle. I'd need to look over it in more > detail before I offer up any fixes. I'd also appreciate it if anyone > else wants to sanity check my analysis there. > >>> Yeah, I think you're right. It's fine since r/w opens have a distinct >>> slot, even though the refcounting just tracks the number of read and >>> write references. So yeah, the leak probably is in an error path >>> someplace, or maybe a race someplace. >> >> So I noticed that set_access is always called locked, but clear_access is >> not, >> this does not sound right. >> >> So I placed this strategic WARN_ON: >> @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst >> *rqstp, struct nfs4_file *fp, >> goto out_put_access; >> spin_lock(>fi_lock); >> if (!fp->fi_fds[oflag]) { >> +WARN_ON(!test_access(open->op_share_access, stp)); >> fp->fi_fds[oflag] = filp; >> filp = NULL; >> >> This is right in the place where nfsd set the access flag already, discovered >> that the file is not opened and went on to open it, yet some parallel thread >> came in and cleared the flag by the time we got the file opened. >> It did trigger (but there are 30 minutes left till test finish, so I don't >> know yet if this will correspond to the problem at hand yet, so below is >> speculation). >> >> Now, at the exit from this function, the state will not have access for this >> file >> set and the file would be leaked, since the matching call, probably in >> release_all_access() does: >> if (test_access(i, stp)) >> nfs4_file_put_access(stp->st_stid.sc_file, i); >> clear_access(i, stp); >> >> Meaning that the refcount is not dropped due to no access flag set. >> >> I imagine we can just set the flag back right after setting fp->fi_fds[oflag] >> and that particular race would be gone? >> >> Would that be ok, or is it just wrong that somebody can come and drop >> (delegation downgrade seems to be the most likely suspect) access bits like >> this >> while they are in the middle of being acquired? >> > > I think you're on the right track... > > Operations that morph the stateid's seqid should be serialized via the > st_rwsem. An OPEN_DOWNGRADE or CLOSE takes an exclusive lock on the > rwsem, but we do allow OPENs to run in parallel with each other, under > the assumption that on success you always get at least what you > requested. So, I don't think that OPEN can really race with an > OPEN_DOWNGRADE or CLOSE like you're thinking. > > But...suppose we have two racing OPEN calls. They're both in > nfs4_get_vfs_file. One opens the file and succeeds and the other fails > and ends up in out_put_access. At that point, you could end up > clobbering the successful update to st_access_bmap from the other task, > and we'd end up not putting the file access references in > release_all_access. It looks like the race plays in reverse to what I thought. My assert when setting fi_fds to not NULL never hit, but I also had this warning below and that DID hit and is a 100% correlation to the later leak. set_access(u32 access, struct nfs4_ol_stateid *stp) { unsigned char mask = 1 << access; WARN_ON_ONCE(access > NFS4_SHARE_ACCESS_BOTH); WARN_ON(stp->st_access_bmap & mask); stp->st_access_bmap |= mask; } So we are setting the access that is already set (this comes from nfs4_get_vfs_file() via nfs4_upgrade_open() ) twice, then one of the callers drops theirs, and so subsequent state is not dropping their file reference in release_all_access() because they don't see the access bit, something like this. Though I don't see why it cannot play like I originally envisioned too. Hm… So if it always hits when set_access warning triggers, and there's no error on both paths, then there's no explicit call to nfs4_file_put_access() and that's a leak since release_all_access() would only release the file once? I wonder if instead on the success exit path we can check the old_access_bmap and if there our current access is also set - release it? This is ok, right? if the other caller succeeded they still keep the reference. If the other caller failed they drop theirs and restore their idea of access bmap which presumably does not have the bit set and drop their reference anyway. The question is if they both fail, or only them fail. If both fail and we win the race on restoring old access - we might set access wrongly because them had it correct. Similarly if they fail and we don't, they can reset our access. > A simple way to confirm that might be to convert all of the read locks > on the st_rwsem to write locks. That will serialize all of the open > operations and should
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote: > On Tue, 2016-06-07 at 22:22 -0400, Oleg Drokin wrote: >> On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote: >> > That said, this code is quite subtle. I'd need to look over it in more > detail before I offer up any fixes. I'd also appreciate it if anyone > else wants to sanity check my analysis there. > >>> Yeah, I think you're right. It's fine since r/w opens have a distinct >>> slot, even though the refcounting just tracks the number of read and >>> write references. So yeah, the leak probably is in an error path >>> someplace, or maybe a race someplace. >> >> So I noticed that set_access is always called locked, but clear_access is >> not, >> this does not sound right. >> >> So I placed this strategic WARN_ON: >> @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst >> *rqstp, struct nfs4_file *fp, >> goto out_put_access; >> spin_lock(>fi_lock); >> if (!fp->fi_fds[oflag]) { >> +WARN_ON(!test_access(open->op_share_access, stp)); >> fp->fi_fds[oflag] = filp; >> filp = NULL; >> >> This is right in the place where nfsd set the access flag already, discovered >> that the file is not opened and went on to open it, yet some parallel thread >> came in and cleared the flag by the time we got the file opened. >> It did trigger (but there are 30 minutes left till test finish, so I don't >> know yet if this will correspond to the problem at hand yet, so below is >> speculation). >> >> Now, at the exit from this function, the state will not have access for this >> file >> set and the file would be leaked, since the matching call, probably in >> release_all_access() does: >> if (test_access(i, stp)) >> nfs4_file_put_access(stp->st_stid.sc_file, i); >> clear_access(i, stp); >> >> Meaning that the refcount is not dropped due to no access flag set. >> >> I imagine we can just set the flag back right after setting fp->fi_fds[oflag] >> and that particular race would be gone? >> >> Would that be ok, or is it just wrong that somebody can come and drop >> (delegation downgrade seems to be the most likely suspect) access bits like >> this >> while they are in the middle of being acquired? >> > > I think you're on the right track... > > Operations that morph the stateid's seqid should be serialized via the > st_rwsem. An OPEN_DOWNGRADE or CLOSE takes an exclusive lock on the > rwsem, but we do allow OPENs to run in parallel with each other, under > the assumption that on success you always get at least what you > requested. So, I don't think that OPEN can really race with an > OPEN_DOWNGRADE or CLOSE like you're thinking. > > But...suppose we have two racing OPEN calls. They're both in > nfs4_get_vfs_file. One opens the file and succeeds and the other fails > and ends up in out_put_access. At that point, you could end up > clobbering the successful update to st_access_bmap from the other task, > and we'd end up not putting the file access references in > release_all_access. It looks like the race plays in reverse to what I thought. My assert when setting fi_fds to not NULL never hit, but I also had this warning below and that DID hit and is a 100% correlation to the later leak. set_access(u32 access, struct nfs4_ol_stateid *stp) { unsigned char mask = 1 << access; WARN_ON_ONCE(access > NFS4_SHARE_ACCESS_BOTH); WARN_ON(stp->st_access_bmap & mask); stp->st_access_bmap |= mask; } So we are setting the access that is already set (this comes from nfs4_get_vfs_file() via nfs4_upgrade_open() ) twice, then one of the callers drops theirs, and so subsequent state is not dropping their file reference in release_all_access() because they don't see the access bit, something like this. Though I don't see why it cannot play like I originally envisioned too. Hm… So if it always hits when set_access warning triggers, and there's no error on both paths, then there's no explicit call to nfs4_file_put_access() and that's a leak since release_all_access() would only release the file once? I wonder if instead on the success exit path we can check the old_access_bmap and if there our current access is also set - release it? This is ok, right? if the other caller succeeded they still keep the reference. If the other caller failed they drop theirs and restore their idea of access bmap which presumably does not have the bit set and drop their reference anyway. The question is if they both fail, or only them fail. If both fail and we win the race on restoring old access - we might set access wrongly because them had it correct. Similarly if they fail and we don't, they can reset our access. > A simple way to confirm that might be to convert all of the read locks > on the st_rwsem to write locks. That will serialize all of the open > operations and should
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Tue, 2016-06-07 at 22:22 -0400, Oleg Drokin wrote: > On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote: > > > > > That said, this code is quite subtle. I'd need to look over it in more > > > > detail before I offer up any fixes. I'd also appreciate it if anyone > > > > else wants to sanity check my analysis there. > > > > > > Yeah, I think you're right. It's fine since r/w opens have a distinct > > slot, even though the refcounting just tracks the number of read and > > write references. So yeah, the leak probably is in an error path > > someplace, or maybe a race someplace. > > So I noticed that set_access is always called locked, but clear_access is not, > this does not sound right. > > So I placed this strategic WARN_ON: > @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, > struct nfs4_file *fp, > goto out_put_access; > spin_lock(>fi_lock); > if (!fp->fi_fds[oflag]) { > +WARN_ON(!test_access(open->op_share_access, stp)); > fp->fi_fds[oflag] = filp; > filp = NULL; > > This is right in the place where nfsd set the access flag already, discovered > that the file is not opened and went on to open it, yet some parallel thread > came in and cleared the flag by the time we got the file opened. > It did trigger (but there are 30 minutes left till test finish, so I don't > know yet if this will correspond to the problem at hand yet, so below is > speculation). > > Now, at the exit from this function, the state will not have access for this > file > set and the file would be leaked, since the matching call, probably in > release_all_access() does: > if (test_access(i, stp)) > nfs4_file_put_access(stp->st_stid.sc_file, i); > clear_access(i, stp); > > Meaning that the refcount is not dropped due to no access flag set. > > I imagine we can just set the flag back right after setting fp->fi_fds[oflag] > and that particular race would be gone? > > Would that be ok, or is it just wrong that somebody can come and drop > (delegation downgrade seems to be the most likely suspect) access bits like > this > while they are in the middle of being acquired? > I think you're on the right track... Operations that morph the stateid's seqid should be serialized via the st_rwsem. An OPEN_DOWNGRADE or CLOSE takes an exclusive lock on the rwsem, but we do allow OPENs to run in parallel with each other, under the assumption that on success you always get at least what you requested. So, I don't think that OPEN can really race with an OPEN_DOWNGRADE or CLOSE like you're thinking. But...suppose we have two racing OPEN calls. They're both in nfs4_get_vfs_file. One opens the file and succeeds and the other fails and ends up in out_put_access. At that point, you could end up clobbering the successful update to st_access_bmap from the other task, and we'd end up not putting the file access references in release_all_access. A simple way to confirm that might be to convert all of the read locks on the st_rwsem to write locks. That will serialize all of the open operations and should prevent that particular race from occurring. If that works, we'd probably want to fix it in a less heavy-handed way, but I'd have to think about how best to do that. -- Jeff Layton
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Tue, 2016-06-07 at 22:22 -0400, Oleg Drokin wrote: > On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote: > > > > > That said, this code is quite subtle. I'd need to look over it in more > > > > detail before I offer up any fixes. I'd also appreciate it if anyone > > > > else wants to sanity check my analysis there. > > > > > > Yeah, I think you're right. It's fine since r/w opens have a distinct > > slot, even though the refcounting just tracks the number of read and > > write references. So yeah, the leak probably is in an error path > > someplace, or maybe a race someplace. > > So I noticed that set_access is always called locked, but clear_access is not, > this does not sound right. > > So I placed this strategic WARN_ON: > @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, > struct nfs4_file *fp, > goto out_put_access; > spin_lock(>fi_lock); > if (!fp->fi_fds[oflag]) { > +WARN_ON(!test_access(open->op_share_access, stp)); > fp->fi_fds[oflag] = filp; > filp = NULL; > > This is right in the place where nfsd set the access flag already, discovered > that the file is not opened and went on to open it, yet some parallel thread > came in and cleared the flag by the time we got the file opened. > It did trigger (but there are 30 minutes left till test finish, so I don't > know yet if this will correspond to the problem at hand yet, so below is > speculation). > > Now, at the exit from this function, the state will not have access for this > file > set and the file would be leaked, since the matching call, probably in > release_all_access() does: > if (test_access(i, stp)) > nfs4_file_put_access(stp->st_stid.sc_file, i); > clear_access(i, stp); > > Meaning that the refcount is not dropped due to no access flag set. > > I imagine we can just set the flag back right after setting fp->fi_fds[oflag] > and that particular race would be gone? > > Would that be ok, or is it just wrong that somebody can come and drop > (delegation downgrade seems to be the most likely suspect) access bits like > this > while they are in the middle of being acquired? > I think you're on the right track... Operations that morph the stateid's seqid should be serialized via the st_rwsem. An OPEN_DOWNGRADE or CLOSE takes an exclusive lock on the rwsem, but we do allow OPENs to run in parallel with each other, under the assumption that on success you always get at least what you requested. So, I don't think that OPEN can really race with an OPEN_DOWNGRADE or CLOSE like you're thinking. But...suppose we have two racing OPEN calls. They're both in nfs4_get_vfs_file. One opens the file and succeeds and the other fails and ends up in out_put_access. At that point, you could end up clobbering the successful update to st_access_bmap from the other task, and we'd end up not putting the file access references in release_all_access. A simple way to confirm that might be to convert all of the read locks on the st_rwsem to write locks. That will serialize all of the open operations and should prevent that particular race from occurring. If that works, we'd probably want to fix it in a less heavy-handed way, but I'd have to think about how best to do that. -- Jeff Layton
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Jun 7, 2016, at 10:22 PM, Oleg Drokin wrote: > > On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote: > That said, this code is quite subtle. I'd need to look over it in more detail before I offer up any fixes. I'd also appreciate it if anyone else wants to sanity check my analysis there. >> Yeah, I think you're right. It's fine since r/w opens have a distinct >> slot, even though the refcounting just tracks the number of read and >> write references. So yeah, the leak probably is in an error path >> someplace, or maybe a race someplace. > > So I noticed that set_access is always called locked, but clear_access is not, > this does not sound right. > > So I placed this strategic WARN_ON: > @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, > struct nfs4_file *fp, >goto out_put_access; >spin_lock(>fi_lock); >if (!fp->fi_fds[oflag]) { > +WARN_ON(!test_access(open->op_share_access, stp)); >fp->fi_fds[oflag] = filp; >filp = NULL; > > This is right in the place where nfsd set the access flag already, discovered > that the file is not opened and went on to open it, yet some parallel thread > came in and cleared the flag by the time we got the file opened. > It did trigger (but there are 30 minutes left till test finish, so I don't > know yet if this will correspond to the problem at hand yet, so below is > speculation). Duh, I looked for a warning, but did not cross reference, and it was not this one that hit yet. Though apparently I am hitting some of the "impossible" warnings, so you might want to look into that anyway. status = nfsd4_process_open2(rqstp, resfh, open); WARN(status && open->op_created, "nfsd4_process_open2 failed to open newly-created file! status=%u\n", be32_to_cpu(status)); and filp = find_readable_file(fp); if (!filp) { /* We should always have a readable file here */ WARN_ON_ONCE(1); locks_free_lock(fl); return -EBADF; }
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Jun 7, 2016, at 10:22 PM, Oleg Drokin wrote: > > On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote: > That said, this code is quite subtle. I'd need to look over it in more detail before I offer up any fixes. I'd also appreciate it if anyone else wants to sanity check my analysis there. >> Yeah, I think you're right. It's fine since r/w opens have a distinct >> slot, even though the refcounting just tracks the number of read and >> write references. So yeah, the leak probably is in an error path >> someplace, or maybe a race someplace. > > So I noticed that set_access is always called locked, but clear_access is not, > this does not sound right. > > So I placed this strategic WARN_ON: > @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, > struct nfs4_file *fp, >goto out_put_access; >spin_lock(>fi_lock); >if (!fp->fi_fds[oflag]) { > +WARN_ON(!test_access(open->op_share_access, stp)); >fp->fi_fds[oflag] = filp; >filp = NULL; > > This is right in the place where nfsd set the access flag already, discovered > that the file is not opened and went on to open it, yet some parallel thread > came in and cleared the flag by the time we got the file opened. > It did trigger (but there are 30 minutes left till test finish, so I don't > know yet if this will correspond to the problem at hand yet, so below is > speculation). Duh, I looked for a warning, but did not cross reference, and it was not this one that hit yet. Though apparently I am hitting some of the "impossible" warnings, so you might want to look into that anyway. status = nfsd4_process_open2(rqstp, resfh, open); WARN(status && open->op_created, "nfsd4_process_open2 failed to open newly-created file! status=%u\n", be32_to_cpu(status)); and filp = find_readable_file(fp); if (!filp) { /* We should always have a readable file here */ WARN_ON_ONCE(1); locks_free_lock(fl); return -EBADF; }
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote: >>> That said, this code is quite subtle. I'd need to look over it in more >>> detail before I offer up any fixes. I'd also appreciate it if anyone >>> else wants to sanity check my analysis there. >>> > Yeah, I think you're right. It's fine since r/w opens have a distinct > slot, even though the refcounting just tracks the number of read and > write references. So yeah, the leak probably is in an error path > someplace, or maybe a race someplace. So I noticed that set_access is always called locked, but clear_access is not, this does not sound right. So I placed this strategic WARN_ON: @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp, goto out_put_access; spin_lock(>fi_lock); if (!fp->fi_fds[oflag]) { +WARN_ON(!test_access(open->op_share_access, stp)); fp->fi_fds[oflag] = filp; filp = NULL; This is right in the place where nfsd set the access flag already, discovered that the file is not opened and went on to open it, yet some parallel thread came in and cleared the flag by the time we got the file opened. It did trigger (but there are 30 minutes left till test finish, so I don't know yet if this will correspond to the problem at hand yet, so below is speculation). Now, at the exit from this function, the state will not have access for this file set and the file would be leaked, since the matching call, probably in release_all_access() does: if (test_access(i, stp)) nfs4_file_put_access(stp->st_stid.sc_file, i); clear_access(i, stp); Meaning that the refcount is not dropped due to no access flag set. I imagine we can just set the flag back right after setting fp->fi_fds[oflag] and that particular race would be gone? Would that be ok, or is it just wrong that somebody can come and drop (delegation downgrade seems to be the most likely suspect) access bits like this while they are in the middle of being acquired?
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote: >>> That said, this code is quite subtle. I'd need to look over it in more >>> detail before I offer up any fixes. I'd also appreciate it if anyone >>> else wants to sanity check my analysis there. >>> > Yeah, I think you're right. It's fine since r/w opens have a distinct > slot, even though the refcounting just tracks the number of read and > write references. So yeah, the leak probably is in an error path > someplace, or maybe a race someplace. So I noticed that set_access is always called locked, but clear_access is not, this does not sound right. So I placed this strategic WARN_ON: @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp, goto out_put_access; spin_lock(>fi_lock); if (!fp->fi_fds[oflag]) { +WARN_ON(!test_access(open->op_share_access, stp)); fp->fi_fds[oflag] = filp; filp = NULL; This is right in the place where nfsd set the access flag already, discovered that the file is not opened and went on to open it, yet some parallel thread came in and cleared the flag by the time we got the file opened. It did trigger (but there are 30 minutes left till test finish, so I don't know yet if this will correspond to the problem at hand yet, so below is speculation). Now, at the exit from this function, the state will not have access for this file set and the file would be leaked, since the matching call, probably in release_all_access() does: if (test_access(i, stp)) nfs4_file_put_access(stp->st_stid.sc_file, i); clear_access(i, stp); Meaning that the refcount is not dropped due to no access flag set. I imagine we can just set the flag back right after setting fp->fi_fds[oflag] and that particular race would be gone? Would that be ok, or is it just wrong that somebody can come and drop (delegation downgrade seems to be the most likely suspect) access bits like this while they are in the middle of being acquired?
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote: > On Tue, 2016-06-07 at 19:39 -0400, Oleg Drokin wrote: >> On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote: >>> So, the way this is supposed to work is that the stateids each hold a >>> reference to the nfs4_file. They also take an fi_access reference for >>> the open modes that they represent. This is how we know when it's OK to >>> close one of the files in the fd[] slots. It might be interesting to >>> also print out the fi_access counters if the slot hasn't been zeroed >>> out if you feel up to it. >> Ok, I have primed that and will let you know next time it triggers. > Thanks. Maybe that will give us some idea of where to look… Well, the counters are: 0 1 0 at the warning site, which is what makes sense, really. >>> I think though that the open upgrade accounting may be wrong. >>> __nfs4_file_get_access does an unconditional atomic_inc, so if the >>> client does a open for write and then opens again for read/write, then >>> we may end up with too many write fi_access references taken. >> >> The read-write access is all fine I think, because when dropping read or >> write >> it also checks if the readwrite is filled? >> Also whoever did that atomic_inc for get_access would eventually do put, >> and that should free the file, no? >> I was mostly thinking there's an error exit path somewhere that forgets to do >> the put. >> This also would explain the other pieces leaked that are not directly >> attached >> into the nfs4_file. >> >>> >>> That said, this code is quite subtle. I'd need to look over it in more >>> detail before I offer up any fixes. I'd also appreciate it if anyone >>> else wants to sanity check my analysis there. >>> > > Yeah, I think you're right. It's fine since r/w opens have a distinct > slot, even though the refcounting just tracks the number of read and > write references. So yeah, the leak probably is in an error path > someplace, or maybe a race someplace. I guess I'll try to put some tracers in the code to try and match refcount takers vs refcount droppers for lack of better ideas. Hopefully that'll lead us to see who got the count that was not dropped.
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote: > On Tue, 2016-06-07 at 19:39 -0400, Oleg Drokin wrote: >> On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote: >>> So, the way this is supposed to work is that the stateids each hold a >>> reference to the nfs4_file. They also take an fi_access reference for >>> the open modes that they represent. This is how we know when it's OK to >>> close one of the files in the fd[] slots. It might be interesting to >>> also print out the fi_access counters if the slot hasn't been zeroed >>> out if you feel up to it. >> Ok, I have primed that and will let you know next time it triggers. > Thanks. Maybe that will give us some idea of where to look… Well, the counters are: 0 1 0 at the warning site, which is what makes sense, really. >>> I think though that the open upgrade accounting may be wrong. >>> __nfs4_file_get_access does an unconditional atomic_inc, so if the >>> client does a open for write and then opens again for read/write, then >>> we may end up with too many write fi_access references taken. >> >> The read-write access is all fine I think, because when dropping read or >> write >> it also checks if the readwrite is filled? >> Also whoever did that atomic_inc for get_access would eventually do put, >> and that should free the file, no? >> I was mostly thinking there's an error exit path somewhere that forgets to do >> the put. >> This also would explain the other pieces leaked that are not directly >> attached >> into the nfs4_file. >> >>> >>> That said, this code is quite subtle. I'd need to look over it in more >>> detail before I offer up any fixes. I'd also appreciate it if anyone >>> else wants to sanity check my analysis there. >>> > > Yeah, I think you're right. It's fine since r/w opens have a distinct > slot, even though the refcounting just tracks the number of read and > write references. So yeah, the leak probably is in an error path > someplace, or maybe a race someplace. I guess I'll try to put some tracers in the code to try and match refcount takers vs refcount droppers for lack of better ideas. Hopefully that'll lead us to see who got the count that was not dropped.
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Tue, 2016-06-07 at 19:39 -0400, Oleg Drokin wrote: > On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote: > > > > > On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote: > > > > > > On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote: > > > > > > > > > > > > > > > On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote: > > > > > > > > > > > > > > > Hello! > > > > > > > > > > I've been trying to better understand this problem I was having > > > > > where sometimes > > > > > a formerly NFS-exported mountpoint becomes unmountable (after nfsd > > > > > stop). > > > > > > > > > > I finally traced it to a leaked filedescriptor that was allocated > > > > > from > > > > > > > > > > nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open(). > > > > > > > > > > Also together with it we see leaked credentials allocated along > > > > > the same path from > > > > > fh_verify() and groups allocated from > > > > > svcauth_unix_accept()->groups_alloc() that > > > > > are presumably used by the credentials. > > > > > > > > > > Unfortunately I was not able to make total sense out of the state > > > > > handling in nfsd, > > > > > but it's clear that one of the file descriptors inside struct > > > > > nfs4_file is > > > > > lost. I added a patch like this (always a good idea, so surprised > > > > > it was not > > > > > there already): > > > > > @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head > > > > > *rcu) > > > > > { > > > > > struct nfs4_file *fp = container_of(rcu, struct nfs4_file, > > > > > fi_rcu); > > > > > > > > > > + WARN_ON(fp->fi_fds[0]); > > > > > + WARN_ON(fp->fi_fds[1]); > > > > > + WARN_ON(fp->fi_fds[2]); > > > > > kmem_cache_free(file_slab, fp); > > > > > } > > > > > > > > > > And when the problem is hit, I am also triggering (Always this one > > > > > which is fd[1]) > > > > > [ 3588.143002] [ cut here ] > > > > > [ 3588.143662] WARNING: CPU: 5 PID: 9 at > > > > > /home/green/bk/linux/fs/nfsd/nfs4state.c:278 > > > > > nfsd4_free_file_rcu+0x65/0x80 [nfsd] > > > > > [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev > > > > > acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm > > > > > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm > > > > > floppy serio_raw virtio_blk > > > > > [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted > > > > > 4.7.0-rc1-vm-nfs+ #120 > > > > > [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > > > > > [ 3588.153830] 0286 e2d5ccdf 88011965bd50 > > > > > 814a11a5 > > > > > [ 3588.153832] 88011965bd90 > > > > > 8108806b > > > > > [ 3588.153834] 0116 8800c476a0b8 8800c476a048 > > > > > c0110fc0 > > > > > [ 3588.153834] Call Trace: > > > > > [ 3588.153839] [] dump_stack+0x86/0xc1 > > > > > [ 3588.153841] [] __warn+0xcb/0xf0 > > > > > [ 3588.153852] [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd] > > > > > [ 3588.153853] [] warn_slowpath_null+0x1d/0x20 > > > > > [ 3588.153859] [] nfsd4_free_file_rcu+0x65/0x80 [nfsd] > > > > > [ 3588.153861] [] rcu_nocb_kthread+0x335/0x510 > > > > > [ 3588.153862] [] ? rcu_nocb_kthread+0x27f/0x510 > > > > > [ 3588.153863] [] ? rcu_cpu_notify+0x3e0/0x3e0 > > > > > [ 3588.153866] [] kthread+0x101/0x120 > > > > > [ 3588.153868] [] ? trace_hardirqs_on_caller+0xf4/0x1b0 > > > > > [ 3588.153871] [] ret_from_fork+0x1f/0x40 > > > > > [ 3588.153872] [] ? kthread_create_on_node+0x250/0x250 > > > > > > > > > > > > > > > release_all_access() seems to be doing correct job of all that > > > > > cleaning, so > > > > > there must be some other path that I do not quite see. > > > > > > > > > > Hopefully you are more familiar with the code and can see the > > > > > problem right away ;) > > > > Hmm...well I'm not seeing it right offhand, and haven't been able to > > > > reproduce the problem so far after a couple of attempts by hand. What > > > > sort of workload are you running before you see that warning pop? > > > Well, the workload is a crazed twisty conflicting operations set in a > > > very limited > > > namespace (called racer.sh, > > > http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer > > > ) > > > > > > It does all sorts of crazy unimaginable stuff that is hard to predict, > > > and typically I > > > run it on several mountpoints all pointing to the same spot > > > (now 3-ways: one instance on direct mount point in ext4, one on nfs mount > > > mounted as > > > mount -o nfs and one on mount -o nfs4). > > > It must be hitting some rare race or a race + operations combination that > > > leads to > > > the leak. > > > > > > To reproduce I run it like this until it stops (and it usually stops in > > > 2 or 3 iterations, so it takes some time): > > > $ cat ~/stress.sh > > >
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Tue, 2016-06-07 at 19:39 -0400, Oleg Drokin wrote: > On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote: > > > > > On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote: > > > > > > On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote: > > > > > > > > > > > > > > > On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote: > > > > > > > > > > > > > > > Hello! > > > > > > > > > > I've been trying to better understand this problem I was having > > > > > where sometimes > > > > > a formerly NFS-exported mountpoint becomes unmountable (after nfsd > > > > > stop). > > > > > > > > > > I finally traced it to a leaked filedescriptor that was allocated > > > > > from > > > > > > > > > > nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open(). > > > > > > > > > > Also together with it we see leaked credentials allocated along > > > > > the same path from > > > > > fh_verify() and groups allocated from > > > > > svcauth_unix_accept()->groups_alloc() that > > > > > are presumably used by the credentials. > > > > > > > > > > Unfortunately I was not able to make total sense out of the state > > > > > handling in nfsd, > > > > > but it's clear that one of the file descriptors inside struct > > > > > nfs4_file is > > > > > lost. I added a patch like this (always a good idea, so surprised > > > > > it was not > > > > > there already): > > > > > @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head > > > > > *rcu) > > > > > { > > > > > struct nfs4_file *fp = container_of(rcu, struct nfs4_file, > > > > > fi_rcu); > > > > > > > > > > + WARN_ON(fp->fi_fds[0]); > > > > > + WARN_ON(fp->fi_fds[1]); > > > > > + WARN_ON(fp->fi_fds[2]); > > > > > kmem_cache_free(file_slab, fp); > > > > > } > > > > > > > > > > And when the problem is hit, I am also triggering (Always this one > > > > > which is fd[1]) > > > > > [ 3588.143002] [ cut here ] > > > > > [ 3588.143662] WARNING: CPU: 5 PID: 9 at > > > > > /home/green/bk/linux/fs/nfsd/nfs4state.c:278 > > > > > nfsd4_free_file_rcu+0x65/0x80 [nfsd] > > > > > [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev > > > > > acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm > > > > > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm > > > > > floppy serio_raw virtio_blk > > > > > [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted > > > > > 4.7.0-rc1-vm-nfs+ #120 > > > > > [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > > > > > [ 3588.153830] 0286 e2d5ccdf 88011965bd50 > > > > > 814a11a5 > > > > > [ 3588.153832] 88011965bd90 > > > > > 8108806b > > > > > [ 3588.153834] 0116 8800c476a0b8 8800c476a048 > > > > > c0110fc0 > > > > > [ 3588.153834] Call Trace: > > > > > [ 3588.153839] [] dump_stack+0x86/0xc1 > > > > > [ 3588.153841] [] __warn+0xcb/0xf0 > > > > > [ 3588.153852] [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd] > > > > > [ 3588.153853] [] warn_slowpath_null+0x1d/0x20 > > > > > [ 3588.153859] [] nfsd4_free_file_rcu+0x65/0x80 [nfsd] > > > > > [ 3588.153861] [] rcu_nocb_kthread+0x335/0x510 > > > > > [ 3588.153862] [] ? rcu_nocb_kthread+0x27f/0x510 > > > > > [ 3588.153863] [] ? rcu_cpu_notify+0x3e0/0x3e0 > > > > > [ 3588.153866] [] kthread+0x101/0x120 > > > > > [ 3588.153868] [] ? trace_hardirqs_on_caller+0xf4/0x1b0 > > > > > [ 3588.153871] [] ret_from_fork+0x1f/0x40 > > > > > [ 3588.153872] [] ? kthread_create_on_node+0x250/0x250 > > > > > > > > > > > > > > > release_all_access() seems to be doing correct job of all that > > > > > cleaning, so > > > > > there must be some other path that I do not quite see. > > > > > > > > > > Hopefully you are more familiar with the code and can see the > > > > > problem right away ;) > > > > Hmm...well I'm not seeing it right offhand, and haven't been able to > > > > reproduce the problem so far after a couple of attempts by hand. What > > > > sort of workload are you running before you see that warning pop? > > > Well, the workload is a crazed twisty conflicting operations set in a > > > very limited > > > namespace (called racer.sh, > > > http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer > > > ) > > > > > > It does all sorts of crazy unimaginable stuff that is hard to predict, > > > and typically I > > > run it on several mountpoints all pointing to the same spot > > > (now 3-ways: one instance on direct mount point in ext4, one on nfs mount > > > mounted as > > > mount -o nfs and one on mount -o nfs4). > > > It must be hitting some rare race or a race + operations combination that > > > leads to > > > the leak. > > > > > > To reproduce I run it like this until it stops (and it usually stops in > > > 2 or 3 iterations, so it takes some time): > > > $ cat ~/stress.sh > > >
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote: > On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote: >> On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote: >> >>> >>> On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote: Hello! I've been trying to better understand this problem I was having where sometimes a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop). I finally traced it to a leaked filedescriptor that was allocated from nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open(). Also together with it we see leaked credentials allocated along the same path from fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that are presumably used by the credentials. Unfortunately I was not able to make total sense out of the state handling in nfsd, but it's clear that one of the file descriptors inside struct nfs4_file is lost. I added a patch like this (always a good idea, so surprised it was not there already): @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu) { struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu); + WARN_ON(fp->fi_fds[0]); + WARN_ON(fp->fi_fds[1]); + WARN_ON(fp->fi_fds[2]); kmem_cache_free(file_slab, fp); } And when the problem is hit, I am also triggering (Always this one which is fd[1]) [ 3588.143002] [ cut here ] [ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd] [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120 [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 3588.153830] 0286 e2d5ccdf 88011965bd50 814a11a5 [ 3588.153832] 88011965bd90 8108806b [ 3588.153834] 0116 8800c476a0b8 8800c476a048 c0110fc0 [ 3588.153834] Call Trace: [ 3588.153839] [] dump_stack+0x86/0xc1 [ 3588.153841] [] __warn+0xcb/0xf0 [ 3588.153852] [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd] [ 3588.153853] [] warn_slowpath_null+0x1d/0x20 [ 3588.153859] [] nfsd4_free_file_rcu+0x65/0x80 [nfsd] [ 3588.153861] [] rcu_nocb_kthread+0x335/0x510 [ 3588.153862] [] ? rcu_nocb_kthread+0x27f/0x510 [ 3588.153863] [] ? rcu_cpu_notify+0x3e0/0x3e0 [ 3588.153866] [] kthread+0x101/0x120 [ 3588.153868] [] ? trace_hardirqs_on_caller+0xf4/0x1b0 [ 3588.153871] [] ret_from_fork+0x1f/0x40 [ 3588.153872] [] ? kthread_create_on_node+0x250/0x250 release_all_access() seems to be doing correct job of all that cleaning, so there must be some other path that I do not quite see. Hopefully you are more familiar with the code and can see the problem right away ;) >>> Hmm...well I'm not seeing it right offhand, and haven't been able to >>> reproduce the problem so far after a couple of attempts by hand. What >>> sort of workload are you running before you see that warning pop? >> Well, the workload is a crazed twisty conflicting operations set in a very >> limited >> namespace (called racer.sh, >> http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer >> ) >> >> It does all sorts of crazy unimaginable stuff that is hard to predict, and >> typically I >> run it on several mountpoints all pointing to the same spot >> (now 3-ways: one instance on direct mount point in ext4, one on nfs mount >> mounted as >> mount -o nfs and one on mount -o nfs4). >> It must be hitting some rare race or a race + operations combination that >> leads to >> the leak. >> >> To reproduce I run it like this until it stops (and it usually stops in >> 2 or 3 iterations, so it takes some time): >> $ cat ~/stress.sh >> #!/bin/bash >> >> set -x >> >> cd /home/green/git/lustre-release/lustre/tests/racer >> dd if=/dev/zero of=/tmp/loop bs=1024k count=1024 >> mkfs.ext4 /tmp/loop >> service rpcbind start >> mount none /var/lib/nfs -t tmpfs >> touch /var/lib/nfs/etab >> service nfs-mountd start >> >> while :; do >> >> mount /tmp/loop /mnt/lustre -o loop || exit 1 >> mkdir /mnt/lustre/racer >> service nfs-server start || exit 2 >> mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3 >> mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4 >> DURATION=3600 sh racer.sh /mnt/nfs/racer & >> DURATION=3600
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote: > On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote: >> On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote: >> >>> >>> On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote: Hello! I've been trying to better understand this problem I was having where sometimes a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop). I finally traced it to a leaked filedescriptor that was allocated from nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open(). Also together with it we see leaked credentials allocated along the same path from fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that are presumably used by the credentials. Unfortunately I was not able to make total sense out of the state handling in nfsd, but it's clear that one of the file descriptors inside struct nfs4_file is lost. I added a patch like this (always a good idea, so surprised it was not there already): @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu) { struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu); + WARN_ON(fp->fi_fds[0]); + WARN_ON(fp->fi_fds[1]); + WARN_ON(fp->fi_fds[2]); kmem_cache_free(file_slab, fp); } And when the problem is hit, I am also triggering (Always this one which is fd[1]) [ 3588.143002] [ cut here ] [ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd] [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120 [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 3588.153830] 0286 e2d5ccdf 88011965bd50 814a11a5 [ 3588.153832] 88011965bd90 8108806b [ 3588.153834] 0116 8800c476a0b8 8800c476a048 c0110fc0 [ 3588.153834] Call Trace: [ 3588.153839] [] dump_stack+0x86/0xc1 [ 3588.153841] [] __warn+0xcb/0xf0 [ 3588.153852] [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd] [ 3588.153853] [] warn_slowpath_null+0x1d/0x20 [ 3588.153859] [] nfsd4_free_file_rcu+0x65/0x80 [nfsd] [ 3588.153861] [] rcu_nocb_kthread+0x335/0x510 [ 3588.153862] [] ? rcu_nocb_kthread+0x27f/0x510 [ 3588.153863] [] ? rcu_cpu_notify+0x3e0/0x3e0 [ 3588.153866] [] kthread+0x101/0x120 [ 3588.153868] [] ? trace_hardirqs_on_caller+0xf4/0x1b0 [ 3588.153871] [] ret_from_fork+0x1f/0x40 [ 3588.153872] [] ? kthread_create_on_node+0x250/0x250 release_all_access() seems to be doing correct job of all that cleaning, so there must be some other path that I do not quite see. Hopefully you are more familiar with the code and can see the problem right away ;) >>> Hmm...well I'm not seeing it right offhand, and haven't been able to >>> reproduce the problem so far after a couple of attempts by hand. What >>> sort of workload are you running before you see that warning pop? >> Well, the workload is a crazed twisty conflicting operations set in a very >> limited >> namespace (called racer.sh, >> http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer >> ) >> >> It does all sorts of crazy unimaginable stuff that is hard to predict, and >> typically I >> run it on several mountpoints all pointing to the same spot >> (now 3-ways: one instance on direct mount point in ext4, one on nfs mount >> mounted as >> mount -o nfs and one on mount -o nfs4). >> It must be hitting some rare race or a race + operations combination that >> leads to >> the leak. >> >> To reproduce I run it like this until it stops (and it usually stops in >> 2 or 3 iterations, so it takes some time): >> $ cat ~/stress.sh >> #!/bin/bash >> >> set -x >> >> cd /home/green/git/lustre-release/lustre/tests/racer >> dd if=/dev/zero of=/tmp/loop bs=1024k count=1024 >> mkfs.ext4 /tmp/loop >> service rpcbind start >> mount none /var/lib/nfs -t tmpfs >> touch /var/lib/nfs/etab >> service nfs-mountd start >> >> while :; do >> >> mount /tmp/loop /mnt/lustre -o loop || exit 1 >> mkdir /mnt/lustre/racer >> service nfs-server start || exit 2 >> mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3 >> mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4 >> DURATION=3600 sh racer.sh /mnt/nfs/racer & >> DURATION=3600
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote: > On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote: > > > > > On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote: > > > > > > Hello! > > > > > > I've been trying to better understand this problem I was having where > > > sometimes > > > a formerly NFS-exported mountpoint becomes unmountable (after nfsd > > > stop). > > > > > > I finally traced it to a leaked filedescriptor that was allocated from > > > nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open(). > > > > > > Also together with it we see leaked credentials allocated along the > > > same path from > > > fh_verify() and groups allocated from > > > svcauth_unix_accept()->groups_alloc() that > > > are presumably used by the credentials. > > > > > > Unfortunately I was not able to make total sense out of the state > > > handling in nfsd, > > > but it's clear that one of the file descriptors inside struct > > > nfs4_file is > > > lost. I added a patch like this (always a good idea, so surprised it > > > was not > > > there already): > > > @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu) > > > { > > > struct nfs4_file *fp = container_of(rcu, struct nfs4_file, > > > fi_rcu); > > > > > > + WARN_ON(fp->fi_fds[0]); > > > + WARN_ON(fp->fi_fds[1]); > > > + WARN_ON(fp->fi_fds[2]); > > > kmem_cache_free(file_slab, fp); > > > } > > > > > > And when the problem is hit, I am also triggering (Always this one > > > which is fd[1]) > > > [ 3588.143002] [ cut here ] > > > [ 3588.143662] WARNING: CPU: 5 PID: 9 at > > > /home/green/bk/linux/fs/nfsd/nfs4state.c:278 > > > nfsd4_free_file_rcu+0x65/0x80 [nfsd] > > > [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev > > > acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm > > > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy > > > serio_raw virtio_blk > > > [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ > > > #120 > > > [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > > > [ 3588.153830] 0286 e2d5ccdf 88011965bd50 > > > 814a11a5 > > > [ 3588.153832] 88011965bd90 > > > 8108806b > > > [ 3588.153834] 0116 8800c476a0b8 8800c476a048 > > > c0110fc0 > > > [ 3588.153834] Call Trace: > > > [ 3588.153839] [] dump_stack+0x86/0xc1 > > > [ 3588.153841] [] __warn+0xcb/0xf0 > > > [ 3588.153852] [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd] > > > [ 3588.153853] [] warn_slowpath_null+0x1d/0x20 > > > [ 3588.153859] [] nfsd4_free_file_rcu+0x65/0x80 [nfsd] > > > [ 3588.153861] [] rcu_nocb_kthread+0x335/0x510 > > > [ 3588.153862] [] ? rcu_nocb_kthread+0x27f/0x510 > > > [ 3588.153863] [] ? rcu_cpu_notify+0x3e0/0x3e0 > > > [ 3588.153866] [] kthread+0x101/0x120 > > > [ 3588.153868] [] ? trace_hardirqs_on_caller+0xf4/0x1b0 > > > [ 3588.153871] [] ret_from_fork+0x1f/0x40 > > > [ 3588.153872] [] ? kthread_create_on_node+0x250/0x250 > > > > > > > > > release_all_access() seems to be doing correct job of all that > > > cleaning, so > > > there must be some other path that I do not quite see. > > > > > > Hopefully you are more familiar with the code and can see the problem > > > right away ;) > > Hmm...well I'm not seeing it right offhand, and haven't been able to > > reproduce the problem so far after a couple of attempts by hand. What > > sort of workload are you running before you see that warning pop? > Well, the workload is a crazed twisty conflicting operations set in a very > limited > namespace (called racer.sh, > http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer > ) > > It does all sorts of crazy unimaginable stuff that is hard to predict, and > typically I > run it on several mountpoints all pointing to the same spot > (now 3-ways: one instance on direct mount point in ext4, one on nfs mount > mounted as > mount -o nfs and one on mount -o nfs4). > It must be hitting some rare race or a race + operations combination that > leads to > the leak. > > To reproduce I run it like this until it stops (and it usually stops in > 2 or 3 iterations, so it takes some time): > $ cat ~/stress.sh > #!/bin/bash > > set -x > > cd /home/green/git/lustre-release/lustre/tests/racer > dd if=/dev/zero of=/tmp/loop bs=1024k count=1024 > mkfs.ext4 /tmp/loop > service rpcbind start > mount none /var/lib/nfs -t tmpfs > touch /var/lib/nfs/etab > service nfs-mountd start > > while :; do > > mount /tmp/loop /mnt/lustre -o loop || exit 1 > mkdir /mnt/lustre/racer > service nfs-server start || exit 2 > mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3 > mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4 > DURATION=3600 sh racer.sh /mnt/nfs/racer & > DURATION=3600 sh racer.sh
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote: > On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote: > > > > > On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote: > > > > > > Hello! > > > > > > I've been trying to better understand this problem I was having where > > > sometimes > > > a formerly NFS-exported mountpoint becomes unmountable (after nfsd > > > stop). > > > > > > I finally traced it to a leaked filedescriptor that was allocated from > > > nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open(). > > > > > > Also together with it we see leaked credentials allocated along the > > > same path from > > > fh_verify() and groups allocated from > > > svcauth_unix_accept()->groups_alloc() that > > > are presumably used by the credentials. > > > > > > Unfortunately I was not able to make total sense out of the state > > > handling in nfsd, > > > but it's clear that one of the file descriptors inside struct > > > nfs4_file is > > > lost. I added a patch like this (always a good idea, so surprised it > > > was not > > > there already): > > > @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu) > > > { > > > struct nfs4_file *fp = container_of(rcu, struct nfs4_file, > > > fi_rcu); > > > > > > + WARN_ON(fp->fi_fds[0]); > > > + WARN_ON(fp->fi_fds[1]); > > > + WARN_ON(fp->fi_fds[2]); > > > kmem_cache_free(file_slab, fp); > > > } > > > > > > And when the problem is hit, I am also triggering (Always this one > > > which is fd[1]) > > > [ 3588.143002] [ cut here ] > > > [ 3588.143662] WARNING: CPU: 5 PID: 9 at > > > /home/green/bk/linux/fs/nfsd/nfs4state.c:278 > > > nfsd4_free_file_rcu+0x65/0x80 [nfsd] > > > [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev > > > acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm > > > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy > > > serio_raw virtio_blk > > > [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ > > > #120 > > > [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > > > [ 3588.153830] 0286 e2d5ccdf 88011965bd50 > > > 814a11a5 > > > [ 3588.153832] 88011965bd90 > > > 8108806b > > > [ 3588.153834] 0116 8800c476a0b8 8800c476a048 > > > c0110fc0 > > > [ 3588.153834] Call Trace: > > > [ 3588.153839] [] dump_stack+0x86/0xc1 > > > [ 3588.153841] [] __warn+0xcb/0xf0 > > > [ 3588.153852] [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd] > > > [ 3588.153853] [] warn_slowpath_null+0x1d/0x20 > > > [ 3588.153859] [] nfsd4_free_file_rcu+0x65/0x80 [nfsd] > > > [ 3588.153861] [] rcu_nocb_kthread+0x335/0x510 > > > [ 3588.153862] [] ? rcu_nocb_kthread+0x27f/0x510 > > > [ 3588.153863] [] ? rcu_cpu_notify+0x3e0/0x3e0 > > > [ 3588.153866] [] kthread+0x101/0x120 > > > [ 3588.153868] [] ? trace_hardirqs_on_caller+0xf4/0x1b0 > > > [ 3588.153871] [] ret_from_fork+0x1f/0x40 > > > [ 3588.153872] [] ? kthread_create_on_node+0x250/0x250 > > > > > > > > > release_all_access() seems to be doing correct job of all that > > > cleaning, so > > > there must be some other path that I do not quite see. > > > > > > Hopefully you are more familiar with the code and can see the problem > > > right away ;) > > Hmm...well I'm not seeing it right offhand, and haven't been able to > > reproduce the problem so far after a couple of attempts by hand. What > > sort of workload are you running before you see that warning pop? > Well, the workload is a crazed twisty conflicting operations set in a very > limited > namespace (called racer.sh, > http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer > ) > > It does all sorts of crazy unimaginable stuff that is hard to predict, and > typically I > run it on several mountpoints all pointing to the same spot > (now 3-ways: one instance on direct mount point in ext4, one on nfs mount > mounted as > mount -o nfs and one on mount -o nfs4). > It must be hitting some rare race or a race + operations combination that > leads to > the leak. > > To reproduce I run it like this until it stops (and it usually stops in > 2 or 3 iterations, so it takes some time): > $ cat ~/stress.sh > #!/bin/bash > > set -x > > cd /home/green/git/lustre-release/lustre/tests/racer > dd if=/dev/zero of=/tmp/loop bs=1024k count=1024 > mkfs.ext4 /tmp/loop > service rpcbind start > mount none /var/lib/nfs -t tmpfs > touch /var/lib/nfs/etab > service nfs-mountd start > > while :; do > > mount /tmp/loop /mnt/lustre -o loop || exit 1 > mkdir /mnt/lustre/racer > service nfs-server start || exit 2 > mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3 > mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4 > DURATION=3600 sh racer.sh /mnt/nfs/racer & > DURATION=3600 sh racer.sh
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote: > On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote: >> Hello! >> >>I've been trying to better understand this problem I was having where >> sometimes >>a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop). >> >>I finally traced it to a leaked filedescriptor that was allocated from >>nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open(). >> >>Also together with it we see leaked credentials allocated along the same >> path from >>fh_verify() and groups allocated from >> svcauth_unix_accept()->groups_alloc() that >>are presumably used by the credentials. >> >>Unfortunately I was not able to make total sense out of the state >> handling in nfsd, >>but it's clear that one of the file descriptors inside struct nfs4_file is >>lost. I added a patch like this (always a good idea, so surprised it was >> not >>there already): >> @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu) >> { >> struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu); >> >> + WARN_ON(fp->fi_fds[0]); >> + WARN_ON(fp->fi_fds[1]); >> + WARN_ON(fp->fi_fds[2]); >> kmem_cache_free(file_slab, fp); >> } >> >>And when the problem is hit, I am also triggering (Always this one which >> is fd[1]) >> [ 3588.143002] [ cut here ] >> [ 3588.143662] WARNING: CPU: 5 PID: 9 at >> /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 >> [nfsd] >> [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq >> tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper >> syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk >> [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120 >> [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 >> [ 3588.153830] 0286 e2d5ccdf 88011965bd50 >> 814a11a5 >> [ 3588.153832] 88011965bd90 >> 8108806b >> [ 3588.153834] 0116 8800c476a0b8 8800c476a048 >> c0110fc0 >> [ 3588.153834] Call Trace: >> [ 3588.153839] [] dump_stack+0x86/0xc1 >> [ 3588.153841] [] __warn+0xcb/0xf0 >> [ 3588.153852] [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd] >> [ 3588.153853] [] warn_slowpath_null+0x1d/0x20 >> [ 3588.153859] [] nfsd4_free_file_rcu+0x65/0x80 [nfsd] >> [ 3588.153861] [] rcu_nocb_kthread+0x335/0x510 >> [ 3588.153862] [] ? rcu_nocb_kthread+0x27f/0x510 >> [ 3588.153863] [] ? rcu_cpu_notify+0x3e0/0x3e0 >> [ 3588.153866] [] kthread+0x101/0x120 >> [ 3588.153868] [] ? trace_hardirqs_on_caller+0xf4/0x1b0 >> [ 3588.153871] [] ret_from_fork+0x1f/0x40 >> [ 3588.153872] [] ? kthread_create_on_node+0x250/0x250 >> >> >> release_all_access() seems to be doing correct job of all that cleaning, so >> there must be some other path that I do not quite see. >> >> Hopefully you are more familiar with the code and can see the problem >> right away ;) > > Hmm...well I'm not seeing it right offhand, and haven't been able to > reproduce the problem so far after a couple of attempts by hand. What > sort of workload are you running before you see that warning pop? Well, the workload is a crazed twisty conflicting operations set in a very limited namespace (called racer.sh, http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer ) It does all sorts of crazy unimaginable stuff that is hard to predict, and typically I run it on several mountpoints all pointing to the same spot (now 3-ways: one instance on direct mount point in ext4, one on nfs mount mounted as mount -o nfs and one on mount -o nfs4). It must be hitting some rare race or a race + operations combination that leads to the leak. To reproduce I run it like this until it stops (and it usually stops in 2 or 3 iterations, so it takes some time): $ cat ~/stress.sh #!/bin/bash set -x cd /home/green/git/lustre-release/lustre/tests/racer dd if=/dev/zero of=/tmp/loop bs=1024k count=1024 mkfs.ext4 /tmp/loop service rpcbind start mount none /var/lib/nfs -t tmpfs touch /var/lib/nfs/etab service nfs-mountd start while :; do mount /tmp/loop /mnt/lustre -o loop || exit 1 mkdir /mnt/lustre/racer service nfs-server start || exit 2 mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3 mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4 DURATION=3600 sh racer.sh /mnt/nfs/racer & DURATION=3600 sh racer.sh /mnt/nfs2/racer & DURATION=3600 sh racer.sh /mnt/lustre/racer & wait %1 %2 %3 umount /mnt/nfs || exit 5 umount /mnt/nfs2 || exit 6 service nfs-server stop || exit 7 umount /mnt/lustre || exit 8 e2fsck -n -f /tmp/loop || exit 9 done Please ignore all the Lustre references, the test is generic, it's just I am running off nfs-readonly-root setup, so want to reuse existing mountpoint locations,
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote: > On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote: >> Hello! >> >>I've been trying to better understand this problem I was having where >> sometimes >>a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop). >> >>I finally traced it to a leaked filedescriptor that was allocated from >>nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open(). >> >>Also together with it we see leaked credentials allocated along the same >> path from >>fh_verify() and groups allocated from >> svcauth_unix_accept()->groups_alloc() that >>are presumably used by the credentials. >> >>Unfortunately I was not able to make total sense out of the state >> handling in nfsd, >>but it's clear that one of the file descriptors inside struct nfs4_file is >>lost. I added a patch like this (always a good idea, so surprised it was >> not >>there already): >> @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu) >> { >> struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu); >> >> + WARN_ON(fp->fi_fds[0]); >> + WARN_ON(fp->fi_fds[1]); >> + WARN_ON(fp->fi_fds[2]); >> kmem_cache_free(file_slab, fp); >> } >> >>And when the problem is hit, I am also triggering (Always this one which >> is fd[1]) >> [ 3588.143002] [ cut here ] >> [ 3588.143662] WARNING: CPU: 5 PID: 9 at >> /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 >> [nfsd] >> [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq >> tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper >> syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk >> [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120 >> [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 >> [ 3588.153830] 0286 e2d5ccdf 88011965bd50 >> 814a11a5 >> [ 3588.153832] 88011965bd90 >> 8108806b >> [ 3588.153834] 0116 8800c476a0b8 8800c476a048 >> c0110fc0 >> [ 3588.153834] Call Trace: >> [ 3588.153839] [] dump_stack+0x86/0xc1 >> [ 3588.153841] [] __warn+0xcb/0xf0 >> [ 3588.153852] [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd] >> [ 3588.153853] [] warn_slowpath_null+0x1d/0x20 >> [ 3588.153859] [] nfsd4_free_file_rcu+0x65/0x80 [nfsd] >> [ 3588.153861] [] rcu_nocb_kthread+0x335/0x510 >> [ 3588.153862] [] ? rcu_nocb_kthread+0x27f/0x510 >> [ 3588.153863] [] ? rcu_cpu_notify+0x3e0/0x3e0 >> [ 3588.153866] [] kthread+0x101/0x120 >> [ 3588.153868] [] ? trace_hardirqs_on_caller+0xf4/0x1b0 >> [ 3588.153871] [] ret_from_fork+0x1f/0x40 >> [ 3588.153872] [] ? kthread_create_on_node+0x250/0x250 >> >> >> release_all_access() seems to be doing correct job of all that cleaning, so >> there must be some other path that I do not quite see. >> >> Hopefully you are more familiar with the code and can see the problem >> right away ;) > > Hmm...well I'm not seeing it right offhand, and haven't been able to > reproduce the problem so far after a couple of attempts by hand. What > sort of workload are you running before you see that warning pop? Well, the workload is a crazed twisty conflicting operations set in a very limited namespace (called racer.sh, http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer ) It does all sorts of crazy unimaginable stuff that is hard to predict, and typically I run it on several mountpoints all pointing to the same spot (now 3-ways: one instance on direct mount point in ext4, one on nfs mount mounted as mount -o nfs and one on mount -o nfs4). It must be hitting some rare race or a race + operations combination that leads to the leak. To reproduce I run it like this until it stops (and it usually stops in 2 or 3 iterations, so it takes some time): $ cat ~/stress.sh #!/bin/bash set -x cd /home/green/git/lustre-release/lustre/tests/racer dd if=/dev/zero of=/tmp/loop bs=1024k count=1024 mkfs.ext4 /tmp/loop service rpcbind start mount none /var/lib/nfs -t tmpfs touch /var/lib/nfs/etab service nfs-mountd start while :; do mount /tmp/loop /mnt/lustre -o loop || exit 1 mkdir /mnt/lustre/racer service nfs-server start || exit 2 mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3 mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4 DURATION=3600 sh racer.sh /mnt/nfs/racer & DURATION=3600 sh racer.sh /mnt/nfs2/racer & DURATION=3600 sh racer.sh /mnt/lustre/racer & wait %1 %2 %3 umount /mnt/nfs || exit 5 umount /mnt/nfs2 || exit 6 service nfs-server stop || exit 7 umount /mnt/lustre || exit 8 e2fsck -n -f /tmp/loop || exit 9 done Please ignore all the Lustre references, the test is generic, it's just I am running off nfs-readonly-root setup, so want to reuse existing mountpoint locations,
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote: > Hello! > > I've been trying to better understand this problem I was having where > sometimes > a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop). > > I finally traced it to a leaked filedescriptor that was allocated from > nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open(). > > Also together with it we see leaked credentials allocated along the same > path from > fh_verify() and groups allocated from > svcauth_unix_accept()->groups_alloc() that > are presumably used by the credentials. > > Unfortunately I was not able to make total sense out of the state handling > in nfsd, > but it's clear that one of the file descriptors inside struct nfs4_file is > lost. I added a patch like this (always a good idea, so surprised it was > not > there already): > @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu) > { > struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu); > > + WARN_ON(fp->fi_fds[0]); > + WARN_ON(fp->fi_fds[1]); > + WARN_ON(fp->fi_fds[2]); > kmem_cache_free(file_slab, fp); > } > > And when the problem is hit, I am also triggering (Always this one which > is fd[1]) > [ 3588.143002] [ cut here ] > [ 3588.143662] WARNING: CPU: 5 PID: 9 at > /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 > [nfsd] > [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq > tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper > syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk > [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120 > [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > [ 3588.153830] 0286 e2d5ccdf 88011965bd50 > 814a11a5 > [ 3588.153832] 88011965bd90 > 8108806b > [ 3588.153834] 0116 8800c476a0b8 8800c476a048 > c0110fc0 > [ 3588.153834] Call Trace: > [ 3588.153839] [] dump_stack+0x86/0xc1 > [ 3588.153841] [] __warn+0xcb/0xf0 > [ 3588.153852] [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd] > [ 3588.153853] [] warn_slowpath_null+0x1d/0x20 > [ 3588.153859] [] nfsd4_free_file_rcu+0x65/0x80 [nfsd] > [ 3588.153861] [] rcu_nocb_kthread+0x335/0x510 > [ 3588.153862] [] ? rcu_nocb_kthread+0x27f/0x510 > [ 3588.153863] [] ? rcu_cpu_notify+0x3e0/0x3e0 > [ 3588.153866] [] kthread+0x101/0x120 > [ 3588.153868] [] ? trace_hardirqs_on_caller+0xf4/0x1b0 > [ 3588.153871] [] ret_from_fork+0x1f/0x40 > [ 3588.153872] [] ? kthread_create_on_node+0x250/0x250 > > > release_all_access() seems to be doing correct job of all that cleaning, so > there must be some other path that I do not quite see. > > Hopefully you are more familiar with the code and can see the problem right > away ;) > > Bye, > Oleg Hmm...well I'm not seeing it right offhand, and haven't been able to reproduce the problem so far after a couple of attempts by hand. What sort of workload are you running before you see that warning pop? -- Jeff Layton
Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote: > Hello! > > I've been trying to better understand this problem I was having where > sometimes > a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop). > > I finally traced it to a leaked filedescriptor that was allocated from > nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open(). > > Also together with it we see leaked credentials allocated along the same > path from > fh_verify() and groups allocated from > svcauth_unix_accept()->groups_alloc() that > are presumably used by the credentials. > > Unfortunately I was not able to make total sense out of the state handling > in nfsd, > but it's clear that one of the file descriptors inside struct nfs4_file is > lost. I added a patch like this (always a good idea, so surprised it was > not > there already): > @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu) > { > struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu); > > + WARN_ON(fp->fi_fds[0]); > + WARN_ON(fp->fi_fds[1]); > + WARN_ON(fp->fi_fds[2]); > kmem_cache_free(file_slab, fp); > } > > And when the problem is hit, I am also triggering (Always this one which > is fd[1]) > [ 3588.143002] [ cut here ] > [ 3588.143662] WARNING: CPU: 5 PID: 9 at > /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 > [nfsd] > [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq > tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper > syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk > [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120 > [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > [ 3588.153830] 0286 e2d5ccdf 88011965bd50 > 814a11a5 > [ 3588.153832] 88011965bd90 > 8108806b > [ 3588.153834] 0116 8800c476a0b8 8800c476a048 > c0110fc0 > [ 3588.153834] Call Trace: > [ 3588.153839] [] dump_stack+0x86/0xc1 > [ 3588.153841] [] __warn+0xcb/0xf0 > [ 3588.153852] [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd] > [ 3588.153853] [] warn_slowpath_null+0x1d/0x20 > [ 3588.153859] [] nfsd4_free_file_rcu+0x65/0x80 [nfsd] > [ 3588.153861] [] rcu_nocb_kthread+0x335/0x510 > [ 3588.153862] [] ? rcu_nocb_kthread+0x27f/0x510 > [ 3588.153863] [] ? rcu_cpu_notify+0x3e0/0x3e0 > [ 3588.153866] [] kthread+0x101/0x120 > [ 3588.153868] [] ? trace_hardirqs_on_caller+0xf4/0x1b0 > [ 3588.153871] [] ret_from_fork+0x1f/0x40 > [ 3588.153872] [] ? kthread_create_on_node+0x250/0x250 > > > release_all_access() seems to be doing correct job of all that cleaning, so > there must be some other path that I do not quite see. > > Hopefully you are more familiar with the code and can see the problem right > away ;) > > Bye, > Oleg Hmm...well I'm not seeing it right offhand, and haven't been able to reproduce the problem so far after a couple of attempts by hand. What sort of workload are you running before you see that warning pop? -- Jeff Layton
Files leak from nfsd in 4.7.1-rc1 (and more?)
Hello! I've been trying to better understand this problem I was having where sometimes a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop). I finally traced it to a leaked filedescriptor that was allocated from nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open(). Also together with it we see leaked credentials allocated along the same path from fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that are presumably used by the credentials. Unfortunately I was not able to make total sense out of the state handling in nfsd, but it's clear that one of the file descriptors inside struct nfs4_file is lost. I added a patch like this (always a good idea, so surprised it was not there already): @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu) { struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu); + WARN_ON(fp->fi_fds[0]); + WARN_ON(fp->fi_fds[1]); + WARN_ON(fp->fi_fds[2]); kmem_cache_free(file_slab, fp); } And when the problem is hit, I am also triggering (Always this one which is fd[1]) [ 3588.143002] [ cut here ] [ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd] [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120 [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 3588.153830] 0286 e2d5ccdf 88011965bd50 814a11a5 [ 3588.153832] 88011965bd90 8108806b [ 3588.153834] 0116 8800c476a0b8 8800c476a048 c0110fc0 [ 3588.153834] Call Trace: [ 3588.153839] [] dump_stack+0x86/0xc1 [ 3588.153841] [] __warn+0xcb/0xf0 [ 3588.153852] [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd] [ 3588.153853] [] warn_slowpath_null+0x1d/0x20 [ 3588.153859] [] nfsd4_free_file_rcu+0x65/0x80 [nfsd] [ 3588.153861] [] rcu_nocb_kthread+0x335/0x510 [ 3588.153862] [] ? rcu_nocb_kthread+0x27f/0x510 [ 3588.153863] [] ? rcu_cpu_notify+0x3e0/0x3e0 [ 3588.153866] [] kthread+0x101/0x120 [ 3588.153868] [] ? trace_hardirqs_on_caller+0xf4/0x1b0 [ 3588.153871] [] ret_from_fork+0x1f/0x40 [ 3588.153872] [] ? kthread_create_on_node+0x250/0x250 release_all_access() seems to be doing correct job of all that cleaning, so there must be some other path that I do not quite see. Hopefully you are more familiar with the code and can see the problem right away ;) Bye, Oleg
Files leak from nfsd in 4.7.1-rc1 (and more?)
Hello! I've been trying to better understand this problem I was having where sometimes a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop). I finally traced it to a leaked filedescriptor that was allocated from nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open(). Also together with it we see leaked credentials allocated along the same path from fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that are presumably used by the credentials. Unfortunately I was not able to make total sense out of the state handling in nfsd, but it's clear that one of the file descriptors inside struct nfs4_file is lost. I added a patch like this (always a good idea, so surprised it was not there already): @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu) { struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu); + WARN_ON(fp->fi_fds[0]); + WARN_ON(fp->fi_fds[1]); + WARN_ON(fp->fi_fds[2]); kmem_cache_free(file_slab, fp); } And when the problem is hit, I am also triggering (Always this one which is fd[1]) [ 3588.143002] [ cut here ] [ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd] [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120 [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 3588.153830] 0286 e2d5ccdf 88011965bd50 814a11a5 [ 3588.153832] 88011965bd90 8108806b [ 3588.153834] 0116 8800c476a0b8 8800c476a048 c0110fc0 [ 3588.153834] Call Trace: [ 3588.153839] [] dump_stack+0x86/0xc1 [ 3588.153841] [] __warn+0xcb/0xf0 [ 3588.153852] [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd] [ 3588.153853] [] warn_slowpath_null+0x1d/0x20 [ 3588.153859] [] nfsd4_free_file_rcu+0x65/0x80 [nfsd] [ 3588.153861] [] rcu_nocb_kthread+0x335/0x510 [ 3588.153862] [] ? rcu_nocb_kthread+0x27f/0x510 [ 3588.153863] [] ? rcu_cpu_notify+0x3e0/0x3e0 [ 3588.153866] [] kthread+0x101/0x120 [ 3588.153868] [] ? trace_hardirqs_on_caller+0xf4/0x1b0 [ 3588.153871] [] ret_from_fork+0x1f/0x40 [ 3588.153872] [] ? kthread_create_on_node+0x250/0x250 release_all_access() seems to be doing correct job of all that cleaning, so there must be some other path that I do not quite see. Hopefully you are more familiar with the code and can see the problem right away ;) Bye, Oleg