Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-09 Thread Andrew W Elble

> But...suppose we have two racing OPEN calls. They're both in
> nfs4_get_vfs_file. One opens the file and succeeds and the other fails
> and ends up in out_put_access. At that point, you could end up
> clobbering the successful update to st_access_bmap from the other task,
> and we'd end up not putting the file access references in
> release_all_access.


   Pondering further (with an assumption about the workload), isn't this
   path also likely to involve a lease break in progress on that file?

Thanks,

Andy

-- 
Andrew W. Elble
awe...@discipline.rit.edu
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912


Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-09 Thread Andrew W Elble

> But...suppose we have two racing OPEN calls. They're both in
> nfs4_get_vfs_file. One opens the file and succeeds and the other fails
> and ends up in out_put_access. At that point, you could end up
> clobbering the successful update to st_access_bmap from the other task,
> and we'd end up not putting the file access references in
> release_all_access.


   Pondering further (with an assumption about the workload), isn't this
   path also likely to involve a lease break in progress on that file?

Thanks,

Andy

-- 
Andrew W. Elble
awe...@discipline.rit.edu
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912


Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-08 Thread Oleg Drokin

On Jun 8, 2016, at 1:22 PM, Jeff Layton wrote:

> On Wed, 2016-06-08 at 12:10 -0400, Oleg Drokin wrote:
>> On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote:
>> 
>>> A simple way to confirm that might be to convert all of the read locks
>>> on the st_rwsem to write locks. That will serialize all of the open
>>> operations and should prevent that particular race from occurring.
>>> 
>>> If that works, we'd probably want to fix it in a less heavy-handed way,
>>> but I'd have to think about how best to do that.
>> 
>> So I looked at the call sites for nfs4_get_vfs_file(), how about something 
>> like this:
>> 
>> after we grab the fp->fi_lock, we can do test_access(open->op_share_access, 
>> stp);
>> 
>> If that returns true - just drop the spinlock and return EAGAIN.
>> 
>> The callsite in nfs4_upgrade_open() would handle that by retesting the 
>> access map
>> again and either coming back in or more likely reusing the now updated 
>> stateid
>> (synchronised by the fi_lock again).
>> We probably need to convert the whole access map testing there to be under
>> fi_lock.
>> Something like:
>> nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct 
>> svc_fh *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open)
>> {
>> __be32 status;
>> unsigned char old_deny_bmap = stp->st_deny_bmap;
>> 
>> again:
>> +spin_lock(>fi_lock);
>> if (!test_access(open->op_share_access, stp)) {
>> +spin_unlock(>fi_lock);
>> +   status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open);
>> +if (status == -EAGAIN)
>> +goto again;
>> +return status;
>> +}
>> 
>> /* test and set deny mode */
>> -spin_lock(>fi_lock);
>> status = nfs4_file_check_deny(fp, open->op_share_deny);
>> 
>> 
>> The call in nfsd4_process_open2() I think cannot hit this condition, right?
>> probably can add a WARN_ON there? BUG_ON? more sensible approach?
>> 
>> Alternatively we can probably always call nfs4_get_vfs_file() under this 
>> spinlock,
>> just have it drop that for the open and then reobtain (already done), not as 
>> transparent I guess.
>> 
> 
> Yeah, I think that might be best. It looks like things could change
> after you drop the spinlock with the patch above. Since we have to
> retake it anyway in nfs4_get_vfs_file, we can just do it there.
> 
>> Or the fi_lock might be converted to say a mutex, so we can sleep with it 
>> held and
>> then we can hold it across whole invocation of nfs4_get_vfs_file() and 
>> access testing and stuff.
> 
> I think we'd be better off taking the st_rwsem for write (maybe just
> turning it into a mutex). That would at least be per-stateid instead of
> per-inode. That's a fine fix for now.
> 
> It might slow down a client slightly that is sending two stateid
> morphing operations in parallel, but they shouldn't affect each other.
> I'm liking that solution more and more here.
> Longer term, I think we need to further simplify OPEN handling. It has
> gotten better, but it's still really hard to follow currently (and is
> obviously error-prone).

The conversion to always rwlock holds up nice so far (also no other WARNs are 
triggered
yet.)

I guess I'll do a patch converting to mutex, but also separately a patch that 
just
holds fi_lock more - test that other one and if all is well, submit is too,
and let you choose which one you like the most ;)




Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-08 Thread Oleg Drokin

On Jun 8, 2016, at 1:22 PM, Jeff Layton wrote:

> On Wed, 2016-06-08 at 12:10 -0400, Oleg Drokin wrote:
>> On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote:
>> 
>>> A simple way to confirm that might be to convert all of the read locks
>>> on the st_rwsem to write locks. That will serialize all of the open
>>> operations and should prevent that particular race from occurring.
>>> 
>>> If that works, we'd probably want to fix it in a less heavy-handed way,
>>> but I'd have to think about how best to do that.
>> 
>> So I looked at the call sites for nfs4_get_vfs_file(), how about something 
>> like this:
>> 
>> after we grab the fp->fi_lock, we can do test_access(open->op_share_access, 
>> stp);
>> 
>> If that returns true - just drop the spinlock and return EAGAIN.
>> 
>> The callsite in nfs4_upgrade_open() would handle that by retesting the 
>> access map
>> again and either coming back in or more likely reusing the now updated 
>> stateid
>> (synchronised by the fi_lock again).
>> We probably need to convert the whole access map testing there to be under
>> fi_lock.
>> Something like:
>> nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct 
>> svc_fh *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open)
>> {
>> __be32 status;
>> unsigned char old_deny_bmap = stp->st_deny_bmap;
>> 
>> again:
>> +spin_lock(>fi_lock);
>> if (!test_access(open->op_share_access, stp)) {
>> +spin_unlock(>fi_lock);
>> +   status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open);
>> +if (status == -EAGAIN)
>> +goto again;
>> +return status;
>> +}
>> 
>> /* test and set deny mode */
>> -spin_lock(>fi_lock);
>> status = nfs4_file_check_deny(fp, open->op_share_deny);
>> 
>> 
>> The call in nfsd4_process_open2() I think cannot hit this condition, right?
>> probably can add a WARN_ON there? BUG_ON? more sensible approach?
>> 
>> Alternatively we can probably always call nfs4_get_vfs_file() under this 
>> spinlock,
>> just have it drop that for the open and then reobtain (already done), not as 
>> transparent I guess.
>> 
> 
> Yeah, I think that might be best. It looks like things could change
> after you drop the spinlock with the patch above. Since we have to
> retake it anyway in nfs4_get_vfs_file, we can just do it there.
> 
>> Or the fi_lock might be converted to say a mutex, so we can sleep with it 
>> held and
>> then we can hold it across whole invocation of nfs4_get_vfs_file() and 
>> access testing and stuff.
> 
> I think we'd be better off taking the st_rwsem for write (maybe just
> turning it into a mutex). That would at least be per-stateid instead of
> per-inode. That's a fine fix for now.
> 
> It might slow down a client slightly that is sending two stateid
> morphing operations in parallel, but they shouldn't affect each other.
> I'm liking that solution more and more here.
> Longer term, I think we need to further simplify OPEN handling. It has
> gotten better, but it's still really hard to follow currently (and is
> obviously error-prone).

The conversion to always rwlock holds up nice so far (also no other WARNs are 
triggered
yet.)

I guess I'll do a patch converting to mutex, but also separately a patch that 
just
holds fi_lock more - test that other one and if all is well, submit is too,
and let you choose which one you like the most ;)




Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-08 Thread Jeff Layton
On Wed, 2016-06-08 at 12:10 -0400, Oleg Drokin wrote:
> On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote:
> 
> > A simple way to confirm that might be to convert all of the read locks
> > on the st_rwsem to write locks. That will serialize all of the open
> > operations and should prevent that particular race from occurring.
> > 
> > If that works, we'd probably want to fix it in a less heavy-handed way,
> > but I'd have to think about how best to do that.
> 
> So I looked at the call sites for nfs4_get_vfs_file(), how about something 
> like this:
> 
> after we grab the fp->fi_lock, we can do test_access(open->op_share_access, 
> stp);
> 
> If that returns true - just drop the spinlock and return EAGAIN.
> 
> The callsite in nfs4_upgrade_open() would handle that by retesting the access 
> map
> again and either coming back in or more likely reusing the now updated stateid
> (synchronised by the fi_lock again).
> We probably need to convert the whole access map testing there to be under
> fi_lock.
> Something like:
> nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh 
> *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open)
> {
> __be32 status;
> unsigned char old_deny_bmap = stp->st_deny_bmap;
> 
> again:
> +spin_lock(>fi_lock);
> if (!test_access(open->op_share_access, stp)) {
> + spin_unlock(>fi_lock);
> +   status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open);
> + if (status == -EAGAIN)
> + goto again;
> + return status;
> + }
> 
> /* test and set deny mode */
> -spin_lock(>fi_lock);
> status = nfs4_file_check_deny(fp, open->op_share_deny);
> 
> 
> The call in nfsd4_process_open2() I think cannot hit this condition, right?
> probably can add a WARN_ON there? BUG_ON? more sensible approach?
> 
> Alternatively we can probably always call nfs4_get_vfs_file() under this 
> spinlock,
> just have it drop that for the open and then reobtain (already done), not as 
> transparent I guess.
> 

Yeah, I think that might be best. It looks like things could change
after you drop the spinlock with the patch above. Since we have to
retake it anyway in nfs4_get_vfs_file, we can just do it there.

> Or the fi_lock might be converted to say a mutex, so we can sleep with it 
> held and
> then we can hold it across whole invocation of nfs4_get_vfs_file() and access 
> testing and stuff.

I think we'd be better off taking the st_rwsem for write (maybe just
turning it into a mutex). That would at least be per-stateid instead of
per-inode. That's a fine fix for now.

It might slow down a client slightly that is sending two stateid
morphing operations in parallel, but they shouldn't affect each other.
I'm liking that solution more and more here.
Longer term, I think we need to further simplify OPEN handling. It has
gotten better, but it's still really hard to follow currently (and is
obviously error-prone).

-- 
Jeff Layton 


Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-08 Thread Jeff Layton
On Wed, 2016-06-08 at 12:10 -0400, Oleg Drokin wrote:
> On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote:
> 
> > A simple way to confirm that might be to convert all of the read locks
> > on the st_rwsem to write locks. That will serialize all of the open
> > operations and should prevent that particular race from occurring.
> > 
> > If that works, we'd probably want to fix it in a less heavy-handed way,
> > but I'd have to think about how best to do that.
> 
> So I looked at the call sites for nfs4_get_vfs_file(), how about something 
> like this:
> 
> after we grab the fp->fi_lock, we can do test_access(open->op_share_access, 
> stp);
> 
> If that returns true - just drop the spinlock and return EAGAIN.
> 
> The callsite in nfs4_upgrade_open() would handle that by retesting the access 
> map
> again and either coming back in or more likely reusing the now updated stateid
> (synchronised by the fi_lock again).
> We probably need to convert the whole access map testing there to be under
> fi_lock.
> Something like:
> nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh 
> *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open)
> {
> __be32 status;
> unsigned char old_deny_bmap = stp->st_deny_bmap;
> 
> again:
> +spin_lock(>fi_lock);
> if (!test_access(open->op_share_access, stp)) {
> + spin_unlock(>fi_lock);
> +   status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open);
> + if (status == -EAGAIN)
> + goto again;
> + return status;
> + }
> 
> /* test and set deny mode */
> -spin_lock(>fi_lock);
> status = nfs4_file_check_deny(fp, open->op_share_deny);
> 
> 
> The call in nfsd4_process_open2() I think cannot hit this condition, right?
> probably can add a WARN_ON there? BUG_ON? more sensible approach?
> 
> Alternatively we can probably always call nfs4_get_vfs_file() under this 
> spinlock,
> just have it drop that for the open and then reobtain (already done), not as 
> transparent I guess.
> 

Yeah, I think that might be best. It looks like things could change
after you drop the spinlock with the patch above. Since we have to
retake it anyway in nfs4_get_vfs_file, we can just do it there.

> Or the fi_lock might be converted to say a mutex, so we can sleep with it 
> held and
> then we can hold it across whole invocation of nfs4_get_vfs_file() and access 
> testing and stuff.

I think we'd be better off taking the st_rwsem for write (maybe just
turning it into a mutex). That would at least be per-stateid instead of
per-inode. That's a fine fix for now.

It might slow down a client slightly that is sending two stateid
morphing operations in parallel, but they shouldn't affect each other.
I'm liking that solution more and more here.
Longer term, I think we need to further simplify OPEN handling. It has
gotten better, but it's still really hard to follow currently (and is
obviously error-prone).

-- 
Jeff Layton 


Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-08 Thread Oleg Drokin

On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote:

> A simple way to confirm that might be to convert all of the read locks
> on the st_rwsem to write locks. That will serialize all of the open
> operations and should prevent that particular race from occurring.
> 
> If that works, we'd probably want to fix it in a less heavy-handed way,
> but I'd have to think about how best to do that.

So I looked at the call sites for nfs4_get_vfs_file(), how about something like 
this:

after we grab the fp->fi_lock, we can do test_access(open->op_share_access, 
stp);

If that returns true - just drop the spinlock and return EAGAIN.

The callsite in nfs4_upgrade_open() would handle that by retesting the access 
map
again and either coming back in or more likely reusing the now updated stateid
(synchronised by the fi_lock again).
We probably need to convert the whole access map testing there to be under
fi_lock.
Something like:
nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh 
*cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open)
{
__be32 status;
unsigned char old_deny_bmap = stp->st_deny_bmap;

again:
+spin_lock(>fi_lock);
if (!test_access(open->op_share_access, stp)) {
+   spin_unlock(>fi_lock);
+   status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open);
+   if (status == -EAGAIN)
+   goto again;
+   return status;
+   }

/* test and set deny mode */
-spin_lock(>fi_lock);
status = nfs4_file_check_deny(fp, open->op_share_deny);


The call in nfsd4_process_open2() I think cannot hit this condition, right?
probably can add a WARN_ON there? BUG_ON? more sensible approach?

Alternatively we can probably always call nfs4_get_vfs_file() under this 
spinlock,
just have it drop that for the open and then reobtain (already done), not as 
transparent I guess.

Or the fi_lock might be converted to say a mutex, so we can sleep with it held 
and
then we can hold it across whole invocation of nfs4_get_vfs_file() and access 
testing and stuff.




Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-08 Thread Oleg Drokin

On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote:

> A simple way to confirm that might be to convert all of the read locks
> on the st_rwsem to write locks. That will serialize all of the open
> operations and should prevent that particular race from occurring.
> 
> If that works, we'd probably want to fix it in a less heavy-handed way,
> but I'd have to think about how best to do that.

So I looked at the call sites for nfs4_get_vfs_file(), how about something like 
this:

after we grab the fp->fi_lock, we can do test_access(open->op_share_access, 
stp);

If that returns true - just drop the spinlock and return EAGAIN.

The callsite in nfs4_upgrade_open() would handle that by retesting the access 
map
again and either coming back in or more likely reusing the now updated stateid
(synchronised by the fi_lock again).
We probably need to convert the whole access map testing there to be under
fi_lock.
Something like:
nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh 
*cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open)
{
__be32 status;
unsigned char old_deny_bmap = stp->st_deny_bmap;

again:
+spin_lock(>fi_lock);
if (!test_access(open->op_share_access, stp)) {
+   spin_unlock(>fi_lock);
+   status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open);
+   if (status == -EAGAIN)
+   goto again;
+   return status;
+   }

/* test and set deny mode */
-spin_lock(>fi_lock);
status = nfs4_file_check_deny(fp, open->op_share_deny);


The call in nfsd4_process_open2() I think cannot hit this condition, right?
probably can add a WARN_ON there? BUG_ON? more sensible approach?

Alternatively we can probably always call nfs4_get_vfs_file() under this 
spinlock,
just have it drop that for the open and then reobtain (already done), not as 
transparent I guess.

Or the fi_lock might be converted to say a mutex, so we can sleep with it held 
and
then we can hold it across whole invocation of nfs4_get_vfs_file() and access 
testing and stuff.




Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-08 Thread Oleg Drokin

On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 22:22 -0400, Oleg Drokin wrote:
>> On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:
>> 
> That said, this code is quite subtle. I'd need to look over it in more
> detail before I offer up any fixes. I'd also appreciate it if anyone
> else wants to sanity check my analysis there.
> 
>>> Yeah, I think you're right. It's fine since r/w opens have a distinct
>>> slot, even though the refcounting just tracks the number of read and
>>> write references. So yeah, the leak probably is in an error path
>>> someplace, or maybe a race someplace.
>> 
>> So I noticed that set_access is always called locked, but clear_access is 
>> not,
>> this does not sound right.
>> 
>> So I placed this strategic WARN_ON:
>> @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst 
>> *rqstp, struct nfs4_file *fp,
>> goto out_put_access;
>> spin_lock(>fi_lock);
>> if (!fp->fi_fds[oflag]) {
>> +WARN_ON(!test_access(open->op_share_access, stp));
>> fp->fi_fds[oflag] = filp;
>> filp = NULL;
>>
>> This is right in the place where nfsd set the access flag already, discovered
>> that the file is not opened and went on to open it, yet some parallel thread
>> came in and cleared the flag by the time we got the file opened.
>> It did trigger (but there are 30 minutes left till test finish, so I don't
>> know yet if this will correspond to the problem at hand yet, so below is 
>> speculation).
>> 
>> Now, at the exit from this function, the state will not have access for this 
>> file
>> set and the file would be leaked, since the matching call, probably in
>> release_all_access() does:
>> if (test_access(i, stp))
>> nfs4_file_put_access(stp->st_stid.sc_file, i);
>> clear_access(i, stp);
>> 
>> Meaning that the refcount is not dropped due to no access flag set.
>> 
>> I imagine we can just set the flag back right after setting fp->fi_fds[oflag]
>> and that particular race would be gone?
>> 
>> Would that be ok, or is it just wrong that somebody can come and drop
>> (delegation downgrade seems to be the most likely suspect) access bits like 
>> this
>> while they are in the middle of being acquired?
>> 
> 
> I think you're on the right track...
> 
> Operations that morph the stateid's seqid should be serialized via the
> st_rwsem. An OPEN_DOWNGRADE or CLOSE takes an exclusive lock on the
> rwsem, but we do allow OPENs to run in parallel with each other, under
> the assumption that on success you always get at least what you
> requested. So, I don't think that OPEN can really race with an
> OPEN_DOWNGRADE or CLOSE like you're thinking.
> 
> But...suppose we have two racing OPEN calls. They're both in
> nfs4_get_vfs_file. One opens the file and succeeds and the other fails
> and ends up in out_put_access. At that point, you could end up
> clobbering the successful update to st_access_bmap from the other task,
> and we'd end up not putting the file access references in
> release_all_access.

It looks like the race plays in reverse to what I thought.
My assert when setting fi_fds to not NULL never hit, but I also had this warning
below and that DID hit and is a 100% correlation to the later leak.

set_access(u32 access, struct nfs4_ol_stateid *stp)
{
unsigned char mask = 1 << access;

WARN_ON_ONCE(access > NFS4_SHARE_ACCESS_BOTH);
WARN_ON(stp->st_access_bmap & mask);
stp->st_access_bmap |= mask;
}

So we are setting the access that is already set
(this comes from nfs4_get_vfs_file() via nfs4_upgrade_open() ) twice, then one
of the callers drops theirs, and so subsequent state is not dropping their
file reference in release_all_access() because they don't see the access bit,
something like this.

Though I don't see why it cannot play like I originally envisioned too.

Hm…
So if it always hits when set_access warning triggers, and there's no
error on both paths, then there's no explicit call to nfs4_file_put_access()
and that's a leak since release_all_access() would only release the file once?

I wonder if instead on the success exit path we can check the old_access_bmap 
and
if there our current access is also set - release it?
This is ok, right? if the other caller succeeded they still keep the reference.
If the other caller failed they drop theirs and restore their idea of access 
bmap
which presumably does not have the bit set and drop their reference anyway.

The question is if they both fail, or only them fail.
If both fail and we win the race on restoring old access - we might set access 
wrongly
because them had it correct.
Similarly if they fail and we don't, they can reset our access.

> A simple way to confirm that might be to convert all of the read locks
> on the st_rwsem to write locks. That will serialize all of the open
> operations and should 

Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-08 Thread Oleg Drokin

On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 22:22 -0400, Oleg Drokin wrote:
>> On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:
>> 
> That said, this code is quite subtle. I'd need to look over it in more
> detail before I offer up any fixes. I'd also appreciate it if anyone
> else wants to sanity check my analysis there.
> 
>>> Yeah, I think you're right. It's fine since r/w opens have a distinct
>>> slot, even though the refcounting just tracks the number of read and
>>> write references. So yeah, the leak probably is in an error path
>>> someplace, or maybe a race someplace.
>> 
>> So I noticed that set_access is always called locked, but clear_access is 
>> not,
>> this does not sound right.
>> 
>> So I placed this strategic WARN_ON:
>> @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst 
>> *rqstp, struct nfs4_file *fp,
>> goto out_put_access;
>> spin_lock(>fi_lock);
>> if (!fp->fi_fds[oflag]) {
>> +WARN_ON(!test_access(open->op_share_access, stp));
>> fp->fi_fds[oflag] = filp;
>> filp = NULL;
>>
>> This is right in the place where nfsd set the access flag already, discovered
>> that the file is not opened and went on to open it, yet some parallel thread
>> came in and cleared the flag by the time we got the file opened.
>> It did trigger (but there are 30 minutes left till test finish, so I don't
>> know yet if this will correspond to the problem at hand yet, so below is 
>> speculation).
>> 
>> Now, at the exit from this function, the state will not have access for this 
>> file
>> set and the file would be leaked, since the matching call, probably in
>> release_all_access() does:
>> if (test_access(i, stp))
>> nfs4_file_put_access(stp->st_stid.sc_file, i);
>> clear_access(i, stp);
>> 
>> Meaning that the refcount is not dropped due to no access flag set.
>> 
>> I imagine we can just set the flag back right after setting fp->fi_fds[oflag]
>> and that particular race would be gone?
>> 
>> Would that be ok, or is it just wrong that somebody can come and drop
>> (delegation downgrade seems to be the most likely suspect) access bits like 
>> this
>> while they are in the middle of being acquired?
>> 
> 
> I think you're on the right track...
> 
> Operations that morph the stateid's seqid should be serialized via the
> st_rwsem. An OPEN_DOWNGRADE or CLOSE takes an exclusive lock on the
> rwsem, but we do allow OPENs to run in parallel with each other, under
> the assumption that on success you always get at least what you
> requested. So, I don't think that OPEN can really race with an
> OPEN_DOWNGRADE or CLOSE like you're thinking.
> 
> But...suppose we have two racing OPEN calls. They're both in
> nfs4_get_vfs_file. One opens the file and succeeds and the other fails
> and ends up in out_put_access. At that point, you could end up
> clobbering the successful update to st_access_bmap from the other task,
> and we'd end up not putting the file access references in
> release_all_access.

It looks like the race plays in reverse to what I thought.
My assert when setting fi_fds to not NULL never hit, but I also had this warning
below and that DID hit and is a 100% correlation to the later leak.

set_access(u32 access, struct nfs4_ol_stateid *stp)
{
unsigned char mask = 1 << access;

WARN_ON_ONCE(access > NFS4_SHARE_ACCESS_BOTH);
WARN_ON(stp->st_access_bmap & mask);
stp->st_access_bmap |= mask;
}

So we are setting the access that is already set
(this comes from nfs4_get_vfs_file() via nfs4_upgrade_open() ) twice, then one
of the callers drops theirs, and so subsequent state is not dropping their
file reference in release_all_access() because they don't see the access bit,
something like this.

Though I don't see why it cannot play like I originally envisioned too.

Hm…
So if it always hits when set_access warning triggers, and there's no
error on both paths, then there's no explicit call to nfs4_file_put_access()
and that's a leak since release_all_access() would only release the file once?

I wonder if instead on the success exit path we can check the old_access_bmap 
and
if there our current access is also set - release it?
This is ok, right? if the other caller succeeded they still keep the reference.
If the other caller failed they drop theirs and restore their idea of access 
bmap
which presumably does not have the bit set and drop their reference anyway.

The question is if they both fail, or only them fail.
If both fail and we win the race on restoring old access - we might set access 
wrongly
because them had it correct.
Similarly if they fail and we don't, they can reset our access.

> A simple way to confirm that might be to convert all of the read locks
> on the st_rwsem to write locks. That will serialize all of the open
> operations and should 

Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-08 Thread Jeff Layton
On Tue, 2016-06-07 at 22:22 -0400, Oleg Drokin wrote:
> On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:
> 
> > > > That said, this code is quite subtle. I'd need to look over it in more
> > > > detail before I offer up any fixes. I'd also appreciate it if anyone
> > > > else wants to sanity check my analysis there.
> > > > 
> > Yeah, I think you're right. It's fine since r/w opens have a distinct
> > slot, even though the refcounting just tracks the number of read and
> > write references. So yeah, the leak probably is in an error path
> > someplace, or maybe a race someplace.
> 
> So I noticed that set_access is always called locked, but clear_access is not,
> this does not sound right.
> 
> So I placed this strategic WARN_ON:
> @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, 
> struct nfs4_file *fp,
> goto out_put_access;
> spin_lock(>fi_lock);
> if (!fp->fi_fds[oflag]) {
> +WARN_ON(!test_access(open->op_share_access, stp));
> fp->fi_fds[oflag] = filp;
> filp = NULL;
>    
> This is right in the place where nfsd set the access flag already, discovered
> that the file is not opened and went on to open it, yet some parallel thread
> came in and cleared the flag by the time we got the file opened.
> It did trigger (but there are 30 minutes left till test finish, so I don't
> know yet if this will correspond to the problem at hand yet, so below is 
> speculation).
> 
> Now, at the exit from this function, the state will not have access for this 
> file
> set and the file would be leaked, since the matching call, probably in
> release_all_access() does:
> if (test_access(i, stp))
> nfs4_file_put_access(stp->st_stid.sc_file, i);
> clear_access(i, stp);
> 
> Meaning that the refcount is not dropped due to no access flag set.
> 
> I imagine we can just set the flag back right after setting fp->fi_fds[oflag]
> and that particular race would be gone?
> 
> Would that be ok, or is it just wrong that somebody can come and drop
> (delegation downgrade seems to be the most likely suspect) access bits like 
> this
> while they are in the middle of being acquired?
> 

I think you're on the right track...

Operations that morph the stateid's seqid should be serialized via the
st_rwsem. An OPEN_DOWNGRADE or CLOSE takes an exclusive lock on the
rwsem, but we do allow OPENs to run in parallel with each other, under
the assumption that on success you always get at least what you
requested. So, I don't think that OPEN can really race with an
OPEN_DOWNGRADE or CLOSE like you're thinking.

But...suppose we have two racing OPEN calls. They're both in
nfs4_get_vfs_file. One opens the file and succeeds and the other fails
and ends up in out_put_access. At that point, you could end up
clobbering the successful update to st_access_bmap from the other task,
and we'd end up not putting the file access references in
release_all_access.

A simple way to confirm that might be to convert all of the read locks
on the st_rwsem to write locks. That will serialize all of the open
operations and should prevent that particular race from occurring.

If that works, we'd probably want to fix it in a less heavy-handed way,
but I'd have to think about how best to do that.

-- 
Jeff Layton 


Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-08 Thread Jeff Layton
On Tue, 2016-06-07 at 22:22 -0400, Oleg Drokin wrote:
> On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:
> 
> > > > That said, this code is quite subtle. I'd need to look over it in more
> > > > detail before I offer up any fixes. I'd also appreciate it if anyone
> > > > else wants to sanity check my analysis there.
> > > > 
> > Yeah, I think you're right. It's fine since r/w opens have a distinct
> > slot, even though the refcounting just tracks the number of read and
> > write references. So yeah, the leak probably is in an error path
> > someplace, or maybe a race someplace.
> 
> So I noticed that set_access is always called locked, but clear_access is not,
> this does not sound right.
> 
> So I placed this strategic WARN_ON:
> @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, 
> struct nfs4_file *fp,
> goto out_put_access;
> spin_lock(>fi_lock);
> if (!fp->fi_fds[oflag]) {
> +WARN_ON(!test_access(open->op_share_access, stp));
> fp->fi_fds[oflag] = filp;
> filp = NULL;
>    
> This is right in the place where nfsd set the access flag already, discovered
> that the file is not opened and went on to open it, yet some parallel thread
> came in and cleared the flag by the time we got the file opened.
> It did trigger (but there are 30 minutes left till test finish, so I don't
> know yet if this will correspond to the problem at hand yet, so below is 
> speculation).
> 
> Now, at the exit from this function, the state will not have access for this 
> file
> set and the file would be leaked, since the matching call, probably in
> release_all_access() does:
> if (test_access(i, stp))
> nfs4_file_put_access(stp->st_stid.sc_file, i);
> clear_access(i, stp);
> 
> Meaning that the refcount is not dropped due to no access flag set.
> 
> I imagine we can just set the flag back right after setting fp->fi_fds[oflag]
> and that particular race would be gone?
> 
> Would that be ok, or is it just wrong that somebody can come and drop
> (delegation downgrade seems to be the most likely suspect) access bits like 
> this
> while they are in the middle of being acquired?
> 

I think you're on the right track...

Operations that morph the stateid's seqid should be serialized via the
st_rwsem. An OPEN_DOWNGRADE or CLOSE takes an exclusive lock on the
rwsem, but we do allow OPENs to run in parallel with each other, under
the assumption that on success you always get at least what you
requested. So, I don't think that OPEN can really race with an
OPEN_DOWNGRADE or CLOSE like you're thinking.

But...suppose we have two racing OPEN calls. They're both in
nfs4_get_vfs_file. One opens the file and succeeds and the other fails
and ends up in out_put_access. At that point, you could end up
clobbering the successful update to st_access_bmap from the other task,
and we'd end up not putting the file access references in
release_all_access.

A simple way to confirm that might be to convert all of the read locks
on the st_rwsem to write locks. That will serialize all of the open
operations and should prevent that particular race from occurring.

If that works, we'd probably want to fix it in a less heavy-handed way,
but I'd have to think about how best to do that.

-- 
Jeff Layton 


Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Oleg Drokin

On Jun 7, 2016, at 10:22 PM, Oleg Drokin wrote:

> 
> On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:
> 
 That said, this code is quite subtle. I'd need to look over it in more
 detail before I offer up any fixes. I'd also appreciate it if anyone
 else wants to sanity check my analysis there.
 
>> Yeah, I think you're right. It's fine since r/w opens have a distinct
>> slot, even though the refcounting just tracks the number of read and
>> write references. So yeah, the leak probably is in an error path
>> someplace, or maybe a race someplace.
> 
> So I noticed that set_access is always called locked, but clear_access is not,
> this does not sound right.
> 
> So I placed this strategic WARN_ON:
> @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, 
> struct nfs4_file *fp,
>goto out_put_access;
>spin_lock(>fi_lock);
>if (!fp->fi_fds[oflag]) {
> +WARN_ON(!test_access(open->op_share_access, stp));
>fp->fi_fds[oflag] = filp;
>filp = NULL;
> 
> This is right in the place where nfsd set the access flag already, discovered
> that the file is not opened and went on to open it, yet some parallel thread
> came in and cleared the flag by the time we got the file opened.
> It did trigger (but there are 30 minutes left till test finish, so I don't
> know yet if this will correspond to the problem at hand yet, so below is 
> speculation).

Duh, I looked for a warning, but did not cross reference, and it was not this 
one that
hit yet.

Though apparently I am hitting some of the "impossible" warnings, so you might 
want to
look into that anyway.

status = nfsd4_process_open2(rqstp, resfh, open);
WARN(status && open->op_created,
 "nfsd4_process_open2 failed to open newly-created file! 
status=%u\n",
 be32_to_cpu(status));

and

filp = find_readable_file(fp);
if (!filp) {
/* We should always have a readable file here */
WARN_ON_ONCE(1);
locks_free_lock(fl);
return -EBADF;
}



Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Oleg Drokin

On Jun 7, 2016, at 10:22 PM, Oleg Drokin wrote:

> 
> On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:
> 
 That said, this code is quite subtle. I'd need to look over it in more
 detail before I offer up any fixes. I'd also appreciate it if anyone
 else wants to sanity check my analysis there.
 
>> Yeah, I think you're right. It's fine since r/w opens have a distinct
>> slot, even though the refcounting just tracks the number of read and
>> write references. So yeah, the leak probably is in an error path
>> someplace, or maybe a race someplace.
> 
> So I noticed that set_access is always called locked, but clear_access is not,
> this does not sound right.
> 
> So I placed this strategic WARN_ON:
> @@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, 
> struct nfs4_file *fp,
>goto out_put_access;
>spin_lock(>fi_lock);
>if (!fp->fi_fds[oflag]) {
> +WARN_ON(!test_access(open->op_share_access, stp));
>fp->fi_fds[oflag] = filp;
>filp = NULL;
> 
> This is right in the place where nfsd set the access flag already, discovered
> that the file is not opened and went on to open it, yet some parallel thread
> came in and cleared the flag by the time we got the file opened.
> It did trigger (but there are 30 minutes left till test finish, so I don't
> know yet if this will correspond to the problem at hand yet, so below is 
> speculation).

Duh, I looked for a warning, but did not cross reference, and it was not this 
one that
hit yet.

Though apparently I am hitting some of the "impossible" warnings, so you might 
want to
look into that anyway.

status = nfsd4_process_open2(rqstp, resfh, open);
WARN(status && open->op_created,
 "nfsd4_process_open2 failed to open newly-created file! 
status=%u\n",
 be32_to_cpu(status));

and

filp = find_readable_file(fp);
if (!filp) {
/* We should always have a readable file here */
WARN_ON_ONCE(1);
locks_free_lock(fl);
return -EBADF;
}



Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Oleg Drokin

On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:

>>> That said, this code is quite subtle. I'd need to look over it in more
>>> detail before I offer up any fixes. I'd also appreciate it if anyone
>>> else wants to sanity check my analysis there.
>>> 
> Yeah, I think you're right. It's fine since r/w opens have a distinct
> slot, even though the refcounting just tracks the number of read and
> write references. So yeah, the leak probably is in an error path
> someplace, or maybe a race someplace.

So I noticed that set_access is always called locked, but clear_access is not,
this does not sound right.

So I placed this strategic WARN_ON:
@@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, 
struct nfs4_file *fp,
goto out_put_access;
spin_lock(>fi_lock);
if (!fp->fi_fds[oflag]) {
+WARN_ON(!test_access(open->op_share_access, stp));
fp->fi_fds[oflag] = filp;
filp = NULL;
   
This is right in the place where nfsd set the access flag already, discovered
that the file is not opened and went on to open it, yet some parallel thread
came in and cleared the flag by the time we got the file opened.
It did trigger (but there are 30 minutes left till test finish, so I don't
know yet if this will correspond to the problem at hand yet, so below is 
speculation).

Now, at the exit from this function, the state will not have access for this 
file
set and the file would be leaked, since the matching call, probably in
release_all_access() does:
if (test_access(i, stp))
nfs4_file_put_access(stp->st_stid.sc_file, i);
clear_access(i, stp);

Meaning that the refcount is not dropped due to no access flag set.

I imagine we can just set the flag back right after setting fp->fi_fds[oflag]
and that particular race would be gone?

Would that be ok, or is it just wrong that somebody can come and drop
(delegation downgrade seems to be the most likely suspect) access bits like this
while they are in the middle of being acquired?



Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Oleg Drokin

On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:

>>> That said, this code is quite subtle. I'd need to look over it in more
>>> detail before I offer up any fixes. I'd also appreciate it if anyone
>>> else wants to sanity check my analysis there.
>>> 
> Yeah, I think you're right. It's fine since r/w opens have a distinct
> slot, even though the refcounting just tracks the number of read and
> write references. So yeah, the leak probably is in an error path
> someplace, or maybe a race someplace.

So I noticed that set_access is always called locked, but clear_access is not,
this does not sound right.

So I placed this strategic WARN_ON:
@@ -3991,6 +4030,7 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, 
struct nfs4_file *fp,
goto out_put_access;
spin_lock(>fi_lock);
if (!fp->fi_fds[oflag]) {
+WARN_ON(!test_access(open->op_share_access, stp));
fp->fi_fds[oflag] = filp;
filp = NULL;
   
This is right in the place where nfsd set the access flag already, discovered
that the file is not opened and went on to open it, yet some parallel thread
came in and cleared the flag by the time we got the file opened.
It did trigger (but there are 30 minutes left till test finish, so I don't
know yet if this will correspond to the problem at hand yet, so below is 
speculation).

Now, at the exit from this function, the state will not have access for this 
file
set and the file would be leaked, since the matching call, probably in
release_all_access() does:
if (test_access(i, stp))
nfs4_file_put_access(stp->st_stid.sc_file, i);
clear_access(i, stp);

Meaning that the refcount is not dropped due to no access flag set.

I imagine we can just set the flag back right after setting fp->fi_fds[oflag]
and that particular race would be gone?

Would that be ok, or is it just wrong that somebody can come and drop
(delegation downgrade seems to be the most likely suspect) access bits like this
while they are in the middle of being acquired?



Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Oleg Drokin

On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 19:39 -0400, Oleg Drokin wrote:
>> On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote:
>>> So, the way this is supposed to work is that the stateids each hold a
>>> reference to the nfs4_file. They also take an fi_access reference for
>>> the open modes that they represent. This is how we know when it's OK to
>>> close one of the files in the fd[] slots. It might be interesting to
>>> also print out the fi_access counters if the slot hasn't been zeroed
>>> out if you feel up to it.
>> Ok, I have primed that and will let you know next time it triggers.
> Thanks. Maybe that will give us some idea of where to look…

Well, the counters are: 0 1 0 at the warning site, which is what
makes sense, really.

>>> I think though that the open upgrade accounting may be wrong.
>>> __nfs4_file_get_access does an unconditional atomic_inc, so if the
>>> client does a open for write and then opens again for read/write, then
>>> we may end up with too many write fi_access references taken.
>> 
>> The read-write access is all fine I think, because when dropping read or 
>> write
>> it also checks if the readwrite is filled?
>> Also whoever did that atomic_inc for get_access would eventually do put,
>> and that should free the file, no?
>> I was mostly thinking there's an error exit path somewhere that forgets to do
>> the put.
>> This also would explain the other pieces leaked that are not directly 
>> attached
>> into the nfs4_file.
>> 
>>> 
>>> That said, this code is quite subtle. I'd need to look over it in more
>>> detail before I offer up any fixes. I'd also appreciate it if anyone
>>> else wants to sanity check my analysis there.
>>> 
> 
> Yeah, I think you're right. It's fine since r/w opens have a distinct
> slot, even though the refcounting just tracks the number of read and
> write references. So yeah, the leak probably is in an error path
> someplace, or maybe a race someplace.

I guess I'll try to put some tracers in the code to try and match
refcount takers vs refcount droppers for lack of better ideas.

Hopefully that'll lead us to see who got the count that was not dropped.



Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Oleg Drokin

On Jun 7, 2016, at 8:03 PM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 19:39 -0400, Oleg Drokin wrote:
>> On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote:
>>> So, the way this is supposed to work is that the stateids each hold a
>>> reference to the nfs4_file. They also take an fi_access reference for
>>> the open modes that they represent. This is how we know when it's OK to
>>> close one of the files in the fd[] slots. It might be interesting to
>>> also print out the fi_access counters if the slot hasn't been zeroed
>>> out if you feel up to it.
>> Ok, I have primed that and will let you know next time it triggers.
> Thanks. Maybe that will give us some idea of where to look…

Well, the counters are: 0 1 0 at the warning site, which is what
makes sense, really.

>>> I think though that the open upgrade accounting may be wrong.
>>> __nfs4_file_get_access does an unconditional atomic_inc, so if the
>>> client does a open for write and then opens again for read/write, then
>>> we may end up with too many write fi_access references taken.
>> 
>> The read-write access is all fine I think, because when dropping read or 
>> write
>> it also checks if the readwrite is filled?
>> Also whoever did that atomic_inc for get_access would eventually do put,
>> and that should free the file, no?
>> I was mostly thinking there's an error exit path somewhere that forgets to do
>> the put.
>> This also would explain the other pieces leaked that are not directly 
>> attached
>> into the nfs4_file.
>> 
>>> 
>>> That said, this code is quite subtle. I'd need to look over it in more
>>> detail before I offer up any fixes. I'd also appreciate it if anyone
>>> else wants to sanity check my analysis there.
>>> 
> 
> Yeah, I think you're right. It's fine since r/w opens have a distinct
> slot, even though the refcounting just tracks the number of read and
> write references. So yeah, the leak probably is in an error path
> someplace, or maybe a race someplace.

I guess I'll try to put some tracers in the code to try and match
refcount takers vs refcount droppers for lack of better ideas.

Hopefully that'll lead us to see who got the count that was not dropped.



Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Jeff Layton
On Tue, 2016-06-07 at 19:39 -0400, Oleg Drokin wrote:
> On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote:
> 
> > 
> > On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote:
> > > 
> > > On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:
> > > 
> > > > 
> > > > 
> > > > On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
> > > > > 
> > > > > 
> > > > > Hello!
> > > > > 
> > > > >    I've been trying to better understand this problem I was having 
> > > > > where sometimes
> > > > >    a formerly NFS-exported mountpoint becomes unmountable (after nfsd 
> > > > > stop).
> > > > > 
> > > > >    I finally traced it to a leaked filedescriptor that was allocated 
> > > > > from
> > > > >    
> > > > > nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
> > > > > 
> > > > >    Also together with it we see leaked credentials allocated along 
> > > > > the same path from
> > > > >    fh_verify() and groups allocated from 
> > > > > svcauth_unix_accept()->groups_alloc() that
> > > > >    are presumably used by the credentials.
> > > > > 
> > > > >    Unfortunately I was not able to make total sense out of the state 
> > > > > handling in nfsd,
> > > > >    but it's clear that one of the file descriptors inside struct 
> > > > > nfs4_file is
> > > > >    lost. I added a patch like this (always a good idea, so surprised 
> > > > > it was not
> > > > >    there already):
> > > > > @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head 
> > > > > *rcu)
> > > > >  {
> > > > > struct nfs4_file *fp = container_of(rcu, struct nfs4_file, 
> > > > > fi_rcu);
> > > > >  
> > > > > +   WARN_ON(fp->fi_fds[0]);
> > > > > +   WARN_ON(fp->fi_fds[1]);
> > > > > +   WARN_ON(fp->fi_fds[2]);
> > > > > kmem_cache_free(file_slab, fp);
> > > > >  }
> > > > > 
> > > > >    And when the problem is hit, I am also triggering (Always this one 
> > > > > which is fd[1])
> > > > > [ 3588.143002] [ cut here ]
> > > > > [ 3588.143662] WARNING: CPU: 5 PID: 9 at 
> > > > > /home/green/bk/linux/fs/nfsd/nfs4state.c:278 
> > > > > nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > > > [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev 
> > > > > acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm 
> > > > > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm 
> > > > > floppy serio_raw virtio_blk
> > > > > [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 
> > > > > 4.7.0-rc1-vm-nfs+ #120
> > > > > [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > > > > [ 3588.153830]  0286 e2d5ccdf 88011965bd50 
> > > > > 814a11a5
> > > > > [ 3588.153832]    88011965bd90 
> > > > > 8108806b
> > > > > [ 3588.153834]  0116 8800c476a0b8 8800c476a048 
> > > > > c0110fc0
> > > > > [ 3588.153834] Call Trace:
> > > > > [ 3588.153839]  [] dump_stack+0x86/0xc1
> > > > > [ 3588.153841]  [] __warn+0xcb/0xf0
> > > > > [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
> > > > > [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
> > > > > [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > > > [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
> > > > > [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
> > > > > [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
> > > > > [ 3588.153866]  [] kthread+0x101/0x120
> > > > > [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
> > > > > [ 3588.153871]  [] ret_from_fork+0x1f/0x40
> > > > > [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
> > > > > 
> > > > > 
> > > > >   release_all_access() seems to be doing correct job of all that 
> > > > > cleaning, so
> > > > >   there must be some other path that I do not quite see.
> > > > > 
> > > > >   Hopefully you are more familiar with the code and can see the 
> > > > > problem right away ;)
> > > > Hmm...well I'm not seeing it right offhand, and haven't been able to
> > > > reproduce the problem so far after a couple of attempts by hand. What
> > > > sort of workload are you running before you see that warning pop?
> > > Well, the workload is a crazed twisty conflicting operations set in a 
> > > very limited
> > > namespace (called racer.sh,
> > > http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer
> > >  )
> > > 
> > > It does all sorts of crazy unimaginable stuff that is hard to predict, 
> > > and typically I
> > > run it on several mountpoints all pointing to the same spot
> > > (now 3-ways: one instance on direct mount point in ext4, one on nfs mount 
> > > mounted as
> > > mount -o nfs and one on mount -o nfs4).
> > > It must be hitting some rare race or a race + operations combination that 
> > > leads to
> > > the leak.
> > > 
> > > To reproduce I run it like this until it stops (and it usually stops in
> > > 2 or 3 iterations, so it takes some time):
> > > $ cat ~/stress.sh 
> > > 

Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Jeff Layton
On Tue, 2016-06-07 at 19:39 -0400, Oleg Drokin wrote:
> On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote:
> 
> > 
> > On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote:
> > > 
> > > On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:
> > > 
> > > > 
> > > > 
> > > > On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
> > > > > 
> > > > > 
> > > > > Hello!
> > > > > 
> > > > >    I've been trying to better understand this problem I was having 
> > > > > where sometimes
> > > > >    a formerly NFS-exported mountpoint becomes unmountable (after nfsd 
> > > > > stop).
> > > > > 
> > > > >    I finally traced it to a leaked filedescriptor that was allocated 
> > > > > from
> > > > >    
> > > > > nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
> > > > > 
> > > > >    Also together with it we see leaked credentials allocated along 
> > > > > the same path from
> > > > >    fh_verify() and groups allocated from 
> > > > > svcauth_unix_accept()->groups_alloc() that
> > > > >    are presumably used by the credentials.
> > > > > 
> > > > >    Unfortunately I was not able to make total sense out of the state 
> > > > > handling in nfsd,
> > > > >    but it's clear that one of the file descriptors inside struct 
> > > > > nfs4_file is
> > > > >    lost. I added a patch like this (always a good idea, so surprised 
> > > > > it was not
> > > > >    there already):
> > > > > @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head 
> > > > > *rcu)
> > > > >  {
> > > > > struct nfs4_file *fp = container_of(rcu, struct nfs4_file, 
> > > > > fi_rcu);
> > > > >  
> > > > > +   WARN_ON(fp->fi_fds[0]);
> > > > > +   WARN_ON(fp->fi_fds[1]);
> > > > > +   WARN_ON(fp->fi_fds[2]);
> > > > > kmem_cache_free(file_slab, fp);
> > > > >  }
> > > > > 
> > > > >    And when the problem is hit, I am also triggering (Always this one 
> > > > > which is fd[1])
> > > > > [ 3588.143002] [ cut here ]
> > > > > [ 3588.143662] WARNING: CPU: 5 PID: 9 at 
> > > > > /home/green/bk/linux/fs/nfsd/nfs4state.c:278 
> > > > > nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > > > [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev 
> > > > > acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm 
> > > > > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm 
> > > > > floppy serio_raw virtio_blk
> > > > > [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 
> > > > > 4.7.0-rc1-vm-nfs+ #120
> > > > > [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > > > > [ 3588.153830]  0286 e2d5ccdf 88011965bd50 
> > > > > 814a11a5
> > > > > [ 3588.153832]    88011965bd90 
> > > > > 8108806b
> > > > > [ 3588.153834]  0116 8800c476a0b8 8800c476a048 
> > > > > c0110fc0
> > > > > [ 3588.153834] Call Trace:
> > > > > [ 3588.153839]  [] dump_stack+0x86/0xc1
> > > > > [ 3588.153841]  [] __warn+0xcb/0xf0
> > > > > [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
> > > > > [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
> > > > > [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > > > [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
> > > > > [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
> > > > > [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
> > > > > [ 3588.153866]  [] kthread+0x101/0x120
> > > > > [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
> > > > > [ 3588.153871]  [] ret_from_fork+0x1f/0x40
> > > > > [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
> > > > > 
> > > > > 
> > > > >   release_all_access() seems to be doing correct job of all that 
> > > > > cleaning, so
> > > > >   there must be some other path that I do not quite see.
> > > > > 
> > > > >   Hopefully you are more familiar with the code and can see the 
> > > > > problem right away ;)
> > > > Hmm...well I'm not seeing it right offhand, and haven't been able to
> > > > reproduce the problem so far after a couple of attempts by hand. What
> > > > sort of workload are you running before you see that warning pop?
> > > Well, the workload is a crazed twisty conflicting operations set in a 
> > > very limited
> > > namespace (called racer.sh,
> > > http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer
> > >  )
> > > 
> > > It does all sorts of crazy unimaginable stuff that is hard to predict, 
> > > and typically I
> > > run it on several mountpoints all pointing to the same spot
> > > (now 3-ways: one instance on direct mount point in ext4, one on nfs mount 
> > > mounted as
> > > mount -o nfs and one on mount -o nfs4).
> > > It must be hitting some rare race or a race + operations combination that 
> > > leads to
> > > the leak.
> > > 
> > > To reproduce I run it like this until it stops (and it usually stops in
> > > 2 or 3 iterations, so it takes some time):
> > > $ cat ~/stress.sh 
> > > 

Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Oleg Drokin

On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote:
>> On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:
>> 
>>> 
>>> On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
 
 Hello!
 
I've been trying to better understand this problem I was having where 
 sometimes
a formerly NFS-exported mountpoint becomes unmountable (after nfsd 
 stop).
 
I finally traced it to a leaked filedescriptor that was allocated from
nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
 
Also together with it we see leaked credentials allocated along the 
 same path from
fh_verify() and groups allocated from 
 svcauth_unix_accept()->groups_alloc() that
are presumably used by the credentials.
 
Unfortunately I was not able to make total sense out of the state 
 handling in nfsd,
but it's clear that one of the file descriptors inside struct nfs4_file 
 is
lost. I added a patch like this (always a good idea, so surprised it 
 was not
there already):
 @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
  {
 struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
  
 +   WARN_ON(fp->fi_fds[0]);
 +   WARN_ON(fp->fi_fds[1]);
 +   WARN_ON(fp->fi_fds[2]);
 kmem_cache_free(file_slab, fp);
  }
 
And when the problem is hit, I am also triggering (Always this one 
 which is fd[1])
 [ 3588.143002] [ cut here ]
 [ 3588.143662] WARNING: CPU: 5 PID: 9 at 
 /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 
 [nfsd]
 [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq 
 tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper 
 syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw 
 virtio_blk
 [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ 
 #120
 [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [ 3588.153830]  0286 e2d5ccdf 88011965bd50 
 814a11a5
 [ 3588.153832]    88011965bd90 
 8108806b
 [ 3588.153834]  0116 8800c476a0b8 8800c476a048 
 c0110fc0
 [ 3588.153834] Call Trace:
 [ 3588.153839]  [] dump_stack+0x86/0xc1
 [ 3588.153841]  [] __warn+0xcb/0xf0
 [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
 [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
 [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
 [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
 [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
 [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
 [ 3588.153866]  [] kthread+0x101/0x120
 [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
 [ 3588.153871]  [] ret_from_fork+0x1f/0x40
 [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
 
 
   release_all_access() seems to be doing correct job of all that cleaning, 
 so
   there must be some other path that I do not quite see.
 
   Hopefully you are more familiar with the code and can see the problem 
 right away ;)
>>> Hmm...well I'm not seeing it right offhand, and haven't been able to
>>> reproduce the problem so far after a couple of attempts by hand. What
>>> sort of workload are you running before you see that warning pop?
>> Well, the workload is a crazed twisty conflicting operations set in a very 
>> limited
>> namespace (called racer.sh,
>> http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer
>>  )
>> 
>> It does all sorts of crazy unimaginable stuff that is hard to predict, and 
>> typically I
>> run it on several mountpoints all pointing to the same spot
>> (now 3-ways: one instance on direct mount point in ext4, one on nfs mount 
>> mounted as
>> mount -o nfs and one on mount -o nfs4).
>> It must be hitting some rare race or a race + operations combination that 
>> leads to
>> the leak.
>> 
>> To reproduce I run it like this until it stops (and it usually stops in
>> 2 or 3 iterations, so it takes some time):
>> $ cat ~/stress.sh 
>> #!/bin/bash
>> 
>> set -x
>> 
>> cd /home/green/git/lustre-release/lustre/tests/racer
>> dd if=/dev/zero of=/tmp/loop bs=1024k count=1024
>> mkfs.ext4 /tmp/loop
>> service rpcbind start
>> mount none /var/lib/nfs -t tmpfs
>> touch /var/lib/nfs/etab
>> service nfs-mountd start
>> 
>> while :; do
>> 
>> mount /tmp/loop /mnt/lustre -o loop || exit 1
>> mkdir /mnt/lustre/racer
>> service nfs-server start || exit 2
>> mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3
>> mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4
>> DURATION=3600 sh racer.sh /mnt/nfs/racer &
>> DURATION=3600 

Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Oleg Drokin

On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote:
>> On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:
>> 
>>> 
>>> On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
 
 Hello!
 
I've been trying to better understand this problem I was having where 
 sometimes
a formerly NFS-exported mountpoint becomes unmountable (after nfsd 
 stop).
 
I finally traced it to a leaked filedescriptor that was allocated from
nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
 
Also together with it we see leaked credentials allocated along the 
 same path from
fh_verify() and groups allocated from 
 svcauth_unix_accept()->groups_alloc() that
are presumably used by the credentials.
 
Unfortunately I was not able to make total sense out of the state 
 handling in nfsd,
but it's clear that one of the file descriptors inside struct nfs4_file 
 is
lost. I added a patch like this (always a good idea, so surprised it 
 was not
there already):
 @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
  {
 struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
  
 +   WARN_ON(fp->fi_fds[0]);
 +   WARN_ON(fp->fi_fds[1]);
 +   WARN_ON(fp->fi_fds[2]);
 kmem_cache_free(file_slab, fp);
  }
 
And when the problem is hit, I am also triggering (Always this one 
 which is fd[1])
 [ 3588.143002] [ cut here ]
 [ 3588.143662] WARNING: CPU: 5 PID: 9 at 
 /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 
 [nfsd]
 [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq 
 tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper 
 syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw 
 virtio_blk
 [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ 
 #120
 [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [ 3588.153830]  0286 e2d5ccdf 88011965bd50 
 814a11a5
 [ 3588.153832]    88011965bd90 
 8108806b
 [ 3588.153834]  0116 8800c476a0b8 8800c476a048 
 c0110fc0
 [ 3588.153834] Call Trace:
 [ 3588.153839]  [] dump_stack+0x86/0xc1
 [ 3588.153841]  [] __warn+0xcb/0xf0
 [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
 [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
 [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
 [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
 [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
 [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
 [ 3588.153866]  [] kthread+0x101/0x120
 [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
 [ 3588.153871]  [] ret_from_fork+0x1f/0x40
 [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
 
 
   release_all_access() seems to be doing correct job of all that cleaning, 
 so
   there must be some other path that I do not quite see.
 
   Hopefully you are more familiar with the code and can see the problem 
 right away ;)
>>> Hmm...well I'm not seeing it right offhand, and haven't been able to
>>> reproduce the problem so far after a couple of attempts by hand. What
>>> sort of workload are you running before you see that warning pop?
>> Well, the workload is a crazed twisty conflicting operations set in a very 
>> limited
>> namespace (called racer.sh,
>> http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer
>>  )
>> 
>> It does all sorts of crazy unimaginable stuff that is hard to predict, and 
>> typically I
>> run it on several mountpoints all pointing to the same spot
>> (now 3-ways: one instance on direct mount point in ext4, one on nfs mount 
>> mounted as
>> mount -o nfs and one on mount -o nfs4).
>> It must be hitting some rare race or a race + operations combination that 
>> leads to
>> the leak.
>> 
>> To reproduce I run it like this until it stops (and it usually stops in
>> 2 or 3 iterations, so it takes some time):
>> $ cat ~/stress.sh 
>> #!/bin/bash
>> 
>> set -x
>> 
>> cd /home/green/git/lustre-release/lustre/tests/racer
>> dd if=/dev/zero of=/tmp/loop bs=1024k count=1024
>> mkfs.ext4 /tmp/loop
>> service rpcbind start
>> mount none /var/lib/nfs -t tmpfs
>> touch /var/lib/nfs/etab
>> service nfs-mountd start
>> 
>> while :; do
>> 
>> mount /tmp/loop /mnt/lustre -o loop || exit 1
>> mkdir /mnt/lustre/racer
>> service nfs-server start || exit 2
>> mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3
>> mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4
>> DURATION=3600 sh racer.sh /mnt/nfs/racer &
>> DURATION=3600 

Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Jeff Layton
On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote:
> On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:
> 
> > 
> > On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
> > > 
> > > Hello!
> > > 
> > >    I've been trying to better understand this problem I was having where 
> > > sometimes
> > >    a formerly NFS-exported mountpoint becomes unmountable (after nfsd 
> > > stop).
> > > 
> > >    I finally traced it to a leaked filedescriptor that was allocated from
> > >    nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
> > > 
> > >    Also together with it we see leaked credentials allocated along the 
> > > same path from
> > >    fh_verify() and groups allocated from 
> > > svcauth_unix_accept()->groups_alloc() that
> > >    are presumably used by the credentials.
> > > 
> > >    Unfortunately I was not able to make total sense out of the state 
> > > handling in nfsd,
> > >    but it's clear that one of the file descriptors inside struct 
> > > nfs4_file is
> > >    lost. I added a patch like this (always a good idea, so surprised it 
> > > was not
> > >    there already):
> > > @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
> > >  {
> > > struct nfs4_file *fp = container_of(rcu, struct nfs4_file, 
> > > fi_rcu);
> > >  
> > > +   WARN_ON(fp->fi_fds[0]);
> > > +   WARN_ON(fp->fi_fds[1]);
> > > +   WARN_ON(fp->fi_fds[2]);
> > > kmem_cache_free(file_slab, fp);
> > >  }
> > > 
> > >    And when the problem is hit, I am also triggering (Always this one 
> > > which is fd[1])
> > > [ 3588.143002] [ cut here ]
> > > [ 3588.143662] WARNING: CPU: 5 PID: 9 at 
> > > /home/green/bk/linux/fs/nfsd/nfs4state.c:278 
> > > nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev 
> > > acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm 
> > > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy 
> > > serio_raw virtio_blk
> > > [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ 
> > > #120
> > > [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > > [ 3588.153830]  0286 e2d5ccdf 88011965bd50 
> > > 814a11a5
> > > [ 3588.153832]    88011965bd90 
> > > 8108806b
> > > [ 3588.153834]  0116 8800c476a0b8 8800c476a048 
> > > c0110fc0
> > > [ 3588.153834] Call Trace:
> > > [ 3588.153839]  [] dump_stack+0x86/0xc1
> > > [ 3588.153841]  [] __warn+0xcb/0xf0
> > > [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
> > > [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
> > > [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
> > > [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
> > > [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
> > > [ 3588.153866]  [] kthread+0x101/0x120
> > > [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
> > > [ 3588.153871]  [] ret_from_fork+0x1f/0x40
> > > [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
> > > 
> > > 
> > >   release_all_access() seems to be doing correct job of all that 
> > > cleaning, so
> > >   there must be some other path that I do not quite see.
> > > 
> > >   Hopefully you are more familiar with the code and can see the problem 
> > > right away ;)
> > Hmm...well I'm not seeing it right offhand, and haven't been able to
> > reproduce the problem so far after a couple of attempts by hand. What
> > sort of workload are you running before you see that warning pop?
> Well, the workload is a crazed twisty conflicting operations set in a very 
> limited
> namespace (called racer.sh,
> http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer
>  )
> 
> It does all sorts of crazy unimaginable stuff that is hard to predict, and 
> typically I
> run it on several mountpoints all pointing to the same spot
> (now 3-ways: one instance on direct mount point in ext4, one on nfs mount 
> mounted as
> mount -o nfs and one on mount -o nfs4).
> It must be hitting some rare race or a race + operations combination that 
> leads to
> the leak.
> 
> To reproduce I run it like this until it stops (and it usually stops in
> 2 or 3 iterations, so it takes some time):
> $ cat ~/stress.sh 
> #!/bin/bash
> 
> set -x
> 
> cd /home/green/git/lustre-release/lustre/tests/racer
> dd if=/dev/zero of=/tmp/loop bs=1024k count=1024
> mkfs.ext4 /tmp/loop
> service rpcbind start
> mount none /var/lib/nfs -t tmpfs
> touch /var/lib/nfs/etab
> service nfs-mountd start
> 
> while :; do
> 
> mount /tmp/loop /mnt/lustre -o loop || exit 1
> mkdir /mnt/lustre/racer
> service nfs-server start || exit 2
> mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3
> mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4
> DURATION=3600 sh racer.sh /mnt/nfs/racer &
> DURATION=3600 sh racer.sh 

Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Jeff Layton
On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote:
> On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:
> 
> > 
> > On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
> > > 
> > > Hello!
> > > 
> > >    I've been trying to better understand this problem I was having where 
> > > sometimes
> > >    a formerly NFS-exported mountpoint becomes unmountable (after nfsd 
> > > stop).
> > > 
> > >    I finally traced it to a leaked filedescriptor that was allocated from
> > >    nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
> > > 
> > >    Also together with it we see leaked credentials allocated along the 
> > > same path from
> > >    fh_verify() and groups allocated from 
> > > svcauth_unix_accept()->groups_alloc() that
> > >    are presumably used by the credentials.
> > > 
> > >    Unfortunately I was not able to make total sense out of the state 
> > > handling in nfsd,
> > >    but it's clear that one of the file descriptors inside struct 
> > > nfs4_file is
> > >    lost. I added a patch like this (always a good idea, so surprised it 
> > > was not
> > >    there already):
> > > @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
> > >  {
> > > struct nfs4_file *fp = container_of(rcu, struct nfs4_file, 
> > > fi_rcu);
> > >  
> > > +   WARN_ON(fp->fi_fds[0]);
> > > +   WARN_ON(fp->fi_fds[1]);
> > > +   WARN_ON(fp->fi_fds[2]);
> > > kmem_cache_free(file_slab, fp);
> > >  }
> > > 
> > >    And when the problem is hit, I am also triggering (Always this one 
> > > which is fd[1])
> > > [ 3588.143002] [ cut here ]
> > > [ 3588.143662] WARNING: CPU: 5 PID: 9 at 
> > > /home/green/bk/linux/fs/nfsd/nfs4state.c:278 
> > > nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev 
> > > acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm 
> > > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy 
> > > serio_raw virtio_blk
> > > [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ 
> > > #120
> > > [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > > [ 3588.153830]  0286 e2d5ccdf 88011965bd50 
> > > 814a11a5
> > > [ 3588.153832]    88011965bd90 
> > > 8108806b
> > > [ 3588.153834]  0116 8800c476a0b8 8800c476a048 
> > > c0110fc0
> > > [ 3588.153834] Call Trace:
> > > [ 3588.153839]  [] dump_stack+0x86/0xc1
> > > [ 3588.153841]  [] __warn+0xcb/0xf0
> > > [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
> > > [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
> > > [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
> > > [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
> > > [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
> > > [ 3588.153866]  [] kthread+0x101/0x120
> > > [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
> > > [ 3588.153871]  [] ret_from_fork+0x1f/0x40
> > > [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
> > > 
> > > 
> > >   release_all_access() seems to be doing correct job of all that 
> > > cleaning, so
> > >   there must be some other path that I do not quite see.
> > > 
> > >   Hopefully you are more familiar with the code and can see the problem 
> > > right away ;)
> > Hmm...well I'm not seeing it right offhand, and haven't been able to
> > reproduce the problem so far after a couple of attempts by hand. What
> > sort of workload are you running before you see that warning pop?
> Well, the workload is a crazed twisty conflicting operations set in a very 
> limited
> namespace (called racer.sh,
> http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer
>  )
> 
> It does all sorts of crazy unimaginable stuff that is hard to predict, and 
> typically I
> run it on several mountpoints all pointing to the same spot
> (now 3-ways: one instance on direct mount point in ext4, one on nfs mount 
> mounted as
> mount -o nfs and one on mount -o nfs4).
> It must be hitting some rare race or a race + operations combination that 
> leads to
> the leak.
> 
> To reproduce I run it like this until it stops (and it usually stops in
> 2 or 3 iterations, so it takes some time):
> $ cat ~/stress.sh 
> #!/bin/bash
> 
> set -x
> 
> cd /home/green/git/lustre-release/lustre/tests/racer
> dd if=/dev/zero of=/tmp/loop bs=1024k count=1024
> mkfs.ext4 /tmp/loop
> service rpcbind start
> mount none /var/lib/nfs -t tmpfs
> touch /var/lib/nfs/etab
> service nfs-mountd start
> 
> while :; do
> 
> mount /tmp/loop /mnt/lustre -o loop || exit 1
> mkdir /mnt/lustre/racer
> service nfs-server start || exit 2
> mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3
> mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4
> DURATION=3600 sh racer.sh /mnt/nfs/racer &
> DURATION=3600 sh racer.sh 

Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Oleg Drokin

On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
>> Hello!
>> 
>>I've been trying to better understand this problem I was having where 
>> sometimes
>>a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).
>> 
>>I finally traced it to a leaked filedescriptor that was allocated from
>>nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
>> 
>>Also together with it we see leaked credentials allocated along the same 
>> path from
>>fh_verify() and groups allocated from 
>> svcauth_unix_accept()->groups_alloc() that
>>are presumably used by the credentials.
>> 
>>Unfortunately I was not able to make total sense out of the state 
>> handling in nfsd,
>>but it's clear that one of the file descriptors inside struct nfs4_file is
>>lost. I added a patch like this (always a good idea, so surprised it was 
>> not
>>there already):
>> @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
>>  {
>> struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
>>  
>> +   WARN_ON(fp->fi_fds[0]);
>> +   WARN_ON(fp->fi_fds[1]);
>> +   WARN_ON(fp->fi_fds[2]);
>> kmem_cache_free(file_slab, fp);
>>  }
>> 
>>And when the problem is hit, I am also triggering (Always this one which 
>> is fd[1])
>> [ 3588.143002] [ cut here ]
>> [ 3588.143662] WARNING: CPU: 5 PID: 9 at 
>> /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 
>> [nfsd]
>> [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq 
>> tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper 
>> syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
>> [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
>> [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>> [ 3588.153830]  0286 e2d5ccdf 88011965bd50 
>> 814a11a5
>> [ 3588.153832]    88011965bd90 
>> 8108806b
>> [ 3588.153834]  0116 8800c476a0b8 8800c476a048 
>> c0110fc0
>> [ 3588.153834] Call Trace:
>> [ 3588.153839]  [] dump_stack+0x86/0xc1
>> [ 3588.153841]  [] __warn+0xcb/0xf0
>> [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
>> [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
>> [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
>> [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
>> [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
>> [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
>> [ 3588.153866]  [] kthread+0x101/0x120
>> [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
>> [ 3588.153871]  [] ret_from_fork+0x1f/0x40
>> [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
>> 
>> 
>>   release_all_access() seems to be doing correct job of all that cleaning, so
>>   there must be some other path that I do not quite see.
>> 
>>   Hopefully you are more familiar with the code and can see the problem 
>> right away ;)
> 
> Hmm...well I'm not seeing it right offhand, and haven't been able to
> reproduce the problem so far after a couple of attempts by hand. What
> sort of workload are you running before you see that warning pop?

Well, the workload is a crazed twisty conflicting operations set in a very 
limited
namespace (called racer.sh,
http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer
 )

It does all sorts of crazy unimaginable stuff that is hard to predict, and 
typically I
run it on several mountpoints all pointing to the same spot
(now 3-ways: one instance on direct mount point in ext4, one on nfs mount 
mounted as
mount -o nfs and one on mount -o nfs4).
It must be hitting some rare race or a race + operations combination that leads 
to
the leak.

To reproduce I run it like this until it stops (and it usually stops in
2 or 3 iterations, so it takes some time):
$ cat ~/stress.sh 
#!/bin/bash

set -x

cd /home/green/git/lustre-release/lustre/tests/racer
dd if=/dev/zero of=/tmp/loop bs=1024k count=1024
mkfs.ext4 /tmp/loop
service rpcbind start
mount none /var/lib/nfs -t tmpfs
touch /var/lib/nfs/etab
service nfs-mountd start

while :; do

mount /tmp/loop /mnt/lustre -o loop || exit 1
mkdir /mnt/lustre/racer
service nfs-server start || exit 2
mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3
mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4
DURATION=3600 sh racer.sh /mnt/nfs/racer &
DURATION=3600 sh racer.sh /mnt/nfs2/racer &
DURATION=3600 sh racer.sh /mnt/lustre/racer &
wait %1 %2 %3
umount /mnt/nfs || exit 5
umount /mnt/nfs2 || exit 6
service nfs-server stop || exit 7
umount /mnt/lustre || exit 8
e2fsck -n -f /tmp/loop || exit 9
done

Please ignore all the Lustre references, the test is generic, it's just I am 
running
off nfs-readonly-root setup, so want to reuse existing mountpoint locations,

Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Oleg Drokin

On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:

> On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
>> Hello!
>> 
>>I've been trying to better understand this problem I was having where 
>> sometimes
>>a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).
>> 
>>I finally traced it to a leaked filedescriptor that was allocated from
>>nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
>> 
>>Also together with it we see leaked credentials allocated along the same 
>> path from
>>fh_verify() and groups allocated from 
>> svcauth_unix_accept()->groups_alloc() that
>>are presumably used by the credentials.
>> 
>>Unfortunately I was not able to make total sense out of the state 
>> handling in nfsd,
>>but it's clear that one of the file descriptors inside struct nfs4_file is
>>lost. I added a patch like this (always a good idea, so surprised it was 
>> not
>>there already):
>> @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
>>  {
>> struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
>>  
>> +   WARN_ON(fp->fi_fds[0]);
>> +   WARN_ON(fp->fi_fds[1]);
>> +   WARN_ON(fp->fi_fds[2]);
>> kmem_cache_free(file_slab, fp);
>>  }
>> 
>>And when the problem is hit, I am also triggering (Always this one which 
>> is fd[1])
>> [ 3588.143002] [ cut here ]
>> [ 3588.143662] WARNING: CPU: 5 PID: 9 at 
>> /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 
>> [nfsd]
>> [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq 
>> tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper 
>> syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
>> [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
>> [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>> [ 3588.153830]  0286 e2d5ccdf 88011965bd50 
>> 814a11a5
>> [ 3588.153832]    88011965bd90 
>> 8108806b
>> [ 3588.153834]  0116 8800c476a0b8 8800c476a048 
>> c0110fc0
>> [ 3588.153834] Call Trace:
>> [ 3588.153839]  [] dump_stack+0x86/0xc1
>> [ 3588.153841]  [] __warn+0xcb/0xf0
>> [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
>> [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
>> [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
>> [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
>> [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
>> [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
>> [ 3588.153866]  [] kthread+0x101/0x120
>> [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
>> [ 3588.153871]  [] ret_from_fork+0x1f/0x40
>> [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
>> 
>> 
>>   release_all_access() seems to be doing correct job of all that cleaning, so
>>   there must be some other path that I do not quite see.
>> 
>>   Hopefully you are more familiar with the code and can see the problem 
>> right away ;)
> 
> Hmm...well I'm not seeing it right offhand, and haven't been able to
> reproduce the problem so far after a couple of attempts by hand. What
> sort of workload are you running before you see that warning pop?

Well, the workload is a crazed twisty conflicting operations set in a very 
limited
namespace (called racer.sh,
http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer
 )

It does all sorts of crazy unimaginable stuff that is hard to predict, and 
typically I
run it on several mountpoints all pointing to the same spot
(now 3-ways: one instance on direct mount point in ext4, one on nfs mount 
mounted as
mount -o nfs and one on mount -o nfs4).
It must be hitting some rare race or a race + operations combination that leads 
to
the leak.

To reproduce I run it like this until it stops (and it usually stops in
2 or 3 iterations, so it takes some time):
$ cat ~/stress.sh 
#!/bin/bash

set -x

cd /home/green/git/lustre-release/lustre/tests/racer
dd if=/dev/zero of=/tmp/loop bs=1024k count=1024
mkfs.ext4 /tmp/loop
service rpcbind start
mount none /var/lib/nfs -t tmpfs
touch /var/lib/nfs/etab
service nfs-mountd start

while :; do

mount /tmp/loop /mnt/lustre -o loop || exit 1
mkdir /mnt/lustre/racer
service nfs-server start || exit 2
mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3
mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4
DURATION=3600 sh racer.sh /mnt/nfs/racer &
DURATION=3600 sh racer.sh /mnt/nfs2/racer &
DURATION=3600 sh racer.sh /mnt/lustre/racer &
wait %1 %2 %3
umount /mnt/nfs || exit 5
umount /mnt/nfs2 || exit 6
service nfs-server stop || exit 7
umount /mnt/lustre || exit 8
e2fsck -n -f /tmp/loop || exit 9
done

Please ignore all the Lustre references, the test is generic, it's just I am 
running
off nfs-readonly-root setup, so want to reuse existing mountpoint locations,

Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Jeff Layton
On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
> Hello!
> 
>    I've been trying to better understand this problem I was having where 
> sometimes
>    a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).
> 
>    I finally traced it to a leaked filedescriptor that was allocated from
>    nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
> 
>    Also together with it we see leaked credentials allocated along the same 
> path from
>    fh_verify() and groups allocated from 
> svcauth_unix_accept()->groups_alloc() that
>    are presumably used by the credentials.
> 
>    Unfortunately I was not able to make total sense out of the state handling 
> in nfsd,
>    but it's clear that one of the file descriptors inside struct nfs4_file is
>    lost. I added a patch like this (always a good idea, so surprised it was 
> not
>    there already):
> @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
>  {
> struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
>  
> +   WARN_ON(fp->fi_fds[0]);
> +   WARN_ON(fp->fi_fds[1]);
> +   WARN_ON(fp->fi_fds[2]);
> kmem_cache_free(file_slab, fp);
>  }
> 
>    And when the problem is hit, I am also triggering (Always this one which 
> is fd[1])
> [ 3588.143002] [ cut here ]
> [ 3588.143662] WARNING: CPU: 5 PID: 9 at 
> /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 
> [nfsd]
> [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq 
> tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper 
> syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
> [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
> [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 3588.153830]  0286 e2d5ccdf 88011965bd50 
> 814a11a5
> [ 3588.153832]    88011965bd90 
> 8108806b
> [ 3588.153834]  0116 8800c476a0b8 8800c476a048 
> c0110fc0
> [ 3588.153834] Call Trace:
> [ 3588.153839]  [] dump_stack+0x86/0xc1
> [ 3588.153841]  [] __warn+0xcb/0xf0
> [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
> [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
> [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
> [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
> [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
> [ 3588.153866]  [] kthread+0x101/0x120
> [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
> [ 3588.153871]  [] ret_from_fork+0x1f/0x40
> [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
> 
> 
>   release_all_access() seems to be doing correct job of all that cleaning, so
>   there must be some other path that I do not quite see.
> 
>   Hopefully you are more familiar with the code and can see the problem right 
> away ;)
> 
> Bye,
> Oleg

Hmm...well I'm not seeing it right offhand, and haven't been able to
reproduce the problem so far after a couple of attempts by hand. What
sort of workload are you running before you see that warning pop?

-- 
Jeff Layton 


Re: Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Jeff Layton
On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
> Hello!
> 
>    I've been trying to better understand this problem I was having where 
> sometimes
>    a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).
> 
>    I finally traced it to a leaked filedescriptor that was allocated from
>    nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
> 
>    Also together with it we see leaked credentials allocated along the same 
> path from
>    fh_verify() and groups allocated from 
> svcauth_unix_accept()->groups_alloc() that
>    are presumably used by the credentials.
> 
>    Unfortunately I was not able to make total sense out of the state handling 
> in nfsd,
>    but it's clear that one of the file descriptors inside struct nfs4_file is
>    lost. I added a patch like this (always a good idea, so surprised it was 
> not
>    there already):
> @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
>  {
> struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
>  
> +   WARN_ON(fp->fi_fds[0]);
> +   WARN_ON(fp->fi_fds[1]);
> +   WARN_ON(fp->fi_fds[2]);
> kmem_cache_free(file_slab, fp);
>  }
> 
>    And when the problem is hit, I am also triggering (Always this one which 
> is fd[1])
> [ 3588.143002] [ cut here ]
> [ 3588.143662] WARNING: CPU: 5 PID: 9 at 
> /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 
> [nfsd]
> [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq 
> tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper 
> syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
> [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
> [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 3588.153830]  0286 e2d5ccdf 88011965bd50 
> 814a11a5
> [ 3588.153832]    88011965bd90 
> 8108806b
> [ 3588.153834]  0116 8800c476a0b8 8800c476a048 
> c0110fc0
> [ 3588.153834] Call Trace:
> [ 3588.153839]  [] dump_stack+0x86/0xc1
> [ 3588.153841]  [] __warn+0xcb/0xf0
> [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
> [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
> [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
> [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
> [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
> [ 3588.153866]  [] kthread+0x101/0x120
> [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
> [ 3588.153871]  [] ret_from_fork+0x1f/0x40
> [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
> 
> 
>   release_all_access() seems to be doing correct job of all that cleaning, so
>   there must be some other path that I do not quite see.
> 
>   Hopefully you are more familiar with the code and can see the problem right 
> away ;)
> 
> Bye,
> Oleg

Hmm...well I'm not seeing it right offhand, and haven't been able to
reproduce the problem so far after a couple of attempts by hand. What
sort of workload are you running before you see that warning pop?

-- 
Jeff Layton 


Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Oleg Drokin
Hello!

   I've been trying to better understand this problem I was having where 
sometimes
   a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).

   I finally traced it to a leaked filedescriptor that was allocated from
   nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().

   Also together with it we see leaked credentials allocated along the same 
path from
   fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() 
that
   are presumably used by the credentials.

   Unfortunately I was not able to make total sense out of the state handling 
in nfsd,
   but it's clear that one of the file descriptors inside struct nfs4_file is
   lost. I added a patch like this (always a good idea, so surprised it was not
   there already):
@@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
 {
struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
 
+   WARN_ON(fp->fi_fds[0]);
+   WARN_ON(fp->fi_fds[1]);
+   WARN_ON(fp->fi_fds[2]);
kmem_cache_free(file_slab, fp);
 }

   And when the problem is hit, I am also triggering (Always this one which is 
fd[1])
[ 3588.143002] [ cut here ]
[ 3588.143662] WARNING: CPU: 5 PID: 9 at 
/home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 
[nfsd]
[ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq 
tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
[ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
[ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 3588.153830]  0286 e2d5ccdf 88011965bd50 
814a11a5
[ 3588.153832]    88011965bd90 
8108806b
[ 3588.153834]  0116 8800c476a0b8 8800c476a048 
c0110fc0
[ 3588.153834] Call Trace:
[ 3588.153839]  [] dump_stack+0x86/0xc1
[ 3588.153841]  [] __warn+0xcb/0xf0
[ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 
[nfsd]
[ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
[ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
[ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
[ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
[ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
[ 3588.153866]  [] kthread+0x101/0x120
[ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
[ 3588.153871]  [] ret_from_fork+0x1f/0x40
[ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250


  release_all_access() seems to be doing correct job of all that cleaning, so
  there must be some other path that I do not quite see.

  Hopefully you are more familiar with the code and can see the problem right 
away ;)

Bye,
Oleg


Files leak from nfsd in 4.7.1-rc1 (and more?)

2016-06-07 Thread Oleg Drokin
Hello!

   I've been trying to better understand this problem I was having where 
sometimes
   a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).

   I finally traced it to a leaked filedescriptor that was allocated from
   nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().

   Also together with it we see leaked credentials allocated along the same 
path from
   fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() 
that
   are presumably used by the credentials.

   Unfortunately I was not able to make total sense out of the state handling 
in nfsd,
   but it's clear that one of the file descriptors inside struct nfs4_file is
   lost. I added a patch like this (always a good idea, so surprised it was not
   there already):
@@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
 {
struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
 
+   WARN_ON(fp->fi_fds[0]);
+   WARN_ON(fp->fi_fds[1]);
+   WARN_ON(fp->fi_fds[2]);
kmem_cache_free(file_slab, fp);
 }

   And when the problem is hit, I am also triggering (Always this one which is 
fd[1])
[ 3588.143002] [ cut here ]
[ 3588.143662] WARNING: CPU: 5 PID: 9 at 
/home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 
[nfsd]
[ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq 
tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
[ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
[ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 3588.153830]  0286 e2d5ccdf 88011965bd50 
814a11a5
[ 3588.153832]    88011965bd90 
8108806b
[ 3588.153834]  0116 8800c476a0b8 8800c476a048 
c0110fc0
[ 3588.153834] Call Trace:
[ 3588.153839]  [] dump_stack+0x86/0xc1
[ 3588.153841]  [] __warn+0xcb/0xf0
[ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 
[nfsd]
[ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
[ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
[ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
[ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
[ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
[ 3588.153866]  [] kthread+0x101/0x120
[ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
[ 3588.153871]  [] ret_from_fork+0x1f/0x40
[ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250


  release_all_access() seems to be doing correct job of all that cleaning, so
  there must be some other path that I do not quite see.

  Hopefully you are more familiar with the code and can see the problem right 
away ;)

Bye,
Oleg