Re: [PATCH for-next V6 3/5] IB/uverbs: Enable device removal when there are active user space applications

2015-07-06 Thread Yishai Hadas

On 6/30/2015 9:40 PM, Jason Gunthorpe wrote:

On Tue, Jun 30, 2015 at 01:26:05PM +0300, Yishai Hadas wrote:

  struct ib_uverbs_device {
-   struct kref ref;
+   struct kref comp_ref;
+   struct kref free_ref;


So.. I was looking at this, and there is something wrong with the
existing code.

This old code:

cdev_del(uverbs_dev-cdev);
[..]
wait_for_completion(uverbs_dev-comp);
-   kfree(uverbs_dev);

Has built in to it an assumption that when cdev_del returns there can
be no possible open() running. Which doesn't appear to be true, cdev
calls open unlocked and relies on refcounting to make everything work
out.


The patch that introduces this bug was added 5 years ago by Alex Chiang 
and Signed-off-by: Roland Dreier.


Look at commit ID:2a72f212263701b927559f6850446421d5906c41, it can be 
seen also at: 
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2a72f212263701b 



Before this commit there was a device look-up table that was protected 
by a spin_lock used by ib_uverbs_open and by ib_uverbs_remove_one. When 
it was dropped and container_of was used instead, it enabled the race 
with remove_one as dev might be freed just after:
dev = container_of(inode-i_cdev, struct ib_uverbs_device, cdev) but 
before the kref_get.


In addition, this buggy patch added some dead code as 
container_of(x,y,z) can never be NULL and so dev can never be NULL.
As a result the comment above ib_uverbs_open saying the open method 
will either immediately run -ENXIO is wrong as it can never happen.


 static int ib_uverbs_open(struct inode *inode, struct file *filp)
 {
@@ -631,13 +628,10 @@ static int ib_uverbs_open(struct inode *inode, 
struct file *filp)

struct ib_uverbs_file *file;
int ret;

-   spin_lock(map_lock);
-   dev = dev_table[iminor(inode) - IB_UVERBS_BASE_MINOR];
+   dev = container_of(inode-i_cdev, struct ib_uverbs_device, cdev);
if (dev)
kref_get(dev-ref);
-   spin_unlock(map_lock);
-
-   if (!dev)
+   else
return -ENXIO;


Doug/Jason,
AFAIK V6 addressed all opened comments raised by Jason, including the 
last one that asked to use 2 separate krefs for both complete and free, 
it didn't introduced the problem above.


I believe that we should go forward and take the series. Please consider 
that this series fixes an existing oops in patch #1 and adds a missing 
functionality in the kernel, Enable device removal when there are 
active user space clients.


To fix the existing 5 years bug an orthogonal patch that fixes the buggy 
patch should be sent.


Alex/Roland:
Please review above, any option that you'll contribute a patch that 
solves that problem ? any comment on ?




--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V6 3/5] IB/uverbs: Enable device removal when there are active user space applications

2015-07-06 Thread Jason Gunthorpe
On Mon, Jul 06, 2015 at 05:08:08PM +0300, Yishai Hadas wrote:

 The patch that introduces this bug was added 5 years ago by Alex Chiang and
 Signed-off-by: Roland Dreier.
 
 Look at commit ID:2a72f212263701b927559f6850446421d5906c41, it can be seen
 also at:
 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2a72f212263701b

Perhaps, this one also looks involved as well:

commit 055422ddbb0a7610c5f57a056743d7336a39e90f
Author: Alexander Chiang achi...@hp.com
Date:   Tue Feb 2 19:07:49 2010 +

IB/uverbs: Convert *cdev to cdev in struct ib_uverbs_device

Instead of storing a pointer to a cdev, embed the entire struct cdev.

Embedding the cdev without using a parent kobject looks like the root
mistake.

 AFAIK V6 addressed all opened comments raised by Jason, including the last
 one that asked to use 2 separate krefs for both complete and free, it didn't
 introduced the problem above.

It does make it worse though, previously the module locking would make
it unlikely to ever hit any problem here, but now we have a naked
fully exposed race where release races with kfree resulting in
use-after-free. I'd think hitting it is quite likely if the new
feature is being used, and subtle memory corruption is not something
we want to see in the kernel.

So, I'd say, yes it is an old bug, but it is unlikely to hit it. This
patch series is making it much likely, so it needs to be fixed.

In any event, I'm not sure what you are complaining about - this
series absolutely reworks the lifetime model of ib_uverbs_device, why
on earth do you think it is OK to have a buggy new implementation just
because the previous version was buggy? *Especially* when someone
takes the time to point out the mistake and tells you exactly how to
fix it, and it is *trival* to do?

Even worse: I went through and audited the lifetime of V6's new model,
and I think that is *absolutely* something you should have done before
sending V1 :(

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V6 3/5] IB/uverbs: Enable device removal when there are active user space applications

2015-06-30 Thread Jason Gunthorpe
On Tue, Jun 30, 2015 at 01:26:05PM +0300, Yishai Hadas wrote:
  struct ib_uverbs_device {
 - struct kref ref;
 + struct kref comp_ref;
 + struct kref free_ref;

So.. I was looking at this, and there is something wrong with the
existing code. 

This old code:

cdev_del(uverbs_dev-cdev);
[..]
wait_for_completion(uverbs_dev-comp);
-   kfree(uverbs_dev);

Has built in to it an assumption that when cdev_del returns there can
be no possible open() running. Which doesn't appear to be true, cdev
calls open unlocked and relies on refcounting to make everything work
out.

Even other places in the rdma core work this way, eg user_mad.

Which means open can be running concurrently with the rest of that
stuff, which creates several obvious problems.

I *think* (and I am not totally sure) that when you use cdev with a
dynamic structure, it *must* be chained off of a kobject for the
containing structure. Certainly, other examples in the kernel I've
looked at recently do this. (Typically the cdev will be part of the

Ie it should look like this:

  struct ib_uverbs_device {
struct kobject  kobj;
struct cdev cdev;

cdev_init(uverbs_dev-cdev, NULL);
uverbs_dev-cdev.kobj.parent = uverbs_dev-kobj;
cdev_add(..)

The cdev will hold a kref on the parent (the containing structure) and
only when that kref is released is it guaranteed that open will never
be called again.

So, kobj becomes your free_ref, and cdev properly chains off it to
close that little hole with kref.

---

The next problem is that open can run concurrently with
wait_for_completion, so the waiting scheme is wrong too.

This is a great example of why you should never use a kref for an
active count. It seems like the right thing, but it is subtly wrong.

krefs have this special property:

kref_get()
WARN_ON_ONCE(atomic_inc_return(kref-refcount)  2);

So when the code did this:

-   kref_put(uverbs_dev-ref, ib_uverbs_release_dev);
-   wait_for_completion(uverbs_dev-comp);
-   kfree(uverbs_dev);

There is a race where another CPU may be in ib_uverbs_open
about to do kref_get, which will trigger the above WARN_ON, or a
use after free race with the kfree

A good way to implement this pattern is to use an atomic with a
bias. See how kernfs_get_active/kernfs_put_active/kernfs_drain work
for a very good example of this scheme.

This is an existing bug, I think a dedicated patch which
 - adds the kobj and moves the kfree(uverbs_dev) into it
 - Fixes the active count scheme to use an atomic not a kref

Would be appropriate. Once done the disassociate patch doesn't have to
really do anything with this stuff.

I would also recommend looking at other uses of cdev_add in the rdma
core, they may be similarly off..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html