We have found a race condition in sysfs.c which occurs when unloading low-level 
modules
(e.g., mlx4_ib) in the driver.
Specifically:

Although the kernel takes reference counts on sysfs files, it does not take 
such counts
on modules which implement attribute reads.

For example, we have:
static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr,
                              char *buf)
{
        struct port_table_attribute *tab_attr =
                container_of(attr, struct port_table_attribute, attr);
        u16 pkey;
        ssize_t ret;
====>race condition HERE *****
        ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey);
        if (ret)
                return ret;

        return sprintf(buf, "0x%04x\n", pkey);
}

The sysfs file /sys/class/infiniband/<device>/ports/1/pkey/<pkey number> is 
protected
from destruction while we are in show_port_pkey.
However, the underlying module which implements ib_query_pkey (in this case, 
mlx4_ib) is not.

Thus, if another process is busy unloading mlx4_ib, and the time-slice of the 
process
which is reading sysfs expires at the point indicated above in the code, 
ib_query_pkey()
will fail with a page-fault (kernel panic), since it will not find the code 
page which implements
ib_query_pkey() (inlined to the query_pkey() function in the low-level driver 
virtual function table).

Now, when a low-level driver is unloaded, the following procedure (in sysfs.c) 
is called:
void ib_device_unregister_sysfs(struct ib_device *device)
{
        struct kobject *p, *t;
        struct ib_port *port;

        list_for_each_entry_safe(p, t, &device->port_list, entry) {
                list_del(&p->entry);
                port = container_of(p, struct ib_port, kobj);
                mutex_lock(&port->mutex);
                port->valid = 0;
                sysfs_remove_group(p, &pma_group);
                sysfs_remove_group(p, &port->pkey_group);
                sysfs_remove_group(p, &port->gid_group);
                mutex_unlock(&port->mutex);
                kobject_put(p);
        }

        kobject_put(device->ports_parent);
        device_unregister(&device->dev);
}

After this call, the kernel continues with unloading the low-level module.
However, until device_unregister(&device->dev) is invoked, the
sysfs attribute path for the low-level device is still valid.  Hence the race 
condition -- 

Process A                           Process B
---------                       ---------------
1. starts unloading low-level mod
                                2. cat /sys/class/infiniband/...
                                3. Time slice runs out just before accessing 
low-level
                                   module for requested info.
4. Low-level module is fully unloaded
                                5. Page-fault panic when trying to access a 
procedure in
                                   the just-unloaded module.

Some attempt was made for some (but not all) of the "show" procedures to check 
if the module is alive:
        if (!ibdev_is_alive(p->ibdev))
                return -ENODEV;

This narrows the race window considerably, but does not eliminate it. (I put 
this fix in show_port_pkey(),
and was still able to generate the kernel panic).

The only way I was able to eliminate the kernel panic entirely was via a mutex 
(declaration and init not shown):
static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr,
                              char *buf)
{
        struct port_table_attribute *tab_attr =
                container_of(attr, struct port_table_attribute, attr);
        u16 pkey;
        ssize_t ret;

        mutex_lock(&p->mutex);
==>     if (p->valid)
                ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, 
&pkey);
        else
                ret = -EINVAL;
==>     mutex_unlock(&p->mutex);
        if (ret)
                return ret;

        return sprintf(buf, "0x%04x\n", pkey);
}

and:
void ib_device_unregister_sysfs(struct ib_device *device)
{
        struct kobject *p, *t;
        struct ib_port *port;

        list_for_each_entry_safe(p, t, &device->port_list, entry) {
                list_del(&p->entry);
                port = container_of(p, struct ib_port, kobj);
==>             mutex_lock(&port->mutex);
                port->valid = 0;
                sysfs_remove_group(p, &pma_group);
                sysfs_remove_group(p, &port->pkey_group);
                sysfs_remove_group(p, &port->gid_group);
==>             mutex_unlock(&port->mutex);
                kobject_put(p);
        }

        kobject_put(device->ports_parent);
        device_unregister(&device->dev);
}

This is approach is fine for the port-based groups.  What about class-device 
attributes themselves?
I believe that the best approach is to add a sysfs_mutex to ib_device, and lock 
that for ALL "show" methods
in this file.

Opinions?

- Jack
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to