from:"Li Zhong"

[RFC PATCH v6 2/2] Implement lock_device_hotplug_sysfs() by breaking active protection

2014-05-09 Thread Li Zhong

This patch tries to solve the device hot remove locking issues in a
different way from commit 5e33bc41, as kernfs already has a mechanism
to break active protection.

The active protection is there to keep the file alive by blocking
deletion while operations are on-going in the file.  This blocking
creates a dependency loop when an operation running off a sysfs knob
ends up grabbing a lock which may be held while removing the said sysfs
knob.

So the problem here is the order of s_active, and the series of hotplug
related locks.

commit 5e33bc41 solves it by taking out the first of the series of
hoplug related locks, device_hotplug_lock, with a try lock. And if that
try lock fails, it returns a restart syscall error, and drops s_active
temporarily to allow the other process to remove the sysfs knob.

This doesn't help with lockdep warnings reported against s_active and
other hotplug related locks in the series.

This patch instead tries to take s_active out of the lock dependency
graph using the active protection breaking mechanism.

lock_device_hotplug_sysfs() function name is kept here, three more
arguments are added, dev, attr, to help find the kernfs_node
corresponding to the sysfs knob (which should always be able to be
found, WARN if not); kn_out is used to record the found kernfs_node for
use in unlock. unlock_device_hotplug_sysfs() is created to help
cleanup the operations(get reference, break active, etc) done in
lock_device_hotplug_sysfs(), as well as unlock the lock.

As we break the active protection here, the device removing process
might remove the sysfs entries after that point. So after we grab the
device_hotplug_lock, we can check whether that really happens. If so,
we should abort and not invoke the online/offline callbacks anymore.
lockdep assertion is added in try_offline_node() to make sure
device_hotplug_lock is grabbed while removing cpu or memory.

As devcie_hotplug_lock is used to protect hotplug operations with multiple
subsystems. So there might be devices that won't be removed together with
devices of other subsystems, and thus device_hotplug_lock is not needed.
In this case, their specific online/offline callbacks needs to check whether
the device has been removed.

Cc: Tejun Heo 
Cc: Rafael J. Wysocki 
Cc: Greg Kroah-Hartman 
Cc: Toshi Kani 
Signed-off-by: Li Zhong 
---
 drivers/base/core.c| 93 +-
 drivers/base/memory.c  |  5 +--
 include/linux/device.h |  7 +++-
 mm/memory_hotplug.c|  2 ++
 4 files changed, 95 insertions(+), 12 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 20da3ad..329f3b4 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -61,14 +61,87 @@ void unlock_device_hotplug(void)
mutex_unlock(&device_hotplug_lock);
 }
 
-int lock_device_hotplug_sysfs(void)
+void device_hotplug_assert_held(void)
 {
-   if (mutex_trylock(&device_hotplug_lock))
-   return 0;
+   lockdep_assert_held(&device_hotplug_lock);
+}
+
+/*
+ * device_hotplug_lock is used to protect hotplugs which may involve multiple
+ * subsystems. This _sysfs() version is created to solve the deadlock with
+ * s_active of the device attribute file, which could be used to online/offline
+ * the device.
+ *
+ * Returns 0 on success. -ENODEV if the @dev is removed after we break the
+ * active protection. (We should always be able to find the kernfs_node
+ * related to the attribute, WARN if not, and also return -ENODEV).
+ *
+ * When success(return 0), the found kernfs_node is passed out in @kn_out,
+ * else pass NULL to @kn_out.
+ */
+int lock_device_hotplug_sysfs(struct device *dev,
+ struct device_attribute *attr,
+ struct kernfs_node **kn_out)
+{
+   struct kernfs_node *kn;
+
+   *kn_out = kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name);
+
+   if (WARN_ON_ONCE(!kn))
+   return -ENODEV;
+
+   /*
+* Break active protection here to avoid deadlocks with device
+* removing process, which tries to remove sysfs entries including this
+* "online" attribute while holding some hotplug related locks.
+*
+* @dev needs to be protected here, or it could go away any time after
+* dropping active protection.
+*/
+   get_device(dev);
+   kernfs_break_active_protection(kn);
+
+   lock_device_hotplug();
+
+   /*
+* For devices which need device_hoplug_lock to protect the
+* online/offline operation, i.e. it might be removed together with
+* devices in some other subsystems, we can check here whether the
+* device has been removed, so we don't call device_{on|off}line
+* against removed device. lockdep assertion is added in
+* try_offline_node() to make sure the lock is held to remove cpu
+* or memory.
+*
+* If some devices won&

[RFC PATCH v6 1/2 ] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()

2014-05-09 Thread Li Zhong

While auditing the usage of lock_device_hotplug_sysfs() for implementing
it in another way in following patch, it seems to me that the code here
is to add/remove device, and the files probe/release themselves won't be
removed.

lock_device_hotplug_sysfs() is used to solve the deadlocks of s_active of
the "online" file and the hotplug related locks.

So if one process trying to remove the device, the sysfs files for this device,
including "online" will be removed after device_hotplug_lock is grabbed.
So for another process which might concurrently write to the "online" file,
after it grabs s_active of the "online", it couldn't directly lock
device_hotplug_lock, instead, in *_sysfs() version, try lock is used, and if
it fails, the whole syscall is restarted.

But here the s_active of "probe/release" are not involved in the above process,
as they are not to be removed. So we could use lock_device_hotplug() here.

Cc: Tejun Heo 
Cc: Rafael J. Wysocki 
Cc: Greg Kroah-Hartman 
Cc: Toshi Kani 
Signed-off-by: Li Zhong 
---
 drivers/base/cpu.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 006b1bc..9483225 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -88,9 +88,7 @@ static ssize_t cpu_probe_store(struct device *dev,
ssize_t cnt;
int ret;
 
-   ret = lock_device_hotplug_sysfs();
-   if (ret)
-   return ret;
+   lock_device_hotplug();
 
cnt = arch_cpu_probe(buf, count);
 
@@ -106,9 +104,7 @@ static ssize_t cpu_release_store(struct device *dev,
ssize_t cnt;
int ret;
 
-   ret = lock_device_hotplug_sysfs();
-   if (ret)
-   return ret;
+   lock_device_hotplug();
 
cnt = arch_cpu_release(buf, count);
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks

2014-05-09 Thread Li Zhong

viOn Tue, 2014-04-22 at 11:34 +0800, Li Zhong wrote:
> On Mon, 2014-04-21 at 18:46 -0400, Tejun Heo wrote:
> > Hello,
> > 
> > On Mon, Apr 21, 2014 at 05:23:50PM +0800, Li Zhong wrote:
> > 
> > Proper /** function comment would be nice.
> 
> Ok, will try to write some in next version.
> 
> > 
> > > +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev,
> > > +struct device_attribute *attr)
> > 
> > I can see why you did this but let's please not require the user of
> > this function to see how the thing is working internally.  Let's
> > return int and keep track of (or look up again) the kernfs_node
> > internally.
> 
> Ok, it also makes the prototype of lock and unlock look more consistent
> and comfortable. 

When trying to do an new version of the patch, I find that if the device
is really removed, then we couldn't look up using the parent, and
attribute name again in unlock. So I guess maybe I could add one more
argument, e.g. kn_out,:q to track this kernfs node. 

Code will be posted soon for your review. 

Thanks, Zhong

> 
> > 
> > >  {
> > ...
> > > + /*
> > > +  * We assume device_hotplug_lock must be acquired before removing
> > 
> > Is this assumption true?  If so, can we add lockdep assertions in
> > places to verify and enforce this?  If not, aren't we just feeling
> > good when the reality is broken?
> 
> It seems not true ... I think there are devices that don't have the
> online/offline concept, we just need to add it, remove it, like ethernet
> cards. 
> 
> Maybe we could change the comments above, like:
>   /* We assume device_hotplug_lock must be acquired before 
>* removing devices, which have online/offline sysfs knob, 
>* and some locks are needed to serialize the online/offline
>* callbacks and device removing. ...
> ? 
> 
> And we could add lockdep assertions in cpu and memory related code? e.g.
> remove_memory(), unregister_cpu()
> 
> Currently, remove_memory() has comments for the function:
> 
>  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
>  * and online/offline operations before this call, as required by
>  * try_offline_node().
>  */
> 
> maybe it could be removed with the lockdep assertion.
> 
> > ...
> > 
> > Function comment please.
> 
> OK. 
> 
> Thanks, Zhong
> 
> > > +void unlock_device_hotplug_sysfs(struct device *dev,
> > > +  struct kernfs_node *kn)
> > > +{
> > > + unlock_device_hotplug();
> > > + kernfs_unbreak_active_protection(kn);
> > > + put_device(dev);
> > > + kernfs_put(kn);
> > >  }
> > 
> > Thanks.
> > 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock

2014-04-27 Thread Li Zhong

On Fri, 2014-04-25 at 09:59 -0400, Alan Stern wrote:
> On Fri, 25 Apr 2014, Li Zhong wrote:
> 
> > > No, this isn't self removal. The driver-attribute (not device-attribute)
> > > store operation simply grabs a lock that is also held while the driver
> > > is being deregistered at module unload. Taking a reference to the module
> > > in this case will prevent deregistration while store is running.
> > > 
> > > But it seems like this can be solved for usb-serial by simply not
> > > holding the lock while deregistering.
> > 
> > I didn't look carefully about this lock. 
> > 
> > But I'm not sure whether there are such requirements for driver
> > attributes:
> > 
> > some lock needs be grabbed in the driver attributes store callbacks, and
> > the same lock also needs to be grabbed during driver unregister. 
> 
> In this case, the lock does _not_ need to be grabbed during driver 
> unregister.  The driver grabs the lock, but it doesn't need to.

OK.

> 
> > If we have such requirements currently or in the future, I think they
> > could all be solved by breaking active protection after get the module
> > reference.
> 
> No!  That would be very bad.
> 
> Unloading modules is quite different from unbinding drivers.  After the
> driver is unbound, its attribute callback routines can continue to run.  
> But after a driver module has been unloaded, its attribute callback 
> routines _cannot_ run because they aren't present in memory any more.
> 
> If we allowed a module to be unloaded while one of its callbacks was 
> running (because active protection was broken), imagine what would 
> happen...

I don't think the module could be unloaded after we increased the module
reference counter. 

Thanks, Zhong

> 
> Alan Stern
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock

2014-04-27 Thread Li Zhong

On Fri, 2014-04-25 at 09:54 -0400, Alan Stern wrote:
> On Fri, 25 Apr 2014, Li Zhong wrote:
> 
> > > I don't get why try_module_get() matters here.  We can't call into
> > > ->store if the object at hand is already destroyed and the underlying
> > > module can't go away if the target device is still alive.
> > > try_module_get() doesn't actually protect the object.  Why does that
> > > matter?  This is self removal, right?  Can you please take a look at
> > > kernfs_remove_self()?
> > 
> > This is about one process writing something to driver attributes, and
> > one process trying to unload this driver. 
> > 
> > I think try_module_get() could detect whether the driver is being
> > unloaded, and if not, prevent it from being unloaded, so it could
> > protect the object here by not allow the driver to be unloaded.
> 
> That isn't how try_module_get() works.  If the module is being 
> unloaded, try_module_get() simply fails.  It does not prevent the 
> module from being unloaded -- that's why its name begins with "try".

Yes, I know that. What I said above is for the case when
try_module_get() detects the driver is NOT being unloaded, and increases
the reference counter. 

Thanks, Zhong
> 
> Alan Stern
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks

2014-04-27 Thread Li Zhong

On Fri, 2014-04-25 at 14:47 +0200, Rafael J. Wysocki wrote:
> On 4/25/2014 3:46 AM, Li Zhong wrote:
> > On Thu, 2014-04-24 at 12:02 +0200, Rafael J. Wysocki wrote:
> >> On 4/24/2014 10:59 AM, Li Zhong wrote:
> >>> On Wed, 2014-04-23 at 18:12 +0200, Rafael J. Wysocki wrote:
> >>>> On 4/23/2014 4:23 PM, Tejun Heo wrote:
> >>>>> Hello, Rafael.
> >>>> Hi,
> >>>>
> >>>>> On Wed, Apr 23, 2014 at 12:21:33AM +0200, Rafael J. Wysocki wrote:
> >>>>>> Can you please elaborate a bit?
> >>>>> Because it can get involved in larger locking dependency issues by
> >>>>> joining dependency graphs of two otherwise largely disjoint
> >>>>> subsystems.  It has potential to create possible deadlocks which don't
> >>>>> need to exist.
> >>>> Well, I do my best not to add unnecessary locks if that's what you mean.
> >>>>
> >>>>>> It is there to protect hotplug operations involving multiple devices
> >>>>>> (in different subsystems) from racing with each other.  Why exactly
> >>>>>> is it bad?
> >>>>> But why would different subsystems, say cpu and memory, use the same
> >>>>> lock?  Wouldn't those subsystems already have proper locking inside
> >>>>> their own subsystems?
> >>>> That locking is not sufficient.
> >>>>
> >>>>> Why add this additional global lock across multiple subsystems?
> >>>> That basically is because of how eject works when it is triggered via 
> >>>> ACPI.
> >>>>
> >>>> It is signaled for a device at the top of a subtree.  It may be a
> >>>> container of some sort and the eject involves everything below that
> >>>> device in the ACPI namespace.  That may involve multiple subsystem
> >>>> (CPUs, memory, PCI host bridge, etc.).
> >>>>
> >>>> We do that in two steps, offline (which can fail) and eject proper
> >>>> (which can't fail and makes all of the involved devices go away). All
> >>>> that has to be done in one go with respect to the sysfs-triggered
> >>>> offline/online and that's why the lock is there.
> >>> Thank you for the education. It's more clear to me now why we need this
> >>> lock.
> >>>
> >>> I still have some small questions about when this lock is needed:
> >>>
> >>> I could understand sysfs-triggered online is not acceptable when
> >>> removing devices in multiple subsystems. But maybe concurrent offline
> >>> and remove(with proper per subsystem locks) seems not harmful?
> >>>
> >>> And if we just want to remove some devices in a specific subsystem, e.g.
> >>> like writing /cpu/release, if it just wants to offline and remove some
> >>> cpus, then maybe we don't require the device_hotplug_lock to be taken?
> >> No and no.
> >>
> >> If the offline phase fails for any device in the subtree, we roll back
> >> the operation
> >> and online devices that have already been offlined by it.  Also the ACPI
> >> hot-addition
> >> needs to acquire device_hotplug_lock so that it doesn't race with ejects
> >> and so
> >> that lock needs to be taken by sysfs-triggered offline too.
> > I can understand that hot-addition needs the device_hotplug lock, but
> > still not very clear about the offline.
> >
> > I guess your are describing following scenario:
> >
> > user A: (trying remove cpu 1 and memory 1-10)
> >
> > lock_device_hotplug
> > offline cpu with cpu locks  -- successful
> > offline memories with memory locks  -- failed, e.g. for memory8
> > online cpu and memory with their locks
> > unlock_device_hotplug
> 
> What about if all is successful and CPU1 is gone before 
> device_hotplug_lock is released?

You mean user B will try to offline an already removed cpu1? But I think
the cpu subsys locks should be able to handle such situation?

> 
> > user B: (trying offline cpu 1)
> >
> > offline cpu with cpu locks
> >
> > But I don't see any problem for user B not taking the device_hotplug
> > lock. The result may be different for user B to take or not take the
> > lock. But I think it could be seen as concurrent online/offline for cpu1
> > under cpu hotplug locks, which just depends on which is executed last?
> >
> > Or

Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()

2014-04-27 Thread Li Zhong

On Fri, 2014-04-25 at 08:28 -0400, Tejun Heo wrote:
> On Fri, Apr 25, 2014 at 09:56:10AM +0800, Li Zhong wrote:
> > > A nests cpu subsys mutex under s_active of the online node.  B nests
> > > s_active of the online node under the cpu subsys mutex.  What am I
> > > missing?
> > 
> > From the above two chain, I think the problem is cpu_subsys_mutex and
> > s_active(online), which is the deadlock we are trying to solve in patch
> 
> Yeap, that one was what I was thinking about.
> 
> > #2. I seems to me s_active(release) here doesn't have lock issues? 
> 
> And your patch doesn't change the situation for online.  

I'm a little confused here...

To clarify for this patch(#1): 

Currently(before patch#2), the s_active(online) and cpu subsys mutex
deadlock is solved by lock_device_hotplug_sysfs(). (lockdep warnings are
still there, so we want to change the implementation of
lock_device_hotplug_sysfs() in patch #2 to take out of s_active_online,
instead of device_hotplug_lock, which is grabbed before cpu subsys
mutex). 

But here, for the probe/release, I think we really don't need to use
this _sysfs() version. 

So it is replaced with lock_device_hotplug() directly to avoid two
unnecessary conversions of lock_device_hotplug_sysfs(), that otherwise
need to be done in patch #2.

Thanks, Zhong

> Alright,
> thanks a lot for the explanation.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock

2014-04-27 Thread Li Zhong

On Fri, 2014-04-25 at 12:15 +0200, Johan Hovold wrote:
> On Fri, Apr 25, 2014 at 10:16:57AM +0800, Li Zhong wrote:
> > On Thu, 2014-04-24 at 16:52 +0200, Johan Hovold wrote:
> 
> > > No, this isn't self removal. The driver-attribute (not device-attribute)
> > > store operation simply grabs a lock that is also held while the driver
> > > is being deregistered at module unload. Taking a reference to the module
> > > in this case will prevent deregistration while store is running.
> > > 
> > > But it seems like this can be solved for usb-serial by simply not
> > > holding the lock while deregistering.
> > 
> > I didn't look carefully about this lock. 
> > 
> > But I'm not sure whether there are such requirements for driver
> > attributes:
> > 
> > some lock needs be grabbed in the driver attributes store callbacks, and
> > the same lock also needs to be grabbed during driver unregister. 
> > 
> > If we have such requirements currently or in the future, I think they
> > could all be solved by breaking active protection after get the module
> > reference.
> 
> Yes, if we find any such cases, but I don't think it should be done
> generally (as in one of your earlier posts).

OK, maybe we only need to reconsider this only if we see some more
similar lockdep warnings in the future. 

> 
> Active protection is usually exactly what prevents the driver from being
> deregistered, and would only need to be broken to avoid any ABBA
> deadlocks from being reported by lockdep (the module reference would be
> enough to prevent the actual deadlock).

Yes, maybe try to get the module reference is not bad before writing to
driver attributes, as it doesn't make much sense to really call the
callbacks for the driver attribute if the driver is being unload.

And after we get the reference, it is safe for us to break the active.
But if we don't have such real cases(lockdep warnings), we actually
don't need to break it. 

Thanks, Zhong

> 
> Thanks,
> Johan
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock

2014-04-24 Thread Li Zhong

On Thu, 2014-04-24 at 16:52 +0200, Johan Hovold wrote:
> On Thu, Apr 24, 2014 at 10:35:17AM -0400, Tejun Heo wrote:
> > On Thu, Apr 24, 2014 at 04:29:15PM +0800, Li Zhong wrote:
> > > On Wed, 2014-04-23 at 10:19 -0400, Tejun Heo wrote:
> > > > cc'ing Li Zhong who's working on a simliar issue in the following
> > > > thread and quoting whole body.
> > > > 
> > > >   http://thread.gmane.org/gmane.linux.kernel/1680706
> > > > 
> > > > Li, this is another variation of the same problem.  Maybe this can be
> > > > covered by your work too?
> > > 
> > > It seems to me that it is about write something to driver attribute, and
> > > driver unloading. If so, maybe it's not easy to reuse the help functions
> > > created for device attribute, and device removing.
> > > 
> > > But I guess the idea to break the active protection could still be
> > > applied here:
> > > 
> > > Maybe we could try_module_get() here (like the other option suggested by
> > > Johan?), and break active protection if we could get the module,
> > > something like below? 
> > 
> > I don't get why try_module_get() matters here.  We can't call into
> > ->store if the object at hand is already destroyed and the underlying
> > module can't go away if the target device is still alive.
> > try_module_get() doesn't actually protect the object.  Why does that
> > matter?  This is self removal, right?  Can you please take a look at
> > kernfs_remove_self()?
> 
> No, this isn't self removal. The driver-attribute (not device-attribute)
> store operation simply grabs a lock that is also held while the driver
> is being deregistered at module unload. Taking a reference to the module
> in this case will prevent deregistration while store is running.
> 
> But it seems like this can be solved for usb-serial by simply not
> holding the lock while deregistering.

I didn't look carefully about this lock. 

But I'm not sure whether there are such requirements for driver
attributes:

some lock needs be grabbed in the driver attributes store callbacks, and
the same lock also needs to be grabbed during driver unregister. 

If we have such requirements currently or in the future, I think they
could all be solved by breaking active protection after get the module
reference.

Thanks, Zhong

> 
> Johan
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock

2014-04-24 Thread Li Zhong

On Thu, 2014-04-24 at 10:35 -0400, Tejun Heo wrote:
> On Thu, Apr 24, 2014 at 04:29:15PM +0800, Li Zhong wrote:
> > On Wed, 2014-04-23 at 10:19 -0400, Tejun Heo wrote:
> > > cc'ing Li Zhong who's working on a simliar issue in the following
> > > thread and quoting whole body.
> > > 
> > >   http://thread.gmane.org/gmane.linux.kernel/1680706
> > > 
> > > Li, this is another variation of the same problem.  Maybe this can be
> > > covered by your work too?
> > 
> > It seems to me that it is about write something to driver attribute, and
> > driver unloading. If so, maybe it's not easy to reuse the help functions
> > created for device attribute, and device removing.
> > 
> > But I guess the idea to break the active protection could still be
> > applied here:
> > 
> > Maybe we could try_module_get() here (like the other option suggested by
> > Johan?), and break active protection if we could get the module,
> > something like below? 
> 
> I don't get why try_module_get() matters here.  We can't call into
> ->store if the object at hand is already destroyed and the underlying
> module can't go away if the target device is still alive.
> try_module_get() doesn't actually protect the object.  Why does that
> matter?  This is self removal, right?  Can you please take a look at
> kernfs_remove_self()?

This is about one process writing something to driver attributes, and
one process trying to unload this driver. 

I think try_module_get() could detect whether the driver is being
unloaded, and if not, prevent it from being unloaded, so it could
protect the object here by not allow the driver to be unloaded.

And if the driver is being unloaded, we could abort the write operation
directly. 

Thanks, Zhong

> 
> Thanks.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()

2014-04-24 Thread Li Zhong

On Thu, 2014-04-24 at 10:32 -0400, Tejun Heo wrote:
> Hello,
> 
> On Thu, Apr 24, 2014 at 04:37:23PM +0800, Li Zhong wrote:
> > On Wed, 2014-04-23 at 10:39 -0400, Tejun Heo wrote:
> > After thinking it harder, I still couldn't see ABBA here ... 
> > 
> > the active protection taken here is for "probe/release" which will not
> > be waited for removing something like "online" under cpu#? Or my
> > assumption that s_active for different files are different locks are
> > completely wrong? Or I missed something else? 
> 
> I'm probably confused about the locking.  I was thinking a scenario
> like the following.
> 
> A. CPU on/offline
> 
>grabs s_active protection of online node
>grabs cpu subsys mutex
>perform on/offline
>releases cpu subsys mutex
>releases s_active protection of online node

so the chain here is s_active(online) -> cpu_subsys_mutex

> 
> B. CPU release
> 
>grabs s_active protection of release node
>grabs cpu subsys mutex
>performs removal of the CPU
>removes the online node
>releases cpu subsys mutex
>releases s_active protection of release node

and the chain here is s_active(release) -> cpu_subsys_mutex ->
s_active(online)

> 
> A nests cpu subsys mutex under s_active of the online node.  B nests
> s_active of the online node under the cpu subsys mutex.  What am I
> missing?

>From the above two chain, I think the problem is cpu_subsys_mutex and
s_active(online), which is the deadlock we are trying to solve in patch
#2. I seems to me s_active(release) here doesn't have lock issues? 

Thanks, Zhong

> 
> Thanks.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks

2014-04-24 Thread Li Zhong

On Thu, 2014-04-24 at 12:02 +0200, Rafael J. Wysocki wrote:
> On 4/24/2014 10:59 AM, Li Zhong wrote:
> > On Wed, 2014-04-23 at 18:12 +0200, Rafael J. Wysocki wrote:
> >> On 4/23/2014 4:23 PM, Tejun Heo wrote:
> >>> Hello, Rafael.
> >> Hi,
> >>
> >>> On Wed, Apr 23, 2014 at 12:21:33AM +0200, Rafael J. Wysocki wrote:
> >>>> Can you please elaborate a bit?
> >>> Because it can get involved in larger locking dependency issues by
> >>> joining dependency graphs of two otherwise largely disjoint
> >>> subsystems.  It has potential to create possible deadlocks which don't
> >>> need to exist.
> >> Well, I do my best not to add unnecessary locks if that's what you mean.
> >>
> >>>> It is there to protect hotplug operations involving multiple devices
> >>>> (in different subsystems) from racing with each other.  Why exactly
> >>>> is it bad?
> >>> But why would different subsystems, say cpu and memory, use the same
> >>> lock?  Wouldn't those subsystems already have proper locking inside
> >>> their own subsystems?
> >> That locking is not sufficient.
> >>
> >>> Why add this additional global lock across multiple subsystems?
> >> That basically is because of how eject works when it is triggered via ACPI.
> >>
> >> It is signaled for a device at the top of a subtree.  It may be a
> >> container of some sort and the eject involves everything below that
> >> device in the ACPI namespace.  That may involve multiple subsystem
> >> (CPUs, memory, PCI host bridge, etc.).
> >>
> >> We do that in two steps, offline (which can fail) and eject proper
> >> (which can't fail and makes all of the involved devices go away). All
> >> that has to be done in one go with respect to the sysfs-triggered
> >> offline/online and that's why the lock is there.
> > Thank you for the education. It's more clear to me now why we need this
> > lock.
> >
> > I still have some small questions about when this lock is needed:
> >
> > I could understand sysfs-triggered online is not acceptable when
> > removing devices in multiple subsystems. But maybe concurrent offline
> > and remove(with proper per subsystem locks) seems not harmful?
> >
> > And if we just want to remove some devices in a specific subsystem, e.g.
> > like writing /cpu/release, if it just wants to offline and remove some
> > cpus, then maybe we don't require the device_hotplug_lock to be taken?
> 
> No and no.
> 
> If the offline phase fails for any device in the subtree, we roll back 
> the operation
> and online devices that have already been offlined by it.  Also the ACPI 
> hot-addition
> needs to acquire device_hotplug_lock so that it doesn't race with ejects 
> and so
> that lock needs to be taken by sysfs-triggered offline too.

I can understand that hot-addition needs the device_hotplug lock, but
still not very clear about the offline. 

I guess your are describing following scenario: 

user A: (trying remove cpu 1 and memory 1-10)

lock_device_hotplug
offline cpu with cpu locks  -- successful
offline memories with memory locks  -- failed, e.g. for memory8
online cpu and memory with their locks
unlock_device_hotplug

user B: (trying offline cpu 1)

offline cpu with cpu locks

But I don't see any problem for user B not taking the device_hotplug
lock. The result may be different for user B to take or not take the
lock. But I think it could be seen as concurrent online/offline for cpu1
under cpu hotplug locks, which just depends on which is executed last? 

Or did I miss something here? 

Thanks, Zhong

> Thanks,
> Rafael
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks

2014-04-24 Thread Li Zhong

On Wed, 2014-04-23 at 18:12 +0200, Rafael J. Wysocki wrote:
> On 4/23/2014 4:23 PM, Tejun Heo wrote:
> > Hello, Rafael.
> 
> Hi,
> 
> > On Wed, Apr 23, 2014 at 12:21:33AM +0200, Rafael J. Wysocki wrote:
> >> Can you please elaborate a bit?
> > Because it can get involved in larger locking dependency issues by
> > joining dependency graphs of two otherwise largely disjoint
> > subsystems.  It has potential to create possible deadlocks which don't
> > need to exist.
> 
> Well, I do my best not to add unnecessary locks if that's what you mean.
> 
> >> It is there to protect hotplug operations involving multiple devices
> >> (in different subsystems) from racing with each other.  Why exactly
> >> is it bad?
> > But why would different subsystems, say cpu and memory, use the same
> > lock?  Wouldn't those subsystems already have proper locking inside
> > their own subsystems?
> 
> That locking is not sufficient.
> 
> > Why add this additional global lock across multiple subsystems?
> 
> That basically is because of how eject works when it is triggered via ACPI.
> 
> It is signaled for a device at the top of a subtree.  It may be a 
> container of some sort and the eject involves everything below that 
> device in the ACPI namespace.  That may involve multiple subsystem 
> (CPUs, memory, PCI host bridge, etc.).
> 
> We do that in two steps, offline (which can fail) and eject proper 
> (which can't fail and makes all of the involved devices go away). All 
> that has to be done in one go with respect to the sysfs-triggered 
> offline/online and that's why the lock is there.

Thank you for the education. It's more clear to me now why we need this
lock. 

I still have some small questions about when this lock is needed:

I could understand sysfs-triggered online is not acceptable when
removing devices in multiple subsystems. But maybe concurrent offline
and remove(with proper per subsystem locks) seems not harmful? 

And if we just want to remove some devices in a specific subsystem, e.g.
like writing /cpu/release, if it just wants to offline and remove some
cpus, then maybe we don't require the device_hotplug_lock to be taken? 

Thanks, Zhong
> 
> Thanks,
> Rafael
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()

2014-04-24 Thread Li Zhong

On Wed, 2014-04-23 at 10:39 -0400, Tejun Heo wrote:
> On Wed, Apr 23, 2014 at 10:00:26AM +0800, Li Zhong wrote:
> > If you remove cpu0, then the cpu0 directory will be removed, together
> > with the "online" file in the directory, while some other process might
> > be writing 0 or 1 to it.
> > 
> > But here, for the probe/release, take "release" for example, if user
> > writes something that stands for cpu0 to it, the cpu0 will be removed, 
> > and the cpu0 directory and the files under it will be removed. But
> > "release" itself is not removed.
> > 
> > They are attributes of cpu_subsys, not of some specific cpus.
> 
> OIC, so the file itself which triggers removal doesn't get removed.
> Hmmm... but can't you still fall into deadlock?  If on/offline takes
> the same lock under active protection which is also taken while
> removing the cpu files, it doesn't really matter whether the release
> file itself is removed in the process or not.  You can still have ABBA
> deadlock.  What am I missing here?

After thinking it harder, I still couldn't see ABBA here ... 

the active protection taken here is for "probe/release" which will not
be waited for removing something like "online" under cpu#? Or my
assumption that s_active for different files are different locks are
completely wrong? Or I missed something else? 

Thanks, Zhong
> 
> Thanks.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock

2014-04-24 Thread Li Zhong

On Wed, 2014-04-23 at 10:19 -0400, Tejun Heo wrote:
> cc'ing Li Zhong who's working on a simliar issue in the following
> thread and quoting whole body.
> 
>   http://thread.gmane.org/gmane.linux.kernel/1680706
> 
> Li, this is another variation of the same problem.  Maybe this can be
> covered by your work too?

It seems to me that it is about write something to driver attribute, and
driver unloading. If so, maybe it's not easy to reuse the help functions
created for device attribute, and device removing.

But I guess the idea to break the active protection could still be
applied here:

Maybe we could try_module_get() here (like the other option suggested by
Johan?), and break active protection if we could get the module,
something like below? 


diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 83e910a..6ce27e0 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -69,9 +69,24 @@ static ssize_t drv_attr_store(struct kobject *kobj, struct 
attribute *attr,
struct driver_attribute *drv_attr = to_drv_attr(attr);
struct driver_private *drv_priv = to_driver(kobj);
ssize_t ret = -EIO;
+   struct kernfs_node *kn;
+
+   if (!try_module_get(drv_priv->driver->owner))
+   return ret;
+
+   kn = kernfs_find_and_get(kobj->sd, attr->name);
+   if (WARN_ON_ONCE(!kn))
+   return ret;
+
+   kernfs_break_active_protection(kn);
 
if (drv_attr->store)
ret = drv_attr->store(drv_priv->driver, buf, count);
+
+   kernfs_unbreak_active_protection(kn);
+   kernfs_put(kn);
+
+   module_put(drv_priv->driver->owner);
return ret;
 }
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks

2014-04-23 Thread Li Zhong

On Wed, 2014-04-23 at 12:58 +0200, Rafael J. Wysocki wrote:
> On Wednesday, April 23, 2014 01:03:42 PM Li Zhong wrote:
> > On Tue, 2014-04-22 at 16:44 -0400, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Tue, Apr 22, 2014 at 11:34:39AM +0800, Li Zhong wrote:
> > > > > Is this assumption true?  If so, can we add lockdep assertions in
> > > > > places to verify and enforce this?  If not, aren't we just feeling
> > > > > good when the reality is broken?
> > > > 
> > > > It seems not true ... I think there are devices that don't have the
> > > > online/offline concept, we just need to add it, remove it, like ethernet
> > > > cards. 
> > > > 
> > > > Maybe we could change the comments above, like:
> > > > /* We assume device_hotplug_lock must be acquired before 
> > > >  * removing devices, which have online/offline sysfs knob, 
> > > >  * and some locks are needed to serialize the online/offline
> > > >  * callbacks and device removing. ...
> > > > ? 
> > > > 
> > > > And we could add lockdep assertions in cpu and memory related code? e.g.
> > > > remove_memory(), unregister_cpu()
> > > > 
> > > > Currently, remove_memory() has comments for the function:
> > > > 
> > > >  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> > > >  * and online/offline operations before this call, as required by
> > > >  * try_offline_node().
> > > >  */
> > > > 
> > > > maybe it could be removed with the lockdep assertion.
> > > 
> > > I'm confused about the overall locking scheme.  What's the role of
> > > device_hotplug_lock?  Is that solely to prevent the sysfs deadlock
> > > issue?  Or does it serve other synchronization purposes depending on
> > > the specific subsystem?  If the former, the lock no longer needs to
> > > exist.  The only thing necessary would be synchronization between
> > > device_del() deleting the sysfs file and the unbreak helper invoking
> > > device-specific callback.  If the latter, we probably should change
> > > that.  Sharing hotplug lock across multiple subsystems through driver
> > > core sounds like a pretty bad idea.
> > 
> > I think it's the latter.
> 
> Actually, no, this is not the case if I understand you correctly.

Oh, Sorry, I didn't read carefully. Yes, it's not specific subsystem.
After seeing your reply, I understand it is for protecting device hot
remove involving multiple subsystems.

> 
> > I think device_{on|off}line is better to be
> > done in some sort of lock which prevents the device from being removed,
> > including some preparation work that needs be done before device_del().
> 
> Quite frankly, you should be confident that you understand the code you're
> trying to modify or please don't touch it.
> 
> I'll have a deeper look at this issue later today or tomorrow and will get
> back to you then.

Ok, thank you, 
Zhong

> 
> Thanks!
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks

2014-04-23 Thread Li Zhong

On Wed, 2014-04-23 at 12:54 +0200, Rafael J. Wysocki wrote:
> On Wednesday, April 23, 2014 09:50:32 AM Li Zhong wrote:
> > On Tue, 2014-04-22 at 12:11 +0200, Rafael J. Wysocki wrote:
> > > On Tuesday, April 22, 2014 11:34:39 AM Li Zhong wrote:
> > > > On Mon, 2014-04-21 at 18:46 -0400, Tejun Heo wrote:
> > > > > Hello,
> > > > > 
> > > > > On Mon, Apr 21, 2014 at 05:23:50PM +0800, Li Zhong wrote:
> > > > > 
> > > > > Proper /** function comment would be nice.
> > > > 
> > > > Ok, will try to write some in next version.
> > > > 
> > > > > 
> > > > > > +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev,
> > > > > > +  struct device_attribute 
> > > > > > *attr)
> > > > > 
> > > > > I can see why you did this but let's please not require the user of
> > > > > this function to see how the thing is working internally.  Let's
> > > > > return int and keep track of (or look up again) the kernfs_node
> > > > > internally.
> > > > 
> > > > Ok, it also makes the prototype of lock and unlock look more consistent
> > > > and comfortable. 
> > > > 
> > > > > 
> > > > > >  {
> > > > > ...
> > > > > > +   /*
> > > > > > +* We assume device_hotplug_lock must be acquired before 
> > > > > > removing
> > > > > 
> > > > > Is this assumption true?  If so, can we add lockdep assertions in
> > > > > places to verify and enforce this?  If not, aren't we just feeling
> > > > > good when the reality is broken?
> > > > 
> > > > It seems not true ... I think there are devices that don't have the
> > > > online/offline concept, we just need to add it, remove it, like ethernet
> > > > cards. 
> > > 
> > > Well, I haven't been following this closely (I was travelling, sorry), but
> > > there certainly are devices without online/offline.  That currently is 
> > > only
> > > present for CPUs, memory blocks and ACPI containers (if I remember 
> > > correctly).
> > > 
> > > > 
> > > > Maybe we could change the comments above, like:
> > > > /* We assume device_hotplug_lock must be acquired before 
> > > >  * removing devices, which have online/offline sysfs knob, 
> > > >  * and some locks are needed to serialize the online/offline
> > > >  * callbacks and device removing. ...
> > > > ? 
> > > 
> > > Lockdep assertions would be better than this in my opinion.
> > 
> > This is talking about the lock required in the other process, the device
> > removing process, e.g. that in remove_memory() below. So I guess no
> > lockdep assertions needed here. Or I misunderstand your point? 
> 
> I mean if you assume certain lock to be held somewhere, it is better to use
> lockdep annotations to express that assumption, because that will cause users
> to *see* the problem when it happens.

OK, I see, I think you were suggesting the same thing as Tejun, just I
misunderstood it. 

Thanks, Zhong

> 
> Thanks!
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks

2014-04-22 Thread Li Zhong

On Tue, 2014-04-22 at 16:44 -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Apr 22, 2014 at 11:34:39AM +0800, Li Zhong wrote:
> > > Is this assumption true?  If so, can we add lockdep assertions in
> > > places to verify and enforce this?  If not, aren't we just feeling
> > > good when the reality is broken?
> > 
> > It seems not true ... I think there are devices that don't have the
> > online/offline concept, we just need to add it, remove it, like ethernet
> > cards. 
> > 
> > Maybe we could change the comments above, like:
> > /* We assume device_hotplug_lock must be acquired before 
> >  * removing devices, which have online/offline sysfs knob, 
> >  * and some locks are needed to serialize the online/offline
> >  * callbacks and device removing. ...
> > ? 
> > 
> > And we could add lockdep assertions in cpu and memory related code? e.g.
> > remove_memory(), unregister_cpu()
> > 
> > Currently, remove_memory() has comments for the function:
> > 
> >  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> >  * and online/offline operations before this call, as required by
> >  * try_offline_node().
> >  */
> > 
> > maybe it could be removed with the lockdep assertion.
> 
> I'm confused about the overall locking scheme.  What's the role of
> device_hotplug_lock?  Is that solely to prevent the sysfs deadlock
> issue?  Or does it serve other synchronization purposes depending on
> the specific subsystem?  If the former, the lock no longer needs to
> exist.  The only thing necessary would be synchronization between
> device_del() deleting the sysfs file and the unbreak helper invoking
> device-specific callback.  If the latter, we probably should change
> that.  Sharing hotplug lock across multiple subsystems through driver
> core sounds like a pretty bad idea.

I think it's the latter. I think device_{on|off}line is better to be
done in some sort of lock which prevents the device from being removed,
including some preparation work that needs be done before device_del().

Thanks, Zhong

> 
> Thanks.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()

2014-04-22 Thread Li Zhong

On Tue, 2014-04-22 at 16:40 -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Apr 22, 2014 at 10:29:37AM +0800, Li Zhong wrote:
> > The probe/release files are attribute files for cpu subsys, looks like
> > following in sysfs dirs
> > 
> > $ cd /sys/devices/system/cpu/
> > $ ls -l
> > total 0
> > drwxr-xr-x. 7 root root 0 Apr 17 19:00 cpu0
> > drwxr-xr-x. 4 root root 0 Apr 17 19:00 cpu1
> > drwxr-xr-x. 4 root root 0 Apr 17 19:00 cpu10
> > ..
> > drwxr-xr-x. 3 root root 0 Apr 20 08:00 cpufreq
> > drwxr-xr-x. 2 root root 0 Apr 20 08:00 cpuidle
> > -rw---. 1 root root 65536 Apr 21 00:28 dscr_default
> > -r--r--r--. 1 root root 65536 Apr 21 00:28 kernel_max
> > -r--r--r--. 1 root root 65536 Apr 21 00:28 offline
> > -r--r--r--. 1 root root 65536 Sep  4  2014 online
> > -r--r--r--. 1 root root 65536 Apr 21 00:28 possible
> > drwxr-xr-x. 2 root root 0 Apr 20 08:00 power
> > -r--r--r--. 1 root root 65536 Apr 17 20:46 present
> > --w---. 1 root root 65536 Apr 21 00:28 probe<-  
> > --w---. 1 root root 65536 Apr 21 00:28 release  <-
> > -rw---. 1 root root 65536 Apr 21 00:28 subcores_per_core
> > -rw-r--r--. 1 root root 65536 Apr 21 00:28 uevent
> > 
> > From the code, it seems cpu subsys won't be unregistered, and it doesn't
> > make sense to remove all the cpus in the system. 
> 
> I don't think I'm following you.  Are you saying that no files which
> are protected under the hotplug lock that cpu subsys uses are removed
> during offlining?

Sorry, Maybe I didn't say it clearly. 

There are files under cpu#, e.g. 

$ cd /sys/devices/system/cpu/cpu0
$ ls
cache dscr   node1pmc1  pmc5   smt_snooze_delay
uevent
cpuidle   mmcr0  online   pmc2  pmc6   spurr
crash_notes   mmcr1  physical_id  pmc3  power  subsystem
crash_notes_size  mmcra  pir  pmc4  purr   topology

If you remove cpu0, then the cpu0 directory will be removed, together
with the "online" file in the directory, while some other process might
be writing 0 or 1 to it.

But here, for the probe/release, take "release" for example, if user
writes something that stands for cpu0 to it, the cpu0 will be removed, 
and the cpu0 directory and the files under it will be removed. But
"release" itself is not removed.

They are attributes of cpu_subsys, not of some specific cpus.

Hopes the above makes things a bit clearer. 

Thanks, Zhong

> 
> Thanks.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks

2014-04-22 Thread Li Zhong

On Tue, 2014-04-22 at 12:11 +0200, Rafael J. Wysocki wrote:
> On Tuesday, April 22, 2014 11:34:39 AM Li Zhong wrote:
> > On Mon, 2014-04-21 at 18:46 -0400, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Mon, Apr 21, 2014 at 05:23:50PM +0800, Li Zhong wrote:
> > > 
> > > Proper /** function comment would be nice.
> > 
> > Ok, will try to write some in next version.
> > 
> > > 
> > > > +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev,
> > > > +  struct device_attribute 
> > > > *attr)
> > > 
> > > I can see why you did this but let's please not require the user of
> > > this function to see how the thing is working internally.  Let's
> > > return int and keep track of (or look up again) the kernfs_node
> > > internally.
> > 
> > Ok, it also makes the prototype of lock and unlock look more consistent
> > and comfortable. 
> > 
> > > 
> > > >  {
> > > ...
> > > > +   /*
> > > > +* We assume device_hotplug_lock must be acquired before 
> > > > removing
> > > 
> > > Is this assumption true?  If so, can we add lockdep assertions in
> > > places to verify and enforce this?  If not, aren't we just feeling
> > > good when the reality is broken?
> > 
> > It seems not true ... I think there are devices that don't have the
> > online/offline concept, we just need to add it, remove it, like ethernet
> > cards. 
> 
> Well, I haven't been following this closely (I was travelling, sorry), but
> there certainly are devices without online/offline.  That currently is only
> present for CPUs, memory blocks and ACPI containers (if I remember correctly).
> 
> > 
> > Maybe we could change the comments above, like:
> > /* We assume device_hotplug_lock must be acquired before 
> >  * removing devices, which have online/offline sysfs knob, 
> >  * and some locks are needed to serialize the online/offline
> >  * callbacks and device removing. ...
> > ? 
> 
> Lockdep assertions would be better than this in my opinion.

This is talking about the lock required in the other process, the device
removing process, e.g. that in remove_memory() below. So I guess no
lockdep assertions needed here. Or I misunderstand your point? 

> > 
> > And we could add lockdep assertions in cpu and memory related code? e.g.
> > remove_memory(), unregister_cpu()
> > 
> > Currently, remove_memory() has comments for the function:
> > 
> >  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> >  * and online/offline operations before this call, as required by
> >  * try_offline_node().
> >  */
> > 
> > maybe it could be removed with the lockdep assertion.
> 
> No, please don't remove it.  It is there to explain where the locking 
> requirement
> comes from.

OK, I see. I think I'll just add lockdep assertions, and keep the
comments there. 

Thanks, Zhong

> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks

2014-04-21 Thread Li Zhong

On Mon, 2014-04-21 at 18:46 -0400, Tejun Heo wrote:
> Hello,
> 
> On Mon, Apr 21, 2014 at 05:23:50PM +0800, Li Zhong wrote:
> 
> Proper /** function comment would be nice.

Ok, will try to write some in next version.

> 
> > +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev,
> > +  struct device_attribute *attr)
> 
> I can see why you did this but let's please not require the user of
> this function to see how the thing is working internally.  Let's
> return int and keep track of (or look up again) the kernfs_node
> internally.

Ok, it also makes the prototype of lock and unlock look more consistent
and comfortable. 

> 
> >  {
> ...
> > +   /*
> > +* We assume device_hotplug_lock must be acquired before removing
> 
> Is this assumption true?  If so, can we add lockdep assertions in
> places to verify and enforce this?  If not, aren't we just feeling
> good when the reality is broken?

It seems not true ... I think there are devices that don't have the
online/offline concept, we just need to add it, remove it, like ethernet
cards. 

Maybe we could change the comments above, like:
/* We assume device_hotplug_lock must be acquired before 
 * removing devices, which have online/offline sysfs knob, 
 * and some locks are needed to serialize the online/offline
 * callbacks and device removing. ...
? 

And we could add lockdep assertions in cpu and memory related code? e.g.
remove_memory(), unregister_cpu()

Currently, remove_memory() has comments for the function:

 * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
 * and online/offline operations before this call, as required by
 * try_offline_node().
 */

maybe it could be removed with the lockdep assertion.

> ...
> 
> Function comment please.

OK. 

Thanks, Zhong

> > +void unlock_device_hotplug_sysfs(struct device *dev,
> > +struct kernfs_node *kn)
> > +{
> > +   unlock_device_hotplug();
> > +   kernfs_unbreak_active_protection(kn);
> > +   put_device(dev);
> > +   kernfs_put(kn);
> >  }
> 
> Thanks.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()

2014-04-21 Thread Li Zhong

On Mon, 2014-04-21 at 18:38 -0400, Tejun Heo wrote:
> Hello,
> 
> On Mon, Apr 21, 2014 at 05:20:59PM +0800, Li Zhong wrote:
> > While auditing the usage of lock_device_hotplug_sysfs() for implementing
> > it in another way in following patch, it seems to me that the code here
> > is to add/remove device, and the files probe/release for cpu bus
> > themselves won't be removed. 
> > 
> > So it seems to me there is no s_active related deadlock here, and we
> > could just use lock_device_hotplug().
> 
> It may still cause issue if offlining ends up removing sysfs files or
> gets involved with the same lock used during cpu hot[un]plug
> operations (e.g. offlining racing against cpu hotunplug) and offlining
> a cpu does remove files.  Has this change been tested?

The probe/release files are attribute files for cpu subsys, looks like
following in sysfs dirs

$ cd /sys/devices/system/cpu/
$ ls -l
total 0
drwxr-xr-x. 7 root root 0 Apr 17 19:00 cpu0
drwxr-xr-x. 4 root root 0 Apr 17 19:00 cpu1
drwxr-xr-x. 4 root root 0 Apr 17 19:00 cpu10
..
drwxr-xr-x. 3 root root 0 Apr 20 08:00 cpufreq
drwxr-xr-x. 2 root root 0 Apr 20 08:00 cpuidle
-rw---. 1 root root 65536 Apr 21 00:28 dscr_default
-r--r--r--. 1 root root 65536 Apr 21 00:28 kernel_max
-r--r--r--. 1 root root 65536 Apr 21 00:28 offline
-r--r--r--. 1 root root 65536 Sep  4  2014 online
-r--r--r--. 1 root root 65536 Apr 21 00:28 possible
drwxr-xr-x. 2 root root 0 Apr 20 08:00 power
-r--r--r--. 1 root root 65536 Apr 17 20:46 present
--w---. 1 root root 65536 Apr 21 00:28 probe<-  
--w---. 1 root root 65536 Apr 21 00:28 release  <-
-rw---. 1 root root 65536 Apr 21 00:28 subcores_per_core
-rw-r--r--. 1 root root 65536 Apr 21 00:28 uevent

>From the code, it seems cpu subsys won't be unregistered, and it doesn't
make sense to remove all the cpus in the system. 

Thanks, Zhong

> 
> Thanks.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks

2014-04-21 Thread Li Zhong

This patch tries to solve the device hot remove locking issues in a
different way from commit 5e33bc41, as kernfs already has a mechanism 
to break active protection. 

The active protection is there to keep the file alive by blocking
deletion while operations are on-going in the file.  This blocking
creates a dependency loop when an operation running off a sysfs knob
ends up grabbing a lock which may be held while removing the said sysfs
knob.

So the problem here is the order of s_active, and the series of hotplug
related locks. 

commit 5e33bc41 solves it by taking out the first of the series of
hoplug related locks, device_hotplug_lock, with a try lock. And if that
try lock fails, it returns a restart syscall error, and drops s_active
temporarily to allow the other process to remove the sysfs knob.

This doesn't help with lockdep warnings reported against s_active and
other hotplug related locks in the series.

This patch instead tries to take s_active out of the lock dependency
graph using the active protection breaking mechanism. 

lock_device_hotplug_sysfs() function name is kept here, two more
arguments are added, dev, attr, to help find the kernfs_node
corresponding to the sysfs knob (which should always be able to be
found, WARN if not). The found kernfs_node is recorded as the return
value, to be released in unlock_device_hotplug_sysfs().

As we break the active protection here, the device removing process
might remove the sysfs entries after that point. So after we grab the
device_hotplug_lock, we need check whether that really happens. If so,
we should abort and not invoke the online/offline callbacks anymore. In
this case, NULL is returned to the caller, so it could return with
ENODEV.

Signed-off-by: Li Zhong 
---
 drivers/base/core.c| 63 ++
 drivers/base/memory.c  |  9 
 include/linux/device.h |  5 +++-
 3 files changed, 62 insertions(+), 15 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 0dd6528..4d37a2b 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -61,14 +61,55 @@ void unlock_device_hotplug(void)
mutex_unlock(&device_hotplug_lock);
 }
 
-int lock_device_hotplug_sysfs(void)
+struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev,
+  struct device_attribute *attr)
 {
-   if (mutex_trylock(&device_hotplug_lock))
-   return 0;
+   struct kernfs_node *kn;
+
+   kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name);
+
+   if (WARN_ON_ONCE(!kn))
+   return NULL;
+
+   /*
+* Break active protection here to avoid deadlocks with device
+* removing process, which tries to remove sysfs entries including this
+* "online" attribute while holding some hotplug related locks.
+*
+* @dev needs to be protected here, or it could go away any time after
+* dropping active protection.
+*/
+
+   get_device(dev);
+   kernfs_break_active_protection(kn);
+
+   lock_device_hotplug();
+
+   /*
+* We assume device_hotplug_lock must be acquired before removing
+* device, we can check here whether the device has been removed, so
+* we don't call device_{on|off}line against removed device.
+*/
 
-   /* Avoid busy looping (5 ms of sleep should do). */
-   msleep(5);
-   return restart_syscall();
+   if (!dev->kobj.sd) {
+   /* device_del() already called on @dev, we should also abort */
+   unlock_device_hotplug();
+   kernfs_unbreak_active_protection(kn);
+   put_device(dev);
+   kernfs_put(kn);
+   return NULL;
+   }
+
+   return kn;
+}
+
+void unlock_device_hotplug_sysfs(struct device *dev,
+struct kernfs_node *kn)
+{
+   unlock_device_hotplug();
+   kernfs_unbreak_active_protection(kn);
+   put_device(dev);
+   kernfs_put(kn);
 }
 
 #ifdef CONFIG_BLOCK
@@ -439,17 +480,19 @@ static ssize_t online_store(struct device *dev, struct 
device_attribute *attr,
 {
bool val;
int ret;
+   struct kernfs_node *kn;
 
ret = strtobool(buf, &val);
if (ret < 0)
return ret;
 
-   ret = lock_device_hotplug_sysfs();
-   if (ret)
-   return ret;
+   kn = lock_device_hotplug_sysfs(dev, attr);
+   if (!kn)
+   return -ENODEV;
 
ret = val ? device_online(dev) : device_offline(dev);
-   unlock_device_hotplug();
+   unlock_device_hotplug_sysfs(dev, kn);
+
return ret < 0 ? ret : count;
 }
 static DEVICE_ATTR_RW(online);
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index bece691..c2b66d4 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -320,10 +320,11 @@ store_mem_state(struct device *dev,
 {

[RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()

2014-04-21 Thread Li Zhong

While auditing the usage of lock_device_hotplug_sysfs() for implementing
it in another way in following patch, it seems to me that the code here
is to add/remove device, and the files probe/release for cpu bus
themselves won't be removed. 

So it seems to me there is no s_active related deadlock here, and we
could just use lock_device_hotplug().

Signed-off-by: Li Zhong 
---
 drivers/base/cpu.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 006b1bc..9483225 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -88,9 +88,7 @@ static ssize_t cpu_probe_store(struct device *dev,
ssize_t cnt;
int ret;
 
-   ret = lock_device_hotplug_sysfs();
-   if (ret)
-   return ret;
+   lock_device_hotplug();
 
cnt = arch_cpu_probe(buf, count);
 
@@ -106,9 +104,7 @@ static ssize_t cpu_release_store(struct device *dev,
ssize_t cnt;
int ret;
 
-   ret = lock_device_hotplug_sysfs();
-   if (ret)
-   return ret;
+   lock_device_hotplug();
 
cnt = arch_cpu_release(buf, count);
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v4] Use kernfs_break_active_protection() for device online store callbacks

2014-04-18 Thread Li Zhong

On Thu, 2014-04-17 at 11:17 -0400, Tejun Heo wrote:
> Hello,
> 
> On Thu, Apr 17, 2014 at 02:50:44PM +0800, Li Zhong wrote:
> > This patch tries to solve the device hot remove locking issues in a
> > different way from commit 5e33bc41, as kernfs already has a mechanism 
> > to break active protection. 
> > 
> > The problem here is the order of s_active, and series of hotplug related
> > lock. 
> 
> It prolly deservse more detailed explanation of the deadlock along
> with how 5e33bc41 ("$SUBJ") tried to solve it.  The active protetion
> is there to keep the file alive by blocking deletion while operations
> are on-going in the file.  This blocking creates a dependency loop
> when an operation running off a sysfs knob ends up grabbing a lock
> which may be held while removing the said sysfs knob.

OK, I'll try to add these and something more in next version.

> 
> > +   kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name);
> > +   if (WARN_ON_ONCE(!kn))
> > +   return -ENODEV;
> > +
> > +   /*
> > +* Break active protection here to avoid deadlocks with device
> > +* removing process, which tries to remove sysfs entries including this
> > +* "online" attribute while holding some hotplug related locks.
> > +*
> > +* @dev needs to be protected here, or it could go away any time after
> > +* dropping active protection. But it is still unreasonable/unsafe to
> > +* online/offline a device after it being removed. Fortunately, there
> 
> I think this is something driver layer proper should provide
> synchronization for.  It shouldn't be difficult to synchronize this
> function against device_del(), right?  And, please note that @dev is
> guaranteed to have not been removed (at least hasn't gone through attr
> removal) upto this point.

Ok, I think what we need here is the check below, after getting
device_hotplug_lock, and abort this function if device already removed.
We should allow device_del() to remove the device in the other process,
which is why we are breaking the active protection. 

> 
> > +* are some checks in online/offline knobs. Like cpu, it checks cpu
> > +* present/online mask before doing the real work.
> > +*/
> > +
> > +   get_device(dev);
> > +   kernfs_break_active_protection(kn);
> > +
> > +   lock_device_hotplug();
> > +
> > +   /*
> > +* If we assume device_hotplug_lock must be acquired before removing
> > +* device, we may try to find a way to check whether the device has
> > +* been removed here, so we don't call device_{on|off}line against
> > +* removed device.
> > +*/
> 
> Yeah, let's please fix this.

OK, I guess we can check whether dev->kobj.sd becomes NULL. If so, it
means the device has already been deleted by device_del().

> 
> > ret = val ? device_online(dev) : device_offline(dev);
> > unlock_device_hotplug();
> > +
> > +   kernfs_unbreak_active_protection(kn);
> > +   put_device(dev);
> > +
> > +   kernfs_put(kn);
> > +
> > return ret < 0 ? ret : count;
> >  }
> >  static DEVICE_ATTR_RW(online);
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index bece691..0d2f3a5 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -320,10 +320,17 @@ store_mem_state(struct device *dev,
> >  {
> > struct memory_block *mem = to_memory_block(dev);
> > int ret, online_type;
> > +   struct kernfs_node *kn;
> >  
> > -   ret = lock_device_hotplug_sysfs();
> > -   if (ret)
> > -   return ret;
> > +   kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name);
> > +   if (WARN_ON_ONCE(!kn))
> > +   return -ENODEV;
> > +
> > +   /* refer to comments in online_store() for more information */
> > +   get_device(dev);
> > +   kernfs_break_active_protection(kn);
> > +
> > +   lock_device_hotplug();
> >  
> > if (!strncmp(buf, "online_kernel", min_t(int, count, 13)))
> > online_type = ONLINE_KERNEL;
> > @@ -362,6 +369,11 @@ store_mem_state(struct device *dev,
> >  err:
> > unlock_device_hotplug();
> >  
> > +   kernfs_unbreak_active_protection(kn);
> > +   put_device(dev);
> > +
> > +   kernfs_put(kn);
> 
> There are other users of lock_device_hotplug_sysfs().  We probably
> want to audit them and convert them too, preferably with helper
> routines so that they don't end up duplicating the complexity?

I see, I guess I could keep lock_device_hotplug_sysfs(), just replace it
with the implementation here; and provide a new
unlock_device_hotplug_sysfs(), which will do the unlock, unbreak, and
puts. 

I'll try to get the code ready sometime next week for your review.

Thanks, Zhong

> 
> Thanks.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v4] Use kernfs_break_active_protection() for device online store callbacks

2014-04-16 Thread Li Zhong

This patch tries to solve the device hot remove locking issues in a
different way from commit 5e33bc41, as kernfs already has a mechanism 
to break active protection. 

The problem here is the order of s_active, and series of hotplug related
lock. 

This patch takes s_active out of the lock dependency graph, so it won't
dead lock with any hotplug realted locks.

Signed-off-by: Li Zhong 
---
 drivers/base/core.c   | 37 ++---
 drivers/base/memory.c | 18 +++---
 2 files changed, 49 insertions(+), 6 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 0dd6528..1b96cb9 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -439,17 +439,48 @@ static ssize_t online_store(struct device *dev, struct 
device_attribute *attr,
 {
bool val;
int ret;
+   struct kernfs_node *kn;
 
ret = strtobool(buf, &val);
if (ret < 0)
return ret;
 
-   ret = lock_device_hotplug_sysfs();
-   if (ret)
-   return ret;
+   kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name);
+   if (WARN_ON_ONCE(!kn))
+   return -ENODEV;
+
+   /*
+* Break active protection here to avoid deadlocks with device
+* removing process, which tries to remove sysfs entries including this
+* "online" attribute while holding some hotplug related locks.
+*
+* @dev needs to be protected here, or it could go away any time after
+* dropping active protection. But it is still unreasonable/unsafe to
+* online/offline a device after it being removed. Fortunately, there
+* are some checks in online/offline knobs. Like cpu, it checks cpu
+* present/online mask before doing the real work.
+*/
+
+   get_device(dev);
+   kernfs_break_active_protection(kn);
+
+   lock_device_hotplug();
+
+   /*
+* If we assume device_hotplug_lock must be acquired before removing
+* device, we may try to find a way to check whether the device has
+* been removed here, so we don't call device_{on|off}line against
+* removed device.
+*/
 
ret = val ? device_online(dev) : device_offline(dev);
unlock_device_hotplug();
+
+   kernfs_unbreak_active_protection(kn);
+   put_device(dev);
+
+   kernfs_put(kn);
+
return ret < 0 ? ret : count;
 }
 static DEVICE_ATTR_RW(online);
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index bece691..0d2f3a5 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -320,10 +320,17 @@ store_mem_state(struct device *dev,
 {
struct memory_block *mem = to_memory_block(dev);
int ret, online_type;
+   struct kernfs_node *kn;
 
-   ret = lock_device_hotplug_sysfs();
-   if (ret)
-   return ret;
+   kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name);
+   if (WARN_ON_ONCE(!kn))
+   return -ENODEV;
+
+   /* refer to comments in online_store() for more information */
+   get_device(dev);
+   kernfs_break_active_protection(kn);
+
+   lock_device_hotplug();
 
if (!strncmp(buf, "online_kernel", min_t(int, count, 13)))
online_type = ONLINE_KERNEL;
@@ -362,6 +369,11 @@ store_mem_state(struct device *dev,
 err:
unlock_device_hotplug();
 
+   kernfs_unbreak_active_protection(kn);
+   put_device(dev);
+
+   kernfs_put(kn);
+
if (ret)
return ret;
return count;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-16 Thread Li Zhong

On Wed, 2014-04-16 at 11:17 -0400, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 16, 2014 at 09:41:40AM +0800, Li Zhong wrote:
> > > If so, that is
> > > an actually possible deadlock, no?
> > 
> > Yes, but it seems to me that it is solved in commit 5e33bc41, which uses
> > lock_device_hotplug_sysfs() to return a restart syscall error if not
> > able to try lock the device_hotplug_lock. That also requires the device
> > removing code path to take the device_hotplug_lock. 
> 
> But that patch only takes out device_hotplug_lock out of the
> dependency graph and does nothing for cpu_add_remove_lock.  It seems
> to be that there still is a deadlock condition involving s_active and
> cpu_add_remove_lock.  Am I missing something here?

It seems to me cpu_add_remove_lock is always taken after
device_hotplug_lock.

So if cpu_add_remove_lock has been acquired by device removing process,
then it means the other online/offline process couldn't successfully try
lock device_hotplug_lock, and will release s_active with a restart
syscall error;

if cpu_add_remove_lock has been acquired by online/offline process, then
it should already hold device_hotlug_lock, and keeps the device removing
process waiting at device_hotplug_lock. So online/offline process could
release the lock, and finally release s_active soon. 

But after some further thinking, I seem to understand your point.
s_active has lock order problem with the other series of hotplug related
locks, so it's better to take s_active out of the dependency chain,
rather than the first of the other series of locks? like you suggested
below.

> 
> Now that kernfs has a proper mechanism to deal with it, wouldn't it
> make more sense to replace 5e33bc41 with prper s_active protection
> breaking?

I'll try this way and send you the code for review.

Thanks,
Zhong

> 
> Thanks.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v3] memory-hotplug: Update documentation to hide information about SECTIONS and remove end_phys_index

2014-04-15 Thread Li Zhong

Seems we all agree that information about SECTION, e.g. section size,
sections per memory block should be kept as kernel internals, and not
exposed to userspace.

This patch updates Documentation/memory-hotplug.txt to refer to memory
blocks instead of memory sections where appropriate and added a
paragraph to explain that memory blocks are made of memory sections.
The documentation update is mostly provided by Nathan.

Also, as end_phys_index in code is actually not the end section id, but
the end memory block id, which should always be the same as phys_index.
So it is removed here.

Signed-off-by: Li Zhong 
Reviewed-by: Zhang Yanfei 
---
v3: memort->memory, pointed out by Yanfei

 Documentation/memory-hotplug.txt |  125 +++---
 drivers/base/memory.c|   12 
 2 files changed, 61 insertions(+), 76 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 58340d5..1aa239f 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -88,16 +88,21 @@ phase by hand.
 
 1.3. Unit of Memory online/offline operation
 
-Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory
-into chunks of the same size. The chunk is called a "section". The size of
-a section is architecture dependent. For example, power uses 16MiB, ia64 uses
-1GiB. The unit of online/offline operation is "one section". (see Section 3.)
+Memory hotplug uses SPARSEMEM memory model which allows memory to be divided
+into chunks of the same size. These chunks are called "sections". The size of
+a memory section is architecture dependent. For example, power uses 16MiB, ia64
+uses 1GiB.
 
-To determine the size of sections, please read this file:
+Memory sections are combined into chunks referred to as "memory blocks". The
+size of a memory block is architecture dependent and represents the logical
+unit upon which memory online/offline operations are to be performed. The
+default size of a memory block is the same as memory section size unless an
+architecture specifies otherwise. (see Section 3.)
+
+To determine the size (in bytes) of a memory block please read this file:
 
 /sys/devices/system/memory/block_size_bytes
 
-This file shows the size of sections in byte.
 
 ---
 2. Kernel Configuration
@@ -123,42 +128,35 @@ config options.
 (CONFIG_ACPI_CONTAINER).
 This option can be kernel module too.
 
+
 
-4 sysfs files for memory hotplug
+3 sysfs files for memory hotplug
 
-All sections have their device information in sysfs.  Each section is part of
-a memory block under /sys/devices/system/memory as
+All memory blocks have their device information in sysfs.  Each memory block
+is described under /sys/devices/system/memory as
 
 /sys/devices/system/memory/memoryXXX
-(XXX is the section id.)
+(XXX is the memory block id.)
 
-Now, XXX is defined as (start_address_of_section / section_size) of the first
-section contained in the memory block.  The files 'phys_index' and
-'end_phys_index' under each directory report the beginning and end section id's
-for the memory block covered by the sysfs directory.  It is expected that all
+For the memory block covered by the sysfs directory.  It is expected that all
 memory sections in this range are present and no memory holes exist in the
 range. Currently there is no way to determine if there is a memory hole, but
 the existence of one should not affect the hotplug capabilities of the memory
 block.
 
-For example, assume 1GiB section size. A device for a memory starting at
+For example, assume 1GiB memory block size. A device for a memory starting at
 0x1 is /sys/device/system/memory/memory4
 (0x1 / 1Gib = 4)
 This device covers address range [0x1 ... 0x14000)
 
-Under each section, you can see 4 or 5 files, the end_phys_index file being
-a recent addition and not present on older kernels.
+Under each memory block, you can see 4 files:
 
-/sys/devices/system/memory/memoryXXX/start_phys_index
-/sys/devices/system/memory/memoryXXX/end_phys_index
+/sys/devices/system/memory/memoryXXX/phys_index
 /sys/devices/system/memory/memoryXXX/phys_device
 /sys/devices/system/memory/memoryXXX/state
 /sys/devices/system/memory/memoryXXX/removable
 
-'phys_index'  : read-only and contains section id of the first section
-   in the memory block, same as XXX.
-'end_phys_index'  : read-only and contains section id of the last section
-   in the memory block.
+'phys_index'  : read-only and contains memory block id, same as XXX.
 'state'   : read-write
 at read:  contains online/offline state of memory.
 at write: user can specify "online_kernel",
@@ -185,6

Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-15 Thread Li Zhong

On Tue, 2014-04-15 at 10:50 -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Apr 15, 2014 at 10:44:37AM +0800, Li Zhong wrote:
> > / *
> >   * This process might deadlock with another process trying to 
> >   * remove this device:
> >   * This process holding the s_active of "online" attribute, and tries 
> >   * to online/offline the device with some locks protecting hotplug.
> >   * Device removing process holding some locks protecting hotplug, and 
> >   * tries to remove the "online" attribute, waiting for the s_active to
> >   * be released. 
> >   *
> >   * The deadlock described above should be solved with
> >   * lock_device_hotplug_sysfs(). We temporarily drop the active 
> >   * protection here to avoid some lockdep warnings. 
> >   *
> >   * If device_hotplug_lock is forgotten to be used when removing
> >   * device(possibly some very simple device even don't need this lock?),
> >   * @dev could go away any time after dropping the active protection. 
> >   * So increase its ref count before dropping active protection. 
> >   * Though invoking device_{on|off}line() on a removed device seems
> >   * unreasonable, it should be less disastrous than playing with freed
> >   * @dev. Also, we might be able to have some mechanism abort 
> >   * device_{on|off}line() if @dev already removed.
> >   */
> 
> Hmmm... I'm not sure I fully understand the problem.  Does the code
> ever try to remove "online" while holding cpu_add_remove_lock and,
> when written 0, online knob grabs cpu_add_remove_lock?  

Yes.

In acpi_processor_remove(), cpu_maps_update_begin() is called to hold
cpu_add_remove_lock, and then arch_unregister_cpu calls
unregister_cpu(), which will try to remove dir cpu1 including "online".

while written 0 to online, cpu_down() will also try to grab
cpu_add_remove_lock with cpu_maps_update_begin().

> If so, that is
> an actually possible deadlock, no?

Yes, but it seems to me that it is solved in commit 5e33bc41, which uses
lock_device_hotplug_sysfs() to return a restart syscall error if not
able to try lock the device_hotplug_lock. That also requires the device
removing code path to take the device_hotplug_lock. 

Thanks, Zhong

> 
> Thanks.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2] memory-hotplug: Update documentation to hide information about SECTIONS and remove end_phys_index

2014-04-14 Thread Li Zhong

On Mon, 2014-04-14 at 17:13 +0800, Zhang Yanfei wrote:
> On 04/14/2014 04:43 PM, Li Zhong wrote:
> > Seems we all agree that information about SECTION, e.g. section size,
> > sections per memory block should be kept as kernel internals, and not
> > exposed to userspace.
> > 
> > This patch updates Documentation/memory-hotplug.txt to refer to memory
> > blocks instead of memory sections where appropriate and added a
> > paragraph to explain that memory blocks are made of memory sections.
> > The documentation update is mostly provided by Nathan.
> > 
> > Also, as end_phys_index in code is actually not the end section id, but
> > the end memory block id, which should always be the same as phys_index.
> > So it is removed here.
> > 
> > Signed-off-by: Li Zhong 
> 
> Reviewed-by: Zhang Yanfei 
> 
> Still the nitpick there.

Ao.. Will fix it in next version.

Thanks, Zhong

> 
> > ---
> >  Documentation/memory-hotplug.txt |  125 
> > +++---
> >  drivers/base/memory.c|   12 
> >  2 files changed, 61 insertions(+), 76 deletions(-)
> > 
> > diff --git a/Documentation/memory-hotplug.txt 
> > b/Documentation/memory-hotplug.txt
> > index 58340d5..1aa239f 100644
> > --- a/Documentation/memory-hotplug.txt
> > +++ b/Documentation/memory-hotplug.txt
> > @@ -88,16 +88,21 @@ phase by hand.
> >  
> >  1.3. Unit of Memory online/offline operation
> >  
> > -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole 
> > memory
> > -into chunks of the same size. The chunk is called a "section". The size of
> > -a section is architecture dependent. For example, power uses 16MiB, ia64 
> > uses
> > -1GiB. The unit of online/offline operation is "one section". (see Section 
> > 3.)
> > +Memory hotplug uses SPARSEMEM memory model which allows memory to be 
> > divided
> > +into chunks of the same size. These chunks are called "sections". The size 
> > of
> > +a memory section is architecture dependent. For example, power uses 16MiB, 
> > ia64
> > +uses 1GiB.
> >  
> > -To determine the size of sections, please read this file:
> > +Memory sections are combined into chunks referred to as "memory blocks". 
> > The
> > +size of a memory block is architecture dependent and represents the logical
> > +unit upon which memory online/offline operations are to be performed. The
> > +default size of a memory block is the same as memory section size unless an
> > +architecture specifies otherwise. (see Section 3.)
> > +
> > +To determine the size (in bytes) of a memory block please read this file:
> >  
> >  /sys/devices/system/memory/block_size_bytes
> >  
> > -This file shows the size of sections in byte.
> >  
> >  ---
> >  2. Kernel Configuration
> > @@ -123,42 +128,35 @@ config options.
> >  (CONFIG_ACPI_CONTAINER).
> >  This option can be kernel module too.
> >  
> > +
> >  
> > -4 sysfs files for memory hotplug
> > +3 sysfs files for memory hotplug
> >  
> > -All sections have their device information in sysfs.  Each section is part 
> > of
> > -a memory block under /sys/devices/system/memory as
> > +All memory blocks have their device information in sysfs.  Each memory 
> > block
> > +is described under /sys/devices/system/memory as
> >  
> >  /sys/devices/system/memory/memoryXXX
> > -(XXX is the section id.)
> > +(XXX is the memory block id.)
> >  
> > -Now, XXX is defined as (start_address_of_section / section_size) of the 
> > first
> > -section contained in the memory block.  The files 'phys_index' and
> > -'end_phys_index' under each directory report the beginning and end section 
> > id's
> > -for the memory block covered by the sysfs directory.  It is expected that 
> > all
> > +For the memory block covered by the sysfs directory.  It is expected that 
> > all
> >  memory sections in this range are present and no memory holes exist in the
> >  range. Currently there is no way to determine if there is a memory hole, 
> > but
> >  the existence of one should not affect the hotplug capabilities of the 
> > memory
> >  block.
> >  
> > -For example, assume 1GiB section size. A device for a memory starting at
> > +For example, assume 1GiB memory block size. A device for a memory starting 
> > a

Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-14 Thread Li Zhong

On Mon, 2014-04-14 at 16:13 -0400, Tejun Heo wrote:
> Hello,
> 
> On Mon, Apr 14, 2014 at 03:47:29PM +0800, Li Zhong wrote:
> > @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct 
> > device_attribute *attr,
> >  {
> > bool val;
> > int ret;
> > +   struct kernfs_node *kn;
> >  
> > ret = strtobool(buf, &val);
> > if (ret < 0)
> > @@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct 
> > device_attribute *attr,
> > if (ret)
> > return ret;
> >  
> > +   kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name);
> > +   if (WARN_ON_ONCE(!kn))
> > +   goto out;
> > +
> > +   get_device(dev);
> > +   kernfs_break_active_protection(kn);
> > ret = val ? device_online(dev) : device_offline(dev);
> > +   kernfs_unbreak_active_protection(kn);
> > +   put_device(dev);
> > +
> > +   kernfs_put(kn);
> > +
> > +out:
> > unlock_device_hotplug();
> > return ret < 0 ? ret : count;
> >  }
> 
> Can you please add comment explainin why this is being down?  As it
> currently stands, it's quite a bit of complexity without any
> indication what's going on why.  Also, if device_hotplug is locked, is
> it really necessary to get @dev?  Can it go away inbetween?  The code
> snippet looks weird because getting @dev indicates that the device
> might go away without doing it but then it proceeds to invoke
> device_{on|off}line() which probably isn't safe to invoke on a removed
> device.

Hi Tejun, 

I tried to write some draft words here... (I'm not good at writing
it...) Could you please help to have a review and comment? thanks.

/ *
  * This process might deadlock with another process trying to 
  * remove this device:
  * This process holding the s_active of "online" attribute, and tries 
  * to online/offline the device with some locks protecting hotplug.
  * Device removing process holding some locks protecting hotplug, and 
  * tries to remove the "online" attribute, waiting for the s_active to
  * be released. 
  *
  * The deadlock described above should be solved with
  * lock_device_hotplug_sysfs(). We temporarily drop the active 
  * protection here to avoid some lockdep warnings. 
  *
  * If device_hotplug_lock is forgotten to be used when removing
  * device(possibly some very simple device even don't need this lock?),
  * @dev could go away any time after dropping the active protection. 
  * So increase its ref count before dropping active protection. 
  * Though invoking device_{on|off}line() on a removed device seems
  * unreasonable, it should be less disastrous than playing with freed
  * @dev. Also, we might be able to have some mechanism abort 
  * device_{on|off}line() if @dev already removed.
  */

Thanks, Zhong
> 
> Thanks.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2] memory-hotplug: Update documentation to hide information about SECTIONS and remove end_phys_index

2014-04-14 Thread Li Zhong

Seems we all agree that information about SECTION, e.g. section size,
sections per memory block should be kept as kernel internals, and not
exposed to userspace.

This patch updates Documentation/memory-hotplug.txt to refer to memory
blocks instead of memory sections where appropriate and added a
paragraph to explain that memory blocks are made of memory sections.
The documentation update is mostly provided by Nathan.

Also, as end_phys_index in code is actually not the end section id, but
the end memory block id, which should always be the same as phys_index.
So it is removed here.

Signed-off-by: Li Zhong 
---
 Documentation/memory-hotplug.txt |  125 +++---
 drivers/base/memory.c|   12 
 2 files changed, 61 insertions(+), 76 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 58340d5..1aa239f 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -88,16 +88,21 @@ phase by hand.
 
 1.3. Unit of Memory online/offline operation
 
-Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory
-into chunks of the same size. The chunk is called a "section". The size of
-a section is architecture dependent. For example, power uses 16MiB, ia64 uses
-1GiB. The unit of online/offline operation is "one section". (see Section 3.)
+Memory hotplug uses SPARSEMEM memory model which allows memory to be divided
+into chunks of the same size. These chunks are called "sections". The size of
+a memory section is architecture dependent. For example, power uses 16MiB, ia64
+uses 1GiB.
 
-To determine the size of sections, please read this file:
+Memory sections are combined into chunks referred to as "memory blocks". The
+size of a memory block is architecture dependent and represents the logical
+unit upon which memory online/offline operations are to be performed. The
+default size of a memory block is the same as memory section size unless an
+architecture specifies otherwise. (see Section 3.)
+
+To determine the size (in bytes) of a memory block please read this file:
 
 /sys/devices/system/memory/block_size_bytes
 
-This file shows the size of sections in byte.
 
 ---
 2. Kernel Configuration
@@ -123,42 +128,35 @@ config options.
 (CONFIG_ACPI_CONTAINER).
 This option can be kernel module too.
 
+
 
-4 sysfs files for memory hotplug
+3 sysfs files for memory hotplug
 
-All sections have their device information in sysfs.  Each section is part of
-a memory block under /sys/devices/system/memory as
+All memory blocks have their device information in sysfs.  Each memory block
+is described under /sys/devices/system/memory as
 
 /sys/devices/system/memory/memoryXXX
-(XXX is the section id.)
+(XXX is the memory block id.)
 
-Now, XXX is defined as (start_address_of_section / section_size) of the first
-section contained in the memory block.  The files 'phys_index' and
-'end_phys_index' under each directory report the beginning and end section id's
-for the memory block covered by the sysfs directory.  It is expected that all
+For the memory block covered by the sysfs directory.  It is expected that all
 memory sections in this range are present and no memory holes exist in the
 range. Currently there is no way to determine if there is a memory hole, but
 the existence of one should not affect the hotplug capabilities of the memory
 block.
 
-For example, assume 1GiB section size. A device for a memory starting at
+For example, assume 1GiB memory block size. A device for a memory starting at
 0x1 is /sys/device/system/memory/memory4
 (0x1 / 1Gib = 4)
 This device covers address range [0x1 ... 0x14000)
 
-Under each section, you can see 4 or 5 files, the end_phys_index file being
-a recent addition and not present on older kernels.
+Under each memory block, you can see 4 files:
 
-/sys/devices/system/memory/memoryXXX/start_phys_index
-/sys/devices/system/memory/memoryXXX/end_phys_index
+/sys/devices/system/memory/memoryXXX/phys_index
 /sys/devices/system/memory/memoryXXX/phys_device
 /sys/devices/system/memory/memoryXXX/state
 /sys/devices/system/memory/memoryXXX/removable
 
-'phys_index'  : read-only and contains section id of the first section
-   in the memory block, same as XXX.
-'end_phys_index'  : read-only and contains section id of the last section
-   in the memory block.
+'phys_index'  : read-only and contains memory block id, same as XXX.
 'state'   : read-write
 at read:  contains online/offline state of memory.
 at write: user can specify "online_kernel",
@@ -185,6 +183,7 @@ For example:
 A backlink will also be created:
 /sys/devices/syst

[RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks

2014-04-14 Thread Li Zhong

I noticed following lockdep warning when trying acpi hot-remove cpus:

[84154.204080] ==
[84154.204080] [ INFO: possible circular locking dependency detected ]
[84154.204080] 3.14.0-next-20140408+ #24 Tainted: GW
[84154.204080] ---
[84154.204080] bash/777 is trying to acquire lock:
[84154.204080]  (cpu_add_remove_lock){+.+.+.}, at: []
cpu_maps_update_begin+0x17/0x20
[84154.213203] 
[84154.213203] but task is already holding lock:
[84154.213203]  (s_active#79){.+}, at: []
kernfs_fop_write+0xe4/0x190
[84154.213203] 
[84154.213203] which lock already depends on the new lock.
[84154.213203] 
[84154.213203] 
[84154.213203] the existing dependency chain (in reverse order) is:
[84154.213203] 
-> #3 (s_active#79){.+}:
[84154.213203][] lock_acquire+0x9b/0x1d0
[84154.213203][] __kernfs_remove+0x250/0x310
[84154.213203][] kernfs_remove_by_name_ns
+0x50/0xc0
[84154.213203][] sysfs_remove_file_ns
+0x15/0x20
[84154.213203][] device_remove_file+0x19/0x20
[84154.213203][] device_remove_attrs+0x33/0x80
[84154.213203][] device_del+0x127/0x1c0
[84154.213203][] device_unregister+0x22/0x60
[84154.213203][] unregister_cpu+0x1e/0x40
[84154.213203][] arch_unregister_cpu+0x23/0x30
[84154.213203][] acpi_processor_remove
+0x8d/0xb2
[84154.213203][] acpi_bus_trim+0x5a/0x8d
[84154.213203][] acpi_device_hotplug
+0x1a8/0x3ec
[84154.213203][] acpi_hotplug_work_fn
+0x1f/0x2b
[84154.213203][] process_one_work+0x1eb/0x6b0
[84154.213203][] worker_thread+0x11b/0x370
[84154.213203][] kthread+0xe4/0x100
[84154.213203][] ret_from_fork+0x7c/0xb0
[84154.213203] 
-> #2 (cpu_hotplug.lock#2){+.+.+.}:
[84154.213203][] lock_acquire+0x9b/0x1d0
[84154.213203][] mutex_lock_nested+0x50/0x3c0
[84154.213203][] cpu_hotplug_begin+0x4f/0x80
[84154.213203][] _cpu_up+0x3f/0x160
[84154.213203][] cpu_up+0x69/0x80
[84154.213203][] smp_init+0x60/0x8c
[84154.213203][] kernel_init_freeable
+0x126/0x23b
[84154.213203][] kernel_init+0xe/0xf0
[84154.213203][] ret_from_fork+0x7c/0xb0
[84154.213203] 
-> #1 (cpu_hotplug.lock){++}:
[84154.213203][] lock_acquire+0x9b/0x1d0
[84154.213203][] cpu_hotplug_begin+0x41/0x80
[84154.213203][] _cpu_up+0x3f/0x160
[84154.213203][] cpu_up+0x69/0x80
[84154.213203][] smp_init+0x60/0x8c
[84154.213203][] kernel_init_freeable
+0x126/0x23b
[84154.213203][] kernel_init+0xe/0xf0
[84154.213203][] ret_from_fork+0x7c/0xb0
[84154.213203] 
-> #0 (cpu_add_remove_lock){+.+.+.}:
[84154.213203][] __lock_acquire+0x1f2a/0x1f60
[84154.213203][] lock_acquire+0x9b/0x1d0
[84154.213203][] mutex_lock_nested+0x50/0x3c0
[84154.213203][] cpu_maps_update_begin
+0x17/0x20
[84154.213203][] cpu_down+0x1d/0x50
[84154.213203][] cpu_subsys_offline+0x14/0x20
[84154.213203][] device_offline+0xad/0xd0
[84154.213203][] online_store+0x42/0x80
[84154.213203][] dev_attr_store+0x18/0x30
[84154.213203][] sysfs_kf_write+0x49/0x60
[84154.213203][] kernfs_fop_write+0x109/0x190
[84154.213203][] vfs_write+0xbe/0x1c0
[84154.213203][] SyS_write+0x52/0xb0
[84154.213203][] tracesys+0xd0/0xd5
[84154.213203] 
[84154.213203] other info that might help us debug this:
[84154.213203] 
[84154.213203] Chain exists of:
  cpu_add_remove_lock --> cpu_hotplug.lock#2 --> s_active#79

[84154.213203]  Possible unsafe locking scenario:
[84154.213203] 
[84154.213203]CPU0CPU1
[84154.213203]
[84154.213203]   lock(s_active#79);
[84154.213203]lock(cpu_hotplug.lock#2);
[84154.213203]lock(s_active#79);
[84154.213203]   lock(cpu_add_remove_lock);
[84154.213203] 
[84154.213203]  *** DEADLOCK ***
.

The deadlock itself seems already fixed in commit 5e33bc41. 

As Tejun suggested, to avoid this lockdep warning,
kernfs_break_active_protection() is used before online/offline callbacks
to take the s_active lock out of the dependency chain during
online/offline operations. 

Signed-off-by: Li Zhong 
---
 drivers/base/core.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 0dd6528..b313ad7 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct 
device_attribute *attr,
 {
bool val;
int ret;
+   struct kernfs_node *kn;
 
ret = strtobool(buf, &val);
if (ret < 0)
@@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct 
device_attribute *attr,
if (re

[RFC PATCH v2] Use kernfs_break_active_protection() for device online store callbacks

2014-04-10 Thread Li Zhong

I noticed following lockdep warning when trying acpi hot-remove cpus:

[84154.204080] ==
[84154.204080] [ INFO: possible circular locking dependency detected ]
[84154.204080] 3.14.0-next-20140408+ #24 Tainted: GW
[84154.204080] ---
[84154.204080] bash/777 is trying to acquire lock:
[84154.204080]  (cpu_add_remove_lock){+.+.+.}, at: []
cpu_maps_update_begin+0x17/0x20
[84154.213203] 
[84154.213203] but task is already holding lock:
[84154.213203]  (s_active#79){.+}, at: []
kernfs_fop_write+0xe4/0x190
[84154.213203] 
[84154.213203] which lock already depends on the new lock.
[84154.213203] 
[84154.213203] 
[84154.213203] the existing dependency chain (in reverse order) is:
[84154.213203] 
-> #3 (s_active#79){.+}:
[84154.213203][] lock_acquire+0x9b/0x1d0
[84154.213203][] __kernfs_remove+0x250/0x310
[84154.213203][] kernfs_remove_by_name_ns
+0x50/0xc0
[84154.213203][] sysfs_remove_file_ns
+0x15/0x20
[84154.213203][] device_remove_file+0x19/0x20
[84154.213203][] device_remove_attrs+0x33/0x80
[84154.213203][] device_del+0x127/0x1c0
[84154.213203][] device_unregister+0x22/0x60
[84154.213203][] unregister_cpu+0x1e/0x40
[84154.213203][] arch_unregister_cpu+0x23/0x30
[84154.213203][] acpi_processor_remove
+0x8d/0xb2
[84154.213203][] acpi_bus_trim+0x5a/0x8d
[84154.213203][] acpi_device_hotplug
+0x1a8/0x3ec
[84154.213203][] acpi_hotplug_work_fn
+0x1f/0x2b
[84154.213203][] process_one_work+0x1eb/0x6b0
[84154.213203][] worker_thread+0x11b/0x370
[84154.213203][] kthread+0xe4/0x100
[84154.213203][] ret_from_fork+0x7c/0xb0
[84154.213203] 
-> #2 (cpu_hotplug.lock#2){+.+.+.}:
[84154.213203][] lock_acquire+0x9b/0x1d0
[84154.213203][] mutex_lock_nested+0x50/0x3c0
[84154.213203][] cpu_hotplug_begin+0x4f/0x80
[84154.213203][] _cpu_up+0x3f/0x160
[84154.213203][] cpu_up+0x69/0x80
[84154.213203][] smp_init+0x60/0x8c
[84154.213203][] kernel_init_freeable
+0x126/0x23b
[84154.213203][] kernel_init+0xe/0xf0
[84154.213203][] ret_from_fork+0x7c/0xb0
[84154.213203] 
-> #1 (cpu_hotplug.lock){++}:
[84154.213203][] lock_acquire+0x9b/0x1d0
[84154.213203][] cpu_hotplug_begin+0x41/0x80
[84154.213203][] _cpu_up+0x3f/0x160
[84154.213203][] cpu_up+0x69/0x80
[84154.213203][] smp_init+0x60/0x8c
[84154.213203][] kernel_init_freeable
+0x126/0x23b
[84154.213203][] kernel_init+0xe/0xf0
[84154.213203][] ret_from_fork+0x7c/0xb0
[84154.213203] 
-> #0 (cpu_add_remove_lock){+.+.+.}:
[84154.213203][] __lock_acquire+0x1f2a/0x1f60
[84154.213203][] lock_acquire+0x9b/0x1d0
[84154.213203][] mutex_lock_nested+0x50/0x3c0
[84154.213203][] cpu_maps_update_begin
+0x17/0x20
[84154.213203][] cpu_down+0x1d/0x50
[84154.213203][] cpu_subsys_offline+0x14/0x20
[84154.213203][] device_offline+0xad/0xd0
[84154.213203][] online_store+0x42/0x80
[84154.213203][] dev_attr_store+0x18/0x30
[84154.213203][] sysfs_kf_write+0x49/0x60
[84154.213203][] kernfs_fop_write+0x109/0x190
[84154.213203][] vfs_write+0xbe/0x1c0
[84154.213203][] SyS_write+0x52/0xb0
[84154.213203][] tracesys+0xd0/0xd5
[84154.213203] 
[84154.213203] other info that might help us debug this:
[84154.213203] 
[84154.213203] Chain exists of:
  cpu_add_remove_lock --> cpu_hotplug.lock#2 --> s_active#79

[84154.213203]  Possible unsafe locking scenario:
[84154.213203] 
[84154.213203]CPU0CPU1
[84154.213203]
[84154.213203]   lock(s_active#79);
[84154.213203]lock(cpu_hotplug.lock#2);
[84154.213203]lock(s_active#79);
[84154.213203]   lock(cpu_add_remove_lock);
[84154.213203] 
[84154.213203]  *** DEADLOCK ***
.

The deadlock itself seems already fixed in commit 5e33bc41. 

As Tejun suggested, to avoid this lockdep warning,
kernfs_break_active_protection() is used before online/offline callbacks
to take the s_active lock out of the dependency chain during
online/offline operations. 

Signed-off-by: Li Zhong 
---
 drivers/base/core.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 0dd6528..2b9f68e 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct 
device_attribute *attr,
 {
bool val;
int ret;
+   struct kernfs_node *kn;
 
ret = strtobool(buf, &val);
if (ret < 0)
@@ -448,7 +449,15 @@ static ssize_t online_store(struct device *dev, struct 
device_attribute *attr,
if (ret)

[RFC PATCH] Suppress a device hot remove related lockdep warning

2014-04-10 Thread Li Zhong

I noticed following lockdep warning when trying acpi hot-remove cpus:

[84154.204080] ==
[84154.204080] [ INFO: possible circular locking dependency detected ]
[84154.204080] 3.14.0-next-20140408+ #24 Tainted: GW
[84154.204080] ---
[84154.204080] bash/777 is trying to acquire lock:
[84154.204080]  (cpu_add_remove_lock){+.+.+.}, at: [] 
cpu_maps_update_begin+0x17/0x20
[84154.213203] 
[84154.213203] but task is already holding lock:
[84154.213203]  (s_active#79){.+}, at: [] 
kernfs_fop_write+0xe4/0x190
[84154.213203] 
[84154.213203] which lock already depends on the new lock.
[84154.213203] 
[84154.213203] 
[84154.213203] the existing dependency chain (in reverse order) is:
[84154.213203] 
-> #3 (s_active#79){.+}:
[84154.213203][] lock_acquire+0x9b/0x1d0
[84154.213203][] __kernfs_remove+0x250/0x310
[84154.213203][] kernfs_remove_by_name_ns+0x50/0xc0
[84154.213203][] sysfs_remove_file_ns+0x15/0x20
[84154.213203][] device_remove_file+0x19/0x20
[84154.213203][] device_remove_attrs+0x33/0x80
[84154.213203][] device_del+0x127/0x1c0
[84154.213203][] device_unregister+0x22/0x60
[84154.213203][] unregister_cpu+0x1e/0x40
[84154.213203][] arch_unregister_cpu+0x23/0x30
[84154.213203][] acpi_processor_remove+0x8d/0xb2
[84154.213203][] acpi_bus_trim+0x5a/0x8d
[84154.213203][] acpi_device_hotplug+0x1a8/0x3ec
[84154.213203][] acpi_hotplug_work_fn+0x1f/0x2b
[84154.213203][] process_one_work+0x1eb/0x6b0
[84154.213203][] worker_thread+0x11b/0x370
[84154.213203][] kthread+0xe4/0x100
[84154.213203][] ret_from_fork+0x7c/0xb0
[84154.213203] 
-> #2 (cpu_hotplug.lock#2){+.+.+.}:
[84154.213203][] lock_acquire+0x9b/0x1d0
[84154.213203][] mutex_lock_nested+0x50/0x3c0
[84154.213203][] cpu_hotplug_begin+0x4f/0x80
[84154.213203][] _cpu_up+0x3f/0x160
[84154.213203][] cpu_up+0x69/0x80
[84154.213203][] smp_init+0x60/0x8c
[84154.213203][] kernel_init_freeable+0x126/0x23b
[84154.213203][] kernel_init+0xe/0xf0
[84154.213203][] ret_from_fork+0x7c/0xb0
[84154.213203] 
-> #1 (cpu_hotplug.lock){++}:
[84154.213203][] lock_acquire+0x9b/0x1d0
[84154.213203][] cpu_hotplug_begin+0x41/0x80
[84154.213203][] _cpu_up+0x3f/0x160
[84154.213203][] cpu_up+0x69/0x80
[84154.213203][] smp_init+0x60/0x8c
[84154.213203][] kernel_init_freeable+0x126/0x23b
[84154.213203][] kernel_init+0xe/0xf0
[84154.213203][] ret_from_fork+0x7c/0xb0
[84154.213203] 
-> #0 (cpu_add_remove_lock){+.+.+.}:
[84154.213203][] __lock_acquire+0x1f2a/0x1f60
[84154.213203][] lock_acquire+0x9b/0x1d0
[84154.213203][] mutex_lock_nested+0x50/0x3c0
[84154.213203][] cpu_maps_update_begin+0x17/0x20
[84154.213203][] cpu_down+0x1d/0x50
[84154.213203][] cpu_subsys_offline+0x14/0x20
[84154.213203][] device_offline+0xad/0xd0
[84154.213203][] online_store+0x42/0x80
[84154.213203][] dev_attr_store+0x18/0x30
[84154.213203][] sysfs_kf_write+0x49/0x60
[84154.213203][] kernfs_fop_write+0x109/0x190
[84154.213203][] vfs_write+0xbe/0x1c0
[84154.213203][] SyS_write+0x52/0xb0
[84154.213203][] tracesys+0xd0/0xd5
[84154.213203] 
[84154.213203] other info that might help us debug this:
[84154.213203] 
[84154.213203] Chain exists of:
  cpu_add_remove_lock --> cpu_hotplug.lock#2 --> s_active#79

[84154.213203]  Possible unsafe locking scenario:
[84154.213203] 
[84154.213203]CPU0CPU1
[84154.213203]
[84154.213203]   lock(s_active#79);
[84154.213203]lock(cpu_hotplug.lock#2);
[84154.213203]lock(s_active#79);
[84154.213203]   lock(cpu_add_remove_lock);
[84154.213203] 
[84154.213203]  *** DEADLOCK ***
.

The deadlock itself seems already fixed in commit 5e33bc41. 

This patch uses DEVICE_ATTR_IGNORE_LOCKDEP for "online" attr to suppress
this lockdep warning. But I'm a little afraid it might also hide
(future) potential real dead lock scenarios? 

Signed-off-by: Li Zhong 
---
 drivers/base/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 0dd6528..98e3e09 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -452,7 +452,9 @@ static ssize_t online_store(struct device *dev, struct 
device_attribute *attr,
unlock_device_hotplug();
return ret < 0 ? ret : count;
 }
-static DEVICE_ATTR_RW(online);
+
+static DEVICE_ATTR_IGNORE_LOCKDEP(online, (S_IWUSR | S_IRUGO),
+ online_show, online_store);
 
 int device_add_gro

Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number

2014-04-09 Thread Li Zhong

On Wed, 2014-04-09 at 12:39 -0500, Nathan Fontenot wrote:
> On 04/08/2014 02:47 PM, Dave Hansen wrote:
> > 
> > That document really needs to be updated to stop referring to sections
> > (at least in the descriptions of the user interface).  We can not change
> > the units of phys_index/end_phys_index without also changing
> > block_size_bytes.
> > 
> 
> Here is a first pass at updating the documentation.
> 
> I have tried to update the documentation to refer to memory blocks instead
> of memory sections where appropriate and added a paragraph to explain
> that memory blocks are mode of memory sections.
> 
> Thoughts?

If we all agree to hide the information about sections, then I think we
also need to update the section id's used for phys_index/end_phys_index,
something like following on top of yours?

--
diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 92d15e2..9fbb025 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -138,10 +138,7 @@ is described under /sys/devices/system/memory as
 /sys/devices/system/memory/memoryXXX
 (XXX is the memory block id.)
 
-Now, XXX is defined as (start_address_of_section / section_size) of the first
-section contained in the memory block.  The files 'phys_index' and
-'end_phys_index' under each directory report the beginning and end section id's
-for the memory block covered by the sysfs directory.  It is expected that all
+For the memory block covered by the sysfs directory.  It is expected that all
 memory sections in this range are present and no memory holes exist in the
 range. Currently there is no way to determine if there is a memory hole, but
 the existence of one should not affect the hotplug capabilities of the memory
@@ -155,16 +152,14 @@ This device covers address range [0x1 ... 
0x14000)
 Under each memory block, you can see 4 or 5 files, the end_phys_index file
 being a recent addition and not present on older kernels.
 
-/sys/devices/system/memory/memoryXXX/start_phys_index
+/sys/devices/system/memory/memoryXXX/phys_index
 /sys/devices/system/memory/memoryXXX/end_phys_index
 /sys/devices/system/memory/memoryXXX/phys_device
 /sys/devices/system/memory/memoryXXX/state
 /sys/devices/system/memory/memoryXXX/removable
 
-'phys_index'  : read-only and contains section id of the first section
-   in the memory block, same as XXX.
-'end_phys_index'  : read-only and contains section id of the last section
-   in the memory block.
+'phys_index'  : read-only and contains memory block id, same as XXX.
+'end_phys_index'  : read-only and contains memory block id, same as XXX.
 'state'   : read-write
 at read:  contains online/offline state of memory.
 at write: user can specify "online_kernel",
--

Not sure whether it is proper to remove end_phys_index, too? 

Thanks, 
Zhong

 

 
> 
> -Nathan
> ---
>  Documentation/memory-hotplug.txt |  113 
> ---
>  1 file changed, 59 insertions(+), 54 deletions(-)
> 
> Index: linux/Documentation/memory-hotplug.txt
> ===
> --- linux.orig/Documentation/memory-hotplug.txt
> +++ linux/Documentation/memory-hotplug.txt
> @@ -88,16 +88,21 @@ phase by hand.
> 
>  1.3. Unit of Memory online/offline operation
>  
> -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole 
> memory
> -into chunks of the same size. The chunk is called a "section". The size of
> -a section is architecture dependent. For example, power uses 16MiB, ia64 uses
> -1GiB. The unit of online/offline operation is "one section". (see Section 3.)
> +Memory hotplug uses SPARSEMEM memory model which allows memory to be divided
> +into chunks of the same size. These chunks are called "sections". The size of
> +a memory section is architecture dependent. For example, power uses 16MiB, 
> ia64
> +uses 1GiB.
> +
> +Memory sections are combined into chunks referred to as "memory blocks". The
> +size of a memory block is architecture dependent and represents the logical
> +unit upon which memory online/offline operations are to be performed. The
> +default size of a memory block is the same as memory section size unless an
> +architecture specifies otherwise. (see Section 3.)
> 
> -To determine the size of sections, please read this file:
> +To determine the size (in bytes) of a memory block please read this file:
> 
>  /sys/devices/system/memory/block_size_bytes
> 
> -This file shows the size of sections in byte.
> 
>  ---
>  2. Kernel Configuration
> @@ -123,14 +128,15 @@ config options.
>  (CONFIG_ACPI_CONTAINER).
>  This option can be kernel module too.
> 
> +
>  
> -4 sysfs files for memory hotplug
> +3 sysfs files for memory hotplug
>  
> -All sections have their device information in sys

Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number

2014-04-09 Thread Li Zhong

On Wed, 2014-04-09 at 08:49 -0700, Dave Hansen wrote:
> On 04/09/2014 02:20 AM, Li Zhong wrote:
> > Or do you mean we don't need to expose any information related to
> > SECTION to userspace? 
> 
> Right, we don't need to expose sections themselves to userspace.  Do we?
> 
OK, I agree with that. 

Yanfei, I recall you once expressed your preference for section
numbers? 

Thanks, Zhong


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number

2014-04-09 Thread Li Zhong

On Tue, 2014-04-08 at 12:47 -0700, Dave Hansen wrote:
> On 04/08/2014 11:23 AM, Nathan Fontenot wrote:
> > On 04/08/2014 11:13 AM, Dave Hansen wrote:
> >> On 04/08/2014 01:27 AM, Li Zhong wrote:
> >>> If Dave and others don't have further objections, it seems this small
> >>> userspace incompatibility could be accepted by most of us, and I don't
> >>> need to make a version 2. 
> >>
> >> Let me ask another question then.  What are the units of
> >> phys_index/end_phys_index?  How do we expose those units to userspace?
> >>
> > 
> > The documentation for these files just states that the files contain
> > the first and last section id of memory in the memory block for
> > phys_index and end_phys_index respectively.
> > 
> > I'm not sure the values have ever been units of anything, at least not
> > that I remember.
> 
> 
> 
> There are two units.  SECTION_SIZE, which is completely internal to the
> kernel, and block_size_bytes which used to be the same as SECTION_SIZE,
> but is not now.  Which one of those two is phys_index/end_phys_index in,
> and if it is in terms of SECTION_SIZE like this patch proposes, how do
> we tell userspace how large SECTION_SIZE is?

With this patch, I think we could still tell how large SECTION_SIZE is.
block_size_bytes tells us the size of the block, and
end_phys_index-phys_index+1, tells us the numbers of sections in the
block, and then we could know the SECTION_SIZE by a division. 

Not that obvious, but if needed, we could easily add a file telling us
section_size or sections_per_block. 

> 
> block_size_bytes is supposed to tell you how large the sections are.  In
> the case where we lumped a bunch of sections together, we also bumped up
> block_size_bytes.  That's why we currently divide the *ACTUAL* section
> number in phys_index/end_phys_index by block_size_bytes.
> 
> That document really needs to be updated to stop referring to sections
> (at least in the descriptions of the user interface).  We can not change
> the units of phys_index/end_phys_index without also changing
> block_size_bytes.

How about adding a new section_size_bytes together with this patch? 

Or do you mean we don't need to expose any information related to
SECTION to userspace? 

Thanks, 
Zhong


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number

2014-04-08 Thread Li Zhong

On Fri, 2014-04-04 at 10:29 +0900, Yasuaki Ishimatsu wrote:
> (2014/04/02 17:56), Li Zhong wrote:
> > I noticed the phys_index and end_phys_index under
> > /sys/devices/system/memory/memoryXXX/ have the same value, e.g.
> > (for the test machine, one memory block has 8 sections, that is
> >   sections_per_block equals 8)
> >
> > # cd /sys/devices/system/memory/memory100/
> > # cat phys_index end_phys_index
> > 0064
> > 0064
> >
> > Seems they should reflect the start/end section number respectively, which
> > also matches what is said in Documentation/memory-hotplug.txt
> >
> > This patch tries to modify that so the two files could show the start/end
> > section number of the memory block.
> >
> > After this change, output of the above example looks like:
> >
> > # cat phys_index end_phys_index
> > 0320
> > 0327
> >
> > Signed-off-by: Li Zhong 
> > ---
> 
> Good catch! This is a bug. So I think the bug should be fixed.
> 
> Reviewed-by: Yasuaki Ishimatsu 

Thank you all for the review. 

If Dave and others don't have further objections, it seems this small
userspace incompatibility could be accepted by most of us, and I don't
need to make a version 2. 

Thanks, Zhong 

> Thanks,
> Yasuaki Ishimatsu
> 
> >   drivers/base/memory.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index bece691..b10f2fa 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct device 
> > *dev,
> > struct memory_block *mem = to_memory_block(dev);
> > unsigned long phys_index;
> >
> > -   phys_index = mem->start_section_nr / sections_per_block;
> > +   phys_index = mem->start_section_nr;
> > return sprintf(buf, "%08lx\n", phys_index);
> >   }
> >
> > @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device 
> > *dev,
> > struct memory_block *mem = to_memory_block(dev);
> > unsigned long phys_index;
> >
> > -   phys_index = mem->end_section_nr / sections_per_block;
> > +   phys_index = mem->end_section_nr;
> > return sprintf(buf, "%08lx\n", phys_index);
> >   }
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number

2014-04-02 Thread Li Zhong

On Thu, 2014-04-03 at 11:06 +0800, Zhang Yanfei wrote:
> On 04/03/2014 10:37 AM, Li Zhong wrote:
> > On Thu, 2014-04-03 at 09:37 +0800, Zhang Yanfei wrote:
> >> Add ccing
> >>
> >> On 04/02/2014 04:56 PM, Li Zhong wrote:
> >>> I noticed the phys_index and end_phys_index under 
> >>> /sys/devices/system/memory/memoryXXX/ have the same value, e.g.
> >>> (for the test machine, one memory block has 8 sections, that is 
> >>>  sections_per_block equals 8)
> >>>
> >>> # cd /sys/devices/system/memory/memory100/
> >>> # cat phys_index end_phys_index 
> >>> 0064
> >>> 0064
> >>>
> >>> Seems they should reflect the start/end section number respectively, 
> >>> which 
> >>> also matches what is said in Documentation/memory-hotplug.txt
> >>
> > Hi Yanfei, 
> > 
> > Thanks for the review. 
> > 
> >> Indeed. I've noticed this before. The value in 'end_phys_index' doesn't
> >> match what it really means. But, the name itself is vague, it looks like
> >> it is the index of some page frame. (we keep this name for compatibility?)
> > 
> > I guess so, Dave just reminded me that the RFC would also break
> > userspace..
> > 
> > And now my plan is: 
> >  leave the code unchanged
> >  update the document, state the end_phys_index/phys_index are the same,
> > and means the memory block index
> 
> Ah. I doubt whether there is userspace tool which is using the two sysfiles?
> for example, the memory100 directory itself can tell us which block it is.
> So why there is the two files under it give the same meaning.
> 
> If there is userspace tool using the two files, does it use 'end_phys_index'
> in the correct way? That said, if a userspace tool knows what the 
> 'end_phys_index'
> really mean, does it still need it since we have 'phys_index' with the same 
> value?

For the end_phys_index, I totally agree with you. If somebody tries to
use it, say as the end section number, he should finally be able to find
that the value is not what he expects. But who knows ...

For phys_index, I guess it is also reasonable for some userspace tool to
just loop for each memoryXXX directory under /sys/devices/system/memory,
and use the phys_index as the memory block index for the directory,
instead of extracting the XXX from the directory name memoryXXX

Don't know whether there are some generic rules for handling such kind
of compatibility issues...


> >  [optional] create two new files start_sec_nr, end_sec_nr if needed
> 
> These two are the really meaningful sysfiles for userspace, IMO.

OK, I see. 

> 
> > 
> > Do you have any other suggestions? 
> 
> No. I think we should wait for other guys to comment more.

OK, let's wait.

Thanks, Zhong

> 
> Thanks.
> 
> > 
> > Thanks, Zhong
> > 
> >>
> >> The corresponding member in struct memory_block is:
> >>
> >> struct memory_block {
> >>     unsigned long start_section_nr;
> >> unsigned long end_section_nr;
> >> ...
> >>
> >> The two members seem to have the right name, and have the right value in 
> >> kernel.
> >>
> >>
> >>>
> >>> This patch tries to modify that so the two files could show the start/end
> >>> section number of the memory block.
> >>>
> >>> After this change, output of the above example looks like:
> >>>
> >>> # cat phys_index end_phys_index 
> >>> 0320
> >>> 0327
> >>>
> >>> Signed-off-by: Li Zhong 
> >>> ---
> >>>  drivers/base/memory.c | 4 ++--
> >>>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> >>> index bece691..b10f2fa 100644
> >>> --- a/drivers/base/memory.c
> >>> +++ b/drivers/base/memory.c
> >>> @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct 
> >>> device *dev,
> >>>   struct memory_block *mem = to_memory_block(dev);
> >>>   unsigned long phys_index;
> >>>  
> >>> - phys_index = mem->start_section_nr / sections_per_block;
> >>> + phys_index = mem->start_section_nr;
> >>>   return sprintf(buf, "%08lx\n", phys_index);
> >>>  }
> >>>  
> >>> @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device 
> >>> *dev,
> >>>   struct memory_block *mem = to_memory_block(dev);
> >>>   unsigned long phys_index;
> >>>  
> >>> - phys_index = mem->end_section_nr / sections_per_block;
> >>> + phys_index = mem->end_section_nr;
> >>>   return sprintf(buf, "%08lx\n", phys_index);
> >>>  }
> >>>  
> >>>
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>> the body of a message to majord...@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> Please read the FAQ at  http://www.tux.org/lkml/
> >>>
> >>
> >>
> > 
> > 
> > .
> > 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number

2014-04-02 Thread Li Zhong

On Thu, 2014-04-03 at 09:37 +0800, Zhang Yanfei wrote:
> Add ccing
> 
> On 04/02/2014 04:56 PM, Li Zhong wrote:
> > I noticed the phys_index and end_phys_index under 
> > /sys/devices/system/memory/memoryXXX/ have the same value, e.g.
> > (for the test machine, one memory block has 8 sections, that is 
> >  sections_per_block equals 8)
> > 
> > # cd /sys/devices/system/memory/memory100/
> > # cat phys_index end_phys_index 
> > 0064
> > 0064
> > 
> > Seems they should reflect the start/end section number respectively, which 
> > also matches what is said in Documentation/memory-hotplug.txt
> 
Hi Yanfei, 

Thanks for the review. 

> Indeed. I've noticed this before. The value in 'end_phys_index' doesn't
> match what it really means. But, the name itself is vague, it looks like
> it is the index of some page frame. (we keep this name for compatibility?)

I guess so, Dave just reminded me that the RFC would also break
userspace..

And now my plan is: 
 leave the code unchanged
 update the document, state the end_phys_index/phys_index are the same,
and means the memory block index
 [optional] create two new files start_sec_nr, end_sec_nr if needed

Do you have any other suggestions? 

Thanks, Zhong

> 
> The corresponding member in struct memory_block is:
> 
> struct memory_block {
> unsigned long start_section_nr;
> unsigned long end_section_nr;
> ...
> 
> The two members seem to have the right name, and have the right value in 
> kernel.
> 
> 
> > 
> > This patch tries to modify that so the two files could show the start/end
> > section number of the memory block.
> > 
> > After this change, output of the above example looks like:
> > 
> > # cat phys_index end_phys_index 
> > 0320
> > 0327
> > 
> > Signed-off-by: Li Zhong 
> > ---
> >  drivers/base/memory.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index bece691..b10f2fa 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct device 
> > *dev,
> > struct memory_block *mem = to_memory_block(dev);
> > unsigned long phys_index;
> >  
> > -   phys_index = mem->start_section_nr / sections_per_block;
> > +   phys_index = mem->start_section_nr;
> > return sprintf(buf, "%08lx\n", phys_index);
> >  }
> >  
> > @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device 
> > *dev,
> > struct memory_block *mem = to_memory_block(dev);
> > unsigned long phys_index;
> >  
> > -   phys_index = mem->end_section_nr / sections_per_block;
> > +   phys_index = mem->end_section_nr;
> > return sprintf(buf, "%08lx\n", phys_index);
> >  }
> >  
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number

2014-04-02 Thread Li Zhong

On Wed, 2014-04-02 at 09:09 -0700, Dave Hansen wrote:
> On 04/02/2014 01:56 AM, Li Zhong wrote:
> > I noticed the phys_index and end_phys_index under 
> > /sys/devices/system/memory/memoryXXX/ have the same value, e.g.
> > (for the test machine, one memory block has 8 sections, that is 
> >  sections_per_block equals 8)
> > 
> > # cd /sys/devices/system/memory/memory100/
> > # cat phys_index end_phys_index 
> > 0064
> > 0064
> > 
> > Seems they should reflect the start/end section number respectively, which 
> > also matches what is said in Documentation/memory-hotplug.txt
> 
> This changes a user-visible interface.  Won't this break userspace?

Hi Dave, 

Hmm, I think so... Thank you for the reminder

Do you have some better ideas to fix this? Maybe we should leave the
code unchanged, just change the corresponding document (just it seems
that the end_phys_index will always be the same as phys_index)? And if
the section numbers are really needed, then we could create two new
files later, e.g. start_section_nr, end_section_nr? 

Thanks, 
Zhong
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number

2014-04-02 Thread Li Zhong

I noticed the phys_index and end_phys_index under 
/sys/devices/system/memory/memoryXXX/ have the same value, e.g.
(for the test machine, one memory block has 8 sections, that is 
 sections_per_block equals 8)

# cd /sys/devices/system/memory/memory100/
# cat phys_index end_phys_index 
0064
0064

Seems they should reflect the start/end section number respectively, which 
also matches what is said in Documentation/memory-hotplug.txt

This patch tries to modify that so the two files could show the start/end
section number of the memory block.

After this change, output of the above example looks like:

# cat phys_index end_phys_index 
0320
0327

Signed-off-by: Li Zhong 
---
 drivers/base/memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index bece691..b10f2fa 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct device *dev,
struct memory_block *mem = to_memory_block(dev);
unsigned long phys_index;
 
-   phys_index = mem->start_section_nr / sections_per_block;
+   phys_index = mem->start_section_nr;
return sprintf(buf, "%08lx\n", phys_index);
 }
 
@@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device *dev,
struct memory_block *mem = to_memory_block(dev);
unsigned long phys_index;
 
-   phys_index = mem->end_section_nr / sections_per_block;
+   phys_index = mem->end_section_nr;
return sprintf(buf, "%08lx\n", phys_index);
 }
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH ] workqueue: add args to workqueue lockdep name

2014-02-13 Thread Li Zhong

Tommi noticed a 'funny' lock class name: "%s#5" from a lock acquired in
process_one_work().

Maybe #fmt plus #args could be used as the lock_name to give some more
information for some fmt string like the above. 

__builtin_constant_p() check is removed (as there seems no good way to
check all the variables in args list). However, by removing the check,
it only adds two additional "s for those constants.

Some lockdep name examples printed out after the change:

lockdep namewq->name

"events_long"   events_long
"%s"("khelper") khelper
"xfs-data/%s"mp->m_fsname   xfs-data/dm-3

Signed-off-by: Li Zhong 
---
 include/linux/workqueue.h |5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 594521b..704f4f6 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -419,10 +419,7 @@ __alloc_workqueue_key(const char *fmt, unsigned int flags, 
int max_active,
static struct lock_class_key __key; \
const char *__lock_name;\
\
-   if (__builtin_constant_p(fmt))  \
-   __lock_name = (fmt);\
-   else\
-   __lock_name = #fmt; \
+   __lock_name = #fmt#args;\
\
__alloc_workqueue_key((fmt), (flags), (max_active), \
  &__key, __lock_name, ##args); \



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: lockdep: strange %s#5 lock name

2014-02-12 Thread Li Zhong

On Tue, 2014-02-11 at 10:27 -0500, Tejun Heo wrote:
> On Tue, Feb 11, 2014 at 12:00:36PM +0100, Peter Zijlstra wrote:
> > > Looks good to me.  Can you please post the patch with SOB?
> > 
> > ---
> > Subject: workqueue: Fix workqueue lockdep name
> > 
> > Tommi noticed a 'funny' lock class name: "%s#5" from a lock acquired in
> > process_one_work(). It turns out that commit b196be89cdc14 forgot to
> > change the lockdep_init_map() when it changed the @lock_name argument
> > from a string to a format.
> > 
> > Fixes: b196be89cdc14 ("workqueue: make alloc_workqueue() take printf fmt 
> > and args for name")
> > Reported-by: Tommi Rantala 
> > Signed-off-by: Peter Zijlstra 
> 
> Applied to wq/for-3.14-fixes.

Hi Tejun, Peter,

I found following lockdep warning caused by this commit in next-tree:

[5.251993] [ cut here ]
[5.252019] WARNING: CPU: 0 PID: 221 at kernel/locking/lockdep.c:710 
__lock_acquire+0x1761/0x1f60()
[5.252019] Modules linked in: e1000
[5.252019] CPU: 0 PID: 221 Comm: lvm Not tainted 3.14.0-rc2-next-20140212 #1
[5.252019] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[5.252019]  0009 880118e91938 8155fe12 
880118e91978
[5.252019]  8105c195 880118e91958 81eb33d0 
0002
[5.252019]  880118dd2318   
880118e91988
[5.252019] Call Trace:
[5.252019]  [] dump_stack+0x19/0x1b
[5.252019]  [] warn_slowpath_common+0x85/0xb0
[5.252019]  [] warn_slowpath_null+0x1a/0x20
[5.252019]  [] __lock_acquire+0x1761/0x1f60
[5.252019]  [] ? mark_held_locks+0xae/0x120
[5.252019]  [] ? debug_check_no_locks_freed+0x8e/0x160
[5.252019]  [] ? lockdep_init_map+0xac/0x600
[5.252019]  [] lock_acquire+0x9a/0x120
[5.252019]  [] ? flush_workqueue+0x5/0x750
[5.252019]  [] flush_workqueue+0x109/0x750
[5.252019]  [] ? flush_workqueue+0x5/0x750
[5.252019]  [] ? _raw_spin_unlock_irq+0x30/0x40
[5.252019]  [] ? srcu_reschedule+0xe0/0xf0
[5.252019]  [] dm_suspend+0xe9/0x1e0
[5.252019]  [] dev_suspend+0x1e3/0x270
[5.252019]  [] ? table_load+0x350/0x350
[5.252019]  [] ctl_ioctl+0x26c/0x510
[5.252019]  [] ? __lock_acquire+0x41c/0x1f60
[5.252019]  [] ? vtime_account_user+0x98/0xb0
[5.252019]  [] dm_ctl_ioctl+0x13/0x20
[5.252019]  [] do_vfs_ioctl+0x88/0x570
[5.252019]  [] ? __fget_light+0x129/0x150
[5.252019]  [] SyS_ioctl+0x91/0xb0
[5.252019]  [] tracesys+0xcf/0xd4
[5.252019] ---[ end trace ff1fa506f34be3bc ]---

It seems to me that when the second time alloc_workqueue() is called
from the same code path, it would have two locks with the same key, but
not the same &wq->name, which doesn't meet lockdep's assumption.

Maybe we could use the #fmt plus #args as the lock_name to avoid the
above issue? It could also give some more information for some string
like %s#5. Draft code attached below.

I removed __builtin_constant_p() check (I don't know how to check all
the args). However, by removing the check, it only adds two additional
two "s for those constants.

Some lockdep name examples I printed out after the change:

lockdep namewq->name

"events_long"   events_long
"%s"("khelper") khelper
"xfs-data/%s"mp->m_fsname   xfs-data/dm-3
...

A little bit ugly, not sure whether we could generate some better names
here?

Thanks, 
Zhong

---
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 594521b..704f4f6 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -419,10 +419,7 @@ __alloc_workqueue_key(const char *fmt, unsigned int flags, 
int max_active,
static struct lock_class_key __key; \
const char *__lock_name;\
\
-   if (__builtin_constant_p(fmt))  \
-   __lock_name = (fmt);\
-   else\
-   __lock_name = #fmt; \
+   __lock_name = #fmt#args;\
\
__alloc_workqueue_key((fmt), (flags), (max_active), \
  &__key, __lock_name, ##args); \
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 861d8dd..82ef9f3 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4202,7 +4202,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char 
*fmt,
INIT_LIST_HEAD(&wq->flusher_overflow);
INIT_LIST_HEAD(&wq->maydays);
 
-   lockdep_init_map(&wq->lockdep_map, wq->name, key, 0);
+   lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
INIT_LIST_HEA

Re: [RFC PATCH next]module: Fix mod->mkobj.kobj potentially freed too early

2013-08-22 Thread Li Zhong

On Thu, 2013-08-22 at 16:30 +0930, Rusty Russell wrote:
> Greg KH  writes:
> > On Wed, Aug 21, 2013 at 05:49:58PM +0800, Li Zhong wrote:
> > >  struct kobj_type module_ktype = {
> > > + .release   =module_kobj_release,
> > >   .sysfs_ops =&module_sysfs_ops,
> > >  };
> >
> > Wait, as there is no release function here for the kobject (a different
> > problem), why is the deferred release function causing any problems?
> > There is no release function to call, so what is causing the oops?
> 
> Because DEBUG_KOBJECT_RELEASE does the kobject_put() sometime later,
> which is what causes the oops.
> 
> Since kobjects don't have an owner field, AFAICT someone *could* grab
> one in a module which is unloading, then put it after unload.   So this
> fixes a real bug, albeit not one seen in the real world.
> 
> Applied,

Oh, thank you, Rusty. 

I just sent out another version... which fix it in another way as Greg
suggested, could you please also help to take a look at it?

Thanks, Zhong

> Rusty.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 next]module: Fix mod->mkobj.kobj potentially freed too early

2013-08-22 Thread Li Zhong

DEBUG_KOBJECT_RELEASE helps to find the issue attached below.

After some investigation, it seems the reason is:
The mod->mkobj.kobj(a01600d0 below) is freed together with mod
itself by module_free(mod, mod->module_core) in free_module(). However,
its children still hold references to it, as the delay caused by
DEBUG_KOBJECT_RELEASE. So when the child('holders' below) tries to
decrease the reference count to its parent in kobject_del(), BUG happens
as it tries to access already freed memory.

v2: 
Greg suggested to make the kobject as a pointer. But it seems a little
hard to make kobj(kobject) embedded in mkobj(module_kobject) a pointer,
as that seems to cause getting the mkobj from the kobj impossible --
to_module_kobject() is used in several places to derive mkobj from its
member kobj. 

So in this version, I made the whole mkobj(module_kobject) as a pointer
in mod(module).

The mkobj is then allocated in mod_sysfs_init(), and freed in its member
kobj's release function -- it seems that there is no mkobj usage after
its kobj is released? 

[ 1844.175287] kobject: 'holders' (88007c1f1600): kobject_release, parent 
a01600d0 (delayed)
[ 1844.178991] kobject: 'notes' (8800370b2a00): kobject_release, parent 
a01600d0 (delayed)
[ 1845.180118] kobject: 'holders' (88007c1f1600): kobject_cleanup, parent 
a01600d0
[ 1845.182130] kobject: 'holders' (88007c1f1600): auto cleanup kobject_del
[ 1845.184120] BUG: unable to handle kernel paging request at a01601d0
[ 1845.185026] IP: [] kobject_put+0x11/0x60
[ 1845.185026] PGD 1a13067 PUD 1a14063 PMD 7bd30067 PTE 0
[ 1845.185026] Oops:  [#1] PREEMPT 
[ 1845.185026] Modules linked in: xfs libcrc32c [last unloaded: kprobe_example]
[ 1845.185026] CPU: 0 PID: 18 Comm: kworker/0:1 Tainted: G   O 
3.11.0-rc6-next-20130819+ #1
[ 1845.185026] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 1845.185026] Workqueue: events kobject_delayed_cleanup
[ 1845.185026] task: 88007ca51f00 ti: 88007ca5c000 task.ti: 
88007ca5c000
[ 1845.185026] RIP: 0010:[]  [] 
kobject_put+0x11/0x60
[ 1845.185026] RSP: 0018:88007ca5dd08  EFLAGS: 00010282
[ 1845.185026] RAX: 2000 RBX: a01600d0 RCX: 8177d638
[ 1845.185026] RDX: 88007ca5dc18 RSI:  RDI: a01600d0
[ 1845.185026] RBP: 88007ca5dd18 R08: 824e9810 R09: 
[ 1845.185026] R10: 8800 R11: dead4ead0001 R12: 81a95040
[ 1845.185026] R13: 88007b27a960 R14: 88007c1f1600 R15: 
[ 1845.185026] FS:  () GS:81a23000() 
knlGS:
[ 1845.185026] CS:  0010 DS:  ES:  CR0: 8005003b
[ 1845.185026] CR2: a01601d0 CR3: 37207000 CR4: 06b0
[ 1845.185026] Stack:
[ 1845.185026]  88007c1f1600 88007c1f1600 88007ca5dd38 
812cdb7e
[ 1845.185026]   88007c1f1640 88007ca5dd68 
812cdbfe
[ 1845.185026]  88007c974800 88007c1f1640 88007ff61a00 

[ 1845.185026] Call Trace:
[ 1845.185026]  [] kobject_del+0x2e/0x40
[ 1845.185026]  [] kobject_delayed_cleanup+0x6e/0x1d0
[ 1845.185026]  [] process_one_work+0x1e5/0x670
[ 1845.185026]  [] ? process_one_work+0x183/0x670
[ 1845.185026]  [] worker_thread+0x113/0x370
[ 1845.185026]  [] ? rescuer_thread+0x290/0x290
[ 1845.185026]  [] kthread+0xda/0xe0
[ 1845.185026]  [] ? _raw_spin_unlock_irq+0x30/0x60
[ 1845.185026]  [] ? kthread_create_on_node+0x130/0x130
[ 1845.185026]  [] ret_from_fork+0x7a/0xb0
[ 1845.185026]  [] ? kthread_create_on_node+0x130/0x130
[ 1845.185026] Code: 81 48 c7 c7 28 95 ad 81 31 c0 e8 9b da 01 00 e9 4f ff ff 
ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 1d  87 
00 01 00 00 01 74 1e 48 8d 7b 38 83 6b 38 01 0f 94 c0 84 
[ 1845.185026] RIP  [] kobject_put+0x11/0x60
[ 1845.185026]  RSP 
[ 1845.185026] CR2: a01601d0
[ 1845.185026] ---[ end trace 49a70afd109f5653 ]---

Signed-off-by: Li Zhong 
---
 drivers/base/core.c|  2 +-
 drivers/base/module.c  |  4 ++--
 include/linux/module.h |  2 +-
 kernel/module.c| 37 +
 kernel/params.c| 19 +--
 5 files changed, 38 insertions(+), 26 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 09a99d6..b8a1fc6 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1629,7 +1629,7 @@ struct device *__root_device_register(const char *name, 
struct module *owner)
 
 #ifdef CONFIG_MODULES  /* gotta find a "cleaner" way to do this */
if (owner) {
-   struct module_kobject *mk = &owner->mkobj;
+   struct module_kobject *mk = owner->mkobj;
 
err = sysfs_create_link(&root->dev.kobj, &mk->kobj, "module");
if (err) {
diff --git a/driv

Re: [RFC PATCH next]module: Fix mod->mkobj.kobj potentially freed too early

2013-08-21 Thread Li Zhong

On Wed, 2013-08-21 at 21:03 -0700, Greg KH wrote:
> On Thu, Aug 22, 2013 at 10:34:06AM +0800, Li Zhong wrote:
> > On Wed, 2013-08-21 at 09:18 -0700, Greg KH wrote:
> > > On Wed, Aug 21, 2013 at 05:49:58PM +0800, Li Zhong wrote:
> > > > DEBUG_KOBJECT_RELEASE helps to find the issue attached below.
> > > > 
> > > > After some investigation, it seems the reason is:
> > > > The mod->mkobj.kobj(a01600d0 below) is freed together with mod
> > > > itself in free_module(). However, its children still hold references to
> > > > it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the
> > > > child(holders below) tries to decrease the reference count to its parent
> > > > in kobject_del(), BUG happens as it tries to access already freed 
> > > > memory.
> > > 
> > > Ah, thanks for tracking this down.  I had seen this in my local testing,
> > > but wasn't able to figure out the offending code.
> > > 
> > > > This patch tries to fix it by waiting for the mod->mkobj.kobj to be
> > > > really released in the module removing process (and some error code
> > > > paths).
> > > 
> > > Nasty, we should just be freeing the structure in the release function,
> > > why doesn't that work?
> > 
> > Hi Greg,
> > 
> > It seems I didn't describe it clearly in the previous mail. I'm trying
> > to do it better below: 
> > 
> > mod->mkobj.kobj is embedded in module_kobject(not a pointer):
> > struct module_kobject {
> > struct kobject kobj;
> > ...
> > 
> > and allocated with the module memory. So we could see the parent below
> > a01600d0 is between MODULES_VADDR (a000) and
> > MODULES_END(ff00). 
> > 
> > It seem to me that the mkobj.kobj is freed by module_free(mod,
> > mod->module_core). 
> 
> Ick, you are right.  If a kobject is being embedded in an object, it
> should control the lifespan of the object, not somewhere else like is
> happening here.
> 
> The best solution for this is to make the kobject a pointer, not
> embedded in the structure, that will fix this issue, right?

Yes, I think so. I'll try to write a fix using this way, thanks for your
suggestion.

Thanks, Zhong

> 
> thanks,
> 
> greg k-h
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH next]module: Fix mod->mkobj.kobj potentially freed too early

2013-08-21 Thread Li Zhong

On Wed, 2013-08-21 at 09:18 -0700, Greg KH wrote:
> On Wed, Aug 21, 2013 at 05:49:58PM +0800, Li Zhong wrote:
> > DEBUG_KOBJECT_RELEASE helps to find the issue attached below.
> > 
> > After some investigation, it seems the reason is:
> > The mod->mkobj.kobj(a01600d0 below) is freed together with mod
> > itself in free_module(). However, its children still hold references to
> > it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the
> > child(holders below) tries to decrease the reference count to its parent
> > in kobject_del(), BUG happens as it tries to access already freed memory.
> 
> Ah, thanks for tracking this down.  I had seen this in my local testing,
> but wasn't able to figure out the offending code.
> 
> > This patch tries to fix it by waiting for the mod->mkobj.kobj to be
> > really released in the module removing process (and some error code
> > paths).
> 
> Nasty, we should just be freeing the structure in the release function,
> why doesn't that work?

Hi Greg,

It seems I didn't describe it clearly in the previous mail. I'm trying
to do it better below: 

mod->mkobj.kobj is embedded in module_kobject(not a pointer):
struct module_kobject {
struct kobject kobj;
...

and allocated with the module memory. So we could see the parent below
a01600d0 is between MODULES_VADDR (a000) and
MODULES_END(ff00). 

It seem to me that the mkobj.kobj is freed by module_free(mod,
mod->module_core). 

So in free_module(), we need delay the call of module_free(), until
mkobj.kobj finishes its cleanup. 

> > [ 1844.175287] kobject: 'holders' (88007c1f1600): kobject_release, 
> > parent a01600d0 (delayed)
> > [ 1844.178991] kobject: 'notes' (8800370b2a00): kobject_release, parent 
> > a01600d0 (delayed)
> > [ 1845.180118] kobject: 'holders' (88007c1f1600): kobject_cleanup, 
> > parent a01600d0
> > [ 1845.182130] kobject: 'holders' (88007c1f1600): auto cleanup 
> > kobject_del
> > [ 1845.184120] BUG: unable to handle kernel paging request at 
> > a01601d0
> > [ 1845.185026] IP: [] kobject_put+0x11/0x60
> > [ 1845.185026] PGD 1a13067 PUD 1a14063 PMD 7bd30067 PTE 0
> > [ 1845.185026] Oops:  [#1] PREEMPT 
> > [ 1845.185026] Modules linked in: xfs libcrc32c [last unloaded: 
> > kprobe_example]
> > [ 1845.185026] CPU: 0 PID: 18 Comm: kworker/0:1 Tainted: G   O 
> > 3.11.0-rc6-next-20130819+ #1
> > [ 1845.185026] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> > [ 1845.185026] Workqueue: events kobject_delayed_cleanup
> > [ 1845.185026] task: 88007ca51f00 ti: 88007ca5c000 task.ti: 
> > 88007ca5c000
> > [ 1845.185026] RIP: 0010:[]  [] 
> > kobject_put+0x11/0x60
> > [ 1845.185026] RSP: 0018:88007ca5dd08  EFLAGS: 00010282
> > [ 1845.185026] RAX: 2000 RBX: a01600d0 RCX: 
> > 8177d638
> > [ 1845.185026] RDX: 88007ca5dc18 RSI:  RDI: 
> > a01600d0
> > [ 1845.185026] RBP: 88007ca5dd18 R08: 824e9810 R09: 
> > 
> > [ 1845.185026] R10: 8800 R11: dead4ead0001 R12: 
> > 81a95040
> > [ 1845.185026] R13: 88007b27a960 R14: 88007c1f1600 R15: 
> > 
> > [ 1845.185026] FS:  () GS:81a23000() 
> > knlGS:
> > [ 1845.185026] CS:  0010 DS:  ES:  CR0: 8005003b
> > [ 1845.185026] CR2: a01601d0 CR3: 37207000 CR4: 
> > 06b0
> > [ 1845.185026] Stack:
> > [ 1845.185026]  88007c1f1600 88007c1f1600 88007ca5dd38 
> > 812cdb7e
> > [ 1845.185026]   88007c1f1640 88007ca5dd68 
> > 812cdbfe
> > [ 1845.185026]  88007c974800 88007c1f1640 88007ff61a00 
> > 
> > [ 1845.185026] Call Trace:
> > [ 1845.185026]  [] kobject_del+0x2e/0x40
> > [ 1845.185026]  [] kobject_delayed_cleanup+0x6e/0x1d0
> > [ 1845.185026]  [] process_one_work+0x1e5/0x670
> > [ 1845.185026]  [] ? process_one_work+0x183/0x670
> > [ 1845.185026]  [] worker_thread+0x113/0x370
> > [ 1845.185026]  [] ? rescuer_thread+0x290/0x290
> > [ 1845.185026]  [] kthread+0xda/0xe0
> > [ 1845.185026]  [] ? _raw_spin_unlock_irq+0x30/0x60
> > [ 1845.185026]  [] ? kthread_create_on_node+0x130/0x130
> > [ 1845.185026]  [] ret_from_fork+0x7a/0xb0
> > [ 1845.185026]  [] ? kthread_create_on_node+0x130/0x130
> > [ 1845.185026] Code: 81 48 c7 c7 28 95 ad 81 31 c0 e8 9

[RFC PATCH next]module: Fix mod->mkobj.kobj potentially freed too early

2013-08-21 Thread Li Zhong

DEBUG_KOBJECT_RELEASE helps to find the issue attached below.

After some investigation, it seems the reason is:
The mod->mkobj.kobj(a01600d0 below) is freed together with mod
itself in free_module(). However, its children still hold references to
it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the
child(holders below) tries to decrease the reference count to its parent
in kobject_del(), BUG happens as it tries to access already freed memory.

This patch tries to fix it by waiting for the mod->mkobj.kobj to be
really released in the module removing process (and some error code
paths).

[ 1844.175287] kobject: 'holders' (88007c1f1600): kobject_release, parent 
a01600d0 (delayed)
[ 1844.178991] kobject: 'notes' (8800370b2a00): kobject_release, parent 
a01600d0 (delayed)
[ 1845.180118] kobject: 'holders' (88007c1f1600): kobject_cleanup, parent 
a01600d0
[ 1845.182130] kobject: 'holders' (88007c1f1600): auto cleanup kobject_del
[ 1845.184120] BUG: unable to handle kernel paging request at a01601d0
[ 1845.185026] IP: [] kobject_put+0x11/0x60
[ 1845.185026] PGD 1a13067 PUD 1a14063 PMD 7bd30067 PTE 0
[ 1845.185026] Oops:  [#1] PREEMPT 
[ 1845.185026] Modules linked in: xfs libcrc32c [last unloaded: kprobe_example]
[ 1845.185026] CPU: 0 PID: 18 Comm: kworker/0:1 Tainted: G   O 
3.11.0-rc6-next-20130819+ #1
[ 1845.185026] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 1845.185026] Workqueue: events kobject_delayed_cleanup
[ 1845.185026] task: 88007ca51f00 ti: 88007ca5c000 task.ti: 
88007ca5c000
[ 1845.185026] RIP: 0010:[]  [] 
kobject_put+0x11/0x60
[ 1845.185026] RSP: 0018:88007ca5dd08  EFLAGS: 00010282
[ 1845.185026] RAX: 2000 RBX: a01600d0 RCX: 8177d638
[ 1845.185026] RDX: 88007ca5dc18 RSI:  RDI: a01600d0
[ 1845.185026] RBP: 88007ca5dd18 R08: 824e9810 R09: 
[ 1845.185026] R10: 8800 R11: dead4ead0001 R12: 81a95040
[ 1845.185026] R13: 88007b27a960 R14: 88007c1f1600 R15: 
[ 1845.185026] FS:  () GS:81a23000() 
knlGS:
[ 1845.185026] CS:  0010 DS:  ES:  CR0: 8005003b
[ 1845.185026] CR2: a01601d0 CR3: 37207000 CR4: 06b0
[ 1845.185026] Stack:
[ 1845.185026]  88007c1f1600 88007c1f1600 88007ca5dd38 
812cdb7e
[ 1845.185026]   88007c1f1640 88007ca5dd68 
812cdbfe
[ 1845.185026]  88007c974800 88007c1f1640 88007ff61a00 

[ 1845.185026] Call Trace:
[ 1845.185026]  [] kobject_del+0x2e/0x40
[ 1845.185026]  [] kobject_delayed_cleanup+0x6e/0x1d0
[ 1845.185026]  [] process_one_work+0x1e5/0x670
[ 1845.185026]  [] ? process_one_work+0x183/0x670
[ 1845.185026]  [] worker_thread+0x113/0x370
[ 1845.185026]  [] ? rescuer_thread+0x290/0x290
[ 1845.185026]  [] kthread+0xda/0xe0
[ 1845.185026]  [] ? _raw_spin_unlock_irq+0x30/0x60
[ 1845.185026]  [] ? kthread_create_on_node+0x130/0x130
[ 1845.185026]  [] ret_from_fork+0x7a/0xb0
[ 1845.185026]  [] ? kthread_create_on_node+0x130/0x130
[ 1845.185026] Code: 81 48 c7 c7 28 95 ad 81 31 c0 e8 9b da 01 00 e9 4f ff ff 
ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 1d  87 
00 01 00 00 01 74 1e 48 8d 7b 38 83 6b 38 01 0f 94 c0 84 
[ 1845.185026] RIP  [] kobject_put+0x11/0x60
[ 1845.185026]  RSP 
[ 1845.185026] CR2: a01601d0
[ 1845.185026] ---[ end trace 49a70afd109f5653 ]---

Signed-off-by: Li Zhong 
---
 include/linux/module.h |  1 +
 kernel/module.c| 14 +++---
 kernel/params.c|  7 +++
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index 504035f..05f2447 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -42,6 +42,7 @@ struct module_kobject {
struct module *mod;
struct kobject *drivers_dir;
struct module_param_attrs *mp;
+   struct completion *kobj_completion;
 };
 
 struct module_attribute {
diff --git a/kernel/module.c b/kernel/module.c
index 2069158..b5e2baa 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1611,6 +1611,14 @@ static void module_remove_modinfo_attrs(struct module 
*mod)
kfree(mod->modinfo_attrs);
 }
 
+static void mod_kobject_put(struct module *mod)
+{
+   DECLARE_COMPLETION_ONSTACK(c);
+   mod->mkobj.kobj_completion = &c;
+   kobject_put(&mod->mkobj.kobj);
+   wait_for_completion(&c);
+}
+
 static int mod_sysfs_init(struct module *mod)
 {
int err;
@@ -1638,7 +1646,7 @@ static int mod_sysfs_init(struct module *mod)
err = kobject_init_and_add(&mod->mkobj.kobj, &module_ktype, NULL,
   "%s", mod->name);
if (err)
-   kobject_put(&mod

[PATCH] nohz: fix compile warning in tick_nohz_init()

2013-07-15 Thread Li Zhong

cpu is not used after commit 5b8621a68fdcd2baf1d3b413726f913a5254d46a

Signed-off-by: Li Zhong 
---
 kernel/time/tick-sched.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index e80183f..8145860 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -343,8 +343,6 @@ static int tick_nohz_init_all(void)
 
 void __init tick_nohz_init(void)
 {
-   int cpu;
-
if (!have_nohz_full_mask) {
if (tick_nohz_init_all() < 0)
return;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] vfs: partial revoke implementation suggested by Al Viro

2013-07-11 Thread Li Zhong

This patch tries to partially implement what Al Viro suggested in
https://lkml.org/lkml/2013/4/5/15

Code also mostly copied from the above link

Signed-off-by: Li Zhong 
---
 fs/Makefile|2 +-
 fs/revoke.c|  133 
 include/linux/fs.h |2 +
 include/linux/revoke.h |   50 ++
 4 files changed, 186 insertions(+), 1 deletion(-)
 create mode 100644 fs/revoke.c
 create mode 100644 include/linux/revoke.h

diff --git a/fs/Makefile b/fs/Makefile
index 4fe6df3..af0a622 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=  open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o splice.o sync.o utimes.o \
-   stack.o fs_struct.o statfs.o
+   stack.o fs_struct.o statfs.o revoke.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/revoke.c b/fs/revoke.c
new file mode 100644
index 000..bcda9ba
--- /dev/null
+++ b/fs/revoke.c
@@ -0,0 +1,133 @@
+#include 
+#include 
+#include 
+
+bool __start_using(struct revoke *revoke)
+{
+   struct revokable *r;
+   rcu_read_lock();
+   r = rcu_dereference(revoke->revokable);
+   if (unlikely(!r)) {
+   rcu_read_unlock();
+   return false;   /* revoked */
+   }
+
+   if (likely(atomic_inc_unless_negative(&r->in_use))) {
+   rcu_read_unlock();
+   return true;/* we are using it now */
+   }
+
+   rcu_read_unlock();
+   return false;   /* it's being revoked right now */
+}
+
+#define BIAS (-1U<<31)
+
+void __stop_using(struct revoke *revoke)
+{
+   struct revokable *r;
+   r = rcu_dereference_protected(revoke->revokable, 1);
+   BUG_ON(!r);
+   if (atomic_dec_return(&r->in_use) == BIAS)
+   complete(r->c);
+}
+
+/* called with r->lock held by caller, unlocks it */
+static void __release_revoke(struct revokable *r, struct revoke *revoke)
+{
+   if (revoke->closing) {
+   DECLARE_COMPLETION_ONSTACK(c);
+   revoke->c = &c;
+   spin_unlock(&r->lock);
+   wait_for_completion(&c);
+   } else {
+   struct file *file;
+   revoke->closing = 1;
+   spin_unlock(&r->lock);
+   file = revoke->file;
+   if (file->f_op->release)
+   file->f_op->release(file_inode(file), file);
+   spin_lock(&r->lock);
+   hlist_del_init(&revoke->node);
+   rcu_assign_pointer(revoke->revokable, NULL);
+   rcu_read_lock();/* prevent freeing of r */
+   if (revoke->c)
+   complete(revoke->c);
+   spin_unlock(&r->lock);
+   rcu_read_unlock();
+   }
+}
+
+void release_revoke(struct revoke *revoke)
+{
+   struct revokable *r;
+   rcu_read_lock();
+   r = rcu_dereference(revoke->revokable);
+   if (!r) {
+   /* already closed by revokation */
+   rcu_read_unlock();
+   goto out;
+   }
+
+   spin_lock(&r->lock);
+   if (unlikely(hlist_unhashed(&revoke->node))) {
+   /* just been revoked */
+   spin_unlock(&r->lock);
+   rcu_read_unlock();
+   goto out;
+   }
+
+   /*
+* Ok, revoke_it() couldn't have been finished yet
+* it'll have to get r->lock before it's through, so
+* we can drop rcu_read_lock
+*/
+   rcu_read_unlock();
+   __release_revoke(r, revoke);
+out:
+   kfree(revoke);
+}
+
+void revoke_it(struct revokable *r)
+{
+   DECLARE_COMPLETION_ONSTACK(c);
+   r->c = &c;
+   if (atomic_add_return(BIAS, &r->in_use) != BIAS) {
+   if (r->kick)
+   r->kick(r);
+   wait_for_completion(&c);
+   }
+
+   while (1) {
+   struct hlist_node *p;
+   spin_lock(&r->lock);
+   p = r->list.first;
+   if (!p)
+   break;
+   __release_revoke(r, hlist_entry(p, struct revoke, node));
+   }
+   spin_unlock(&r->lock);
+}
+
+int make_revokable(struct file *f, struct revokable *r)
+{
+   struct revoke *revoke = kzalloc(sizeof(struct revoke), GFP_KERNEL);
+   if (!revoke)
+   return -ENOMEM;
+
+   if (!atomic_inc_unless_negative(&r->in_use)) {
+   kfree(revoke);
+   return -ENOENT;
+   }
+
+   revoke->file = f;
+   revoke->revokable = r;
+   f->f_re

[PATCH 2/2] proc: covert procfs to use the general revoke implementation

2013-07-11 Thread Li Zhong

This patch tries to replace the current revoke logic in procfs with the
implementation suggested by Al Viro in
https://lkml.org/lkml/2013/4/5/15

Below is the replacement guideline copied from that mail:

procfs would have struct revokable embedded into proc_dir_entry, with freeing
of those guys RCUd.  It would set ->f_op to ->proc_fops and call
make_revokable(file, &pde->revokable) in proc_reg_open(); no wrappers for other
methods needed anymore.  All file_operations instances fed to
proc_create() et.al. would lose ->owner - it's already not needed for those,
actually.  remove_proc_entry()/remove_proc_subtree() would call revoke_it()
on everything we are removing.

Signed-off-by: Li Zhong 
---
 fs/compat_ioctl.c  |8 +-
 fs/eventpoll.c |   10 ++-
 fs/file_table.c|   13 ++-
 fs/ioctl.c |7 +-
 fs/proc/generic.c  |   12 +--
 fs/proc/inode.c|  229 +---
 fs/proc/internal.h |9 +--
 fs/read_write.c|   30 +--
 fs/select.c|   11 ++-
 mm/mmap.c  |8 +-
 mm/nommu.c |   16 +++-
 11 files changed, 111 insertions(+), 242 deletions(-)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 5d19acf..48da3bf 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -110,6 +110,7 @@
 #include 
 
 #include 
+#include 
 
 #ifdef CONFIG_SPARC
 #include 
@@ -1584,7 +1585,12 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, 
unsigned int cmd,
 
default:
if (f.file->f_op && f.file->f_op->compat_ioctl) {
-   error = f.file->f_op->compat_ioctl(f.file, cmd, arg);
+   if (likely(start_using(f.file))) {
+   error = f.file->f_op->compat_ioctl(f.file,
+   cmd, arg);
+   stop_using(f.file);
+   } else
+   error = -ENOTTY;
if (error != -ENOIOCTLCMD)
goto out_fput;
}
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 9ad17b15..c73e8af 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * LOCKING:
@@ -776,9 +777,16 @@ static int ep_eventpoll_release(struct inode *inode, 
struct file *file)
 
 static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt)
 {
+   unsigned int rv = DEFAULT_POLLMASK;
pt->_key = epi->event.events;
 
-   return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & epi->event.events;
+   if (likely(start_using(epi->ffd.file))) {
+   rv = epi->ffd.file->f_op->poll(epi->ffd.file, pt)
+   & epi->event.events;
+   stop_using(epi->ffd.file);
+   }
+
+   return rv;
 }
 
 static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head,
diff --git a/fs/file_table.c b/fs/file_table.c
index 08e719b..b3d55e0 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -244,14 +245,20 @@ static void __fput(struct file *file)
file->f_op->fasync(-1, file, 0);
}
ima_file_free(file);
-   if (file->f_op && file->f_op->release)
-   file->f_op->release(inode, file);
+   if (file->f_op && file->f_op->release) {
+   if (likely(!(file->f_revoke)))
+   file->f_op->release(inode, file);
+   else
+   release_revoke(file->f_revoke);
+   }
security_file_free(file);
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
 !(file->f_mode & FMODE_PATH))) {
cdev_put(inode->i_cdev);
}
-   fops_put(file->f_op);
+   if (likely(!(file->f_revoke)))
+   fops_put(file->f_op);
+
put_pid(file->f_owner.pid);
if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
i_readcount_dec(inode);
diff --git a/fs/ioctl.c b/fs/ioctl.c
index fd507fb..7610377 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -40,7 +41,11 @@ static long vfs_ioctl(struct file *filp, unsigned int cmd,
if (!filp->f_op || !filp->f_op->unlocked_ioctl)
goto out;
 
-   error = filp->f_op->unlocked_ioctl(filp, cmd, arg);
+   if (likely(start_using(filp))) {
+   error = filp->f_op->unlocked_ioctl(filp, cmd, arg);
+   stop_using(filp);
+   }
+
if (error == -ENOIOCTLCMD)
error = -ENOTTY;
  out:
diff --git a/fs

[RFC PATCH 0/2] partial revoke implementation for procfs

2013-07-11 Thread Li Zhong

Hi Al Viro, 

I tried to implement what you proposed in:
https://lkml.org/lkml/2013/4/5/15

When you have time, would you please help to take a look at it, and give your 
comments? 
I sent the draft early, just want to make sure I didn't misunderstand anything
in your proposal, and what I did and next step plan are not heading the wrong
direction.

This is only an initial draft to do only what procfs is needed

1. didn't do anything about kick, mmap, fasync, ... and something you mentioned
in following mails

2. I only wrapped f_ops in vfs which are used by procfs ( but maybe I still
missed something). 

3. It seems all f_ops are const, so I couldn't easliy clear the pointer of
->owner, maybe that needs every calling site of proc_create(_data) to make sure
the proc_fops doesn't have ->owner set? 

Currently, I added an ugly check in __fput, so if if ->f_revoke is set in file,
we don't call fput_ops; and in proc_reg_open(), restore the old ->f_op if
make_revokable() fails.

patch 1: adding the implementation proposed in your mail
patch 2: convert procfs to use this implementation

If there aren't any big issues, I plan to look for another file system 
(with backing device) to try other things that's not implemented this time.

Thanks, Zhong

Li Zhong (2):
  vfs: partial revoke implementation suggested by Al Viro
  proc: covert procfs to use the general revoke implementation

 fs/Makefile|2 +-
 fs/compat_ioctl.c  |8 +-
 fs/eventpoll.c |   10 ++-
 fs/file_table.c|   13 ++-
 fs/ioctl.c |7 +-
 fs/proc/generic.c  |   12 +--
 fs/proc/inode.c|  229 
 fs/proc/internal.h |9 +-
 fs/read_write.c|   30 +--
 fs/revoke.c|  133 
 fs/select.c|   11 ++-
 include/linux/fs.h |2 +
 include/linux/revoke.h |   50 +++
 mm/mmap.c  |8 +-
 mm/nommu.c |   16 +++-
 15 files changed, 297 insertions(+), 243 deletions(-)
 create mode 100644 fs/revoke.c
 create mode 100644 include/linux/revoke.h

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH nohz] return NOTIFY_BAD in cpu down call back to stop offlining the cpu

2013-05-17 Thread Li Zhong

In tick_nohz_cpu_down_callback() if the cpu is the one handling
timekeeping , it seems that we should return something that could stop
notify CPU_DOWN_PREPARE, and then start notify CPU_DOWN_FAILED on the
already called notifier call backs.

-EINVAL will be converted to 0 by notifier_to_errno(), then the cpu
would be taken down with part of the DOWN_PREPARE notifier callbacks
called, and something bad could happen after that.

Signed-off-by: Li Zhong 
---
 kernel/time/tick-sched.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index bc67d42..17b8155 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -306,7 +306,7 @@ static int __cpuinit tick_nohz_cpu_down_callback(struct 
notifier_block *nfb,
 * we can't safely shutdown that CPU.
 */
if (have_nohz_full_mask && tick_do_timer_cpu == cpu)
-   return -EINVAL;
+   return NOTIFY_BAD;
break;
}
return NOTIFY_OK;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S

2013-05-14 Thread Li Zhong

On Tue, 2013-05-14 at 16:13 +0200, Frederic Weisbecker wrote:
> On Fri, May 10, 2013 at 05:12:26PM -0400, Steven Rostedt wrote:
> > +/*
> > + * This is a entry point to the scheduler() just before going
> > + * back to user space. This is called with irqs disabled
> > + * which prevents races with the CONTEXT_TRACKING updates.
> > + */
> > +asmlinkage void __sched schedule_preempt_user(void)
> > +{
> > +   enum ctx_state prev_state;
> > +
> > +   prev_state = exception_enter();
> > +
> > +   local_irq_enable();
> > +   __schedule();
> > +   local_irq_disable();
> > +
> > +   exception_exit(prev_state);
> > +}
> 
> Ok I just had a look at how ARM and PPC64 are handling user preemption and it 
> seems
> that irqs are disabled around the call to schedule() on these archs too. 
> Although
> do_work_pending() in ARM surprisingly doesn't re-enable irqs before calling 
> schedule?
> 
> Anyway having irqs disabled around user preemption seem to be a requirement 
> to make
> sure the TIF_NEED_RESCHED check is not racy against irqs and return to 
> userspace.
> So I guess we can keep the above function as it is.
> 
> But perhaps we should queue this for 3.11 given that it's a bit of a 
> sensitive change
> in the x86 user return path.
> 
> Look, I'm just going to make a seperate pull request with this patch based on 
> 3.10-rc1
> and let Ingo choose the target.

Could the deletion of schedule_user() be separated to another patch. So
for other archs, such as ppc64, we could have both
schedule_preempt_user() and schedule_user() there for the conversion?

Or are there some better ways to avoid the conflict with arch trees? 

Thanks, Zhong

> 
> (Meanwhile I still think it would be a good idea to keep LOCKDEP_SYS_EXIT in 
> the loop :-)
> 
> Thanks!
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S

2013-05-14 Thread Li Zhong

On Mon, 2013-05-13 at 11:03 -0400, Steven Rostedt wrote:
> On Mon, 2013-05-13 at 17:56 +0800, Li Zhong wrote:
> 
> > > All this before schedule_user() was able to call user_exit() and take us
> > > out of dynamic tick user context.
> > 
> > Maybe we could just disable function trace for schedule_user()? 
> > 
> > It seems that function trace might use RCU, at least in __schedule()
> > when calling preempt_enable(), and if preemption is really enabled. 
> 
> There is a small use of rcu scheduling (preempt disable/enable) in
> function tracing. One is for perf and dynamic call traces and some hash
> code (multiple users of the function tracer). But those are the unique
> cases, and I really do not want to remove this function just for the non
> common case. Function tracing has proven extremely useful in debugging
> RCU and context tracking. By adding notrace, it's a lost cause.

OK, so it seems a bad idea to disable trace here. 

However, as schedule is traced, so we might get something like 
.schedule <-.schedule_user to know that schedule_user is called?

> 
> Not to mention, adding trace_printk() there will break it too, and that
> does not use any RCU but still disables preemption because the writing
> to the ring buffer requires staying on the same CPU.
> 
I guess you mean the preemption enable after buffer written might cause
a task migration, and break the pair of user_enter/user_exit?

Then how about adding trace_printk() after user_exit() there?

> > 
> > user_exit() is used to allow RCU usage after that (with rcu_user_exit).
> > RCU usage is not allowed before that because cpu is in user eqs. And if
> > function trace needs use RCU, then it seems user_exit() itself or its
> > callers couldn't be function traced.
> 
> And it can't be debugged either.
> 
> I can probably black list those functions manually, such that only the
> main function tracer can trace it, all others will not. In otherwords, I
> can have it such that function tracer will not trace those functions for
> perf, dynamic function tracers, or anything that requires changing the
> function list.

I'm totally lost. Need some more time to read the trace code (please
forgive me for the silly questions in this mail...)

By "main function tracer", do you mean the function_trace_call()? And I
don't understand how to black list some functions manually?

If function_trace_call() is the main function tracer you mentioned,
there is preemption disable/enable in it. schedule might happen where
preemption is enabled. Or maybe you mean we should make all those
functions called with preemption disabled(or irqs disabled)?  

Then syscall_trace_enter(), syscall_trace_leave(), do_notify_resume(),
need be converted the similar way as schedule_preempt_user() did? 

Thanks, Zhong

> I could probably also add a heavier weight synchronize_sched() that even
> synchronizes cpus in userspace.
> 
> > 
> > If __schedule() in preempt_enable() is the only place function trace
> > uses RCU, then after converting to schedule_preempt_user(), it is safe
> > as irqs are disabled for __schedule() to happen. But user_exit() itself
> > and some other callers might still need function trace be disabled.
> 
> The above makes no sense to me. function tracing didn't break, but the
> user_exit() did because of a preemption in the wrong place, as there was
> no protection there to prevent an unnecessary preemption.
> 
> 
> -- Steve
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v4 1/5] powerpc: Syscall hooks for context tracking subsystem

2013-05-13 Thread Li Zhong

This is the syscall slow path hooks for context tracking subsystem,
corresponding to
[PATCH] x86: Syscall hooks for userspace RCU extended QS
  commit bf5a3c13b939813d28ce26c01425054c740d6731

TIF_MEMDIE is moved to the second 16-bits (with value 17), as it seems there
is no asm code using it. TIF_NOHZ is added to _TIF_SYCALL_T_OR_A, so it is
better for it to be in the same 16 bits with others in the group, so in the
asm code, andi. with this group could work.

Signed-off-by: Li Zhong 
Acked-by: Frederic Weisbecker 
---
 arch/powerpc/include/asm/thread_info.h |7 +--
 arch/powerpc/kernel/ptrace.c   |5 +
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index 8ceea14..ba7b197 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -97,7 +97,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_PERFMON_CTXSW  6   /* perfmon needs ctxsw calls */
 #define TIF_SYSCALL_AUDIT  7   /* syscall auditing active */
 #define TIF_SINGLESTEP 8   /* singlestepping active */
-#define TIF_MEMDIE 9   /* is terminating due to OOM killer */
+#define TIF_NOHZ   9   /* in adaptive nohz mode */
 #define TIF_SECCOMP10  /* secure computing */
 #define TIF_RESTOREALL 11  /* Restore all regs (implies NOERROR) */
 #define TIF_NOERROR12  /* Force successful syscall return */
@@ -106,6 +106,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACEPOINT 15  /* syscall tracepoint instrumentation */
 #define TIF_EMULATE_STACK_STORE16  /* Is an instruction emulation
for stack store? */
+#define TIF_MEMDIE 17  /* is terminating due to OOM killer */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE (1<
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1788,6 +1789,8 @@ long do_syscall_trace_enter(struct pt_regs *regs)
 {
long ret = 0;
 
+   user_exit();
+
secure_computing_strict(regs->gpr[0]);
 
if (test_thread_flag(TIF_SYSCALL_TRACE) &&
@@ -1832,4 +1835,6 @@ void do_syscall_trace_leave(struct pt_regs *regs)
step = test_thread_flag(TIF_SINGLESTEP);
if (step || test_thread_flag(TIF_SYSCALL_TRACE))
tracehook_report_syscall_exit(regs, step);
+
+   user_enter();
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v4 0/5] powerpc: Support context tracking for Power pSeries

2013-05-13 Thread Li Zhong

These patches try to support context tracking for Power arch, beginning with
64-bit pSeries. The codes are ported from that of the x86_64, and in each
patch, I listed the corresponding patch for x86.

v4:
fixed some cosmetic issues suggested by Ben.

Li Zhong (5):
  powerpc: Syscall hooks for context tracking subsystem
  powerpc: Exception hooks for context tracking subsystem
  powerpc: Exit user context on notify resume
  powerpc: Use the new schedule_user API on userspace preemption
  powerpc: select HAVE_CONTEXT_TRACKING for pSeries

 arch/powerpc/include/asm/context_tracking.h |   10 
 arch/powerpc/include/asm/thread_info.h  |7 ++-
 arch/powerpc/kernel/entry_64.S  |3 +-
 arch/powerpc/kernel/ptrace.c|5 ++
 arch/powerpc/kernel/signal.c|5 ++
 arch/powerpc/kernel/traps.c |   80 +++
 arch/powerpc/mm/fault.c |   41 +-
 arch/powerpc/mm/hash_utils_64.c |   36 +---
 arch/powerpc/platforms/pseries/Kconfig  |1 +
 9 files changed, 140 insertions(+), 48 deletions(-)
 create mode 100644 arch/powerpc/include/asm/context_tracking.h

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v4 5/5] powerpc: select HAVE_CONTEXT_TRACKING for pSeries

2013-05-13 Thread Li Zhong

Start context tracking support from pSeries.

Signed-off-by: Li Zhong 
---
 arch/powerpc/platforms/pseries/Kconfig |1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index 9a0941b..023b288 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -18,6 +18,7 @@ config PPC_PSERIES
select PPC_PCI_CHOICE if EXPERT
select ZLIB_DEFLATE
select PPC_DOORBELL
+   select HAVE_CONTEXT_TRACKING
default y
 
 config PPC_SPLPAR
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v4 2/5] powerpc: Exception hooks for context tracking subsystem

2013-05-13 Thread Li Zhong

This is the exception hooks for context tracking subsystem, including
data access, program check, single step, instruction breakpoint, machine check,
alignment, fp unavailable, altivec assist, unknown exception, whose handlers
might use RCU.

This patch corresponds to
[PATCH] x86: Exception hooks for userspace RCU extended QS
  commit 6ba3c97a38803883c2eee489505796cb0a727122

But after the exception handling moved to generic code, and some changes in
following two commits:
56dd9470d7c8734f055da2a6bac553caf4a468eb
  context_tracking: Move exception handling to generic code
6c1e0256fad84a843d915414e4b5973b7443d48d
  context_tracking: Restore correct previous context state on exception exit

it is able for exception hooks to use the generic code above instead of a
redundant arch implementation.

Signed-off-by: Li Zhong 
---
 arch/powerpc/kernel/traps.c |   80 ---
 arch/powerpc/mm/fault.c |   41 +---
 arch/powerpc/mm/hash_utils_64.c |   36 +-
 3 files changed, 112 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 83efa2f..a7a648f 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -667,6 +668,7 @@ int machine_check_generic(struct pt_regs *regs)
 
 void machine_check_exception(struct pt_regs *regs)
 {
+   enum ctx_state prev_state = exception_enter();
int recover = 0;
 
__get_cpu_var(irq_stat).mce_exceptions++;
@@ -683,7 +685,7 @@ void machine_check_exception(struct pt_regs *regs)
recover = cur_cpu_spec->machine_check(regs);
 
if (recover > 0)
-   return;
+   goto bail;
 
 #if defined(CONFIG_8xx) && defined(CONFIG_PCI)
/* the qspan pci read routines can cause machine checks -- Cort
@@ -693,20 +695,23 @@ void machine_check_exception(struct pt_regs *regs)
 * -- BenH
 */
bad_page_fault(regs, regs->dar, SIGBUS);
-   return;
+   goto bail;
 #endif
 
if (debugger_fault_handler(regs))
-   return;
+   goto bail;
 
if (check_io_access(regs))
-   return;
+   goto bail;
 
die("Machine check", regs, SIGBUS);
 
/* Must die if the interrupt is not recoverable */
if (!(regs->msr & MSR_RI))
panic("Unrecoverable Machine check");
+
+bail:
+   exception_exit(prev_state);
 }
 
 void SMIException(struct pt_regs *regs)
@@ -716,20 +721,29 @@ void SMIException(struct pt_regs *regs)
 
 void unknown_exception(struct pt_regs *regs)
 {
+   enum ctx_state prev_state = exception_enter();
+
printk("Bad trap at PC: %lx, SR: %lx, vector=%lx\n",
   regs->nip, regs->msr, regs->trap);
 
_exception(SIGTRAP, regs, 0, 0);
+
+   exception_exit(prev_state);
 }
 
 void instruction_breakpoint_exception(struct pt_regs *regs)
 {
+   enum ctx_state prev_state = exception_enter();
+
if (notify_die(DIE_IABR_MATCH, "iabr_match", regs, 5,
5, SIGTRAP) == NOTIFY_STOP)
-   return;
+   goto bail;
if (debugger_iabr_match(regs))
-   return;
+   goto bail;
_exception(SIGTRAP, regs, TRAP_BRKPT, regs->nip);
+
+bail:
+   exception_exit(prev_state);
 }
 
 void RunModeException(struct pt_regs *regs)
@@ -739,15 +753,20 @@ void RunModeException(struct pt_regs *regs)
 
 void __kprobes single_step_exception(struct pt_regs *regs)
 {
+   enum ctx_state prev_state = exception_enter();
+
clear_single_step(regs);
 
if (notify_die(DIE_SSTEP, "single_step", regs, 5,
5, SIGTRAP) == NOTIFY_STOP)
-   return;
+   goto bail;
if (debugger_sstep(regs))
-   return;
+   goto bail;
 
_exception(SIGTRAP, regs, TRAP_TRACE, regs->nip);
+
+bail:
+   exception_exit(prev_state);
 }
 
 /*
@@ -1005,6 +1024,7 @@ int is_valid_bugaddr(unsigned long addr)
 
 void __kprobes program_check_exception(struct pt_regs *regs)
 {
+   enum ctx_state prev_state = exception_enter();
unsigned int reason = get_reason(regs);
extern int do_mathemu(struct pt_regs *regs);
 
@@ -1014,26 +1034,26 @@ void __kprobes program_check_exception(struct pt_regs 
*regs)
if (reason & REASON_FP) {
/* IEEE FP exception */
parse_fpe(regs);
-   return;
+   goto bail;
}
if (reason & REASON_TRAP) {
/* Debugger is first in line to stop recursive faults in
 * rcu_lock, notify_die, or atomic_notifier_call_chain */
if (debugger_bpt(regs))
-   ret

[RFC PATCH v4 4/5] powerpc: Use the new schedule_user API on userspace preemption

2013-05-13 Thread Li Zhong

This patch corresponds to
[PATCH] x86: Use the new schedule_user API on userspace preemption
  commit 0430499ce9d78691f3985962021b16bf8f8a8048

Signed-off-by: Li Zhong 
---
 arch/powerpc/include/asm/context_tracking.h |   10 ++
 arch/powerpc/kernel/entry_64.S  |3 ++-
 2 files changed, 12 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/include/asm/context_tracking.h

diff --git a/arch/powerpc/include/asm/context_tracking.h 
b/arch/powerpc/include/asm/context_tracking.h
new file mode 100644
index 000..b6f5a33
--- /dev/null
+++ b/arch/powerpc/include/asm/context_tracking.h
@@ -0,0 +1,10 @@
+#ifndef _ASM_POWERPC_CONTEXT_TRACKING_H
+#define _ASM_POWERPC_CONTEXT_TRACKING_H
+
+#ifdef CONFIG_CONTEXT_TRACKING
+#define SCHEDULE_USER bl   .schedule_user
+#else
+#define SCHEDULE_USER bl   .schedule
+#endif
+
+#endif
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 915fbb4..d418977 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * System calls.
@@ -634,7 +635,7 @@ _GLOBAL(ret_from_except_lite)
andi.   r0,r4,_TIF_NEED_RESCHED
beq 1f
bl  .restore_interrupts
-   bl  .schedule
+   SCHEDULE_USER
b   .ret_from_except_lite
 
 1: bl  .save_nvgprs
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v4 3/5] powerpc: Exit user context on notify resume

2013-05-13 Thread Li Zhong

This patch allows RCU usage in do_notify_resume, e.g. signal handling.
It corresponds to
[PATCH] x86: Exit RCU extended QS on notify resume
  commit edf55fda35c7dc7f2d9241c3abaddaf759b457c6

Signed-off-by: Li Zhong 
---
 arch/powerpc/kernel/signal.c |5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index cf12eae..d63b502 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -159,6 +160,8 @@ static int do_signal(struct pt_regs *regs)
 
 void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags)
 {
+   user_exit();
+
if (thread_info_flags & _TIF_UPROBE)
uprobe_notify_resume(regs);
 
@@ -169,4 +172,6 @@ void do_notify_resume(struct pt_regs *regs, unsigned long 
thread_info_flags)
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
}
+
+   user_enter();
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S

2013-05-13 Thread Li Zhong

On Fri, 2013-05-10 at 17:12 -0400, Steven Rostedt wrote:
> plain text document attachment (fix-user-exit-preempt.patch)
> I started testing the new NOHZ_FULL in the kernel and had some issues,
> so I started function tracing and this bug report came out:
> 
> 
> [23446.458073] [ cut here ]
> [23446.461028] WARNING:
> at /home/rostedt/work/git/linux-trace.git/kernel/rcutree.c:388
> rcu_eqs_enter+0x4b/0x89()
> [23446.466096] Modules linked in: ebtables ipt_MASQUERADE sunrpc bridge
> stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
> ip6_tables ipv6 uinput snd_hda_codec_id
> t snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm
> kvm_intel snd_timer snd kvm soundcore shpchp snd_page_alloc microcode
> i2c_i801 pata_acpi firewire_ohci firewi
> re_core crc_itu_t ata_generic i915 drm_kms_helper drm i2c_algo_bit
> i2c_core video
> [23446.466096] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.9.0-test+ #11
> [23446.466096] Hardware name: To Be Filled By O.E.M. To Be Filled By
> O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
> [23446.466096]  814879f0 81004fa6 810337af
> 88007d40
> [23446.466096]    88007d40d900
> 81a01f00
> [23446.466096]  81acba00 88007d6495c0 810a1f36
> 0a28
> [23446.466096] Call Trace:
> [23446.466096]  [] ? dump_stack+0xd/0x17
> [23446.466096]  [] ? show_stack+0x5/0x3e
> [23446.466096]  [] ? warn_slowpath_common+0x64/0x7c
> [23446.466096]  [] ? rcu_eqs_enter+0x4b/0x89
> [23446.466096]  [] ? rcu_idle_enter+0x1a/0x30
> [23446.466096]  [] ? ftrace_graph_caller+0x85/0x85
> [23446.466096]  [] cpu_startup_entry+0xb3/0x11c
> [23446.466096]  [] ? start_kernel+0x41a/0x425
> [23446.466096]  [] ? repair_env_string+0x54/0x54
> [23446.466096] ---[ end trace ddbb69ae2a0f6687 ]---
> [23446.466096] [ cut here ]
> 
> This is caused by an imbalance of rcu_eqs_enter() and rcu_eqs_exit().
> Adding more tracing, I also discovered that there seemed to be an
> imbalance in user_exit() and user_enter(). Specifically, I found that
> user_enter() was called, and then a migration happened and user_exit()
> was never called before the schedule. Then I had noticed this in my
> trace:
> 
> run-isol-11546   1.N.. 37363.637641: function: schedule_user <-- 
> retint_careful
> run-isol-11546   1.N.. 37363.637641: function: __schedule <-- 
> preempt_schedule
> 
> The funny part here is that schedule_user does not call
> preempt_schedule. But then I also noticed the flags of the call to
> schedule_user(): "1.N..". This shows that it was running on cpu 1 with
> NEED_RESCHED set (N), but both interrupts and preemption are enabled. If
> an interrupt came in here we can schedule out again, but that would call
> preempt_schedule_irq, which has code to check if its in "user context
> (according to dynamic_ticks)" and would do the user_exit() too. But
> something didn't seem right here.
> 
> Then I realized that it was actually the function tracer itself, as for
> every function it traces, it calls preempt_disable() and
> preempt_enable() to record the data on a per_cpu buffer. That
> preempt_enable() noticed the NEED_RESCHED set and since preemption is
> enabled, it kindly called preempt_schedule() for us!
> 
> All this before schedule_user() was able to call user_exit() and take us
> out of dynamic tick user context.

Maybe we could just disable function trace for schedule_user()? 

It seems that function trace might use RCU, at least in __schedule()
when calling preempt_enable(), and if preemption is really enabled. 

user_exit() is used to allow RCU usage after that (with rcu_user_exit).
RCU usage is not allowed before that because cpu is in user eqs. And if
function trace needs use RCU, then it seems user_exit() itself or its
callers couldn't be function traced.

If __schedule() in preempt_enable() is the only place function trace
uses RCU, then after converting to schedule_preempt_user(), it is safe
as irqs are disabled for __schedule() to happen. But user_exit() itself
and some other callers might still need function trace be disabled.

Thanks, Zhong 

> 
> I even verified this by adding a:
> 
>   preempt_disable();
>   preempt_enable();
> 
> just before the user_exit() and ran a test without doing any tracing and
> was able to easily hit this bug.
> 
> I then looked at the code in entry_64.S, and noticed that the calls to
> SCHEDULE_USER were all encompassed with ENABLE_INTERRUPTS and
> DISABLE_INTERRUPTS before calling schedule. And the SCHEDULE_USER macro
> is different if we have CONTEXT_TRACKING enabled or not.
> 
> I created a new schedule_preempt_user() which is similar to
> schedule_preempt_disable()  and works like preempt_schedule_irq(), where
> it includes the exception_enter() and exit() routines that take care of
> the accounting of dynamic ticks when interrupting user mode or not. It
> also ta

Re: [RFC PATCH v3 2/5] powerpc: Exception hooks for context tracking subsystem

2013-05-13 Thread Li Zhong

On Mon, 2013-05-13 at 19:06 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2013-05-13 at 16:44 +0800, Li Zhong wrote:
> > Yes, the above and hash_page() are two C functions for a same exception.
> > And the exception hooks enable RCU usage in those C codes. But for asm
> > codes, I think we could assume that there would be no RCU usage there,
> > so we don't need wrap them in the hooks.
> 
> hash_page() won't start a new RCU, at least not in its current incarnation,
> the only thing I can see it ever doing would be to take some RCU read locks 
> one
> day (it doesn't today).

Seems I added the hooks because of the trace point of hcall
entry/exit ...

Thanks, Zhong

> 
> low_hash_fault() is a different beast. It will typically kill things, thus
> involving sending signals etc... RCU might well be involved.
> 
> Cheers,
> Ben.
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v3 0/5] powerpc: Support context tracking for Power pSeries

2013-05-13 Thread Li Zhong

On Mon, 2013-05-13 at 18:59 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2013-05-13 at 16:03 +0800, Li Zhong wrote:
> > 
> > To my understanding, it is used to enable RCU user extended quiescent
> > state, so RCU on that cpu doesn't need scheduler ticks. And together
> > with some other code(already in 3.10), we are able to remove the ticks
> > in some cases (e.g. only 1 task running on the cpu, with some other
> > limitations).
> 
> Ok, sounds interesting. Once you fix the little cosmetic issue, I don't
> see any reason not to merge them as it's basically wiring up an existing
> feature (in that regard the patches are pretty straightforward) and I
> assume the overhead is only there when you enable it.

Thanks for your support, Ben.

Yes, there should be no overhead if not enabled. 

Thanks, Zhong

> 
> Cheers,
> Ben.
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v3 2/5] powerpc: Exception hooks for context tracking subsystem

2013-05-13 Thread Li Zhong

On Mon, 2013-05-13 at 15:57 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2013-05-13 at 13:21 +0800, Li Zhong wrote:
> > int recover = 0;
> > +   enum ctx_state prev_state;
> > +
> > +   prev_state = exception_enter();
> 
> Please make it nicer:
> 
>   enum ctx_state prev_state = exception_enter();
>   int recover = 0;

OK, I'll fix it, and all the others.

> > __get_cpu_var(irq_stat).mce_exceptions++;
> >  
> > @@ -683,7 +687,7 @@ void machine_check_exception(struct pt_regs *regs)
> > recover = cur_cpu_spec->machine_check(regs);
> >  
> > if (recover > 0)
> > -   return;
> > +   goto exit;
> 
> I'm no fan of "exit" as a label as it's otherwise a libc function (I
> know there is no relationship here but I just find that annoying).
> 
> I usually use "bail"

OK, I'll fix it, and all the others.

> 
> >  #if defined(CONFIG_8xx) && defined(CONFIG_PCI)
> > /* the qspan pci read routines can cause machine checks -- Cort
> > @@ -693,20 +697,23 @@ void machine_check_exception(struct pt_regs *regs)
> >  * -- BenH
> >  */
> > bad_page_fault(regs, regs->dar, SIGBUS);
> > -   return;
> > +   goto exit;
> >  #endif
> >  
> > if (debugger_fault_handler(regs))
> > -   return;
> > +   goto exit;
> >  
> > if (check_io_access(regs))
> > -   return;
> > +   goto exit;
> >  
> > die("Machine check", regs, SIGBUS);
> >  
> > /* Must die if the interrupt is not recoverable */
> > if (!(regs->msr & MSR_RI))
> > panic("Unrecoverable Machine check");
> > +
> > +exit:
> > +   exception_exit(prev_state);
> >  }
> >  
> >  void SMIException(struct pt_regs *regs)
> > @@ -716,20 +723,31 @@ void SMIException(struct pt_regs *regs)
> >  
> >  void unknown_exception(struct pt_regs *regs)
> >  {
> > +   enum ctx_state prev_state;
> > +   prev_state = exception_enter();
> > +
> 
> Same cosmetic comment for all other cases
> 
>  /...
> 
> >  #include 
> >  #include 
> > @@ -193,8 +194,8 @@ static int mm_fault_error(struct pt_regs *regs, 
> > unsigned long addr, int fault)
> >   * The return value is 0 if the fault was handled, or the signal
> >   * number if this is a kernel fault that can't be handled here.
> >   */
> > -int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
> > -   unsigned long error_code)
> > +static int __kprobes __do_page_fault(struct pt_regs *regs,
> > +   unsigned long address, unsigned long error_code)
> >  {
> > struct vm_area_struct * vma;
> > struct mm_struct *mm = current->mm;
> > @@ -475,6 +476,17 @@ bad_area_nosemaphore:
> >  
> >  }
> >  
> > +int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
> > +   unsigned long error_code)
> > +{
> > +   int ret;
> > +   enum ctx_state prev_state;
> > +   prev_state = exception_enter();
> > +   ret = __do_page_fault(regs, address, error_code);
> > +   exception_exit(prev_state);
> > +   return ret;
> > +}
> 
> Any reason why you didn't use the same wrapping trick in traps.c ? I'd
> rather we stay consistent in the way we do things between the two files.

OK, I'll make it consistent with those in traps.c

> 
> >  /*
> >   * bad_page_fault is called when we have a bad access from the kernel.
> >   * It is called from the DSI and ISI handlers in head.S and from some
> > diff --git a/arch/powerpc/mm/hash_utils_64.c 
> > b/arch/powerpc/mm/hash_utils_64.c
> > index 88ac0ee..d92fb26 100644
> > --- a/arch/powerpc/mm/hash_utils_64.c
> > +++ b/arch/powerpc/mm/hash_utils_64.c
> > @@ -33,6 +33,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -962,6 +963,9 @@ int hash_page(unsigned long ea, unsigned long access, 
> > unsigned long trap)
> > const struct cpumask *tmp;
> > int rc, user_region = 0, local = 0;
> > int psize, ssize;
> > +   enum ctx_state prev_state;
> > +
> > +   prev_state = exception_enter();
> >  
> > DBG_LOW("hash_page(ea=%016lx, access=%lx, trap=%lx\n",
> > ea, access, trap);
> > @@ -973,7 +977,8 @@ int hash_page(unsigned long ea, unsigned long a

Re: [RFC PATCH v3 0/5] powerpc: Support context tracking for Power pSeries

2013-05-13 Thread Li Zhong

On Mon, 2013-05-13 at 15:51 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2013-05-13 at 13:21 +0800, Li Zhong wrote:
> > These patches try to support context tracking for Power arch, beginning with
> > 64-bit pSeries. The codes are ported from that of the x86_64, and in each
> > patch, I listed the corresponding patch for x86.
> 
> So that's yet another pile of bloat on all syscall entry/exit and
> exception entry/exit. What is it used for ? (I haven't followed on
> x86_64 side).

To my understanding, it is used to enable RCU user extended quiescent
state, so RCU on that cpu doesn't need scheduler ticks. And together
with some other code(already in 3.10), we are able to remove the ticks
in some cases (e.g. only 1 task running on the cpu, with some other
limitations).

Maybe Paul, or Frederic could give some better descriptions. 

Thanks, Zhong

> 
> Cheers,
> Ben.
> 
> > v3:
> > 
> > This version is mainly a rebasing, against 3.10-rc1, also as the common 
> > code 
> > to handle the exception are pulled into 3.10, so there is no dependency on 
> > tip tree. So patch #2 and #6 in previous version_2 is merged together. 
> > 
> > Li Zhong (5):
> >   powerpc: Syscall hooks for context tracking subsystem
> >   powerpc: Exception hooks for context tracking subsystem
> >   powerpc: Exit user context on notify resume
> >   powerpc: Use the new schedule_user API on userspace preemption
> >   powerpc: select HAVE_CONTEXT_TRACKING for pSeries
> > 
> >  arch/powerpc/include/asm/context_tracking.h |   10 +++
> >  arch/powerpc/include/asm/thread_info.h  |7 ++-
> >  arch/powerpc/kernel/entry_64.S  |3 +-
> >  arch/powerpc/kernel/ptrace.c|5 ++
> >  arch/powerpc/kernel/signal.c|5 ++
> >  arch/powerpc/kernel/traps.c |   91 
> > ---
> >  arch/powerpc/mm/fault.c |   16 -
> >  arch/powerpc/mm/hash_utils_64.c |   38 ---
> >  arch/powerpc/platforms/pseries/Kconfig  |1 +
> >  9 files changed, 140 insertions(+), 36 deletions(-)
> >  create mode 100644 arch/powerpc/include/asm/context_tracking.h
> > 
> 
> 




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v3 1/5] powerpc: Syscall hooks for context tracking subsystem

2013-05-12 Thread Li Zhong

This is the syscall slow path hooks for context tracking subsystem,
corresponding to
[PATCH] x86: Syscall hooks for userspace RCU extended QS
  commit bf5a3c13b939813d28ce26c01425054c740d6731

TIF_MEMDIE is moved to the second 16-bits (with value 17), as it seems there
is no asm code using it. TIF_NOHZ is added to _TIF_SYCALL_T_OR_A, so it is
better for it to be in the same 16 bits with others in the group, so in the
asm code, andi. with this group could work.

Signed-off-by: Li Zhong 
Acked-by: Frederic Weisbecker 
---
 arch/powerpc/include/asm/thread_info.h |7 +--
 arch/powerpc/kernel/ptrace.c   |5 +
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index 8ceea14..ba7b197 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -97,7 +97,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_PERFMON_CTXSW  6   /* perfmon needs ctxsw calls */
 #define TIF_SYSCALL_AUDIT  7   /* syscall auditing active */
 #define TIF_SINGLESTEP 8   /* singlestepping active */
-#define TIF_MEMDIE 9   /* is terminating due to OOM killer */
+#define TIF_NOHZ   9   /* in adaptive nohz mode */
 #define TIF_SECCOMP10  /* secure computing */
 #define TIF_RESTOREALL 11  /* Restore all regs (implies NOERROR) */
 #define TIF_NOERROR12  /* Force successful syscall return */
@@ -106,6 +106,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACEPOINT 15  /* syscall tracepoint instrumentation */
 #define TIF_EMULATE_STACK_STORE16  /* Is an instruction emulation
for stack store? */
+#define TIF_MEMDIE 17  /* is terminating due to OOM killer */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE (1<
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1788,6 +1789,8 @@ long do_syscall_trace_enter(struct pt_regs *regs)
 {
long ret = 0;
 
+   user_exit();
+
secure_computing_strict(regs->gpr[0]);
 
if (test_thread_flag(TIF_SYSCALL_TRACE) &&
@@ -1832,4 +1835,6 @@ void do_syscall_trace_leave(struct pt_regs *regs)
step = test_thread_flag(TIF_SINGLESTEP);
if (step || test_thread_flag(TIF_SYSCALL_TRACE))
tracehook_report_syscall_exit(regs, step);
+
+   user_enter();
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v3 3/5] powerpc: Exit user context on notify resume

2013-05-12 Thread Li Zhong

This patch allows RCU usage in do_notify_resume, e.g. signal handling.
It corresponds to
[PATCH] x86: Exit RCU extended QS on notify resume
  commit edf55fda35c7dc7f2d9241c3abaddaf759b457c6

Signed-off-by: Li Zhong 
---
 arch/powerpc/kernel/signal.c |5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index cf12eae..d63b502 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -159,6 +160,8 @@ static int do_signal(struct pt_regs *regs)
 
 void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags)
 {
+   user_exit();
+
if (thread_info_flags & _TIF_UPROBE)
uprobe_notify_resume(regs);
 
@@ -169,4 +172,6 @@ void do_notify_resume(struct pt_regs *regs, unsigned long 
thread_info_flags)
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
}
+
+   user_enter();
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v3 4/5] powerpc: Use the new schedule_user API on userspace preemption

2013-05-12 Thread Li Zhong

This patch corresponds to
[PATCH] x86: Use the new schedule_user API on userspace preemption
  commit 0430499ce9d78691f3985962021b16bf8f8a8048

Signed-off-by: Li Zhong 
---
 arch/powerpc/include/asm/context_tracking.h |   10 ++
 arch/powerpc/kernel/entry_64.S  |3 ++-
 2 files changed, 12 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/include/asm/context_tracking.h

diff --git a/arch/powerpc/include/asm/context_tracking.h 
b/arch/powerpc/include/asm/context_tracking.h
new file mode 100644
index 000..b6f5a33
--- /dev/null
+++ b/arch/powerpc/include/asm/context_tracking.h
@@ -0,0 +1,10 @@
+#ifndef _ASM_POWERPC_CONTEXT_TRACKING_H
+#define _ASM_POWERPC_CONTEXT_TRACKING_H
+
+#ifdef CONFIG_CONTEXT_TRACKING
+#define SCHEDULE_USER bl   .schedule_user
+#else
+#define SCHEDULE_USER bl   .schedule
+#endif
+
+#endif
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 915fbb4..d418977 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * System calls.
@@ -634,7 +635,7 @@ _GLOBAL(ret_from_except_lite)
andi.   r0,r4,_TIF_NEED_RESCHED
beq 1f
bl  .restore_interrupts
-   bl  .schedule
+   SCHEDULE_USER
b   .ret_from_except_lite
 
 1: bl  .save_nvgprs
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v3 5/5] powerpc: select HAVE_CONTEXT_TRACKING for pSeries

2013-05-12 Thread Li Zhong

Start context tracking support from pSeries.

Signed-off-by: Li Zhong 
---
 arch/powerpc/platforms/pseries/Kconfig |1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index 9a0941b..023b288 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -18,6 +18,7 @@ config PPC_PSERIES
select PPC_PCI_CHOICE if EXPERT
select ZLIB_DEFLATE
select PPC_DOORBELL
+   select HAVE_CONTEXT_TRACKING
default y
 
 config PPC_SPLPAR
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v3 2/5] powerpc: Exception hooks for context tracking subsystem

2013-05-12 Thread Li Zhong

This is the exception hooks for context tracking subsystem, including
data access, program check, single step, instruction breakpoint, machine check,
alignment, fp unavailable, altivec assist, unknown exception, whose handlers
might use RCU.

This patch corresponds to
[PATCH] x86: Exception hooks for userspace RCU extended QS
  commit 6ba3c97a38803883c2eee489505796cb0a727122

But after the exception handling moved to generic code, and some changes in
following two commits:
56dd9470d7c8734f055da2a6bac553caf4a468eb
  context_tracking: Move exception handling to generic code
6c1e0256fad84a843d915414e4b5973b7443d48d
  context_tracking: Restore correct previous context state on exception exit

it is able for exception hooks to use the generic code above instead of a
redundant arch implementation.

Signed-off-by: Li Zhong 
---
 arch/powerpc/kernel/traps.c |   91 +--
 arch/powerpc/mm/fault.c |   16 ++-
 arch/powerpc/mm/hash_utils_64.c |   38 
 3 files changed, 112 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 83efa2f..9d3c000 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -668,6 +669,9 @@ int machine_check_generic(struct pt_regs *regs)
 void machine_check_exception(struct pt_regs *regs)
 {
int recover = 0;
+   enum ctx_state prev_state;
+
+   prev_state = exception_enter();
 
__get_cpu_var(irq_stat).mce_exceptions++;
 
@@ -683,7 +687,7 @@ void machine_check_exception(struct pt_regs *regs)
recover = cur_cpu_spec->machine_check(regs);
 
if (recover > 0)
-   return;
+   goto exit;
 
 #if defined(CONFIG_8xx) && defined(CONFIG_PCI)
/* the qspan pci read routines can cause machine checks -- Cort
@@ -693,20 +697,23 @@ void machine_check_exception(struct pt_regs *regs)
 * -- BenH
 */
bad_page_fault(regs, regs->dar, SIGBUS);
-   return;
+   goto exit;
 #endif
 
if (debugger_fault_handler(regs))
-   return;
+   goto exit;
 
if (check_io_access(regs))
-   return;
+   goto exit;
 
die("Machine check", regs, SIGBUS);
 
/* Must die if the interrupt is not recoverable */
if (!(regs->msr & MSR_RI))
panic("Unrecoverable Machine check");
+
+exit:
+   exception_exit(prev_state);
 }
 
 void SMIException(struct pt_regs *regs)
@@ -716,20 +723,31 @@ void SMIException(struct pt_regs *regs)
 
 void unknown_exception(struct pt_regs *regs)
 {
+   enum ctx_state prev_state;
+   prev_state = exception_enter();
+
printk("Bad trap at PC: %lx, SR: %lx, vector=%lx\n",
   regs->nip, regs->msr, regs->trap);
 
_exception(SIGTRAP, regs, 0, 0);
+
+   exception_exit(prev_state);
 }
 
 void instruction_breakpoint_exception(struct pt_regs *regs)
 {
+   enum ctx_state prev_state;
+   prev_state = exception_enter();
+
if (notify_die(DIE_IABR_MATCH, "iabr_match", regs, 5,
5, SIGTRAP) == NOTIFY_STOP)
-   return;
+   goto exit;
if (debugger_iabr_match(regs))
-   return;
+   goto exit;
_exception(SIGTRAP, regs, TRAP_BRKPT, regs->nip);
+
+exit:
+   exception_exit(prev_state);
 }
 
 void RunModeException(struct pt_regs *regs)
@@ -739,15 +757,21 @@ void RunModeException(struct pt_regs *regs)
 
 void __kprobes single_step_exception(struct pt_regs *regs)
 {
+   enum ctx_state prev_state;
+   prev_state = exception_enter();
+
clear_single_step(regs);
 
if (notify_die(DIE_SSTEP, "single_step", regs, 5,
5, SIGTRAP) == NOTIFY_STOP)
-   return;
+   goto exit;
if (debugger_sstep(regs))
-   return;
+   goto exit;
 
_exception(SIGTRAP, regs, TRAP_TRACE, regs->nip);
+
+exit:
+   exception_exit(prev_state);
 }
 
 /*
@@ -1006,34 +1030,37 @@ int is_valid_bugaddr(unsigned long addr)
 void __kprobes program_check_exception(struct pt_regs *regs)
 {
unsigned int reason = get_reason(regs);
+   enum ctx_state prev_state;
extern int do_mathemu(struct pt_regs *regs);
 
+   prev_state = exception_enter();
+
/* We can now get here via a FP Unavailable exception if the core
 * has no FPU, in that case the reason flags will be 0 */
 
if (reason & REASON_FP) {
/* IEEE FP exception */
parse_fpe(regs);
-   return;
+   goto exit;
}
if (reason & REASON_TRAP) {
/* Debugger is first in line to stop recursive fault

[RFC PATCH v3 0/5] powerpc: Support context tracking for Power pSeries

2013-05-12 Thread Li Zhong

These patches try to support context tracking for Power arch, beginning with
64-bit pSeries. The codes are ported from that of the x86_64, and in each
patch, I listed the corresponding patch for x86.

v3:

This version is mainly a rebasing, against 3.10-rc1, also as the common code 
to handle the exception are pulled into 3.10, so there is no dependency on 
tip tree. So patch #2 and #6 in previous version_2 is merged together. 

Li Zhong (5):
  powerpc: Syscall hooks for context tracking subsystem
  powerpc: Exception hooks for context tracking subsystem
  powerpc: Exit user context on notify resume
  powerpc: Use the new schedule_user API on userspace preemption
  powerpc: select HAVE_CONTEXT_TRACKING for pSeries

 arch/powerpc/include/asm/context_tracking.h |   10 +++
 arch/powerpc/include/asm/thread_info.h  |7 ++-
 arch/powerpc/kernel/entry_64.S  |3 +-
 arch/powerpc/kernel/ptrace.c|5 ++
 arch/powerpc/kernel/signal.c|5 ++
 arch/powerpc/kernel/traps.c |   91 ---
 arch/powerpc/mm/fault.c |   16 -
 arch/powerpc/mm/hash_utils_64.c |   38 ---
 arch/powerpc/platforms/pseries/Kconfig  |1 +
 9 files changed, 140 insertions(+), 36 deletions(-)
 create mode 100644 arch/powerpc/include/asm/context_tracking.h

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:timers/nohz] nohz: Protect smp_processor_id() in tick_nohz_task_switch()

2013-04-29 Thread tip-bot for Li Zhong

Commit-ID:  6296ace467c8640317414ba589b124323806f7ce
Gitweb: http://git.kernel.org/tip/6296ace467c8640317414ba589b124323806f7ce
Author: Li Zhong 
AuthorDate: Sun, 28 Apr 2013 11:25:58 +0800
Committer:  Ingo Molnar 
CommitDate: Mon, 29 Apr 2013 13:17:33 +0200

nohz: Protect smp_processor_id() in tick_nohz_task_switch()

I saw following error when testing the latest nohz code on
Power:

[   85.295384] BUG: using smp_processor_id() in preemptible [] code: 
rsyslogd/3493
[   85.295396] caller is .tick_nohz_task_switch+0x1c/0xb8
[   85.295402] Call Trace:
[   85.295408] [c001fababab0] [c0012dc4] .show_stack+0x110/0x25c 
(unreliable)
[   85.295420] [c001fababba0] [c07c4b54] .dump_stack+0x20/0x30
[   85.295430] [c001fababc10] [c044eb74] 
.debug_smp_processor_id+0xf4/0x124
[   85.295438] [c001fababca0] [c00d7594] 
.tick_nohz_task_switch+0x1c/0xb8
[   85.295447] [c001fababd20] [c00b9748] 
.finish_task_switch+0x13c/0x160
[   85.295455] [c001fababdb0] [c00bbe50] .schedule_tail+0x50/0x124
[   85.295463] [c001fababe30] [c0009dc8] .ret_from_fork+0x4/0x54

The code below moves the test into local_irq_save/restore
section to avoid the above complaint.

Signed-off-by: Li Zhong 
Acked-by: Frederic Weisbecker 
Cc: Peter Zijlstra 
Cc: Paul McKenney 
Link: 
http://lkml.kernel.org/r/1367119558.6391.34.ca...@thinkpad-t5421.cn.ibm.com
Signed-off-by: Ingo Molnar 
---
 kernel/time/tick-sched.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index da53c8f..1c9f53b 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -251,14 +251,15 @@ void tick_nohz_task_switch(struct task_struct *tsk)
 {
unsigned long flags;
 
-   if (!tick_nohz_full_cpu(smp_processor_id()))
-   return;
-
local_irq_save(flags);
 
+   if (!tick_nohz_full_cpu(smp_processor_id()))
+   goto out;
+
if (tick_nohz_tick_stopped() && !can_stop_full_tick())
tick_nohz_full_kick();
 
+out:
local_irq_restore(flags);
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2]nohz: protect smp_processor_id() in tick_nohz_task_switch()

2013-04-27 Thread Li Zhong

I saw following error when testing the latest nohz code on Power:

[   85.295384] BUG: using smp_processor_id() in preemptible [] code: 
rsyslogd/3493
[   85.295396] caller is .tick_nohz_task_switch+0x1c/0xb8
[   85.295402] Call Trace:
[   85.295408] [c001fababab0] [c0012dc4] .show_stack+0x110/0x25c 
(unreliable)
[   85.295420] [c001fababba0] [c07c4b54] .dump_stack+0x20/0x30
[   85.295430] [c001fababc10] [c044eb74] 
.debug_smp_processor_id+0xf4/0x124
[   85.295438] [c001fababca0] [c00d7594] 
.tick_nohz_task_switch+0x1c/0xb8
[   85.295447] [c001fababd20] [c00b9748] 
.finish_task_switch+0x13c/0x160
[   85.295455] [c001fababdb0] [c00bbe50] .schedule_tail+0x50/0x124
[   85.295463] [c001fababe30] [c0009dc8] .ret_from_fork+0x4/0x54

The code below moves the test into local_irq_save/restore section to
avoid the above complaint.

Signed-off-by: Li Zhong 
---
 kernel/time/tick-sched.c |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index da53c8f..1c9f53b 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -251,14 +251,15 @@ void tick_nohz_task_switch(struct task_struct *tsk)
 {
unsigned long flags;
 
-   if (!tick_nohz_full_cpu(smp_processor_id()))
-   return;
-
local_irq_save(flags);
 
+   if (!tick_nohz_full_cpu(smp_processor_id()))
+   goto out;
+
if (tick_nohz_tick_stopped() && !can_stop_full_tick())
tick_nohz_full_kick();
 
+out:
local_irq_restore(flags);
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpuset: fix compile warning when CONFIG_SMP=n

2013-04-27 Thread Li Zhong

On Sun, 2013-04-28 at 09:46 +0800, Li Zefan wrote:
> Reported by Fengguang's kbuild test robot:

Thank you for the quick fix, Zefan. 

Strong robot :)

Thanks, Zhong

> 
> kernel/cpuset.c:787: warning: 'generate_sched_domains' defined but not used
> 
> Introduced by commit e0e80a02e5701c8790bd348ab59edb154fbda60b
> ("cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()),
> which removed generate_sched_domains() from cpuset_hotplug_workfn().
> 
> Reported-by: Fengguang Wu 
> Signed-off-by: Li Zefan 
> ---
>  kernel/cpuset.c | 7 ---
>  1 file changed, 7 deletions(-)
> 
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index ef05901..2422c5b 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -789,13 +789,6 @@ out:
>  static void rebuild_sched_domains_locked(void)
>  {
>  }
> -
> -static int generate_sched_domains(cpumask_var_t **domains,
> - struct sched_domain_attr **attributes)
> -{
> - *domains = NULL;
> - return 1;
> -}
>  #endif /* CONFIG_SMP */
> 
>  void rebuild_sched_domains(void)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH]nohz: Use raw_smp_processor_id() in tick_nohz_task_switch()

2013-04-27 Thread Li Zhong

On Sat, 2013-04-27 at 15:40 +0200, Frederic Weisbecker wrote:
> 2013/4/27 Li Zhong :
> > I saw following error when testing the latest nohz code on Power:
> >
> > [   85.295384] BUG: using smp_processor_id() in preemptible [] 
> > code: rsyslogd/3493
> > [   85.295396] caller is .tick_nohz_task_switch+0x1c/0xb8
> > [   85.295402] Call Trace:
> > [   85.295408] [c001fababab0] [c0012dc4] 
> > .show_stack+0x110/0x25c (unreliable)
> > [   85.295420] [c001fababba0] [c07c4b54] .dump_stack+0x20/0x30
> > [   85.295430] [c001fababc10] [c044eb74] 
> > .debug_smp_processor_id+0xf4/0x124
> > [   85.295438] [c001fababca0] [c00d7594] 
> > .tick_nohz_task_switch+0x1c/0xb8
> > [   85.295447] [c001fababd20] [c00b9748] 
> > .finish_task_switch+0x13c/0x160
> > [   85.295455] [c001fababdb0] [c00bbe50] 
> > .schedule_tail+0x50/0x124
> > [   85.295463] [c001fababe30] [c0009dc8] .ret_from_fork+0x4/0x54
> >
> > It seems to me that we could just use raw_smp_processor_id() here. Even
> > if the tick_nohz_full_cpu() check is done on a !nohz_full cpu, then the
> > task is moved to another nohz_full cpu, it seems the context switching
> > because of the task moving would call tick_nohz_task_switch() again to
> > evaluate the need for tick.
> >
> > I don't know whether I missed something here.
> 
> You're right it looks safe to do so. But I suggest we rather move the
> test inside local_irq_save()/restore section to avoid confusion on
> reviewers minds.

OK, I'll send an updated version, using local_irq_save() to protect it. 
I tried using raw_* because seems it could avoid some unnecessary irq
disabling...

Thanks, Zhong

> Thanks!
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH]nohz: Use raw_smp_processor_id() in tick_nohz_task_switch()

2013-04-27 Thread Li Zhong

I saw following error when testing the latest nohz code on Power:

[   85.295384] BUG: using smp_processor_id() in preemptible [] code: 
rsyslogd/3493
[   85.295396] caller is .tick_nohz_task_switch+0x1c/0xb8
[   85.295402] Call Trace:
[   85.295408] [c001fababab0] [c0012dc4] .show_stack+0x110/0x25c 
(unreliable)
[   85.295420] [c001fababba0] [c07c4b54] .dump_stack+0x20/0x30
[   85.295430] [c001fababc10] [c044eb74] 
.debug_smp_processor_id+0xf4/0x124
[   85.295438] [c001fababca0] [c00d7594] 
.tick_nohz_task_switch+0x1c/0xb8
[   85.295447] [c001fababd20] [c00b9748] 
.finish_task_switch+0x13c/0x160
[   85.295455] [c001fababdb0] [c00bbe50] .schedule_tail+0x50/0x124
[   85.295463] [c001fababe30] [c0009dc8] .ret_from_fork+0x4/0x54

It seems to me that we could just use raw_smp_processor_id() here. Even
if the tick_nohz_full_cpu() check is done on a !nohz_full cpu, then the
task is moved to another nohz_full cpu, it seems the context switching
because of the task moving would call tick_nohz_task_switch() again to
evaluate the need for tick.

I don't know whether I missed something here.

Signed-off-by: Li Zhong 
---
 kernel/time/tick-sched.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index da53c8f..0aa575b 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -251,7 +251,7 @@ void tick_nohz_task_switch(struct task_struct *tsk)
 {
unsigned long flags;
 
-   if (!tick_nohz_full_cpu(smp_processor_id()))
+   if (!tick_nohz_full_cpu(raw_smp_processor_id()))
return;
 
local_irq_save(flags);
-- 
1.7.1



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 cpuset] Don't pass offlined cpus to partition_sched_domains()

2013-04-12 Thread Li Zhong

On Thu, 2013-04-11 at 16:57 +0800, Li Zefan wrote:
> On 2013/4/9 17:59, Li Zhong wrote:
> > Hi Zefan, 
> > 
> > I did some test today, enabling cpuset and online/offline the cpus. It
> > caused NULL address dereferencing in get_group(). After adding some
> > debugging code, it seems that it is caused by using some offlined cpus
> > as the parameter to partition_sched_domains(). More details below:
> > 
> > ===
> > following is printed in build_sched_domain(): 
> > @@ -6476,6 +6486,14 @@ struct sched_domain *build_sched_domain(struct
> > sched_domain_topology_level *tl,
> > return child;
> > 
> > cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
> > +   if (cpumask_empty(sched_domain_span(sd)))
> > +   printk("lz empty domain %p, mask %pS\n", sd, tl->mask);
> > +   char cpulist[128];
> > +   cpumask_scnprintf(cpulist, sizeof(cpulist), cpu_map);
> > +   printk("lz cpu_map %s\n", cpulist);
> > +   cpumask_scnprintf(cpulist, sizeof(cpulist), tl->mask(cpu));
> > +   printk("lz tl->mask(%d) %pS %s\n",cpu, tl->mask, cpulist);
> > +
> > if (child) {
> > sd->level = child->level + 1;
> > sched_domain_level_max = max(sched_domain_level_max,
> > sd->level);
> > 
> > [  232.714502] lz empty domain c001f73c, mask cpu_smt_mask
> > +0x0/0x10
> > [  232.714515] lz cpu_map 1-2,7,13-15
> > [  232.714521] lz tl->mask(1) cpu_smt_mask+0x0/0x10
> > [  232.714529] lz cpu_map 1-2,7,13-15
> > [  232.714536] lz tl->mask(1) cpu_cpu_mask+0x0/0x10 2,7,13-15
> > [  232.714544] lz cpu_map 1-2,7,13-15
> > [  232.714551] lz tl->mask(1) sd_numa_mask+0x0/0x10 2,7,13-15
> > [  232.714558] lz cpu_map 1-2,7,13-15
> > [  232.714565] lz tl->mask(2) cpu_smt_mask+0x0/0x10 2
> > ===
> > 
> > It seems that one cpu (#1 of the above) could be taken offline in
> > cpuset_hotplug_workfn(), after top_cpuset's allowed cpus changed and the
> > propagated work function finished. But generate_sched_domains(), which
> > uses the cpuset information, still considers this cpu (#1) valid, and
> > passes it down in the cpumask to partition_sched_domains(). Then we have
> > an empty sd as the above. 
> > 
> 
> Thanks for the test and analysis!
> 
> So the bug was caused when the cpuset async cpuset handling hasn't been
> finished, and at this time user triggered another hotplug operation, right?

Seems so, the summary is very clear :)

> Then I think we should wait cpuset to finish its work before starting
> a new hotplug operation.

en.

> This should fix the bug you observed, and it should also fix another bug,
> that the 2nd schedule_work(&cpuset_hotplug_work) returns true, and so
> the root cpuset.cpus won't be updated.

OK, I just realized that the work can only have one instance... 

As schedule_work returns false if work already in the queue, so I guess
we should check against false.

And it seems to me that we also need the logic in cpu_up path, for
similar reasons, or WARN_ON might be triggered, and cpuset cpu info
won't get updated.

> Could you test this patch?

Seems it works, with the above two changes.

For your convenience, the final code I used for testing is attached
below. 

Thanks, Zhong

---
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 8c8a60d..d3ba31d 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -119,6 +119,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
task_unlock(current);
 }
 
+extern void cpuset_flush_hotplug_work(void);
+
 #else /* !CONFIG_CPUSETS */
 
 static inline int cpuset_init(void) { return 0; }
@@ -233,6 +235,10 @@ static inline bool put_mems_allowed(unsigned int seq)
return true;
 }
 
+static inline void cpuset_flush_hotplug_work(void)
+{
+}
+
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
diff --git a/kernel/cpu.c b/kernel/cpu.c
index b5e4ab2..6eb914b 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "smpboot.h"
 
@@ -278,6 +279,13 @@ static int __ref _cpu_down(unsigned int cpu, int 
tasks_frozen)
if (!cpu_online(cpu))
return -EINVAL;
 
+   /*
+* cpuset handles cpu/memory hotplug asynchronously, so it's possible
+* that cpuset hasn't finished it's hotplug work for the previous
+* hotplug operation.
+*/
+   cpuset_flush_hotplug_work();
+
cpu_hotplug_begin();
 
err = __cpu_notify(CPU_DOWN_PREPARE | mod, hcpu, -1, &nr_cal

Re: [RFC PATCH v2 6/6] powerpc: Use generic code for exception handling

2013-04-09 Thread Li Zhong

On Wed, 2013-04-10 at 13:32 +0800, Li Zhong wrote:
> On Wed, 2013-04-10 at 14:56 +1000, Michael Ellerman wrote:
> > On Fri, Mar 29, 2013 at 06:00:21PM +0800, Li Zhong wrote:
> > > After the exception handling moved to generic code, and some changes in
> > ...
> > > diff --git a/arch/powerpc/mm/hash_utils_64.c 
> > > b/arch/powerpc/mm/hash_utils_64.c
> > > index 360fba8..eeab30f 100644
> > > --- a/arch/powerpc/mm/hash_utils_64.c
> > > +++ b/arch/powerpc/mm/hash_utils_64.c
> > > @@ -33,6 +33,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  
> > >  #include 
> > >  #include 
> > > @@ -56,7 +57,6 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > -#include 
> > >  
> > >  #ifdef DEBUG
> > >  #define DBG(fmt...) udbg_printf(fmt)
> > > @@ -919,13 +919,17 @@ int hash_page(unsigned long ea, unsigned long 
> > > access, unsigned long trap)
> > >   const struct cpumask *tmp;
> > >   int rc, user_region = 0, local = 0;
> > >   int psize, ssize;
> > > + enum ctx_state prev_state;
> > > +
> > > + prev_state = exception_enter();
> > >  
> > >   DBG_LOW("hash_page(ea=%016lx, access=%lx, trap=%lx\n",
> > >   ea, access, trap);
> > >  
> > >   if ((ea & ~REGION_MASK) >= PGTABLE_RANGE) {
> > >   DBG_LOW(" out of pgtable range !\n");
> > > - return 1;
> > > + rc = 1;
> > > + goto exit;
> > >   }
> > >  
> > >   /* Get region & vsid */
> > 
> > This no longer applies on mainline, please send an updated version.
> 
> Yes, for current mainline (powerpc tree), only previous five patches
> could be applied. The dependency of this patch is current in tip tree,
> and seems would be in for 3.10.
> 
> There are some more details in the cover letter (#0):
> 
> "I assume these patches would get in through powerpc tree, so I didn't
> combine the new patch (#6) with the original one (#2). So that if
> powerpc tree picks these, it could pick the first five patches, and
> apply patch #6 later when the dependency enters into powerpc tree (maybe
> on some 3.10-rcs)."

And I will send an updated version of this one when I see the dependency
commits in mainline. 

Thanks, Zhong

> Thanks, Zhong
> 
> > cheers
> > 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 6/6] powerpc: Use generic code for exception handling

2013-04-09 Thread Li Zhong

On Wed, 2013-04-10 at 14:56 +1000, Michael Ellerman wrote:
> On Fri, Mar 29, 2013 at 06:00:21PM +0800, Li Zhong wrote:
> > After the exception handling moved to generic code, and some changes in
> ...
> > diff --git a/arch/powerpc/mm/hash_utils_64.c 
> > b/arch/powerpc/mm/hash_utils_64.c
> > index 360fba8..eeab30f 100644
> > --- a/arch/powerpc/mm/hash_utils_64.c
> > +++ b/arch/powerpc/mm/hash_utils_64.c
> > @@ -33,6 +33,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -56,7 +57,6 @@
> >  #include 
> >  #include 
> >  #include 
> > -#include 
> >  
> >  #ifdef DEBUG
> >  #define DBG(fmt...) udbg_printf(fmt)
> > @@ -919,13 +919,17 @@ int hash_page(unsigned long ea, unsigned long access, 
> > unsigned long trap)
> > const struct cpumask *tmp;
> > int rc, user_region = 0, local = 0;
> > int psize, ssize;
> > +   enum ctx_state prev_state;
> > +
> > +   prev_state = exception_enter();
> >  
> > DBG_LOW("hash_page(ea=%016lx, access=%lx, trap=%lx\n",
> > ea, access, trap);
> >  
> > if ((ea & ~REGION_MASK) >= PGTABLE_RANGE) {
> > DBG_LOW(" out of pgtable range !\n");
> > -   return 1;
> > +   rc = 1;
> > +   goto exit;
> > }
> >  
> > /* Get region & vsid */
> 
> This no longer applies on mainline, please send an updated version.

Yes, for current mainline (powerpc tree), only previous five patches
could be applied. The dependency of this patch is current in tip tree,
and seems would be in for 3.10.

There are some more details in the cover letter (#0):

"I assume these patches would get in through powerpc tree, so I didn't
combine the new patch (#6) with the original one (#2). So that if
powerpc tree picks these, it could pick the first five patches, and
apply patch #6 later when the dependency enters into powerpc tree (maybe
on some 3.10-rcs)."

Thanks, Zhong

> cheers
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 cpuset] Don't pass offlined cpus to partition_sched_domains()

2013-04-09 Thread Li Zhong

Hi Zefan, 

I did some test today, enabling cpuset and online/offline the cpus. It
caused NULL address dereferencing in get_group(). After adding some
debugging code, it seems that it is caused by using some offlined cpus
as the parameter to partition_sched_domains(). More details below:

===
following is printed in build_sched_domain(): 
@@ -6476,6 +6486,14 @@ struct sched_domain *build_sched_domain(struct
sched_domain_topology_level *tl,
return child;

cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
+   if (cpumask_empty(sched_domain_span(sd)))
+   printk("lz empty domain %p, mask %pS\n", sd, tl->mask);
+   char cpulist[128];
+   cpumask_scnprintf(cpulist, sizeof(cpulist), cpu_map);
+   printk("lz cpu_map %s\n", cpulist);
+   cpumask_scnprintf(cpulist, sizeof(cpulist), tl->mask(cpu));
+   printk("lz tl->mask(%d) %pS %s\n",cpu, tl->mask, cpulist);
+
if (child) {
sd->level = child->level + 1;
sched_domain_level_max = max(sched_domain_level_max,
sd->level);

[  232.714502] lz empty domain c001f73c, mask cpu_smt_mask
+0x0/0x10
[  232.714515] lz cpu_map 1-2,7,13-15
[  232.714521] lz tl->mask(1) cpu_smt_mask+0x0/0x10
[  232.714529] lz cpu_map 1-2,7,13-15
[  232.714536] lz tl->mask(1) cpu_cpu_mask+0x0/0x10 2,7,13-15
[  232.714544] lz cpu_map 1-2,7,13-15
[  232.714551] lz tl->mask(1) sd_numa_mask+0x0/0x10 2,7,13-15
[  232.714558] lz cpu_map 1-2,7,13-15
[  232.714565] lz tl->mask(2) cpu_smt_mask+0x0/0x10 2
===

It seems that one cpu (#1 of the above) could be taken offline in
cpuset_hotplug_workfn(), after top_cpuset's allowed cpus changed and the
propagated work function finished. But generate_sched_domains(), which
uses the cpuset information, still considers this cpu (#1) valid, and
passes it down in the cpumask to partition_sched_domains(). Then we have
an empty sd as the above. 

With the empty domain, in get_group(),
if (child)
cpu = cpumask_first(sched_domain_span(child));

this might cause the cpu to be returned a value >= nr_cpu_ids, then
cause bad dereference with the wrong cpu value in the later code. 

This following code tries to fix the issue by anding the cpu_active_mask
at the end of generate_sched_domains() for each partition. This might
result the partiton (set of domains) not the best one, but we know
another rebuild (caused by the cpu offline in the middle of
cpuset_hotplug_workfn()) is on the way. 

Also it seems that generate_sched_domains() needs be called together
with partition_sched_domains(), under the hotplug lock. Or similarly,
one cpu might be taken offline while doing generate_sched_domains(), and
cause the above issue. 

Or maybe we could put the whole cpuset_hotplug_workfn() into the hotplug
lock? 

Thanks, Zhong

---
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4f9dfe4..aed8436 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -754,6 +754,12 @@ done:
 */
if (doms == NULL)
ndoms = 1;
+   else {
+   for (nslot = 0; nslot < ndoms; nslot++) {
+   struct cpumask *dp = doms[nslot];
+   cpumask_and(dp, dp, cpu_active_mask);
+   }
+   }
 
*domains= doms;
*attributes = dattr;
@@ -,17 +2228,8 @@ static void cpuset_hotplug_workfn(struct work_struct 
*work)
flush_workqueue(cpuset_propagate_hotplug_wq);
 
/* rebuild sched domains if cpus_allowed has changed */
-   if (cpus_updated) {
-   struct sched_domain_attr *attr;
-   cpumask_var_t *doms;
-   int ndoms;
-
-   mutex_lock(&cpuset_mutex);
-   ndoms = generate_sched_domains(&doms, &attr);
-   mutex_unlock(&cpuset_mutex);
-
-   partition_sched_domains(ndoms, doms, attr);
-   }
+   if (cpus_updated)
+   rebuild_sched_domains();
 }
 
 void cpuset_update_active_cpus(bool cpu_online)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH cpuset] Use rebuild_sched_domains() in cpuset_hotplug_workfn()

2013-04-08 Thread Li Zhong

On Wed, 2013-04-03 at 17:32 +0800, Li Zefan wrote:
> On 2013/4/2 15:16, Li Zhong wrote:
> > In cpuset_hotplug_workfn(), partition_sched_domains() is called without
> > hotplug lock held, which is actually needed (stated in the function
> > header of partition_sched_domains()). 
> > 
> > This patch tries to use rebuild_sched_domains() to solve the above
> > issue, and makes the code looks a little simpler.
> > 
Hi Zefan, 

Thanks for your quick review.

> I guess you found this just by code inspection, right?

Ah, yes :)

> The change looks fine and safe at a first glance, but we don't have
> time to review the patch for now.
> 
> However I don't know why partition_sched_domains() has to be called
> with hotplug lock held.
> 
> After a quick scan, seems the only place in partition_sched_domains()
> that might need hotplug lock is arch_update_cpu_topology().

and cpu_attach_domain(), which is called by partition_sched_domains()
also states that it needs hotplug lock be held.

It seems that the code is moved out of hotplug critical section after it
is converted to work functions. 

Thanks, Zhong

> 
> It would be better if you had CCed the scheduler guys...They may
> know the answer.
> 
> > Signed-off-by: Li Zhong 
> > ---
> >  kernel/cpuset.c |   13 ++---
> >  1 files changed, 2 insertions(+), 11 deletions(-)
> > 
> > diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> > index 4f9dfe4..515a713 100644
> > --- a/kernel/cpuset.c
> > +++ b/kernel/cpuset.c
> > @@ -,17 +,8 @@ static void cpuset_hotplug_workfn(struct work_struct 
> > *work)
> > flush_workqueue(cpuset_propagate_hotplug_wq);
> >  
> > /* rebuild sched domains if cpus_allowed has changed */
> > -   if (cpus_updated) {
> > -   struct sched_domain_attr *attr;
> > -   cpumask_var_t *doms;
> > -   int ndoms;
> > -
> > -   mutex_lock(&cpuset_mutex);
> > -   ndoms = generate_sched_domains(&doms, &attr);
> > -   mutex_unlock(&cpuset_mutex);
> > -
> > -   partition_sched_domains(ndoms, doms, attr);
> > -   }
> > +   if (cpus_updated)
> > +   rebuild_sched_domains();
> >  }
> >  
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 2/6] powerpc: Exception hooks for context tracking subsystem

2013-04-08 Thread Li Zhong

On Fri, 2013-04-05 at 13:50 +1100, Paul Mackerras wrote:
> On Fri, Mar 29, 2013 at 06:00:17PM +0800, Li Zhong wrote:
> > This is the exception hooks for context tracking subsystem, including
> > data access, program check, single step, instruction breakpoint, machine 
> > check,
> > alignment, fp unavailable, altivec assist, unknown exception, whose handlers
> > might use RCU.
> > 
> > This patch corresponds to
> > [PATCH] x86: Exception hooks for userspace RCU extended QS
> >   commit 6ba3c97a38803883c2eee489505796cb0a727122
> > 
> > Signed-off-by: Li Zhong 

Hi Paul, 

Thanks for your review! Please check my answers below, and correct me if
any errors.

> Is there a reason why you didn't put the exception_exit() call in
> ret_from_except_lite in entry_64.S, and the exception_entry() call in
> EXCEPTION_PROLOG_COMMON?  That would seem to catch all these cases in
> a more centralized place.

It seems to me that ret_from_except_lite and EXCEPTION_PROLOG_COMMON are
also used by interrupts, where I think we don't need the hooks. So using
this way could help to avoid adding overhead to these code path
(interrupts, and some exit path of syscall). 

And I think adding the hook on higher level code seems a little easier
for reading and checking. It seems that some exceptions don't use
EXCEPTION_PROLOG_COMMON, and some don't go ret_from_except_lite exit
path (like fp unavailable might go directly to fast_exception_return ).
Maybe fast_exception_return is a centralized place for us to return to
user space? But it still adds some overheads which is not necessarily
needed. 

And I think it also makes the implementation here consistent with the
style that x86 uses. 

> Also, I notice that with the exception_exit calls where they are, we
> can still deliver signals (thus possibly taking a page fault) or call
> schedule() for preemption after the exception_exit() call.  Is that
> OK, or is it a potential problem?

If I understand correctly, I guess you are talking about the cases where
we might return to user space without context state correctly being set
as in user?

There is user_enter() called in do_notify_resume() in patch #3, so after
handling the signals we always call user_enter(). 

There are also some changes of the context_tracking code from Frederic,
which might be related: ( they are now in tip tree, and url of the
patches for your convenience https://lkml.org/lkml/2013/3/1/266 )

6c1e0256fad84a843d915414e4b5973b7443d48d
context_tracking: Restore correct previous context state on exception
exit. 

With this patch, if a later exception happened after user_enter(),
before the CPU actually returns to user space, the correct context
state(in user) is saved and restored when handling the later exception. 

Patch #6 converts the code to use these new APIs, which is currently not
available in powerpc tree. 

b22366cd54c6fe05db426f20adb10f461c19ec06
context_tracking: Restore preempted context state after
preempt_schedule_irq

With this patch, the user context state could be correctly restored
after schedule returns. 

Thanks, Zhong

> Paul.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH cpuset] Use rebuild_sched_domains() in cpuset_hotplug_workfn()

2013-04-02 Thread Li Zhong

In cpuset_hotplug_workfn(), partition_sched_domains() is called without
hotplug lock held, which is actually needed (stated in the function
header of partition_sched_domains()). 

This patch tries to use rebuild_sched_domains() to solve the above
issue, and makes the code looks a little simpler.

Signed-off-by: Li Zhong 
---
 kernel/cpuset.c |   13 ++---
 1 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4f9dfe4..515a713 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -,17 +,8 @@ static void cpuset_hotplug_workfn(struct work_struct 
*work)
flush_workqueue(cpuset_propagate_hotplug_wq);
 
/* rebuild sched domains if cpus_allowed has changed */
-   if (cpus_updated) {
-   struct sched_domain_attr *attr;
-   cpumask_var_t *doms;
-   int ndoms;
-
-   mutex_lock(&cpuset_mutex);
-   ndoms = generate_sched_domains(&doms, &attr);
-   mutex_unlock(&cpuset_mutex);
-
-   partition_sched_domains(ndoms, doms, attr);
-   }
+   if (cpus_updated)
+   rebuild_sched_domains();
 }
 
 void cpuset_update_active_cpus(bool cpu_online)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 4/6] powerpc: Use the new schedule_user API on userspace preemption

2013-03-29 Thread Li Zhong

This patch corresponds to
[PATCH] x86: Use the new schedule_user API on userspace preemption
  commit 0430499ce9d78691f3985962021b16bf8f8a8048

Signed-off-by: Li Zhong 
---
 arch/powerpc/include/asm/context_tracking.h |   11 +++
 arch/powerpc/kernel/entry_64.S  |3 ++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/context_tracking.h 
b/arch/powerpc/include/asm/context_tracking.h
index 377146e..4da287e 100644
--- a/arch/powerpc/include/asm/context_tracking.h
+++ b/arch/powerpc/include/asm/context_tracking.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_POWERPC_CONTEXT_TRACKING_H
 #define _ASM_POWERPC_CONTEXT_TRACKING_H
 
+#ifndef __ASSEMBLY__
 #include 
 #include 
 
@@ -25,4 +26,14 @@ static inline void __exception_exit(struct pt_regs *regs)
 #endif
 }
 
+#else /* __ASSEMBLY__ */
+
+#ifdef CONFIG_CONTEXT_TRACKING
+#define SCHEDULE_USER bl   .schedule_user
+#else
+#define SCHEDULE_USER bl   .schedule
+#endif
+
+#endif /* !__ASSEMBLY__ */
+
 #endif
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 256c5bf..f7e4622 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * System calls.
@@ -618,7 +619,7 @@ _GLOBAL(ret_from_except_lite)
andi.   r0,r4,_TIF_NEED_RESCHED
beq 1f
bl  .restore_interrupts
-   bl  .schedule
+   SCHEDULE_USER
b   .ret_from_except_lite
 
 1: bl  .save_nvgprs
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 0/6] powerpc: Support context tracking for Power pSeries

2013-03-29 Thread Li Zhong

These patches try to support context tracking for Power arch, beginning with
64-bit pSeries. The codes are ported from that of the x86_64, and in each
patch, I listed the corresponding patch for x86.

Would you please help review and give your comments?

v2:

I rebased the patches against 3.9-rcs, and also added a patch to replace
the exception handling with the generic code in tip timers/nohz.

I assume these patches would get in through powerpc tree, so I didn't combine
the new patch (#6) with the original one (#2). So that if powerpc tree picks
these, it could pick the first five patches, and apply patch #6 later
when the dependency enters into powerpc tree (maybe on some 3.10-rcs).

I'm also wondering whether it is possible for these to go through
tip timers/nohz, so for now, patches #6 and #2 could be combined into one, and
no need to  worry about the issues caused by arch/common code merging. And it
might also make future changes easier.

Thanks, Zhong

Li Zhong (6):
  powerpc: Syscall hooks for context tracking subsystem
  powerpc: Exception hooks for context tracking subsystem
  powerpc: Exit user context on notify resume
  powerpc: Use the new schedule_user API on userspace preemption
  powerpc: select HAVE_CONTEXT_TRACKING for pSeries
  powerpc: Use generic code for exception handling

 arch/powerpc/include/asm/context_tracking.h |   10 +++
 arch/powerpc/include/asm/thread_info.h  |7 ++-
 arch/powerpc/kernel/entry_64.S  |3 +-
 arch/powerpc/kernel/ptrace.c|5 ++
 arch/powerpc/kernel/signal.c|5 ++
 arch/powerpc/kernel/traps.c |   91 ---
 arch/powerpc/mm/fault.c |   16 -
 arch/powerpc/mm/hash_utils_64.c |   38 ---
 arch/powerpc/platforms/pseries/Kconfig  |1 +
 9 files changed, 140 insertions(+), 36 deletions(-)
 create mode 100644 arch/powerpc/include/asm/context_tracking.h

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 5/6] powerpc: select HAVE_CONTEXT_TRACKING for pSeries

2013-03-29 Thread Li Zhong

Start context tracking support from pSeries.

Signed-off-by: Li Zhong 
---
 arch/powerpc/platforms/pseries/Kconfig |1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index 9a0941b..023b288 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -18,6 +18,7 @@ config PPC_PSERIES
select PPC_PCI_CHOICE if EXPERT
select ZLIB_DEFLATE
select PPC_DOORBELL
+   select HAVE_CONTEXT_TRACKING
default y
 
 config PPC_SPLPAR
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 6/6] powerpc: Use generic code for exception handling

2013-03-29 Thread Li Zhong

After the exception handling moved to generic code, and some changes in
following two commits:
56dd9470d7c8734f055da2a6bac553caf4a468eb
  context_tracking: Move exception handling to generic code
6c1e0256fad84a843d915414e4b5973b7443d48d
  context_tracking: Restore correct previous context state on exception exit

it is able for this patch to replace the implementation in arch code
with the generic code in above commits.

Signed-off-by: Li Zhong 
---
 arch/powerpc/include/asm/context_tracking.h |   29 ---
 arch/powerpc/kernel/exceptions-64s.S|4 +--
 arch/powerpc/kernel/traps.c |   42 +-
 arch/powerpc/mm/fault.c |7 ++--
 arch/powerpc/mm/hash_utils_64.c |   51 ++-
 5 files changed, 57 insertions(+), 76 deletions(-)

diff --git a/arch/powerpc/include/asm/context_tracking.h 
b/arch/powerpc/include/asm/context_tracking.h
index 4da287e..b6f5a33 100644
--- a/arch/powerpc/include/asm/context_tracking.h
+++ b/arch/powerpc/include/asm/context_tracking.h
@@ -1,39 +1,10 @@
 #ifndef _ASM_POWERPC_CONTEXT_TRACKING_H
 #define _ASM_POWERPC_CONTEXT_TRACKING_H
 
-#ifndef __ASSEMBLY__
-#include 
-#include 
-
-/*
- * temporarily defined to avoid potential conflicts with the common
- * implementation, these will be removed by a later patch after the common
- * code enters powerpc tree
- */
-#define exception_enter __exception_enter
-#define exception_exit __exception_exit
-
-static inline void __exception_enter(struct pt_regs *regs)
-{
-   user_exit();
-}
-
-static inline void __exception_exit(struct pt_regs *regs)
-{
-#ifdef CONFIG_CONTEXT_TRACKING
-   if (user_mode(regs))
-   user_enter();
-#endif
-}
-
-#else /* __ASSEMBLY__ */
-
 #ifdef CONFIG_CONTEXT_TRACKING
 #define SCHEDULE_USER bl   .schedule_user
 #else
 #define SCHEDULE_USER bl   .schedule
 #endif
 
-#endif /* !__ASSEMBLY__ */
-
 #endif
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 6d82f4f..a8a5361 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1368,17 +1368,15 @@ END_MMU_FTR_SECTION_IFCLR(MMU_FTR_SLB)
rlwimi  r4,r0,32-13,30,30   /* becomes _PAGE_USER access bit */
ori r4,r4,1 /* add _PAGE_PRESENT */
rlwimi  r4,r5,22+2,31-2,31-2/* Set _PAGE_EXEC if trap is 0x400 */
-   addir6,r1,STACK_FRAME_OVERHEAD
 
/*
 * r3 contains the faulting address
 * r4 contains the required access permissions
 * r5 contains the trap number
-* r6 contains the address of pt_regs
 *
 * at return r3 = 0 for success, 1 for page fault, negative for error
 */
-   bl  .hash_page_ct   /* build HPTE if possible */
+   bl  .hash_page  /* build HPTE if possible */
cmpdi   r3,0/* see if hash_page succeeded */
 
/* Success */
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 6228b6b..1b46c2d9 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -60,7 +61,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 int (*__debugger)(struct pt_regs *regs) __read_mostly;
@@ -669,8 +669,9 @@ int machine_check_generic(struct pt_regs *regs)
 void machine_check_exception(struct pt_regs *regs)
 {
int recover = 0;
+   enum ctx_state prev_state;
 
-   exception_enter(regs);
+   prev_state = exception_enter();
 
__get_cpu_var(irq_stat).mce_exceptions++;
 
@@ -712,7 +713,7 @@ void machine_check_exception(struct pt_regs *regs)
panic("Unrecoverable Machine check");
 
 exit:
-   exception_exit(regs);
+   exception_exit(prev_state);
 }
 
 void SMIException(struct pt_regs *regs)
@@ -722,19 +723,21 @@ void SMIException(struct pt_regs *regs)
 
 void unknown_exception(struct pt_regs *regs)
 {
-   exception_enter(regs);
+   enum ctx_state prev_state;
+   prev_state = exception_enter();
 
printk("Bad trap at PC: %lx, SR: %lx, vector=%lx\n",
   regs->nip, regs->msr, regs->trap);
 
_exception(SIGTRAP, regs, 0, 0);
 
-   exception_exit(regs);
+   exception_exit(prev_state);
 }
 
 void instruction_breakpoint_exception(struct pt_regs *regs)
 {
-   exception_enter(regs);
+   enum ctx_state prev_state;
+   prev_state = exception_enter();
 
if (notify_die(DIE_IABR_MATCH, "iabr_match", regs, 5,
5, SIGTRAP) == NOTIFY_STOP)
@@ -744,7 +747,7 @@ void instruction_breakpoint_exception(struct pt_regs *regs)
_exception(SIGTRAP, regs, TRAP_BRKPT, regs->nip);
 
 exit:
-   exception_exit(regs);
+   exception_exi

[RFC PATCH v2 1/6] powerpc: Syscall hooks for context tracking subsystem

2013-03-29 Thread Li Zhong

This is the syscall slow path hooks for context tracking subsystem,
corresponding to
[PATCH] x86: Syscall hooks for userspace RCU extended QS
  commit bf5a3c13b939813d28ce26c01425054c740d6731

TIF_MEMDIE is moved to the second 16-bits (with value 17), as it seems there
is no asm code using it. TIF_NOHZ is added to _TIF_SYCALL_T_OR_A, so it is
better for it to be in the same 16 bits with others in the group, so in the
asm code, andi. with this group could work.

Signed-off-by: Li Zhong 
Acked-by: Frederic Weisbecker 
---
 arch/powerpc/include/asm/thread_info.h |7 +--
 arch/powerpc/kernel/ptrace.c   |5 +
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index 406b7b9..414a261 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -97,7 +97,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_PERFMON_CTXSW  6   /* perfmon needs ctxsw calls */
 #define TIF_SYSCALL_AUDIT  7   /* syscall auditing active */
 #define TIF_SINGLESTEP 8   /* singlestepping active */
-#define TIF_MEMDIE 9   /* is terminating due to OOM killer */
+#define TIF_NOHZ   9   /* in adaptive nohz mode */
 #define TIF_SECCOMP10  /* secure computing */
 #define TIF_RESTOREALL 11  /* Restore all regs (implies NOERROR) */
 #define TIF_NOERROR12  /* Force successful syscall return */
@@ -106,6 +106,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACEPOINT 15  /* syscall tracepoint instrumentation */
 #define TIF_EMULATE_STACK_STORE16  /* Is an instruction emulation
for stack store? */
+#define TIF_MEMDIE 17  /* is terminating due to OOM killer */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE (1<
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1778,6 +1779,8 @@ long do_syscall_trace_enter(struct pt_regs *regs)
 {
long ret = 0;
 
+   user_exit();
+
secure_computing_strict(regs->gpr[0]);
 
if (test_thread_flag(TIF_SYSCALL_TRACE) &&
@@ -1822,4 +1825,6 @@ void do_syscall_trace_leave(struct pt_regs *regs)
step = test_thread_flag(TIF_SINGLESTEP);
if (step || test_thread_flag(TIF_SYSCALL_TRACE))
tracehook_report_syscall_exit(regs, step);
+
+   user_enter();
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 3/6] powerpc: Exit user context on notify resume

2013-03-29 Thread Li Zhong

This patch allows RCU usage in do_notify_resume, e.g. signal handling.
It corresponds to
[PATCH] x86: Exit RCU extended QS on notify resume
  commit edf55fda35c7dc7f2d9241c3abaddaf759b457c6

Signed-off-by: Li Zhong 
---
 arch/powerpc/kernel/signal.c |5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index cf12eae..d63b502 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -159,6 +160,8 @@ static int do_signal(struct pt_regs *regs)
 
 void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags)
 {
+   user_exit();
+
if (thread_info_flags & _TIF_UPROBE)
uprobe_notify_resume(regs);
 
@@ -169,4 +172,6 @@ void do_notify_resume(struct pt_regs *regs, unsigned long 
thread_info_flags)
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
}
+
+   user_enter();
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 2/6] powerpc: Exception hooks for context tracking subsystem

2013-03-29 Thread Li Zhong

This is the exception hooks for context tracking subsystem, including
data access, program check, single step, instruction breakpoint, machine check,
alignment, fp unavailable, altivec assist, unknown exception, whose handlers
might use RCU.

This patch corresponds to
[PATCH] x86: Exception hooks for userspace RCU extended QS
  commit 6ba3c97a38803883c2eee489505796cb0a727122

Signed-off-by: Li Zhong 
---
 arch/powerpc/include/asm/context_tracking.h |   28 +
 arch/powerpc/kernel/exceptions-64s.S|4 +-
 arch/powerpc/kernel/traps.c |   83 ---
 arch/powerpc/mm/fault.c |   15 -
 arch/powerpc/mm/hash_utils_64.c |   17 ++
 5 files changed, 122 insertions(+), 25 deletions(-)
 create mode 100644 arch/powerpc/include/asm/context_tracking.h

diff --git a/arch/powerpc/include/asm/context_tracking.h 
b/arch/powerpc/include/asm/context_tracking.h
new file mode 100644
index 000..377146e
--- /dev/null
+++ b/arch/powerpc/include/asm/context_tracking.h
@@ -0,0 +1,28 @@
+#ifndef _ASM_POWERPC_CONTEXT_TRACKING_H
+#define _ASM_POWERPC_CONTEXT_TRACKING_H
+
+#include 
+#include 
+
+/*
+ * temporarily defined to avoid potential conflicts with the common
+ * implementation, these will be removed by a later patch after the common
+ * code enters powerpc tree
+ */
+#define exception_enter __exception_enter
+#define exception_exit __exception_exit
+
+static inline void __exception_enter(struct pt_regs *regs)
+{
+   user_exit();
+}
+
+static inline void __exception_exit(struct pt_regs *regs)
+{
+#ifdef CONFIG_CONTEXT_TRACKING
+   if (user_mode(regs))
+   user_enter();
+#endif
+}
+
+#endif
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index a8a5361..6d82f4f 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1368,15 +1368,17 @@ END_MMU_FTR_SECTION_IFCLR(MMU_FTR_SLB)
rlwimi  r4,r0,32-13,30,30   /* becomes _PAGE_USER access bit */
ori r4,r4,1 /* add _PAGE_PRESENT */
rlwimi  r4,r5,22+2,31-2,31-2/* Set _PAGE_EXEC if trap is 0x400 */
+   addir6,r1,STACK_FRAME_OVERHEAD
 
/*
 * r3 contains the faulting address
 * r4 contains the required access permissions
 * r5 contains the trap number
+* r6 contains the address of pt_regs
 *
 * at return r3 = 0 for success, 1 for page fault, negative for error
 */
-   bl  .hash_page  /* build HPTE if possible */
+   bl  .hash_page_ct   /* build HPTE if possible */
cmpdi   r3,0/* see if hash_page succeeded */
 
/* Success */
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 37cc40e..6228b6b 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -60,6 +60,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 int (*__debugger)(struct pt_regs *regs) __read_mostly;
@@ -669,6 +670,8 @@ void machine_check_exception(struct pt_regs *regs)
 {
int recover = 0;
 
+   exception_enter(regs);
+
__get_cpu_var(irq_stat).mce_exceptions++;
 
/* See if any machine dependent calls. In theory, we would want
@@ -683,7 +686,7 @@ void machine_check_exception(struct pt_regs *regs)
recover = cur_cpu_spec->machine_check(regs);
 
if (recover > 0)
-   return;
+   goto exit;
 
 #if defined(CONFIG_8xx) && defined(CONFIG_PCI)
/* the qspan pci read routines can cause machine checks -- Cort
@@ -693,20 +696,23 @@ void machine_check_exception(struct pt_regs *regs)
 * -- BenH
 */
bad_page_fault(regs, regs->dar, SIGBUS);
-   return;
+   goto exit;
 #endif
 
if (debugger_fault_handler(regs))
-   return;
+   goto exit;
 
if (check_io_access(regs))
-   return;
+   goto exit;
 
die("Machine check", regs, SIGBUS);
 
/* Must die if the interrupt is not recoverable */
if (!(regs->msr & MSR_RI))
panic("Unrecoverable Machine check");
+
+exit:
+   exception_exit(regs);
 }
 
 void SMIException(struct pt_regs *regs)
@@ -716,20 +722,29 @@ void SMIException(struct pt_regs *regs)
 
 void unknown_exception(struct pt_regs *regs)
 {
+   exception_enter(regs);
+
printk("Bad trap at PC: %lx, SR: %lx, vector=%lx\n",
   regs->nip, regs->msr, regs->trap);
 
_exception(SIGTRAP, regs, 0, 0);
+
+   exception_exit(regs);
 }
 
 void instruction_breakpoint_exception(struct pt_regs *regs)
 {
+   exception_enter(regs);
+
if (notify_die(DIE_IABR_MATCH, "iabr_match", regs, 5,
5, SIGTRAP) == N

Re: [RFC PATCH] context_tracking/rcu: don't function trace before rcu_user_exit() finishes

2013-03-04 Thread Li Zhong

This updated version adds notrace to these functions directly:

I saw some RCU illegal usage from idle complaints when function tracer
is enabled with forced context tracking.

It seems that __schedule() might be called in function_trace_call() when
it re-enables preemption(if preemption and irqs are both enabled). 

So at the places where we call rcu_user_exit(), we need make sure that
no function tracer hooks could be called in preemption/irqs both enabled
environment before rcu_user_exit() finishes, or we might cause illegal
RCU usage in __schedule(). 

This patch tries to add notrace attribute to a couple of functions where
the above could happen, including user_exit(), and a few callers of
user_exit(). 

Signed-off-by: Li Zhong 
--
 arch/x86/kernel/ptrace.c  | 4 ++--
 arch/x86/kernel/signal.c  | 2 +-
 kernel/context_tracking.c | 2 +-
 kernel/sched/core.c   | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 29a8120..ce9f12e 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1488,7 +1488,7 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs 
*regs,
  * We must return the syscall number to actually look up in the table.
  * This can be -1L to skip running any syscall at all.
  */
-long syscall_trace_enter(struct pt_regs *regs)
+long notrace syscall_trace_enter(struct pt_regs *regs)
 {
long ret = 0;
 
@@ -1538,7 +1538,7 @@ out:
return ret ?: regs->orig_ax;
 }
 
-void syscall_trace_leave(struct pt_regs *regs)
+void notrace syscall_trace_leave(struct pt_regs *regs)
 {
bool step;
 
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 6956299..7b60210 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -732,7 +732,7 @@ static void do_signal(struct pt_regs *regs)
  * notification of userspace execution resumption
  * - triggered by the TIF_WORK_MASK flags
  */
-void
+void notrace
 do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 {
user_exit();
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 65349f0..f598f61 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -83,7 +83,7 @@ void user_enter(void)
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void user_exit(void)
+void notrace user_exit(void)
 {
unsigned long flags;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f12624..178a6af 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2971,7 +2971,7 @@ asmlinkage void __sched schedule(void)
 EXPORT_SYMBOL(schedule);
 
 #ifdef CONFIG_CONTEXT_TRACKING
-asmlinkage void __sched schedule_user(void)
+asmlinkage void __sched notrace schedule_user(void)
 {
/*
 * If we come here after a random call to set_need_resched(),


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] context_tracking/rcu: don't function trace before rcu_user_exit() finishes

2013-02-28 Thread Li Zhong

On Thu, 2013-02-28 at 08:38 -0800, Paul E. McKenney wrote:
> On Thu, Feb 28, 2013 at 04:26:10PM +0800, Li Zhong wrote:
> > I saw some RCU illegal usage from idle complaints when function tracer
> > is enabled with forced context tracking.
> > 
> > It seems that __schedule() might be called in function_trace_call() when
> > it re-enables preemption(if preemption and irqs are both enabled). 
> > 
> > So at the places where we call rcu_user_exit(), we need make sure that
> > no function tracer hooks could be called in preemption/irqs both enabled
> > environment before rcu_user_exit() finishes, or we might cause illegal
> > RCU usage in __schedule(). 
> > 
> > This patch tries to add notrace attribute to a couple of functions where
> > the above could happen, including user_exit(), and a few callers of
> > user_exit(). 
> 
> And I have to ask...
> 
> Do we really need the rcu_user_notrace?  Why not just label the functions
> as notrace?

Hmm, because it seems to me the problems are related only to rcu user
qs, and tracing without rcu user eqs doesn't have the above issues. But
maybe it's better to do it simpler by just marking them notrace, as they
are all low-level functions? 

Thanks, Zhong

> 
>   Thanx, Paul
> 
> > Signed-off-by: Li Zhong 
> > --
> >  arch/x86/kernel/ptrace.c  | 4 ++--
> >  arch/x86/kernel/signal.c  | 2 +-
> >  include/linux/rcupdate.h  | 2 ++
> >  kernel/context_tracking.c | 2 +-
> >  kernel/sched/core.c   | 2 +-
> >  5 files changed, 7 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> > index 29a8120..04659c2 100644
> > --- a/arch/x86/kernel/ptrace.c
> > +++ b/arch/x86/kernel/ptrace.c
> > @@ -1488,7 +1488,7 @@ void send_sigtrap(struct task_struct *tsk, struct 
> > pt_regs *regs,
> >   * We must return the syscall number to actually look up in the table.
> >   * This can be -1L to skip running any syscall at all.
> >   */
> > -long syscall_trace_enter(struct pt_regs *regs)
> > +long rcu_user_notrace syscall_trace_enter(struct pt_regs *regs)
> >  {
> > long ret = 0;
> > 
> > @@ -1538,7 +1538,7 @@ out:
> > return ret ?: regs->orig_ax;
> >  }
> > 
> > -void syscall_trace_leave(struct pt_regs *regs)
> > +void rcu_user_notrace syscall_trace_leave(struct pt_regs *regs)
> >  {
> > bool step;
> > 
> > diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> > index 6956299..10a9e5a 100644
> > --- a/arch/x86/kernel/signal.c
> > +++ b/arch/x86/kernel/signal.c
> > @@ -732,7 +732,7 @@ static void do_signal(struct pt_regs *regs)
> >   * notification of userspace execution resumption
> >   * - triggered by the TIF_WORK_MASK flags
> >   */
> > -void
> > +void rcu_user_notrace
> >  do_notify_resume(struct pt_regs *regs, void *unused, __u32 
> > thread_info_flags)
> >  {
> > user_exit();
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index b758ce1..ecf9872 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -229,6 +229,7 @@ extern void rcu_user_enter(void);
> >  extern void rcu_user_exit(void);
> >  extern void rcu_user_enter_after_irq(void);
> >  extern void rcu_user_exit_after_irq(void);
> > +#define rcu_user_notrace notrace
> >  #else
> >  static inline void rcu_user_enter(void) { }
> >  static inline void rcu_user_exit(void) { }
> > @@ -236,6 +237,7 @@ static inline void rcu_user_enter_after_irq(void) { }
> >  static inline void rcu_user_exit_after_irq(void) { }
> >  static inline void rcu_user_hooks_switch(struct task_struct *prev,
> >  struct task_struct *next) { }
> > +#define rcu_user_notrace
> >  #endif /* CONFIG_RCU_USER_QS */
> > 
> >  extern void exit_rcu(void);
> > diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
> > index 65349f0..d3fc35f 100644
> > --- a/kernel/context_tracking.c
> > +++ b/kernel/context_tracking.c
> > @@ -83,7 +83,7 @@ void user_enter(void)
> >   * This call supports re-entrancy. This way it can be called from any 
> > exception
> >   * handler without needing to know if we came from userspace or not.
> >   */
> > -void user_exit(void)
> > +void rcu_user_notrace user_exit(void)
> >  {
> > unsigned long flags;
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 2b52431..c970e92 100644
> >

[RFC PATCH] context_tracking/rcu: don't function trace before rcu_user_exit() finishes

2013-02-28 Thread Li Zhong

I saw some RCU illegal usage from idle complaints when function tracer
is enabled with forced context tracking.

It seems that __schedule() might be called in function_trace_call() when
it re-enables preemption(if preemption and irqs are both enabled). 

So at the places where we call rcu_user_exit(), we need make sure that
no function tracer hooks could be called in preemption/irqs both enabled
environment before rcu_user_exit() finishes, or we might cause illegal
RCU usage in __schedule(). 

This patch tries to add notrace attribute to a couple of functions where
the above could happen, including user_exit(), and a few callers of
user_exit(). 

Signed-off-by: Li Zhong 
--
 arch/x86/kernel/ptrace.c  | 4 ++--
 arch/x86/kernel/signal.c  | 2 +-
 include/linux/rcupdate.h  | 2 ++
 kernel/context_tracking.c | 2 +-
 kernel/sched/core.c   | 2 +-
 5 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 29a8120..04659c2 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1488,7 +1488,7 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs 
*regs,
  * We must return the syscall number to actually look up in the table.
  * This can be -1L to skip running any syscall at all.
  */
-long syscall_trace_enter(struct pt_regs *regs)
+long rcu_user_notrace syscall_trace_enter(struct pt_regs *regs)
 {
long ret = 0;
 
@@ -1538,7 +1538,7 @@ out:
return ret ?: regs->orig_ax;
 }
 
-void syscall_trace_leave(struct pt_regs *regs)
+void rcu_user_notrace syscall_trace_leave(struct pt_regs *regs)
 {
bool step;
 
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 6956299..10a9e5a 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -732,7 +732,7 @@ static void do_signal(struct pt_regs *regs)
  * notification of userspace execution resumption
  * - triggered by the TIF_WORK_MASK flags
  */
-void
+void rcu_user_notrace
 do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 {
user_exit();
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index b758ce1..ecf9872 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -229,6 +229,7 @@ extern void rcu_user_enter(void);
 extern void rcu_user_exit(void);
 extern void rcu_user_enter_after_irq(void);
 extern void rcu_user_exit_after_irq(void);
+#define rcu_user_notrace notrace
 #else
 static inline void rcu_user_enter(void) { }
 static inline void rcu_user_exit(void) { }
@@ -236,6 +237,7 @@ static inline void rcu_user_enter_after_irq(void) { }
 static inline void rcu_user_exit_after_irq(void) { }
 static inline void rcu_user_hooks_switch(struct task_struct *prev,
 struct task_struct *next) { }
+#define rcu_user_notrace
 #endif /* CONFIG_RCU_USER_QS */
 
 extern void exit_rcu(void);
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 65349f0..d3fc35f 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -83,7 +83,7 @@ void user_enter(void)
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void user_exit(void)
+void rcu_user_notrace user_exit(void)
 {
unsigned long flags;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b52431..c970e92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2973,7 +2973,7 @@ asmlinkage void __sched schedule(void)
 EXPORT_SYMBOL(schedule);
 
 #ifdef CONFIG_CONTEXT_TRACKING
-asmlinkage void __sched schedule_user(void)
+asmlinkage void __sched rcu_user_notrace schedule_user(void)
 {
/*
 * If we come here after a random call to set_need_resched(),


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/urgent] cputime: Constify timeval_to_cputime(timeval) argument

2013-02-24 Thread tip-bot for Li Zhong

Commit-ID:  c78a4bcd1a879b39fb7646c887b0c195f1018909
Gitweb: http://git.kernel.org/tip/c78a4bcd1a879b39fb7646c887b0c195f1018909
Author: Li Zhong 
AuthorDate: Sat, 23 Feb 2013 17:28:44 +0100
Committer:  Ingo Molnar 
CommitDate: Sun, 24 Feb 2013 12:57:15 +0100

cputime: Constify timeval_to_cputime(timeval) argument

Saw the following compiler warning on the linux-next tree:

  kernel/itimer.c: In function 'set_cpu_itimer':
  kernel/itimer.c:152:2: warning: passing argument 1 of 'timeval_to_cputime' 
discards 'const' qualifier from pointer target type [enabled by default]
  ...

timeval_to_cputime() is always passed a constant timeval in
argument, we need to teach the nsecs based cputime
implementation about that.

Signed-off-by: Li Zhong 
Signed-off-by: Frederic Weisbecker 
Cc: Steven Rostedt 
Cc: Kevin Hilman 
Link: 
http://lkml.kernel.org/r/1361636925-22288-2-git-send-email-fweis...@gmail.com
Signed-off-by: Ingo Molnar 
Cc: Steven Rostedt 
Cc: Kevin Hilman 
---
 include/asm-generic/cputime_nsecs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/asm-generic/cputime_nsecs.h 
b/include/asm-generic/cputime_nsecs.h
index b6485ca..a8ece9a 100644
--- a/include/asm-generic/cputime_nsecs.h
+++ b/include/asm-generic/cputime_nsecs.h
@@ -76,7 +76,7 @@ static inline void cputime_to_timespec(const cputime_t ct, 
struct timespec *val)
 /*
  * Convert cputime <-> timeval (msec)
  */
-static inline cputime_t timeval_to_cputime(struct timeval *val)
+static inline cputime_t timeval_to_cputime(const struct timeval *val)
 {
u64 ret = val->tv_sec * NSEC_PER_SEC + val->tv_usec * NSEC_PER_USEC;
return (__force cputime_t) ret;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 2/5] powerpc: Exception hooks for context tracking subsystem

2013-02-16 Thread Li Zhong

On Sun, 2013-02-10 at 15:10 +0100, Frederic Weisbecker wrote:
> 2013/2/1 Li Zhong :
> > This is the exception hooks for context tracking subsystem, including
> > data access, program check, single step, instruction breakpoint, machine 
> > check,
> > alignment, fp unavailable, altivec assist, unknown exception, whose handlers
> > might use RCU.
> >
> > This patch corresponds to
> > [PATCH] x86: Exception hooks for userspace RCU extended QS
> >   commit 6ba3c97a38803883c2eee489505796cb0a727122
> >
> > Signed-off-by: Li Zhong 
> 
> Looks good!
> 
> I guess we should move exception_enter/exit definition to the generic
> code. They should be the same for all archs after all. 

Indeed.

> Also we are
> relying on user_mode(regs) but this may be buggy with some corner
> cases. For example if an exception happen after a call to user_exit()

I guess you mean user_enter() here, or am I confused?

> (on syscall exit) but before we actually resume in userspace, the
> exception will exit in kernel mode from the context tracking POV.
> 
> So instead on relying on the regs, which are not sync with the context
> tracking state, we should use something like:
> 
> prev_state = exception_enter();
> ...
> exception_exit(prev_state);
> 
> Also preempt_schedule_irq() is concerned as well by this problem. So I
> should convert it to that scheme as well. I'm going to prepare some
> patches.
> 
> Feel free to merge this patch in the powerpc tree, I'll do the
> conversion along the way.

Or if your patches gets merged earlier than these, I can update my code
according to yours.

Thanks, Zhong

> 
> Thanks.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 1/5] powerpc: Syscall hooks for context tracking subsystem

2013-02-07 Thread Li Zhong

On Thu, 2013-02-07 at 01:29 +0100, Frederic Weisbecker wrote:
> 2013/2/1 Li Zhong :
> > This is the syscall slow path hooks for context tracking subsystem,
> > corresponding to
> > [PATCH] x86: Syscall hooks for userspace RCU extended QS
> >   commit bf5a3c13b939813d28ce26c01425054c740d6731
> >
> > TIF_MEMDIE is moved to the second 16-bits (with value 17), as it seems there
> > is no asm code using it. TIF_NOHZ is added to _TIF_SYCALL_T_OR_A, so it is
> > better for it to be in the same 16 bits with others in the group, so in the
> > asm code, andi. with this group could work.
> >
> > Signed-off-by: Li Zhong 
> > ---
> >  arch/powerpc/include/asm/thread_info.h |7 +--
> >  arch/powerpc/kernel/ptrace.c   |5 +
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/thread_info.h 
> > b/arch/powerpc/include/asm/thread_info.h
> > index 406b7b9..414a261 100644
> > --- a/arch/powerpc/include/asm/thread_info.h
> > +++ b/arch/powerpc/include/asm/thread_info.h
> > @@ -97,7 +97,7 @@ static inline struct thread_info 
> > *current_thread_info(void)
> >  #define TIF_PERFMON_CTXSW  6   /* perfmon needs ctxsw calls */
> >  #define TIF_SYSCALL_AUDIT  7   /* syscall auditing active */
> >  #define TIF_SINGLESTEP 8   /* singlestepping active */
> > -#define TIF_MEMDIE 9   /* is terminating due to OOM killer 
> > */
> > +#define TIF_NOHZ   9   /* in adaptive nohz mode */
> >  #define TIF_SECCOMP10  /* secure computing */
> >  #define TIF_RESTOREALL 11  /* Restore all regs (implies 
> > NOERROR) */
> >  #define TIF_NOERROR12  /* Force successful syscall return 
> > */
> > @@ -106,6 +106,7 @@ static inline struct thread_info 
> > *current_thread_info(void)
> >  #define TIF_SYSCALL_TRACEPOINT 15  /* syscall tracepoint 
> > instrumentation */
> >  #define TIF_EMULATE_STACK_STORE16  /* Is an instruction 
> > emulation
> > for stack store? */
> > +#define TIF_MEMDIE 17  /* is terminating due to OOM killer 
> > */
> >
> >  /* as above, but as bit values */
> >  #define _TIF_SYSCALL_TRACE (1< > @@ -124,8 +125,10 @@ static inline struct thread_info 
> > *current_thread_info(void)
> >  #define _TIF_UPROBE(1< >  #define _TIF_SYSCALL_TRACEPOINT(1< >  #define _TIF_EMULATE_STACK_STORE   (1< > +#define _TIF_NOHZ  (1< >  #define _TIF_SYSCALL_T_OR_A(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
> > -_TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT)
> > +_TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | \
> > +_TIF_NOHZ)
> >
> >  #define _TIF_USER_WORK_MASK(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
> >  _TIF_NOTIFY_RESUME | _TIF_UPROBE)
> > diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
> > index c497000..62238dd 100644
> > --- a/arch/powerpc/kernel/ptrace.c
> > +++ b/arch/powerpc/kernel/ptrace.c
> > @@ -32,6 +32,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include 
> >  #include 
> > @@ -1745,6 +1746,8 @@ long do_syscall_trace_enter(struct pt_regs *regs)
> >  {
> > long ret = 0;
> >
> > +   user_exit();
> > +
> > secure_computing_strict(regs->gpr[0]);
> >
> > if (test_thread_flag(TIF_SYSCALL_TRACE) &&
> > @@ -1789,4 +1792,6 @@ void do_syscall_trace_leave(struct pt_regs *regs)
> > step = test_thread_flag(TIF_SINGLESTEP);
> > if (step || test_thread_flag(TIF_SYSCALL_TRACE))
> > tracehook_report_syscall_exit(regs, step);
> > +
> > +   user_enter();
> 
> In x86-64, schedule_user() and do_notify_resume() can be called before
> syscall_trace_leave(). As a result we may be entering
> syscall_trace_leave() in user mode (from a context tracking POV). To
> fix this I added a call to user_exit() on the very beginning of that
> function.
> 
> You can find the details in 2c5594df344cd1ff0cc9bf007dea3235582b3acf
> ("rcu: Fix unrecovered RCU user mode in syscall_trace_leave()").

Hi Frederic, 

Thank you very much for the reminding. 

> 
> Could this problem happen in ppc as well?

I checked the code(64 bit) today, it seems to me that it won't happen.
But fortunately, we are in the ppc mailing list, please correct me if my
understanding is wrong.

By the way, I enabled CONTEXT_TRACKING_FORCE and PROVE_RCU, so if it
could happen, I think there should be some illegal RCU usage complaints
reported.  

Thanks, Zhong

> Thanks.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages

2013-02-06 Thread Li Zhong

On Wed, 2013-02-06 at 18:44 -0500, Sasha Levin wrote:
> On 02/04/2013 02:17 AM, Michel Lespinasse wrote:
> > munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
> > at a time. When munlocking THP pages (or the huge zero page), this resulted
> > in taking the mm->page_table_lock 512 times in a row.
> > 
> > We can do better by making use of the page_mask returned by follow_page_mask
> > (for the huge zero page case), or the size of the page munlock_vma_page()
> > operated on (for the true THP page case).
> > 
> > Note - I am sending this as RFC only for now as I can't currently put
> > my finger on what if anything prevents split_huge_page() from operating
> > concurrently on the same page as munlock_vma_page(), which would mess
> > up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
> > point I missed here ?
> > 
> > Signed-off-by: Michel Lespinasse 
> 
> Hi Michel,
> 
> Fuzzing with trinity inside a KVM tools guest produces a steady stream of:
> 
> 
> [   51.823275] [ cut here ]
> [   51.823302] kernel BUG at include/linux/page-flags.h:421!
> [   51.823307] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [   51.823307] Dumping ftrace buffer:
> [   51.823314](ftrace buffer empty)
> [   51.823314] Modules linked in:
> [   51.823314] CPU 2
> [   51.823314] Pid: 7116, comm: trinity Tainted: GW
> 3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273
> [   51.823316] RIP: 0010:[]  [] 
> munlock_vma_page+0x12/0xf0
> [   51.823317] RSP: 0018:880009641bb8  EFLAGS: 00010282
> [   51.823319] RAX: 011ffc008001 RBX: ea410040 RCX: 
> 
> [   51.823320] RDX:  RSI:  RDI: 
> ea410040
> [   51.823321] RBP: 880009641bc8 R08:  R09: 
> 
> [   51.823322] R10:  R11:  R12: 
> 880009633958
> [   51.823324] R13: 01252000 R14: ea410040 R15: 
> 00ff
> [   51.823326] FS:  7fe7a9046700() GS:88000ba0() 
> knlGS:
> [   51.823327] CS:  0010 DS:  ES:  CR0: 80050033
> [   51.823328] CR2: 7fc583b90fcb CR3: 09bc8000 CR4: 
> 000406e0
> [   51.823334] DR0:  DR1:  DR2: 
> 
> [   51.823338] DR3:  DR6: 0ff0 DR7: 
> 0400
> [   51.823340] Process trinity (pid: 7116, threadinfo 88000964, task 
> 880009638000)
> [   51.823341] Stack:
> [   51.823344]  00a01000 880009633958 880009641c08 
> 812429bd
> [   51.823373]  880009638000 01ff09638000 880009ade000 
> 880009633958
> [   51.823373]  880009638810 880009ade098 880009641cb8 
> 81246d81
> [   51.823373] Call Trace:
> [   51.823373]  [] munlock_vma_pages_range+0x8d/0xf0
> [   51.823373]  [] exit_mmap+0x51/0x170
> [   51.823373]  [] ? __khugepaged_exit+0x8a/0xf0
> [   51.823373]  [] ? kmem_cache_free+0x22f/0x3b0
> [   51.823373]  [] ? __khugepaged_exit+0x8a/0xf0
> [   51.823373]  [] mmput+0x77/0xe0
> [   51.823377]  [] exit_mm+0x113/0x120
> [   51.823381]  [] ? _raw_spin_unlock_irq+0x51/0x80
> [   51.823384]  [] do_exit+0x24a/0x590
> [   51.823387]  [] do_group_exit+0x8a/0xc0
> [   51.823390]  [] get_signal_to_deliver+0x501/0x5b0
> [   51.823394]  [] do_signal+0x42/0x110
> [   51.823399]  [] ? rcu_eqs_exit_common+0x64/0x340
> [   51.823404]  [] ? trace_hardirqs_on+0xd/0x10
> [   51.823407]  [] ? trace_hardirqs_on_caller+0x128/0x160
> [   51.823409]  [] ? trace_hardirqs_on+0xd/0x10
> [   51.823412]  [] do_notify_resume+0x48/0xa0
> [   51.823415]  [] retint_signal+0x4d/0x92
> [   51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 
> c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48
> 8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 
> 01 48 8b
> [   51.823449] RIP  [] munlock_vma_page+0x12/0xf0
> [   51.823450]  RSP 
> [   51.826846] ---[ end trace a7919e7f17c0a72a ]---
> 

The similar warning prevents my system from booting. And it seems to me
that in munlock_vma_pages_range(), the page_mask needs be the page
number returned from munlock_vma_page() minus 1. And the following fix
solved my problem. Would you please have a try? 

Thanks, Zhong



diff --git a/mm/mlock.c b/mm/mlock.c
index af1d115..1e3d794 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -255,7 +255,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
unlock_page(page);
put_page(page);
}
-   page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
+   page_increm = 1 + (~(start >> PAGE_SHIFT) & (page_mask-1));
start += page_increm * PAGE_SIZE;
cond_resched();
}


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message

1 2 >

1 - 100 of 143 matches

Mail list logo