[RFC PATCH v6 2/2] Implement lock_device_hotplug_sysfs() by breaking active protection
This patch tries to solve the device hot remove locking issues in a different way from commit 5e33bc41, as kernfs already has a mechanism to break active protection. The active protection is there to keep the file alive by blocking deletion while operations are on-going in the file. This blocking creates a dependency loop when an operation running off a sysfs knob ends up grabbing a lock which may be held while removing the said sysfs knob. So the problem here is the order of s_active, and the series of hotplug related locks. commit 5e33bc41 solves it by taking out the first of the series of hoplug related locks, device_hotplug_lock, with a try lock. And if that try lock fails, it returns a restart syscall error, and drops s_active temporarily to allow the other process to remove the sysfs knob. This doesn't help with lockdep warnings reported against s_active and other hotplug related locks in the series. This patch instead tries to take s_active out of the lock dependency graph using the active protection breaking mechanism. lock_device_hotplug_sysfs() function name is kept here, three more arguments are added, dev, attr, to help find the kernfs_node corresponding to the sysfs knob (which should always be able to be found, WARN if not); kn_out is used to record the found kernfs_node for use in unlock. unlock_device_hotplug_sysfs() is created to help cleanup the operations(get reference, break active, etc) done in lock_device_hotplug_sysfs(), as well as unlock the lock. As we break the active protection here, the device removing process might remove the sysfs entries after that point. So after we grab the device_hotplug_lock, we can check whether that really happens. If so, we should abort and not invoke the online/offline callbacks anymore. lockdep assertion is added in try_offline_node() to make sure device_hotplug_lock is grabbed while removing cpu or memory. As devcie_hotplug_lock is used to protect hotplug operations with multiple subsystems. So there might be devices that won't be removed together with devices of other subsystems, and thus device_hotplug_lock is not needed. In this case, their specific online/offline callbacks needs to check whether the device has been removed. Cc: Tejun Heo Cc: Rafael J. Wysocki Cc: Greg Kroah-Hartman Cc: Toshi Kani Signed-off-by: Li Zhong --- drivers/base/core.c| 93 +- drivers/base/memory.c | 5 +-- include/linux/device.h | 7 +++- mm/memory_hotplug.c| 2 ++ 4 files changed, 95 insertions(+), 12 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 20da3ad..329f3b4 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -61,14 +61,87 @@ void unlock_device_hotplug(void) mutex_unlock(&device_hotplug_lock); } -int lock_device_hotplug_sysfs(void) +void device_hotplug_assert_held(void) { - if (mutex_trylock(&device_hotplug_lock)) - return 0; + lockdep_assert_held(&device_hotplug_lock); +} + +/* + * device_hotplug_lock is used to protect hotplugs which may involve multiple + * subsystems. This _sysfs() version is created to solve the deadlock with + * s_active of the device attribute file, which could be used to online/offline + * the device. + * + * Returns 0 on success. -ENODEV if the @dev is removed after we break the + * active protection. (We should always be able to find the kernfs_node + * related to the attribute, WARN if not, and also return -ENODEV). + * + * When success(return 0), the found kernfs_node is passed out in @kn_out, + * else pass NULL to @kn_out. + */ +int lock_device_hotplug_sysfs(struct device *dev, + struct device_attribute *attr, + struct kernfs_node **kn_out) +{ + struct kernfs_node *kn; + + *kn_out = kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name); + + if (WARN_ON_ONCE(!kn)) + return -ENODEV; + + /* +* Break active protection here to avoid deadlocks with device +* removing process, which tries to remove sysfs entries including this +* "online" attribute while holding some hotplug related locks. +* +* @dev needs to be protected here, or it could go away any time after +* dropping active protection. +*/ + get_device(dev); + kernfs_break_active_protection(kn); + + lock_device_hotplug(); + + /* +* For devices which need device_hoplug_lock to protect the +* online/offline operation, i.e. it might be removed together with +* devices in some other subsystems, we can check here whether the +* device has been removed, so we don't call device_{on|off}line +* against removed device. lockdep assertion is added in +* try_offline_node() to make sure the lock is held to remove cpu +* or memory. +* +* If some devices won&
[RFC PATCH v6 1/2 ] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
While auditing the usage of lock_device_hotplug_sysfs() for implementing it in another way in following patch, it seems to me that the code here is to add/remove device, and the files probe/release themselves won't be removed. lock_device_hotplug_sysfs() is used to solve the deadlocks of s_active of the "online" file and the hotplug related locks. So if one process trying to remove the device, the sysfs files for this device, including "online" will be removed after device_hotplug_lock is grabbed. So for another process which might concurrently write to the "online" file, after it grabs s_active of the "online", it couldn't directly lock device_hotplug_lock, instead, in *_sysfs() version, try lock is used, and if it fails, the whole syscall is restarted. But here the s_active of "probe/release" are not involved in the above process, as they are not to be removed. So we could use lock_device_hotplug() here. Cc: Tejun Heo Cc: Rafael J. Wysocki Cc: Greg Kroah-Hartman Cc: Toshi Kani Signed-off-by: Li Zhong --- drivers/base/cpu.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 006b1bc..9483225 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -88,9 +88,7 @@ static ssize_t cpu_probe_store(struct device *dev, ssize_t cnt; int ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + lock_device_hotplug(); cnt = arch_cpu_probe(buf, count); @@ -106,9 +104,7 @@ static ssize_t cpu_release_store(struct device *dev, ssize_t cnt; int ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + lock_device_hotplug(); cnt = arch_cpu_release(buf, count); -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
viOn Tue, 2014-04-22 at 11:34 +0800, Li Zhong wrote: > On Mon, 2014-04-21 at 18:46 -0400, Tejun Heo wrote: > > Hello, > > > > On Mon, Apr 21, 2014 at 05:23:50PM +0800, Li Zhong wrote: > > > > Proper /** function comment would be nice. > > Ok, will try to write some in next version. > > > > > > +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev, > > > +struct device_attribute *attr) > > > > I can see why you did this but let's please not require the user of > > this function to see how the thing is working internally. Let's > > return int and keep track of (or look up again) the kernfs_node > > internally. > > Ok, it also makes the prototype of lock and unlock look more consistent > and comfortable. When trying to do an new version of the patch, I find that if the device is really removed, then we couldn't look up using the parent, and attribute name again in unlock. So I guess maybe I could add one more argument, e.g. kn_out,:q to track this kernfs node. Code will be posted soon for your review. Thanks, Zhong > > > > > > { > > ... > > > + /* > > > + * We assume device_hotplug_lock must be acquired before removing > > > > Is this assumption true? If so, can we add lockdep assertions in > > places to verify and enforce this? If not, aren't we just feeling > > good when the reality is broken? > > It seems not true ... I think there are devices that don't have the > online/offline concept, we just need to add it, remove it, like ethernet > cards. > > Maybe we could change the comments above, like: > /* We assume device_hotplug_lock must be acquired before >* removing devices, which have online/offline sysfs knob, >* and some locks are needed to serialize the online/offline >* callbacks and device removing. ... > ? > > And we could add lockdep assertions in cpu and memory related code? e.g. > remove_memory(), unregister_cpu() > > Currently, remove_memory() has comments for the function: > > * NOTE: The caller must call lock_device_hotplug() to serialize hotplug > * and online/offline operations before this call, as required by > * try_offline_node(). > */ > > maybe it could be removed with the lockdep assertion. > > > ... > > > > Function comment please. > > OK. > > Thanks, Zhong > > > > +void unlock_device_hotplug_sysfs(struct device *dev, > > > + struct kernfs_node *kn) > > > +{ > > > + unlock_device_hotplug(); > > > + kernfs_unbreak_active_protection(kn); > > > + put_device(dev); > > > + kernfs_put(kn); > > > } > > > > Thanks. > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Fri, 2014-04-25 at 09:59 -0400, Alan Stern wrote: > On Fri, 25 Apr 2014, Li Zhong wrote: > > > > No, this isn't self removal. The driver-attribute (not device-attribute) > > > store operation simply grabs a lock that is also held while the driver > > > is being deregistered at module unload. Taking a reference to the module > > > in this case will prevent deregistration while store is running. > > > > > > But it seems like this can be solved for usb-serial by simply not > > > holding the lock while deregistering. > > > > I didn't look carefully about this lock. > > > > But I'm not sure whether there are such requirements for driver > > attributes: > > > > some lock needs be grabbed in the driver attributes store callbacks, and > > the same lock also needs to be grabbed during driver unregister. > > In this case, the lock does _not_ need to be grabbed during driver > unregister. The driver grabs the lock, but it doesn't need to. OK. > > > If we have such requirements currently or in the future, I think they > > could all be solved by breaking active protection after get the module > > reference. > > No! That would be very bad. > > Unloading modules is quite different from unbinding drivers. After the > driver is unbound, its attribute callback routines can continue to run. > But after a driver module has been unloaded, its attribute callback > routines _cannot_ run because they aren't present in memory any more. > > If we allowed a module to be unloaded while one of its callbacks was > running (because active protection was broken), imagine what would > happen... I don't think the module could be unloaded after we increased the module reference counter. Thanks, Zhong > > Alan Stern > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Fri, 2014-04-25 at 09:54 -0400, Alan Stern wrote: > On Fri, 25 Apr 2014, Li Zhong wrote: > > > > I don't get why try_module_get() matters here. We can't call into > > > ->store if the object at hand is already destroyed and the underlying > > > module can't go away if the target device is still alive. > > > try_module_get() doesn't actually protect the object. Why does that > > > matter? This is self removal, right? Can you please take a look at > > > kernfs_remove_self()? > > > > This is about one process writing something to driver attributes, and > > one process trying to unload this driver. > > > > I think try_module_get() could detect whether the driver is being > > unloaded, and if not, prevent it from being unloaded, so it could > > protect the object here by not allow the driver to be unloaded. > > That isn't how try_module_get() works. If the module is being > unloaded, try_module_get() simply fails. It does not prevent the > module from being unloaded -- that's why its name begins with "try". Yes, I know that. What I said above is for the case when try_module_get() detects the driver is NOT being unloaded, and increases the reference counter. Thanks, Zhong > > Alan Stern > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Fri, 2014-04-25 at 14:47 +0200, Rafael J. Wysocki wrote: > On 4/25/2014 3:46 AM, Li Zhong wrote: > > On Thu, 2014-04-24 at 12:02 +0200, Rafael J. Wysocki wrote: > >> On 4/24/2014 10:59 AM, Li Zhong wrote: > >>> On Wed, 2014-04-23 at 18:12 +0200, Rafael J. Wysocki wrote: > >>>> On 4/23/2014 4:23 PM, Tejun Heo wrote: > >>>>> Hello, Rafael. > >>>> Hi, > >>>> > >>>>> On Wed, Apr 23, 2014 at 12:21:33AM +0200, Rafael J. Wysocki wrote: > >>>>>> Can you please elaborate a bit? > >>>>> Because it can get involved in larger locking dependency issues by > >>>>> joining dependency graphs of two otherwise largely disjoint > >>>>> subsystems. It has potential to create possible deadlocks which don't > >>>>> need to exist. > >>>> Well, I do my best not to add unnecessary locks if that's what you mean. > >>>> > >>>>>> It is there to protect hotplug operations involving multiple devices > >>>>>> (in different subsystems) from racing with each other. Why exactly > >>>>>> is it bad? > >>>>> But why would different subsystems, say cpu and memory, use the same > >>>>> lock? Wouldn't those subsystems already have proper locking inside > >>>>> their own subsystems? > >>>> That locking is not sufficient. > >>>> > >>>>> Why add this additional global lock across multiple subsystems? > >>>> That basically is because of how eject works when it is triggered via > >>>> ACPI. > >>>> > >>>> It is signaled for a device at the top of a subtree. It may be a > >>>> container of some sort and the eject involves everything below that > >>>> device in the ACPI namespace. That may involve multiple subsystem > >>>> (CPUs, memory, PCI host bridge, etc.). > >>>> > >>>> We do that in two steps, offline (which can fail) and eject proper > >>>> (which can't fail and makes all of the involved devices go away). All > >>>> that has to be done in one go with respect to the sysfs-triggered > >>>> offline/online and that's why the lock is there. > >>> Thank you for the education. It's more clear to me now why we need this > >>> lock. > >>> > >>> I still have some small questions about when this lock is needed: > >>> > >>> I could understand sysfs-triggered online is not acceptable when > >>> removing devices in multiple subsystems. But maybe concurrent offline > >>> and remove(with proper per subsystem locks) seems not harmful? > >>> > >>> And if we just want to remove some devices in a specific subsystem, e.g. > >>> like writing /cpu/release, if it just wants to offline and remove some > >>> cpus, then maybe we don't require the device_hotplug_lock to be taken? > >> No and no. > >> > >> If the offline phase fails for any device in the subtree, we roll back > >> the operation > >> and online devices that have already been offlined by it. Also the ACPI > >> hot-addition > >> needs to acquire device_hotplug_lock so that it doesn't race with ejects > >> and so > >> that lock needs to be taken by sysfs-triggered offline too. > > I can understand that hot-addition needs the device_hotplug lock, but > > still not very clear about the offline. > > > > I guess your are describing following scenario: > > > > user A: (trying remove cpu 1 and memory 1-10) > > > > lock_device_hotplug > > offline cpu with cpu locks -- successful > > offline memories with memory locks -- failed, e.g. for memory8 > > online cpu and memory with their locks > > unlock_device_hotplug > > What about if all is successful and CPU1 is gone before > device_hotplug_lock is released? You mean user B will try to offline an already removed cpu1? But I think the cpu subsys locks should be able to handle such situation? > > > user B: (trying offline cpu 1) > > > > offline cpu with cpu locks > > > > But I don't see any problem for user B not taking the device_hotplug > > lock. The result may be different for user B to take or not take the > > lock. But I think it could be seen as concurrent online/offline for cpu1 > > under cpu hotplug locks, which just depends on which is executed last? > > > > Or
Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
On Fri, 2014-04-25 at 08:28 -0400, Tejun Heo wrote: > On Fri, Apr 25, 2014 at 09:56:10AM +0800, Li Zhong wrote: > > > A nests cpu subsys mutex under s_active of the online node. B nests > > > s_active of the online node under the cpu subsys mutex. What am I > > > missing? > > > > From the above two chain, I think the problem is cpu_subsys_mutex and > > s_active(online), which is the deadlock we are trying to solve in patch > > Yeap, that one was what I was thinking about. > > > #2. I seems to me s_active(release) here doesn't have lock issues? > > And your patch doesn't change the situation for online. I'm a little confused here... To clarify for this patch(#1): Currently(before patch#2), the s_active(online) and cpu subsys mutex deadlock is solved by lock_device_hotplug_sysfs(). (lockdep warnings are still there, so we want to change the implementation of lock_device_hotplug_sysfs() in patch #2 to take out of s_active_online, instead of device_hotplug_lock, which is grabbed before cpu subsys mutex). But here, for the probe/release, I think we really don't need to use this _sysfs() version. So it is replaced with lock_device_hotplug() directly to avoid two unnecessary conversions of lock_device_hotplug_sysfs(), that otherwise need to be done in patch #2. Thanks, Zhong > Alright, > thanks a lot for the explanation. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Fri, 2014-04-25 at 12:15 +0200, Johan Hovold wrote: > On Fri, Apr 25, 2014 at 10:16:57AM +0800, Li Zhong wrote: > > On Thu, 2014-04-24 at 16:52 +0200, Johan Hovold wrote: > > > > No, this isn't self removal. The driver-attribute (not device-attribute) > > > store operation simply grabs a lock that is also held while the driver > > > is being deregistered at module unload. Taking a reference to the module > > > in this case will prevent deregistration while store is running. > > > > > > But it seems like this can be solved for usb-serial by simply not > > > holding the lock while deregistering. > > > > I didn't look carefully about this lock. > > > > But I'm not sure whether there are such requirements for driver > > attributes: > > > > some lock needs be grabbed in the driver attributes store callbacks, and > > the same lock also needs to be grabbed during driver unregister. > > > > If we have such requirements currently or in the future, I think they > > could all be solved by breaking active protection after get the module > > reference. > > Yes, if we find any such cases, but I don't think it should be done > generally (as in one of your earlier posts). OK, maybe we only need to reconsider this only if we see some more similar lockdep warnings in the future. > > Active protection is usually exactly what prevents the driver from being > deregistered, and would only need to be broken to avoid any ABBA > deadlocks from being reported by lockdep (the module reference would be > enough to prevent the actual deadlock). Yes, maybe try to get the module reference is not bad before writing to driver attributes, as it doesn't make much sense to really call the callbacks for the driver attribute if the driver is being unload. And after we get the reference, it is safe for us to break the active. But if we don't have such real cases(lockdep warnings), we actually don't need to break it. Thanks, Zhong > > Thanks, > Johan > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Thu, 2014-04-24 at 16:52 +0200, Johan Hovold wrote: > On Thu, Apr 24, 2014 at 10:35:17AM -0400, Tejun Heo wrote: > > On Thu, Apr 24, 2014 at 04:29:15PM +0800, Li Zhong wrote: > > > On Wed, 2014-04-23 at 10:19 -0400, Tejun Heo wrote: > > > > cc'ing Li Zhong who's working on a simliar issue in the following > > > > thread and quoting whole body. > > > > > > > > http://thread.gmane.org/gmane.linux.kernel/1680706 > > > > > > > > Li, this is another variation of the same problem. Maybe this can be > > > > covered by your work too? > > > > > > It seems to me that it is about write something to driver attribute, and > > > driver unloading. If so, maybe it's not easy to reuse the help functions > > > created for device attribute, and device removing. > > > > > > But I guess the idea to break the active protection could still be > > > applied here: > > > > > > Maybe we could try_module_get() here (like the other option suggested by > > > Johan?), and break active protection if we could get the module, > > > something like below? > > > > I don't get why try_module_get() matters here. We can't call into > > ->store if the object at hand is already destroyed and the underlying > > module can't go away if the target device is still alive. > > try_module_get() doesn't actually protect the object. Why does that > > matter? This is self removal, right? Can you please take a look at > > kernfs_remove_self()? > > No, this isn't self removal. The driver-attribute (not device-attribute) > store operation simply grabs a lock that is also held while the driver > is being deregistered at module unload. Taking a reference to the module > in this case will prevent deregistration while store is running. > > But it seems like this can be solved for usb-serial by simply not > holding the lock while deregistering. I didn't look carefully about this lock. But I'm not sure whether there are such requirements for driver attributes: some lock needs be grabbed in the driver attributes store callbacks, and the same lock also needs to be grabbed during driver unregister. If we have such requirements currently or in the future, I think they could all be solved by breaking active protection after get the module reference. Thanks, Zhong > > Johan > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Thu, 2014-04-24 at 10:35 -0400, Tejun Heo wrote: > On Thu, Apr 24, 2014 at 04:29:15PM +0800, Li Zhong wrote: > > On Wed, 2014-04-23 at 10:19 -0400, Tejun Heo wrote: > > > cc'ing Li Zhong who's working on a simliar issue in the following > > > thread and quoting whole body. > > > > > > http://thread.gmane.org/gmane.linux.kernel/1680706 > > > > > > Li, this is another variation of the same problem. Maybe this can be > > > covered by your work too? > > > > It seems to me that it is about write something to driver attribute, and > > driver unloading. If so, maybe it's not easy to reuse the help functions > > created for device attribute, and device removing. > > > > But I guess the idea to break the active protection could still be > > applied here: > > > > Maybe we could try_module_get() here (like the other option suggested by > > Johan?), and break active protection if we could get the module, > > something like below? > > I don't get why try_module_get() matters here. We can't call into > ->store if the object at hand is already destroyed and the underlying > module can't go away if the target device is still alive. > try_module_get() doesn't actually protect the object. Why does that > matter? This is self removal, right? Can you please take a look at > kernfs_remove_self()? This is about one process writing something to driver attributes, and one process trying to unload this driver. I think try_module_get() could detect whether the driver is being unloaded, and if not, prevent it from being unloaded, so it could protect the object here by not allow the driver to be unloaded. And if the driver is being unloaded, we could abort the write operation directly. Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
On Thu, 2014-04-24 at 10:32 -0400, Tejun Heo wrote: > Hello, > > On Thu, Apr 24, 2014 at 04:37:23PM +0800, Li Zhong wrote: > > On Wed, 2014-04-23 at 10:39 -0400, Tejun Heo wrote: > > After thinking it harder, I still couldn't see ABBA here ... > > > > the active protection taken here is for "probe/release" which will not > > be waited for removing something like "online" under cpu#? Or my > > assumption that s_active for different files are different locks are > > completely wrong? Or I missed something else? > > I'm probably confused about the locking. I was thinking a scenario > like the following. > > A. CPU on/offline > >grabs s_active protection of online node >grabs cpu subsys mutex >perform on/offline >releases cpu subsys mutex >releases s_active protection of online node so the chain here is s_active(online) -> cpu_subsys_mutex > > B. CPU release > >grabs s_active protection of release node >grabs cpu subsys mutex >performs removal of the CPU >removes the online node >releases cpu subsys mutex >releases s_active protection of release node and the chain here is s_active(release) -> cpu_subsys_mutex -> s_active(online) > > A nests cpu subsys mutex under s_active of the online node. B nests > s_active of the online node under the cpu subsys mutex. What am I > missing? >From the above two chain, I think the problem is cpu_subsys_mutex and s_active(online), which is the deadlock we are trying to solve in patch #2. I seems to me s_active(release) here doesn't have lock issues? Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Thu, 2014-04-24 at 12:02 +0200, Rafael J. Wysocki wrote: > On 4/24/2014 10:59 AM, Li Zhong wrote: > > On Wed, 2014-04-23 at 18:12 +0200, Rafael J. Wysocki wrote: > >> On 4/23/2014 4:23 PM, Tejun Heo wrote: > >>> Hello, Rafael. > >> Hi, > >> > >>> On Wed, Apr 23, 2014 at 12:21:33AM +0200, Rafael J. Wysocki wrote: > >>>> Can you please elaborate a bit? > >>> Because it can get involved in larger locking dependency issues by > >>> joining dependency graphs of two otherwise largely disjoint > >>> subsystems. It has potential to create possible deadlocks which don't > >>> need to exist. > >> Well, I do my best not to add unnecessary locks if that's what you mean. > >> > >>>> It is there to protect hotplug operations involving multiple devices > >>>> (in different subsystems) from racing with each other. Why exactly > >>>> is it bad? > >>> But why would different subsystems, say cpu and memory, use the same > >>> lock? Wouldn't those subsystems already have proper locking inside > >>> their own subsystems? > >> That locking is not sufficient. > >> > >>> Why add this additional global lock across multiple subsystems? > >> That basically is because of how eject works when it is triggered via ACPI. > >> > >> It is signaled for a device at the top of a subtree. It may be a > >> container of some sort and the eject involves everything below that > >> device in the ACPI namespace. That may involve multiple subsystem > >> (CPUs, memory, PCI host bridge, etc.). > >> > >> We do that in two steps, offline (which can fail) and eject proper > >> (which can't fail and makes all of the involved devices go away). All > >> that has to be done in one go with respect to the sysfs-triggered > >> offline/online and that's why the lock is there. > > Thank you for the education. It's more clear to me now why we need this > > lock. > > > > I still have some small questions about when this lock is needed: > > > > I could understand sysfs-triggered online is not acceptable when > > removing devices in multiple subsystems. But maybe concurrent offline > > and remove(with proper per subsystem locks) seems not harmful? > > > > And if we just want to remove some devices in a specific subsystem, e.g. > > like writing /cpu/release, if it just wants to offline and remove some > > cpus, then maybe we don't require the device_hotplug_lock to be taken? > > No and no. > > If the offline phase fails for any device in the subtree, we roll back > the operation > and online devices that have already been offlined by it. Also the ACPI > hot-addition > needs to acquire device_hotplug_lock so that it doesn't race with ejects > and so > that lock needs to be taken by sysfs-triggered offline too. I can understand that hot-addition needs the device_hotplug lock, but still not very clear about the offline. I guess your are describing following scenario: user A: (trying remove cpu 1 and memory 1-10) lock_device_hotplug offline cpu with cpu locks -- successful offline memories with memory locks -- failed, e.g. for memory8 online cpu and memory with their locks unlock_device_hotplug user B: (trying offline cpu 1) offline cpu with cpu locks But I don't see any problem for user B not taking the device_hotplug lock. The result may be different for user B to take or not take the lock. But I think it could be seen as concurrent online/offline for cpu1 under cpu hotplug locks, which just depends on which is executed last? Or did I miss something here? Thanks, Zhong > Thanks, > Rafael > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Wed, 2014-04-23 at 18:12 +0200, Rafael J. Wysocki wrote: > On 4/23/2014 4:23 PM, Tejun Heo wrote: > > Hello, Rafael. > > Hi, > > > On Wed, Apr 23, 2014 at 12:21:33AM +0200, Rafael J. Wysocki wrote: > >> Can you please elaborate a bit? > > Because it can get involved in larger locking dependency issues by > > joining dependency graphs of two otherwise largely disjoint > > subsystems. It has potential to create possible deadlocks which don't > > need to exist. > > Well, I do my best not to add unnecessary locks if that's what you mean. > > >> It is there to protect hotplug operations involving multiple devices > >> (in different subsystems) from racing with each other. Why exactly > >> is it bad? > > But why would different subsystems, say cpu and memory, use the same > > lock? Wouldn't those subsystems already have proper locking inside > > their own subsystems? > > That locking is not sufficient. > > > Why add this additional global lock across multiple subsystems? > > That basically is because of how eject works when it is triggered via ACPI. > > It is signaled for a device at the top of a subtree. It may be a > container of some sort and the eject involves everything below that > device in the ACPI namespace. That may involve multiple subsystem > (CPUs, memory, PCI host bridge, etc.). > > We do that in two steps, offline (which can fail) and eject proper > (which can't fail and makes all of the involved devices go away). All > that has to be done in one go with respect to the sysfs-triggered > offline/online and that's why the lock is there. Thank you for the education. It's more clear to me now why we need this lock. I still have some small questions about when this lock is needed: I could understand sysfs-triggered online is not acceptable when removing devices in multiple subsystems. But maybe concurrent offline and remove(with proper per subsystem locks) seems not harmful? And if we just want to remove some devices in a specific subsystem, e.g. like writing /cpu/release, if it just wants to offline and remove some cpus, then maybe we don't require the device_hotplug_lock to be taken? Thanks, Zhong > > Thanks, > Rafael > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
On Wed, 2014-04-23 at 10:39 -0400, Tejun Heo wrote: > On Wed, Apr 23, 2014 at 10:00:26AM +0800, Li Zhong wrote: > > If you remove cpu0, then the cpu0 directory will be removed, together > > with the "online" file in the directory, while some other process might > > be writing 0 or 1 to it. > > > > But here, for the probe/release, take "release" for example, if user > > writes something that stands for cpu0 to it, the cpu0 will be removed, > > and the cpu0 directory and the files under it will be removed. But > > "release" itself is not removed. > > > > They are attributes of cpu_subsys, not of some specific cpus. > > OIC, so the file itself which triggers removal doesn't get removed. > Hmmm... but can't you still fall into deadlock? If on/offline takes > the same lock under active protection which is also taken while > removing the cpu files, it doesn't really matter whether the release > file itself is removed in the process or not. You can still have ABBA > deadlock. What am I missing here? After thinking it harder, I still couldn't see ABBA here ... the active protection taken here is for "probe/release" which will not be waited for removing something like "online" under cpu#? Or my assumption that s_active for different files are different locks are completely wrong? Or I missed something else? Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] USB: serial: fix sysfs-attribute removal deadlock
On Wed, 2014-04-23 at 10:19 -0400, Tejun Heo wrote: > cc'ing Li Zhong who's working on a simliar issue in the following > thread and quoting whole body. > > http://thread.gmane.org/gmane.linux.kernel/1680706 > > Li, this is another variation of the same problem. Maybe this can be > covered by your work too? It seems to me that it is about write something to driver attribute, and driver unloading. If so, maybe it's not easy to reuse the help functions created for device attribute, and device removing. But I guess the idea to break the active protection could still be applied here: Maybe we could try_module_get() here (like the other option suggested by Johan?), and break active protection if we could get the module, something like below? diff --git a/drivers/base/bus.c b/drivers/base/bus.c index 83e910a..6ce27e0 100644 --- a/drivers/base/bus.c +++ b/drivers/base/bus.c @@ -69,9 +69,24 @@ static ssize_t drv_attr_store(struct kobject *kobj, struct attribute *attr, struct driver_attribute *drv_attr = to_drv_attr(attr); struct driver_private *drv_priv = to_driver(kobj); ssize_t ret = -EIO; + struct kernfs_node *kn; + + if (!try_module_get(drv_priv->driver->owner)) + return ret; + + kn = kernfs_find_and_get(kobj->sd, attr->name); + if (WARN_ON_ONCE(!kn)) + return ret; + + kernfs_break_active_protection(kn); if (drv_attr->store) ret = drv_attr->store(drv_priv->driver, buf, count); + + kernfs_unbreak_active_protection(kn); + kernfs_put(kn); + + module_put(drv_priv->driver->owner); return ret; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Wed, 2014-04-23 at 12:58 +0200, Rafael J. Wysocki wrote: > On Wednesday, April 23, 2014 01:03:42 PM Li Zhong wrote: > > On Tue, 2014-04-22 at 16:44 -0400, Tejun Heo wrote: > > > Hello, > > > > > > On Tue, Apr 22, 2014 at 11:34:39AM +0800, Li Zhong wrote: > > > > > Is this assumption true? If so, can we add lockdep assertions in > > > > > places to verify and enforce this? If not, aren't we just feeling > > > > > good when the reality is broken? > > > > > > > > It seems not true ... I think there are devices that don't have the > > > > online/offline concept, we just need to add it, remove it, like ethernet > > > > cards. > > > > > > > > Maybe we could change the comments above, like: > > > > /* We assume device_hotplug_lock must be acquired before > > > > * removing devices, which have online/offline sysfs knob, > > > > * and some locks are needed to serialize the online/offline > > > > * callbacks and device removing. ... > > > > ? > > > > > > > > And we could add lockdep assertions in cpu and memory related code? e.g. > > > > remove_memory(), unregister_cpu() > > > > > > > > Currently, remove_memory() has comments for the function: > > > > > > > > * NOTE: The caller must call lock_device_hotplug() to serialize hotplug > > > > * and online/offline operations before this call, as required by > > > > * try_offline_node(). > > > > */ > > > > > > > > maybe it could be removed with the lockdep assertion. > > > > > > I'm confused about the overall locking scheme. What's the role of > > > device_hotplug_lock? Is that solely to prevent the sysfs deadlock > > > issue? Or does it serve other synchronization purposes depending on > > > the specific subsystem? If the former, the lock no longer needs to > > > exist. The only thing necessary would be synchronization between > > > device_del() deleting the sysfs file and the unbreak helper invoking > > > device-specific callback. If the latter, we probably should change > > > that. Sharing hotplug lock across multiple subsystems through driver > > > core sounds like a pretty bad idea. > > > > I think it's the latter. > > Actually, no, this is not the case if I understand you correctly. Oh, Sorry, I didn't read carefully. Yes, it's not specific subsystem. After seeing your reply, I understand it is for protecting device hot remove involving multiple subsystems. > > > I think device_{on|off}line is better to be > > done in some sort of lock which prevents the device from being removed, > > including some preparation work that needs be done before device_del(). > > Quite frankly, you should be confident that you understand the code you're > trying to modify or please don't touch it. > > I'll have a deeper look at this issue later today or tomorrow and will get > back to you then. Ok, thank you, Zhong > > Thanks! > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Wed, 2014-04-23 at 12:54 +0200, Rafael J. Wysocki wrote: > On Wednesday, April 23, 2014 09:50:32 AM Li Zhong wrote: > > On Tue, 2014-04-22 at 12:11 +0200, Rafael J. Wysocki wrote: > > > On Tuesday, April 22, 2014 11:34:39 AM Li Zhong wrote: > > > > On Mon, 2014-04-21 at 18:46 -0400, Tejun Heo wrote: > > > > > Hello, > > > > > > > > > > On Mon, Apr 21, 2014 at 05:23:50PM +0800, Li Zhong wrote: > > > > > > > > > > Proper /** function comment would be nice. > > > > > > > > Ok, will try to write some in next version. > > > > > > > > > > > > > > > +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev, > > > > > > + struct device_attribute > > > > > > *attr) > > > > > > > > > > I can see why you did this but let's please not require the user of > > > > > this function to see how the thing is working internally. Let's > > > > > return int and keep track of (or look up again) the kernfs_node > > > > > internally. > > > > > > > > Ok, it also makes the prototype of lock and unlock look more consistent > > > > and comfortable. > > > > > > > > > > > > > > > { > > > > > ... > > > > > > + /* > > > > > > +* We assume device_hotplug_lock must be acquired before > > > > > > removing > > > > > > > > > > Is this assumption true? If so, can we add lockdep assertions in > > > > > places to verify and enforce this? If not, aren't we just feeling > > > > > good when the reality is broken? > > > > > > > > It seems not true ... I think there are devices that don't have the > > > > online/offline concept, we just need to add it, remove it, like ethernet > > > > cards. > > > > > > Well, I haven't been following this closely (I was travelling, sorry), but > > > there certainly are devices without online/offline. That currently is > > > only > > > present for CPUs, memory blocks and ACPI containers (if I remember > > > correctly). > > > > > > > > > > > Maybe we could change the comments above, like: > > > > /* We assume device_hotplug_lock must be acquired before > > > > * removing devices, which have online/offline sysfs knob, > > > > * and some locks are needed to serialize the online/offline > > > > * callbacks and device removing. ... > > > > ? > > > > > > Lockdep assertions would be better than this in my opinion. > > > > This is talking about the lock required in the other process, the device > > removing process, e.g. that in remove_memory() below. So I guess no > > lockdep assertions needed here. Or I misunderstand your point? > > I mean if you assume certain lock to be held somewhere, it is better to use > lockdep annotations to express that assumption, because that will cause users > to *see* the problem when it happens. OK, I see, I think you were suggesting the same thing as Tejun, just I misunderstood it. Thanks, Zhong > > Thanks! > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Tue, 2014-04-22 at 16:44 -0400, Tejun Heo wrote: > Hello, > > On Tue, Apr 22, 2014 at 11:34:39AM +0800, Li Zhong wrote: > > > Is this assumption true? If so, can we add lockdep assertions in > > > places to verify and enforce this? If not, aren't we just feeling > > > good when the reality is broken? > > > > It seems not true ... I think there are devices that don't have the > > online/offline concept, we just need to add it, remove it, like ethernet > > cards. > > > > Maybe we could change the comments above, like: > > /* We assume device_hotplug_lock must be acquired before > > * removing devices, which have online/offline sysfs knob, > > * and some locks are needed to serialize the online/offline > > * callbacks and device removing. ... > > ? > > > > And we could add lockdep assertions in cpu and memory related code? e.g. > > remove_memory(), unregister_cpu() > > > > Currently, remove_memory() has comments for the function: > > > > * NOTE: The caller must call lock_device_hotplug() to serialize hotplug > > * and online/offline operations before this call, as required by > > * try_offline_node(). > > */ > > > > maybe it could be removed with the lockdep assertion. > > I'm confused about the overall locking scheme. What's the role of > device_hotplug_lock? Is that solely to prevent the sysfs deadlock > issue? Or does it serve other synchronization purposes depending on > the specific subsystem? If the former, the lock no longer needs to > exist. The only thing necessary would be synchronization between > device_del() deleting the sysfs file and the unbreak helper invoking > device-specific callback. If the latter, we probably should change > that. Sharing hotplug lock across multiple subsystems through driver > core sounds like a pretty bad idea. I think it's the latter. I think device_{on|off}line is better to be done in some sort of lock which prevents the device from being removed, including some preparation work that needs be done before device_del(). Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
On Tue, 2014-04-22 at 16:40 -0400, Tejun Heo wrote: > Hello, > > On Tue, Apr 22, 2014 at 10:29:37AM +0800, Li Zhong wrote: > > The probe/release files are attribute files for cpu subsys, looks like > > following in sysfs dirs > > > > $ cd /sys/devices/system/cpu/ > > $ ls -l > > total 0 > > drwxr-xr-x. 7 root root 0 Apr 17 19:00 cpu0 > > drwxr-xr-x. 4 root root 0 Apr 17 19:00 cpu1 > > drwxr-xr-x. 4 root root 0 Apr 17 19:00 cpu10 > > .. > > drwxr-xr-x. 3 root root 0 Apr 20 08:00 cpufreq > > drwxr-xr-x. 2 root root 0 Apr 20 08:00 cpuidle > > -rw---. 1 root root 65536 Apr 21 00:28 dscr_default > > -r--r--r--. 1 root root 65536 Apr 21 00:28 kernel_max > > -r--r--r--. 1 root root 65536 Apr 21 00:28 offline > > -r--r--r--. 1 root root 65536 Sep 4 2014 online > > -r--r--r--. 1 root root 65536 Apr 21 00:28 possible > > drwxr-xr-x. 2 root root 0 Apr 20 08:00 power > > -r--r--r--. 1 root root 65536 Apr 17 20:46 present > > --w---. 1 root root 65536 Apr 21 00:28 probe<- > > --w---. 1 root root 65536 Apr 21 00:28 release <- > > -rw---. 1 root root 65536 Apr 21 00:28 subcores_per_core > > -rw-r--r--. 1 root root 65536 Apr 21 00:28 uevent > > > > From the code, it seems cpu subsys won't be unregistered, and it doesn't > > make sense to remove all the cpus in the system. > > I don't think I'm following you. Are you saying that no files which > are protected under the hotplug lock that cpu subsys uses are removed > during offlining? Sorry, Maybe I didn't say it clearly. There are files under cpu#, e.g. $ cd /sys/devices/system/cpu/cpu0 $ ls cache dscr node1pmc1 pmc5 smt_snooze_delay uevent cpuidle mmcr0 online pmc2 pmc6 spurr crash_notes mmcr1 physical_id pmc3 power subsystem crash_notes_size mmcra pir pmc4 purr topology If you remove cpu0, then the cpu0 directory will be removed, together with the "online" file in the directory, while some other process might be writing 0 or 1 to it. But here, for the probe/release, take "release" for example, if user writes something that stands for cpu0 to it, the cpu0 will be removed, and the cpu0 directory and the files under it will be removed. But "release" itself is not removed. They are attributes of cpu_subsys, not of some specific cpus. Hopes the above makes things a bit clearer. Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Tue, 2014-04-22 at 12:11 +0200, Rafael J. Wysocki wrote: > On Tuesday, April 22, 2014 11:34:39 AM Li Zhong wrote: > > On Mon, 2014-04-21 at 18:46 -0400, Tejun Heo wrote: > > > Hello, > > > > > > On Mon, Apr 21, 2014 at 05:23:50PM +0800, Li Zhong wrote: > > > > > > Proper /** function comment would be nice. > > > > Ok, will try to write some in next version. > > > > > > > > > +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev, > > > > + struct device_attribute > > > > *attr) > > > > > > I can see why you did this but let's please not require the user of > > > this function to see how the thing is working internally. Let's > > > return int and keep track of (or look up again) the kernfs_node > > > internally. > > > > Ok, it also makes the prototype of lock and unlock look more consistent > > and comfortable. > > > > > > > > > { > > > ... > > > > + /* > > > > +* We assume device_hotplug_lock must be acquired before > > > > removing > > > > > > Is this assumption true? If so, can we add lockdep assertions in > > > places to verify and enforce this? If not, aren't we just feeling > > > good when the reality is broken? > > > > It seems not true ... I think there are devices that don't have the > > online/offline concept, we just need to add it, remove it, like ethernet > > cards. > > Well, I haven't been following this closely (I was travelling, sorry), but > there certainly are devices without online/offline. That currently is only > present for CPUs, memory blocks and ACPI containers (if I remember correctly). > > > > > Maybe we could change the comments above, like: > > /* We assume device_hotplug_lock must be acquired before > > * removing devices, which have online/offline sysfs knob, > > * and some locks are needed to serialize the online/offline > > * callbacks and device removing. ... > > ? > > Lockdep assertions would be better than this in my opinion. This is talking about the lock required in the other process, the device removing process, e.g. that in remove_memory() below. So I guess no lockdep assertions needed here. Or I misunderstand your point? > > > > And we could add lockdep assertions in cpu and memory related code? e.g. > > remove_memory(), unregister_cpu() > > > > Currently, remove_memory() has comments for the function: > > > > * NOTE: The caller must call lock_device_hotplug() to serialize hotplug > > * and online/offline operations before this call, as required by > > * try_offline_node(). > > */ > > > > maybe it could be removed with the lockdep assertion. > > No, please don't remove it. It is there to explain where the locking > requirement > comes from. OK, I see. I think I'll just add lockdep assertions, and keep the comments there. Thanks, Zhong > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
On Mon, 2014-04-21 at 18:46 -0400, Tejun Heo wrote: > Hello, > > On Mon, Apr 21, 2014 at 05:23:50PM +0800, Li Zhong wrote: > > Proper /** function comment would be nice. Ok, will try to write some in next version. > > > +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev, > > + struct device_attribute *attr) > > I can see why you did this but let's please not require the user of > this function to see how the thing is working internally. Let's > return int and keep track of (or look up again) the kernfs_node > internally. Ok, it also makes the prototype of lock and unlock look more consistent and comfortable. > > > { > ... > > + /* > > +* We assume device_hotplug_lock must be acquired before removing > > Is this assumption true? If so, can we add lockdep assertions in > places to verify and enforce this? If not, aren't we just feeling > good when the reality is broken? It seems not true ... I think there are devices that don't have the online/offline concept, we just need to add it, remove it, like ethernet cards. Maybe we could change the comments above, like: /* We assume device_hotplug_lock must be acquired before * removing devices, which have online/offline sysfs knob, * and some locks are needed to serialize the online/offline * callbacks and device removing. ... ? And we could add lockdep assertions in cpu and memory related code? e.g. remove_memory(), unregister_cpu() Currently, remove_memory() has comments for the function: * NOTE: The caller must call lock_device_hotplug() to serialize hotplug * and online/offline operations before this call, as required by * try_offline_node(). */ maybe it could be removed with the lockdep assertion. > ... > > Function comment please. OK. Thanks, Zhong > > +void unlock_device_hotplug_sysfs(struct device *dev, > > +struct kernfs_node *kn) > > +{ > > + unlock_device_hotplug(); > > + kernfs_unbreak_active_protection(kn); > > + put_device(dev); > > + kernfs_put(kn); > > } > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
On Mon, 2014-04-21 at 18:38 -0400, Tejun Heo wrote: > Hello, > > On Mon, Apr 21, 2014 at 05:20:59PM +0800, Li Zhong wrote: > > While auditing the usage of lock_device_hotplug_sysfs() for implementing > > it in another way in following patch, it seems to me that the code here > > is to add/remove device, and the files probe/release for cpu bus > > themselves won't be removed. > > > > So it seems to me there is no s_active related deadlock here, and we > > could just use lock_device_hotplug(). > > It may still cause issue if offlining ends up removing sysfs files or > gets involved with the same lock used during cpu hot[un]plug > operations (e.g. offlining racing against cpu hotunplug) and offlining > a cpu does remove files. Has this change been tested? The probe/release files are attribute files for cpu subsys, looks like following in sysfs dirs $ cd /sys/devices/system/cpu/ $ ls -l total 0 drwxr-xr-x. 7 root root 0 Apr 17 19:00 cpu0 drwxr-xr-x. 4 root root 0 Apr 17 19:00 cpu1 drwxr-xr-x. 4 root root 0 Apr 17 19:00 cpu10 .. drwxr-xr-x. 3 root root 0 Apr 20 08:00 cpufreq drwxr-xr-x. 2 root root 0 Apr 20 08:00 cpuidle -rw---. 1 root root 65536 Apr 21 00:28 dscr_default -r--r--r--. 1 root root 65536 Apr 21 00:28 kernel_max -r--r--r--. 1 root root 65536 Apr 21 00:28 offline -r--r--r--. 1 root root 65536 Sep 4 2014 online -r--r--r--. 1 root root 65536 Apr 21 00:28 possible drwxr-xr-x. 2 root root 0 Apr 20 08:00 power -r--r--r--. 1 root root 65536 Apr 17 20:46 present --w---. 1 root root 65536 Apr 21 00:28 probe<- --w---. 1 root root 65536 Apr 21 00:28 release <- -rw---. 1 root root 65536 Apr 21 00:28 subcores_per_core -rw-r--r--. 1 root root 65536 Apr 21 00:28 uevent >From the code, it seems cpu subsys won't be unregistered, and it doesn't make sense to remove all the cpus in the system. Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v5 2/2] Use kernfs_break_active_protection() for device online store callbacks
This patch tries to solve the device hot remove locking issues in a different way from commit 5e33bc41, as kernfs already has a mechanism to break active protection. The active protection is there to keep the file alive by blocking deletion while operations are on-going in the file. This blocking creates a dependency loop when an operation running off a sysfs knob ends up grabbing a lock which may be held while removing the said sysfs knob. So the problem here is the order of s_active, and the series of hotplug related locks. commit 5e33bc41 solves it by taking out the first of the series of hoplug related locks, device_hotplug_lock, with a try lock. And if that try lock fails, it returns a restart syscall error, and drops s_active temporarily to allow the other process to remove the sysfs knob. This doesn't help with lockdep warnings reported against s_active and other hotplug related locks in the series. This patch instead tries to take s_active out of the lock dependency graph using the active protection breaking mechanism. lock_device_hotplug_sysfs() function name is kept here, two more arguments are added, dev, attr, to help find the kernfs_node corresponding to the sysfs knob (which should always be able to be found, WARN if not). The found kernfs_node is recorded as the return value, to be released in unlock_device_hotplug_sysfs(). As we break the active protection here, the device removing process might remove the sysfs entries after that point. So after we grab the device_hotplug_lock, we need check whether that really happens. If so, we should abort and not invoke the online/offline callbacks anymore. In this case, NULL is returned to the caller, so it could return with ENODEV. Signed-off-by: Li Zhong --- drivers/base/core.c| 63 ++ drivers/base/memory.c | 9 include/linux/device.h | 5 +++- 3 files changed, 62 insertions(+), 15 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 0dd6528..4d37a2b 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -61,14 +61,55 @@ void unlock_device_hotplug(void) mutex_unlock(&device_hotplug_lock); } -int lock_device_hotplug_sysfs(void) +struct kernfs_node *lock_device_hotplug_sysfs(struct device *dev, + struct device_attribute *attr) { - if (mutex_trylock(&device_hotplug_lock)) - return 0; + struct kernfs_node *kn; + + kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name); + + if (WARN_ON_ONCE(!kn)) + return NULL; + + /* +* Break active protection here to avoid deadlocks with device +* removing process, which tries to remove sysfs entries including this +* "online" attribute while holding some hotplug related locks. +* +* @dev needs to be protected here, or it could go away any time after +* dropping active protection. +*/ + + get_device(dev); + kernfs_break_active_protection(kn); + + lock_device_hotplug(); + + /* +* We assume device_hotplug_lock must be acquired before removing +* device, we can check here whether the device has been removed, so +* we don't call device_{on|off}line against removed device. +*/ - /* Avoid busy looping (5 ms of sleep should do). */ - msleep(5); - return restart_syscall(); + if (!dev->kobj.sd) { + /* device_del() already called on @dev, we should also abort */ + unlock_device_hotplug(); + kernfs_unbreak_active_protection(kn); + put_device(dev); + kernfs_put(kn); + return NULL; + } + + return kn; +} + +void unlock_device_hotplug_sysfs(struct device *dev, +struct kernfs_node *kn) +{ + unlock_device_hotplug(); + kernfs_unbreak_active_protection(kn); + put_device(dev); + kernfs_put(kn); } #ifdef CONFIG_BLOCK @@ -439,17 +480,19 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, { bool val; int ret; + struct kernfs_node *kn; ret = strtobool(buf, &val); if (ret < 0) return ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + kn = lock_device_hotplug_sysfs(dev, attr); + if (!kn) + return -ENODEV; ret = val ? device_online(dev) : device_offline(dev); - unlock_device_hotplug(); + unlock_device_hotplug_sysfs(dev, kn); + return ret < 0 ? ret : count; } static DEVICE_ATTR_RW(online); diff --git a/drivers/base/memory.c b/drivers/base/memory.c index bece691..c2b66d4 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -320,10 +320,11 @@ store_mem_state(struct device *dev, {
[RFC PATCH v5 1/2] Use lock_device_hotplug() in cpu_probe_store() and cpu_release_store()
While auditing the usage of lock_device_hotplug_sysfs() for implementing it in another way in following patch, it seems to me that the code here is to add/remove device, and the files probe/release for cpu bus themselves won't be removed. So it seems to me there is no s_active related deadlock here, and we could just use lock_device_hotplug(). Signed-off-by: Li Zhong --- drivers/base/cpu.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 006b1bc..9483225 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -88,9 +88,7 @@ static ssize_t cpu_probe_store(struct device *dev, ssize_t cnt; int ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + lock_device_hotplug(); cnt = arch_cpu_probe(buf, count); @@ -106,9 +104,7 @@ static ssize_t cpu_release_store(struct device *dev, ssize_t cnt; int ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + lock_device_hotplug(); cnt = arch_cpu_release(buf, count); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v4] Use kernfs_break_active_protection() for device online store callbacks
On Thu, 2014-04-17 at 11:17 -0400, Tejun Heo wrote: > Hello, > > On Thu, Apr 17, 2014 at 02:50:44PM +0800, Li Zhong wrote: > > This patch tries to solve the device hot remove locking issues in a > > different way from commit 5e33bc41, as kernfs already has a mechanism > > to break active protection. > > > > The problem here is the order of s_active, and series of hotplug related > > lock. > > It prolly deservse more detailed explanation of the deadlock along > with how 5e33bc41 ("$SUBJ") tried to solve it. The active protetion > is there to keep the file alive by blocking deletion while operations > are on-going in the file. This blocking creates a dependency loop > when an operation running off a sysfs knob ends up grabbing a lock > which may be held while removing the said sysfs knob. OK, I'll try to add these and something more in next version. > > > + kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name); > > + if (WARN_ON_ONCE(!kn)) > > + return -ENODEV; > > + > > + /* > > +* Break active protection here to avoid deadlocks with device > > +* removing process, which tries to remove sysfs entries including this > > +* "online" attribute while holding some hotplug related locks. > > +* > > +* @dev needs to be protected here, or it could go away any time after > > +* dropping active protection. But it is still unreasonable/unsafe to > > +* online/offline a device after it being removed. Fortunately, there > > I think this is something driver layer proper should provide > synchronization for. It shouldn't be difficult to synchronize this > function against device_del(), right? And, please note that @dev is > guaranteed to have not been removed (at least hasn't gone through attr > removal) upto this point. Ok, I think what we need here is the check below, after getting device_hotplug_lock, and abort this function if device already removed. We should allow device_del() to remove the device in the other process, which is why we are breaking the active protection. > > > +* are some checks in online/offline knobs. Like cpu, it checks cpu > > +* present/online mask before doing the real work. > > +*/ > > + > > + get_device(dev); > > + kernfs_break_active_protection(kn); > > + > > + lock_device_hotplug(); > > + > > + /* > > +* If we assume device_hotplug_lock must be acquired before removing > > +* device, we may try to find a way to check whether the device has > > +* been removed here, so we don't call device_{on|off}line against > > +* removed device. > > +*/ > > Yeah, let's please fix this. OK, I guess we can check whether dev->kobj.sd becomes NULL. If so, it means the device has already been deleted by device_del(). > > > ret = val ? device_online(dev) : device_offline(dev); > > unlock_device_hotplug(); > > + > > + kernfs_unbreak_active_protection(kn); > > + put_device(dev); > > + > > + kernfs_put(kn); > > + > > return ret < 0 ? ret : count; > > } > > static DEVICE_ATTR_RW(online); > > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > > index bece691..0d2f3a5 100644 > > --- a/drivers/base/memory.c > > +++ b/drivers/base/memory.c > > @@ -320,10 +320,17 @@ store_mem_state(struct device *dev, > > { > > struct memory_block *mem = to_memory_block(dev); > > int ret, online_type; > > + struct kernfs_node *kn; > > > > - ret = lock_device_hotplug_sysfs(); > > - if (ret) > > - return ret; > > + kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name); > > + if (WARN_ON_ONCE(!kn)) > > + return -ENODEV; > > + > > + /* refer to comments in online_store() for more information */ > > + get_device(dev); > > + kernfs_break_active_protection(kn); > > + > > + lock_device_hotplug(); > > > > if (!strncmp(buf, "online_kernel", min_t(int, count, 13))) > > online_type = ONLINE_KERNEL; > > @@ -362,6 +369,11 @@ store_mem_state(struct device *dev, > > err: > > unlock_device_hotplug(); > > > > + kernfs_unbreak_active_protection(kn); > > + put_device(dev); > > + > > + kernfs_put(kn); > > There are other users of lock_device_hotplug_sysfs(). We probably > want to audit them and convert them too, preferably with helper > routines so that they don't end up duplicating the complexity? I see, I guess I could keep lock_device_hotplug_sysfs(), just replace it with the implementation here; and provide a new unlock_device_hotplug_sysfs(), which will do the unlock, unbreak, and puts. I'll try to get the code ready sometime next week for your review. Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v4] Use kernfs_break_active_protection() for device online store callbacks
This patch tries to solve the device hot remove locking issues in a different way from commit 5e33bc41, as kernfs already has a mechanism to break active protection. The problem here is the order of s_active, and series of hotplug related lock. This patch takes s_active out of the lock dependency graph, so it won't dead lock with any hotplug realted locks. Signed-off-by: Li Zhong --- drivers/base/core.c | 37 ++--- drivers/base/memory.c | 18 +++--- 2 files changed, 49 insertions(+), 6 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 0dd6528..1b96cb9 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -439,17 +439,48 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, { bool val; int ret; + struct kernfs_node *kn; ret = strtobool(buf, &val); if (ret < 0) return ret; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name); + if (WARN_ON_ONCE(!kn)) + return -ENODEV; + + /* +* Break active protection here to avoid deadlocks with device +* removing process, which tries to remove sysfs entries including this +* "online" attribute while holding some hotplug related locks. +* +* @dev needs to be protected here, or it could go away any time after +* dropping active protection. But it is still unreasonable/unsafe to +* online/offline a device after it being removed. Fortunately, there +* are some checks in online/offline knobs. Like cpu, it checks cpu +* present/online mask before doing the real work. +*/ + + get_device(dev); + kernfs_break_active_protection(kn); + + lock_device_hotplug(); + + /* +* If we assume device_hotplug_lock must be acquired before removing +* device, we may try to find a way to check whether the device has +* been removed here, so we don't call device_{on|off}line against +* removed device. +*/ ret = val ? device_online(dev) : device_offline(dev); unlock_device_hotplug(); + + kernfs_unbreak_active_protection(kn); + put_device(dev); + + kernfs_put(kn); + return ret < 0 ? ret : count; } static DEVICE_ATTR_RW(online); diff --git a/drivers/base/memory.c b/drivers/base/memory.c index bece691..0d2f3a5 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -320,10 +320,17 @@ store_mem_state(struct device *dev, { struct memory_block *mem = to_memory_block(dev); int ret, online_type; + struct kernfs_node *kn; - ret = lock_device_hotplug_sysfs(); - if (ret) - return ret; + kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name); + if (WARN_ON_ONCE(!kn)) + return -ENODEV; + + /* refer to comments in online_store() for more information */ + get_device(dev); + kernfs_break_active_protection(kn); + + lock_device_hotplug(); if (!strncmp(buf, "online_kernel", min_t(int, count, 13))) online_type = ONLINE_KERNEL; @@ -362,6 +369,11 @@ store_mem_state(struct device *dev, err: unlock_device_hotplug(); + kernfs_unbreak_active_protection(kn); + put_device(dev); + + kernfs_put(kn); + if (ret) return ret; return count; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
On Wed, 2014-04-16 at 11:17 -0400, Tejun Heo wrote: > Hello, > > On Wed, Apr 16, 2014 at 09:41:40AM +0800, Li Zhong wrote: > > > If so, that is > > > an actually possible deadlock, no? > > > > Yes, but it seems to me that it is solved in commit 5e33bc41, which uses > > lock_device_hotplug_sysfs() to return a restart syscall error if not > > able to try lock the device_hotplug_lock. That also requires the device > > removing code path to take the device_hotplug_lock. > > But that patch only takes out device_hotplug_lock out of the > dependency graph and does nothing for cpu_add_remove_lock. It seems > to be that there still is a deadlock condition involving s_active and > cpu_add_remove_lock. Am I missing something here? It seems to me cpu_add_remove_lock is always taken after device_hotplug_lock. So if cpu_add_remove_lock has been acquired by device removing process, then it means the other online/offline process couldn't successfully try lock device_hotplug_lock, and will release s_active with a restart syscall error; if cpu_add_remove_lock has been acquired by online/offline process, then it should already hold device_hotlug_lock, and keeps the device removing process waiting at device_hotplug_lock. So online/offline process could release the lock, and finally release s_active soon. But after some further thinking, I seem to understand your point. s_active has lock order problem with the other series of hotplug related locks, so it's better to take s_active out of the dependency chain, rather than the first of the other series of locks? like you suggested below. > > Now that kernfs has a proper mechanism to deal with it, wouldn't it > make more sense to replace 5e33bc41 with prper s_active protection > breaking? I'll try this way and send you the code for review. Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v3] memory-hotplug: Update documentation to hide information about SECTIONS and remove end_phys_index
Seems we all agree that information about SECTION, e.g. section size, sections per memory block should be kept as kernel internals, and not exposed to userspace. This patch updates Documentation/memory-hotplug.txt to refer to memory blocks instead of memory sections where appropriate and added a paragraph to explain that memory blocks are made of memory sections. The documentation update is mostly provided by Nathan. Also, as end_phys_index in code is actually not the end section id, but the end memory block id, which should always be the same as phys_index. So it is removed here. Signed-off-by: Li Zhong Reviewed-by: Zhang Yanfei --- v3: memort->memory, pointed out by Yanfei Documentation/memory-hotplug.txt | 125 +++--- drivers/base/memory.c| 12 2 files changed, 61 insertions(+), 76 deletions(-) diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt index 58340d5..1aa239f 100644 --- a/Documentation/memory-hotplug.txt +++ b/Documentation/memory-hotplug.txt @@ -88,16 +88,21 @@ phase by hand. 1.3. Unit of Memory online/offline operation -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory -into chunks of the same size. The chunk is called a "section". The size of -a section is architecture dependent. For example, power uses 16MiB, ia64 uses -1GiB. The unit of online/offline operation is "one section". (see Section 3.) +Memory hotplug uses SPARSEMEM memory model which allows memory to be divided +into chunks of the same size. These chunks are called "sections". The size of +a memory section is architecture dependent. For example, power uses 16MiB, ia64 +uses 1GiB. -To determine the size of sections, please read this file: +Memory sections are combined into chunks referred to as "memory blocks". The +size of a memory block is architecture dependent and represents the logical +unit upon which memory online/offline operations are to be performed. The +default size of a memory block is the same as memory section size unless an +architecture specifies otherwise. (see Section 3.) + +To determine the size (in bytes) of a memory block please read this file: /sys/devices/system/memory/block_size_bytes -This file shows the size of sections in byte. --- 2. Kernel Configuration @@ -123,42 +128,35 @@ config options. (CONFIG_ACPI_CONTAINER). This option can be kernel module too. + -4 sysfs files for memory hotplug +3 sysfs files for memory hotplug -All sections have their device information in sysfs. Each section is part of -a memory block under /sys/devices/system/memory as +All memory blocks have their device information in sysfs. Each memory block +is described under /sys/devices/system/memory as /sys/devices/system/memory/memoryXXX -(XXX is the section id.) +(XXX is the memory block id.) -Now, XXX is defined as (start_address_of_section / section_size) of the first -section contained in the memory block. The files 'phys_index' and -'end_phys_index' under each directory report the beginning and end section id's -for the memory block covered by the sysfs directory. It is expected that all +For the memory block covered by the sysfs directory. It is expected that all memory sections in this range are present and no memory holes exist in the range. Currently there is no way to determine if there is a memory hole, but the existence of one should not affect the hotplug capabilities of the memory block. -For example, assume 1GiB section size. A device for a memory starting at +For example, assume 1GiB memory block size. A device for a memory starting at 0x1 is /sys/device/system/memory/memory4 (0x1 / 1Gib = 4) This device covers address range [0x1 ... 0x14000) -Under each section, you can see 4 or 5 files, the end_phys_index file being -a recent addition and not present on older kernels. +Under each memory block, you can see 4 files: -/sys/devices/system/memory/memoryXXX/start_phys_index -/sys/devices/system/memory/memoryXXX/end_phys_index +/sys/devices/system/memory/memoryXXX/phys_index /sys/devices/system/memory/memoryXXX/phys_device /sys/devices/system/memory/memoryXXX/state /sys/devices/system/memory/memoryXXX/removable -'phys_index' : read-only and contains section id of the first section - in the memory block, same as XXX. -'end_phys_index' : read-only and contains section id of the last section - in the memory block. +'phys_index' : read-only and contains memory block id, same as XXX. 'state' : read-write at read: contains online/offline state of memory. at write: user can specify "online_kernel", @@ -185,6
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
On Tue, 2014-04-15 at 10:50 -0400, Tejun Heo wrote: > Hello, > > On Tue, Apr 15, 2014 at 10:44:37AM +0800, Li Zhong wrote: > > / * > > * This process might deadlock with another process trying to > > * remove this device: > > * This process holding the s_active of "online" attribute, and tries > > * to online/offline the device with some locks protecting hotplug. > > * Device removing process holding some locks protecting hotplug, and > > * tries to remove the "online" attribute, waiting for the s_active to > > * be released. > > * > > * The deadlock described above should be solved with > > * lock_device_hotplug_sysfs(). We temporarily drop the active > > * protection here to avoid some lockdep warnings. > > * > > * If device_hotplug_lock is forgotten to be used when removing > > * device(possibly some very simple device even don't need this lock?), > > * @dev could go away any time after dropping the active protection. > > * So increase its ref count before dropping active protection. > > * Though invoking device_{on|off}line() on a removed device seems > > * unreasonable, it should be less disastrous than playing with freed > > * @dev. Also, we might be able to have some mechanism abort > > * device_{on|off}line() if @dev already removed. > > */ > > Hmmm... I'm not sure I fully understand the problem. Does the code > ever try to remove "online" while holding cpu_add_remove_lock and, > when written 0, online knob grabs cpu_add_remove_lock? Yes. In acpi_processor_remove(), cpu_maps_update_begin() is called to hold cpu_add_remove_lock, and then arch_unregister_cpu calls unregister_cpu(), which will try to remove dir cpu1 including "online". while written 0 to online, cpu_down() will also try to grab cpu_add_remove_lock with cpu_maps_update_begin(). > If so, that is > an actually possible deadlock, no? Yes, but it seems to me that it is solved in commit 5e33bc41, which uses lock_device_hotplug_sysfs() to return a restart syscall error if not able to try lock the device_hotplug_lock. That also requires the device removing code path to take the device_hotplug_lock. Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] memory-hotplug: Update documentation to hide information about SECTIONS and remove end_phys_index
On Mon, 2014-04-14 at 17:13 +0800, Zhang Yanfei wrote: > On 04/14/2014 04:43 PM, Li Zhong wrote: > > Seems we all agree that information about SECTION, e.g. section size, > > sections per memory block should be kept as kernel internals, and not > > exposed to userspace. > > > > This patch updates Documentation/memory-hotplug.txt to refer to memory > > blocks instead of memory sections where appropriate and added a > > paragraph to explain that memory blocks are made of memory sections. > > The documentation update is mostly provided by Nathan. > > > > Also, as end_phys_index in code is actually not the end section id, but > > the end memory block id, which should always be the same as phys_index. > > So it is removed here. > > > > Signed-off-by: Li Zhong > > Reviewed-by: Zhang Yanfei > > Still the nitpick there. Ao.. Will fix it in next version. Thanks, Zhong > > > --- > > Documentation/memory-hotplug.txt | 125 > > +++--- > > drivers/base/memory.c| 12 > > 2 files changed, 61 insertions(+), 76 deletions(-) > > > > diff --git a/Documentation/memory-hotplug.txt > > b/Documentation/memory-hotplug.txt > > index 58340d5..1aa239f 100644 > > --- a/Documentation/memory-hotplug.txt > > +++ b/Documentation/memory-hotplug.txt > > @@ -88,16 +88,21 @@ phase by hand. > > > > 1.3. Unit of Memory online/offline operation > > > > -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole > > memory > > -into chunks of the same size. The chunk is called a "section". The size of > > -a section is architecture dependent. For example, power uses 16MiB, ia64 > > uses > > -1GiB. The unit of online/offline operation is "one section". (see Section > > 3.) > > +Memory hotplug uses SPARSEMEM memory model which allows memory to be > > divided > > +into chunks of the same size. These chunks are called "sections". The size > > of > > +a memory section is architecture dependent. For example, power uses 16MiB, > > ia64 > > +uses 1GiB. > > > > -To determine the size of sections, please read this file: > > +Memory sections are combined into chunks referred to as "memory blocks". > > The > > +size of a memory block is architecture dependent and represents the logical > > +unit upon which memory online/offline operations are to be performed. The > > +default size of a memory block is the same as memory section size unless an > > +architecture specifies otherwise. (see Section 3.) > > + > > +To determine the size (in bytes) of a memory block please read this file: > > > > /sys/devices/system/memory/block_size_bytes > > > > -This file shows the size of sections in byte. > > > > --- > > 2. Kernel Configuration > > @@ -123,42 +128,35 @@ config options. > > (CONFIG_ACPI_CONTAINER). > > This option can be kernel module too. > > > > + > > > > -4 sysfs files for memory hotplug > > +3 sysfs files for memory hotplug > > > > -All sections have their device information in sysfs. Each section is part > > of > > -a memory block under /sys/devices/system/memory as > > +All memory blocks have their device information in sysfs. Each memory > > block > > +is described under /sys/devices/system/memory as > > > > /sys/devices/system/memory/memoryXXX > > -(XXX is the section id.) > > +(XXX is the memory block id.) > > > > -Now, XXX is defined as (start_address_of_section / section_size) of the > > first > > -section contained in the memory block. The files 'phys_index' and > > -'end_phys_index' under each directory report the beginning and end section > > id's > > -for the memory block covered by the sysfs directory. It is expected that > > all > > +For the memory block covered by the sysfs directory. It is expected that > > all > > memory sections in this range are present and no memory holes exist in the > > range. Currently there is no way to determine if there is a memory hole, > > but > > the existence of one should not affect the hotplug capabilities of the > > memory > > block. > > > > -For example, assume 1GiB section size. A device for a memory starting at > > +For example, assume 1GiB memory block size. A device for a memory starting > > a
Re: [RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
On Mon, 2014-04-14 at 16:13 -0400, Tejun Heo wrote: > Hello, > > On Mon, Apr 14, 2014 at 03:47:29PM +0800, Li Zhong wrote: > > @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct > > device_attribute *attr, > > { > > bool val; > > int ret; > > + struct kernfs_node *kn; > > > > ret = strtobool(buf, &val); > > if (ret < 0) > > @@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct > > device_attribute *attr, > > if (ret) > > return ret; > > > > + kn = kernfs_find_and_get(dev->kobj.sd, attr->attr.name); > > + if (WARN_ON_ONCE(!kn)) > > + goto out; > > + > > + get_device(dev); > > + kernfs_break_active_protection(kn); > > ret = val ? device_online(dev) : device_offline(dev); > > + kernfs_unbreak_active_protection(kn); > > + put_device(dev); > > + > > + kernfs_put(kn); > > + > > +out: > > unlock_device_hotplug(); > > return ret < 0 ? ret : count; > > } > > Can you please add comment explainin why this is being down? As it > currently stands, it's quite a bit of complexity without any > indication what's going on why. Also, if device_hotplug is locked, is > it really necessary to get @dev? Can it go away inbetween? The code > snippet looks weird because getting @dev indicates that the device > might go away without doing it but then it proceeds to invoke > device_{on|off}line() which probably isn't safe to invoke on a removed > device. Hi Tejun, I tried to write some draft words here... (I'm not good at writing it...) Could you please help to have a review and comment? thanks. / * * This process might deadlock with another process trying to * remove this device: * This process holding the s_active of "online" attribute, and tries * to online/offline the device with some locks protecting hotplug. * Device removing process holding some locks protecting hotplug, and * tries to remove the "online" attribute, waiting for the s_active to * be released. * * The deadlock described above should be solved with * lock_device_hotplug_sysfs(). We temporarily drop the active * protection here to avoid some lockdep warnings. * * If device_hotplug_lock is forgotten to be used when removing * device(possibly some very simple device even don't need this lock?), * @dev could go away any time after dropping the active protection. * So increase its ref count before dropping active protection. * Though invoking device_{on|off}line() on a removed device seems * unreasonable, it should be less disastrous than playing with freed * @dev. Also, we might be able to have some mechanism abort * device_{on|off}line() if @dev already removed. */ Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2] memory-hotplug: Update documentation to hide information about SECTIONS and remove end_phys_index
Seems we all agree that information about SECTION, e.g. section size, sections per memory block should be kept as kernel internals, and not exposed to userspace. This patch updates Documentation/memory-hotplug.txt to refer to memory blocks instead of memory sections where appropriate and added a paragraph to explain that memory blocks are made of memory sections. The documentation update is mostly provided by Nathan. Also, as end_phys_index in code is actually not the end section id, but the end memory block id, which should always be the same as phys_index. So it is removed here. Signed-off-by: Li Zhong --- Documentation/memory-hotplug.txt | 125 +++--- drivers/base/memory.c| 12 2 files changed, 61 insertions(+), 76 deletions(-) diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt index 58340d5..1aa239f 100644 --- a/Documentation/memory-hotplug.txt +++ b/Documentation/memory-hotplug.txt @@ -88,16 +88,21 @@ phase by hand. 1.3. Unit of Memory online/offline operation -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory -into chunks of the same size. The chunk is called a "section". The size of -a section is architecture dependent. For example, power uses 16MiB, ia64 uses -1GiB. The unit of online/offline operation is "one section". (see Section 3.) +Memory hotplug uses SPARSEMEM memory model which allows memory to be divided +into chunks of the same size. These chunks are called "sections". The size of +a memory section is architecture dependent. For example, power uses 16MiB, ia64 +uses 1GiB. -To determine the size of sections, please read this file: +Memory sections are combined into chunks referred to as "memory blocks". The +size of a memory block is architecture dependent and represents the logical +unit upon which memory online/offline operations are to be performed. The +default size of a memory block is the same as memory section size unless an +architecture specifies otherwise. (see Section 3.) + +To determine the size (in bytes) of a memory block please read this file: /sys/devices/system/memory/block_size_bytes -This file shows the size of sections in byte. --- 2. Kernel Configuration @@ -123,42 +128,35 @@ config options. (CONFIG_ACPI_CONTAINER). This option can be kernel module too. + -4 sysfs files for memory hotplug +3 sysfs files for memory hotplug -All sections have their device information in sysfs. Each section is part of -a memory block under /sys/devices/system/memory as +All memory blocks have their device information in sysfs. Each memory block +is described under /sys/devices/system/memory as /sys/devices/system/memory/memoryXXX -(XXX is the section id.) +(XXX is the memory block id.) -Now, XXX is defined as (start_address_of_section / section_size) of the first -section contained in the memory block. The files 'phys_index' and -'end_phys_index' under each directory report the beginning and end section id's -for the memory block covered by the sysfs directory. It is expected that all +For the memory block covered by the sysfs directory. It is expected that all memory sections in this range are present and no memory holes exist in the range. Currently there is no way to determine if there is a memory hole, but the existence of one should not affect the hotplug capabilities of the memory block. -For example, assume 1GiB section size. A device for a memory starting at +For example, assume 1GiB memory block size. A device for a memory starting at 0x1 is /sys/device/system/memory/memory4 (0x1 / 1Gib = 4) This device covers address range [0x1 ... 0x14000) -Under each section, you can see 4 or 5 files, the end_phys_index file being -a recent addition and not present on older kernels. +Under each memory block, you can see 4 files: -/sys/devices/system/memory/memoryXXX/start_phys_index -/sys/devices/system/memory/memoryXXX/end_phys_index +/sys/devices/system/memory/memoryXXX/phys_index /sys/devices/system/memory/memoryXXX/phys_device /sys/devices/system/memory/memoryXXX/state /sys/devices/system/memory/memoryXXX/removable -'phys_index' : read-only and contains section id of the first section - in the memory block, same as XXX. -'end_phys_index' : read-only and contains section id of the last section - in the memory block. +'phys_index' : read-only and contains memory block id, same as XXX. 'state' : read-write at read: contains online/offline state of memory. at write: user can specify "online_kernel", @@ -185,6 +183,7 @@ For example: A backlink will also be created: /sys/devices/syst
[RFC PATCH v3] Use kernfs_break_active_protection() for device online store callbacks
I noticed following lockdep warning when trying acpi hot-remove cpus: [84154.204080] == [84154.204080] [ INFO: possible circular locking dependency detected ] [84154.204080] 3.14.0-next-20140408+ #24 Tainted: GW [84154.204080] --- [84154.204080] bash/777 is trying to acquire lock: [84154.204080] (cpu_add_remove_lock){+.+.+.}, at: [] cpu_maps_update_begin+0x17/0x20 [84154.213203] [84154.213203] but task is already holding lock: [84154.213203] (s_active#79){.+}, at: [] kernfs_fop_write+0xe4/0x190 [84154.213203] [84154.213203] which lock already depends on the new lock. [84154.213203] [84154.213203] [84154.213203] the existing dependency chain (in reverse order) is: [84154.213203] -> #3 (s_active#79){.+}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] __kernfs_remove+0x250/0x310 [84154.213203][] kernfs_remove_by_name_ns +0x50/0xc0 [84154.213203][] sysfs_remove_file_ns +0x15/0x20 [84154.213203][] device_remove_file+0x19/0x20 [84154.213203][] device_remove_attrs+0x33/0x80 [84154.213203][] device_del+0x127/0x1c0 [84154.213203][] device_unregister+0x22/0x60 [84154.213203][] unregister_cpu+0x1e/0x40 [84154.213203][] arch_unregister_cpu+0x23/0x30 [84154.213203][] acpi_processor_remove +0x8d/0xb2 [84154.213203][] acpi_bus_trim+0x5a/0x8d [84154.213203][] acpi_device_hotplug +0x1a8/0x3ec [84154.213203][] acpi_hotplug_work_fn +0x1f/0x2b [84154.213203][] process_one_work+0x1eb/0x6b0 [84154.213203][] worker_thread+0x11b/0x370 [84154.213203][] kthread+0xe4/0x100 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #2 (cpu_hotplug.lock#2){+.+.+.}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] mutex_lock_nested+0x50/0x3c0 [84154.213203][] cpu_hotplug_begin+0x4f/0x80 [84154.213203][] _cpu_up+0x3f/0x160 [84154.213203][] cpu_up+0x69/0x80 [84154.213203][] smp_init+0x60/0x8c [84154.213203][] kernel_init_freeable +0x126/0x23b [84154.213203][] kernel_init+0xe/0xf0 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #1 (cpu_hotplug.lock){++}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] cpu_hotplug_begin+0x41/0x80 [84154.213203][] _cpu_up+0x3f/0x160 [84154.213203][] cpu_up+0x69/0x80 [84154.213203][] smp_init+0x60/0x8c [84154.213203][] kernel_init_freeable +0x126/0x23b [84154.213203][] kernel_init+0xe/0xf0 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #0 (cpu_add_remove_lock){+.+.+.}: [84154.213203][] __lock_acquire+0x1f2a/0x1f60 [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] mutex_lock_nested+0x50/0x3c0 [84154.213203][] cpu_maps_update_begin +0x17/0x20 [84154.213203][] cpu_down+0x1d/0x50 [84154.213203][] cpu_subsys_offline+0x14/0x20 [84154.213203][] device_offline+0xad/0xd0 [84154.213203][] online_store+0x42/0x80 [84154.213203][] dev_attr_store+0x18/0x30 [84154.213203][] sysfs_kf_write+0x49/0x60 [84154.213203][] kernfs_fop_write+0x109/0x190 [84154.213203][] vfs_write+0xbe/0x1c0 [84154.213203][] SyS_write+0x52/0xb0 [84154.213203][] tracesys+0xd0/0xd5 [84154.213203] [84154.213203] other info that might help us debug this: [84154.213203] [84154.213203] Chain exists of: cpu_add_remove_lock --> cpu_hotplug.lock#2 --> s_active#79 [84154.213203] Possible unsafe locking scenario: [84154.213203] [84154.213203]CPU0CPU1 [84154.213203] [84154.213203] lock(s_active#79); [84154.213203]lock(cpu_hotplug.lock#2); [84154.213203]lock(s_active#79); [84154.213203] lock(cpu_add_remove_lock); [84154.213203] [84154.213203] *** DEADLOCK *** . The deadlock itself seems already fixed in commit 5e33bc41. As Tejun suggested, to avoid this lockdep warning, kernfs_break_active_protection() is used before online/offline callbacks to take the s_active lock out of the dependency chain during online/offline operations. Signed-off-by: Li Zhong --- drivers/base/core.c | 13 + 1 file changed, 13 insertions(+) diff --git a/drivers/base/core.c b/drivers/base/core.c index 0dd6528..b313ad7 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, { bool val; int ret; + struct kernfs_node *kn; ret = strtobool(buf, &val); if (ret < 0) @@ -448,7 +449,19 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, if (re
[RFC PATCH v2] Use kernfs_break_active_protection() for device online store callbacks
I noticed following lockdep warning when trying acpi hot-remove cpus: [84154.204080] == [84154.204080] [ INFO: possible circular locking dependency detected ] [84154.204080] 3.14.0-next-20140408+ #24 Tainted: GW [84154.204080] --- [84154.204080] bash/777 is trying to acquire lock: [84154.204080] (cpu_add_remove_lock){+.+.+.}, at: [] cpu_maps_update_begin+0x17/0x20 [84154.213203] [84154.213203] but task is already holding lock: [84154.213203] (s_active#79){.+}, at: [] kernfs_fop_write+0xe4/0x190 [84154.213203] [84154.213203] which lock already depends on the new lock. [84154.213203] [84154.213203] [84154.213203] the existing dependency chain (in reverse order) is: [84154.213203] -> #3 (s_active#79){.+}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] __kernfs_remove+0x250/0x310 [84154.213203][] kernfs_remove_by_name_ns +0x50/0xc0 [84154.213203][] sysfs_remove_file_ns +0x15/0x20 [84154.213203][] device_remove_file+0x19/0x20 [84154.213203][] device_remove_attrs+0x33/0x80 [84154.213203][] device_del+0x127/0x1c0 [84154.213203][] device_unregister+0x22/0x60 [84154.213203][] unregister_cpu+0x1e/0x40 [84154.213203][] arch_unregister_cpu+0x23/0x30 [84154.213203][] acpi_processor_remove +0x8d/0xb2 [84154.213203][] acpi_bus_trim+0x5a/0x8d [84154.213203][] acpi_device_hotplug +0x1a8/0x3ec [84154.213203][] acpi_hotplug_work_fn +0x1f/0x2b [84154.213203][] process_one_work+0x1eb/0x6b0 [84154.213203][] worker_thread+0x11b/0x370 [84154.213203][] kthread+0xe4/0x100 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #2 (cpu_hotplug.lock#2){+.+.+.}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] mutex_lock_nested+0x50/0x3c0 [84154.213203][] cpu_hotplug_begin+0x4f/0x80 [84154.213203][] _cpu_up+0x3f/0x160 [84154.213203][] cpu_up+0x69/0x80 [84154.213203][] smp_init+0x60/0x8c [84154.213203][] kernel_init_freeable +0x126/0x23b [84154.213203][] kernel_init+0xe/0xf0 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #1 (cpu_hotplug.lock){++}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] cpu_hotplug_begin+0x41/0x80 [84154.213203][] _cpu_up+0x3f/0x160 [84154.213203][] cpu_up+0x69/0x80 [84154.213203][] smp_init+0x60/0x8c [84154.213203][] kernel_init_freeable +0x126/0x23b [84154.213203][] kernel_init+0xe/0xf0 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #0 (cpu_add_remove_lock){+.+.+.}: [84154.213203][] __lock_acquire+0x1f2a/0x1f60 [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] mutex_lock_nested+0x50/0x3c0 [84154.213203][] cpu_maps_update_begin +0x17/0x20 [84154.213203][] cpu_down+0x1d/0x50 [84154.213203][] cpu_subsys_offline+0x14/0x20 [84154.213203][] device_offline+0xad/0xd0 [84154.213203][] online_store+0x42/0x80 [84154.213203][] dev_attr_store+0x18/0x30 [84154.213203][] sysfs_kf_write+0x49/0x60 [84154.213203][] kernfs_fop_write+0x109/0x190 [84154.213203][] vfs_write+0xbe/0x1c0 [84154.213203][] SyS_write+0x52/0xb0 [84154.213203][] tracesys+0xd0/0xd5 [84154.213203] [84154.213203] other info that might help us debug this: [84154.213203] [84154.213203] Chain exists of: cpu_add_remove_lock --> cpu_hotplug.lock#2 --> s_active#79 [84154.213203] Possible unsafe locking scenario: [84154.213203] [84154.213203]CPU0CPU1 [84154.213203] [84154.213203] lock(s_active#79); [84154.213203]lock(cpu_hotplug.lock#2); [84154.213203]lock(s_active#79); [84154.213203] lock(cpu_add_remove_lock); [84154.213203] [84154.213203] *** DEADLOCK *** . The deadlock itself seems already fixed in commit 5e33bc41. As Tejun suggested, to avoid this lockdep warning, kernfs_break_active_protection() is used before online/offline callbacks to take the s_active lock out of the dependency chain during online/offline operations. Signed-off-by: Li Zhong --- drivers/base/core.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/base/core.c b/drivers/base/core.c index 0dd6528..2b9f68e 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -439,6 +439,7 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, { bool val; int ret; + struct kernfs_node *kn; ret = strtobool(buf, &val); if (ret < 0) @@ -448,7 +449,15 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, if (ret)
[RFC PATCH] Suppress a device hot remove related lockdep warning
I noticed following lockdep warning when trying acpi hot-remove cpus: [84154.204080] == [84154.204080] [ INFO: possible circular locking dependency detected ] [84154.204080] 3.14.0-next-20140408+ #24 Tainted: GW [84154.204080] --- [84154.204080] bash/777 is trying to acquire lock: [84154.204080] (cpu_add_remove_lock){+.+.+.}, at: [] cpu_maps_update_begin+0x17/0x20 [84154.213203] [84154.213203] but task is already holding lock: [84154.213203] (s_active#79){.+}, at: [] kernfs_fop_write+0xe4/0x190 [84154.213203] [84154.213203] which lock already depends on the new lock. [84154.213203] [84154.213203] [84154.213203] the existing dependency chain (in reverse order) is: [84154.213203] -> #3 (s_active#79){.+}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] __kernfs_remove+0x250/0x310 [84154.213203][] kernfs_remove_by_name_ns+0x50/0xc0 [84154.213203][] sysfs_remove_file_ns+0x15/0x20 [84154.213203][] device_remove_file+0x19/0x20 [84154.213203][] device_remove_attrs+0x33/0x80 [84154.213203][] device_del+0x127/0x1c0 [84154.213203][] device_unregister+0x22/0x60 [84154.213203][] unregister_cpu+0x1e/0x40 [84154.213203][] arch_unregister_cpu+0x23/0x30 [84154.213203][] acpi_processor_remove+0x8d/0xb2 [84154.213203][] acpi_bus_trim+0x5a/0x8d [84154.213203][] acpi_device_hotplug+0x1a8/0x3ec [84154.213203][] acpi_hotplug_work_fn+0x1f/0x2b [84154.213203][] process_one_work+0x1eb/0x6b0 [84154.213203][] worker_thread+0x11b/0x370 [84154.213203][] kthread+0xe4/0x100 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #2 (cpu_hotplug.lock#2){+.+.+.}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] mutex_lock_nested+0x50/0x3c0 [84154.213203][] cpu_hotplug_begin+0x4f/0x80 [84154.213203][] _cpu_up+0x3f/0x160 [84154.213203][] cpu_up+0x69/0x80 [84154.213203][] smp_init+0x60/0x8c [84154.213203][] kernel_init_freeable+0x126/0x23b [84154.213203][] kernel_init+0xe/0xf0 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #1 (cpu_hotplug.lock){++}: [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] cpu_hotplug_begin+0x41/0x80 [84154.213203][] _cpu_up+0x3f/0x160 [84154.213203][] cpu_up+0x69/0x80 [84154.213203][] smp_init+0x60/0x8c [84154.213203][] kernel_init_freeable+0x126/0x23b [84154.213203][] kernel_init+0xe/0xf0 [84154.213203][] ret_from_fork+0x7c/0xb0 [84154.213203] -> #0 (cpu_add_remove_lock){+.+.+.}: [84154.213203][] __lock_acquire+0x1f2a/0x1f60 [84154.213203][] lock_acquire+0x9b/0x1d0 [84154.213203][] mutex_lock_nested+0x50/0x3c0 [84154.213203][] cpu_maps_update_begin+0x17/0x20 [84154.213203][] cpu_down+0x1d/0x50 [84154.213203][] cpu_subsys_offline+0x14/0x20 [84154.213203][] device_offline+0xad/0xd0 [84154.213203][] online_store+0x42/0x80 [84154.213203][] dev_attr_store+0x18/0x30 [84154.213203][] sysfs_kf_write+0x49/0x60 [84154.213203][] kernfs_fop_write+0x109/0x190 [84154.213203][] vfs_write+0xbe/0x1c0 [84154.213203][] SyS_write+0x52/0xb0 [84154.213203][] tracesys+0xd0/0xd5 [84154.213203] [84154.213203] other info that might help us debug this: [84154.213203] [84154.213203] Chain exists of: cpu_add_remove_lock --> cpu_hotplug.lock#2 --> s_active#79 [84154.213203] Possible unsafe locking scenario: [84154.213203] [84154.213203]CPU0CPU1 [84154.213203] [84154.213203] lock(s_active#79); [84154.213203]lock(cpu_hotplug.lock#2); [84154.213203]lock(s_active#79); [84154.213203] lock(cpu_add_remove_lock); [84154.213203] [84154.213203] *** DEADLOCK *** . The deadlock itself seems already fixed in commit 5e33bc41. This patch uses DEVICE_ATTR_IGNORE_LOCKDEP for "online" attr to suppress this lockdep warning. But I'm a little afraid it might also hide (future) potential real dead lock scenarios? Signed-off-by: Li Zhong --- drivers/base/core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 0dd6528..98e3e09 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -452,7 +452,9 @@ static ssize_t online_store(struct device *dev, struct device_attribute *attr, unlock_device_hotplug(); return ret < 0 ? ret : count; } -static DEVICE_ATTR_RW(online); + +static DEVICE_ATTR_IGNORE_LOCKDEP(online, (S_IWUSR | S_IRUGO), + online_show, online_store); int device_add_gro
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Wed, 2014-04-09 at 12:39 -0500, Nathan Fontenot wrote: > On 04/08/2014 02:47 PM, Dave Hansen wrote: > > > > That document really needs to be updated to stop referring to sections > > (at least in the descriptions of the user interface). We can not change > > the units of phys_index/end_phys_index without also changing > > block_size_bytes. > > > > Here is a first pass at updating the documentation. > > I have tried to update the documentation to refer to memory blocks instead > of memory sections where appropriate and added a paragraph to explain > that memory blocks are mode of memory sections. > > Thoughts? If we all agree to hide the information about sections, then I think we also need to update the section id's used for phys_index/end_phys_index, something like following on top of yours? -- diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt index 92d15e2..9fbb025 100644 --- a/Documentation/memory-hotplug.txt +++ b/Documentation/memory-hotplug.txt @@ -138,10 +138,7 @@ is described under /sys/devices/system/memory as /sys/devices/system/memory/memoryXXX (XXX is the memory block id.) -Now, XXX is defined as (start_address_of_section / section_size) of the first -section contained in the memory block. The files 'phys_index' and -'end_phys_index' under each directory report the beginning and end section id's -for the memory block covered by the sysfs directory. It is expected that all +For the memory block covered by the sysfs directory. It is expected that all memory sections in this range are present and no memory holes exist in the range. Currently there is no way to determine if there is a memory hole, but the existence of one should not affect the hotplug capabilities of the memory @@ -155,16 +152,14 @@ This device covers address range [0x1 ... 0x14000) Under each memory block, you can see 4 or 5 files, the end_phys_index file being a recent addition and not present on older kernels. -/sys/devices/system/memory/memoryXXX/start_phys_index +/sys/devices/system/memory/memoryXXX/phys_index /sys/devices/system/memory/memoryXXX/end_phys_index /sys/devices/system/memory/memoryXXX/phys_device /sys/devices/system/memory/memoryXXX/state /sys/devices/system/memory/memoryXXX/removable -'phys_index' : read-only and contains section id of the first section - in the memory block, same as XXX. -'end_phys_index' : read-only and contains section id of the last section - in the memory block. +'phys_index' : read-only and contains memory block id, same as XXX. +'end_phys_index' : read-only and contains memory block id, same as XXX. 'state' : read-write at read: contains online/offline state of memory. at write: user can specify "online_kernel", -- Not sure whether it is proper to remove end_phys_index, too? Thanks, Zhong > > -Nathan > --- > Documentation/memory-hotplug.txt | 113 > --- > 1 file changed, 59 insertions(+), 54 deletions(-) > > Index: linux/Documentation/memory-hotplug.txt > === > --- linux.orig/Documentation/memory-hotplug.txt > +++ linux/Documentation/memory-hotplug.txt > @@ -88,16 +88,21 @@ phase by hand. > > 1.3. Unit of Memory online/offline operation > > -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole > memory > -into chunks of the same size. The chunk is called a "section". The size of > -a section is architecture dependent. For example, power uses 16MiB, ia64 uses > -1GiB. The unit of online/offline operation is "one section". (see Section 3.) > +Memory hotplug uses SPARSEMEM memory model which allows memory to be divided > +into chunks of the same size. These chunks are called "sections". The size of > +a memory section is architecture dependent. For example, power uses 16MiB, > ia64 > +uses 1GiB. > + > +Memory sections are combined into chunks referred to as "memory blocks". The > +size of a memory block is architecture dependent and represents the logical > +unit upon which memory online/offline operations are to be performed. The > +default size of a memory block is the same as memory section size unless an > +architecture specifies otherwise. (see Section 3.) > > -To determine the size of sections, please read this file: > +To determine the size (in bytes) of a memory block please read this file: > > /sys/devices/system/memory/block_size_bytes > > -This file shows the size of sections in byte. > > --- > 2. Kernel Configuration > @@ -123,14 +128,15 @@ config options. > (CONFIG_ACPI_CONTAINER). > This option can be kernel module too. > > + > > -4 sysfs files for memory hotplug > +3 sysfs files for memory hotplug > > -All sections have their device information in sys
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Wed, 2014-04-09 at 08:49 -0700, Dave Hansen wrote: > On 04/09/2014 02:20 AM, Li Zhong wrote: > > Or do you mean we don't need to expose any information related to > > SECTION to userspace? > > Right, we don't need to expose sections themselves to userspace. Do we? > OK, I agree with that. Yanfei, I recall you once expressed your preference for section numbers? Thanks, Zhong -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Tue, 2014-04-08 at 12:47 -0700, Dave Hansen wrote: > On 04/08/2014 11:23 AM, Nathan Fontenot wrote: > > On 04/08/2014 11:13 AM, Dave Hansen wrote: > >> On 04/08/2014 01:27 AM, Li Zhong wrote: > >>> If Dave and others don't have further objections, it seems this small > >>> userspace incompatibility could be accepted by most of us, and I don't > >>> need to make a version 2. > >> > >> Let me ask another question then. What are the units of > >> phys_index/end_phys_index? How do we expose those units to userspace? > >> > > > > The documentation for these files just states that the files contain > > the first and last section id of memory in the memory block for > > phys_index and end_phys_index respectively. > > > > I'm not sure the values have ever been units of anything, at least not > > that I remember. > > > > There are two units. SECTION_SIZE, which is completely internal to the > kernel, and block_size_bytes which used to be the same as SECTION_SIZE, > but is not now. Which one of those two is phys_index/end_phys_index in, > and if it is in terms of SECTION_SIZE like this patch proposes, how do > we tell userspace how large SECTION_SIZE is? With this patch, I think we could still tell how large SECTION_SIZE is. block_size_bytes tells us the size of the block, and end_phys_index-phys_index+1, tells us the numbers of sections in the block, and then we could know the SECTION_SIZE by a division. Not that obvious, but if needed, we could easily add a file telling us section_size or sections_per_block. > > block_size_bytes is supposed to tell you how large the sections are. In > the case where we lumped a bunch of sections together, we also bumped up > block_size_bytes. That's why we currently divide the *ACTUAL* section > number in phys_index/end_phys_index by block_size_bytes. > > That document really needs to be updated to stop referring to sections > (at least in the descriptions of the user interface). We can not change > the units of phys_index/end_phys_index without also changing > block_size_bytes. How about adding a new section_size_bytes together with this patch? Or do you mean we don't need to expose any information related to SECTION to userspace? Thanks, Zhong -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Fri, 2014-04-04 at 10:29 +0900, Yasuaki Ishimatsu wrote: > (2014/04/02 17:56), Li Zhong wrote: > > I noticed the phys_index and end_phys_index under > > /sys/devices/system/memory/memoryXXX/ have the same value, e.g. > > (for the test machine, one memory block has 8 sections, that is > > sections_per_block equals 8) > > > > # cd /sys/devices/system/memory/memory100/ > > # cat phys_index end_phys_index > > 0064 > > 0064 > > > > Seems they should reflect the start/end section number respectively, which > > also matches what is said in Documentation/memory-hotplug.txt > > > > This patch tries to modify that so the two files could show the start/end > > section number of the memory block. > > > > After this change, output of the above example looks like: > > > > # cat phys_index end_phys_index > > 0320 > > 0327 > > > > Signed-off-by: Li Zhong > > --- > > Good catch! This is a bug. So I think the bug should be fixed. > > Reviewed-by: Yasuaki Ishimatsu Thank you all for the review. If Dave and others don't have further objections, it seems this small userspace incompatibility could be accepted by most of us, and I don't need to make a version 2. Thanks, Zhong > Thanks, > Yasuaki Ishimatsu > > > drivers/base/memory.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > > index bece691..b10f2fa 100644 > > --- a/drivers/base/memory.c > > +++ b/drivers/base/memory.c > > @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct device > > *dev, > > struct memory_block *mem = to_memory_block(dev); > > unsigned long phys_index; > > > > - phys_index = mem->start_section_nr / sections_per_block; > > + phys_index = mem->start_section_nr; > > return sprintf(buf, "%08lx\n", phys_index); > > } > > > > @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device > > *dev, > > struct memory_block *mem = to_memory_block(dev); > > unsigned long phys_index; > > > > - phys_index = mem->end_section_nr / sections_per_block; > > + phys_index = mem->end_section_nr; > > return sprintf(buf, "%08lx\n", phys_index); > > } > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Thu, 2014-04-03 at 11:06 +0800, Zhang Yanfei wrote: > On 04/03/2014 10:37 AM, Li Zhong wrote: > > On Thu, 2014-04-03 at 09:37 +0800, Zhang Yanfei wrote: > >> Add ccing > >> > >> On 04/02/2014 04:56 PM, Li Zhong wrote: > >>> I noticed the phys_index and end_phys_index under > >>> /sys/devices/system/memory/memoryXXX/ have the same value, e.g. > >>> (for the test machine, one memory block has 8 sections, that is > >>> sections_per_block equals 8) > >>> > >>> # cd /sys/devices/system/memory/memory100/ > >>> # cat phys_index end_phys_index > >>> 0064 > >>> 0064 > >>> > >>> Seems they should reflect the start/end section number respectively, > >>> which > >>> also matches what is said in Documentation/memory-hotplug.txt > >> > > Hi Yanfei, > > > > Thanks for the review. > > > >> Indeed. I've noticed this before. The value in 'end_phys_index' doesn't > >> match what it really means. But, the name itself is vague, it looks like > >> it is the index of some page frame. (we keep this name for compatibility?) > > > > I guess so, Dave just reminded me that the RFC would also break > > userspace.. > > > > And now my plan is: > > leave the code unchanged > > update the document, state the end_phys_index/phys_index are the same, > > and means the memory block index > > Ah. I doubt whether there is userspace tool which is using the two sysfiles? > for example, the memory100 directory itself can tell us which block it is. > So why there is the two files under it give the same meaning. > > If there is userspace tool using the two files, does it use 'end_phys_index' > in the correct way? That said, if a userspace tool knows what the > 'end_phys_index' > really mean, does it still need it since we have 'phys_index' with the same > value? For the end_phys_index, I totally agree with you. If somebody tries to use it, say as the end section number, he should finally be able to find that the value is not what he expects. But who knows ... For phys_index, I guess it is also reasonable for some userspace tool to just loop for each memoryXXX directory under /sys/devices/system/memory, and use the phys_index as the memory block index for the directory, instead of extracting the XXX from the directory name memoryXXX Don't know whether there are some generic rules for handling such kind of compatibility issues... > > [optional] create two new files start_sec_nr, end_sec_nr if needed > > These two are the really meaningful sysfiles for userspace, IMO. OK, I see. > > > > > Do you have any other suggestions? > > No. I think we should wait for other guys to comment more. OK, let's wait. Thanks, Zhong > > Thanks. > > > > > Thanks, Zhong > > > >> > >> The corresponding member in struct memory_block is: > >> > >> struct memory_block { > >> unsigned long start_section_nr; > >> unsigned long end_section_nr; > >> ... > >> > >> The two members seem to have the right name, and have the right value in > >> kernel. > >> > >> > >>> > >>> This patch tries to modify that so the two files could show the start/end > >>> section number of the memory block. > >>> > >>> After this change, output of the above example looks like: > >>> > >>> # cat phys_index end_phys_index > >>> 0320 > >>> 0327 > >>> > >>> Signed-off-by: Li Zhong > >>> --- > >>> drivers/base/memory.c | 4 ++-- > >>> 1 file changed, 2 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c > >>> index bece691..b10f2fa 100644 > >>> --- a/drivers/base/memory.c > >>> +++ b/drivers/base/memory.c > >>> @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct > >>> device *dev, > >>> struct memory_block *mem = to_memory_block(dev); > >>> unsigned long phys_index; > >>> > >>> - phys_index = mem->start_section_nr / sections_per_block; > >>> + phys_index = mem->start_section_nr; > >>> return sprintf(buf, "%08lx\n", phys_index); > >>> } > >>> > >>> @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device > >>> *dev, > >>> struct memory_block *mem = to_memory_block(dev); > >>> unsigned long phys_index; > >>> > >>> - phys_index = mem->end_section_nr / sections_per_block; > >>> + phys_index = mem->end_section_nr; > >>> return sprintf(buf, "%08lx\n", phys_index); > >>> } > >>> > >>> > >>> > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >>> the body of a message to majord...@vger.kernel.org > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>> Please read the FAQ at http://www.tux.org/lkml/ > >>> > >> > >> > > > > > > . > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Thu, 2014-04-03 at 09:37 +0800, Zhang Yanfei wrote: > Add ccing > > On 04/02/2014 04:56 PM, Li Zhong wrote: > > I noticed the phys_index and end_phys_index under > > /sys/devices/system/memory/memoryXXX/ have the same value, e.g. > > (for the test machine, one memory block has 8 sections, that is > > sections_per_block equals 8) > > > > # cd /sys/devices/system/memory/memory100/ > > # cat phys_index end_phys_index > > 0064 > > 0064 > > > > Seems they should reflect the start/end section number respectively, which > > also matches what is said in Documentation/memory-hotplug.txt > Hi Yanfei, Thanks for the review. > Indeed. I've noticed this before. The value in 'end_phys_index' doesn't > match what it really means. But, the name itself is vague, it looks like > it is the index of some page frame. (we keep this name for compatibility?) I guess so, Dave just reminded me that the RFC would also break userspace.. And now my plan is: leave the code unchanged update the document, state the end_phys_index/phys_index are the same, and means the memory block index [optional] create two new files start_sec_nr, end_sec_nr if needed Do you have any other suggestions? Thanks, Zhong > > The corresponding member in struct memory_block is: > > struct memory_block { > unsigned long start_section_nr; > unsigned long end_section_nr; > ... > > The two members seem to have the right name, and have the right value in > kernel. > > > > > > This patch tries to modify that so the two files could show the start/end > > section number of the memory block. > > > > After this change, output of the above example looks like: > > > > # cat phys_index end_phys_index > > 0320 > > 0327 > > > > Signed-off-by: Li Zhong > > --- > > drivers/base/memory.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > > index bece691..b10f2fa 100644 > > --- a/drivers/base/memory.c > > +++ b/drivers/base/memory.c > > @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct device > > *dev, > > struct memory_block *mem = to_memory_block(dev); > > unsigned long phys_index; > > > > - phys_index = mem->start_section_nr / sections_per_block; > > + phys_index = mem->start_section_nr; > > return sprintf(buf, "%08lx\n", phys_index); > > } > > > > @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device > > *dev, > > struct memory_block *mem = to_memory_block(dev); > > unsigned long phys_index; > > > > - phys_index = mem->end_section_nr / sections_per_block; > > + phys_index = mem->end_section_nr; > > return sprintf(buf, "%08lx\n", phys_index); > > } > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
On Wed, 2014-04-02 at 09:09 -0700, Dave Hansen wrote: > On 04/02/2014 01:56 AM, Li Zhong wrote: > > I noticed the phys_index and end_phys_index under > > /sys/devices/system/memory/memoryXXX/ have the same value, e.g. > > (for the test machine, one memory block has 8 sections, that is > > sections_per_block equals 8) > > > > # cd /sys/devices/system/memory/memory100/ > > # cat phys_index end_phys_index > > 0064 > > 0064 > > > > Seems they should reflect the start/end section number respectively, which > > also matches what is said in Documentation/memory-hotplug.txt > > This changes a user-visible interface. Won't this break userspace? Hi Dave, Hmm, I think so... Thank you for the reminder Do you have some better ideas to fix this? Maybe we should leave the code unchanged, just change the corresponding document (just it seems that the end_phys_index will always be the same as phys_index)? And if the section numbers are really needed, then we could create two new files later, e.g. start_section_nr, end_section_nr? Thanks, Zhong > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number
I noticed the phys_index and end_phys_index under /sys/devices/system/memory/memoryXXX/ have the same value, e.g. (for the test machine, one memory block has 8 sections, that is sections_per_block equals 8) # cd /sys/devices/system/memory/memory100/ # cat phys_index end_phys_index 0064 0064 Seems they should reflect the start/end section number respectively, which also matches what is said in Documentation/memory-hotplug.txt This patch tries to modify that so the two files could show the start/end section number of the memory block. After this change, output of the above example looks like: # cat phys_index end_phys_index 0320 0327 Signed-off-by: Li Zhong --- drivers/base/memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index bece691..b10f2fa 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct device *dev, struct memory_block *mem = to_memory_block(dev); unsigned long phys_index; - phys_index = mem->start_section_nr / sections_per_block; + phys_index = mem->start_section_nr; return sprintf(buf, "%08lx\n", phys_index); } @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device *dev, struct memory_block *mem = to_memory_block(dev); unsigned long phys_index; - phys_index = mem->end_section_nr / sections_per_block; + phys_index = mem->end_section_nr; return sprintf(buf, "%08lx\n", phys_index); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH ] workqueue: add args to workqueue lockdep name
Tommi noticed a 'funny' lock class name: "%s#5" from a lock acquired in process_one_work(). Maybe #fmt plus #args could be used as the lock_name to give some more information for some fmt string like the above. __builtin_constant_p() check is removed (as there seems no good way to check all the variables in args list). However, by removing the check, it only adds two additional "s for those constants. Some lockdep name examples printed out after the change: lockdep namewq->name "events_long" events_long "%s"("khelper") khelper "xfs-data/%s"mp->m_fsname xfs-data/dm-3 Signed-off-by: Li Zhong --- include/linux/workqueue.h |5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 594521b..704f4f6 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -419,10 +419,7 @@ __alloc_workqueue_key(const char *fmt, unsigned int flags, int max_active, static struct lock_class_key __key; \ const char *__lock_name;\ \ - if (__builtin_constant_p(fmt)) \ - __lock_name = (fmt);\ - else\ - __lock_name = #fmt; \ + __lock_name = #fmt#args;\ \ __alloc_workqueue_key((fmt), (flags), (max_active), \ &__key, __lock_name, ##args); \ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lockdep: strange %s#5 lock name
On Tue, 2014-02-11 at 10:27 -0500, Tejun Heo wrote: > On Tue, Feb 11, 2014 at 12:00:36PM +0100, Peter Zijlstra wrote: > > > Looks good to me. Can you please post the patch with SOB? > > > > --- > > Subject: workqueue: Fix workqueue lockdep name > > > > Tommi noticed a 'funny' lock class name: "%s#5" from a lock acquired in > > process_one_work(). It turns out that commit b196be89cdc14 forgot to > > change the lockdep_init_map() when it changed the @lock_name argument > > from a string to a format. > > > > Fixes: b196be89cdc14 ("workqueue: make alloc_workqueue() take printf fmt > > and args for name") > > Reported-by: Tommi Rantala > > Signed-off-by: Peter Zijlstra > > Applied to wq/for-3.14-fixes. Hi Tejun, Peter, I found following lockdep warning caused by this commit in next-tree: [5.251993] [ cut here ] [5.252019] WARNING: CPU: 0 PID: 221 at kernel/locking/lockdep.c:710 __lock_acquire+0x1761/0x1f60() [5.252019] Modules linked in: e1000 [5.252019] CPU: 0 PID: 221 Comm: lvm Not tainted 3.14.0-rc2-next-20140212 #1 [5.252019] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [5.252019] 0009 880118e91938 8155fe12 880118e91978 [5.252019] 8105c195 880118e91958 81eb33d0 0002 [5.252019] 880118dd2318 880118e91988 [5.252019] Call Trace: [5.252019] [] dump_stack+0x19/0x1b [5.252019] [] warn_slowpath_common+0x85/0xb0 [5.252019] [] warn_slowpath_null+0x1a/0x20 [5.252019] [] __lock_acquire+0x1761/0x1f60 [5.252019] [] ? mark_held_locks+0xae/0x120 [5.252019] [] ? debug_check_no_locks_freed+0x8e/0x160 [5.252019] [] ? lockdep_init_map+0xac/0x600 [5.252019] [] lock_acquire+0x9a/0x120 [5.252019] [] ? flush_workqueue+0x5/0x750 [5.252019] [] flush_workqueue+0x109/0x750 [5.252019] [] ? flush_workqueue+0x5/0x750 [5.252019] [] ? _raw_spin_unlock_irq+0x30/0x40 [5.252019] [] ? srcu_reschedule+0xe0/0xf0 [5.252019] [] dm_suspend+0xe9/0x1e0 [5.252019] [] dev_suspend+0x1e3/0x270 [5.252019] [] ? table_load+0x350/0x350 [5.252019] [] ctl_ioctl+0x26c/0x510 [5.252019] [] ? __lock_acquire+0x41c/0x1f60 [5.252019] [] ? vtime_account_user+0x98/0xb0 [5.252019] [] dm_ctl_ioctl+0x13/0x20 [5.252019] [] do_vfs_ioctl+0x88/0x570 [5.252019] [] ? __fget_light+0x129/0x150 [5.252019] [] SyS_ioctl+0x91/0xb0 [5.252019] [] tracesys+0xcf/0xd4 [5.252019] ---[ end trace ff1fa506f34be3bc ]--- It seems to me that when the second time alloc_workqueue() is called from the same code path, it would have two locks with the same key, but not the same &wq->name, which doesn't meet lockdep's assumption. Maybe we could use the #fmt plus #args as the lock_name to avoid the above issue? It could also give some more information for some string like %s#5. Draft code attached below. I removed __builtin_constant_p() check (I don't know how to check all the args). However, by removing the check, it only adds two additional two "s for those constants. Some lockdep name examples I printed out after the change: lockdep namewq->name "events_long" events_long "%s"("khelper") khelper "xfs-data/%s"mp->m_fsname xfs-data/dm-3 ... A little bit ugly, not sure whether we could generate some better names here? Thanks, Zhong --- diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 594521b..704f4f6 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -419,10 +419,7 @@ __alloc_workqueue_key(const char *fmt, unsigned int flags, int max_active, static struct lock_class_key __key; \ const char *__lock_name;\ \ - if (__builtin_constant_p(fmt)) \ - __lock_name = (fmt);\ - else\ - __lock_name = #fmt; \ + __lock_name = #fmt#args;\ \ __alloc_workqueue_key((fmt), (flags), (max_active), \ &__key, __lock_name, ##args); \ diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 861d8dd..82ef9f3 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4202,7 +4202,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt, INIT_LIST_HEAD(&wq->flusher_overflow); INIT_LIST_HEAD(&wq->maydays); - lockdep_init_map(&wq->lockdep_map, wq->name, key, 0); + lockdep_init_map(&wq->lockdep_map, lock_name, key, 0); INIT_LIST_HEA
Re: [RFC PATCH next]module: Fix mod->mkobj.kobj potentially freed too early
On Thu, 2013-08-22 at 16:30 +0930, Rusty Russell wrote: > Greg KH writes: > > On Wed, Aug 21, 2013 at 05:49:58PM +0800, Li Zhong wrote: > > > struct kobj_type module_ktype = { > > > + .release =module_kobj_release, > > > .sysfs_ops =&module_sysfs_ops, > > > }; > > > > Wait, as there is no release function here for the kobject (a different > > problem), why is the deferred release function causing any problems? > > There is no release function to call, so what is causing the oops? > > Because DEBUG_KOBJECT_RELEASE does the kobject_put() sometime later, > which is what causes the oops. > > Since kobjects don't have an owner field, AFAICT someone *could* grab > one in a module which is unloading, then put it after unload. So this > fixes a real bug, albeit not one seen in the real world. > > Applied, Oh, thank you, Rusty. I just sent out another version... which fix it in another way as Greg suggested, could you please also help to take a look at it? Thanks, Zhong > Rusty. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2 next]module: Fix mod->mkobj.kobj potentially freed too early
DEBUG_KOBJECT_RELEASE helps to find the issue attached below. After some investigation, it seems the reason is: The mod->mkobj.kobj(a01600d0 below) is freed together with mod itself by module_free(mod, mod->module_core) in free_module(). However, its children still hold references to it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the child('holders' below) tries to decrease the reference count to its parent in kobject_del(), BUG happens as it tries to access already freed memory. v2: Greg suggested to make the kobject as a pointer. But it seems a little hard to make kobj(kobject) embedded in mkobj(module_kobject) a pointer, as that seems to cause getting the mkobj from the kobj impossible -- to_module_kobject() is used in several places to derive mkobj from its member kobj. So in this version, I made the whole mkobj(module_kobject) as a pointer in mod(module). The mkobj is then allocated in mod_sysfs_init(), and freed in its member kobj's release function -- it seems that there is no mkobj usage after its kobj is released? [ 1844.175287] kobject: 'holders' (88007c1f1600): kobject_release, parent a01600d0 (delayed) [ 1844.178991] kobject: 'notes' (8800370b2a00): kobject_release, parent a01600d0 (delayed) [ 1845.180118] kobject: 'holders' (88007c1f1600): kobject_cleanup, parent a01600d0 [ 1845.182130] kobject: 'holders' (88007c1f1600): auto cleanup kobject_del [ 1845.184120] BUG: unable to handle kernel paging request at a01601d0 [ 1845.185026] IP: [] kobject_put+0x11/0x60 [ 1845.185026] PGD 1a13067 PUD 1a14063 PMD 7bd30067 PTE 0 [ 1845.185026] Oops: [#1] PREEMPT [ 1845.185026] Modules linked in: xfs libcrc32c [last unloaded: kprobe_example] [ 1845.185026] CPU: 0 PID: 18 Comm: kworker/0:1 Tainted: G O 3.11.0-rc6-next-20130819+ #1 [ 1845.185026] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 1845.185026] Workqueue: events kobject_delayed_cleanup [ 1845.185026] task: 88007ca51f00 ti: 88007ca5c000 task.ti: 88007ca5c000 [ 1845.185026] RIP: 0010:[] [] kobject_put+0x11/0x60 [ 1845.185026] RSP: 0018:88007ca5dd08 EFLAGS: 00010282 [ 1845.185026] RAX: 2000 RBX: a01600d0 RCX: 8177d638 [ 1845.185026] RDX: 88007ca5dc18 RSI: RDI: a01600d0 [ 1845.185026] RBP: 88007ca5dd18 R08: 824e9810 R09: [ 1845.185026] R10: 8800 R11: dead4ead0001 R12: 81a95040 [ 1845.185026] R13: 88007b27a960 R14: 88007c1f1600 R15: [ 1845.185026] FS: () GS:81a23000() knlGS: [ 1845.185026] CS: 0010 DS: ES: CR0: 8005003b [ 1845.185026] CR2: a01601d0 CR3: 37207000 CR4: 06b0 [ 1845.185026] Stack: [ 1845.185026] 88007c1f1600 88007c1f1600 88007ca5dd38 812cdb7e [ 1845.185026] 88007c1f1640 88007ca5dd68 812cdbfe [ 1845.185026] 88007c974800 88007c1f1640 88007ff61a00 [ 1845.185026] Call Trace: [ 1845.185026] [] kobject_del+0x2e/0x40 [ 1845.185026] [] kobject_delayed_cleanup+0x6e/0x1d0 [ 1845.185026] [] process_one_work+0x1e5/0x670 [ 1845.185026] [] ? process_one_work+0x183/0x670 [ 1845.185026] [] worker_thread+0x113/0x370 [ 1845.185026] [] ? rescuer_thread+0x290/0x290 [ 1845.185026] [] kthread+0xda/0xe0 [ 1845.185026] [] ? _raw_spin_unlock_irq+0x30/0x60 [ 1845.185026] [] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] [] ret_from_fork+0x7a/0xb0 [ 1845.185026] [] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] Code: 81 48 c7 c7 28 95 ad 81 31 c0 e8 9b da 01 00 e9 4f ff ff ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 1d 87 00 01 00 00 01 74 1e 48 8d 7b 38 83 6b 38 01 0f 94 c0 84 [ 1845.185026] RIP [] kobject_put+0x11/0x60 [ 1845.185026] RSP [ 1845.185026] CR2: a01601d0 [ 1845.185026] ---[ end trace 49a70afd109f5653 ]--- Signed-off-by: Li Zhong --- drivers/base/core.c| 2 +- drivers/base/module.c | 4 ++-- include/linux/module.h | 2 +- kernel/module.c| 37 + kernel/params.c| 19 +-- 5 files changed, 38 insertions(+), 26 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 09a99d6..b8a1fc6 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -1629,7 +1629,7 @@ struct device *__root_device_register(const char *name, struct module *owner) #ifdef CONFIG_MODULES /* gotta find a "cleaner" way to do this */ if (owner) { - struct module_kobject *mk = &owner->mkobj; + struct module_kobject *mk = owner->mkobj; err = sysfs_create_link(&root->dev.kobj, &mk->kobj, "module"); if (err) { diff --git a/driv
Re: [RFC PATCH next]module: Fix mod->mkobj.kobj potentially freed too early
On Wed, 2013-08-21 at 21:03 -0700, Greg KH wrote: > On Thu, Aug 22, 2013 at 10:34:06AM +0800, Li Zhong wrote: > > On Wed, 2013-08-21 at 09:18 -0700, Greg KH wrote: > > > On Wed, Aug 21, 2013 at 05:49:58PM +0800, Li Zhong wrote: > > > > DEBUG_KOBJECT_RELEASE helps to find the issue attached below. > > > > > > > > After some investigation, it seems the reason is: > > > > The mod->mkobj.kobj(a01600d0 below) is freed together with mod > > > > itself in free_module(). However, its children still hold references to > > > > it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the > > > > child(holders below) tries to decrease the reference count to its parent > > > > in kobject_del(), BUG happens as it tries to access already freed > > > > memory. > > > > > > Ah, thanks for tracking this down. I had seen this in my local testing, > > > but wasn't able to figure out the offending code. > > > > > > > This patch tries to fix it by waiting for the mod->mkobj.kobj to be > > > > really released in the module removing process (and some error code > > > > paths). > > > > > > Nasty, we should just be freeing the structure in the release function, > > > why doesn't that work? > > > > Hi Greg, > > > > It seems I didn't describe it clearly in the previous mail. I'm trying > > to do it better below: > > > > mod->mkobj.kobj is embedded in module_kobject(not a pointer): > > struct module_kobject { > > struct kobject kobj; > > ... > > > > and allocated with the module memory. So we could see the parent below > > a01600d0 is between MODULES_VADDR (a000) and > > MODULES_END(ff00). > > > > It seem to me that the mkobj.kobj is freed by module_free(mod, > > mod->module_core). > > Ick, you are right. If a kobject is being embedded in an object, it > should control the lifespan of the object, not somewhere else like is > happening here. > > The best solution for this is to make the kobject a pointer, not > embedded in the structure, that will fix this issue, right? Yes, I think so. I'll try to write a fix using this way, thanks for your suggestion. Thanks, Zhong > > thanks, > > greg k-h > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH next]module: Fix mod->mkobj.kobj potentially freed too early
On Wed, 2013-08-21 at 09:18 -0700, Greg KH wrote: > On Wed, Aug 21, 2013 at 05:49:58PM +0800, Li Zhong wrote: > > DEBUG_KOBJECT_RELEASE helps to find the issue attached below. > > > > After some investigation, it seems the reason is: > > The mod->mkobj.kobj(a01600d0 below) is freed together with mod > > itself in free_module(). However, its children still hold references to > > it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the > > child(holders below) tries to decrease the reference count to its parent > > in kobject_del(), BUG happens as it tries to access already freed memory. > > Ah, thanks for tracking this down. I had seen this in my local testing, > but wasn't able to figure out the offending code. > > > This patch tries to fix it by waiting for the mod->mkobj.kobj to be > > really released in the module removing process (and some error code > > paths). > > Nasty, we should just be freeing the structure in the release function, > why doesn't that work? Hi Greg, It seems I didn't describe it clearly in the previous mail. I'm trying to do it better below: mod->mkobj.kobj is embedded in module_kobject(not a pointer): struct module_kobject { struct kobject kobj; ... and allocated with the module memory. So we could see the parent below a01600d0 is between MODULES_VADDR (a000) and MODULES_END(ff00). It seem to me that the mkobj.kobj is freed by module_free(mod, mod->module_core). So in free_module(), we need delay the call of module_free(), until mkobj.kobj finishes its cleanup. > > [ 1844.175287] kobject: 'holders' (88007c1f1600): kobject_release, > > parent a01600d0 (delayed) > > [ 1844.178991] kobject: 'notes' (8800370b2a00): kobject_release, parent > > a01600d0 (delayed) > > [ 1845.180118] kobject: 'holders' (88007c1f1600): kobject_cleanup, > > parent a01600d0 > > [ 1845.182130] kobject: 'holders' (88007c1f1600): auto cleanup > > kobject_del > > [ 1845.184120] BUG: unable to handle kernel paging request at > > a01601d0 > > [ 1845.185026] IP: [] kobject_put+0x11/0x60 > > [ 1845.185026] PGD 1a13067 PUD 1a14063 PMD 7bd30067 PTE 0 > > [ 1845.185026] Oops: [#1] PREEMPT > > [ 1845.185026] Modules linked in: xfs libcrc32c [last unloaded: > > kprobe_example] > > [ 1845.185026] CPU: 0 PID: 18 Comm: kworker/0:1 Tainted: G O > > 3.11.0-rc6-next-20130819+ #1 > > [ 1845.185026] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 > > [ 1845.185026] Workqueue: events kobject_delayed_cleanup > > [ 1845.185026] task: 88007ca51f00 ti: 88007ca5c000 task.ti: > > 88007ca5c000 > > [ 1845.185026] RIP: 0010:[] [] > > kobject_put+0x11/0x60 > > [ 1845.185026] RSP: 0018:88007ca5dd08 EFLAGS: 00010282 > > [ 1845.185026] RAX: 2000 RBX: a01600d0 RCX: > > 8177d638 > > [ 1845.185026] RDX: 88007ca5dc18 RSI: RDI: > > a01600d0 > > [ 1845.185026] RBP: 88007ca5dd18 R08: 824e9810 R09: > > > > [ 1845.185026] R10: 8800 R11: dead4ead0001 R12: > > 81a95040 > > [ 1845.185026] R13: 88007b27a960 R14: 88007c1f1600 R15: > > > > [ 1845.185026] FS: () GS:81a23000() > > knlGS: > > [ 1845.185026] CS: 0010 DS: ES: CR0: 8005003b > > [ 1845.185026] CR2: a01601d0 CR3: 37207000 CR4: > > 06b0 > > [ 1845.185026] Stack: > > [ 1845.185026] 88007c1f1600 88007c1f1600 88007ca5dd38 > > 812cdb7e > > [ 1845.185026] 88007c1f1640 88007ca5dd68 > > 812cdbfe > > [ 1845.185026] 88007c974800 88007c1f1640 88007ff61a00 > > > > [ 1845.185026] Call Trace: > > [ 1845.185026] [] kobject_del+0x2e/0x40 > > [ 1845.185026] [] kobject_delayed_cleanup+0x6e/0x1d0 > > [ 1845.185026] [] process_one_work+0x1e5/0x670 > > [ 1845.185026] [] ? process_one_work+0x183/0x670 > > [ 1845.185026] [] worker_thread+0x113/0x370 > > [ 1845.185026] [] ? rescuer_thread+0x290/0x290 > > [ 1845.185026] [] kthread+0xda/0xe0 > > [ 1845.185026] [] ? _raw_spin_unlock_irq+0x30/0x60 > > [ 1845.185026] [] ? kthread_create_on_node+0x130/0x130 > > [ 1845.185026] [] ret_from_fork+0x7a/0xb0 > > [ 1845.185026] [] ? kthread_create_on_node+0x130/0x130 > > [ 1845.185026] Code: 81 48 c7 c7 28 95 ad 81 31 c0 e8 9
[RFC PATCH next]module: Fix mod->mkobj.kobj potentially freed too early
DEBUG_KOBJECT_RELEASE helps to find the issue attached below. After some investigation, it seems the reason is: The mod->mkobj.kobj(a01600d0 below) is freed together with mod itself in free_module(). However, its children still hold references to it, as the delay caused by DEBUG_KOBJECT_RELEASE. So when the child(holders below) tries to decrease the reference count to its parent in kobject_del(), BUG happens as it tries to access already freed memory. This patch tries to fix it by waiting for the mod->mkobj.kobj to be really released in the module removing process (and some error code paths). [ 1844.175287] kobject: 'holders' (88007c1f1600): kobject_release, parent a01600d0 (delayed) [ 1844.178991] kobject: 'notes' (8800370b2a00): kobject_release, parent a01600d0 (delayed) [ 1845.180118] kobject: 'holders' (88007c1f1600): kobject_cleanup, parent a01600d0 [ 1845.182130] kobject: 'holders' (88007c1f1600): auto cleanup kobject_del [ 1845.184120] BUG: unable to handle kernel paging request at a01601d0 [ 1845.185026] IP: [] kobject_put+0x11/0x60 [ 1845.185026] PGD 1a13067 PUD 1a14063 PMD 7bd30067 PTE 0 [ 1845.185026] Oops: [#1] PREEMPT [ 1845.185026] Modules linked in: xfs libcrc32c [last unloaded: kprobe_example] [ 1845.185026] CPU: 0 PID: 18 Comm: kworker/0:1 Tainted: G O 3.11.0-rc6-next-20130819+ #1 [ 1845.185026] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 1845.185026] Workqueue: events kobject_delayed_cleanup [ 1845.185026] task: 88007ca51f00 ti: 88007ca5c000 task.ti: 88007ca5c000 [ 1845.185026] RIP: 0010:[] [] kobject_put+0x11/0x60 [ 1845.185026] RSP: 0018:88007ca5dd08 EFLAGS: 00010282 [ 1845.185026] RAX: 2000 RBX: a01600d0 RCX: 8177d638 [ 1845.185026] RDX: 88007ca5dc18 RSI: RDI: a01600d0 [ 1845.185026] RBP: 88007ca5dd18 R08: 824e9810 R09: [ 1845.185026] R10: 8800 R11: dead4ead0001 R12: 81a95040 [ 1845.185026] R13: 88007b27a960 R14: 88007c1f1600 R15: [ 1845.185026] FS: () GS:81a23000() knlGS: [ 1845.185026] CS: 0010 DS: ES: CR0: 8005003b [ 1845.185026] CR2: a01601d0 CR3: 37207000 CR4: 06b0 [ 1845.185026] Stack: [ 1845.185026] 88007c1f1600 88007c1f1600 88007ca5dd38 812cdb7e [ 1845.185026] 88007c1f1640 88007ca5dd68 812cdbfe [ 1845.185026] 88007c974800 88007c1f1640 88007ff61a00 [ 1845.185026] Call Trace: [ 1845.185026] [] kobject_del+0x2e/0x40 [ 1845.185026] [] kobject_delayed_cleanup+0x6e/0x1d0 [ 1845.185026] [] process_one_work+0x1e5/0x670 [ 1845.185026] [] ? process_one_work+0x183/0x670 [ 1845.185026] [] worker_thread+0x113/0x370 [ 1845.185026] [] ? rescuer_thread+0x290/0x290 [ 1845.185026] [] kthread+0xda/0xe0 [ 1845.185026] [] ? _raw_spin_unlock_irq+0x30/0x60 [ 1845.185026] [] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] [] ret_from_fork+0x7a/0xb0 [ 1845.185026] [] ? kthread_create_on_node+0x130/0x130 [ 1845.185026] Code: 81 48 c7 c7 28 95 ad 81 31 c0 e8 9b da 01 00 e9 4f ff ff ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 85 ff 74 1d 87 00 01 00 00 01 74 1e 48 8d 7b 38 83 6b 38 01 0f 94 c0 84 [ 1845.185026] RIP [] kobject_put+0x11/0x60 [ 1845.185026] RSP [ 1845.185026] CR2: a01601d0 [ 1845.185026] ---[ end trace 49a70afd109f5653 ]--- Signed-off-by: Li Zhong --- include/linux/module.h | 1 + kernel/module.c| 14 +++--- kernel/params.c| 7 +++ 3 files changed, 19 insertions(+), 3 deletions(-) diff --git a/include/linux/module.h b/include/linux/module.h index 504035f..05f2447 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -42,6 +42,7 @@ struct module_kobject { struct module *mod; struct kobject *drivers_dir; struct module_param_attrs *mp; + struct completion *kobj_completion; }; struct module_attribute { diff --git a/kernel/module.c b/kernel/module.c index 2069158..b5e2baa 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1611,6 +1611,14 @@ static void module_remove_modinfo_attrs(struct module *mod) kfree(mod->modinfo_attrs); } +static void mod_kobject_put(struct module *mod) +{ + DECLARE_COMPLETION_ONSTACK(c); + mod->mkobj.kobj_completion = &c; + kobject_put(&mod->mkobj.kobj); + wait_for_completion(&c); +} + static int mod_sysfs_init(struct module *mod) { int err; @@ -1638,7 +1646,7 @@ static int mod_sysfs_init(struct module *mod) err = kobject_init_and_add(&mod->mkobj.kobj, &module_ktype, NULL, "%s", mod->name); if (err) - kobject_put(&mod
[PATCH] nohz: fix compile warning in tick_nohz_init()
cpu is not used after commit 5b8621a68fdcd2baf1d3b413726f913a5254d46a Signed-off-by: Li Zhong --- kernel/time/tick-sched.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index e80183f..8145860 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -343,8 +343,6 @@ static int tick_nohz_init_all(void) void __init tick_nohz_init(void) { - int cpu; - if (!have_nohz_full_mask) { if (tick_nohz_init_all() < 0) return; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] vfs: partial revoke implementation suggested by Al Viro
This patch tries to partially implement what Al Viro suggested in https://lkml.org/lkml/2013/4/5/15 Code also mostly copied from the above link Signed-off-by: Li Zhong --- fs/Makefile|2 +- fs/revoke.c| 133 include/linux/fs.h |2 + include/linux/revoke.h | 50 ++ 4 files changed, 186 insertions(+), 1 deletion(-) create mode 100644 fs/revoke.c create mode 100644 include/linux/revoke.h diff --git a/fs/Makefile b/fs/Makefile index 4fe6df3..af0a622 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table.o super.o \ attr.o bad_inode.o file.o filesystems.o namespace.o \ seq_file.o xattr.o libfs.o fs-writeback.o \ pnode.o splice.o sync.o utimes.o \ - stack.o fs_struct.o statfs.o + stack.o fs_struct.o statfs.o revoke.o ifeq ($(CONFIG_BLOCK),y) obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o diff --git a/fs/revoke.c b/fs/revoke.c new file mode 100644 index 000..bcda9ba --- /dev/null +++ b/fs/revoke.c @@ -0,0 +1,133 @@ +#include +#include +#include + +bool __start_using(struct revoke *revoke) +{ + struct revokable *r; + rcu_read_lock(); + r = rcu_dereference(revoke->revokable); + if (unlikely(!r)) { + rcu_read_unlock(); + return false; /* revoked */ + } + + if (likely(atomic_inc_unless_negative(&r->in_use))) { + rcu_read_unlock(); + return true;/* we are using it now */ + } + + rcu_read_unlock(); + return false; /* it's being revoked right now */ +} + +#define BIAS (-1U<<31) + +void __stop_using(struct revoke *revoke) +{ + struct revokable *r; + r = rcu_dereference_protected(revoke->revokable, 1); + BUG_ON(!r); + if (atomic_dec_return(&r->in_use) == BIAS) + complete(r->c); +} + +/* called with r->lock held by caller, unlocks it */ +static void __release_revoke(struct revokable *r, struct revoke *revoke) +{ + if (revoke->closing) { + DECLARE_COMPLETION_ONSTACK(c); + revoke->c = &c; + spin_unlock(&r->lock); + wait_for_completion(&c); + } else { + struct file *file; + revoke->closing = 1; + spin_unlock(&r->lock); + file = revoke->file; + if (file->f_op->release) + file->f_op->release(file_inode(file), file); + spin_lock(&r->lock); + hlist_del_init(&revoke->node); + rcu_assign_pointer(revoke->revokable, NULL); + rcu_read_lock();/* prevent freeing of r */ + if (revoke->c) + complete(revoke->c); + spin_unlock(&r->lock); + rcu_read_unlock(); + } +} + +void release_revoke(struct revoke *revoke) +{ + struct revokable *r; + rcu_read_lock(); + r = rcu_dereference(revoke->revokable); + if (!r) { + /* already closed by revokation */ + rcu_read_unlock(); + goto out; + } + + spin_lock(&r->lock); + if (unlikely(hlist_unhashed(&revoke->node))) { + /* just been revoked */ + spin_unlock(&r->lock); + rcu_read_unlock(); + goto out; + } + + /* +* Ok, revoke_it() couldn't have been finished yet +* it'll have to get r->lock before it's through, so +* we can drop rcu_read_lock +*/ + rcu_read_unlock(); + __release_revoke(r, revoke); +out: + kfree(revoke); +} + +void revoke_it(struct revokable *r) +{ + DECLARE_COMPLETION_ONSTACK(c); + r->c = &c; + if (atomic_add_return(BIAS, &r->in_use) != BIAS) { + if (r->kick) + r->kick(r); + wait_for_completion(&c); + } + + while (1) { + struct hlist_node *p; + spin_lock(&r->lock); + p = r->list.first; + if (!p) + break; + __release_revoke(r, hlist_entry(p, struct revoke, node)); + } + spin_unlock(&r->lock); +} + +int make_revokable(struct file *f, struct revokable *r) +{ + struct revoke *revoke = kzalloc(sizeof(struct revoke), GFP_KERNEL); + if (!revoke) + return -ENOMEM; + + if (!atomic_inc_unless_negative(&r->in_use)) { + kfree(revoke); + return -ENOENT; + } + + revoke->file = f; + revoke->revokable = r; + f->f_re
[PATCH 2/2] proc: covert procfs to use the general revoke implementation
This patch tries to replace the current revoke logic in procfs with the implementation suggested by Al Viro in https://lkml.org/lkml/2013/4/5/15 Below is the replacement guideline copied from that mail: procfs would have struct revokable embedded into proc_dir_entry, with freeing of those guys RCUd. It would set ->f_op to ->proc_fops and call make_revokable(file, &pde->revokable) in proc_reg_open(); no wrappers for other methods needed anymore. All file_operations instances fed to proc_create() et.al. would lose ->owner - it's already not needed for those, actually. remove_proc_entry()/remove_proc_subtree() would call revoke_it() on everything we are removing. Signed-off-by: Li Zhong --- fs/compat_ioctl.c |8 +- fs/eventpoll.c | 10 ++- fs/file_table.c| 13 ++- fs/ioctl.c |7 +- fs/proc/generic.c | 12 +-- fs/proc/inode.c| 229 +--- fs/proc/internal.h |9 +-- fs/read_write.c| 30 +-- fs/select.c| 11 ++- mm/mmap.c |8 +- mm/nommu.c | 16 +++- 11 files changed, 111 insertions(+), 242 deletions(-) diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c index 5d19acf..48da3bf 100644 --- a/fs/compat_ioctl.c +++ b/fs/compat_ioctl.c @@ -110,6 +110,7 @@ #include #include +#include #ifdef CONFIG_SPARC #include @@ -1584,7 +1585,12 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd, default: if (f.file->f_op && f.file->f_op->compat_ioctl) { - error = f.file->f_op->compat_ioctl(f.file, cmd, arg); + if (likely(start_using(f.file))) { + error = f.file->f_op->compat_ioctl(f.file, + cmd, arg); + stop_using(f.file); + } else + error = -ENOTTY; if (error != -ENOIOCTLCMD) goto out_fput; } diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 9ad17b15..c73e8af 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -42,6 +42,7 @@ #include #include #include +#include /* * LOCKING: @@ -776,9 +777,16 @@ static int ep_eventpoll_release(struct inode *inode, struct file *file) static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt) { + unsigned int rv = DEFAULT_POLLMASK; pt->_key = epi->event.events; - return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & epi->event.events; + if (likely(start_using(epi->ffd.file))) { + rv = epi->ffd.file->f_op->poll(epi->ffd.file, pt) + & epi->event.events; + stop_using(epi->ffd.file); + } + + return rv; } static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head, diff --git a/fs/file_table.c b/fs/file_table.c index 08e719b..b3d55e0 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -26,6 +26,7 @@ #include #include #include +#include #include @@ -244,14 +245,20 @@ static void __fput(struct file *file) file->f_op->fasync(-1, file, 0); } ima_file_free(file); - if (file->f_op && file->f_op->release) - file->f_op->release(inode, file); + if (file->f_op && file->f_op->release) { + if (likely(!(file->f_revoke))) + file->f_op->release(inode, file); + else + release_revoke(file->f_revoke); + } security_file_free(file); if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL && !(file->f_mode & FMODE_PATH))) { cdev_put(inode->i_cdev); } - fops_put(file->f_op); + if (likely(!(file->f_revoke))) + fops_put(file->f_op); + put_pid(file->f_owner.pid); if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) i_readcount_dec(inode); diff --git a/fs/ioctl.c b/fs/ioctl.c index fd507fb..7610377 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -15,6 +15,7 @@ #include #include #include +#include #include @@ -40,7 +41,11 @@ static long vfs_ioctl(struct file *filp, unsigned int cmd, if (!filp->f_op || !filp->f_op->unlocked_ioctl) goto out; - error = filp->f_op->unlocked_ioctl(filp, cmd, arg); + if (likely(start_using(filp))) { + error = filp->f_op->unlocked_ioctl(filp, cmd, arg); + stop_using(filp); + } + if (error == -ENOIOCTLCMD) error = -ENOTTY; out: diff --git a/fs
[RFC PATCH 0/2] partial revoke implementation for procfs
Hi Al Viro, I tried to implement what you proposed in: https://lkml.org/lkml/2013/4/5/15 When you have time, would you please help to take a look at it, and give your comments? I sent the draft early, just want to make sure I didn't misunderstand anything in your proposal, and what I did and next step plan are not heading the wrong direction. This is only an initial draft to do only what procfs is needed 1. didn't do anything about kick, mmap, fasync, ... and something you mentioned in following mails 2. I only wrapped f_ops in vfs which are used by procfs ( but maybe I still missed something). 3. It seems all f_ops are const, so I couldn't easliy clear the pointer of ->owner, maybe that needs every calling site of proc_create(_data) to make sure the proc_fops doesn't have ->owner set? Currently, I added an ugly check in __fput, so if if ->f_revoke is set in file, we don't call fput_ops; and in proc_reg_open(), restore the old ->f_op if make_revokable() fails. patch 1: adding the implementation proposed in your mail patch 2: convert procfs to use this implementation If there aren't any big issues, I plan to look for another file system (with backing device) to try other things that's not implemented this time. Thanks, Zhong Li Zhong (2): vfs: partial revoke implementation suggested by Al Viro proc: covert procfs to use the general revoke implementation fs/Makefile|2 +- fs/compat_ioctl.c |8 +- fs/eventpoll.c | 10 ++- fs/file_table.c| 13 ++- fs/ioctl.c |7 +- fs/proc/generic.c | 12 +-- fs/proc/inode.c| 229 fs/proc/internal.h |9 +- fs/read_write.c| 30 +-- fs/revoke.c| 133 fs/select.c| 11 ++- include/linux/fs.h |2 + include/linux/revoke.h | 50 +++ mm/mmap.c |8 +- mm/nommu.c | 16 +++- 15 files changed, 297 insertions(+), 243 deletions(-) create mode 100644 fs/revoke.c create mode 100644 include/linux/revoke.h -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH nohz] return NOTIFY_BAD in cpu down call back to stop offlining the cpu
In tick_nohz_cpu_down_callback() if the cpu is the one handling timekeeping , it seems that we should return something that could stop notify CPU_DOWN_PREPARE, and then start notify CPU_DOWN_FAILED on the already called notifier call backs. -EINVAL will be converted to 0 by notifier_to_errno(), then the cpu would be taken down with part of the DOWN_PREPARE notifier callbacks called, and something bad could happen after that. Signed-off-by: Li Zhong --- kernel/time/tick-sched.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index bc67d42..17b8155 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -306,7 +306,7 @@ static int __cpuinit tick_nohz_cpu_down_callback(struct notifier_block *nfb, * we can't safely shutdown that CPU. */ if (have_nohz_full_mask && tick_do_timer_cpu == cpu) - return -EINVAL; + return NOTIFY_BAD; break; } return NOTIFY_OK; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S
On Tue, 2013-05-14 at 16:13 +0200, Frederic Weisbecker wrote: > On Fri, May 10, 2013 at 05:12:26PM -0400, Steven Rostedt wrote: > > +/* > > + * This is a entry point to the scheduler() just before going > > + * back to user space. This is called with irqs disabled > > + * which prevents races with the CONTEXT_TRACKING updates. > > + */ > > +asmlinkage void __sched schedule_preempt_user(void) > > +{ > > + enum ctx_state prev_state; > > + > > + prev_state = exception_enter(); > > + > > + local_irq_enable(); > > + __schedule(); > > + local_irq_disable(); > > + > > + exception_exit(prev_state); > > +} > > Ok I just had a look at how ARM and PPC64 are handling user preemption and it > seems > that irqs are disabled around the call to schedule() on these archs too. > Although > do_work_pending() in ARM surprisingly doesn't re-enable irqs before calling > schedule? > > Anyway having irqs disabled around user preemption seem to be a requirement > to make > sure the TIF_NEED_RESCHED check is not racy against irqs and return to > userspace. > So I guess we can keep the above function as it is. > > But perhaps we should queue this for 3.11 given that it's a bit of a > sensitive change > in the x86 user return path. > > Look, I'm just going to make a seperate pull request with this patch based on > 3.10-rc1 > and let Ingo choose the target. Could the deletion of schedule_user() be separated to another patch. So for other archs, such as ppc64, we could have both schedule_preempt_user() and schedule_user() there for the conversion? Or are there some better ways to avoid the conflict with arch trees? Thanks, Zhong > > (Meanwhile I still think it would be a good idea to keep LOCKDEP_SYS_EXIT in > the loop :-) > > Thanks! > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S
On Mon, 2013-05-13 at 11:03 -0400, Steven Rostedt wrote: > On Mon, 2013-05-13 at 17:56 +0800, Li Zhong wrote: > > > > All this before schedule_user() was able to call user_exit() and take us > > > out of dynamic tick user context. > > > > Maybe we could just disable function trace for schedule_user()? > > > > It seems that function trace might use RCU, at least in __schedule() > > when calling preempt_enable(), and if preemption is really enabled. > > There is a small use of rcu scheduling (preempt disable/enable) in > function tracing. One is for perf and dynamic call traces and some hash > code (multiple users of the function tracer). But those are the unique > cases, and I really do not want to remove this function just for the non > common case. Function tracing has proven extremely useful in debugging > RCU and context tracking. By adding notrace, it's a lost cause. OK, so it seems a bad idea to disable trace here. However, as schedule is traced, so we might get something like .schedule <-.schedule_user to know that schedule_user is called? > > Not to mention, adding trace_printk() there will break it too, and that > does not use any RCU but still disables preemption because the writing > to the ring buffer requires staying on the same CPU. > I guess you mean the preemption enable after buffer written might cause a task migration, and break the pair of user_enter/user_exit? Then how about adding trace_printk() after user_exit() there? > > > > user_exit() is used to allow RCU usage after that (with rcu_user_exit). > > RCU usage is not allowed before that because cpu is in user eqs. And if > > function trace needs use RCU, then it seems user_exit() itself or its > > callers couldn't be function traced. > > And it can't be debugged either. > > I can probably black list those functions manually, such that only the > main function tracer can trace it, all others will not. In otherwords, I > can have it such that function tracer will not trace those functions for > perf, dynamic function tracers, or anything that requires changing the > function list. I'm totally lost. Need some more time to read the trace code (please forgive me for the silly questions in this mail...) By "main function tracer", do you mean the function_trace_call()? And I don't understand how to black list some functions manually? If function_trace_call() is the main function tracer you mentioned, there is preemption disable/enable in it. schedule might happen where preemption is enabled. Or maybe you mean we should make all those functions called with preemption disabled(or irqs disabled)? Then syscall_trace_enter(), syscall_trace_leave(), do_notify_resume(), need be converted the similar way as schedule_preempt_user() did? Thanks, Zhong > I could probably also add a heavier weight synchronize_sched() that even > synchronizes cpus in userspace. > > > > > If __schedule() in preempt_enable() is the only place function trace > > uses RCU, then after converting to schedule_preempt_user(), it is safe > > as irqs are disabled for __schedule() to happen. But user_exit() itself > > and some other callers might still need function trace be disabled. > > The above makes no sense to me. function tracing didn't break, but the > user_exit() did because of a preemption in the wrong place, as there was > no protection there to prevent an unnecessary preemption. > > > -- Steve > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v4 1/5] powerpc: Syscall hooks for context tracking subsystem
This is the syscall slow path hooks for context tracking subsystem, corresponding to [PATCH] x86: Syscall hooks for userspace RCU extended QS commit bf5a3c13b939813d28ce26c01425054c740d6731 TIF_MEMDIE is moved to the second 16-bits (with value 17), as it seems there is no asm code using it. TIF_NOHZ is added to _TIF_SYCALL_T_OR_A, so it is better for it to be in the same 16 bits with others in the group, so in the asm code, andi. with this group could work. Signed-off-by: Li Zhong Acked-by: Frederic Weisbecker --- arch/powerpc/include/asm/thread_info.h |7 +-- arch/powerpc/kernel/ptrace.c |5 + 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h index 8ceea14..ba7b197 100644 --- a/arch/powerpc/include/asm/thread_info.h +++ b/arch/powerpc/include/asm/thread_info.h @@ -97,7 +97,7 @@ static inline struct thread_info *current_thread_info(void) #define TIF_PERFMON_CTXSW 6 /* perfmon needs ctxsw calls */ #define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */ #define TIF_SINGLESTEP 8 /* singlestepping active */ -#define TIF_MEMDIE 9 /* is terminating due to OOM killer */ +#define TIF_NOHZ 9 /* in adaptive nohz mode */ #define TIF_SECCOMP10 /* secure computing */ #define TIF_RESTOREALL 11 /* Restore all regs (implies NOERROR) */ #define TIF_NOERROR12 /* Force successful syscall return */ @@ -106,6 +106,7 @@ static inline struct thread_info *current_thread_info(void) #define TIF_SYSCALL_TRACEPOINT 15 /* syscall tracepoint instrumentation */ #define TIF_EMULATE_STACK_STORE16 /* Is an instruction emulation for stack store? */ +#define TIF_MEMDIE 17 /* is terminating due to OOM killer */ /* as above, but as bit values */ #define _TIF_SYSCALL_TRACE (1< #include #include +#include #include #include @@ -1788,6 +1789,8 @@ long do_syscall_trace_enter(struct pt_regs *regs) { long ret = 0; + user_exit(); + secure_computing_strict(regs->gpr[0]); if (test_thread_flag(TIF_SYSCALL_TRACE) && @@ -1832,4 +1835,6 @@ void do_syscall_trace_leave(struct pt_regs *regs) step = test_thread_flag(TIF_SINGLESTEP); if (step || test_thread_flag(TIF_SYSCALL_TRACE)) tracehook_report_syscall_exit(regs, step); + + user_enter(); } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v4 0/5] powerpc: Support context tracking for Power pSeries
These patches try to support context tracking for Power arch, beginning with 64-bit pSeries. The codes are ported from that of the x86_64, and in each patch, I listed the corresponding patch for x86. v4: fixed some cosmetic issues suggested by Ben. Li Zhong (5): powerpc: Syscall hooks for context tracking subsystem powerpc: Exception hooks for context tracking subsystem powerpc: Exit user context on notify resume powerpc: Use the new schedule_user API on userspace preemption powerpc: select HAVE_CONTEXT_TRACKING for pSeries arch/powerpc/include/asm/context_tracking.h | 10 arch/powerpc/include/asm/thread_info.h |7 ++- arch/powerpc/kernel/entry_64.S |3 +- arch/powerpc/kernel/ptrace.c|5 ++ arch/powerpc/kernel/signal.c|5 ++ arch/powerpc/kernel/traps.c | 80 +++ arch/powerpc/mm/fault.c | 41 +- arch/powerpc/mm/hash_utils_64.c | 36 +--- arch/powerpc/platforms/pseries/Kconfig |1 + 9 files changed, 140 insertions(+), 48 deletions(-) create mode 100644 arch/powerpc/include/asm/context_tracking.h -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v4 5/5] powerpc: select HAVE_CONTEXT_TRACKING for pSeries
Start context tracking support from pSeries. Signed-off-by: Li Zhong --- arch/powerpc/platforms/pseries/Kconfig |1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index 9a0941b..023b288 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -18,6 +18,7 @@ config PPC_PSERIES select PPC_PCI_CHOICE if EXPERT select ZLIB_DEFLATE select PPC_DOORBELL + select HAVE_CONTEXT_TRACKING default y config PPC_SPLPAR -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v4 2/5] powerpc: Exception hooks for context tracking subsystem
This is the exception hooks for context tracking subsystem, including data access, program check, single step, instruction breakpoint, machine check, alignment, fp unavailable, altivec assist, unknown exception, whose handlers might use RCU. This patch corresponds to [PATCH] x86: Exception hooks for userspace RCU extended QS commit 6ba3c97a38803883c2eee489505796cb0a727122 But after the exception handling moved to generic code, and some changes in following two commits: 56dd9470d7c8734f055da2a6bac553caf4a468eb context_tracking: Move exception handling to generic code 6c1e0256fad84a843d915414e4b5973b7443d48d context_tracking: Restore correct previous context state on exception exit it is able for exception hooks to use the generic code above instead of a redundant arch implementation. Signed-off-by: Li Zhong --- arch/powerpc/kernel/traps.c | 80 --- arch/powerpc/mm/fault.c | 41 +--- arch/powerpc/mm/hash_utils_64.c | 36 +- 3 files changed, 112 insertions(+), 45 deletions(-) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 83efa2f..a7a648f 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include @@ -667,6 +668,7 @@ int machine_check_generic(struct pt_regs *regs) void machine_check_exception(struct pt_regs *regs) { + enum ctx_state prev_state = exception_enter(); int recover = 0; __get_cpu_var(irq_stat).mce_exceptions++; @@ -683,7 +685,7 @@ void machine_check_exception(struct pt_regs *regs) recover = cur_cpu_spec->machine_check(regs); if (recover > 0) - return; + goto bail; #if defined(CONFIG_8xx) && defined(CONFIG_PCI) /* the qspan pci read routines can cause machine checks -- Cort @@ -693,20 +695,23 @@ void machine_check_exception(struct pt_regs *regs) * -- BenH */ bad_page_fault(regs, regs->dar, SIGBUS); - return; + goto bail; #endif if (debugger_fault_handler(regs)) - return; + goto bail; if (check_io_access(regs)) - return; + goto bail; die("Machine check", regs, SIGBUS); /* Must die if the interrupt is not recoverable */ if (!(regs->msr & MSR_RI)) panic("Unrecoverable Machine check"); + +bail: + exception_exit(prev_state); } void SMIException(struct pt_regs *regs) @@ -716,20 +721,29 @@ void SMIException(struct pt_regs *regs) void unknown_exception(struct pt_regs *regs) { + enum ctx_state prev_state = exception_enter(); + printk("Bad trap at PC: %lx, SR: %lx, vector=%lx\n", regs->nip, regs->msr, regs->trap); _exception(SIGTRAP, regs, 0, 0); + + exception_exit(prev_state); } void instruction_breakpoint_exception(struct pt_regs *regs) { + enum ctx_state prev_state = exception_enter(); + if (notify_die(DIE_IABR_MATCH, "iabr_match", regs, 5, 5, SIGTRAP) == NOTIFY_STOP) - return; + goto bail; if (debugger_iabr_match(regs)) - return; + goto bail; _exception(SIGTRAP, regs, TRAP_BRKPT, regs->nip); + +bail: + exception_exit(prev_state); } void RunModeException(struct pt_regs *regs) @@ -739,15 +753,20 @@ void RunModeException(struct pt_regs *regs) void __kprobes single_step_exception(struct pt_regs *regs) { + enum ctx_state prev_state = exception_enter(); + clear_single_step(regs); if (notify_die(DIE_SSTEP, "single_step", regs, 5, 5, SIGTRAP) == NOTIFY_STOP) - return; + goto bail; if (debugger_sstep(regs)) - return; + goto bail; _exception(SIGTRAP, regs, TRAP_TRACE, regs->nip); + +bail: + exception_exit(prev_state); } /* @@ -1005,6 +1024,7 @@ int is_valid_bugaddr(unsigned long addr) void __kprobes program_check_exception(struct pt_regs *regs) { + enum ctx_state prev_state = exception_enter(); unsigned int reason = get_reason(regs); extern int do_mathemu(struct pt_regs *regs); @@ -1014,26 +1034,26 @@ void __kprobes program_check_exception(struct pt_regs *regs) if (reason & REASON_FP) { /* IEEE FP exception */ parse_fpe(regs); - return; + goto bail; } if (reason & REASON_TRAP) { /* Debugger is first in line to stop recursive faults in * rcu_lock, notify_die, or atomic_notifier_call_chain */ if (debugger_bpt(regs)) - ret
[RFC PATCH v4 4/5] powerpc: Use the new schedule_user API on userspace preemption
This patch corresponds to [PATCH] x86: Use the new schedule_user API on userspace preemption commit 0430499ce9d78691f3985962021b16bf8f8a8048 Signed-off-by: Li Zhong --- arch/powerpc/include/asm/context_tracking.h | 10 ++ arch/powerpc/kernel/entry_64.S |3 ++- 2 files changed, 12 insertions(+), 1 deletion(-) create mode 100644 arch/powerpc/include/asm/context_tracking.h diff --git a/arch/powerpc/include/asm/context_tracking.h b/arch/powerpc/include/asm/context_tracking.h new file mode 100644 index 000..b6f5a33 --- /dev/null +++ b/arch/powerpc/include/asm/context_tracking.h @@ -0,0 +1,10 @@ +#ifndef _ASM_POWERPC_CONTEXT_TRACKING_H +#define _ASM_POWERPC_CONTEXT_TRACKING_H + +#ifdef CONFIG_CONTEXT_TRACKING +#define SCHEDULE_USER bl .schedule_user +#else +#define SCHEDULE_USER bl .schedule +#endif + +#endif diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 915fbb4..d418977 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -33,6 +33,7 @@ #include #include #include +#include /* * System calls. @@ -634,7 +635,7 @@ _GLOBAL(ret_from_except_lite) andi. r0,r4,_TIF_NEED_RESCHED beq 1f bl .restore_interrupts - bl .schedule + SCHEDULE_USER b .ret_from_except_lite 1: bl .save_nvgprs -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v4 3/5] powerpc: Exit user context on notify resume
This patch allows RCU usage in do_notify_resume, e.g. signal handling. It corresponds to [PATCH] x86: Exit RCU extended QS on notify resume commit edf55fda35c7dc7f2d9241c3abaddaf759b457c6 Signed-off-by: Li Zhong --- arch/powerpc/kernel/signal.c |5 + 1 file changed, 5 insertions(+) diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c index cf12eae..d63b502 100644 --- a/arch/powerpc/kernel/signal.c +++ b/arch/powerpc/kernel/signal.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -159,6 +160,8 @@ static int do_signal(struct pt_regs *regs) void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags) { + user_exit(); + if (thread_info_flags & _TIF_UPROBE) uprobe_notify_resume(regs); @@ -169,4 +172,6 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags) clear_thread_flag(TIF_NOTIFY_RESUME); tracehook_notify_resume(regs); } + + user_enter(); } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S
On Fri, 2013-05-10 at 17:12 -0400, Steven Rostedt wrote: > plain text document attachment (fix-user-exit-preempt.patch) > I started testing the new NOHZ_FULL in the kernel and had some issues, > so I started function tracing and this bug report came out: > > > [23446.458073] [ cut here ] > [23446.461028] WARNING: > at /home/rostedt/work/git/linux-trace.git/kernel/rcutree.c:388 > rcu_eqs_enter+0x4b/0x89() > [23446.466096] Modules linked in: ebtables ipt_MASQUERADE sunrpc bridge > stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter > ip6_tables ipv6 uinput snd_hda_codec_id > t snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm > kvm_intel snd_timer snd kvm soundcore shpchp snd_page_alloc microcode > i2c_i801 pata_acpi firewire_ohci firewi > re_core crc_itu_t ata_generic i915 drm_kms_helper drm i2c_algo_bit > i2c_core video > [23446.466096] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.9.0-test+ #11 > [23446.466096] Hardware name: To Be Filled By O.E.M. To Be Filled By > O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007 > [23446.466096] 814879f0 81004fa6 810337af > 88007d40 > [23446.466096] 88007d40d900 > 81a01f00 > [23446.466096] 81acba00 88007d6495c0 810a1f36 > 0a28 > [23446.466096] Call Trace: > [23446.466096] [] ? dump_stack+0xd/0x17 > [23446.466096] [] ? show_stack+0x5/0x3e > [23446.466096] [] ? warn_slowpath_common+0x64/0x7c > [23446.466096] [] ? rcu_eqs_enter+0x4b/0x89 > [23446.466096] [] ? rcu_idle_enter+0x1a/0x30 > [23446.466096] [] ? ftrace_graph_caller+0x85/0x85 > [23446.466096] [] cpu_startup_entry+0xb3/0x11c > [23446.466096] [] ? start_kernel+0x41a/0x425 > [23446.466096] [] ? repair_env_string+0x54/0x54 > [23446.466096] ---[ end trace ddbb69ae2a0f6687 ]--- > [23446.466096] [ cut here ] > > This is caused by an imbalance of rcu_eqs_enter() and rcu_eqs_exit(). > Adding more tracing, I also discovered that there seemed to be an > imbalance in user_exit() and user_enter(). Specifically, I found that > user_enter() was called, and then a migration happened and user_exit() > was never called before the schedule. Then I had noticed this in my > trace: > > run-isol-11546 1.N.. 37363.637641: function: schedule_user <-- > retint_careful > run-isol-11546 1.N.. 37363.637641: function: __schedule <-- > preempt_schedule > > The funny part here is that schedule_user does not call > preempt_schedule. But then I also noticed the flags of the call to > schedule_user(): "1.N..". This shows that it was running on cpu 1 with > NEED_RESCHED set (N), but both interrupts and preemption are enabled. If > an interrupt came in here we can schedule out again, but that would call > preempt_schedule_irq, which has code to check if its in "user context > (according to dynamic_ticks)" and would do the user_exit() too. But > something didn't seem right here. > > Then I realized that it was actually the function tracer itself, as for > every function it traces, it calls preempt_disable() and > preempt_enable() to record the data on a per_cpu buffer. That > preempt_enable() noticed the NEED_RESCHED set and since preemption is > enabled, it kindly called preempt_schedule() for us! > > All this before schedule_user() was able to call user_exit() and take us > out of dynamic tick user context. Maybe we could just disable function trace for schedule_user()? It seems that function trace might use RCU, at least in __schedule() when calling preempt_enable(), and if preemption is really enabled. user_exit() is used to allow RCU usage after that (with rcu_user_exit). RCU usage is not allowed before that because cpu is in user eqs. And if function trace needs use RCU, then it seems user_exit() itself or its callers couldn't be function traced. If __schedule() in preempt_enable() is the only place function trace uses RCU, then after converting to schedule_preempt_user(), it is safe as irqs are disabled for __schedule() to happen. But user_exit() itself and some other callers might still need function trace be disabled. Thanks, Zhong > > I even verified this by adding a: > > preempt_disable(); > preempt_enable(); > > just before the user_exit() and ran a test without doing any tracing and > was able to easily hit this bug. > > I then looked at the code in entry_64.S, and noticed that the calls to > SCHEDULE_USER were all encompassed with ENABLE_INTERRUPTS and > DISABLE_INTERRUPTS before calling schedule. And the SCHEDULE_USER macro > is different if we have CONTEXT_TRACKING enabled or not. > > I created a new schedule_preempt_user() which is similar to > schedule_preempt_disable() and works like preempt_schedule_irq(), where > it includes the exception_enter() and exit() routines that take care of > the accounting of dynamic ticks when interrupting user mode or not. It > also ta
Re: [RFC PATCH v3 2/5] powerpc: Exception hooks for context tracking subsystem
On Mon, 2013-05-13 at 19:06 +1000, Benjamin Herrenschmidt wrote: > On Mon, 2013-05-13 at 16:44 +0800, Li Zhong wrote: > > Yes, the above and hash_page() are two C functions for a same exception. > > And the exception hooks enable RCU usage in those C codes. But for asm > > codes, I think we could assume that there would be no RCU usage there, > > so we don't need wrap them in the hooks. > > hash_page() won't start a new RCU, at least not in its current incarnation, > the only thing I can see it ever doing would be to take some RCU read locks > one > day (it doesn't today). Seems I added the hooks because of the trace point of hcall entry/exit ... Thanks, Zhong > > low_hash_fault() is a different beast. It will typically kill things, thus > involving sending signals etc... RCU might well be involved. > > Cheers, > Ben. > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/5] powerpc: Support context tracking for Power pSeries
On Mon, 2013-05-13 at 18:59 +1000, Benjamin Herrenschmidt wrote: > On Mon, 2013-05-13 at 16:03 +0800, Li Zhong wrote: > > > > To my understanding, it is used to enable RCU user extended quiescent > > state, so RCU on that cpu doesn't need scheduler ticks. And together > > with some other code(already in 3.10), we are able to remove the ticks > > in some cases (e.g. only 1 task running on the cpu, with some other > > limitations). > > Ok, sounds interesting. Once you fix the little cosmetic issue, I don't > see any reason not to merge them as it's basically wiring up an existing > feature (in that regard the patches are pretty straightforward) and I > assume the overhead is only there when you enable it. Thanks for your support, Ben. Yes, there should be no overhead if not enabled. Thanks, Zhong > > Cheers, > Ben. > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 2/5] powerpc: Exception hooks for context tracking subsystem
On Mon, 2013-05-13 at 15:57 +1000, Benjamin Herrenschmidt wrote: > On Mon, 2013-05-13 at 13:21 +0800, Li Zhong wrote: > > int recover = 0; > > + enum ctx_state prev_state; > > + > > + prev_state = exception_enter(); > > Please make it nicer: > > enum ctx_state prev_state = exception_enter(); > int recover = 0; OK, I'll fix it, and all the others. > > __get_cpu_var(irq_stat).mce_exceptions++; > > > > @@ -683,7 +687,7 @@ void machine_check_exception(struct pt_regs *regs) > > recover = cur_cpu_spec->machine_check(regs); > > > > if (recover > 0) > > - return; > > + goto exit; > > I'm no fan of "exit" as a label as it's otherwise a libc function (I > know there is no relationship here but I just find that annoying). > > I usually use "bail" OK, I'll fix it, and all the others. > > > #if defined(CONFIG_8xx) && defined(CONFIG_PCI) > > /* the qspan pci read routines can cause machine checks -- Cort > > @@ -693,20 +697,23 @@ void machine_check_exception(struct pt_regs *regs) > > * -- BenH > > */ > > bad_page_fault(regs, regs->dar, SIGBUS); > > - return; > > + goto exit; > > #endif > > > > if (debugger_fault_handler(regs)) > > - return; > > + goto exit; > > > > if (check_io_access(regs)) > > - return; > > + goto exit; > > > > die("Machine check", regs, SIGBUS); > > > > /* Must die if the interrupt is not recoverable */ > > if (!(regs->msr & MSR_RI)) > > panic("Unrecoverable Machine check"); > > + > > +exit: > > + exception_exit(prev_state); > > } > > > > void SMIException(struct pt_regs *regs) > > @@ -716,20 +723,31 @@ void SMIException(struct pt_regs *regs) > > > > void unknown_exception(struct pt_regs *regs) > > { > > + enum ctx_state prev_state; > > + prev_state = exception_enter(); > > + > > Same cosmetic comment for all other cases > > /... > > > #include > > #include > > @@ -193,8 +194,8 @@ static int mm_fault_error(struct pt_regs *regs, > > unsigned long addr, int fault) > > * The return value is 0 if the fault was handled, or the signal > > * number if this is a kernel fault that can't be handled here. > > */ > > -int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address, > > - unsigned long error_code) > > +static int __kprobes __do_page_fault(struct pt_regs *regs, > > + unsigned long address, unsigned long error_code) > > { > > struct vm_area_struct * vma; > > struct mm_struct *mm = current->mm; > > @@ -475,6 +476,17 @@ bad_area_nosemaphore: > > > > } > > > > +int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address, > > + unsigned long error_code) > > +{ > > + int ret; > > + enum ctx_state prev_state; > > + prev_state = exception_enter(); > > + ret = __do_page_fault(regs, address, error_code); > > + exception_exit(prev_state); > > + return ret; > > +} > > Any reason why you didn't use the same wrapping trick in traps.c ? I'd > rather we stay consistent in the way we do things between the two files. OK, I'll make it consistent with those in traps.c > > > /* > > * bad_page_fault is called when we have a bad access from the kernel. > > * It is called from the DSI and ISI handlers in head.S and from some > > diff --git a/arch/powerpc/mm/hash_utils_64.c > > b/arch/powerpc/mm/hash_utils_64.c > > index 88ac0ee..d92fb26 100644 > > --- a/arch/powerpc/mm/hash_utils_64.c > > +++ b/arch/powerpc/mm/hash_utils_64.c > > @@ -33,6 +33,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -962,6 +963,9 @@ int hash_page(unsigned long ea, unsigned long access, > > unsigned long trap) > > const struct cpumask *tmp; > > int rc, user_region = 0, local = 0; > > int psize, ssize; > > + enum ctx_state prev_state; > > + > > + prev_state = exception_enter(); > > > > DBG_LOW("hash_page(ea=%016lx, access=%lx, trap=%lx\n", > > ea, access, trap); > > @@ -973,7 +977,8 @@ int hash_page(unsigned long ea, unsigned long a
Re: [RFC PATCH v3 0/5] powerpc: Support context tracking for Power pSeries
On Mon, 2013-05-13 at 15:51 +1000, Benjamin Herrenschmidt wrote: > On Mon, 2013-05-13 at 13:21 +0800, Li Zhong wrote: > > These patches try to support context tracking for Power arch, beginning with > > 64-bit pSeries. The codes are ported from that of the x86_64, and in each > > patch, I listed the corresponding patch for x86. > > So that's yet another pile of bloat on all syscall entry/exit and > exception entry/exit. What is it used for ? (I haven't followed on > x86_64 side). To my understanding, it is used to enable RCU user extended quiescent state, so RCU on that cpu doesn't need scheduler ticks. And together with some other code(already in 3.10), we are able to remove the ticks in some cases (e.g. only 1 task running on the cpu, with some other limitations). Maybe Paul, or Frederic could give some better descriptions. Thanks, Zhong > > Cheers, > Ben. > > > v3: > > > > This version is mainly a rebasing, against 3.10-rc1, also as the common > > code > > to handle the exception are pulled into 3.10, so there is no dependency on > > tip tree. So patch #2 and #6 in previous version_2 is merged together. > > > > Li Zhong (5): > > powerpc: Syscall hooks for context tracking subsystem > > powerpc: Exception hooks for context tracking subsystem > > powerpc: Exit user context on notify resume > > powerpc: Use the new schedule_user API on userspace preemption > > powerpc: select HAVE_CONTEXT_TRACKING for pSeries > > > > arch/powerpc/include/asm/context_tracking.h | 10 +++ > > arch/powerpc/include/asm/thread_info.h |7 ++- > > arch/powerpc/kernel/entry_64.S |3 +- > > arch/powerpc/kernel/ptrace.c|5 ++ > > arch/powerpc/kernel/signal.c|5 ++ > > arch/powerpc/kernel/traps.c | 91 > > --- > > arch/powerpc/mm/fault.c | 16 - > > arch/powerpc/mm/hash_utils_64.c | 38 --- > > arch/powerpc/platforms/pseries/Kconfig |1 + > > 9 files changed, 140 insertions(+), 36 deletions(-) > > create mode 100644 arch/powerpc/include/asm/context_tracking.h > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v3 1/5] powerpc: Syscall hooks for context tracking subsystem
This is the syscall slow path hooks for context tracking subsystem, corresponding to [PATCH] x86: Syscall hooks for userspace RCU extended QS commit bf5a3c13b939813d28ce26c01425054c740d6731 TIF_MEMDIE is moved to the second 16-bits (with value 17), as it seems there is no asm code using it. TIF_NOHZ is added to _TIF_SYCALL_T_OR_A, so it is better for it to be in the same 16 bits with others in the group, so in the asm code, andi. with this group could work. Signed-off-by: Li Zhong Acked-by: Frederic Weisbecker --- arch/powerpc/include/asm/thread_info.h |7 +-- arch/powerpc/kernel/ptrace.c |5 + 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h index 8ceea14..ba7b197 100644 --- a/arch/powerpc/include/asm/thread_info.h +++ b/arch/powerpc/include/asm/thread_info.h @@ -97,7 +97,7 @@ static inline struct thread_info *current_thread_info(void) #define TIF_PERFMON_CTXSW 6 /* perfmon needs ctxsw calls */ #define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */ #define TIF_SINGLESTEP 8 /* singlestepping active */ -#define TIF_MEMDIE 9 /* is terminating due to OOM killer */ +#define TIF_NOHZ 9 /* in adaptive nohz mode */ #define TIF_SECCOMP10 /* secure computing */ #define TIF_RESTOREALL 11 /* Restore all regs (implies NOERROR) */ #define TIF_NOERROR12 /* Force successful syscall return */ @@ -106,6 +106,7 @@ static inline struct thread_info *current_thread_info(void) #define TIF_SYSCALL_TRACEPOINT 15 /* syscall tracepoint instrumentation */ #define TIF_EMULATE_STACK_STORE16 /* Is an instruction emulation for stack store? */ +#define TIF_MEMDIE 17 /* is terminating due to OOM killer */ /* as above, but as bit values */ #define _TIF_SYSCALL_TRACE (1< #include #include +#include #include #include @@ -1788,6 +1789,8 @@ long do_syscall_trace_enter(struct pt_regs *regs) { long ret = 0; + user_exit(); + secure_computing_strict(regs->gpr[0]); if (test_thread_flag(TIF_SYSCALL_TRACE) && @@ -1832,4 +1835,6 @@ void do_syscall_trace_leave(struct pt_regs *regs) step = test_thread_flag(TIF_SINGLESTEP); if (step || test_thread_flag(TIF_SYSCALL_TRACE)) tracehook_report_syscall_exit(regs, step); + + user_enter(); } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v3 3/5] powerpc: Exit user context on notify resume
This patch allows RCU usage in do_notify_resume, e.g. signal handling. It corresponds to [PATCH] x86: Exit RCU extended QS on notify resume commit edf55fda35c7dc7f2d9241c3abaddaf759b457c6 Signed-off-by: Li Zhong --- arch/powerpc/kernel/signal.c |5 + 1 file changed, 5 insertions(+) diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c index cf12eae..d63b502 100644 --- a/arch/powerpc/kernel/signal.c +++ b/arch/powerpc/kernel/signal.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -159,6 +160,8 @@ static int do_signal(struct pt_regs *regs) void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags) { + user_exit(); + if (thread_info_flags & _TIF_UPROBE) uprobe_notify_resume(regs); @@ -169,4 +172,6 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags) clear_thread_flag(TIF_NOTIFY_RESUME); tracehook_notify_resume(regs); } + + user_enter(); } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v3 4/5] powerpc: Use the new schedule_user API on userspace preemption
This patch corresponds to [PATCH] x86: Use the new schedule_user API on userspace preemption commit 0430499ce9d78691f3985962021b16bf8f8a8048 Signed-off-by: Li Zhong --- arch/powerpc/include/asm/context_tracking.h | 10 ++ arch/powerpc/kernel/entry_64.S |3 ++- 2 files changed, 12 insertions(+), 1 deletion(-) create mode 100644 arch/powerpc/include/asm/context_tracking.h diff --git a/arch/powerpc/include/asm/context_tracking.h b/arch/powerpc/include/asm/context_tracking.h new file mode 100644 index 000..b6f5a33 --- /dev/null +++ b/arch/powerpc/include/asm/context_tracking.h @@ -0,0 +1,10 @@ +#ifndef _ASM_POWERPC_CONTEXT_TRACKING_H +#define _ASM_POWERPC_CONTEXT_TRACKING_H + +#ifdef CONFIG_CONTEXT_TRACKING +#define SCHEDULE_USER bl .schedule_user +#else +#define SCHEDULE_USER bl .schedule +#endif + +#endif diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 915fbb4..d418977 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -33,6 +33,7 @@ #include #include #include +#include /* * System calls. @@ -634,7 +635,7 @@ _GLOBAL(ret_from_except_lite) andi. r0,r4,_TIF_NEED_RESCHED beq 1f bl .restore_interrupts - bl .schedule + SCHEDULE_USER b .ret_from_except_lite 1: bl .save_nvgprs -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v3 5/5] powerpc: select HAVE_CONTEXT_TRACKING for pSeries
Start context tracking support from pSeries. Signed-off-by: Li Zhong --- arch/powerpc/platforms/pseries/Kconfig |1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index 9a0941b..023b288 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -18,6 +18,7 @@ config PPC_PSERIES select PPC_PCI_CHOICE if EXPERT select ZLIB_DEFLATE select PPC_DOORBELL + select HAVE_CONTEXT_TRACKING default y config PPC_SPLPAR -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v3 2/5] powerpc: Exception hooks for context tracking subsystem
This is the exception hooks for context tracking subsystem, including data access, program check, single step, instruction breakpoint, machine check, alignment, fp unavailable, altivec assist, unknown exception, whose handlers might use RCU. This patch corresponds to [PATCH] x86: Exception hooks for userspace RCU extended QS commit 6ba3c97a38803883c2eee489505796cb0a727122 But after the exception handling moved to generic code, and some changes in following two commits: 56dd9470d7c8734f055da2a6bac553caf4a468eb context_tracking: Move exception handling to generic code 6c1e0256fad84a843d915414e4b5973b7443d48d context_tracking: Restore correct previous context state on exception exit it is able for exception hooks to use the generic code above instead of a redundant arch implementation. Signed-off-by: Li Zhong --- arch/powerpc/kernel/traps.c | 91 +-- arch/powerpc/mm/fault.c | 16 ++- arch/powerpc/mm/hash_utils_64.c | 38 3 files changed, 112 insertions(+), 33 deletions(-) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 83efa2f..9d3c000 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include @@ -668,6 +669,9 @@ int machine_check_generic(struct pt_regs *regs) void machine_check_exception(struct pt_regs *regs) { int recover = 0; + enum ctx_state prev_state; + + prev_state = exception_enter(); __get_cpu_var(irq_stat).mce_exceptions++; @@ -683,7 +687,7 @@ void machine_check_exception(struct pt_regs *regs) recover = cur_cpu_spec->machine_check(regs); if (recover > 0) - return; + goto exit; #if defined(CONFIG_8xx) && defined(CONFIG_PCI) /* the qspan pci read routines can cause machine checks -- Cort @@ -693,20 +697,23 @@ void machine_check_exception(struct pt_regs *regs) * -- BenH */ bad_page_fault(regs, regs->dar, SIGBUS); - return; + goto exit; #endif if (debugger_fault_handler(regs)) - return; + goto exit; if (check_io_access(regs)) - return; + goto exit; die("Machine check", regs, SIGBUS); /* Must die if the interrupt is not recoverable */ if (!(regs->msr & MSR_RI)) panic("Unrecoverable Machine check"); + +exit: + exception_exit(prev_state); } void SMIException(struct pt_regs *regs) @@ -716,20 +723,31 @@ void SMIException(struct pt_regs *regs) void unknown_exception(struct pt_regs *regs) { + enum ctx_state prev_state; + prev_state = exception_enter(); + printk("Bad trap at PC: %lx, SR: %lx, vector=%lx\n", regs->nip, regs->msr, regs->trap); _exception(SIGTRAP, regs, 0, 0); + + exception_exit(prev_state); } void instruction_breakpoint_exception(struct pt_regs *regs) { + enum ctx_state prev_state; + prev_state = exception_enter(); + if (notify_die(DIE_IABR_MATCH, "iabr_match", regs, 5, 5, SIGTRAP) == NOTIFY_STOP) - return; + goto exit; if (debugger_iabr_match(regs)) - return; + goto exit; _exception(SIGTRAP, regs, TRAP_BRKPT, regs->nip); + +exit: + exception_exit(prev_state); } void RunModeException(struct pt_regs *regs) @@ -739,15 +757,21 @@ void RunModeException(struct pt_regs *regs) void __kprobes single_step_exception(struct pt_regs *regs) { + enum ctx_state prev_state; + prev_state = exception_enter(); + clear_single_step(regs); if (notify_die(DIE_SSTEP, "single_step", regs, 5, 5, SIGTRAP) == NOTIFY_STOP) - return; + goto exit; if (debugger_sstep(regs)) - return; + goto exit; _exception(SIGTRAP, regs, TRAP_TRACE, regs->nip); + +exit: + exception_exit(prev_state); } /* @@ -1006,34 +1030,37 @@ int is_valid_bugaddr(unsigned long addr) void __kprobes program_check_exception(struct pt_regs *regs) { unsigned int reason = get_reason(regs); + enum ctx_state prev_state; extern int do_mathemu(struct pt_regs *regs); + prev_state = exception_enter(); + /* We can now get here via a FP Unavailable exception if the core * has no FPU, in that case the reason flags will be 0 */ if (reason & REASON_FP) { /* IEEE FP exception */ parse_fpe(regs); - return; + goto exit; } if (reason & REASON_TRAP) { /* Debugger is first in line to stop recursive fault
[RFC PATCH v3 0/5] powerpc: Support context tracking for Power pSeries
These patches try to support context tracking for Power arch, beginning with 64-bit pSeries. The codes are ported from that of the x86_64, and in each patch, I listed the corresponding patch for x86. v3: This version is mainly a rebasing, against 3.10-rc1, also as the common code to handle the exception are pulled into 3.10, so there is no dependency on tip tree. So patch #2 and #6 in previous version_2 is merged together. Li Zhong (5): powerpc: Syscall hooks for context tracking subsystem powerpc: Exception hooks for context tracking subsystem powerpc: Exit user context on notify resume powerpc: Use the new schedule_user API on userspace preemption powerpc: select HAVE_CONTEXT_TRACKING for pSeries arch/powerpc/include/asm/context_tracking.h | 10 +++ arch/powerpc/include/asm/thread_info.h |7 ++- arch/powerpc/kernel/entry_64.S |3 +- arch/powerpc/kernel/ptrace.c|5 ++ arch/powerpc/kernel/signal.c|5 ++ arch/powerpc/kernel/traps.c | 91 --- arch/powerpc/mm/fault.c | 16 - arch/powerpc/mm/hash_utils_64.c | 38 --- arch/powerpc/platforms/pseries/Kconfig |1 + 9 files changed, 140 insertions(+), 36 deletions(-) create mode 100644 arch/powerpc/include/asm/context_tracking.h -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:timers/nohz] nohz: Protect smp_processor_id() in tick_nohz_task_switch()
Commit-ID: 6296ace467c8640317414ba589b124323806f7ce Gitweb: http://git.kernel.org/tip/6296ace467c8640317414ba589b124323806f7ce Author: Li Zhong AuthorDate: Sun, 28 Apr 2013 11:25:58 +0800 Committer: Ingo Molnar CommitDate: Mon, 29 Apr 2013 13:17:33 +0200 nohz: Protect smp_processor_id() in tick_nohz_task_switch() I saw following error when testing the latest nohz code on Power: [ 85.295384] BUG: using smp_processor_id() in preemptible [] code: rsyslogd/3493 [ 85.295396] caller is .tick_nohz_task_switch+0x1c/0xb8 [ 85.295402] Call Trace: [ 85.295408] [c001fababab0] [c0012dc4] .show_stack+0x110/0x25c (unreliable) [ 85.295420] [c001fababba0] [c07c4b54] .dump_stack+0x20/0x30 [ 85.295430] [c001fababc10] [c044eb74] .debug_smp_processor_id+0xf4/0x124 [ 85.295438] [c001fababca0] [c00d7594] .tick_nohz_task_switch+0x1c/0xb8 [ 85.295447] [c001fababd20] [c00b9748] .finish_task_switch+0x13c/0x160 [ 85.295455] [c001fababdb0] [c00bbe50] .schedule_tail+0x50/0x124 [ 85.295463] [c001fababe30] [c0009dc8] .ret_from_fork+0x4/0x54 The code below moves the test into local_irq_save/restore section to avoid the above complaint. Signed-off-by: Li Zhong Acked-by: Frederic Weisbecker Cc: Peter Zijlstra Cc: Paul McKenney Link: http://lkml.kernel.org/r/1367119558.6391.34.ca...@thinkpad-t5421.cn.ibm.com Signed-off-by: Ingo Molnar --- kernel/time/tick-sched.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index da53c8f..1c9f53b 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -251,14 +251,15 @@ void tick_nohz_task_switch(struct task_struct *tsk) { unsigned long flags; - if (!tick_nohz_full_cpu(smp_processor_id())) - return; - local_irq_save(flags); + if (!tick_nohz_full_cpu(smp_processor_id())) + goto out; + if (tick_nohz_tick_stopped() && !can_stop_full_tick()) tick_nohz_full_kick(); +out: local_irq_restore(flags); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2]nohz: protect smp_processor_id() in tick_nohz_task_switch()
I saw following error when testing the latest nohz code on Power: [ 85.295384] BUG: using smp_processor_id() in preemptible [] code: rsyslogd/3493 [ 85.295396] caller is .tick_nohz_task_switch+0x1c/0xb8 [ 85.295402] Call Trace: [ 85.295408] [c001fababab0] [c0012dc4] .show_stack+0x110/0x25c (unreliable) [ 85.295420] [c001fababba0] [c07c4b54] .dump_stack+0x20/0x30 [ 85.295430] [c001fababc10] [c044eb74] .debug_smp_processor_id+0xf4/0x124 [ 85.295438] [c001fababca0] [c00d7594] .tick_nohz_task_switch+0x1c/0xb8 [ 85.295447] [c001fababd20] [c00b9748] .finish_task_switch+0x13c/0x160 [ 85.295455] [c001fababdb0] [c00bbe50] .schedule_tail+0x50/0x124 [ 85.295463] [c001fababe30] [c0009dc8] .ret_from_fork+0x4/0x54 The code below moves the test into local_irq_save/restore section to avoid the above complaint. Signed-off-by: Li Zhong --- kernel/time/tick-sched.c |7 --- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index da53c8f..1c9f53b 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -251,14 +251,15 @@ void tick_nohz_task_switch(struct task_struct *tsk) { unsigned long flags; - if (!tick_nohz_full_cpu(smp_processor_id())) - return; - local_irq_save(flags); + if (!tick_nohz_full_cpu(smp_processor_id())) + goto out; + if (tick_nohz_tick_stopped() && !can_stop_full_tick()) tick_nohz_full_kick(); +out: local_irq_restore(flags); } -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuset: fix compile warning when CONFIG_SMP=n
On Sun, 2013-04-28 at 09:46 +0800, Li Zefan wrote: > Reported by Fengguang's kbuild test robot: Thank you for the quick fix, Zefan. Strong robot :) Thanks, Zhong > > kernel/cpuset.c:787: warning: 'generate_sched_domains' defined but not used > > Introduced by commit e0e80a02e5701c8790bd348ab59edb154fbda60b > ("cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()), > which removed generate_sched_domains() from cpuset_hotplug_workfn(). > > Reported-by: Fengguang Wu > Signed-off-by: Li Zefan > --- > kernel/cpuset.c | 7 --- > 1 file changed, 7 deletions(-) > > diff --git a/kernel/cpuset.c b/kernel/cpuset.c > index ef05901..2422c5b 100644 > --- a/kernel/cpuset.c > +++ b/kernel/cpuset.c > @@ -789,13 +789,6 @@ out: > static void rebuild_sched_domains_locked(void) > { > } > - > -static int generate_sched_domains(cpumask_var_t **domains, > - struct sched_domain_attr **attributes) > -{ > - *domains = NULL; > - return 1; > -} > #endif /* CONFIG_SMP */ > > void rebuild_sched_domains(void) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH]nohz: Use raw_smp_processor_id() in tick_nohz_task_switch()
On Sat, 2013-04-27 at 15:40 +0200, Frederic Weisbecker wrote: > 2013/4/27 Li Zhong : > > I saw following error when testing the latest nohz code on Power: > > > > [ 85.295384] BUG: using smp_processor_id() in preemptible [] > > code: rsyslogd/3493 > > [ 85.295396] caller is .tick_nohz_task_switch+0x1c/0xb8 > > [ 85.295402] Call Trace: > > [ 85.295408] [c001fababab0] [c0012dc4] > > .show_stack+0x110/0x25c (unreliable) > > [ 85.295420] [c001fababba0] [c07c4b54] .dump_stack+0x20/0x30 > > [ 85.295430] [c001fababc10] [c044eb74] > > .debug_smp_processor_id+0xf4/0x124 > > [ 85.295438] [c001fababca0] [c00d7594] > > .tick_nohz_task_switch+0x1c/0xb8 > > [ 85.295447] [c001fababd20] [c00b9748] > > .finish_task_switch+0x13c/0x160 > > [ 85.295455] [c001fababdb0] [c00bbe50] > > .schedule_tail+0x50/0x124 > > [ 85.295463] [c001fababe30] [c0009dc8] .ret_from_fork+0x4/0x54 > > > > It seems to me that we could just use raw_smp_processor_id() here. Even > > if the tick_nohz_full_cpu() check is done on a !nohz_full cpu, then the > > task is moved to another nohz_full cpu, it seems the context switching > > because of the task moving would call tick_nohz_task_switch() again to > > evaluate the need for tick. > > > > I don't know whether I missed something here. > > You're right it looks safe to do so. But I suggest we rather move the > test inside local_irq_save()/restore section to avoid confusion on > reviewers minds. OK, I'll send an updated version, using local_irq_save() to protect it. I tried using raw_* because seems it could avoid some unnecessary irq disabling... Thanks, Zhong > Thanks! > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH]nohz: Use raw_smp_processor_id() in tick_nohz_task_switch()
I saw following error when testing the latest nohz code on Power: [ 85.295384] BUG: using smp_processor_id() in preemptible [] code: rsyslogd/3493 [ 85.295396] caller is .tick_nohz_task_switch+0x1c/0xb8 [ 85.295402] Call Trace: [ 85.295408] [c001fababab0] [c0012dc4] .show_stack+0x110/0x25c (unreliable) [ 85.295420] [c001fababba0] [c07c4b54] .dump_stack+0x20/0x30 [ 85.295430] [c001fababc10] [c044eb74] .debug_smp_processor_id+0xf4/0x124 [ 85.295438] [c001fababca0] [c00d7594] .tick_nohz_task_switch+0x1c/0xb8 [ 85.295447] [c001fababd20] [c00b9748] .finish_task_switch+0x13c/0x160 [ 85.295455] [c001fababdb0] [c00bbe50] .schedule_tail+0x50/0x124 [ 85.295463] [c001fababe30] [c0009dc8] .ret_from_fork+0x4/0x54 It seems to me that we could just use raw_smp_processor_id() here. Even if the tick_nohz_full_cpu() check is done on a !nohz_full cpu, then the task is moved to another nohz_full cpu, it seems the context switching because of the task moving would call tick_nohz_task_switch() again to evaluate the need for tick. I don't know whether I missed something here. Signed-off-by: Li Zhong --- kernel/time/tick-sched.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index da53c8f..0aa575b 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -251,7 +251,7 @@ void tick_nohz_task_switch(struct task_struct *tsk) { unsigned long flags; - if (!tick_nohz_full_cpu(smp_processor_id())) + if (!tick_nohz_full_cpu(raw_smp_processor_id())) return; local_irq_save(flags); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2 cpuset] Don't pass offlined cpus to partition_sched_domains()
On Thu, 2013-04-11 at 16:57 +0800, Li Zefan wrote: > On 2013/4/9 17:59, Li Zhong wrote: > > Hi Zefan, > > > > I did some test today, enabling cpuset and online/offline the cpus. It > > caused NULL address dereferencing in get_group(). After adding some > > debugging code, it seems that it is caused by using some offlined cpus > > as the parameter to partition_sched_domains(). More details below: > > > > === > > following is printed in build_sched_domain(): > > @@ -6476,6 +6486,14 @@ struct sched_domain *build_sched_domain(struct > > sched_domain_topology_level *tl, > > return child; > > > > cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu)); > > + if (cpumask_empty(sched_domain_span(sd))) > > + printk("lz empty domain %p, mask %pS\n", sd, tl->mask); > > + char cpulist[128]; > > + cpumask_scnprintf(cpulist, sizeof(cpulist), cpu_map); > > + printk("lz cpu_map %s\n", cpulist); > > + cpumask_scnprintf(cpulist, sizeof(cpulist), tl->mask(cpu)); > > + printk("lz tl->mask(%d) %pS %s\n",cpu, tl->mask, cpulist); > > + > > if (child) { > > sd->level = child->level + 1; > > sched_domain_level_max = max(sched_domain_level_max, > > sd->level); > > > > [ 232.714502] lz empty domain c001f73c, mask cpu_smt_mask > > +0x0/0x10 > > [ 232.714515] lz cpu_map 1-2,7,13-15 > > [ 232.714521] lz tl->mask(1) cpu_smt_mask+0x0/0x10 > > [ 232.714529] lz cpu_map 1-2,7,13-15 > > [ 232.714536] lz tl->mask(1) cpu_cpu_mask+0x0/0x10 2,7,13-15 > > [ 232.714544] lz cpu_map 1-2,7,13-15 > > [ 232.714551] lz tl->mask(1) sd_numa_mask+0x0/0x10 2,7,13-15 > > [ 232.714558] lz cpu_map 1-2,7,13-15 > > [ 232.714565] lz tl->mask(2) cpu_smt_mask+0x0/0x10 2 > > === > > > > It seems that one cpu (#1 of the above) could be taken offline in > > cpuset_hotplug_workfn(), after top_cpuset's allowed cpus changed and the > > propagated work function finished. But generate_sched_domains(), which > > uses the cpuset information, still considers this cpu (#1) valid, and > > passes it down in the cpumask to partition_sched_domains(). Then we have > > an empty sd as the above. > > > > Thanks for the test and analysis! > > So the bug was caused when the cpuset async cpuset handling hasn't been > finished, and at this time user triggered another hotplug operation, right? Seems so, the summary is very clear :) > Then I think we should wait cpuset to finish its work before starting > a new hotplug operation. en. > This should fix the bug you observed, and it should also fix another bug, > that the 2nd schedule_work(&cpuset_hotplug_work) returns true, and so > the root cpuset.cpus won't be updated. OK, I just realized that the work can only have one instance... As schedule_work returns false if work already in the queue, so I guess we should check against false. And it seems to me that we also need the logic in cpu_up path, for similar reasons, or WARN_ON might be triggered, and cpuset cpu info won't get updated. > Could you test this patch? Seems it works, with the above two changes. For your convenience, the final code I used for testing is attached below. Thanks, Zhong --- diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 8c8a60d..d3ba31d 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -119,6 +119,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } +extern void cpuset_flush_hotplug_work(void); + #else /* !CONFIG_CPUSETS */ static inline int cpuset_init(void) { return 0; } @@ -233,6 +235,10 @@ static inline bool put_mems_allowed(unsigned int seq) return true; } +static inline void cpuset_flush_hotplug_work(void) +{ +} + #endif /* !CONFIG_CPUSETS */ #endif /* _LINUX_CPUSET_H */ diff --git a/kernel/cpu.c b/kernel/cpu.c index b5e4ab2..6eb914b 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -19,6 +19,7 @@ #include #include #include +#include #include "smpboot.h" @@ -278,6 +279,13 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen) if (!cpu_online(cpu)) return -EINVAL; + /* +* cpuset handles cpu/memory hotplug asynchronously, so it's possible +* that cpuset hasn't finished it's hotplug work for the previous +* hotplug operation. +*/ + cpuset_flush_hotplug_work(); + cpu_hotplug_begin(); err = __cpu_notify(CPU_DOWN_PREPARE | mod, hcpu, -1, &nr_cal
Re: [RFC PATCH v2 6/6] powerpc: Use generic code for exception handling
On Wed, 2013-04-10 at 13:32 +0800, Li Zhong wrote: > On Wed, 2013-04-10 at 14:56 +1000, Michael Ellerman wrote: > > On Fri, Mar 29, 2013 at 06:00:21PM +0800, Li Zhong wrote: > > > After the exception handling moved to generic code, and some changes in > > ... > > > diff --git a/arch/powerpc/mm/hash_utils_64.c > > > b/arch/powerpc/mm/hash_utils_64.c > > > index 360fba8..eeab30f 100644 > > > --- a/arch/powerpc/mm/hash_utils_64.c > > > +++ b/arch/powerpc/mm/hash_utils_64.c > > > @@ -33,6 +33,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > > > > #include > > > #include > > > @@ -56,7 +57,6 @@ > > > #include > > > #include > > > #include > > > -#include > > > > > > #ifdef DEBUG > > > #define DBG(fmt...) udbg_printf(fmt) > > > @@ -919,13 +919,17 @@ int hash_page(unsigned long ea, unsigned long > > > access, unsigned long trap) > > > const struct cpumask *tmp; > > > int rc, user_region = 0, local = 0; > > > int psize, ssize; > > > + enum ctx_state prev_state; > > > + > > > + prev_state = exception_enter(); > > > > > > DBG_LOW("hash_page(ea=%016lx, access=%lx, trap=%lx\n", > > > ea, access, trap); > > > > > > if ((ea & ~REGION_MASK) >= PGTABLE_RANGE) { > > > DBG_LOW(" out of pgtable range !\n"); > > > - return 1; > > > + rc = 1; > > > + goto exit; > > > } > > > > > > /* Get region & vsid */ > > > > This no longer applies on mainline, please send an updated version. > > Yes, for current mainline (powerpc tree), only previous five patches > could be applied. The dependency of this patch is current in tip tree, > and seems would be in for 3.10. > > There are some more details in the cover letter (#0): > > "I assume these patches would get in through powerpc tree, so I didn't > combine the new patch (#6) with the original one (#2). So that if > powerpc tree picks these, it could pick the first five patches, and > apply patch #6 later when the dependency enters into powerpc tree (maybe > on some 3.10-rcs)." And I will send an updated version of this one when I see the dependency commits in mainline. Thanks, Zhong > Thanks, Zhong > > > cheers > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2 6/6] powerpc: Use generic code for exception handling
On Wed, 2013-04-10 at 14:56 +1000, Michael Ellerman wrote: > On Fri, Mar 29, 2013 at 06:00:21PM +0800, Li Zhong wrote: > > After the exception handling moved to generic code, and some changes in > ... > > diff --git a/arch/powerpc/mm/hash_utils_64.c > > b/arch/powerpc/mm/hash_utils_64.c > > index 360fba8..eeab30f 100644 > > --- a/arch/powerpc/mm/hash_utils_64.c > > +++ b/arch/powerpc/mm/hash_utils_64.c > > @@ -33,6 +33,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -56,7 +57,6 @@ > > #include > > #include > > #include > > -#include > > > > #ifdef DEBUG > > #define DBG(fmt...) udbg_printf(fmt) > > @@ -919,13 +919,17 @@ int hash_page(unsigned long ea, unsigned long access, > > unsigned long trap) > > const struct cpumask *tmp; > > int rc, user_region = 0, local = 0; > > int psize, ssize; > > + enum ctx_state prev_state; > > + > > + prev_state = exception_enter(); > > > > DBG_LOW("hash_page(ea=%016lx, access=%lx, trap=%lx\n", > > ea, access, trap); > > > > if ((ea & ~REGION_MASK) >= PGTABLE_RANGE) { > > DBG_LOW(" out of pgtable range !\n"); > > - return 1; > > + rc = 1; > > + goto exit; > > } > > > > /* Get region & vsid */ > > This no longer applies on mainline, please send an updated version. Yes, for current mainline (powerpc tree), only previous five patches could be applied. The dependency of this patch is current in tip tree, and seems would be in for 3.10. There are some more details in the cover letter (#0): "I assume these patches would get in through powerpc tree, so I didn't combine the new patch (#6) with the original one (#2). So that if powerpc tree picks these, it could pick the first five patches, and apply patch #6 later when the dependency enters into powerpc tree (maybe on some 3.10-rcs)." Thanks, Zhong > cheers > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2 cpuset] Don't pass offlined cpus to partition_sched_domains()
Hi Zefan, I did some test today, enabling cpuset and online/offline the cpus. It caused NULL address dereferencing in get_group(). After adding some debugging code, it seems that it is caused by using some offlined cpus as the parameter to partition_sched_domains(). More details below: === following is printed in build_sched_domain(): @@ -6476,6 +6486,14 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl, return child; cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu)); + if (cpumask_empty(sched_domain_span(sd))) + printk("lz empty domain %p, mask %pS\n", sd, tl->mask); + char cpulist[128]; + cpumask_scnprintf(cpulist, sizeof(cpulist), cpu_map); + printk("lz cpu_map %s\n", cpulist); + cpumask_scnprintf(cpulist, sizeof(cpulist), tl->mask(cpu)); + printk("lz tl->mask(%d) %pS %s\n",cpu, tl->mask, cpulist); + if (child) { sd->level = child->level + 1; sched_domain_level_max = max(sched_domain_level_max, sd->level); [ 232.714502] lz empty domain c001f73c, mask cpu_smt_mask +0x0/0x10 [ 232.714515] lz cpu_map 1-2,7,13-15 [ 232.714521] lz tl->mask(1) cpu_smt_mask+0x0/0x10 [ 232.714529] lz cpu_map 1-2,7,13-15 [ 232.714536] lz tl->mask(1) cpu_cpu_mask+0x0/0x10 2,7,13-15 [ 232.714544] lz cpu_map 1-2,7,13-15 [ 232.714551] lz tl->mask(1) sd_numa_mask+0x0/0x10 2,7,13-15 [ 232.714558] lz cpu_map 1-2,7,13-15 [ 232.714565] lz tl->mask(2) cpu_smt_mask+0x0/0x10 2 === It seems that one cpu (#1 of the above) could be taken offline in cpuset_hotplug_workfn(), after top_cpuset's allowed cpus changed and the propagated work function finished. But generate_sched_domains(), which uses the cpuset information, still considers this cpu (#1) valid, and passes it down in the cpumask to partition_sched_domains(). Then we have an empty sd as the above. With the empty domain, in get_group(), if (child) cpu = cpumask_first(sched_domain_span(child)); this might cause the cpu to be returned a value >= nr_cpu_ids, then cause bad dereference with the wrong cpu value in the later code. This following code tries to fix the issue by anding the cpu_active_mask at the end of generate_sched_domains() for each partition. This might result the partiton (set of domains) not the best one, but we know another rebuild (caused by the cpu offline in the middle of cpuset_hotplug_workfn()) is on the way. Also it seems that generate_sched_domains() needs be called together with partition_sched_domains(), under the hotplug lock. Or similarly, one cpu might be taken offline while doing generate_sched_domains(), and cause the above issue. Or maybe we could put the whole cpuset_hotplug_workfn() into the hotplug lock? Thanks, Zhong --- diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 4f9dfe4..aed8436 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -754,6 +754,12 @@ done: */ if (doms == NULL) ndoms = 1; + else { + for (nslot = 0; nslot < ndoms; nslot++) { + struct cpumask *dp = doms[nslot]; + cpumask_and(dp, dp, cpu_active_mask); + } + } *domains= doms; *attributes = dattr; @@ -,17 +2228,8 @@ static void cpuset_hotplug_workfn(struct work_struct *work) flush_workqueue(cpuset_propagate_hotplug_wq); /* rebuild sched domains if cpus_allowed has changed */ - if (cpus_updated) { - struct sched_domain_attr *attr; - cpumask_var_t *doms; - int ndoms; - - mutex_lock(&cpuset_mutex); - ndoms = generate_sched_domains(&doms, &attr); - mutex_unlock(&cpuset_mutex); - - partition_sched_domains(ndoms, doms, attr); - } + if (cpus_updated) + rebuild_sched_domains(); } void cpuset_update_active_cpus(bool cpu_online) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH cpuset] Use rebuild_sched_domains() in cpuset_hotplug_workfn()
On Wed, 2013-04-03 at 17:32 +0800, Li Zefan wrote: > On 2013/4/2 15:16, Li Zhong wrote: > > In cpuset_hotplug_workfn(), partition_sched_domains() is called without > > hotplug lock held, which is actually needed (stated in the function > > header of partition_sched_domains()). > > > > This patch tries to use rebuild_sched_domains() to solve the above > > issue, and makes the code looks a little simpler. > > Hi Zefan, Thanks for your quick review. > I guess you found this just by code inspection, right? Ah, yes :) > The change looks fine and safe at a first glance, but we don't have > time to review the patch for now. > > However I don't know why partition_sched_domains() has to be called > with hotplug lock held. > > After a quick scan, seems the only place in partition_sched_domains() > that might need hotplug lock is arch_update_cpu_topology(). and cpu_attach_domain(), which is called by partition_sched_domains() also states that it needs hotplug lock be held. It seems that the code is moved out of hotplug critical section after it is converted to work functions. Thanks, Zhong > > It would be better if you had CCed the scheduler guys...They may > know the answer. > > > Signed-off-by: Li Zhong > > --- > > kernel/cpuset.c | 13 ++--- > > 1 files changed, 2 insertions(+), 11 deletions(-) > > > > diff --git a/kernel/cpuset.c b/kernel/cpuset.c > > index 4f9dfe4..515a713 100644 > > --- a/kernel/cpuset.c > > +++ b/kernel/cpuset.c > > @@ -,17 +,8 @@ static void cpuset_hotplug_workfn(struct work_struct > > *work) > > flush_workqueue(cpuset_propagate_hotplug_wq); > > > > /* rebuild sched domains if cpus_allowed has changed */ > > - if (cpus_updated) { > > - struct sched_domain_attr *attr; > > - cpumask_var_t *doms; > > - int ndoms; > > - > > - mutex_lock(&cpuset_mutex); > > - ndoms = generate_sched_domains(&doms, &attr); > > - mutex_unlock(&cpuset_mutex); > > - > > - partition_sched_domains(ndoms, doms, attr); > > - } > > + if (cpus_updated) > > + rebuild_sched_domains(); > > } > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2 2/6] powerpc: Exception hooks for context tracking subsystem
On Fri, 2013-04-05 at 13:50 +1100, Paul Mackerras wrote: > On Fri, Mar 29, 2013 at 06:00:17PM +0800, Li Zhong wrote: > > This is the exception hooks for context tracking subsystem, including > > data access, program check, single step, instruction breakpoint, machine > > check, > > alignment, fp unavailable, altivec assist, unknown exception, whose handlers > > might use RCU. > > > > This patch corresponds to > > [PATCH] x86: Exception hooks for userspace RCU extended QS > > commit 6ba3c97a38803883c2eee489505796cb0a727122 > > > > Signed-off-by: Li Zhong Hi Paul, Thanks for your review! Please check my answers below, and correct me if any errors. > Is there a reason why you didn't put the exception_exit() call in > ret_from_except_lite in entry_64.S, and the exception_entry() call in > EXCEPTION_PROLOG_COMMON? That would seem to catch all these cases in > a more centralized place. It seems to me that ret_from_except_lite and EXCEPTION_PROLOG_COMMON are also used by interrupts, where I think we don't need the hooks. So using this way could help to avoid adding overhead to these code path (interrupts, and some exit path of syscall). And I think adding the hook on higher level code seems a little easier for reading and checking. It seems that some exceptions don't use EXCEPTION_PROLOG_COMMON, and some don't go ret_from_except_lite exit path (like fp unavailable might go directly to fast_exception_return ). Maybe fast_exception_return is a centralized place for us to return to user space? But it still adds some overheads which is not necessarily needed. And I think it also makes the implementation here consistent with the style that x86 uses. > Also, I notice that with the exception_exit calls where they are, we > can still deliver signals (thus possibly taking a page fault) or call > schedule() for preemption after the exception_exit() call. Is that > OK, or is it a potential problem? If I understand correctly, I guess you are talking about the cases where we might return to user space without context state correctly being set as in user? There is user_enter() called in do_notify_resume() in patch #3, so after handling the signals we always call user_enter(). There are also some changes of the context_tracking code from Frederic, which might be related: ( they are now in tip tree, and url of the patches for your convenience https://lkml.org/lkml/2013/3/1/266 ) 6c1e0256fad84a843d915414e4b5973b7443d48d context_tracking: Restore correct previous context state on exception exit. With this patch, if a later exception happened after user_enter(), before the CPU actually returns to user space, the correct context state(in user) is saved and restored when handling the later exception. Patch #6 converts the code to use these new APIs, which is currently not available in powerpc tree. b22366cd54c6fe05db426f20adb10f461c19ec06 context_tracking: Restore preempted context state after preempt_schedule_irq With this patch, the user context state could be correctly restored after schedule returns. Thanks, Zhong > Paul. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH cpuset] Use rebuild_sched_domains() in cpuset_hotplug_workfn()
In cpuset_hotplug_workfn(), partition_sched_domains() is called without hotplug lock held, which is actually needed (stated in the function header of partition_sched_domains()). This patch tries to use rebuild_sched_domains() to solve the above issue, and makes the code looks a little simpler. Signed-off-by: Li Zhong --- kernel/cpuset.c | 13 ++--- 1 files changed, 2 insertions(+), 11 deletions(-) diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 4f9dfe4..515a713 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -,17 +,8 @@ static void cpuset_hotplug_workfn(struct work_struct *work) flush_workqueue(cpuset_propagate_hotplug_wq); /* rebuild sched domains if cpus_allowed has changed */ - if (cpus_updated) { - struct sched_domain_attr *attr; - cpumask_var_t *doms; - int ndoms; - - mutex_lock(&cpuset_mutex); - ndoms = generate_sched_domains(&doms, &attr); - mutex_unlock(&cpuset_mutex); - - partition_sched_domains(ndoms, doms, attr); - } + if (cpus_updated) + rebuild_sched_domains(); } void cpuset_update_active_cpus(bool cpu_online) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2 4/6] powerpc: Use the new schedule_user API on userspace preemption
This patch corresponds to [PATCH] x86: Use the new schedule_user API on userspace preemption commit 0430499ce9d78691f3985962021b16bf8f8a8048 Signed-off-by: Li Zhong --- arch/powerpc/include/asm/context_tracking.h | 11 +++ arch/powerpc/kernel/entry_64.S |3 ++- 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/context_tracking.h b/arch/powerpc/include/asm/context_tracking.h index 377146e..4da287e 100644 --- a/arch/powerpc/include/asm/context_tracking.h +++ b/arch/powerpc/include/asm/context_tracking.h @@ -1,6 +1,7 @@ #ifndef _ASM_POWERPC_CONTEXT_TRACKING_H #define _ASM_POWERPC_CONTEXT_TRACKING_H +#ifndef __ASSEMBLY__ #include #include @@ -25,4 +26,14 @@ static inline void __exception_exit(struct pt_regs *regs) #endif } +#else /* __ASSEMBLY__ */ + +#ifdef CONFIG_CONTEXT_TRACKING +#define SCHEDULE_USER bl .schedule_user +#else +#define SCHEDULE_USER bl .schedule +#endif + +#endif /* !__ASSEMBLY__ */ + #endif diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 256c5bf..f7e4622 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -33,6 +33,7 @@ #include #include #include +#include /* * System calls. @@ -618,7 +619,7 @@ _GLOBAL(ret_from_except_lite) andi. r0,r4,_TIF_NEED_RESCHED beq 1f bl .restore_interrupts - bl .schedule + SCHEDULE_USER b .ret_from_except_lite 1: bl .save_nvgprs -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2 0/6] powerpc: Support context tracking for Power pSeries
These patches try to support context tracking for Power arch, beginning with 64-bit pSeries. The codes are ported from that of the x86_64, and in each patch, I listed the corresponding patch for x86. Would you please help review and give your comments? v2: I rebased the patches against 3.9-rcs, and also added a patch to replace the exception handling with the generic code in tip timers/nohz. I assume these patches would get in through powerpc tree, so I didn't combine the new patch (#6) with the original one (#2). So that if powerpc tree picks these, it could pick the first five patches, and apply patch #6 later when the dependency enters into powerpc tree (maybe on some 3.10-rcs). I'm also wondering whether it is possible for these to go through tip timers/nohz, so for now, patches #6 and #2 could be combined into one, and no need to worry about the issues caused by arch/common code merging. And it might also make future changes easier. Thanks, Zhong Li Zhong (6): powerpc: Syscall hooks for context tracking subsystem powerpc: Exception hooks for context tracking subsystem powerpc: Exit user context on notify resume powerpc: Use the new schedule_user API on userspace preemption powerpc: select HAVE_CONTEXT_TRACKING for pSeries powerpc: Use generic code for exception handling arch/powerpc/include/asm/context_tracking.h | 10 +++ arch/powerpc/include/asm/thread_info.h |7 ++- arch/powerpc/kernel/entry_64.S |3 +- arch/powerpc/kernel/ptrace.c|5 ++ arch/powerpc/kernel/signal.c|5 ++ arch/powerpc/kernel/traps.c | 91 --- arch/powerpc/mm/fault.c | 16 - arch/powerpc/mm/hash_utils_64.c | 38 --- arch/powerpc/platforms/pseries/Kconfig |1 + 9 files changed, 140 insertions(+), 36 deletions(-) create mode 100644 arch/powerpc/include/asm/context_tracking.h -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2 5/6] powerpc: select HAVE_CONTEXT_TRACKING for pSeries
Start context tracking support from pSeries. Signed-off-by: Li Zhong --- arch/powerpc/platforms/pseries/Kconfig |1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index 9a0941b..023b288 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -18,6 +18,7 @@ config PPC_PSERIES select PPC_PCI_CHOICE if EXPERT select ZLIB_DEFLATE select PPC_DOORBELL + select HAVE_CONTEXT_TRACKING default y config PPC_SPLPAR -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2 6/6] powerpc: Use generic code for exception handling
After the exception handling moved to generic code, and some changes in following two commits: 56dd9470d7c8734f055da2a6bac553caf4a468eb context_tracking: Move exception handling to generic code 6c1e0256fad84a843d915414e4b5973b7443d48d context_tracking: Restore correct previous context state on exception exit it is able for this patch to replace the implementation in arch code with the generic code in above commits. Signed-off-by: Li Zhong --- arch/powerpc/include/asm/context_tracking.h | 29 --- arch/powerpc/kernel/exceptions-64s.S|4 +-- arch/powerpc/kernel/traps.c | 42 +- arch/powerpc/mm/fault.c |7 ++-- arch/powerpc/mm/hash_utils_64.c | 51 ++- 5 files changed, 57 insertions(+), 76 deletions(-) diff --git a/arch/powerpc/include/asm/context_tracking.h b/arch/powerpc/include/asm/context_tracking.h index 4da287e..b6f5a33 100644 --- a/arch/powerpc/include/asm/context_tracking.h +++ b/arch/powerpc/include/asm/context_tracking.h @@ -1,39 +1,10 @@ #ifndef _ASM_POWERPC_CONTEXT_TRACKING_H #define _ASM_POWERPC_CONTEXT_TRACKING_H -#ifndef __ASSEMBLY__ -#include -#include - -/* - * temporarily defined to avoid potential conflicts with the common - * implementation, these will be removed by a later patch after the common - * code enters powerpc tree - */ -#define exception_enter __exception_enter -#define exception_exit __exception_exit - -static inline void __exception_enter(struct pt_regs *regs) -{ - user_exit(); -} - -static inline void __exception_exit(struct pt_regs *regs) -{ -#ifdef CONFIG_CONTEXT_TRACKING - if (user_mode(regs)) - user_enter(); -#endif -} - -#else /* __ASSEMBLY__ */ - #ifdef CONFIG_CONTEXT_TRACKING #define SCHEDULE_USER bl .schedule_user #else #define SCHEDULE_USER bl .schedule #endif -#endif /* !__ASSEMBLY__ */ - #endif diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 6d82f4f..a8a5361 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -1368,17 +1368,15 @@ END_MMU_FTR_SECTION_IFCLR(MMU_FTR_SLB) rlwimi r4,r0,32-13,30,30 /* becomes _PAGE_USER access bit */ ori r4,r4,1 /* add _PAGE_PRESENT */ rlwimi r4,r5,22+2,31-2,31-2/* Set _PAGE_EXEC if trap is 0x400 */ - addir6,r1,STACK_FRAME_OVERHEAD /* * r3 contains the faulting address * r4 contains the required access permissions * r5 contains the trap number -* r6 contains the address of pt_regs * * at return r3 = 0 for success, 1 for page fault, negative for error */ - bl .hash_page_ct /* build HPTE if possible */ + bl .hash_page /* build HPTE if possible */ cmpdi r3,0/* see if hash_page succeeded */ /* Success */ diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 6228b6b..1b46c2d9 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include @@ -60,7 +61,6 @@ #include #include #include -#include #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC) int (*__debugger)(struct pt_regs *regs) __read_mostly; @@ -669,8 +669,9 @@ int machine_check_generic(struct pt_regs *regs) void machine_check_exception(struct pt_regs *regs) { int recover = 0; + enum ctx_state prev_state; - exception_enter(regs); + prev_state = exception_enter(); __get_cpu_var(irq_stat).mce_exceptions++; @@ -712,7 +713,7 @@ void machine_check_exception(struct pt_regs *regs) panic("Unrecoverable Machine check"); exit: - exception_exit(regs); + exception_exit(prev_state); } void SMIException(struct pt_regs *regs) @@ -722,19 +723,21 @@ void SMIException(struct pt_regs *regs) void unknown_exception(struct pt_regs *regs) { - exception_enter(regs); + enum ctx_state prev_state; + prev_state = exception_enter(); printk("Bad trap at PC: %lx, SR: %lx, vector=%lx\n", regs->nip, regs->msr, regs->trap); _exception(SIGTRAP, regs, 0, 0); - exception_exit(regs); + exception_exit(prev_state); } void instruction_breakpoint_exception(struct pt_regs *regs) { - exception_enter(regs); + enum ctx_state prev_state; + prev_state = exception_enter(); if (notify_die(DIE_IABR_MATCH, "iabr_match", regs, 5, 5, SIGTRAP) == NOTIFY_STOP) @@ -744,7 +747,7 @@ void instruction_breakpoint_exception(struct pt_regs *regs) _exception(SIGTRAP, regs, TRAP_BRKPT, regs->nip); exit: - exception_exit(regs); + exception_exi
[RFC PATCH v2 1/6] powerpc: Syscall hooks for context tracking subsystem
This is the syscall slow path hooks for context tracking subsystem, corresponding to [PATCH] x86: Syscall hooks for userspace RCU extended QS commit bf5a3c13b939813d28ce26c01425054c740d6731 TIF_MEMDIE is moved to the second 16-bits (with value 17), as it seems there is no asm code using it. TIF_NOHZ is added to _TIF_SYCALL_T_OR_A, so it is better for it to be in the same 16 bits with others in the group, so in the asm code, andi. with this group could work. Signed-off-by: Li Zhong Acked-by: Frederic Weisbecker --- arch/powerpc/include/asm/thread_info.h |7 +-- arch/powerpc/kernel/ptrace.c |5 + 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h index 406b7b9..414a261 100644 --- a/arch/powerpc/include/asm/thread_info.h +++ b/arch/powerpc/include/asm/thread_info.h @@ -97,7 +97,7 @@ static inline struct thread_info *current_thread_info(void) #define TIF_PERFMON_CTXSW 6 /* perfmon needs ctxsw calls */ #define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */ #define TIF_SINGLESTEP 8 /* singlestepping active */ -#define TIF_MEMDIE 9 /* is terminating due to OOM killer */ +#define TIF_NOHZ 9 /* in adaptive nohz mode */ #define TIF_SECCOMP10 /* secure computing */ #define TIF_RESTOREALL 11 /* Restore all regs (implies NOERROR) */ #define TIF_NOERROR12 /* Force successful syscall return */ @@ -106,6 +106,7 @@ static inline struct thread_info *current_thread_info(void) #define TIF_SYSCALL_TRACEPOINT 15 /* syscall tracepoint instrumentation */ #define TIF_EMULATE_STACK_STORE16 /* Is an instruction emulation for stack store? */ +#define TIF_MEMDIE 17 /* is terminating due to OOM killer */ /* as above, but as bit values */ #define _TIF_SYSCALL_TRACE (1< #include #include +#include #include #include @@ -1778,6 +1779,8 @@ long do_syscall_trace_enter(struct pt_regs *regs) { long ret = 0; + user_exit(); + secure_computing_strict(regs->gpr[0]); if (test_thread_flag(TIF_SYSCALL_TRACE) && @@ -1822,4 +1825,6 @@ void do_syscall_trace_leave(struct pt_regs *regs) step = test_thread_flag(TIF_SINGLESTEP); if (step || test_thread_flag(TIF_SYSCALL_TRACE)) tracehook_report_syscall_exit(regs, step); + + user_enter(); } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2 3/6] powerpc: Exit user context on notify resume
This patch allows RCU usage in do_notify_resume, e.g. signal handling. It corresponds to [PATCH] x86: Exit RCU extended QS on notify resume commit edf55fda35c7dc7f2d9241c3abaddaf759b457c6 Signed-off-by: Li Zhong --- arch/powerpc/kernel/signal.c |5 + 1 file changed, 5 insertions(+) diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c index cf12eae..d63b502 100644 --- a/arch/powerpc/kernel/signal.c +++ b/arch/powerpc/kernel/signal.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -159,6 +160,8 @@ static int do_signal(struct pt_regs *regs) void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags) { + user_exit(); + if (thread_info_flags & _TIF_UPROBE) uprobe_notify_resume(regs); @@ -169,4 +172,6 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags) clear_thread_flag(TIF_NOTIFY_RESUME); tracehook_notify_resume(regs); } + + user_enter(); } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2 2/6] powerpc: Exception hooks for context tracking subsystem
This is the exception hooks for context tracking subsystem, including data access, program check, single step, instruction breakpoint, machine check, alignment, fp unavailable, altivec assist, unknown exception, whose handlers might use RCU. This patch corresponds to [PATCH] x86: Exception hooks for userspace RCU extended QS commit 6ba3c97a38803883c2eee489505796cb0a727122 Signed-off-by: Li Zhong --- arch/powerpc/include/asm/context_tracking.h | 28 + arch/powerpc/kernel/exceptions-64s.S|4 +- arch/powerpc/kernel/traps.c | 83 --- arch/powerpc/mm/fault.c | 15 - arch/powerpc/mm/hash_utils_64.c | 17 ++ 5 files changed, 122 insertions(+), 25 deletions(-) create mode 100644 arch/powerpc/include/asm/context_tracking.h diff --git a/arch/powerpc/include/asm/context_tracking.h b/arch/powerpc/include/asm/context_tracking.h new file mode 100644 index 000..377146e --- /dev/null +++ b/arch/powerpc/include/asm/context_tracking.h @@ -0,0 +1,28 @@ +#ifndef _ASM_POWERPC_CONTEXT_TRACKING_H +#define _ASM_POWERPC_CONTEXT_TRACKING_H + +#include +#include + +/* + * temporarily defined to avoid potential conflicts with the common + * implementation, these will be removed by a later patch after the common + * code enters powerpc tree + */ +#define exception_enter __exception_enter +#define exception_exit __exception_exit + +static inline void __exception_enter(struct pt_regs *regs) +{ + user_exit(); +} + +static inline void __exception_exit(struct pt_regs *regs) +{ +#ifdef CONFIG_CONTEXT_TRACKING + if (user_mode(regs)) + user_enter(); +#endif +} + +#endif diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index a8a5361..6d82f4f 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -1368,15 +1368,17 @@ END_MMU_FTR_SECTION_IFCLR(MMU_FTR_SLB) rlwimi r4,r0,32-13,30,30 /* becomes _PAGE_USER access bit */ ori r4,r4,1 /* add _PAGE_PRESENT */ rlwimi r4,r5,22+2,31-2,31-2/* Set _PAGE_EXEC if trap is 0x400 */ + addir6,r1,STACK_FRAME_OVERHEAD /* * r3 contains the faulting address * r4 contains the required access permissions * r5 contains the trap number +* r6 contains the address of pt_regs * * at return r3 = 0 for success, 1 for page fault, negative for error */ - bl .hash_page /* build HPTE if possible */ + bl .hash_page_ct /* build HPTE if possible */ cmpdi r3,0/* see if hash_page succeeded */ /* Success */ diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 37cc40e..6228b6b 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -60,6 +60,7 @@ #include #include #include +#include #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC) int (*__debugger)(struct pt_regs *regs) __read_mostly; @@ -669,6 +670,8 @@ void machine_check_exception(struct pt_regs *regs) { int recover = 0; + exception_enter(regs); + __get_cpu_var(irq_stat).mce_exceptions++; /* See if any machine dependent calls. In theory, we would want @@ -683,7 +686,7 @@ void machine_check_exception(struct pt_regs *regs) recover = cur_cpu_spec->machine_check(regs); if (recover > 0) - return; + goto exit; #if defined(CONFIG_8xx) && defined(CONFIG_PCI) /* the qspan pci read routines can cause machine checks -- Cort @@ -693,20 +696,23 @@ void machine_check_exception(struct pt_regs *regs) * -- BenH */ bad_page_fault(regs, regs->dar, SIGBUS); - return; + goto exit; #endif if (debugger_fault_handler(regs)) - return; + goto exit; if (check_io_access(regs)) - return; + goto exit; die("Machine check", regs, SIGBUS); /* Must die if the interrupt is not recoverable */ if (!(regs->msr & MSR_RI)) panic("Unrecoverable Machine check"); + +exit: + exception_exit(regs); } void SMIException(struct pt_regs *regs) @@ -716,20 +722,29 @@ void SMIException(struct pt_regs *regs) void unknown_exception(struct pt_regs *regs) { + exception_enter(regs); + printk("Bad trap at PC: %lx, SR: %lx, vector=%lx\n", regs->nip, regs->msr, regs->trap); _exception(SIGTRAP, regs, 0, 0); + + exception_exit(regs); } void instruction_breakpoint_exception(struct pt_regs *regs) { + exception_enter(regs); + if (notify_die(DIE_IABR_MATCH, "iabr_match", regs, 5, 5, SIGTRAP) == N
Re: [RFC PATCH] context_tracking/rcu: don't function trace before rcu_user_exit() finishes
This updated version adds notrace to these functions directly: I saw some RCU illegal usage from idle complaints when function tracer is enabled with forced context tracking. It seems that __schedule() might be called in function_trace_call() when it re-enables preemption(if preemption and irqs are both enabled). So at the places where we call rcu_user_exit(), we need make sure that no function tracer hooks could be called in preemption/irqs both enabled environment before rcu_user_exit() finishes, or we might cause illegal RCU usage in __schedule(). This patch tries to add notrace attribute to a couple of functions where the above could happen, including user_exit(), and a few callers of user_exit(). Signed-off-by: Li Zhong -- arch/x86/kernel/ptrace.c | 4 ++-- arch/x86/kernel/signal.c | 2 +- kernel/context_tracking.c | 2 +- kernel/sched/core.c | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 29a8120..ce9f12e 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1488,7 +1488,7 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs, * We must return the syscall number to actually look up in the table. * This can be -1L to skip running any syscall at all. */ -long syscall_trace_enter(struct pt_regs *regs) +long notrace syscall_trace_enter(struct pt_regs *regs) { long ret = 0; @@ -1538,7 +1538,7 @@ out: return ret ?: regs->orig_ax; } -void syscall_trace_leave(struct pt_regs *regs) +void notrace syscall_trace_leave(struct pt_regs *regs) { bool step; diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 6956299..7b60210 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -732,7 +732,7 @@ static void do_signal(struct pt_regs *regs) * notification of userspace execution resumption * - triggered by the TIF_WORK_MASK flags */ -void +void notrace do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags) { user_exit(); diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 65349f0..f598f61 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -83,7 +83,7 @@ void user_enter(void) * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void user_exit(void) +void notrace user_exit(void) { unsigned long flags; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7f12624..178a6af 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2971,7 +2971,7 @@ asmlinkage void __sched schedule(void) EXPORT_SYMBOL(schedule); #ifdef CONFIG_CONTEXT_TRACKING -asmlinkage void __sched schedule_user(void) +asmlinkage void __sched notrace schedule_user(void) { /* * If we come here after a random call to set_need_resched(), -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] context_tracking/rcu: don't function trace before rcu_user_exit() finishes
On Thu, 2013-02-28 at 08:38 -0800, Paul E. McKenney wrote: > On Thu, Feb 28, 2013 at 04:26:10PM +0800, Li Zhong wrote: > > I saw some RCU illegal usage from idle complaints when function tracer > > is enabled with forced context tracking. > > > > It seems that __schedule() might be called in function_trace_call() when > > it re-enables preemption(if preemption and irqs are both enabled). > > > > So at the places where we call rcu_user_exit(), we need make sure that > > no function tracer hooks could be called in preemption/irqs both enabled > > environment before rcu_user_exit() finishes, or we might cause illegal > > RCU usage in __schedule(). > > > > This patch tries to add notrace attribute to a couple of functions where > > the above could happen, including user_exit(), and a few callers of > > user_exit(). > > And I have to ask... > > Do we really need the rcu_user_notrace? Why not just label the functions > as notrace? Hmm, because it seems to me the problems are related only to rcu user qs, and tracing without rcu user eqs doesn't have the above issues. But maybe it's better to do it simpler by just marking them notrace, as they are all low-level functions? Thanks, Zhong > > Thanx, Paul > > > Signed-off-by: Li Zhong > > -- > > arch/x86/kernel/ptrace.c | 4 ++-- > > arch/x86/kernel/signal.c | 2 +- > > include/linux/rcupdate.h | 2 ++ > > kernel/context_tracking.c | 2 +- > > kernel/sched/core.c | 2 +- > > 5 files changed, 7 insertions(+), 5 deletions(-) > > > > diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c > > index 29a8120..04659c2 100644 > > --- a/arch/x86/kernel/ptrace.c > > +++ b/arch/x86/kernel/ptrace.c > > @@ -1488,7 +1488,7 @@ void send_sigtrap(struct task_struct *tsk, struct > > pt_regs *regs, > > * We must return the syscall number to actually look up in the table. > > * This can be -1L to skip running any syscall at all. > > */ > > -long syscall_trace_enter(struct pt_regs *regs) > > +long rcu_user_notrace syscall_trace_enter(struct pt_regs *regs) > > { > > long ret = 0; > > > > @@ -1538,7 +1538,7 @@ out: > > return ret ?: regs->orig_ax; > > } > > > > -void syscall_trace_leave(struct pt_regs *regs) > > +void rcu_user_notrace syscall_trace_leave(struct pt_regs *regs) > > { > > bool step; > > > > diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c > > index 6956299..10a9e5a 100644 > > --- a/arch/x86/kernel/signal.c > > +++ b/arch/x86/kernel/signal.c > > @@ -732,7 +732,7 @@ static void do_signal(struct pt_regs *regs) > > * notification of userspace execution resumption > > * - triggered by the TIF_WORK_MASK flags > > */ > > -void > > +void rcu_user_notrace > > do_notify_resume(struct pt_regs *regs, void *unused, __u32 > > thread_info_flags) > > { > > user_exit(); > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > > index b758ce1..ecf9872 100644 > > --- a/include/linux/rcupdate.h > > +++ b/include/linux/rcupdate.h > > @@ -229,6 +229,7 @@ extern void rcu_user_enter(void); > > extern void rcu_user_exit(void); > > extern void rcu_user_enter_after_irq(void); > > extern void rcu_user_exit_after_irq(void); > > +#define rcu_user_notrace notrace > > #else > > static inline void rcu_user_enter(void) { } > > static inline void rcu_user_exit(void) { } > > @@ -236,6 +237,7 @@ static inline void rcu_user_enter_after_irq(void) { } > > static inline void rcu_user_exit_after_irq(void) { } > > static inline void rcu_user_hooks_switch(struct task_struct *prev, > > struct task_struct *next) { } > > +#define rcu_user_notrace > > #endif /* CONFIG_RCU_USER_QS */ > > > > extern void exit_rcu(void); > > diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c > > index 65349f0..d3fc35f 100644 > > --- a/kernel/context_tracking.c > > +++ b/kernel/context_tracking.c > > @@ -83,7 +83,7 @@ void user_enter(void) > > * This call supports re-entrancy. This way it can be called from any > > exception > > * handler without needing to know if we came from userspace or not. > > */ > > -void user_exit(void) > > +void rcu_user_notrace user_exit(void) > > { > > unsigned long flags; > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 2b52431..c970e92 100644 > >
[RFC PATCH] context_tracking/rcu: don't function trace before rcu_user_exit() finishes
I saw some RCU illegal usage from idle complaints when function tracer is enabled with forced context tracking. It seems that __schedule() might be called in function_trace_call() when it re-enables preemption(if preemption and irqs are both enabled). So at the places where we call rcu_user_exit(), we need make sure that no function tracer hooks could be called in preemption/irqs both enabled environment before rcu_user_exit() finishes, or we might cause illegal RCU usage in __schedule(). This patch tries to add notrace attribute to a couple of functions where the above could happen, including user_exit(), and a few callers of user_exit(). Signed-off-by: Li Zhong -- arch/x86/kernel/ptrace.c | 4 ++-- arch/x86/kernel/signal.c | 2 +- include/linux/rcupdate.h | 2 ++ kernel/context_tracking.c | 2 +- kernel/sched/core.c | 2 +- 5 files changed, 7 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 29a8120..04659c2 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1488,7 +1488,7 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs, * We must return the syscall number to actually look up in the table. * This can be -1L to skip running any syscall at all. */ -long syscall_trace_enter(struct pt_regs *regs) +long rcu_user_notrace syscall_trace_enter(struct pt_regs *regs) { long ret = 0; @@ -1538,7 +1538,7 @@ out: return ret ?: regs->orig_ax; } -void syscall_trace_leave(struct pt_regs *regs) +void rcu_user_notrace syscall_trace_leave(struct pt_regs *regs) { bool step; diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 6956299..10a9e5a 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -732,7 +732,7 @@ static void do_signal(struct pt_regs *regs) * notification of userspace execution resumption * - triggered by the TIF_WORK_MASK flags */ -void +void rcu_user_notrace do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags) { user_exit(); diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index b758ce1..ecf9872 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -229,6 +229,7 @@ extern void rcu_user_enter(void); extern void rcu_user_exit(void); extern void rcu_user_enter_after_irq(void); extern void rcu_user_exit_after_irq(void); +#define rcu_user_notrace notrace #else static inline void rcu_user_enter(void) { } static inline void rcu_user_exit(void) { } @@ -236,6 +237,7 @@ static inline void rcu_user_enter_after_irq(void) { } static inline void rcu_user_exit_after_irq(void) { } static inline void rcu_user_hooks_switch(struct task_struct *prev, struct task_struct *next) { } +#define rcu_user_notrace #endif /* CONFIG_RCU_USER_QS */ extern void exit_rcu(void); diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 65349f0..d3fc35f 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -83,7 +83,7 @@ void user_enter(void) * This call supports re-entrancy. This way it can be called from any exception * handler without needing to know if we came from userspace or not. */ -void user_exit(void) +void rcu_user_notrace user_exit(void) { unsigned long flags; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2b52431..c970e92 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2973,7 +2973,7 @@ asmlinkage void __sched schedule(void) EXPORT_SYMBOL(schedule); #ifdef CONFIG_CONTEXT_TRACKING -asmlinkage void __sched schedule_user(void) +asmlinkage void __sched rcu_user_notrace schedule_user(void) { /* * If we come here after a random call to set_need_resched(), -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/urgent] cputime: Constify timeval_to_cputime(timeval) argument
Commit-ID: c78a4bcd1a879b39fb7646c887b0c195f1018909 Gitweb: http://git.kernel.org/tip/c78a4bcd1a879b39fb7646c887b0c195f1018909 Author: Li Zhong AuthorDate: Sat, 23 Feb 2013 17:28:44 +0100 Committer: Ingo Molnar CommitDate: Sun, 24 Feb 2013 12:57:15 +0100 cputime: Constify timeval_to_cputime(timeval) argument Saw the following compiler warning on the linux-next tree: kernel/itimer.c: In function 'set_cpu_itimer': kernel/itimer.c:152:2: warning: passing argument 1 of 'timeval_to_cputime' discards 'const' qualifier from pointer target type [enabled by default] ... timeval_to_cputime() is always passed a constant timeval in argument, we need to teach the nsecs based cputime implementation about that. Signed-off-by: Li Zhong Signed-off-by: Frederic Weisbecker Cc: Steven Rostedt Cc: Kevin Hilman Link: http://lkml.kernel.org/r/1361636925-22288-2-git-send-email-fweis...@gmail.com Signed-off-by: Ingo Molnar Cc: Steven Rostedt Cc: Kevin Hilman --- include/asm-generic/cputime_nsecs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/asm-generic/cputime_nsecs.h b/include/asm-generic/cputime_nsecs.h index b6485ca..a8ece9a 100644 --- a/include/asm-generic/cputime_nsecs.h +++ b/include/asm-generic/cputime_nsecs.h @@ -76,7 +76,7 @@ static inline void cputime_to_timespec(const cputime_t ct, struct timespec *val) /* * Convert cputime <-> timeval (msec) */ -static inline cputime_t timeval_to_cputime(struct timeval *val) +static inline cputime_t timeval_to_cputime(const struct timeval *val) { u64 ret = val->tv_sec * NSEC_PER_SEC + val->tv_usec * NSEC_PER_USEC; return (__force cputime_t) ret; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 2/5] powerpc: Exception hooks for context tracking subsystem
On Sun, 2013-02-10 at 15:10 +0100, Frederic Weisbecker wrote: > 2013/2/1 Li Zhong : > > This is the exception hooks for context tracking subsystem, including > > data access, program check, single step, instruction breakpoint, machine > > check, > > alignment, fp unavailable, altivec assist, unknown exception, whose handlers > > might use RCU. > > > > This patch corresponds to > > [PATCH] x86: Exception hooks for userspace RCU extended QS > > commit 6ba3c97a38803883c2eee489505796cb0a727122 > > > > Signed-off-by: Li Zhong > > Looks good! > > I guess we should move exception_enter/exit definition to the generic > code. They should be the same for all archs after all. Indeed. > Also we are > relying on user_mode(regs) but this may be buggy with some corner > cases. For example if an exception happen after a call to user_exit() I guess you mean user_enter() here, or am I confused? > (on syscall exit) but before we actually resume in userspace, the > exception will exit in kernel mode from the context tracking POV. > > So instead on relying on the regs, which are not sync with the context > tracking state, we should use something like: > > prev_state = exception_enter(); > ... > exception_exit(prev_state); > > Also preempt_schedule_irq() is concerned as well by this problem. So I > should convert it to that scheme as well. I'm going to prepare some > patches. > > Feel free to merge this patch in the powerpc tree, I'll do the > conversion along the way. Or if your patches gets merged earlier than these, I can update my code according to yours. Thanks, Zhong > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 1/5] powerpc: Syscall hooks for context tracking subsystem
On Thu, 2013-02-07 at 01:29 +0100, Frederic Weisbecker wrote: > 2013/2/1 Li Zhong : > > This is the syscall slow path hooks for context tracking subsystem, > > corresponding to > > [PATCH] x86: Syscall hooks for userspace RCU extended QS > > commit bf5a3c13b939813d28ce26c01425054c740d6731 > > > > TIF_MEMDIE is moved to the second 16-bits (with value 17), as it seems there > > is no asm code using it. TIF_NOHZ is added to _TIF_SYCALL_T_OR_A, so it is > > better for it to be in the same 16 bits with others in the group, so in the > > asm code, andi. with this group could work. > > > > Signed-off-by: Li Zhong > > --- > > arch/powerpc/include/asm/thread_info.h |7 +-- > > arch/powerpc/kernel/ptrace.c |5 + > > 2 files changed, 10 insertions(+), 2 deletions(-) > > > > diff --git a/arch/powerpc/include/asm/thread_info.h > > b/arch/powerpc/include/asm/thread_info.h > > index 406b7b9..414a261 100644 > > --- a/arch/powerpc/include/asm/thread_info.h > > +++ b/arch/powerpc/include/asm/thread_info.h > > @@ -97,7 +97,7 @@ static inline struct thread_info > > *current_thread_info(void) > > #define TIF_PERFMON_CTXSW 6 /* perfmon needs ctxsw calls */ > > #define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */ > > #define TIF_SINGLESTEP 8 /* singlestepping active */ > > -#define TIF_MEMDIE 9 /* is terminating due to OOM killer > > */ > > +#define TIF_NOHZ 9 /* in adaptive nohz mode */ > > #define TIF_SECCOMP10 /* secure computing */ > > #define TIF_RESTOREALL 11 /* Restore all regs (implies > > NOERROR) */ > > #define TIF_NOERROR12 /* Force successful syscall return > > */ > > @@ -106,6 +106,7 @@ static inline struct thread_info > > *current_thread_info(void) > > #define TIF_SYSCALL_TRACEPOINT 15 /* syscall tracepoint > > instrumentation */ > > #define TIF_EMULATE_STACK_STORE16 /* Is an instruction > > emulation > > for stack store? */ > > +#define TIF_MEMDIE 17 /* is terminating due to OOM killer > > */ > > > > /* as above, but as bit values */ > > #define _TIF_SYSCALL_TRACE (1< > @@ -124,8 +125,10 @@ static inline struct thread_info > > *current_thread_info(void) > > #define _TIF_UPROBE(1< > #define _TIF_SYSCALL_TRACEPOINT(1< > #define _TIF_EMULATE_STACK_STORE (1< > +#define _TIF_NOHZ (1< > #define _TIF_SYSCALL_T_OR_A(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \ > > -_TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT) > > +_TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | \ > > +_TIF_NOHZ) > > > > #define _TIF_USER_WORK_MASK(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \ > > _TIF_NOTIFY_RESUME | _TIF_UPROBE) > > diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c > > index c497000..62238dd 100644 > > --- a/arch/powerpc/kernel/ptrace.c > > +++ b/arch/powerpc/kernel/ptrace.c > > @@ -32,6 +32,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -1745,6 +1746,8 @@ long do_syscall_trace_enter(struct pt_regs *regs) > > { > > long ret = 0; > > > > + user_exit(); > > + > > secure_computing_strict(regs->gpr[0]); > > > > if (test_thread_flag(TIF_SYSCALL_TRACE) && > > @@ -1789,4 +1792,6 @@ void do_syscall_trace_leave(struct pt_regs *regs) > > step = test_thread_flag(TIF_SINGLESTEP); > > if (step || test_thread_flag(TIF_SYSCALL_TRACE)) > > tracehook_report_syscall_exit(regs, step); > > + > > + user_enter(); > > In x86-64, schedule_user() and do_notify_resume() can be called before > syscall_trace_leave(). As a result we may be entering > syscall_trace_leave() in user mode (from a context tracking POV). To > fix this I added a call to user_exit() on the very beginning of that > function. > > You can find the details in 2c5594df344cd1ff0cc9bf007dea3235582b3acf > ("rcu: Fix unrecovered RCU user mode in syscall_trace_leave()"). Hi Frederic, Thank you very much for the reminding. > > Could this problem happen in ppc as well? I checked the code(64 bit) today, it seems to me that it won't happen. But fortunately, we are in the ppc mailing list, please correct me if my understanding is wrong. By the way, I enabled CONTEXT_TRACKING_FORCE and PROVE_RCU, so if it could happen, I think there should be some illegal RCU usage complaints reported. Thanks, Zhong > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
On Wed, 2013-02-06 at 18:44 -0500, Sasha Levin wrote: > On 02/04/2013 02:17 AM, Michel Lespinasse wrote: > > munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE > > at a time. When munlocking THP pages (or the huge zero page), this resulted > > in taking the mm->page_table_lock 512 times in a row. > > > > We can do better by making use of the page_mask returned by follow_page_mask > > (for the huge zero page case), or the size of the page munlock_vma_page() > > operated on (for the true THP page case). > > > > Note - I am sending this as RFC only for now as I can't currently put > > my finger on what if anything prevents split_huge_page() from operating > > concurrently on the same page as munlock_vma_page(), which would mess > > up our NR_MLOCK statistics. Is this a latent bug or is there a subtle > > point I missed here ? > > > > Signed-off-by: Michel Lespinasse > > Hi Michel, > > Fuzzing with trinity inside a KVM tools guest produces a steady stream of: > > > [ 51.823275] [ cut here ] > [ 51.823302] kernel BUG at include/linux/page-flags.h:421! > [ 51.823307] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 51.823307] Dumping ftrace buffer: > [ 51.823314](ftrace buffer empty) > [ 51.823314] Modules linked in: > [ 51.823314] CPU 2 > [ 51.823314] Pid: 7116, comm: trinity Tainted: GW > 3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273 > [ 51.823316] RIP: 0010:[] [] > munlock_vma_page+0x12/0xf0 > [ 51.823317] RSP: 0018:880009641bb8 EFLAGS: 00010282 > [ 51.823319] RAX: 011ffc008001 RBX: ea410040 RCX: > > [ 51.823320] RDX: RSI: RDI: > ea410040 > [ 51.823321] RBP: 880009641bc8 R08: R09: > > [ 51.823322] R10: R11: R12: > 880009633958 > [ 51.823324] R13: 01252000 R14: ea410040 R15: > 00ff > [ 51.823326] FS: 7fe7a9046700() GS:88000ba0() > knlGS: > [ 51.823327] CS: 0010 DS: ES: CR0: 80050033 > [ 51.823328] CR2: 7fc583b90fcb CR3: 09bc8000 CR4: > 000406e0 > [ 51.823334] DR0: DR1: DR2: > > [ 51.823338] DR3: DR6: 0ff0 DR7: > 0400 > [ 51.823340] Process trinity (pid: 7116, threadinfo 88000964, task > 880009638000) > [ 51.823341] Stack: > [ 51.823344] 00a01000 880009633958 880009641c08 > 812429bd > [ 51.823373] 880009638000 01ff09638000 880009ade000 > 880009633958 > [ 51.823373] 880009638810 880009ade098 880009641cb8 > 81246d81 > [ 51.823373] Call Trace: > [ 51.823373] [] munlock_vma_pages_range+0x8d/0xf0 > [ 51.823373] [] exit_mmap+0x51/0x170 > [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 > [ 51.823373] [] ? kmem_cache_free+0x22f/0x3b0 > [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 > [ 51.823373] [] mmput+0x77/0xe0 > [ 51.823377] [] exit_mm+0x113/0x120 > [ 51.823381] [] ? _raw_spin_unlock_irq+0x51/0x80 > [ 51.823384] [] do_exit+0x24a/0x590 > [ 51.823387] [] do_group_exit+0x8a/0xc0 > [ 51.823390] [] get_signal_to_deliver+0x501/0x5b0 > [ 51.823394] [] do_signal+0x42/0x110 > [ 51.823399] [] ? rcu_eqs_exit_common+0x64/0x340 > [ 51.823404] [] ? trace_hardirqs_on+0xd/0x10 > [ 51.823407] [] ? trace_hardirqs_on_caller+0x128/0x160 > [ 51.823409] [] ? trace_hardirqs_on+0xd/0x10 > [ 51.823412] [] do_notify_resume+0x48/0xa0 > [ 51.823415] [] retint_signal+0x4d/0x92 > [ 51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 > c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48 > 8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 > 01 48 8b > [ 51.823449] RIP [] munlock_vma_page+0x12/0xf0 > [ 51.823450] RSP > [ 51.826846] ---[ end trace a7919e7f17c0a72a ]--- > The similar warning prevents my system from booting. And it seems to me that in munlock_vma_pages_range(), the page_mask needs be the page number returned from munlock_vma_page() minus 1. And the following fix solved my problem. Would you please have a try? Thanks, Zhong diff --git a/mm/mlock.c b/mm/mlock.c index af1d115..1e3d794 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -255,7 +255,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, unlock_page(page); put_page(page); } - page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask); + page_increm = 1 + (~(start >> PAGE_SHIFT) & (page_mask-1)); start += page_increm * PAGE_SIZE; cond_resched(); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message