Re: [PATCH 0/2] Make squashfs fragments' cache size more configurable
On Thu, Oct 19, 2017 at 12:50 AM, Qixuan Wu wrote: > Hi All, > > Currently, squashfs fragments' cache size is only determined by > config option CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE. Users have > no way to change the value when they get the binary kernel. Thank-you for the patches, but they're both pointless and dangerous. Let's be clear here you're trying to change an "expert only" kernel configuration option into a user changeable option. This is stupid because it is not meant for non-experts to change for good reason. The fragment cache size isn't some small tweak to the operation of Squashfs, it fundamentally affects both the performance and memory overhead of Squashfs. As such right from its introduction in 2003, it has been an "expert only" configuration option at build time. Even then it is made clear that the default has been carefully chosen, and it should only be changed in exceptional circumstances. This basically means don't change the default unless you really know what you're doing, and this means tracing of Squashfs against your use-case to determine caching behaviour. There is absolutely no other reason why you'd want to change the default. This also means it should be restricted to kernel configuration time only. Let's be clear again, very few people should ever want to change the default, and for the "experts" that do want to do so, they can do so when configuring the kernel. If you're not in a position to change it at kernel configuration time then by definition you're not an expert, and you shouldn't be able to change it anyway and certainly not as a user. There is absolutely no use-case here to make this a user changeable option. I can see no upsides in doing this, only downsides. Frankly if you need to change this value at module insert time then there is something wrong with your system or build process. If you want this because you want to build the kernel/modules once, and then post-facto configure them for various products then it is your build process that is broken. If you want this because you want to dynamically change Squashfs memory usage/caching behaviour post kernel configuration time it suggests you're trying to adapt Squashfs's footprint based on available memory. This is an abuse of the option as it's only meant to be used after detailed tracing/analysis and certainly not used to accommodate unforeseen dynamic low memory situations, and if that's the reason for needing this option, you should be looking to solve it elsewhere. Ultimately this has been an "expert" kernel configuration only option since its introduction in 2003, and I never been asked to change it, and I believe this is because people recognise it as such. I suspect you're trying to change this for fundamentally bogus reasons. Moreover Squashfs is used in many different use-cases and distributions, and I'm not going to make this a user-changeable option allowing users to insert the Squashfs module in such a way that will break its performance. So NACK. Phillip Lougher (Squashfs maintainer) > Now make it be configured when booting or inserting module. > Actually, it's better that a config option in a number format > in .config file cat be reconfigured during booting or inserting > module. > > Thanks > Qixuan > > Qixuan Wu (2): > Squashfs: Let the number of fragments cached configurable > Documentation/kernel-parameters.txt: Add kernel parameter of squashfs > fragments' cache size > > Documentation/admin-guide/kernel-parameters.txt | 7 > fs/squashfs/super.c | 43 > - > 2 files changed, 49 insertions(+), 1 deletion(-) > > -- > 2.7.4 > -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume
[...] > In this regards as we consider genpd being a trivial PM domain, those > examples your bring up above is too me also examples of trivial PM > domains. Especially because they don't deal with wakeups, as that is > taken care of by the drivers, right!? Not directly, for example, omap device framework has noirq callback implemented which forcibly disable all devices which are not PM runtime suspended. while doing this it calls drivers PM .runtime_suspend() which may return non 0 value and in this case device will be left enabled (powered) at suspend for wake up purposes (see _od_suspend_noirq()). >>> >>> Yeah, I had that feeling that omap has some trickyness going on. :-) >>> >>> I sure that can be fixed in the omap PM domain, although >> >> ...slipped with my fingers.. here is the rest of the reply... >> >> ..of course that require us to use another way for drivers to signal >> to the omap PM domain that it needs to stay powered as to deal with >> wakeup. >> >> I can have a look at that more closely, to see if it makes sense to change. >> > > Also, additional note here. some IPs are reused between OMAP/Davinci/Keystone, > OMAP PM domain have some code running at noirq time to dial with devices left > in PM runtime enabled state (OMAP PM runtime centric), while Davinci/Keystone > haven't (clock_ops.c), > so pm_runtime_force_* API is actually possibility now to make the same driver > work > on all these platforms. That sounds great! Also, in the end it would be nice to also convert the OMAP PM domain to genpd. I think most of the needed infrastructure is already there to do that. Kind regards Uffe -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume
On 20 October 2017 at 03:19, Rafael J. Wysocki wrote: > On Thursday, October 19, 2017 2:21:07 PM CEST Ulf Hansson wrote: >> On 19 October 2017 at 00:12, Rafael J. Wysocki wrote: >> > On Wednesday, October 18, 2017 4:11:33 PM CEST Ulf Hansson wrote: >> >> [...] >> >> >> >> >> >> >> >> The reason why pm_runtime_force_* needs to respects the hierarchy of >> >> >> the RPM callbacks, is because otherwise it can't safely update the >> >> >> runtime PM status of the device. >> >> > >> >> > I'm not sure I follow this requirement. Why is that so? >> >> >> >> If the PM domain controls some resources for the device in its RPM >> >> callbacks and the driver controls some other resources in its RPM >> >> callbacks - then these resources needs to be managed together. >> > >> > Right, but that doesn't automatically make it necessary to use runtime PM >> > callbacks in the middle layer. Its system-wide PM callbacks may be >> > suitable for that just fine. >> > >> > That is, at least in some cases, you can combine ->runtime_suspend from a >> > driver and ->suspend_late from a middle layer with no problems, for >> > example. >> > >> > That's why some middle layers allow drivers to point ->suspend_late and >> > ->runtime_suspend to the same routine if they want to reuse that code. >> > >> >> This follows the behavior of when a regular call to >> >> pm_runtime_get|put(), triggers the RPM callbacks to be invoked. >> > >> > But (a) it doesn't have to follow it and (b) in some cases it should not >> > follow it. >> >> Of course you don't explicitly *have to* respect the hierarchy of the >> RPM callbacks in pm_runtime_force_*. >> >> However, changing that would require some additional information >> exchange between the driver and the middle-layer/PM domain, as to >> instruct the middle-layer/PM domain of what to do during system-wide >> PM. Especially in cases when the driver deals with wakeup, as in those >> cases the instructions may change dynamically. > > Well, if wakeup matters, drivers can't simply point their PM callbacks > to pm_runtime_force_* anyway. > > If the driver itself deals with wakeups, it clearly needs different callback > routines for system-wide PM and for runtime PM, so it can't reuse its runtime > PM callbacks at all then. It can still re-use its runtime PM callbacks, simply by calling pm_runtime_force_ from its system sleep callbacks. Drivers already do that today, not only to deal with wakeups, but generally when they need to deal with some additional operations. > > If the middle layer deals with wakeups, different callbacks are needed at > that level and so pm_runtime_force_* are unsuitable too. > > Really, invoking runtime PM callbacks from the middle layer in > pm_runtime_force_* is a not a idea at all and there's no general requirement > for it whatever. > >> [...] >> >> >> > In general, not if the wakeup settings are adjusted by the middle layer. >> >> >> >> Correct! >> >> >> >> To use pm_runtime_force* for these cases, one would need some >> >> additional information exchange between the driver and the >> >> middle-layer. >> > >> > Which pretty much defeats the purpose of the wrappers, doesn't it? >> >> Well, no matter if the wrappers are used or not, we need some kind of >> information exchange between the driver and the middle-layers/PM >> domains. > > Right. > > But if that information is exchanged, then why use wrappers *in* *addition* > to that? If we can find a different method that both avoids both open coding and offers the optimize system-wide PM path at resume, I am open to that. > >> Anyway, me personally think it's too early to conclude that using the >> wrappers may not be useful going forward. At this point, they clearly >> helps trivial cases to remain being trivial. > > I'm not sure about that really. So far I've seen more complexity resulting > from using them than being avoided by using them, but I guess the beauty is > in the eye of the beholder. :-) Hehe, yeah you may be right. :-) > >> > >> >> > >> >> >> Regarding hibernation, honestly that's not really my area of >> >> >> expertise. Although, I assume the middle-layer and driver can treat >> >> >> that as a separate case, so if it's not suitable to use >> >> >> pm_runtime_force* for that case, then they shouldn't do it. >> >> > >> >> > Well, agreed. >> >> > >> >> > In some simple cases, though, driver callbacks can be reused for >> >> > hibernation >> >> > too, so it would be good to have a common way to do that too, IMO. >> >> >> >> Okay, that makes sense! >> >> >> >> > >> >> >> > >> >> >> > Also, quite so often other middle layers interact with PCI directly >> >> >> > or >> >> >> > indirectly (eg. a platform device may be a child or a consumer of a >> >> >> > PCI >> >> >> > device) and some optimizations need to take that into account (eg. >> >> >> > parents >> >> >> > generally need to be accessible when their childres are resumed and >> >> >> > so on). >> >> >> >> >> >> A device's parent becomes informed
Re: [RFC PATCH] kbuild: Allow specifying some base host CFLAGS
Hi, On Wed, Oct 18, 2017 at 9:45 AM, Masahiro Yamada wrote: > 2017-10-14 3:02 GMT+09:00 Douglas Anderson : >> Right now there is a way to add some CFLAGS that affect target builds, >> but no way to add CFLAGS that affect host builds. Let's add a way. >> We'll document two environment variables: CFLAGS_HOST and >> CXXFLAGS_HOST. >> >> We'll document that these variables get appended to by the kernel to >> make the final CFLAGS. That means that, though the environment can >> specify some flags, if there is a conflict the kernel can override and >> win. This works differently than KCFLAGS which is appended (and thus >> can override) the kernel specified CFLAGS. >> >> Why would I make KCFLAGS and CFLAGS_HOST work differently in this way? >> My argument is that it's about expected usage. Typically the build >> system invoking the kernel has some idea about some basic CFLAGS that >> it wants to use to build things for the host and things for the >> target. In general the build system would expect that its flags can >> be overridden if necessary (perhaps we need to turn off a warning when >> compiling a certain file, for instance). So, all other things being >> equal, the way I'm making CFLAGS_HOST is the way I'd expect things to >> work. >> >> So, if it's expected that the build system can pass in a base set of >> flags, why didn't we make KCFLAGS work that way? The short answer is: >> when building for the target the kernel is just "special". The build >> system's "target" CFLAGS are likely intended for userspace programs >> and likely make very little sense to use as a basis. This was talked >> about in the seminal commit 69ee0b352242 ("kbuild: do not pick up >> CFLAGS from the environment"). Basically: if the build system REALLY >> knows what it's doing then it can pass in flags that the kernel will >> use, but otherwise it should butt out. Presumably this build system >> that really knows what it's doing knows better than the kernel so >> KCFLAGS comes after the kernel's normal flags. >> >> One last note: I chose to add new variables rather than just having >> the build system try to pass HOSTCFLAGS in somehow (either through the >> environment or the command line) to avoid weird interactions with >> recursive invocations of make. >> >> Signed-off-by: Douglas Anderson >> --- > > I'd like to know for-instance cases where this is useful. I'm not sure I have any exact use cases. I know vapier@ (CCed) was pushing for making sure that these flags get passed from the portage ebuild into the kernel build, so maybe he has some cases? Right now we have the "-pipe" flag that ought to be passed in to the host compiler but we're dropping it on the floor, but that doesn't seem terribly critical. ...but in general the Linux kernel doesn't have all the details about the host system. That means it can't necessarily build the tools quite as optimally (it can't pass "-mtune, right?). I could also imagine that there could be ABI flags that need to be specified? Like if we had floating point math in a host tool it would be important that the build system could tell the kernel what to use for "-mfloat-abi". ...so basically: it's all theoretical at this point in time from my point of view, but I can definitely understand how it could be necessary in the right environment. -Doug -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/5] gpio: gpiolib: Add core support for maintaining GPIO values on reset
GPIO state reset tolerance is implemented in gpiolib through the addition of a new pinconf parameter. With that, some renaming of helpers is done to clarify the scope of the already existing gpiochip_line_is_persistent(), as it's now ambiguous as to whether that means on suspend, reset or both. This in-turn impacts gpio-arizona, but not in any complicated way. This change lays the groundwork for implementing reset tolerance support in all of the external interfaces that can influence GPIOs. Signed-off-by: Andrew Jeffery --- drivers/gpio/gpio-arizona.c | 4 +-- drivers/gpio/gpiolib.c | 55 +++-- drivers/gpio/gpiolib.h | 1 + include/linux/gpio/consumer.h | 9 ++ include/linux/gpio/driver.h | 5 ++- include/linux/gpio/machine.h| 2 ++ include/linux/pinctrl/pinconf-generic.h | 2 ++ 7 files changed, 73 insertions(+), 5 deletions(-) diff --git a/drivers/gpio/gpio-arizona.c b/drivers/gpio/gpio-arizona.c index d4e6ba0301bc..d3fe23569811 100644 --- a/drivers/gpio/gpio-arizona.c +++ b/drivers/gpio/gpio-arizona.c @@ -33,7 +33,7 @@ static int arizona_gpio_direction_in(struct gpio_chip *chip, unsigned offset) { struct arizona_gpio *arizona_gpio = gpiochip_get_data(chip); struct arizona *arizona = arizona_gpio->arizona; - bool persistent = gpiochip_line_is_persistent(chip, offset); + bool persistent = gpiochip_line_is_persistent_suspend(chip, offset); bool change; int ret; @@ -99,7 +99,7 @@ static int arizona_gpio_direction_out(struct gpio_chip *chip, { struct arizona_gpio *arizona_gpio = gpiochip_get_data(chip); struct arizona *arizona = arizona_gpio->arizona; - bool persistent = gpiochip_line_is_persistent(chip, offset); + bool persistent = gpiochip_line_is_persistent_suspend(chip, offset); unsigned int val; int ret; diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c index a56b29fd8bb1..d9dc7e588699 100644 --- a/drivers/gpio/gpiolib.c +++ b/drivers/gpio/gpiolib.c @@ -2414,6 +2414,40 @@ int gpiod_set_debounce(struct gpio_desc *desc, unsigned debounce) EXPORT_SYMBOL_GPL(gpiod_set_debounce); /** + * gpiod_set_reset_tolerant - Hold state across controller reset + * @desc: descriptor of the GPIO for which to set debounce time + * @tolerant: True to hold state across a controller reset, false otherwise + * + * Returns: + * 0 on success, %-ENOTSUPP if the controller doesn't support setting the + * reset tolerance or less than zero on other failures. + */ +int gpiod_set_reset_tolerant(struct gpio_desc *desc, bool tolerant) +{ + struct gpio_chip *chip; + unsigned long packed; + int rc; + + chip = desc->gdev->chip; + if (!chip->set_config) + return -ENOTSUPP; + + packed = pinconf_to_config_packed(PIN_CONFIG_RESET_TOLERANT, tolerant); + + rc = chip->set_config(chip, gpio_chip_hwgpio(desc), packed); + if (rc < 0) + return rc; + + if (tolerant) + set_bit(FLAG_RESET_TOLERANT, &desc->flags); + else + clear_bit(FLAG_RESET_TOLERANT, &desc->flags); + + return 0; +} +EXPORT_SYMBOL_GPL(gpiod_set_reset_tolerant); + +/** * gpiod_is_active_low - test whether a GPIO is active-low or not * @desc: the gpio descriptor to test * @@ -2885,7 +2919,8 @@ bool gpiochip_line_is_open_source(struct gpio_chip *chip, unsigned int offset) } EXPORT_SYMBOL_GPL(gpiochip_line_is_open_source); -bool gpiochip_line_is_persistent(struct gpio_chip *chip, unsigned int offset) +bool gpiochip_line_is_persistent_suspend(struct gpio_chip *chip, +unsigned int offset) { if (offset >= chip->ngpio) return false; @@ -2893,7 +2928,18 @@ bool gpiochip_line_is_persistent(struct gpio_chip *chip, unsigned int offset) return !test_bit(FLAG_SLEEP_MAY_LOSE_VALUE, &chip->gpiodev->descs[offset].flags); } -EXPORT_SYMBOL_GPL(gpiochip_line_is_persistent); +EXPORT_SYMBOL_GPL(gpiochip_line_is_persistent_suspend); + +bool gpiochip_line_is_persistent_reset(struct gpio_chip *chip, + unsigned int offset) +{ + if (offset >= chip->ngpio) + return false; + + return test_bit(FLAG_RESET_TOLERANT, + &chip->gpiodev->descs[offset].flags); +} +EXPORT_SYMBOL_GPL(gpiochip_line_is_persistent_reset); /** * gpiod_get_raw_value_cansleep() - return a gpio's raw value @@ -3271,6 +3317,11 @@ int gpiod_configure_flags(struct gpio_desc *desc, const char *con_id, if (lflags & GPIO_SLEEP_MAY_LOSE_VALUE) set_bit(FLAG_SLEEP_MAY_LOSE_VALUE, &desc->flags); + status = gpiod_set_reset_tolerant(desc, + !!(lflags & GPIO_RESET_TOLERANT)); + if (status < 0) + return status; + /*
[RFC PATCH 2/5] gpio: gpiolib: Add OF support for maintaining GPIO values on reset
Add flags and the associated flag mappings between interfaces to enable GPIO reset tolerance to be specified via devicetree. Signed-off-by: Andrew Jeffery --- drivers/gpio/gpiolib-of.c | 2 ++ drivers/gpio/gpiolib.c | 5 + include/dt-bindings/gpio/gpio.h | 4 include/linux/of_gpio.h | 1 + 4 files changed, 12 insertions(+) diff --git a/drivers/gpio/gpiolib-of.c b/drivers/gpio/gpiolib-of.c index e0d59e61b52f..4a268ba52998 100644 --- a/drivers/gpio/gpiolib-of.c +++ b/drivers/gpio/gpiolib-of.c @@ -155,6 +155,8 @@ struct gpio_desc *of_find_gpio(struct device *dev, const char *con_id, if (of_flags & OF_GPIO_SLEEP_MAY_LOSE_VALUE) *flags |= GPIO_SLEEP_MAY_LOSE_VALUE; + if (of_flags & OF_GPIO_RESET_TOLERANT) + *flags |= GPIO_RESET_TOLERANT; return desc; } diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c index d9dc7e588699..6b4c5df10e84 100644 --- a/drivers/gpio/gpiolib.c +++ b/drivers/gpio/gpiolib.c @@ -3434,6 +3434,7 @@ struct gpio_desc *fwnode_get_named_gpiod(struct fwnode_handle *fwnode, bool active_low = false; bool single_ended = false; bool open_drain = false; + bool reset_tolerant = false; int ret; if (!fwnode) @@ -3448,6 +3449,7 @@ struct gpio_desc *fwnode_get_named_gpiod(struct fwnode_handle *fwnode, active_low = flags & OF_GPIO_ACTIVE_LOW; single_ended = flags & OF_GPIO_SINGLE_ENDED; open_drain = flags & OF_GPIO_OPEN_DRAIN; + reset_tolerant = flags & OF_GPIO_RESET_TOLERANT; } } else if (is_acpi_node(fwnode)) { struct acpi_gpio_info info; @@ -3478,6 +3480,9 @@ struct gpio_desc *fwnode_get_named_gpiod(struct fwnode_handle *fwnode, lflags |= GPIO_OPEN_SOURCE; } + if (reset_tolerant) + lflags |= GPIO_RESET_TOLERANT; + ret = gpiod_configure_flags(desc, propname, lflags, dflags); if (ret < 0) { gpiod_put(desc); diff --git a/include/dt-bindings/gpio/gpio.h b/include/dt-bindings/gpio/gpio.h index 70de5b7a6c9b..01c75d9e308e 100644 --- a/include/dt-bindings/gpio/gpio.h +++ b/include/dt-bindings/gpio/gpio.h @@ -32,4 +32,8 @@ #define GPIO_SLEEP_MAINTAIN_VALUE 0 #define GPIO_SLEEP_MAY_LOSE_VALUE 8 +/* Bit 4 express GPIO persistence on reset */ +#define GPIO_RESET_INTOLERANT 0 +#define GPIO_RESET_TOLERANT 16 + #endif diff --git a/include/linux/of_gpio.h b/include/linux/of_gpio.h index 1fe205582111..9b34737706a7 100644 --- a/include/linux/of_gpio.h +++ b/include/linux/of_gpio.h @@ -32,6 +32,7 @@ enum of_gpio_flags { OF_GPIO_SINGLE_ENDED = 0x2, OF_GPIO_OPEN_DRAIN = 0x4, OF_GPIO_SLEEP_MAY_LOSE_VALUE = 0x8, + OF_GPIO_RESET_TOLERANT = 0x16, }; #ifdef CONFIG_OF_GPIO -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 3/5] gpio: gpiolib: Add chardev support for maintaining GPIO values on reset
Similar to devicetree support, add flags and mappings to expose reset tolerance configuration through the chardev interface. Signed-off-by: Andrew Jeffery --- drivers/gpio/gpiolib.c| 14 +- include/uapi/linux/gpio.h | 11 ++- 2 files changed, 19 insertions(+), 6 deletions(-) diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c index 6b4c5df10e84..442ee5ceee08 100644 --- a/drivers/gpio/gpiolib.c +++ b/drivers/gpio/gpiolib.c @@ -357,7 +357,8 @@ struct linehandle_state { GPIOHANDLE_REQUEST_OUTPUT | \ GPIOHANDLE_REQUEST_ACTIVE_LOW | \ GPIOHANDLE_REQUEST_OPEN_DRAIN | \ - GPIOHANDLE_REQUEST_OPEN_SOURCE) + GPIOHANDLE_REQUEST_OPEN_SOURCE | \ + GPIOHANDLE_REQUEST_RESET_TOLERANT) static long linehandle_ioctl(struct file *filep, unsigned int cmd, unsigned long arg) @@ -498,6 +499,17 @@ static int linehandle_create(struct gpio_device *gdev, void __user *ip) set_bit(FLAG_OPEN_SOURCE, &desc->flags); /* +* Unconditionally configure reset tolerance, as it's possible +* that the tolerance flag itself becomes tolerant to resets. +* Thus it could remain set from a previous environment, but +* the current environment may not expect it so. +*/ + ret = gpiod_set_reset_tolerant(desc, + !!(lflags & GPIOHANDLE_REQUEST_RESET_TOLERANT)); + if (ret < 0) + goto out_free_descs; + + /* * Lines have to be requested explicitly for input * or output, else the line will be treated "as is". */ diff --git a/include/uapi/linux/gpio.h b/include/uapi/linux/gpio.h index 333d3544c964..1b1ce1af8653 100644 --- a/include/uapi/linux/gpio.h +++ b/include/uapi/linux/gpio.h @@ -56,11 +56,12 @@ struct gpioline_info { #define GPIOHANDLES_MAX 64 /* Linerequest flags */ -#define GPIOHANDLE_REQUEST_INPUT (1UL << 0) -#define GPIOHANDLE_REQUEST_OUTPUT (1UL << 1) -#define GPIOHANDLE_REQUEST_ACTIVE_LOW (1UL << 2) -#define GPIOHANDLE_REQUEST_OPEN_DRAIN (1UL << 3) -#define GPIOHANDLE_REQUEST_OPEN_SOURCE (1UL << 4) +#define GPIOHANDLE_REQUEST_INPUT (1UL << 0) +#define GPIOHANDLE_REQUEST_OUTPUT (1UL << 1) +#define GPIOHANDLE_REQUEST_ACTIVE_LOW (1UL << 2) +#define GPIOHANDLE_REQUEST_OPEN_DRAIN (1UL << 3) +#define GPIOHANDLE_REQUEST_OPEN_SOURCE (1UL << 4) +#define GPIOHANDLE_REQUEST_RESET_TOLERANT (1UL << 5) /** * struct gpiohandle_request - Information about a GPIO handle request -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 4/5] gpio: gpiolib: Add sysfs support for maintaining GPIO values on reset
Expose a new 'maintain' sysfs attribute to control both suspend and reset tolerance. Signed-off-by: Andrew Jeffery --- Documentation/gpio/sysfs.txt | 9 + drivers/gpio/gpiolib-sysfs.c | 88 ++-- 2 files changed, 93 insertions(+), 4 deletions(-) diff --git a/Documentation/gpio/sysfs.txt b/Documentation/gpio/sysfs.txt index aeab01aa4d00..f447f0746884 100644 --- a/Documentation/gpio/sysfs.txt +++ b/Documentation/gpio/sysfs.txt @@ -96,6 +96,15 @@ and have the following read/write attributes: for "rising" and "falling" edges will follow this setting. + "maintain" ... displays and controls whether the state of the GPIO is + maintained or lost on suspend or reset. The valid values take + the following meanings: + + 0: Do not maintain state on either suspend or reset + 1: Maintain state for suspend only + 2: Maintain state for reset only + 3: Maintain state for both suspend and reset + GPIO controllers have paths like /sys/class/gpio/gpiochip42/ (for the controller implementing GPIOs starting at #42) and have the following read-only attributes: diff --git a/drivers/gpio/gpiolib-sysfs.c b/drivers/gpio/gpiolib-sysfs.c index 3f454eaf2101..bfa186e73e26 100644 --- a/drivers/gpio/gpiolib-sysfs.c +++ b/drivers/gpio/gpiolib-sysfs.c @@ -289,6 +289,74 @@ static ssize_t edge_store(struct device *dev, } static DEVICE_ATTR_RW(edge); +#define GPIOLIB_SYSFS_MAINTAIN_SUSPEND BIT(0) +#define GPIOLIB_SYSFS_MAINTAIN_RESET BIT(1) +#define GPIOLIB_SYSFS_MAINTAIN_ALL GENMASK(1, 0) +static ssize_t maintain_show(struct device *dev, struct device_attribute *attr, +char *buf) +{ + struct gpiod_data *data = dev_get_drvdata(dev); + ssize_t status = 0; + int val = 0; + + mutex_lock(&data->mutex); + + if (!test_bit(FLAG_SLEEP_MAY_LOSE_VALUE, &data->desc->flags)) + val |= GPIOLIB_SYSFS_MAINTAIN_SUSPEND; + + if (test_bit(FLAG_RESET_TOLERANT, &data->desc->flags)) + val |= GPIOLIB_SYSFS_MAINTAIN_RESET; + + status = sprintf(buf, "%d\n", val); + + mutex_unlock(&data->mutex); + + return status; +} + +static ssize_t maintain_store(struct device *dev, + struct device_attribute *attr, + const char *buf, + size_t size) +{ + struct gpiod_data *data = dev_get_drvdata(dev); + struct gpio_chip *chip; + ssize_t status; + long provided; + + mutex_lock(&data->mutex); + + chip = data->desc->gdev->chip; + + if (!chip->set_config) + return -ENOTSUPP; + + status = kstrtol(buf, 0, &provided); + if (status < 0) + goto out; + + if (provided & ~GPIOLIB_SYSFS_MAINTAIN_ALL) { + status = -EINVAL; + goto out; + } + + if (!(provided & GPIOLIB_SYSFS_MAINTAIN_SUSPEND)) + set_bit(FLAG_SLEEP_MAY_LOSE_VALUE, &data->desc->flags); + else + clear_bit(FLAG_SLEEP_MAY_LOSE_VALUE, + &data->desc->flags); + + /* Configure reset tolerance */ + status = gpiod_set_reset_tolerant(data->desc, + !!(provided & GPIOLIB_SYSFS_MAINTAIN_RESET)); +out: + mutex_unlock(&data->mutex); + + return status ? : size; + +} +static DEVICE_ATTR_RW(maintain); + /* Caller holds gpiod-data mutex. */ static int gpio_sysfs_set_active_low(struct device *dev, int value) { @@ -378,6 +446,7 @@ static struct attribute *gpio_attrs[] = { &dev_attr_edge.attr, &dev_attr_value.attr, &dev_attr_active_low.attr, + &dev_attr_maintain.attr, NULL, }; @@ -474,11 +543,22 @@ static ssize_t export_store(struct class *class, status = -ENODEV; goto done; } - status = gpiod_export(desc, true); - if (status < 0) + + /* +* If userspace is requesting the GPIO via sysfs, make them explicitly +* configure reset tolerance each time by unconditionally disabling it +* here, as the export and configuration steps are not atomic. +*/ + status = gpiod_set_reset_tolerant(desc, false); + if (status < 0) { gpiod_free(desc); - else - set_bit(FLAG_SYSFS, &desc->flags); + } else { + status = gpiod_export(desc, true); + if (status < 0) + gpiod_free(desc); + else + set_bit(FLAG_SYSFS, &desc->flags); + } done: if (status) -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 5/5] gpio: aspeed: Add support for reset tolerance
Use the new pinconf parameter for reset tolerance to expose the associated capability of the Aspeed GPIO controller. Signed-off-by: Andrew Jeffery --- drivers/gpio/gpio-aspeed.c | 39 +-- 1 file changed, 37 insertions(+), 2 deletions(-) diff --git a/drivers/gpio/gpio-aspeed.c b/drivers/gpio/gpio-aspeed.c index bfc53995064a..0492cd917178 100644 --- a/drivers/gpio/gpio-aspeed.c +++ b/drivers/gpio/gpio-aspeed.c @@ -60,6 +60,7 @@ struct aspeed_gpio_bank { uint16_tval_regs; uint16_tirq_regs; uint16_tdebounce_regs; + uint16_ttolerance_regs; const char names[4][3]; }; @@ -70,48 +71,56 @@ static const struct aspeed_gpio_bank aspeed_gpio_banks[] = { .val_regs = 0x, .irq_regs = 0x0008, .debounce_regs = 0x0040, + .tolerance_regs = 0x001c, .names = { "A", "B", "C", "D" }, }, { .val_regs = 0x0020, .irq_regs = 0x0028, .debounce_regs = 0x0048, + .tolerance_regs = 0x003c, .names = { "E", "F", "G", "H" }, }, { .val_regs = 0x0070, .irq_regs = 0x0098, .debounce_regs = 0x00b0, + .tolerance_regs = 0x00ac, .names = { "I", "J", "K", "L" }, }, { .val_regs = 0x0078, .irq_regs = 0x00e8, .debounce_regs = 0x0100, + .tolerance_regs = 0x00fc, .names = { "M", "N", "O", "P" }, }, { .val_regs = 0x0080, .irq_regs = 0x0118, .debounce_regs = 0x0130, + .tolerance_regs = 0x012c, .names = { "Q", "R", "S", "T" }, }, { .val_regs = 0x0088, .irq_regs = 0x0148, .debounce_regs = 0x0160, + .tolerance_regs = 0x015c, .names = { "U", "V", "W", "X" }, }, { .val_regs = 0x01E0, .irq_regs = 0x0178, .debounce_regs = 0x0190, + .tolerance_regs = 0x018c, .names = { "Y", "Z", "AA", "AB" }, }, { - .val_regs = 0x01E8, - .irq_regs = 0x01A8, + .val_regs = 0x01e8, + .irq_regs = 0x01a8, .debounce_regs = 0x01c0, + .tolerance_regs = 0x01bc, .names = { "AC", "", "", "" }, }, }; @@ -531,6 +540,30 @@ static int aspeed_gpio_setup_irqs(struct aspeed_gpio *gpio, return 0; } +static int aspeed_gpio_reset_tolerance(struct gpio_chip *chip, + unsigned int offset, bool enable) +{ + struct aspeed_gpio *gpio = gpiochip_get_data(chip); + const struct aspeed_gpio_bank *bank; + unsigned long flags; + u32 val; + + bank = to_bank(offset); + + spin_lock_irqsave(&gpio->lock, flags); + val = readl(gpio->base + bank->tolerance_regs); + + if (enable) + val |= GPIO_BIT(offset); + else + val &= ~GPIO_BIT(offset); + + writel(val, gpio->base + bank->tolerance_regs); + spin_unlock_irqrestore(&gpio->lock, flags); + + return 0; +} + static int aspeed_gpio_request(struct gpio_chip *chip, unsigned int offset) { if (!have_gpio(gpiochip_get_data(chip), offset)) @@ -768,6 +801,8 @@ static int aspeed_gpio_set_config(struct gpio_chip *chip, unsigned int offset, param == PIN_CONFIG_DRIVE_OPEN_SOURCE) /* Return -ENOTSUPP to trigger emulation, as per datasheet */ return -ENOTSUPP; + else if (param == PIN_CONFIG_RESET_TOLERANT) + return aspeed_gpio_reset_tolerance(chip, offset, arg); return -ENOTSUPP; } -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/5] gpio: Expose reset tolerance capability
Hello, This series exposes a "reset tolerant" property for GPIOs. For example, the controller implemented in Aspeed BMCs provides such a feature to allow the BMC to be reset whilst maintaining necessary state to keep host systems alive or status LEDs in-tact. I'm sending it as an RFC because I'm not sure using pinconf is the right way to go about it, or that expanding the sysfs interface is a good idea, or that the approach taken is right in the context of the existing suspend support. pinconf just ended up being a convenient abstraction whilst supporting the sysfs change, and didn't feel unreasonable to use for devicetree or the chardev interface either. My concern with using pinconf is that the reset-tolerant property is (currently) GPIO-centric, but maybe that's not a worry. So the patches in the series support configuring the property via devicetree, the chardev interface and the sysfs interface. The sysfs interface also exposes the ability to configure the suspend tolerance, though there are some ordering requirements with respect to setting the direction (the suspend tolerance will only take if configured before setting the pin direction on the Arizona controller). Please review! Cheers, Andrew Andrew Jeffery (5): gpio: gpiolib: Add core support for maintaining GPIO values on reset gpio: gpiolib: Add OF support for maintaining GPIO values on reset gpio: gpiolib: Add chardev support for maintaining GPIO values on reset gpio: gpiolib: Add sysfs support for maintaining GPIO values on reset gpio: aspeed: Add support for reset tolerance Documentation/gpio/sysfs.txt| 9 drivers/gpio/gpio-arizona.c | 4 +- drivers/gpio/gpio-aspeed.c | 39 ++- drivers/gpio/gpiolib-of.c | 2 + drivers/gpio/gpiolib-sysfs.c| 88 +++-- drivers/gpio/gpiolib.c | 74 +-- drivers/gpio/gpiolib.h | 1 + include/dt-bindings/gpio/gpio.h | 4 ++ include/linux/gpio/consumer.h | 9 include/linux/gpio/driver.h | 5 +- include/linux/gpio/machine.h| 2 + include/linux/of_gpio.h | 1 + include/linux/pinctrl/pinconf-generic.h | 2 + include/uapi/linux/gpio.h | 11 +++-- 14 files changed, 234 insertions(+), 17 deletions(-) -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 0/6] Add HiSilicon SoC uncore Performance Monitoring Unit driver
Hi Mark/Will, Thanks. On 2017/10/19 23:32, Mark Rutland wrote: > On Thu, Oct 19, 2017 at 04:28:35PM +0100, Will Deacon wrote: >> On Thu, Oct 19, 2017 at 01:29:18PM +0100, Mark Rutland wrote: >>> Will, are you happy to queue this? >>> >>> There's a minor fixup [1] needed in patch 2, but otherwise this looks >>> good to me, and builds cleanly. >>> >>> I've pushed out a branch [2] with that fix folded in, in case that's >>> easier for you. Otherwise, feel free to pick these up with my Ack. >> >> I'm just running some build tests on these. I also tweaked your fix slightly >> -- can you check the diff below please? > > That's nicer! > > My ack stands with that folded in. > > Mark. > >> diff --git a/drivers/perf/hisilicon/hisi_uncore_pmu.c >> b/drivers/perf/hisilicon/hisi_uncore_pmu.c >> index 2bff43f0736b..c74542af4acf 100644 >> --- a/drivers/perf/hisilicon/hisi_uncore_pmu.c >> +++ b/drivers/perf/hisilicon/hisi_uncore_pmu.c >> @@ -69,15 +69,18 @@ static bool hisi_validate_event_group(struct perf_event >> *event) >> /* Include count for the event */ >> int counters = 1; >> >> -/* >> - * We must NOT create groups containing mixed PMUs, although >> - * software events are acceptable >> - */ >> -if (leader->pmu != event->pmu && !is_software_event(leader)) >> -return false; >> +if (!is_software_event(leader)) { >> +/* >> + * We must NOT create groups containing mixed PMUs, although >> + * software events are acceptable >> + */ >> +if (leader->pmu != event->pmu) >> +return false; >> >> -/* Increment counter for the leader */ >> -counters++; >> +/* Increment counter for the leader */ >> +if (leader != event) >> +counters++; >> +} >> >> list_for_each_entry(sibling, &event->group_leader->sibling_list, >> group_entry) { > > . > -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume
On Thursday, October 19, 2017 2:21:07 PM CEST Ulf Hansson wrote: > On 19 October 2017 at 00:12, Rafael J. Wysocki wrote: > > On Wednesday, October 18, 2017 4:11:33 PM CEST Ulf Hansson wrote: > >> [...] > >> > >> >> > >> >> The reason why pm_runtime_force_* needs to respects the hierarchy of > >> >> the RPM callbacks, is because otherwise it can't safely update the > >> >> runtime PM status of the device. > >> > > >> > I'm not sure I follow this requirement. Why is that so? > >> > >> If the PM domain controls some resources for the device in its RPM > >> callbacks and the driver controls some other resources in its RPM > >> callbacks - then these resources needs to be managed together. > > > > Right, but that doesn't automatically make it necessary to use runtime PM > > callbacks in the middle layer. Its system-wide PM callbacks may be > > suitable for that just fine. > > > > That is, at least in some cases, you can combine ->runtime_suspend from a > > driver and ->suspend_late from a middle layer with no problems, for example. > > > > That's why some middle layers allow drivers to point ->suspend_late and > > ->runtime_suspend to the same routine if they want to reuse that code. > > > >> This follows the behavior of when a regular call to > >> pm_runtime_get|put(), triggers the RPM callbacks to be invoked. > > > > But (a) it doesn't have to follow it and (b) in some cases it should not > > follow it. > > Of course you don't explicitly *have to* respect the hierarchy of the > RPM callbacks in pm_runtime_force_*. > > However, changing that would require some additional information > exchange between the driver and the middle-layer/PM domain, as to > instruct the middle-layer/PM domain of what to do during system-wide > PM. Especially in cases when the driver deals with wakeup, as in those > cases the instructions may change dynamically. Well, if wakeup matters, drivers can't simply point their PM callbacks to pm_runtime_force_* anyway. If the driver itself deals with wakeups, it clearly needs different callback routines for system-wide PM and for runtime PM, so it can't reuse its runtime PM callbacks at all then. If the middle layer deals with wakeups, different callbacks are needed at that level and so pm_runtime_force_* are unsuitable too. Really, invoking runtime PM callbacks from the middle layer in pm_runtime_force_* is a not a idea at all and there's no general requirement for it whatever. > [...] > > >> > In general, not if the wakeup settings are adjusted by the middle layer. > >> > >> Correct! > >> > >> To use pm_runtime_force* for these cases, one would need some > >> additional information exchange between the driver and the > >> middle-layer. > > > > Which pretty much defeats the purpose of the wrappers, doesn't it? > > Well, no matter if the wrappers are used or not, we need some kind of > information exchange between the driver and the middle-layers/PM > domains. Right. But if that information is exchanged, then why use wrappers *in* *addition* to that? > Anyway, me personally think it's too early to conclude that using the > wrappers may not be useful going forward. At this point, they clearly > helps trivial cases to remain being trivial. I'm not sure about that really. So far I've seen more complexity resulting from using them than being avoided by using them, but I guess the beauty is in the eye of the beholder. :-) > > > >> > > >> >> Regarding hibernation, honestly that's not really my area of > >> >> expertise. Although, I assume the middle-layer and driver can treat > >> >> that as a separate case, so if it's not suitable to use > >> >> pm_runtime_force* for that case, then they shouldn't do it. > >> > > >> > Well, agreed. > >> > > >> > In some simple cases, though, driver callbacks can be reused for > >> > hibernation > >> > too, so it would be good to have a common way to do that too, IMO. > >> > >> Okay, that makes sense! > >> > >> > > >> >> > > >> >> > Also, quite so often other middle layers interact with PCI directly or > >> >> > indirectly (eg. a platform device may be a child or a consumer of a > >> >> > PCI > >> >> > device) and some optimizations need to take that into account (eg. > >> >> > parents > >> >> > generally need to be accessible when their childres are resumed and > >> >> > so on). > >> >> > >> >> A device's parent becomes informed when changing the runtime PM status > >> >> of the device via pm_runtime_force_suspend|resume(), as those calls > >> >> pm_runtime_set_suspended|active(). > >> > > >> > This requires the parent driver or middle layer to look at the reference > >> > counter and understand it the same way as pm_runtime_force_*. > >> > > >> >> In case that isn't that sufficient, what else is needed? Perhaps you can > >> >> point me to an example so I can understand better? > >> > > >> > Say you want to leave the parent suspended after system resume, but the > >> > child drivers use pm_runtime_force_suspend|resume(). The parent would >
Re: [PATCH 1/3] printk: Introduce per-console loglevel setting
On 09/28/2017 05:43 PM, Calvin Owens wrote: Not all consoles are created equal: depending on the actual hardware, the latency of a printk() call can vary dramatically. The worst examples are serial consoles, where it can spin for tens of milliseconds banging the UART to emit a message, which can cause application-level problems when the kernel spews onto the console. Any thoughts on this series? Happy to resend again, but if there are no objections I'd love to see it merged sooner rather than later :) Happy to resend too, just let me know. Thanks, Calvin At Facebook we use netconsole to monitor our fleet, but we still have serial consoles attached on each host for live debugging, and the latter has caused problems. An obvious solution is to disable the kernel console output to ttyS0, but this makes live debugging frustrating, since crashes become silent and opaque to the ttyS0 user. Enabling it on the fly when needed isn't feasible, since boxes you need to debug via serial are likely to be borked in ways that make this impossible. That puts us between a rock and a hard place: we'd love to set kernel.printk to KERN_INFO and get all the logs. But while netconsole is fast enough to permit that without perturbing userspace, ttyS0 is not, and we're forced to limit console logging to KERN_WARNING and higher. This patch introduces a new per-console loglevel setting, and changes console_unlock() to use max(global_level, per_console_level) when deciding whether or not to emit a given log message. This lets us have our cake and eat it too: instead of being forced to limit all consoles verbosity based on the speed of the slowest one, we can "promote" the faster console while still using a conservative system loglevel setting to avoid disturbing applications. Cc: Petr Mladek Cc: Steven Rostedt Cc: Sergey Senozhatsky Signed-off-by: Calvin Owens --- (V1: https://lkml.org/lkml/2017/4/4/783) Changes in V2: * Honor the ignore_loglevel setting in all cases * Change semantics to use max(global, console) as the loglevel for a console, instead of the previous patch where we treated the per-console one as a filter downstream of the global one. include/linux/console.h | 1 + kernel/printk/printk.c | 38 +++--- 2 files changed, 20 insertions(+), 19 deletions(-) diff --git a/include/linux/console.h b/include/linux/console.h index b8920a0..a5b5d79 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -147,6 +147,7 @@ struct console { int cflag; void*data; struct console *next; + int level; }; /* diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 512f7c2..3f1675e 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1141,9 +1141,14 @@ module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(ignore_loglevel, "ignore loglevel setting (prints all kernel messages to the console)"); -static bool suppress_message_printing(int level) +static int effective_loglevel(struct console *con) { - return (level >= console_loglevel && !ignore_loglevel); + return max(console_loglevel, con ? con->level : LOGLEVEL_EMERG); +} + +static bool suppress_message_printing(int level, struct console *con) +{ + return (level >= effective_loglevel(con) && !ignore_loglevel); } #ifdef CONFIG_BOOT_PRINTK_DELAY @@ -1175,7 +1180,7 @@ static void boot_delay_msec(int level) unsigned long timeout; if ((boot_delay == 0 || system_state >= SYSTEM_RUNNING) - || suppress_message_printing(level)) { + || suppress_message_printing(level, NULL)) { return; } @@ -1549,7 +1554,7 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len) * The console_lock must be held. */ static void call_console_drivers(const char *ext_text, size_t ext_len, -const char *text, size_t len) +const char *text, size_t len, int level) { struct console *con; @@ -1568,6 +1573,8 @@ static void call_console_drivers(const char *ext_text, size_t ext_len, if (!cpu_online(smp_processor_id()) && !(con->flags & CON_ANYTIME)) continue; + if (suppress_message_printing(level, con)) + continue; if (con->flags & CON_EXTENDED) con->write(con, ext_text, ext_len); else @@ -1856,10 +1863,9 @@ static ssize_t msg_print_ext_body(char *buf, size_t size, char *dict, size_t dict_len, char *text, size_t text_len) { return 0; } static void call_console_drivers(const char *ext_text, size_t ext_len, -const char *text, size_t len) {} +
[PATCH doc/rcu 2/2] doc: Fix various RCU docbook comment-header problems
Because many of RCU's files have not been included into docbook, a number of errors have accumulated. This commit fixes them. Signed-off-by: Paul E. McKenney --- include/linux/rculist.h | 2 +- include/linux/rcupdate.h | 22 ++ include/linux/srcu.h | 1 + kernel/rcu/srcutree.c| 2 +- kernel/rcu/sync.c| 9 ++--- kernel/rcu/tree.c| 18 ++ 6 files changed, 33 insertions(+), 21 deletions(-) diff --git a/include/linux/rculist.h b/include/linux/rculist.h index b1fd8bf85fdc..2bea1d5e9930 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -276,7 +276,7 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, #define list_entry_rcu(ptr, type, member) \ container_of(lockless_dereference(ptr), type, member) -/** +/* * Where are list_empty_rcu() and list_first_entry_rcu()? * * Implementing those functions following their counterparts list_empty() and diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index de50d8a4cf41..1a9f70d44af9 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -523,7 +523,7 @@ static inline void rcu_preempt_sleep_check(void) { } * Return the value of the specified RCU-protected pointer, but omit * both the smp_read_barrier_depends() and the READ_ONCE(). This * is useful in cases where update-side locks prevent the value of the - * pointer from changing. Please note that this primitive does -not- + * pointer from changing. Please note that this primitive does *not* * prevent the compiler from repeating this reference or combining it * with other references, so it should not be used without protection * of appropriate locks. @@ -568,7 +568,7 @@ static inline void rcu_preempt_sleep_check(void) { } * is handed off from RCU to some other synchronization mechanism, for * example, reference counting or locking. In C11, it would map to * kill_dependency(). It could be used as follows: - * + * `` * rcu_read_lock(); * p = rcu_dereference(gp); * long_lived = is_long_lived(p); @@ -579,6 +579,7 @@ static inline void rcu_preempt_sleep_check(void) { } * p = rcu_pointer_handoff(p); * } * rcu_read_unlock(); + *`` */ #define rcu_pointer_handoff(p) (p) @@ -778,18 +779,21 @@ static inline notrace void rcu_read_unlock_sched_notrace(void) /** * RCU_INIT_POINTER() - initialize an RCU protected pointer + * @p: The pointer to be initialized. + * @v: The value to initialized the pointer to. * * Initialize an RCU-protected pointer in special cases where readers * do not need ordering constraints on the CPU or the compiler. These * special cases are: * - * 1. This use of RCU_INIT_POINTER() is NULLing out the pointer -or- + * 1. This use of RCU_INIT_POINTER() is NULLing out the pointer *or* * 2. The caller has taken whatever steps are required to prevent - * RCU readers from concurrently accessing this pointer -or- + * RCU readers from concurrently accessing this pointer *or* * 3. The referenced data structure has already been exposed to - * readers either at compile time or via rcu_assign_pointer() -and- - * a. You have not made -any- reader-visible changes to - * this structure since then -or- + * readers either at compile time or via rcu_assign_pointer() *and* + * + * a. You have not made *any* reader-visible changes to + * this structure since then *or* * b. It is OK for readers accessing this structure from its * new location to see the old state of the structure. (For * example, the changes were to statistical counters or to @@ -805,7 +809,7 @@ static inline notrace void rcu_read_unlock_sched_notrace(void) * by a single external-to-structure RCU-protected pointer, then you may * use RCU_INIT_POINTER() to initialize the internal RCU-protected * pointers, but you must use rcu_assign_pointer() to initialize the - * external-to-structure pointer -after- you have completely initialized + * external-to-structure pointer *after* you have completely initialized * the reader-accessible portions of the linked structure. * * Note that unlike rcu_assign_pointer(), RCU_INIT_POINTER() provides no @@ -819,6 +823,8 @@ static inline notrace void rcu_read_unlock_sched_notrace(void) /** * RCU_POINTER_INITIALIZER() - statically initialize an RCU protected pointer + * @p: The pointer to be initialized. + * @v: The value to initialized the pointer to. * * GCC-style initialization for an RCU-protected pointer in a structure field. */ diff --git a/include/linux/srcu.h b/include/linux/srcu.h index 39af9bc0f653..62be8966e837 100644 --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -78,6 +78,7 @@ void synchronize_srcu(struct srcu_struct *sp); /** * srcu_read_lock_held - might we be in SRCU read-side critical section? + * @sp: The srcu_struct structure
[PATCH doc/rcu 1/2] doc: Fix RCU's docbook options
Commit 764f80798b95 ("doc: Add RCU files to docbook-generation files") added :external: options for RCU source files in the file Documentation/core-api/kernel-api.rst. However, this now means nothing, so this commit removes them. Reported-by: Randy Dunlap Reported-by: Akira Yokosawa Signed-off-by: Paul E. McKenney --- Documentation/core-api/kernel-api.rst | 14 -- 1 file changed, 14 deletions(-) diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst index 8282099e0cbf..5da10184d908 100644 --- a/Documentation/core-api/kernel-api.rst +++ b/Documentation/core-api/kernel-api.rst @@ -352,44 +352,30 @@ Read-Copy Update (RCU) -- .. kernel-doc:: include/linux/rcupdate.h - :external: .. kernel-doc:: include/linux/rcupdate_wait.h - :external: .. kernel-doc:: include/linux/rcutree.h - :external: .. kernel-doc:: kernel/rcu/tree.c - :external: .. kernel-doc:: kernel/rcu/tree_plugin.h - :external: .. kernel-doc:: kernel/rcu/tree_exp.h - :external: .. kernel-doc:: kernel/rcu/update.c - :external: .. kernel-doc:: include/linux/srcu.h - :external: .. kernel-doc:: kernel/rcu/srcutree.c - :external: .. kernel-doc:: include/linux/rculist_bl.h - :external: .. kernel-doc:: include/linux/rculist.h - :external: .. kernel-doc:: include/linux/rculist_nulls.h - :external: .. kernel-doc:: include/linux/rcu_sync.h - :external: .. kernel-doc:: kernel/rcu/sync.c - :external: -- 2.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume
On 10/19/2017 01:11 PM, Ulf Hansson wrote: > On 19 October 2017 at 20:04, Ulf Hansson wrote: >> On 19 October 2017 at 19:21, Grygorii Strashko >> wrote: >>> >>> >>> On 10/19/2017 03:33 AM, Ulf Hansson wrote: On 18 October 2017 at 23:48, Rafael J. Wysocki wrote: > On Wednesday, October 18, 2017 9:45:11 PM CEST Grygorii Strashko wrote: >> >> On 10/18/2017 09:11 AM, Ulf Hansson wrote: > > [...] > > That's the point. We know pm_runtime_force_* works nicely for the > trivial middle-layer cases. In which cases the middle-layer callbacks don't exist, so it's just like reusing driver callbacks directly. :-) >> >> I'd like to ask you clarify one point here and provide some info which I >> hope can be useful - >> what's exactly means "trivial middle-layer cases"? >> >> Is it when systems use "drivers/base/power/clock_ops.c - Generic clock >> manipulation PM callbacks" as dev_pm_domain (arm davinci/keystone), or >> OMAP >> device framework struct dev_pm_domain omap_device_pm_domain >> (arm/mach-omap2/omap_device.c) or static const struct dev_pm_ops >> tegra_aconnect_pm_ops? >> >> if yes all above have PM runtime callbacks. > > Trivial ones don't actually do anything meaningful in their PM callbacks. > > Things like the platform bus type, spi bus type, i2c bus type and similar. > > If the middle-layer callbacks manipulate devices in a significant way, > then > they aren't trivial. I fully agree with Rafael's description above, but let me also clarify one more thing. We have also been discussing PM domains as being trivial and non-trivial. In some statements I even think the PM domain has been a part the middle-layer terminology, which may have been a bit confusing. In this regards as we consider genpd being a trivial PM domain, those examples your bring up above is too me also examples of trivial PM domains. Especially because they don't deal with wakeups, as that is taken care of by the drivers, right!? >>> >>> Not directly, for example, omap device framework has noirq callback >>> implemented >>> which forcibly disable all devices which are not PM runtime suspended. >>> while doing this it calls drivers PM .runtime_suspend() which may return >>> non 0 value and in this case device will be left enabled (powered) at >>> suspend for >>> wake up purposes (see _od_suspend_noirq()). >>> >> >> Yeah, I had that feeling that omap has some trickyness going on. :-) >> >> I sure that can be fixed in the omap PM domain, although > > ...slipped with my fingers.. here is the rest of the reply... > > ..of course that require us to use another way for drivers to signal > to the omap PM domain that it needs to stay powered as to deal with > wakeup. > > I can have a look at that more closely, to see if it makes sense to change. > Also, additional note here. some IPs are reused between OMAP/Davinci/Keystone, OMAP PM domain have some code running at noirq time to dial with devices left in PM runtime enabled state (OMAP PM runtime centric), while Davinci/Keystone haven't (clock_ops.c), so pm_runtime_force_* API is actually possibility now to make the same driver work on all these platforms. -- regards, -grygorii -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH doc/rcu 2/2] doc: Fix various RCU docbook comment-header problems
Because many of RCU's files have not been included into docbook, a number of errors have accumulated. This commit fixes them. Signed-off-by: Paul E. McKenney --- include/linux/rculist.h | 2 +- include/linux/rcupdate.h | 22 ++ include/linux/srcu.h | 1 + kernel/rcu/srcutree.c| 2 +- kernel/rcu/sync.c| 9 ++--- kernel/rcu/tree.c| 18 ++ 6 files changed, 33 insertions(+), 21 deletions(-) diff --git a/include/linux/rculist.h b/include/linux/rculist.h index b1fd8bf85fdc..2bea1d5e9930 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -276,7 +276,7 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, #define list_entry_rcu(ptr, type, member) \ container_of(lockless_dereference(ptr), type, member) -/** +/* * Where are list_empty_rcu() and list_first_entry_rcu()? * * Implementing those functions following their counterparts list_empty() and diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index de50d8a4cf41..1a9f70d44af9 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -523,7 +523,7 @@ static inline void rcu_preempt_sleep_check(void) { } * Return the value of the specified RCU-protected pointer, but omit * both the smp_read_barrier_depends() and the READ_ONCE(). This * is useful in cases where update-side locks prevent the value of the - * pointer from changing. Please note that this primitive does -not- + * pointer from changing. Please note that this primitive does *not* * prevent the compiler from repeating this reference or combining it * with other references, so it should not be used without protection * of appropriate locks. @@ -568,7 +568,7 @@ static inline void rcu_preempt_sleep_check(void) { } * is handed off from RCU to some other synchronization mechanism, for * example, reference counting or locking. In C11, it would map to * kill_dependency(). It could be used as follows: - * + * `` * rcu_read_lock(); * p = rcu_dereference(gp); * long_lived = is_long_lived(p); @@ -579,6 +579,7 @@ static inline void rcu_preempt_sleep_check(void) { } * p = rcu_pointer_handoff(p); * } * rcu_read_unlock(); + *`` */ #define rcu_pointer_handoff(p) (p) @@ -778,18 +779,21 @@ static inline notrace void rcu_read_unlock_sched_notrace(void) /** * RCU_INIT_POINTER() - initialize an RCU protected pointer + * @p: The pointer to be initialized. + * @v: The value to initialized the pointer to. * * Initialize an RCU-protected pointer in special cases where readers * do not need ordering constraints on the CPU or the compiler. These * special cases are: * - * 1. This use of RCU_INIT_POINTER() is NULLing out the pointer -or- + * 1. This use of RCU_INIT_POINTER() is NULLing out the pointer *or* * 2. The caller has taken whatever steps are required to prevent - * RCU readers from concurrently accessing this pointer -or- + * RCU readers from concurrently accessing this pointer *or* * 3. The referenced data structure has already been exposed to - * readers either at compile time or via rcu_assign_pointer() -and- - * a. You have not made -any- reader-visible changes to - * this structure since then -or- + * readers either at compile time or via rcu_assign_pointer() *and* + * + * a. You have not made *any* reader-visible changes to + * this structure since then *or* * b. It is OK for readers accessing this structure from its * new location to see the old state of the structure. (For * example, the changes were to statistical counters or to @@ -805,7 +809,7 @@ static inline notrace void rcu_read_unlock_sched_notrace(void) * by a single external-to-structure RCU-protected pointer, then you may * use RCU_INIT_POINTER() to initialize the internal RCU-protected * pointers, but you must use rcu_assign_pointer() to initialize the - * external-to-structure pointer -after- you have completely initialized + * external-to-structure pointer *after* you have completely initialized * the reader-accessible portions of the linked structure. * * Note that unlike rcu_assign_pointer(), RCU_INIT_POINTER() provides no @@ -819,6 +823,8 @@ static inline notrace void rcu_read_unlock_sched_notrace(void) /** * RCU_POINTER_INITIALIZER() - statically initialize an RCU protected pointer + * @p: The pointer to be initialized. + * @v: The value to initialized the pointer to. * * GCC-style initialization for an RCU-protected pointer in a structure field. */ diff --git a/include/linux/srcu.h b/include/linux/srcu.h index 39af9bc0f653..62be8966e837 100644 --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -78,6 +78,7 @@ void synchronize_srcu(struct srcu_struct *sp); /** * srcu_read_lock_held - might we be in SRCU read-side critical section? + * @sp: The srcu_struct structure
[PATCH doc/rcu 1/2] doc: Fix RCU's docbook options
Commit 764f80798b95 ("doc: Add RCU files to docbook-generation files") added :external: options for RCU source files in the file Documentation/core-api/kernel-api.rst. However, this now means nothing, so this commit removes them. Reported-by: Randy Dunlap Reported-by: Akira Yokosawa Signed-off-by: Paul E. McKenney --- Documentation/core-api/kernel-api.rst | 14 -- 1 file changed, 14 deletions(-) diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst index 8282099e0cbf..5da10184d908 100644 --- a/Documentation/core-api/kernel-api.rst +++ b/Documentation/core-api/kernel-api.rst @@ -352,44 +352,30 @@ Read-Copy Update (RCU) -- .. kernel-doc:: include/linux/rcupdate.h - :external: .. kernel-doc:: include/linux/rcupdate_wait.h - :external: .. kernel-doc:: include/linux/rcutree.h - :external: .. kernel-doc:: kernel/rcu/tree.c - :external: .. kernel-doc:: kernel/rcu/tree_plugin.h - :external: .. kernel-doc:: kernel/rcu/tree_exp.h - :external: .. kernel-doc:: kernel/rcu/update.c - :external: .. kernel-doc:: include/linux/srcu.h - :external: .. kernel-doc:: kernel/rcu/srcutree.c - :external: .. kernel-doc:: include/linux/rculist_bl.h - :external: .. kernel-doc:: include/linux/rculist.h - :external: .. kernel-doc:: include/linux/rculist_nulls.h - :external: .. kernel-doc:: include/linux/rcu_sync.h - :external: .. kernel-doc:: kernel/rcu/sync.c - :external: -- 2.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH doc/rcu 0/2] Fix docbook regression
Hello, Linus, Commit 764f80798b95 ("doc: Add RCU files to docbook-generation files"), which is in v4.14-rc1, added :external: options for RCU source files in the file Documentation/core-api/kernel-api.rst. However, this now means nothing, and furthermore breaks builds of the docbook, which has led to popular demand for this to be fixed in v4.14: lkml.kernel.org/r/20171018100340.7f34a...@lwn.net This series therefore contains the following two patches: 1. Remove the erroneous :external: options. 2. Fix the many docbook build complaints that have crept into RCU's docbook comment headers. These fixes include one non-comment change where the name of rcu_sync_func()'s argument is changed to match RCU convention. Thanx, Paul Documentation/core-api/kernel-api.rst | 14 -- include/linux/rculist.h |2 +- include/linux/rcupdate.h | 22 ++ include/linux/srcu.h |1 + kernel/rcu/srcutree.c |2 +- kernel/rcu/sync.c |9 ++--- kernel/rcu/tree.c | 18 ++ 7 files changed, 33 insertions(+), 35 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND v12 0/6] cgroup-aware OOM killer
On Thu 19-10-17 15:45:34, Johannes Weiner wrote: > On Thu, Oct 19, 2017 at 07:52:12PM +0100, Roman Gushchin wrote: > > This patchset makes the OOM killer cgroup-aware. > > Hi Andrew, > > I believe this code is ready for merging upstream, and it seems Michal > is in agreement. There are two main things to consider, however. > > David would have really liked for this patchset to include knobs to > influence how the algorithm picks cgroup victims. The rest of us > agreed that this is beyond the scope of these patches, that the > patches don't need it to be useful, and that there is nothing > preventing anyone from adding configurability later on. David > subsequently nacked the series as he considers it incomplete. Neither > Michal nor I see technical merit in David's nack. agreed > Michal acked the implementation, but on the condition that the new > behavior be opt-in, to not surprise existing users. and just to make it clear I have also said I will _not_ nack if that is not the case. > I *think* we agree > that respecting the cgroup topography during global OOM is what we > should have been doing when cgroups were initially introduced; We do not agree here though. I am not convinced that respecting the cgroup topography is an universal win. It is true that there is no best OOM victim selection strategy but what we have currently is the simplest option and as such the most robust one. I can tell from the past year experience that many of those clever heuristics actually contributed to lockups and non-deterministic behavior. > where > we disagree is that I think users shouldn't have to opt in to > improvements. We have done much more invasive changes to the victim > selection without actual regressions in the past. Further, this change > only applies to mounts of the new cgroup2. which basically means that the behavior will change under many users feet because the respecitve cgroup configuration is chosen by somebody else (e.g. systemd) so I do not really buy "only v2 behavior" > Tejun also wasn't convinced > of the risk for regression, and too would prefer cgroup-awareness to > be the default in cgroup2. I would ask for patch 5/6 to be dropped. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND v12 0/6] cgroup-aware OOM killer
On Thu, Oct 19, 2017 at 07:52:12PM +0100, Roman Gushchin wrote: > This patchset makes the OOM killer cgroup-aware. Hi Andrew, I believe this code is ready for merging upstream, and it seems Michal is in agreement. There are two main things to consider, however. David would have really liked for this patchset to include knobs to influence how the algorithm picks cgroup victims. The rest of us agreed that this is beyond the scope of these patches, that the patches don't need it to be useful, and that there is nothing preventing anyone from adding configurability later on. David subsequently nacked the series as he considers it incomplete. Neither Michal nor I see technical merit in David's nack. Michal acked the implementation, but on the condition that the new behavior be opt-in, to not surprise existing users. I *think* we agree that respecting the cgroup topography during global OOM is what we should have been doing when cgroups were initially introduced; where we disagree is that I think users shouldn't have to opt in to improvements. We have done much more invasive changes to the victim selection without actual regressions in the past. Further, this change only applies to mounts of the new cgroup2. Tejun also wasn't convinced of the risk for regression, and too would prefer cgroup-awareness to be the default in cgroup2. I would ask for patch 5/6 to be dropped. Thanks -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer
On Thu 19-10-17 19:52:15, Roman Gushchin wrote: > Traditionally, the OOM killer is operating on a process level. > Under oom conditions, it finds a process with the highest oom score > and kills it. > > This behavior doesn't suit well the system with many running > containers: > > 1) There is no fairness between containers. A small container with > few large processes will be chosen over a large one with huge > number of small processes. > > 2) Containers often do not expect that some random process inside > will be killed. In many cases much safer behavior is to kill > all tasks in the container. Traditionally, this was implemented > in userspace, but doing it in the kernel has some advantages, > especially in a case of a system-wide OOM. > > To address these issues, the cgroup-aware OOM killer is introduced. > > This patch introduces the core functionality: an ability to select > a memory cgroup as an OOM victim. Under OOM conditions the OOM killer > looks for the biggest leaf memory cgroup and kills the biggest > task belonging to it. > > The following patches will extend this functionality to consider > non-leaf memory cgroups as OOM victims, and also provide an ability > to kill all tasks belonging to the victim cgroup. > > The root cgroup is treated as a leaf memory cgroup, so it's score > is compared with other leaf memory cgroups. > Due to memcg statistics implementation a special approximation > is used for estimating oom_score of root memory cgroup: we sum > oom_score of the belonging processes (or, to be more precise, > tasks owning their mm structures). > > Signed-off-by: Roman Gushchin > Acked-by: Michal Hocko Just to make it clear. My ack is conditional on the opt-in which is implemented later in the series. Strictly speaking system would behave differently during the bisection and that might lead to a confusion. I guess it would be better to simply disable this feature until we have means to enable it. But I do not really care strongly here. There is another thing that I am more concerned about. Usually you should drop ack when making further changes or at least call them out so that the reviewer is aware of them. In this particular case I am worried about the fallback code we have discussed previously [...] > @@ -1080,27 +1102,39 @@ bool out_of_memory(struct oom_control *oc) > current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) && > current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { > get_task_struct(current); > - oc->chosen = current; > + oc->chosen_task = current; > oom_kill_process(oc, "Out of memory > (oom_kill_allocating_task)"); > return true; > } > > + if (mem_cgroup_select_oom_victim(oc)) { > + if (oom_kill_memcg_victim(oc)) > + delay = true; > + > + goto out; > + } > + [...] > +out: > + /* > + * Give the killed process a good chance to exit before trying > + * to allocate memory again. > + */ > + if (delay) > + schedule_timeout_killable(1); > + > + return !!oc->chosen_task; > } this basically means that if you manage to select a memcg victim but then you won't be able to select any task in that memcg then you would return false from out_of_memory and that has other consequences. Namely __alloc_pages_may_oom will not set did_some_progress and so the allocation path will fail. While this scenario is not very likely we should behave better. Your previous implementation (which I've acked) did fall back to the standard oom killer path which is the safest option. Maybe we can do better but let's try robust and be clever later. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] documentation: kernel-api: add more info on bitmap functions
On Mon, 16 Oct 2017 16:32:51 -0700 Randy Dunlap wrote: > There are some good comments about bitmap operations in lib/bitmap.c > and include/linux/bitmap.h, so format them for document generation and > pull them into core-api/kernel-api.rst. > > I converted the "tables" of functions from using tabs to using spaces > so that they are more readable in the source file and in the generated > output. Looks good, thanks, applied. Hopefully Linus won't yell at me about touching all that stuff in lib/... jon -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] Documentation: fix invalid Documentation refs (2)
On Thu, 12 Oct 2017 15:23:26 -0500 Tom Saeger wrote: > Batch (2) set of simple document ref fixes. > > > Tom Saeger (8): > Documentation: fix locking rt-mutex doc refs > Documentation: fix ref to sphinx/kerneldoc.py > Documentation: fix ref to workqueue content > Documentation: fix ref to coccinelle content > Documentation: fix ref to trace stm content > Documentation: fix ref to power basic-pm-debugging > Documentation: fix selftests related file refs > Documentation: fix ref to gpio.txt I've applied the set (except 8/8, which Linus W. already grabbed). Thanks, jon -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel/module: Delete an error message for a failed memory allocation in add_module_usage()
>>> Something like: >>> >>> "because there is a dump_stack() done on allocation failures >>> without __GFP_JNOWARN" >> >> How do you think about to convert such a description into a special format >> for further reference documentation? > > I think it's a bad idea if it's a "special" format. Will it be nice to represent corresponding details as a better “restructured text”? > Always write _why_ some code is being changed. > > People could read the commit descriptions and would not need > to take extra time to lookup external references. I would appreciate if I could copy a widely accepted explanation. > Maybe add something like > "see (commit or )" for additional details" Are there any related extensions possible besides other background information? Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf Regards, Markus -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RESEND v12 4/6] mm, oom: introduce memory.oom_group
The cgroup-aware OOM killer treats leaf memory cgroups as memory consumption entities and performs the victim selection by comparing them based on their memory footprint. Then it kills the biggest task inside the selected memory cgroup. But there are workloads, which are not tolerant to a such behavior. Killing a random task may leave the workload in a broken state. To solve this problem, memory.oom_group knob is introduced. It will define, whether a memory group should be treated as an indivisible memory consumer, compared by total memory consumption with other memory consumers (leaf memory cgroups and other memory cgroups with memory.oom_group set), and whether all belonging tasks should be killed if the cgroup is selected. If set on memcg A, it means that in case of system-wide OOM or memcg-wide OOM scoped to A or any ancestor cgroup, all tasks, belonging to the sub-tree of A will be killed. If OOM event is scoped to a descendant cgroup (A/B, for example), only tasks in that cgroup can be affected. OOM killer will never touch any tasks outside of the scope of the OOM event. Also, tasks with oom_score_adj set to -1000 will not be killed because this has been a long established way to protect a particular process from seeing an unexpected SIGKILL from the OOM killer. Ignoring this user defined configuration might lead to data corruptions or other misbehavior. The default value is 0. Signed-off-by: Roman Gushchin Acked-by: Michal Hocko Acked-by: Johannes Weiner Cc: Vladimir Davydov Cc: Tetsuo Handa Cc: David Rientjes Cc: Andrew Morton Cc: Tejun Heo Cc: kernel-t...@fb.com Cc: cgro...@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux...@kvack.org --- include/linux/memcontrol.h | 17 +++ mm/memcontrol.c| 75 +++--- mm/oom_kill.c | 49 +++--- 3 files changed, 127 insertions(+), 14 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 75b63b68846e..84ac10d7e67d 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -200,6 +200,13 @@ struct mem_cgroup { /* OOM-Killer disable */ int oom_kill_disable; + /* +* Treat the sub-tree as an indivisible memory consumer, +* kill all belonging tasks if the memory cgroup selected +* as OOM victim. +*/ + bool oom_group; + /* handle for "memory.events" */ struct cgroup_file events_file; @@ -488,6 +495,11 @@ bool mem_cgroup_oom_synchronize(bool wait); bool mem_cgroup_select_oom_victim(struct oom_control *oc); +static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg) +{ + return memcg->oom_group; +} + #ifdef CONFIG_MEMCG_SWAP extern int do_swap_account; #endif @@ -953,6 +965,11 @@ static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc) { return false; } + +static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg) +{ + return false; +} #endif /* CONFIG_MEMCG */ /* idx can be of type enum memcg_stat_item or node_stat_item */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f364bfed745f..ad10dbdf723b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2785,19 +2785,51 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg, static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) { - struct mem_cgroup *iter; + struct mem_cgroup *iter, *group = NULL; + long group_score = 0; oc->chosen_memcg = NULL; oc->chosen_points = 0; /* +* If OOM is memcg-wide, and the memcg has the oom_group flag set, +* all tasks belonging to the memcg should be killed. +* So, we mark the memcg as a victim. +*/ + if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) { + oc->chosen_memcg = oc->memcg; + css_get(&oc->chosen_memcg->css); + return; + } + + /* * The oom_score is calculated for leaf memory cgroups (including * the root memcg). +* Non-leaf oom_group cgroups accumulating score of descendant +* leaf memory cgroups. */ rcu_read_lock(); for_each_mem_cgroup_tree(iter, root) { long score; + /* +* We don't consider non-leaf non-oom_group memory cgroups +* as OOM victims. +*/ + if (memcg_has_children(iter) && iter != root_mem_cgroup && + !mem_cgroup_oom_group(iter)) + continue; + + /* +* If group is not set or we've ran out of the group's sub-tree, +* we should set group and reset group_score. +*/ + if (!group || group == root_mem_cgroup || + !mem_cgroup_is_descendant(iter, group)) { +
[RESEND v12 3/6] mm, oom: cgroup-aware OOM killer
Traditionally, the OOM killer is operating on a process level. Under oom conditions, it finds a process with the highest oom score and kills it. This behavior doesn't suit well the system with many running containers: 1) There is no fairness between containers. A small container with few large processes will be chosen over a large one with huge number of small processes. 2) Containers often do not expect that some random process inside will be killed. In many cases much safer behavior is to kill all tasks in the container. Traditionally, this was implemented in userspace, but doing it in the kernel has some advantages, especially in a case of a system-wide OOM. To address these issues, the cgroup-aware OOM killer is introduced. This patch introduces the core functionality: an ability to select a memory cgroup as an OOM victim. Under OOM conditions the OOM killer looks for the biggest leaf memory cgroup and kills the biggest task belonging to it. The following patches will extend this functionality to consider non-leaf memory cgroups as OOM victims, and also provide an ability to kill all tasks belonging to the victim cgroup. The root cgroup is treated as a leaf memory cgroup, so it's score is compared with other leaf memory cgroups. Due to memcg statistics implementation a special approximation is used for estimating oom_score of root memory cgroup: we sum oom_score of the belonging processes (or, to be more precise, tasks owning their mm structures). Signed-off-by: Roman Gushchin Acked-by: Michal Hocko Acked-by: Johannes Weiner Cc: Vladimir Davydov Cc: Tetsuo Handa Cc: David Rientjes Cc: Andrew Morton Cc: Tejun Heo Cc: kernel-t...@fb.com Cc: cgro...@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux...@kvack.org --- include/linux/memcontrol.h | 17 + include/linux/oom.h| 12 ++- mm/memcontrol.c| 181 + mm/oom_kill.c | 72 +- 4 files changed, 262 insertions(+), 20 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 69966c461d1c..75b63b68846e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -35,6 +35,7 @@ struct mem_cgroup; struct page; struct mm_struct; struct kmem_cache; +struct oom_control; /* Cgroup-specific page state, on top of universal node page state */ enum memcg_stat_item { @@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ return css ? container_of(css, struct mem_cgroup, css) : NULL; } +static inline void mem_cgroup_put(struct mem_cgroup *memcg) +{ + css_put(&memcg->css); +} + #define mem_cgroup_from_counter(counter, member) \ container_of(counter, struct mem_cgroup, member) @@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p) bool mem_cgroup_oom_synchronize(bool wait); +bool mem_cgroup_select_oom_victim(struct oom_control *oc); + #ifdef CONFIG_MEMCG_SWAP extern int do_swap_account; #endif @@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task, return true; } +static inline void mem_cgroup_put(struct mem_cgroup *memcg) +{ +} + static inline struct mem_cgroup * mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup *prev, @@ -936,6 +948,11 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { } + +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc) +{ + return false; +} #endif /* CONFIG_MEMCG */ /* idx can be of type enum memcg_stat_item or node_stat_item */ diff --git a/include/linux/oom.h b/include/linux/oom.h index 76aac4ce39bc..ca78e2d5956e 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -9,6 +9,13 @@ #include /* MMF_* */ #include /* VM_FAULT* */ + +/* + * Special value returned by victim selection functions to indicate + * that are inflight OOM victims. + */ +#define INFLIGHT_VICTIM ((void *)-1UL) + struct zonelist; struct notifier_block; struct mem_cgroup; @@ -39,7 +46,8 @@ struct oom_control { /* Used by oom implementation, do not set */ unsigned long totalpages; - struct task_struct *chosen; + struct task_struct *chosen_task; + struct mem_cgroup *chosen_memcg; unsigned long chosen_points; }; @@ -101,6 +109,8 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); +extern int oom_evaluate_task(struct task_struct *task, void *arg); + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1d30a45a4bbe..f364bfed745f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2670,6 +2670,187 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg) return ret; } +static long memcg_oom_badness(str
[RESEND v12 1/6] mm, oom: refactor the oom_kill_process() function
The oom_kill_process() function consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information. The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim. This commit splits the oom_kill_process() function with an intention to re-use the the second half: __oom_kill_process(). The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup. We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process(). Signed-off-by: Roman Gushchin Acked-by: Michal Hocko Acked-by: Johannes Weiner Acked-by: David Rientjes Cc: Vladimir Davydov Cc: Tetsuo Handa Cc: David Rientjes Cc: Andrew Morton Cc: Tejun Heo Cc: kernel-t...@fb.com Cc: cgro...@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux...@kvack.org --- mm/oom_kill.c | 123 +++--- 1 file changed, 65 insertions(+), 58 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 26add8a0d1f7..0b9f36117989 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -842,68 +842,12 @@ static bool task_will_free_mem(struct task_struct *task) return ret; } -static void oom_kill_process(struct oom_control *oc, const char *message) +static void __oom_kill_process(struct task_struct *victim) { - struct task_struct *p = oc->chosen; - unsigned int points = oc->chosen_points; - struct task_struct *victim = p; - struct task_struct *child; - struct task_struct *t; + struct task_struct *p; struct mm_struct *mm; - unsigned int victim_points = 0; - static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, - DEFAULT_RATELIMIT_BURST); bool can_oom_reap = true; - /* -* If the task is already exiting, don't alarm the sysadmin or kill -* its children or threads, just give it access to memory reserves -* so it can die quickly -*/ - task_lock(p); - if (task_will_free_mem(p)) { - mark_oom_victim(p); - wake_oom_reaper(p); - task_unlock(p); - put_task_struct(p); - return; - } - task_unlock(p); - - if (__ratelimit(&oom_rs)) - dump_header(oc, p); - - pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", - message, task_pid_nr(p), p->comm, points); - - /* -* If any of p's children has a different mm and is eligible for kill, -* the one with the highest oom_badness() score is sacrificed for its -* parent. This attempts to lose the minimal amount of work done while -* still freeing memory. -*/ - read_lock(&tasklist_lock); - for_each_thread(p, t) { - list_for_each_entry(child, &t->children, sibling) { - unsigned int child_points; - - if (process_shares_mm(child, p->mm)) - continue; - /* -* oom_badness() returns 0 if the thread is unkillable -*/ - child_points = oom_badness(child, - oc->memcg, oc->nodemask, oc->totalpages); - if (child_points > victim_points) { - put_task_struct(victim); - victim = child; - victim_points = child_points; - get_task_struct(victim); - } - } - } - read_unlock(&tasklist_lock); - p = find_lock_task_mm(victim); if (!p) { put_task_struct(victim); @@ -977,6 +921,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message) } #undef K +static void oom_kill_process(struct oom_control *oc, const char *message) +{ + struct task_struct *p = oc->chosen; + unsigned int points = oc->chosen_points; + struct task_struct *victim = p; + struct task_struct *child; + struct task_struct *t; + unsigned int victim_points = 0; + static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); + + /* +* If the task is already exiting, don't alarm the sysadmin or kill +* its children or threads, just give it access to memory reserves +* so it can die quickly +*/ + task_lock(p); + if (task_will_free_mem(p)) { + mark_oom_victim(p); + wake_oom_reaper(p); + task_unlock(p); + put_task_struct(p); + r
Re: [PATCH] docs: dev-tools: correct Coccinelle version number
On Sun, 15 Oct 2017 11:24:08 +0200 Julia Lawall wrote: > There is no Coccinelle version 1.2. 1.0.2 must be what was intended. > > Signed-off-by: Julia Lawall Applied, thanks. jon -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RESEND v12 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
Implement mem_cgroup_scan_tasks() functionality for the root memory cgroup to use this function for looking for a OOM victim task in the root memory cgroup by the cgroup-ware OOM killer. The root memory cgroup is treated as a leaf cgroup, so only tasks which are directly belonging to the root cgroup are iterated over. This patch doesn't introduce any functional change as mem_cgroup_scan_tasks() is never called for the root memcg. This is preparatory work for the cgroup-aware OOM killer, which will use this function to iterate over tasks belonging to the root memcg. Signed-off-by: Roman Gushchin Acked-by: Michal Hocko Acked-by: Johannes Weiner Acked-by: David Rientjes Cc: Vladimir Davydov Cc: Tetsuo Handa Cc: Andrew Morton Cc: Tejun Heo Cc: kernel-t...@fb.com Cc: cgro...@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux...@kvack.org --- mm/memcontrol.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 50e6906314f8..1d30a45a4bbe 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg) * value, the function breaks the iteration loop and returns the value. * Otherwise, it will iterate over all tasks and return 0. * - * This function must not be called for the root memory cgroup. + * If memcg is the root memory cgroup, this function will iterate only + * over tasks belonging directly to the root memory cgroup. */ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, int (*fn)(struct task_struct *, void *), void *arg) @@ -925,8 +926,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, struct mem_cgroup *iter; int ret = 0; - BUG_ON(memcg == root_mem_cgroup); - for_each_mem_cgroup_tree(iter, memcg) { struct css_task_iter it; struct task_struct *task; @@ -935,7 +934,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, while (!ret && (task = css_task_iter_next(&it))) ret = fn(task, arg); css_task_iter_end(&it); - if (ret) { + if (ret || memcg == root_mem_cgroup) { mem_cgroup_iter_break(memcg, iter); break; } -- 2.13.6 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RESEND v12 0/6] cgroup-aware OOM killer
This patchset makes the OOM killer cgroup-aware. v12: - Root memory cgroup is evaluated based on sum of the oom scores of belonging tasks - Do not fallback to the per-process behavior if there if it wasn't possbile to kill a memcg victim - Rebase on top of mm tree v11: - Fixed an issue with skipping the root mem cgroup (discovered by Shakeel Butt) - Moved a check in __oom_kill_process() to the memmory.oom_group patch, added corresponding comments - Added a note about ignoring tasks with oom_score_adj -1000 (proposed by Michal Hocko) - Rebase on top of mm tree v10: - Separate oom_group introduction into a standalone patch - Stop propagating oom_group - Make oom_group delegatable - Do not try to kill the biggest task in the first order, if the whole cgroup is going to be killed - Stop caching oom_score on struct memcg, optimize victim memcg selection - Drop dmesg printing (for further refining) - Small refactorings and comments added here and there - Rebase on top of mm tree v9: - Change siblings-to-siblings comparison to the tree-wide search, make related refactorings - Make oom_group implicitly propagated down by the tree - Fix an issue with task selection in root cgroup v8: - Do not kill tasks with OOM_SCORE_ADJ -1000 - Make the whole thing opt-in with cgroup mount option control - Drop oom_priority for further discussions - Kill the whole cgroup if oom_group is set and it's memory.max is reached - Update docs and commit messages v7: - __oom_kill_process() drops reference to the victim task - oom_score_adj -1000 is always respected - Renamed oom_kill_all to oom_group - Dropped oom_prio range, converted from short to int - Added a cgroup v2 mount option to disable cgroup-aware OOM killer - Docs updated - Rebased on top of mmotm v6: - Renamed oom_control.chosen to oom_control.chosen_task - Renamed oom_kill_all_tasks to oom_kill_all - Per-node NR_SLAB_UNRECLAIMABLE accounting - Several minor fixes and cleanups - Docs updated v5: - Rebased on top of Michal Hocko's patches, which have changed the way how OOM victims becoming an access to the memory reserves. Dropped corresponding part of this patchset - Separated the oom_kill_process() splitting into a standalone commit - Added debug output (suggested by David Rientjes) - Some minor fixes v4: - Reworked per-cgroup oom_score_adj into oom_priority (based on ideas by David Rientjes) - Tasks with oom_score_adj -1000 are never selected if oom_kill_all_tasks is not set - Memcg victim selection code is reworked, and synchronization is based on finding tasks with OOM victim marker, rather then on global counter - Debug output is dropped - Refactored TIF_MEMDIE usage v3: - Merged commits 1-4 into 6 - Separated oom_score_adj logic and debug output into separate commits - Fixed swap accounting v2: - Reworked victim selection based on feedback from Michal Hocko, Vladimir Davydov and Johannes Weiner - "Kill all tasks" is now an opt-in option, by default only one process will be killed - Added per-cgroup oom_score_adj - Refined oom score calculations, suggested by Vladimir Davydov - Converted to a patchset v1: https://lkml.org/lkml/2017/5/18/969 Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Tetsuo Handa Cc: David Rientjes Cc: Andrew Morton Cc: Tejun Heo Cc: kernel-t...@fb.com Cc: cgro...@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux...@kvack.org Roman Gushchin (6): mm, oom: refactor the oom_kill_process() function mm: implement mem_cgroup_scan_tasks() for the root memory cgroup mm, oom: cgroup-aware OOM killer mm, oom: introduce memory.oom_group mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer mm, oom, docs: describe the cgroup-aware OOM killer Documentation/cgroup-v2.txt | 51 + include/linux/cgroup-defs.h | 5 + include/linux/memcontrol.h | 34 ++ include/linux/oom.h | 12 ++- kernel/cgroup/cgroup.c | 10 ++ mm/memcontrol.c | 258 +++- mm/oom_kill.c | 212 7 files changed, 506 insertions(+), 76 deletions(-) -- 2.13.6 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RESEND v12 6/6] mm, oom, docs: describe the cgroup-aware OOM killer
Document the cgroup-aware OOM killer. Signed-off-by: Roman Gushchin Acked-by: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Cc: Tetsuo Handa Cc: Andrew Morton Cc: David Rientjes Cc: Tejun Heo Cc: kernel-t...@fb.com Cc: cgro...@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux...@kvack.org --- Documentation/cgroup-v2.txt | 51 + 1 file changed, 51 insertions(+) diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index 0bbdc720dd7c..69db5bf9c580 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/. 5-2-1. Memory Interface Files 5-2-2. Usage Guidelines 5-2-3. Memory Ownership + 5-2-4. OOM Killer 5-3. IO 5-3-1. IO Interface Files 5-3-2. Writeback @@ -1031,6 +1032,28 @@ PAGE_SIZE multiple when read back. high limit is used and monitored properly, this limit's utility is limited to providing the final safety net. + memory.oom_group + + A read-write single value file which exists on non-root + cgroups. The default is "0". + + If set, OOM killer will consider the memory cgroup as an + indivisible memory consumers and compare it with other memory + consumers by it's memory footprint. + If such memory cgroup is selected as an OOM victim, all + processes belonging to it or it's descendants will be killed. + + This applies to system-wide OOM conditions and reaching + the hard memory limit of the cgroup and their ancestor. + If OOM condition happens in a descendant cgroup with it's own + memory limit, the memory cgroup can't be considered + as an OOM victim, and OOM killer will not kill all belonging + tasks. + + Also, OOM killer respects the /proc/pid/oom_score_adj value -1000, + and will never kill the unkillable task, even if memory.oom_group + is set. + memory.events A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified @@ -1234,6 +1257,34 @@ to be accessed repeatedly by other cgroups, it may make sense to use POSIX_FADV_DONTNEED to relinquish the ownership of memory areas belonging to the affected files to ensure correct memory ownership. +OOM Killer +~~ + +Cgroup v2 memory controller implements a cgroup-aware OOM killer. +It means that it treats cgroups as first class OOM entities. + +Under OOM conditions the memory controller tries to make the best +choice of a victim, looking for a memory cgroup with the largest +memory footprint, considering leaf cgroups and cgroups with the +memory.oom_group option set, which are considered to be an indivisible +memory consumers. + +By default, OOM killer will kill the biggest task in the selected +memory cgroup. A user can change this behavior by enabling +the per-cgroup memory.oom_group option. If set, it causes +the OOM killer to kill all processes attached to the cgroup, +except processes with oom_score_adj set to -1000. + +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM +the memory controller considers only cgroups belonging to the sub-tree +of the OOM'ing cgroup. + +The root cgroup is treated as a leaf memory cgroup, so it's compared +with other leaf memory cgroups and cgroups with oom_group option set. + +If there are no cgroups with the enabled memory controller, +the OOM killer is using the "traditional" process-based approach. + IO -- -- 2.13.6 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RESEND v12 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware OOM killer. If not set, the OOM selection is performed in a "traditional" per-process way. The behavior can be changed dynamically by remounting the cgroupfs. Signed-off-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Tetsuo Handa Cc: David Rientjes Cc: Andrew Morton Cc: Tejun Heo Cc: kernel-t...@fb.com Cc: cgro...@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux...@kvack.org --- include/linux/cgroup-defs.h | 5 + kernel/cgroup/cgroup.c | 10 ++ mm/memcontrol.c | 3 +++ 3 files changed, 18 insertions(+) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 3e55bbd31ad1..cae5343a8b21 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -80,6 +80,11 @@ enum { * Enable cpuset controller in v1 cgroup to use v2 behavior. */ CGRP_ROOT_CPUSET_V2_MODE = (1 << 4), + + /* +* Enable cgroup-aware OOM killer. +*/ + CGRP_GROUP_OOM = (1 << 5), }; /* cftype->flags */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index c7086c8835da..0e1685ca1d7b 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags) if (!strcmp(token, "nsdelegate")) { *root_flags |= CGRP_ROOT_NS_DELEGATE; continue; + } else if (!strcmp(token, "groupoom")) { + *root_flags |= CGRP_GROUP_OOM; + continue; } pr_err("cgroup2: unknown option \"%s\"\n", token); @@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags) cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE; else cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE; + + if (root_flags & CGRP_GROUP_OOM) + cgrp_dfl_root.flags |= CGRP_GROUP_OOM; + else + cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM; } } @@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root { if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE) seq_puts(seq, ",nsdelegate"); + if (cgrp_dfl_root.flags & CGRP_GROUP_OOM) + seq_puts(seq, ",groupoom"); return 0; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ad10dbdf723b..eb1e15385782 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2875,6 +2875,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc) if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) return false; + if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM)) + return false; + if (oc->memcg) root = oc->memcg; else -- 2.13.6 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume
On 19 October 2017 at 20:04, Ulf Hansson wrote: > On 19 October 2017 at 19:21, Grygorii Strashko > wrote: >> >> >> On 10/19/2017 03:33 AM, Ulf Hansson wrote: >>> On 18 October 2017 at 23:48, Rafael J. Wysocki wrote: On Wednesday, October 18, 2017 9:45:11 PM CEST Grygorii Strashko wrote: > > On 10/18/2017 09:11 AM, Ulf Hansson wrote: [...] That's the point. We know pm_runtime_force_* works nicely for the trivial middle-layer cases. >>> >>> In which cases the middle-layer callbacks don't exist, so it's just like >>> reusing driver callbacks directly. :-) > > I'd like to ask you clarify one point here and provide some info which I > hope can be useful - > what's exactly means "trivial middle-layer cases"? > > Is it when systems use "drivers/base/power/clock_ops.c - Generic clock > manipulation PM callbacks" as dev_pm_domain (arm davinci/keystone), or > OMAP > device framework struct dev_pm_domain omap_device_pm_domain > (arm/mach-omap2/omap_device.c) or static const struct dev_pm_ops > tegra_aconnect_pm_ops? > > if yes all above have PM runtime callbacks. Trivial ones don't actually do anything meaningful in their PM callbacks. Things like the platform bus type, spi bus type, i2c bus type and similar. If the middle-layer callbacks manipulate devices in a significant way, then they aren't trivial. >>> >>> I fully agree with Rafael's description above, but let me also clarify >>> one more thing. >>> >>> We have also been discussing PM domains as being trivial and >>> non-trivial. In some statements I even think the PM domain has been a >>> part the middle-layer terminology, which may have been a bit >>> confusing. >>> >>> In this regards as we consider genpd being a trivial PM domain, those >>> examples your bring up above is too me also examples of trivial PM >>> domains. Especially because they don't deal with wakeups, as that is >>> taken care of by the drivers, right!? >> >> Not directly, for example, omap device framework has noirq callback >> implemented >> which forcibly disable all devices which are not PM runtime suspended. >> while doing this it calls drivers PM .runtime_suspend() which may return >> non 0 value and in this case device will be left enabled (powered) at >> suspend for >> wake up purposes (see _od_suspend_noirq()). >> > > Yeah, I had that feeling that omap has some trickyness going on. :-) > > I sure that can be fixed in the omap PM domain, although ...slipped with my fingers.. here is the rest of the reply... ..of course that require us to use another way for drivers to signal to the omap PM domain that it needs to stay powered as to deal with wakeup. I can have a look at that more closely, to see if it makes sense to change. Kind regards Uffe -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume
On 19 October 2017 at 19:21, Grygorii Strashko wrote: > > > On 10/19/2017 03:33 AM, Ulf Hansson wrote: >> On 18 October 2017 at 23:48, Rafael J. Wysocki wrote: >>> On Wednesday, October 18, 2017 9:45:11 PM CEST Grygorii Strashko wrote: On 10/18/2017 09:11 AM, Ulf Hansson wrote: >>> >>> [...] >>> >>> That's the point. We know pm_runtime_force_* works nicely for the >>> trivial middle-layer cases. >> >> In which cases the middle-layer callbacks don't exist, so it's just like >> reusing driver callbacks directly. :-) I'd like to ask you clarify one point here and provide some info which I hope can be useful - what's exactly means "trivial middle-layer cases"? Is it when systems use "drivers/base/power/clock_ops.c - Generic clock manipulation PM callbacks" as dev_pm_domain (arm davinci/keystone), or OMAP device framework struct dev_pm_domain omap_device_pm_domain (arm/mach-omap2/omap_device.c) or static const struct dev_pm_ops tegra_aconnect_pm_ops? if yes all above have PM runtime callbacks. >>> >>> Trivial ones don't actually do anything meaningful in their PM callbacks. >>> >>> Things like the platform bus type, spi bus type, i2c bus type and similar. >>> >>> If the middle-layer callbacks manipulate devices in a significant way, then >>> they aren't trivial. >> >> I fully agree with Rafael's description above, but let me also clarify >> one more thing. >> >> We have also been discussing PM domains as being trivial and >> non-trivial. In some statements I even think the PM domain has been a >> part the middle-layer terminology, which may have been a bit >> confusing. >> >> In this regards as we consider genpd being a trivial PM domain, those >> examples your bring up above is too me also examples of trivial PM >> domains. Especially because they don't deal with wakeups, as that is >> taken care of by the drivers, right!? > > Not directly, for example, omap device framework has noirq callback > implemented > which forcibly disable all devices which are not PM runtime suspended. > while doing this it calls drivers PM .runtime_suspend() which may return > non 0 value and in this case device will be left enabled (powered) at suspend > for > wake up purposes (see _od_suspend_noirq()). > Yeah, I had that feeling that omap has some trickyness going on. :-) I sure that can be fixed in the omap PM domain, although -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume
[...] >>> > Say you want to leave the parent suspended after system resume, but the >>> > child drivers use pm_runtime_force_suspend|resume(). The parent would >>> > then >>> > need to use pm_runtime_force_suspend|resume() too, no? >>> >>> Actually no. >>> >>> Currently the other options of "deferring resume" (not using >>> pm_runtime_force_*), is either using the "direct_complete" path or >>> similar to the approach you took for the i2c designware driver. >>> >>> Both cases should play nicely in combination of a child being managed >>> by pm_runtime_force_*. That's because only when the parent device is >>> kept runtime suspended during system suspend, resuming can be >>> deferred. >> >> And because the parent remains in runtime suspend late enough in the >> system suspend path, its children also are guaranteed to be suspended. > > Yes. > >> >> But then all of them need to be left in runtime suspend during system >> resume too, which is somewhat restrictive, because some drivers may >> want their devices to be resumed then. > > Actually, this scenario is also addressed when using the pm_runtime_force_*. > > The driver for the child would only need to bump the runtime PM usage > count (pm_runtime_get_noresume()) before calling > pm_runtime_force_suspend() at system suspend. That then also > propagates to the parent, leading to that both the parent and the > child will be resumed when pm_runtime_force_resume() is called for > them. I need to correct myself here. The above currently only works if the child is runtime resumed while pm_runtime_force_suspend() is called. The logic in pm_runtime_force_* needs to be improved to take care of such scenarios. However I think that should be rather easy to fix, if we want that. [...] Kind regards Uffe -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume
On 10/19/2017 03:33 AM, Ulf Hansson wrote: > On 18 October 2017 at 23:48, Rafael J. Wysocki wrote: >> On Wednesday, October 18, 2017 9:45:11 PM CEST Grygorii Strashko wrote: >>> >>> On 10/18/2017 09:11 AM, Ulf Hansson wrote: >> >> [...] >> >> That's the point. We know pm_runtime_force_* works nicely for the >> trivial middle-layer cases. > > In which cases the middle-layer callbacks don't exist, so it's just like > reusing driver callbacks directly. :-) >>> >>> I'd like to ask you clarify one point here and provide some info which I >>> hope can be useful - >>> what's exactly means "trivial middle-layer cases"? >>> >>> Is it when systems use "drivers/base/power/clock_ops.c - Generic clock >>> manipulation PM callbacks" as dev_pm_domain (arm davinci/keystone), or OMAP >>> device framework struct dev_pm_domain omap_device_pm_domain >>> (arm/mach-omap2/omap_device.c) or static const struct dev_pm_ops >>> tegra_aconnect_pm_ops? >>> >>> if yes all above have PM runtime callbacks. >> >> Trivial ones don't actually do anything meaningful in their PM callbacks. >> >> Things like the platform bus type, spi bus type, i2c bus type and similar. >> >> If the middle-layer callbacks manipulate devices in a significant way, then >> they aren't trivial. > > I fully agree with Rafael's description above, but let me also clarify > one more thing. > > We have also been discussing PM domains as being trivial and > non-trivial. In some statements I even think the PM domain has been a > part the middle-layer terminology, which may have been a bit > confusing. > > In this regards as we consider genpd being a trivial PM domain, those > examples your bring up above is too me also examples of trivial PM > domains. Especially because they don't deal with wakeups, as that is > taken care of by the drivers, right!? Not directly, for example, omap device framework has noirq callback implemented which forcibly disable all devices which are not PM runtime suspended. while doing this it calls drivers PM .runtime_suspend() which may return non 0 value and in this case device will be left enabled (powered) at suspend for wake up purposes (see _od_suspend_noirq()). -- regards, -grygorii -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kbuild doc: a bundle of fixes on makefiles.txt
2017-10-19 12:17 GMT+09:00 Cao jin : > It does several fixes: > 1. move the displaced ld example to its reasonale place. > 2. add new example for command gzip. > 3. fix 2 number errors. > 4. fix format of chapter 7.x, make it looks the same as other chapters. > > Signed-off-by: Cao jin > --- Applied to linux-kbuild/fixes. Thanks! -- Best Regards Masahiro Yamada -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 0/6] Add HiSilicon SoC uncore Performance Monitoring Unit driver
On Thu, Oct 19, 2017 at 04:28:35PM +0100, Will Deacon wrote: > On Thu, Oct 19, 2017 at 01:29:18PM +0100, Mark Rutland wrote: > > Will, are you happy to queue this? > > > > There's a minor fixup [1] needed in patch 2, but otherwise this looks > > good to me, and builds cleanly. > > > > I've pushed out a branch [2] with that fix folded in, in case that's > > easier for you. Otherwise, feel free to pick these up with my Ack. > > I'm just running some build tests on these. I also tweaked your fix slightly > -- can you check the diff below please? That's nicer! My ack stands with that folded in. Mark. > diff --git a/drivers/perf/hisilicon/hisi_uncore_pmu.c > b/drivers/perf/hisilicon/hisi_uncore_pmu.c > index 2bff43f0736b..c74542af4acf 100644 > --- a/drivers/perf/hisilicon/hisi_uncore_pmu.c > +++ b/drivers/perf/hisilicon/hisi_uncore_pmu.c > @@ -69,15 +69,18 @@ static bool hisi_validate_event_group(struct perf_event > *event) > /* Include count for the event */ > int counters = 1; > > - /* > - * We must NOT create groups containing mixed PMUs, although > - * software events are acceptable > - */ > - if (leader->pmu != event->pmu && !is_software_event(leader)) > - return false; > + if (!is_software_event(leader)) { > + /* > + * We must NOT create groups containing mixed PMUs, although > + * software events are acceptable > + */ > + if (leader->pmu != event->pmu) > + return false; > > - /* Increment counter for the leader */ > - counters++; > + /* Increment counter for the leader */ > + if (leader != event) > + counters++; > + } > > list_for_each_entry(sibling, &event->group_leader->sibling_list, > group_entry) { -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 0/6] Add HiSilicon SoC uncore Performance Monitoring Unit driver
On Thu, Oct 19, 2017 at 01:29:18PM +0100, Mark Rutland wrote: > Will, are you happy to queue this? > > There's a minor fixup [1] needed in patch 2, but otherwise this looks > good to me, and builds cleanly. > > I've pushed out a branch [2] with that fix folded in, in case that's > easier for you. Otherwise, feel free to pick these up with my Ack. I'm just running some build tests on these. I also tweaked your fix slightly -- can you check the diff below please? Will --->8 diff --git a/drivers/perf/hisilicon/hisi_uncore_pmu.c b/drivers/perf/hisilicon/hisi_uncore_pmu.c index 2bff43f0736b..c74542af4acf 100644 --- a/drivers/perf/hisilicon/hisi_uncore_pmu.c +++ b/drivers/perf/hisilicon/hisi_uncore_pmu.c @@ -69,15 +69,18 @@ static bool hisi_validate_event_group(struct perf_event *event) /* Include count for the event */ int counters = 1; - /* -* We must NOT create groups containing mixed PMUs, although -* software events are acceptable -*/ - if (leader->pmu != event->pmu && !is_software_event(leader)) - return false; + if (!is_software_event(leader)) { + /* +* We must NOT create groups containing mixed PMUs, although +* software events are acceptable +*/ + if (leader->pmu != event->pmu) + return false; - /* Increment counter for the leader */ - counters++; + /* Increment counter for the leader */ + if (leader != event) + counters++; + } list_for_each_entry(sibling, &event->group_leader->sibling_list, group_entry) { -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] mm, thp: introduce dedicated transparent huge page allocation interfaces
On Wed 18-10-17 19:00:26, Du, Changbin wrote: > Hi Hocko, > > On Tue, Oct 17, 2017 at 12:20:52PM +0200, Michal Hocko wrote: > > [CC Kirill] > > > > On Mon 16-10-17 17:19:16, changbin...@intel.com wrote: > > > From: Changbin Du > > > > > > This patch introduced 4 new interfaces to allocate a prepared > > > transparent huge page. > > > - alloc_transhuge_page_vma > > > - alloc_transhuge_page_nodemask > > > - alloc_transhuge_page_node > > > - alloc_transhuge_page > > > > > > The aim is to remove duplicated code and simplify transparent > > > huge page allocation. These are similar to alloc_hugepage_xxx > > > which are for hugetlbfs pages. This patch does below changes: > > > - define alloc_transhuge_page_xxx interfaces > > > - apply them to all existing code > > > - declare prep_transhuge_page as static since no others use it > > > - remove alloc_hugepage_vma definition since it no longer has users > > > > So what exactly is the advantage of the new API? The diffstat doesn't > > sound very convincing to me. > > > The caller only need one step to allocate thp. Several LOCs removed for all > the > caller side with this change. So it's little more convinent. Yeah, but the overall result is more code. So I am not really convinced. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 0/6] Add HiSilicon SoC uncore Performance Monitoring Unit driver
Will, are you happy to queue this? There's a minor fixup [1] needed in patch 2, but otherwise this looks good to me, and builds cleanly. I've pushed out a branch [2] with that fix folded in, in case that's easier for you. Otherwise, feel free to pick these up with my Ack. Thanks, Mark. [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/538016.html [2] git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git perf-drivers/hisilicon-soc On Thu, Oct 19, 2017 at 07:05:15PM +0800, Shaokun Zhang wrote: > This patchset adds support for HiSilicon SoC uncore PMUs driver. It > includes L3C, Hydra Home Agent (HHA) and DDRC. > > Changes in v6: > * remove redundant member hisi_pmu::oneline_cpus > * rename member hisi_pmu::id > * add event code check when event init > * fix online/offline notifier for L3C/HHA/DDRC > > Changes in v5: > * remove unnecessary name/num_events member in hisi_pmu > * refactor hisi_pmu_hwevents structure > * remove hisi_pmu_alloc function > * revise cpuhotplug for L3C PMUs > * add cpuhotplug for HHA/DDRC PMUs > * fix the name format of uncore PMUs > * remove unnecessary variants > > Changes in v4: > * remove redundant code and comments > * reverse the functions order in exit function > * remove some GPL information > * revise including header file > * fix Jonathan's other comments > > Changes in v3: > * rebase to 4.13-rc1 > * add dev_err if ioremap fails for PMUs > > Changes in v2: > * fix kbuild test robot error > * make hisi_uncore_ops static > > Shaokun Zhang (6): > Documentation: perf: hisi: Documentation for HiSilicon SoC PMU driver > perf: hisi: Add support for HiSilicon SoC uncore PMU driver > perf: hisi: Add support for HiSilicon SoC L3C PMU driver > perf: hisi: Add support for HiSilicon SoC HHA PMU driver > perf: hisi: Add support for HiSilicon SoC DDRC PMU driver > arm64: MAINTAINERS: hisi: Add HiSilicon SoC PMU support > > Documentation/perf/hisi-pmu.txt | 53 +++ > MAINTAINERS | 7 + > drivers/perf/Kconfig | 7 + > drivers/perf/Makefile | 1 + > drivers/perf/hisilicon/Makefile | 1 + > drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c | 463 + > drivers/perf/hisilicon/hisi_uncore_hha_pmu.c | 473 > ++ > drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c | 463 + > drivers/perf/hisilicon/hisi_uncore_pmu.c | 444 > drivers/perf/hisilicon/hisi_uncore_pmu.h | 102 ++ > include/linux/cpuhotplug.h| 3 + > 11 files changed, 2017 insertions(+) > create mode 100644 Documentation/perf/hisi-pmu.txt > create mode 100644 drivers/perf/hisilicon/Makefile > create mode 100644 drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c > create mode 100644 drivers/perf/hisilicon/hisi_uncore_hha_pmu.c > create mode 100644 drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c > create mode 100644 drivers/perf/hisilicon/hisi_uncore_pmu.c > create mode 100644 drivers/perf/hisilicon/hisi_uncore_pmu.h > > -- > 1.9.1 > -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume
On 19 October 2017 at 00:12, Rafael J. Wysocki wrote: > On Wednesday, October 18, 2017 4:11:33 PM CEST Ulf Hansson wrote: >> [...] >> >> >> >> >> The reason why pm_runtime_force_* needs to respects the hierarchy of >> >> the RPM callbacks, is because otherwise it can't safely update the >> >> runtime PM status of the device. >> > >> > I'm not sure I follow this requirement. Why is that so? >> >> If the PM domain controls some resources for the device in its RPM >> callbacks and the driver controls some other resources in its RPM >> callbacks - then these resources needs to be managed together. > > Right, but that doesn't automatically make it necessary to use runtime PM > callbacks in the middle layer. Its system-wide PM callbacks may be > suitable for that just fine. > > That is, at least in some cases, you can combine ->runtime_suspend from a > driver and ->suspend_late from a middle layer with no problems, for example. > > That's why some middle layers allow drivers to point ->suspend_late and > ->runtime_suspend to the same routine if they want to reuse that code. > >> This follows the behavior of when a regular call to >> pm_runtime_get|put(), triggers the RPM callbacks to be invoked. > > But (a) it doesn't have to follow it and (b) in some cases it should not > follow it. Of course you don't explicitly *have to* respect the hierarchy of the RPM callbacks in pm_runtime_force_*. However, changing that would require some additional information exchange between the driver and the middle-layer/PM domain, as to instruct the middle-layer/PM domain of what to do during system-wide PM. Especially in cases when the driver deals with wakeup, as in those cases the instructions may change dynamically. [...] >> > In general, not if the wakeup settings are adjusted by the middle layer. >> >> Correct! >> >> To use pm_runtime_force* for these cases, one would need some >> additional information exchange between the driver and the >> middle-layer. > > Which pretty much defeats the purpose of the wrappers, doesn't it? Well, no matter if the wrappers are used or not, we need some kind of information exchange between the driver and the middle-layers/PM domains. Anyway, me personally think it's too early to conclude that using the wrappers may not be useful going forward. At this point, they clearly helps trivial cases to remain being trivial. > >> > >> >> Regarding hibernation, honestly that's not really my area of >> >> expertise. Although, I assume the middle-layer and driver can treat >> >> that as a separate case, so if it's not suitable to use >> >> pm_runtime_force* for that case, then they shouldn't do it. >> > >> > Well, agreed. >> > >> > In some simple cases, though, driver callbacks can be reused for >> > hibernation >> > too, so it would be good to have a common way to do that too, IMO. >> >> Okay, that makes sense! >> >> > >> >> > >> >> > Also, quite so often other middle layers interact with PCI directly or >> >> > indirectly (eg. a platform device may be a child or a consumer of a PCI >> >> > device) and some optimizations need to take that into account (eg. >> >> > parents >> >> > generally need to be accessible when their childres are resumed and so >> >> > on). >> >> >> >> A device's parent becomes informed when changing the runtime PM status >> >> of the device via pm_runtime_force_suspend|resume(), as those calls >> >> pm_runtime_set_suspended|active(). >> > >> > This requires the parent driver or middle layer to look at the reference >> > counter and understand it the same way as pm_runtime_force_*. >> > >> >> In case that isn't that sufficient, what else is needed? Perhaps you can >> >> point me to an example so I can understand better? >> > >> > Say you want to leave the parent suspended after system resume, but the >> > child drivers use pm_runtime_force_suspend|resume(). The parent would then >> > need to use pm_runtime_force_suspend|resume() too, no? >> >> Actually no. >> >> Currently the other options of "deferring resume" (not using >> pm_runtime_force_*), is either using the "direct_complete" path or >> similar to the approach you took for the i2c designware driver. >> >> Both cases should play nicely in combination of a child being managed >> by pm_runtime_force_*. That's because only when the parent device is >> kept runtime suspended during system suspend, resuming can be >> deferred. > > And because the parent remains in runtime suspend late enough in the > system suspend path, its children also are guaranteed to be suspended. Yes. > > But then all of them need to be left in runtime suspend during system > resume too, which is somewhat restrictive, because some drivers may > want their devices to be resumed then. Actually, this scenario is also addressed when using the pm_runtime_force_*. The driver for the child would only need to bump the runtime PM usage count (pm_runtime_get_noresume()) before calling pm_runtime_force_suspend() at system suspend. That then also propag
Re: kernel/module: Delete an error message for a failed memory allocation in add_module_usage()
On Thu, 2017-10-19 at 13:35 +0200, SF Markus Elfring wrote: > > > > > Omit an extra message for a memory allocation failure in this > > > > > function. > > > > > > > > > > This issue was detected by using the Coccinelle software. [] > > > Do you see any need that I should extend subsequent commit messages > > > for this software transformation pattern? > > > > Add a description of _why_ this is being done. > > > > Something like: > > > > "because there is a dump_stack() done on allocation failures > > without __GFP_JNOWARN" > > How do you think about to convert such a description into a special format > for further reference documentation? I think it's a bad idea if it's a "special" format. Always write _why_ some code is being changed. People could read the commit descriptions and would not need to take extra time to lookup external references. Maybe add something like "see (commit or )" for additional details" -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel/module: Delete an error message for a failed memory allocation in add_module_usage()
Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring >>> >>> Applied to modules-next, thanks. >> >> Thanks for your acceptance of this update suggestion after a bit of >> clarification. >> >> Do you see any need that I should extend subsequent commit messages >> for this software transformation pattern? > > Add a description of _why_ this is being done. > > Something like: > > "because there is a dump_stack() done on allocation failures > without __GFP_JNOWARN" How do you think about to convert such a description into a special format for further reference documentation? Regards, Markus -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 2/6] perf: hisi: Add support for HiSilicon SoC uncore PMU driver
On Thu, Oct 19, 2017 at 07:05:17PM +0800, Shaokun Zhang wrote: > This patch adds support HiSilicon SoC uncore PMU driver framework and > interfaces. > +static bool hisi_validate_event_group(struct perf_event *event) > +{ > + struct perf_event *sibling, *leader = event->group_leader; > + struct hisi_pmu *hisi_pmu = to_hisi_pmu(event->pmu); > + /* Include count for the event */ > + int counters = 1; > + > + /* > + * We must NOT create groups containing mixed PMUs, although > + * software events are acceptable > + */ > + if (leader->pmu != event->pmu && !is_software_event(leader)) > + return false; > + > + /* Increment counter for the leader */ > + counters++; Sorry I didn't spot this before, but I believe this should be: if (event != leader && !is_software_event(leader)) counters++; Since the leader can be a SW event, and for the group leader itself, event == leader. Assuming there aren't any major issues elsewhere, I can fix this up when applying the series. Thanks, Mark. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel/module: Delete an error message for a failed memory allocation in add_module_usage()
>> This is a small allocation so it can't fail in current kernels. I can't >> imagine a situation where this could fail and it wasn't dead easy to >> debug. Most modules are loaded at boot so it's not likely to fail, but >> if it did, it would be easy to reproduce. If it's not loaded at boot >> it's probably really easy to tell which module we're loading. > > Yeah, good points. And on second thought, we normally don't print > warnings for every small alloc failure in the kernel anyway (that > would be utterly superfluous), the error code itself is sufficient. > And in the module loader this seems to be the only printk out of the > dozen alloc calls we do, so I'm OK with removing this one. Thanks for your constructive feedback. Can it help to improve the corresponding documentation for Linux programming interfaces a bit more? Regards, Markus -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 2/6] perf: hisi: Add support for HiSilicon SoC uncore PMU driver
This patch adds support HiSilicon SoC uncore PMU driver framework and interfaces. Reviewed-by: Jonathan Cameron Signed-off-by: Shaokun Zhang Signed-off-by: Anurup M --- drivers/perf/Kconfig | 7 + drivers/perf/Makefile| 1 + drivers/perf/hisilicon/Makefile | 1 + drivers/perf/hisilicon/hisi_uncore_pmu.c | 444 +++ drivers/perf/hisilicon/hisi_uncore_pmu.h | 102 +++ 5 files changed, 555 insertions(+) create mode 100644 drivers/perf/hisilicon/Makefile create mode 100644 drivers/perf/hisilicon/hisi_uncore_pmu.c create mode 100644 drivers/perf/hisilicon/hisi_uncore_pmu.h diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig index e5197ff..b1a3894 100644 --- a/drivers/perf/Kconfig +++ b/drivers/perf/Kconfig @@ -17,6 +17,13 @@ config ARM_PMU_ACPI depends on ARM_PMU && ACPI def_bool y +config HISI_PMU + bool "HiSilicon SoC PMU" + depends on ARM64 && ACPI + help + Support for HiSilicon SoC uncore performance monitoring + unit (PMU), such as: L3C, HHA and DDRC. + config QCOM_L2_PMU bool "Qualcomm Technologies L2-cache PMU" depends on ARCH_QCOM && ARM64 && ACPI diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile index 6420bd4..41d3342 100644 --- a/drivers/perf/Makefile +++ b/drivers/perf/Makefile @@ -1,5 +1,6 @@ obj-$(CONFIG_ARM_PMU) += arm_pmu.o arm_pmu_platform.o obj-$(CONFIG_ARM_PMU_ACPI) += arm_pmu_acpi.o +obj-$(CONFIG_HISI_PMU) += hisilicon/ obj-$(CONFIG_QCOM_L2_PMU) += qcom_l2_pmu.o obj-$(CONFIG_QCOM_L3_PMU) += qcom_l3_pmu.o obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile new file mode 100644 index 000..2783bb3 --- /dev/null +++ b/drivers/perf/hisilicon/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o diff --git a/drivers/perf/hisilicon/hisi_uncore_pmu.c b/drivers/perf/hisilicon/hisi_uncore_pmu.c new file mode 100644 index 000..2bff43f --- /dev/null +++ b/drivers/perf/hisilicon/hisi_uncore_pmu.c @@ -0,0 +1,444 @@ +/* + * HiSilicon SoC Hardware event counters support + * + * Copyright (C) 2017 Hisilicon Limited + * Author: Anurup M + * Shaokun Zhang + * + * This code is based on the uncore PMUs like arm-cci and arm-ccn. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ +#include +#include +#include +#include +#include +#include + +#include + +#include "hisi_uncore_pmu.h" + +#define HISI_GET_EVENTID(ev) (ev->hw.config_base & 0xff) +#define HISI_MAX_PERIOD(nr) (BIT_ULL(nr) - 1) + +/* + * PMU format attributes + */ +ssize_t hisi_format_sysfs_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dev_ext_attribute *eattr; + + eattr = container_of(attr, struct dev_ext_attribute, attr); + + return sprintf(buf, "%s\n", (char *)eattr->var); +} + +/* + * PMU event attributes + */ +ssize_t hisi_event_sysfs_show(struct device *dev, + struct device_attribute *attr, char *page) +{ + struct dev_ext_attribute *eattr; + + eattr = container_of(attr, struct dev_ext_attribute, attr); + + return sprintf(page, "config=0x%lx\n", (unsigned long)eattr->var); +} + +/* + * sysfs cpumask attributes. For uncore PMU, we only have a single CPU to show + */ +ssize_t hisi_cpumask_sysfs_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct hisi_pmu *hisi_pmu = to_hisi_pmu(dev_get_drvdata(dev)); + + return sprintf(buf, "%d\n", hisi_pmu->on_cpu); +} + +static bool hisi_validate_event_group(struct perf_event *event) +{ + struct perf_event *sibling, *leader = event->group_leader; + struct hisi_pmu *hisi_pmu = to_hisi_pmu(event->pmu); + /* Include count for the event */ + int counters = 1; + + /* +* We must NOT create groups containing mixed PMUs, although +* software events are acceptable +*/ + if (leader->pmu != event->pmu && !is_software_event(leader)) + return false; + + /* Increment counter for the leader */ + counters++; + + list_for_each_entry(sibling, &event->group_leader->sibling_list, + group_entry) { + if (is_software_event(sibling)) + continue; + if (sibling->pmu != event->pmu) + return false; + /* Increment counter for each sibling */ + counters++; + } + + /* The group can not count events more than the counters in the HW */ + return counters <= hisi_pmu->num_counters; +} + +int hisi_uncore_pmu_counter_valid(struct hisi_pmu *hisi_pmu, int idx) +{ + return idx >= 0 && idx <
[PATCH v6 3/6] perf: hisi: Add support for HiSilicon SoC L3C PMU driver
This patch adds support for L3C PMU driver in HiSilicon SoC chip, Each L3C has own control, counter and interrupt registers and is an separate PMU. For each L3C PMU, it has 8-programable counters and each counter is free-running. Interrupt is supported to handle counter (48-bits) overflow. Reviewed-by: Jonathan Cameron Signed-off-by: Shaokun Zhang Signed-off-by: Anurup M --- drivers/perf/hisilicon/Makefile | 2 +- drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c | 463 +++ include/linux/cpuhotplug.h | 1 + 3 files changed, 465 insertions(+), 1 deletion(-) create mode 100644 drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile index 2783bb3..4a3d3e6 100644 --- a/drivers/perf/hisilicon/Makefile +++ b/drivers/perf/hisilicon/Makefile @@ -1 +1 @@ -obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o +obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o diff --git a/drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c b/drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c new file mode 100644 index 000..0bde5d9 --- /dev/null +++ b/drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c @@ -0,0 +1,463 @@ +/* + * HiSilicon SoC L3C uncore Hardware event counters support + * + * Copyright (C) 2017 Hisilicon Limited + * Author: Anurup M + * Shaokun Zhang + * + * This code is based on the uncore PMUs like arm-cci and arm-ccn. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ +#include +#include +#include +#include +#include +#include +#include +#include + +#include "hisi_uncore_pmu.h" + +/* L3C register definition */ +#define L3C_PERF_CTRL 0x0408 +#define L3C_INT_MASK 0x0800 +#define L3C_INT_STATUS 0x0808 +#define L3C_INT_CLEAR 0x080c +#define L3C_EVENT_CTRL 0x1c00 +#define L3C_EVENT_TYPE00x1d00 +/* + * Each counter is 48-bits and [48:63] are reserved + * which are Read-As-Zero and Writes-Ignored. + */ +#define L3C_CNTR0_LOWER0x1e00 + +/* L3C has 8-counters */ +#define L3C_NR_COUNTERS0x8 + +#define L3C_PERF_CTRL_EN 0x2 +#define L3C_EVTYPE_NONE0xff + +/* + * Select the counter register offset using the counter index + */ +static u32 hisi_l3c_pmu_get_counter_offset(int cntr_idx) +{ + return (L3C_CNTR0_LOWER + (cntr_idx * 8)); +} + +static u64 hisi_l3c_pmu_read_counter(struct hisi_pmu *l3c_pmu, +struct hw_perf_event *hwc) +{ + u32 idx = hwc->idx; + + if (!hisi_uncore_pmu_counter_valid(l3c_pmu, idx)) { + dev_err(l3c_pmu->dev, "Unsupported event index:%d!\n", idx); + return 0; + } + + /* Read 64-bits and the upper 16 bits are RAZ */ + return readq(l3c_pmu->base + hisi_l3c_pmu_get_counter_offset(idx)); +} + +static void hisi_l3c_pmu_write_counter(struct hisi_pmu *l3c_pmu, + struct hw_perf_event *hwc, u64 val) +{ + u32 idx = hwc->idx; + + if (!hisi_uncore_pmu_counter_valid(l3c_pmu, idx)) { + dev_err(l3c_pmu->dev, "Unsupported event index:%d!\n", idx); + return; + } + + /* Write 64-bits and the upper 16 bits are WI */ + writeq(val, l3c_pmu->base + hisi_l3c_pmu_get_counter_offset(idx)); +} + +static void hisi_l3c_pmu_write_evtype(struct hisi_pmu *l3c_pmu, int idx, + u32 type) +{ + u32 reg, reg_idx, shift, val; + + /* +* Select the appropriate event select register(L3C_EVENT_TYPE0/1). +* There are 2 event select registers for the 8 hardware counters. +* Event code is 8-bits and for the former 4 hardware counters, +* L3C_EVENT_TYPE0 is chosen. For the latter 4 hardware counters, +* L3C_EVENT_TYPE1 is chosen. +*/ + reg = L3C_EVENT_TYPE0 + (idx / 4) * 4; + reg_idx = idx % 4; + shift = 8 * reg_idx; + + /* Write event code to L3C_EVENT_TYPEx Register */ + val = readl(l3c_pmu->base + reg); + val &= ~(L3C_EVTYPE_NONE << shift); + val |= (type << shift); + writel(val, l3c_pmu->base + reg); +} + +static void hisi_l3c_pmu_start_counters(struct hisi_pmu *l3c_pmu) +{ + u32 val; + + /* +* Set perf_enable bit in L3C_PERF_CTRL register to start counting +* for all enabled counters. +*/ + val = readl(l3c_pmu->base + L3C_PERF_CTRL); + val |= L3C_PERF_CTRL_EN; + writel(val, l3c_pmu->base + L3C_PERF_CTRL); +} + +static void hisi_l3c_pmu_stop_counters(struct hisi_pmu *l3c_pmu) +{ + u32 val; + + /* +* Clear perf_enable bit in L3C_PERF_CTRL register to stop counting +* for all enabled counters. +*/ + val = readl(l3c_pmu->base + L3
[PATCH v6 6/6] arm64: MAINTAINERS: hisi: Add HiSilicon SoC PMU support
Add support HiSilicon SoC uncore PMU driver. Signed-off-by: Shaokun Zhang --- MAINTAINERS | 7 +++ 1 file changed, 7 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index a74227a..96c583c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6242,6 +6242,13 @@ S: Maintained F: drivers/net/ethernet/hisilicon/ F: Documentation/devicetree/bindings/net/hisilicon*.txt +HISILICON PMU DRIVER +M: Shaokun Zhang +W: http://www.hisilicon.com +S: Supported +F: drivers/perf/hisilicon +F: Documentation/perf/hisi-pmu.txt + HISILICON ROCE DRIVER M: Lijun Ou M: Wei Hu(Xavier) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 5/6] perf: hisi: Add support for HiSilicon SoC DDRC PMU driver
This patch adds support for DDRC PMU driver in HiSilicon SoC chip, Each DDRC has own control, counter and interrupt registers and is an separate PMU. For each DDRC PMU, it has 8-fixed-purpose counters which have been mapped to 8-events by hardware, it assumes that counter index is equal to event code (0 - 7) in DDRC PMU driver. Interrupt is supported to handle counter (32-bits) overflow. Reviewed-by: Jonathan Cameron Signed-off-by: Shaokun Zhang Signed-off-by: Anurup M --- drivers/perf/hisilicon/Makefile | 2 +- drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c | 463 ++ include/linux/cpuhotplug.h| 1 + 3 files changed, 465 insertions(+), 1 deletion(-) create mode 100644 drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile index a72afe8..2621d51 100644 --- a/drivers/perf/hisilicon/Makefile +++ b/drivers/perf/hisilicon/Makefile @@ -1 +1 @@ -obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o hisi_uncore_hha_pmu.o +obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o hisi_uncore_hha_pmu.o hisi_uncore_ddrc_pmu.o diff --git a/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c b/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c new file mode 100644 index 000..1b10ea0 --- /dev/null +++ b/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c @@ -0,0 +1,463 @@ +/* + * HiSilicon SoC DDRC uncore Hardware event counters support + * + * Copyright (C) 2017 Hisilicon Limited + * Author: Shaokun Zhang + * Anurup M + * + * This code is based on the uncore PMUs like arm-cci and arm-ccn. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ +#include +#include +#include +#include +#include +#include +#include +#include + +#include "hisi_uncore_pmu.h" + +/* DDRC register definition */ +#define DDRC_PERF_CTRL 0x010 +#define DDRC_FLUX_WR 0x380 +#define DDRC_FLUX_RD 0x384 +#define DDRC_FLUX_WCMD 0x388 +#define DDRC_FLUX_RCMD 0x38c +#define DDRC_PRE_CMD0x3c0 +#define DDRC_ACT_CMD0x3c4 +#define DDRC_BNK_CHG0x3c8 +#define DDRC_RNK_CHG0x3cc +#define DDRC_EVENT_CTRL 0x6C0 +#define DDRC_INT_MASK 0x6c8 +#define DDRC_INT_STATUS0x6cc +#define DDRC_INT_CLEAR 0x6d0 + +/* DDRC has 8-counters */ +#define DDRC_NR_COUNTERS 0x8 +#define DDRC_PERF_CTRL_EN 0x2 + +/* + * For DDRC PMU, there are eight-events and every event has been mapped + * to fixed-purpose counters which register offset is not consistent. + * Therefore there is no write event type and we assume that event + * code (0 to 7) is equal to counter index in PMU driver. + */ +#define GET_DDRC_EVENTID(hwc) (hwc->config_base & 0x7) + +static const u32 ddrc_reg_off[] = { + DDRC_FLUX_WR, DDRC_FLUX_RD, DDRC_FLUX_WCMD, DDRC_FLUX_RCMD, + DDRC_PRE_CMD, DDRC_ACT_CMD, DDRC_BNK_CHG, DDRC_RNK_CHG +}; + +/* + * Select the counter register offset using the counter index. + * In DDRC there are no programmable counter, the count + * is readed form the statistics counter register itself. + */ +static u32 hisi_ddrc_pmu_get_counter_offset(int cntr_idx) +{ + return ddrc_reg_off[cntr_idx]; +} + +static u64 hisi_ddrc_pmu_read_counter(struct hisi_pmu *ddrc_pmu, + struct hw_perf_event *hwc) +{ + /* Use event code as counter index */ + u32 idx = GET_DDRC_EVENTID(hwc); + + if (!hisi_uncore_pmu_counter_valid(ddrc_pmu, idx)) { + dev_err(ddrc_pmu->dev, "Unsupported event index:%d!\n", idx); + return 0; + } + + return readl(ddrc_pmu->base + hisi_ddrc_pmu_get_counter_offset(idx)); +} + +static void hisi_ddrc_pmu_write_counter(struct hisi_pmu *ddrc_pmu, + struct hw_perf_event *hwc, u64 val) +{ + u32 idx = GET_DDRC_EVENTID(hwc); + + if (!hisi_uncore_pmu_counter_valid(ddrc_pmu, idx)) { + dev_err(ddrc_pmu->dev, "Unsupported event index:%d!\n", idx); + return; + } + + writel((u32)val, + ddrc_pmu->base + hisi_ddrc_pmu_get_counter_offset(idx)); +} + +/* + * For DDRC PMU, event has been mapped to fixed-purpose counter by hardware, + * so there is no need to write event type. + */ +static void hisi_ddrc_pmu_write_evtype(struct hisi_pmu *hha_pmu, int idx, + u32 type) +{ +} + +static void hisi_ddrc_pmu_start_counters(struct hisi_pmu *ddrc_pmu) +{ + u32 val; + + /* Set perf_enable in DDRC_PERF_CTRL to start event counting */ + val = readl(ddrc_pmu->base + DDRC_PERF_CTRL); + val |= DDRC_PERF_CTRL_EN; + writel(val, ddrc_pmu->base + DDRC_PERF_CTRL); +} + +static void hisi_ddrc_pmu_stop_counters(st
[PATCH v6 4/6] perf: hisi: Add support for HiSilicon SoC HHA PMU driver
L3 cache coherence is maintained by Hydra Home Agent (HHA) in HiSilicon SoC. This patch adds support for HHA PMU driver, Each HHA has own control, counter and interrupt registers and is an separate PMU. For each HHA PMU, it has 16-programable counters and each counter is free-running. Interrupt is supported to handle counter (48-bits) overflow. Reviewed-by: Jonathan Cameron Signed-off-by: Shaokun Zhang Signed-off-by: Anurup M --- drivers/perf/hisilicon/Makefile | 2 +- drivers/perf/hisilicon/hisi_uncore_hha_pmu.c | 473 +++ include/linux/cpuhotplug.h | 1 + 3 files changed, 475 insertions(+), 1 deletion(-) create mode 100644 drivers/perf/hisilicon/hisi_uncore_hha_pmu.c diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile index 4a3d3e6..a72afe8 100644 --- a/drivers/perf/hisilicon/Makefile +++ b/drivers/perf/hisilicon/Makefile @@ -1 +1 @@ -obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o +obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o hisi_uncore_hha_pmu.o diff --git a/drivers/perf/hisilicon/hisi_uncore_hha_pmu.c b/drivers/perf/hisilicon/hisi_uncore_hha_pmu.c new file mode 100644 index 000..443906e --- /dev/null +++ b/drivers/perf/hisilicon/hisi_uncore_hha_pmu.c @@ -0,0 +1,473 @@ +/* + * HiSilicon SoC HHA uncore Hardware event counters support + * + * Copyright (C) 2017 Hisilicon Limited + * Author: Shaokun Zhang + * Anurup M + * + * This code is based on the uncore PMUs like arm-cci and arm-ccn. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ +#include +#include +#include +#include +#include +#include +#include +#include + +#include "hisi_uncore_pmu.h" + +/* HHA register definition */ +#define HHA_INT_MASK 0x0804 +#define HHA_INT_STATUS 0x0808 +#define HHA_INT_CLEAR 0x080C +#define HHA_PERF_CTRL 0x1E00 +#define HHA_EVENT_CTRL 0x1E04 +#define HHA_EVENT_TYPE00x1E80 +/* + * Each counter is 48-bits and [48:63] are reserved + * which are Read-As-Zero and Writes-Ignored. + */ +#define HHA_CNT0_LOWER 0x1F00 + +/* HHA has 16-counters */ +#define HHA_NR_COUNTERS0x10 + +#define HHA_PERF_CTRL_EN 0x1 +#define HHA_EVTYPE_NONE0xff + +/* + * Select the counter register offset using the counter index + * each counter is 48-bits. + */ +static u32 hisi_hha_pmu_get_counter_offset(int cntr_idx) +{ + return (HHA_CNT0_LOWER + (cntr_idx * 8)); +} + +static u64 hisi_hha_pmu_read_counter(struct hisi_pmu *hha_pmu, +struct hw_perf_event *hwc) +{ + u32 idx = hwc->idx; + + if (!hisi_uncore_pmu_counter_valid(hha_pmu, idx)) { + dev_err(hha_pmu->dev, "Unsupported event index:%d!\n", idx); + return 0; + } + + /* Read 64 bits and like L3C, top 16 bits are RAZ */ + return readq(hha_pmu->base + hisi_hha_pmu_get_counter_offset(idx)); +} + +static void hisi_hha_pmu_write_counter(struct hisi_pmu *hha_pmu, + struct hw_perf_event *hwc, u64 val) +{ + u32 idx = hwc->idx; + + if (!hisi_uncore_pmu_counter_valid(hha_pmu, idx)) { + dev_err(hha_pmu->dev, "Unsupported event index:%d!\n", idx); + return; + } + + /* Write 64 bits and like L3C, top 16 bits are WI */ + writeq(val, hha_pmu->base + hisi_hha_pmu_get_counter_offset(idx)); +} + +static void hisi_hha_pmu_write_evtype(struct hisi_pmu *hha_pmu, int idx, + u32 type) +{ + u32 reg, reg_idx, shift, val; + + /* +* Select the appropriate event select register(HHA_EVENT_TYPEx). +* There are 4 event select registers for the 16 hardware counters. +* Event code is 8-bits and for the first 4 hardware counters, +* HHA_EVENT_TYPE0 is chosen. For the next 4 hardware counters, +* HHA_EVENT_TYPE1 is chosen and so on. +*/ + reg = HHA_EVENT_TYPE0 + 4 * (idx / 4); + reg_idx = idx % 4; + shift = 8 * reg_idx; + + /* Write event code to HHA_EVENT_TYPEx register */ + val = readl(hha_pmu->base + reg); + val &= ~(HHA_EVTYPE_NONE << shift); + val |= (type << shift); + writel(val, hha_pmu->base + reg); +} + +static void hisi_hha_pmu_start_counters(struct hisi_pmu *hha_pmu) +{ + u32 val; + + /* +* Set perf_enable bit in HHA_PERF_CTRL to start event +* counting for all enabled counters. +*/ + val = readl(hha_pmu->base + HHA_PERF_CTRL); + val |= HHA_PERF_CTRL_EN; + writel(val, hha_pmu->base + HHA_PERF_CTRL); +} + +static void hisi_hha_pmu_stop_counters(struct hisi_pmu *hha_pmu) +{ + u32 val; + + /* +* Clear perf_enable bit
[PATCH v6 0/6] Add HiSilicon SoC uncore Performance Monitoring Unit driver
This patchset adds support for HiSilicon SoC uncore PMUs driver. It includes L3C, Hydra Home Agent (HHA) and DDRC. Changes in v6: * remove redundant member hisi_pmu::oneline_cpus * rename member hisi_pmu::id * add event code check when event init * fix online/offline notifier for L3C/HHA/DDRC Changes in v5: * remove unnecessary name/num_events member in hisi_pmu * refactor hisi_pmu_hwevents structure * remove hisi_pmu_alloc function * revise cpuhotplug for L3C PMUs * add cpuhotplug for HHA/DDRC PMUs * fix the name format of uncore PMUs * remove unnecessary variants Changes in v4: * remove redundant code and comments * reverse the functions order in exit function * remove some GPL information * revise including header file * fix Jonathan's other comments Changes in v3: * rebase to 4.13-rc1 * add dev_err if ioremap fails for PMUs Changes in v2: * fix kbuild test robot error * make hisi_uncore_ops static Shaokun Zhang (6): Documentation: perf: hisi: Documentation for HiSilicon SoC PMU driver perf: hisi: Add support for HiSilicon SoC uncore PMU driver perf: hisi: Add support for HiSilicon SoC L3C PMU driver perf: hisi: Add support for HiSilicon SoC HHA PMU driver perf: hisi: Add support for HiSilicon SoC DDRC PMU driver arm64: MAINTAINERS: hisi: Add HiSilicon SoC PMU support Documentation/perf/hisi-pmu.txt | 53 +++ MAINTAINERS | 7 + drivers/perf/Kconfig | 7 + drivers/perf/Makefile | 1 + drivers/perf/hisilicon/Makefile | 1 + drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c | 463 + drivers/perf/hisilicon/hisi_uncore_hha_pmu.c | 473 ++ drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c | 463 + drivers/perf/hisilicon/hisi_uncore_pmu.c | 444 drivers/perf/hisilicon/hisi_uncore_pmu.h | 102 ++ include/linux/cpuhotplug.h| 3 + 11 files changed, 2017 insertions(+) create mode 100644 Documentation/perf/hisi-pmu.txt create mode 100644 drivers/perf/hisilicon/Makefile create mode 100644 drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c create mode 100644 drivers/perf/hisilicon/hisi_uncore_hha_pmu.c create mode 100644 drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c create mode 100644 drivers/perf/hisilicon/hisi_uncore_pmu.c create mode 100644 drivers/perf/hisilicon/hisi_uncore_pmu.h -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 1/6] Documentation: perf: hisi: Documentation for HiSilicon SoC PMU driver
This patch adds documentation for the uncore PMUs on HiSilicon SoC. Reviewed-by: Jonathan Cameron Signed-off-by: Shaokun Zhang Signed-off-by: Anurup M --- Documentation/perf/hisi-pmu.txt | 53 + 1 file changed, 53 insertions(+) create mode 100644 Documentation/perf/hisi-pmu.txt diff --git a/Documentation/perf/hisi-pmu.txt b/Documentation/perf/hisi-pmu.txt new file mode 100644 index 000..267a028 --- /dev/null +++ b/Documentation/perf/hisi-pmu.txt @@ -0,0 +1,53 @@ +HiSilicon SoC uncore Performance Monitoring Unit (PMU) +== +The HiSilicon SoC chip includes various independent system device PMUs +such as L3 cache (L3C), Hydra Home Agent (HHA) and DDRC. These PMUs are +independent and have hardware logic to gather statistics and performance +information. + +The HiSilicon SoC encapsulates multiple CPU and IO dies. Each CPU cluster +(CCL) is made up of 4 cpu cores sharing one L3 cache; each CPU die is +called Super CPU cluster (SCCL) and is made up of 6 CCLs. Each SCCL has +two HHAs (0 - 1) and four DDRCs (0 - 3), respectively. + +HiSilicon SoC uncore PMU driver +--- +Each device PMU has separate registers for event counting, control and +interrupt, and the PMU driver shall register perf PMU drivers like L3C, +HHA and DDRC etc. The available events and configuration options shall +be described in the sysfs, see : +/sys/devices/hisi_sccl{X}_/, or +/sys/bus/event_source/devices/hisi_sccl{X}_. +The "perf list" command shall list the available events from sysfs. + +Each L3C, HHA and DDRC is registered as a separate PMU with perf. The PMU +name will appear in event listing as hisi_sccl_module. +where "sccl-id" is the identifier of the SCCL and "index-id" is the index of +module. +e.g. hisi_sccl3_l3c0/rd_hit_cpipe is READ_HIT_CPIPE event of L3C index #0 in +SCCL ID #3. +e.g. hisi_sccl1_hha0/rx_operations is RX_OPERATIONS event of HHA index #0 in +SCCL ID #1. + +The driver also provides a "cpumask" sysfs attribute, which shows the CPU core +ID used to count the uncore PMU event. + +Example usage of perf: +$# perf list +hisi_sccl3_l3c0/rd_hit_cpipe/ [kernel PMU event] +-- +hisi_sccl3_l3c0/wr_hit_cpipe/ [kernel PMU event] +-- +hisi_sccl1_l3c0/rd_hit_cpipe/ [kernel PMU event] +-- +hisi_sccl1_l3c0/wr_hit_cpipe/ [kernel PMU event] +-- + +$# perf stat -a -e hisi_sccl3_l3c0/rd_hit_cpipe/ sleep 5 +$# perf stat -a -e hisi_sccl3_l3c0/config=0x02/ sleep 5 + +The current driver does not support sampling. So "perf record" is unsupported. +Also attach to a task is unsupported as the events are all uncore. + +Note: Please contact the maintainer for a complete list of events supported for +the PMU devices in the SoC and its information if needed. -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] document: change the document for the extended movable_node
Add the document for the change of extended movable_node=nn[KMG]@ss[KMG]. Cc: Jonathan Corbet Cc: linux-doc@vger.kernel.org Signed-off-by: Chao Fan --- Documentation/admin-guide/kernel-parameters.txt | 9 + 1 file changed, 9 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ead7f4066ea4..226560667d84 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2332,6 +2332,15 @@ allocations which rules out almost all kernel allocations. Use with caution! + movable_node=nn[KMG]@ss[KMG] + [KNL] Force usage of a specific region of memory. + Extend movable_node to work well with KASLR. + Region of memory in immovable node is from ss to ss+nn. + Multiple regions can be specified, comma delimited. + Notice: we support 4 regions at most now. + Example: + movable_node=100M@2G,1G@4G + MTD_Partition= [MTD] Format: ,,, -- 2.13.6 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume
On 18 October 2017 at 23:48, Rafael J. Wysocki wrote: > On Wednesday, October 18, 2017 9:45:11 PM CEST Grygorii Strashko wrote: >> >> On 10/18/2017 09:11 AM, Ulf Hansson wrote: > > [...] > >> >>> That's the point. We know pm_runtime_force_* works nicely for the >> >>> trivial middle-layer cases. >> >> >> >> In which cases the middle-layer callbacks don't exist, so it's just like >> >> reusing driver callbacks directly. :-) >> >> I'd like to ask you clarify one point here and provide some info which I >> hope can be useful - >> what's exactly means "trivial middle-layer cases"? >> >> Is it when systems use "drivers/base/power/clock_ops.c - Generic clock >> manipulation PM callbacks" as dev_pm_domain (arm davinci/keystone), or OMAP >> device framework struct dev_pm_domain omap_device_pm_domain >> (arm/mach-omap2/omap_device.c) or static const struct dev_pm_ops >> tegra_aconnect_pm_ops? >> >> if yes all above have PM runtime callbacks. > > Trivial ones don't actually do anything meaningful in their PM callbacks. > > Things like the platform bus type, spi bus type, i2c bus type and similar. > > If the middle-layer callbacks manipulate devices in a significant way, then > they aren't trivial. I fully agree with Rafael's description above, but let me also clarify one more thing. We have also been discussing PM domains as being trivial and non-trivial. In some statements I even think the PM domain has been a part the middle-layer terminology, which may have been a bit confusing. In this regards as we consider genpd being a trivial PM domain, those examples your bring up above is too me also examples of trivial PM domains. Especially because they don't deal with wakeups, as that is taken care of by the drivers, right!? Kind regards Uffe -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Update][PATCH v2 01/12] PM / core: Add NEVER_SKIP and SMART_PREPARE driver flags
On Thu, Oct 19, 2017 at 01:17:31AM +0200, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > The motivation for this change is to provide a way to work around > a problem with the direct-complete mechanism used for avoiding > system suspend/resume handling for devices in runtime suspend. > > The problem is that some middle layer code (the PCI bus type and > the ACPI PM domain in particular) returns positive values from its > system suspend ->prepare callbacks regardless of whether the driver's > ->prepare returns a positive value or 0, which effectively prevents > drivers from being able to control the direct-complete feature. > Some drivers need that control, however, and the PCI bus type has > grown its own flag to deal with this issue, but since it is not > limited to PCI, it is better to address it by adding driver flags at > the core level. > > To that end, add a driver_flags field to struct dev_pm_info for flags > that can be set by device drivers at the probe time to inform the PM > core and/or bus types, PM domains and so on on the capabilities and/or > preferences of device drivers. Also add two static inline helpers > for setting that field and testing it against a given set of flags > and make the driver core clear it automatically on driver remove > and probe failures. > > Define and document two PM driver flags related to the direct- > complete feature: NEVER_SKIP and SMART_PREPARE that can be used, > respectively, to indicate to the PM core that the direct-complete > mechanism should never be used for the device and to inform the > middle layer code (bus types, PM domains etc) that it can only > request the PM core to use the direct-complete mechanism for > the device (by returning a positive value from its ->prepare > callback) if it also has been requested by the driver. > > While at it, make the core check pm_runtime_suspended() when > setting power.direct_complete so that it doesn't need to be > checked by ->prepare callbacks. > > Signed-off-by: Rafael J. Wysocki Acked-by: Greg Kroah-Hartman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html