Re: [PATCH 0/2] Make squashfs fragments' cache size more configurable

2017-10-19 Thread Phillip Lougher
On Thu, Oct 19, 2017 at 12:50 AM, Qixuan Wu  wrote:
> Hi All,
>
> Currently, squashfs fragments' cache size is only determined by
> config option CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE. Users have
> no way to change the value when they get the binary kernel.

Thank-you for the patches, but they're both pointless and dangerous.
Let's be clear here you're trying to change an "expert only" kernel
configuration option into a user changeable option.  This is stupid
because it is not meant for non-experts to change for good reason.

The fragment cache size isn't  some small tweak to the operation of
Squashfs, it fundamentally affects both the performance and memory
overhead of Squashfs.  As such right from its introduction in 2003, it
has been an "expert only" configuration option at build time.  Even
then it is made clear that the default has been carefully chosen, and
it should only be changed in exceptional circumstances.  This
basically means don't change the default unless you really know what
you're doing, and this means tracing of Squashfs against your use-case
to determine caching behaviour.  There is absolutely no other reason
why you'd want to change the default.  This also means it should be
restricted to kernel configuration time only.

Let's be clear again, very few people should ever want to change the
default, and for the "experts" that do want to do so, they can do so
when configuring the kernel.  If you're not in a position to change it
at kernel configuration time then by definition you're not an expert,
and you shouldn't be able to change it anyway and certainly not as a
user.

There is absolutely no use-case here to make this a user changeable
option.  I can see no upsides in doing this, only downsides.

Frankly if you need to change this value at module insert time then
there is something wrong with your system or build process.   If you
want this because you want to build the kernel/modules once, and then
post-facto configure them for various products then it is your build
process that is broken.   If you want this because you want to
dynamically change Squashfs memory usage/caching behaviour post kernel
configuration time it suggests you're trying to adapt Squashfs's
footprint based on available memory.   This is an abuse of the option
as it's only meant to be used after detailed tracing/analysis and
certainly not used to accommodate unforeseen dynamic low memory
situations, and if that's the reason for needing this option, you
should be looking to solve it elsewhere.

Ultimately this has been an "expert"  kernel configuration only option
since its introduction in 2003, and I never been asked to change it,
and I believe this is because people recognise it as such.  I suspect
you're trying to change this for fundamentally bogus reasons.
Moreover Squashfs is used in many different use-cases and
distributions, and I'm not going to make this a user-changeable option
allowing users to insert the Squashfs module in such a way that will
break its performance.

So NACK.

Phillip Lougher (Squashfs maintainer)

> Now make it be configured when booting or inserting module.
> Actually, it's better that a config option in a number format
> in .config file cat be reconfigured during booting or inserting
> module.
>
> Thanks
> Qixuan
>
> Qixuan Wu (2):
>   Squashfs: Let the number of fragments cached configurable
>   Documentation/kernel-parameters.txt: Add kernel parameter of squashfs
> fragments' cache size
>
>  Documentation/admin-guide/kernel-parameters.txt |  7 
>  fs/squashfs/super.c | 43 
> -
>  2 files changed, 49 insertions(+), 1 deletion(-)
>
> --
> 2.7.4
>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume

2017-10-19 Thread Ulf Hansson
[...]

> In this regards as we consider genpd being a trivial PM domain, those
> examples your bring up above is too me also examples of trivial PM
> domains. Especially because they don't deal with wakeups, as that is
> taken care of by the drivers, right!?

 Not directly, for example, omap device framework has noirq callback 
 implemented
 which forcibly disable all devices which are not PM runtime suspended.
 while doing this it calls drivers PM .runtime_suspend() which may return
 non 0 value and in this case device will be left enabled (powered) at 
 suspend for
 wake up purposes (see _od_suspend_noirq()).

>>>
>>> Yeah, I had that feeling that omap has some trickyness going on. :-)
>>>
>>> I sure that can be fixed in the omap PM domain, although
>>
>> ...slipped with my fingers.. here is the rest of the reply...
>>
>> ..of course that require us to use another way for drivers to signal
>> to the omap PM domain that it needs to stay powered as to deal with
>> wakeup.
>>
>> I can have a look at that more closely, to see if it makes sense to change.
>>
>
> Also, additional note here. some IPs are reused between OMAP/Davinci/Keystone,
> OMAP PM domain have some code running at noirq time to dial with devices left
> in PM runtime enabled state (OMAP PM runtime centric), while Davinci/Keystone 
> haven't (clock_ops.c),
> so pm_runtime_force_* API is actually possibility now to make the same driver 
> work
>  on all these platforms.

That sounds great!

Also, in the end it would be nice to also convert the OMAP PM domain
to genpd. I think most of the needed infrastructure is already there
to do that.

Kind regards
Uffe
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume

2017-10-19 Thread Ulf Hansson
On 20 October 2017 at 03:19, Rafael J. Wysocki  wrote:
> On Thursday, October 19, 2017 2:21:07 PM CEST Ulf Hansson wrote:
>> On 19 October 2017 at 00:12, Rafael J. Wysocki  wrote:
>> > On Wednesday, October 18, 2017 4:11:33 PM CEST Ulf Hansson wrote:
>> >> [...]
>> >>
>> >> >>
>> >> >> The reason why pm_runtime_force_* needs to respects the hierarchy of
>> >> >> the RPM callbacks, is because otherwise it can't safely update the
>> >> >> runtime PM status of the device.
>> >> >
>> >> > I'm not sure I follow this requirement.  Why is that so?
>> >>
>> >> If the PM domain controls some resources for the device in its RPM
>> >> callbacks and the driver controls some other resources in its RPM
>> >> callbacks - then these resources needs to be managed together.
>> >
>> > Right, but that doesn't automatically make it necessary to use runtime PM
>> > callbacks in the middle layer.  Its system-wide PM callbacks may be
>> > suitable for that just fine.
>> >
>> > That is, at least in some cases, you can combine ->runtime_suspend from a
>> > driver and ->suspend_late from a middle layer with no problems, for 
>> > example.
>> >
>> > That's why some middle layers allow drivers to point ->suspend_late and
>> > ->runtime_suspend to the same routine if they want to reuse that code.
>> >
>> >> This follows the behavior of when a regular call to
>> >> pm_runtime_get|put(), triggers the RPM callbacks to be invoked.
>> >
>> > But (a) it doesn't have to follow it and (b) in some cases it should not
>> > follow it.
>>
>> Of course you don't explicitly *have to* respect the hierarchy of the
>> RPM callbacks in pm_runtime_force_*.
>>
>> However, changing that would require some additional information
>> exchange between the driver and the middle-layer/PM domain, as to
>> instruct the middle-layer/PM domain of what to do during system-wide
>> PM. Especially in cases when the driver deals with wakeup, as in those
>> cases the instructions may change dynamically.
>
> Well, if wakeup matters, drivers can't simply point their PM callbacks
> to pm_runtime_force_* anyway.
>
> If the driver itself deals with wakeups, it clearly needs different callback
> routines for system-wide PM and for runtime PM, so it can't reuse its runtime
> PM callbacks at all then.

It can still re-use its runtime PM callbacks, simply by calling
pm_runtime_force_ from its system sleep callbacks.

Drivers already do that today, not only to deal with wakeups, but
generally when they need to deal with some additional operations.

>
> If the middle layer deals with wakeups, different callbacks are needed at
> that level and so pm_runtime_force_* are unsuitable too.
>
> Really, invoking runtime PM callbacks from the middle layer in
> pm_runtime_force_* is a not a idea at all and there's no general requirement
> for it whatever.
>
>> [...]
>>
>> >> > In general, not if the wakeup settings are adjusted by the middle layer.
>> >>
>> >> Correct!
>> >>
>> >> To use pm_runtime_force* for these cases, one would need some
>> >> additional information exchange between the driver and the
>> >> middle-layer.
>> >
>> > Which pretty much defeats the purpose of the wrappers, doesn't it?
>>
>> Well, no matter if the wrappers are used or not, we need some kind of
>> information exchange between the driver and the middle-layers/PM
>> domains.
>
> Right.
>
> But if that information is exchanged, then why use wrappers *in* *addition*
> to that?

If we can find a different method that both avoids both open coding
and offers the optimize system-wide PM path at resume, I am open to
that.

>
>> Anyway, me personally think it's too early to conclude that using the
>> wrappers may not be useful going forward. At this point, they clearly
>> helps trivial cases to remain being trivial.
>
> I'm not sure about that really.  So far I've seen more complexity resulting
> from using them than being avoided by using them, but I guess the beauty is
> in the eye of the beholder. :-)

Hehe, yeah you may be right. :-)

>
>> >
>> >> >
>> >> >> Regarding hibernation, honestly that's not really my area of
>> >> >> expertise. Although, I assume the middle-layer and driver can treat
>> >> >> that as a separate case, so if it's not suitable to use
>> >> >> pm_runtime_force* for that case, then they shouldn't do it.
>> >> >
>> >> > Well, agreed.
>> >> >
>> >> > In some simple cases, though, driver callbacks can be reused for 
>> >> > hibernation
>> >> > too, so it would be good to have a common way to do that too, IMO.
>> >>
>> >> Okay, that makes sense!
>> >>
>> >> >
>> >> >> >
>> >> >> > Also, quite so often other middle layers interact with PCI directly 
>> >> >> > or
>> >> >> > indirectly (eg. a platform device may be a child or a consumer of a 
>> >> >> > PCI
>> >> >> > device) and some optimizations need to take that into account (eg. 
>> >> >> > parents
>> >> >> > generally need to be accessible when their childres are resumed and 
>> >> >> > so on).
>> >> >>
>> >> >> A device's parent becomes informed

Re: [RFC PATCH] kbuild: Allow specifying some base host CFLAGS

2017-10-19 Thread Doug Anderson
Hi,

On Wed, Oct 18, 2017 at 9:45 AM, Masahiro Yamada
 wrote:
> 2017-10-14 3:02 GMT+09:00 Douglas Anderson :
>> Right now there is a way to add some CFLAGS that affect target builds,
>> but no way to add CFLAGS that affect host builds.  Let's add a way.
>> We'll document two environment variables: CFLAGS_HOST and
>> CXXFLAGS_HOST.
>>
>> We'll document that these variables get appended to by the kernel to
>> make the final CFLAGS.  That means that, though the environment can
>> specify some flags, if there is a conflict the kernel can override and
>> win.  This works differently than KCFLAGS which is appended (and thus
>> can override) the kernel specified CFLAGS.
>>
>> Why would I make KCFLAGS and CFLAGS_HOST work differently in this way?
>> My argument is that it's about expected usage.  Typically the build
>> system invoking the kernel has some idea about some basic CFLAGS that
>> it wants to use to build things for the host and things for the
>> target.  In general the build system would expect that its flags can
>> be overridden if necessary (perhaps we need to turn off a warning when
>> compiling a certain file, for instance).  So, all other things being
>> equal, the way I'm making CFLAGS_HOST is the way I'd expect things to
>> work.
>>
>> So, if it's expected that the build system can pass in a base set of
>> flags, why didn't we make KCFLAGS work that way?  The short answer is:
>> when building for the target the kernel is just "special".  The build
>> system's "target" CFLAGS are likely intended for userspace programs
>> and likely make very little sense to use as a basis.  This was talked
>> about in the seminal commit 69ee0b352242 ("kbuild: do not pick up
>> CFLAGS from the environment").  Basically: if the build system REALLY
>> knows what it's doing then it can pass in flags that the kernel will
>> use, but otherwise it should butt out.  Presumably this build system
>> that really knows what it's doing knows better than the kernel so
>> KCFLAGS comes after the kernel's normal flags.
>>
>> One last note: I chose to add new variables rather than just having
>> the build system try to pass HOSTCFLAGS in somehow (either through the
>> environment or the command line) to avoid weird interactions with
>> recursive invocations of make.
>>
>> Signed-off-by: Douglas Anderson 
>> ---
>
> I'd like to know for-instance cases where this is useful.

I'm not sure I have any exact use cases.  I know vapier@ (CCed) was
pushing for making sure that these flags get passed from the portage
ebuild into the kernel build, so maybe he has some cases?  Right now
we have the "-pipe" flag that ought to be passed in to the host
compiler but we're dropping it on the floor, but that doesn't seem
terribly critical.

...but in general the Linux kernel doesn't have all the details about
the host system.  That means it can't necessarily build the tools
quite as optimally (it can't pass "-mtune, right?).  I could also
imagine that there could be ABI flags that need to be specified?  Like
if we had floating point math in a host tool it would be important
that the build system could tell the kernel what to use for
"-mfloat-abi".

...so basically: it's all theoretical at this point in time from my
point of view, but I can definitely understand how it could be
necessary in the right environment.


-Doug
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/5] gpio: gpiolib: Add core support for maintaining GPIO values on reset

2017-10-19 Thread Andrew Jeffery
GPIO state reset tolerance is implemented in gpiolib through the
addition of a new pinconf parameter. With that, some renaming of helpers
is done to clarify the scope of the already existing
gpiochip_line_is_persistent(), as it's now ambiguous as to whether that
means on suspend, reset or both. This in-turn impacts gpio-arizona, but
not in any complicated way.

This change lays the groundwork for implementing reset tolerance support
in all of the external interfaces that can influence GPIOs.

Signed-off-by: Andrew Jeffery 
---
 drivers/gpio/gpio-arizona.c |  4 +--
 drivers/gpio/gpiolib.c  | 55 +++--
 drivers/gpio/gpiolib.h  |  1 +
 include/linux/gpio/consumer.h   |  9 ++
 include/linux/gpio/driver.h |  5 ++-
 include/linux/gpio/machine.h|  2 ++
 include/linux/pinctrl/pinconf-generic.h |  2 ++
 7 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/drivers/gpio/gpio-arizona.c b/drivers/gpio/gpio-arizona.c
index d4e6ba0301bc..d3fe23569811 100644
--- a/drivers/gpio/gpio-arizona.c
+++ b/drivers/gpio/gpio-arizona.c
@@ -33,7 +33,7 @@ static int arizona_gpio_direction_in(struct gpio_chip *chip, 
unsigned offset)
 {
struct arizona_gpio *arizona_gpio = gpiochip_get_data(chip);
struct arizona *arizona = arizona_gpio->arizona;
-   bool persistent = gpiochip_line_is_persistent(chip, offset);
+   bool persistent = gpiochip_line_is_persistent_suspend(chip, offset);
bool change;
int ret;
 
@@ -99,7 +99,7 @@ static int arizona_gpio_direction_out(struct gpio_chip *chip,
 {
struct arizona_gpio *arizona_gpio = gpiochip_get_data(chip);
struct arizona *arizona = arizona_gpio->arizona;
-   bool persistent = gpiochip_line_is_persistent(chip, offset);
+   bool persistent = gpiochip_line_is_persistent_suspend(chip, offset);
unsigned int val;
int ret;
 
diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c
index a56b29fd8bb1..d9dc7e588699 100644
--- a/drivers/gpio/gpiolib.c
+++ b/drivers/gpio/gpiolib.c
@@ -2414,6 +2414,40 @@ int gpiod_set_debounce(struct gpio_desc *desc, unsigned 
debounce)
 EXPORT_SYMBOL_GPL(gpiod_set_debounce);
 
 /**
+ * gpiod_set_reset_tolerant - Hold state across controller reset
+ * @desc: descriptor of the GPIO for which to set debounce time
+ * @tolerant: True to hold state across a controller reset, false otherwise
+ *
+ * Returns:
+ * 0 on success, %-ENOTSUPP if the controller doesn't support setting the
+ * reset tolerance or less than zero on other failures.
+ */
+int gpiod_set_reset_tolerant(struct gpio_desc *desc, bool tolerant)
+{
+   struct gpio_chip *chip;
+   unsigned long packed;
+   int rc;
+
+   chip = desc->gdev->chip;
+   if (!chip->set_config)
+   return -ENOTSUPP;
+
+   packed = pinconf_to_config_packed(PIN_CONFIG_RESET_TOLERANT, tolerant);
+
+   rc = chip->set_config(chip, gpio_chip_hwgpio(desc), packed);
+   if (rc < 0)
+   return rc;
+
+   if (tolerant)
+   set_bit(FLAG_RESET_TOLERANT, &desc->flags);
+   else
+   clear_bit(FLAG_RESET_TOLERANT, &desc->flags);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(gpiod_set_reset_tolerant);
+
+/**
  * gpiod_is_active_low - test whether a GPIO is active-low or not
  * @desc: the gpio descriptor to test
  *
@@ -2885,7 +2919,8 @@ bool gpiochip_line_is_open_source(struct gpio_chip *chip, 
unsigned int offset)
 }
 EXPORT_SYMBOL_GPL(gpiochip_line_is_open_source);
 
-bool gpiochip_line_is_persistent(struct gpio_chip *chip, unsigned int offset)
+bool gpiochip_line_is_persistent_suspend(struct gpio_chip *chip,
+unsigned int offset)
 {
if (offset >= chip->ngpio)
return false;
@@ -2893,7 +2928,18 @@ bool gpiochip_line_is_persistent(struct gpio_chip *chip, 
unsigned int offset)
return !test_bit(FLAG_SLEEP_MAY_LOSE_VALUE,
 &chip->gpiodev->descs[offset].flags);
 }
-EXPORT_SYMBOL_GPL(gpiochip_line_is_persistent);
+EXPORT_SYMBOL_GPL(gpiochip_line_is_persistent_suspend);
+
+bool gpiochip_line_is_persistent_reset(struct gpio_chip *chip,
+  unsigned int offset)
+{
+   if (offset >= chip->ngpio)
+   return false;
+
+   return test_bit(FLAG_RESET_TOLERANT,
+   &chip->gpiodev->descs[offset].flags);
+}
+EXPORT_SYMBOL_GPL(gpiochip_line_is_persistent_reset);
 
 /**
  * gpiod_get_raw_value_cansleep() - return a gpio's raw value
@@ -3271,6 +3317,11 @@ int gpiod_configure_flags(struct gpio_desc *desc, const 
char *con_id,
if (lflags & GPIO_SLEEP_MAY_LOSE_VALUE)
set_bit(FLAG_SLEEP_MAY_LOSE_VALUE, &desc->flags);
 
+   status = gpiod_set_reset_tolerant(desc,
+ !!(lflags & GPIO_RESET_TOLERANT));
+   if (status < 0)
+   return status;
+
/*

[RFC PATCH 2/5] gpio: gpiolib: Add OF support for maintaining GPIO values on reset

2017-10-19 Thread Andrew Jeffery
Add flags and the associated flag mappings between interfaces to enable
GPIO reset tolerance to be specified via devicetree.

Signed-off-by: Andrew Jeffery 
---
 drivers/gpio/gpiolib-of.c   | 2 ++
 drivers/gpio/gpiolib.c  | 5 +
 include/dt-bindings/gpio/gpio.h | 4 
 include/linux/of_gpio.h | 1 +
 4 files changed, 12 insertions(+)

diff --git a/drivers/gpio/gpiolib-of.c b/drivers/gpio/gpiolib-of.c
index e0d59e61b52f..4a268ba52998 100644
--- a/drivers/gpio/gpiolib-of.c
+++ b/drivers/gpio/gpiolib-of.c
@@ -155,6 +155,8 @@ struct gpio_desc *of_find_gpio(struct device *dev, const 
char *con_id,
 
if (of_flags & OF_GPIO_SLEEP_MAY_LOSE_VALUE)
*flags |= GPIO_SLEEP_MAY_LOSE_VALUE;
+   if (of_flags & OF_GPIO_RESET_TOLERANT)
+   *flags |= GPIO_RESET_TOLERANT;
 
return desc;
 }
diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c
index d9dc7e588699..6b4c5df10e84 100644
--- a/drivers/gpio/gpiolib.c
+++ b/drivers/gpio/gpiolib.c
@@ -3434,6 +3434,7 @@ struct gpio_desc *fwnode_get_named_gpiod(struct 
fwnode_handle *fwnode,
bool active_low = false;
bool single_ended = false;
bool open_drain = false;
+   bool reset_tolerant = false;
int ret;
 
if (!fwnode)
@@ -3448,6 +3449,7 @@ struct gpio_desc *fwnode_get_named_gpiod(struct 
fwnode_handle *fwnode,
active_low = flags & OF_GPIO_ACTIVE_LOW;
single_ended = flags & OF_GPIO_SINGLE_ENDED;
open_drain = flags & OF_GPIO_OPEN_DRAIN;
+   reset_tolerant = flags & OF_GPIO_RESET_TOLERANT;
}
} else if (is_acpi_node(fwnode)) {
struct acpi_gpio_info info;
@@ -3478,6 +3480,9 @@ struct gpio_desc *fwnode_get_named_gpiod(struct 
fwnode_handle *fwnode,
lflags |= GPIO_OPEN_SOURCE;
}
 
+   if (reset_tolerant)
+   lflags |= GPIO_RESET_TOLERANT;
+
ret = gpiod_configure_flags(desc, propname, lflags, dflags);
if (ret < 0) {
gpiod_put(desc);
diff --git a/include/dt-bindings/gpio/gpio.h b/include/dt-bindings/gpio/gpio.h
index 70de5b7a6c9b..01c75d9e308e 100644
--- a/include/dt-bindings/gpio/gpio.h
+++ b/include/dt-bindings/gpio/gpio.h
@@ -32,4 +32,8 @@
 #define GPIO_SLEEP_MAINTAIN_VALUE 0
 #define GPIO_SLEEP_MAY_LOSE_VALUE 8
 
+/* Bit 4 express GPIO persistence on reset */
+#define GPIO_RESET_INTOLERANT 0
+#define GPIO_RESET_TOLERANT 16
+
 #endif
diff --git a/include/linux/of_gpio.h b/include/linux/of_gpio.h
index 1fe205582111..9b34737706a7 100644
--- a/include/linux/of_gpio.h
+++ b/include/linux/of_gpio.h
@@ -32,6 +32,7 @@ enum of_gpio_flags {
OF_GPIO_SINGLE_ENDED = 0x2,
OF_GPIO_OPEN_DRAIN = 0x4,
OF_GPIO_SLEEP_MAY_LOSE_VALUE = 0x8,
+   OF_GPIO_RESET_TOLERANT = 0x16,
 };
 
 #ifdef CONFIG_OF_GPIO
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 3/5] gpio: gpiolib: Add chardev support for maintaining GPIO values on reset

2017-10-19 Thread Andrew Jeffery
Similar to devicetree support, add flags and mappings to expose reset
tolerance configuration through the chardev interface.

Signed-off-by: Andrew Jeffery 
---
 drivers/gpio/gpiolib.c| 14 +-
 include/uapi/linux/gpio.h | 11 ++-
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c
index 6b4c5df10e84..442ee5ceee08 100644
--- a/drivers/gpio/gpiolib.c
+++ b/drivers/gpio/gpiolib.c
@@ -357,7 +357,8 @@ struct linehandle_state {
GPIOHANDLE_REQUEST_OUTPUT | \
GPIOHANDLE_REQUEST_ACTIVE_LOW | \
GPIOHANDLE_REQUEST_OPEN_DRAIN | \
-   GPIOHANDLE_REQUEST_OPEN_SOURCE)
+   GPIOHANDLE_REQUEST_OPEN_SOURCE | \
+   GPIOHANDLE_REQUEST_RESET_TOLERANT)
 
 static long linehandle_ioctl(struct file *filep, unsigned int cmd,
 unsigned long arg)
@@ -498,6 +499,17 @@ static int linehandle_create(struct gpio_device *gdev, 
void __user *ip)
set_bit(FLAG_OPEN_SOURCE, &desc->flags);
 
/*
+* Unconditionally configure reset tolerance, as it's possible
+* that the tolerance flag itself becomes tolerant to resets.
+* Thus it could remain set from a previous environment, but
+* the current environment may not expect it so.
+*/
+   ret = gpiod_set_reset_tolerant(desc,
+   !!(lflags & GPIOHANDLE_REQUEST_RESET_TOLERANT));
+   if (ret < 0)
+   goto out_free_descs;
+
+   /*
 * Lines have to be requested explicitly for input
 * or output, else the line will be treated "as is".
 */
diff --git a/include/uapi/linux/gpio.h b/include/uapi/linux/gpio.h
index 333d3544c964..1b1ce1af8653 100644
--- a/include/uapi/linux/gpio.h
+++ b/include/uapi/linux/gpio.h
@@ -56,11 +56,12 @@ struct gpioline_info {
 #define GPIOHANDLES_MAX 64
 
 /* Linerequest flags */
-#define GPIOHANDLE_REQUEST_INPUT   (1UL << 0)
-#define GPIOHANDLE_REQUEST_OUTPUT  (1UL << 1)
-#define GPIOHANDLE_REQUEST_ACTIVE_LOW  (1UL << 2)
-#define GPIOHANDLE_REQUEST_OPEN_DRAIN  (1UL << 3)
-#define GPIOHANDLE_REQUEST_OPEN_SOURCE (1UL << 4)
+#define GPIOHANDLE_REQUEST_INPUT   (1UL << 0)
+#define GPIOHANDLE_REQUEST_OUTPUT  (1UL << 1)
+#define GPIOHANDLE_REQUEST_ACTIVE_LOW  (1UL << 2)
+#define GPIOHANDLE_REQUEST_OPEN_DRAIN  (1UL << 3)
+#define GPIOHANDLE_REQUEST_OPEN_SOURCE (1UL << 4)
+#define GPIOHANDLE_REQUEST_RESET_TOLERANT  (1UL << 5)
 
 /**
  * struct gpiohandle_request - Information about a GPIO handle request
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 4/5] gpio: gpiolib: Add sysfs support for maintaining GPIO values on reset

2017-10-19 Thread Andrew Jeffery
Expose a new 'maintain' sysfs attribute to control both suspend and
reset tolerance.

Signed-off-by: Andrew Jeffery 
---
 Documentation/gpio/sysfs.txt |  9 +
 drivers/gpio/gpiolib-sysfs.c | 88 ++--
 2 files changed, 93 insertions(+), 4 deletions(-)

diff --git a/Documentation/gpio/sysfs.txt b/Documentation/gpio/sysfs.txt
index aeab01aa4d00..f447f0746884 100644
--- a/Documentation/gpio/sysfs.txt
+++ b/Documentation/gpio/sysfs.txt
@@ -96,6 +96,15 @@ and have the following read/write attributes:
for "rising" and "falling" edges will follow this
setting.
 
+   "maintain" ... displays and controls whether the state of the GPIO is
+   maintained or lost on suspend or reset. The valid values take
+   the following meanings:
+
+   0: Do not maintain state on either suspend or reset
+   1: Maintain state for suspend only
+   2: Maintain state for reset only
+   3: Maintain state for both suspend and reset
+
 GPIO controllers have paths like /sys/class/gpio/gpiochip42/ (for the
 controller implementing GPIOs starting at #42) and have the following
 read-only attributes:
diff --git a/drivers/gpio/gpiolib-sysfs.c b/drivers/gpio/gpiolib-sysfs.c
index 3f454eaf2101..bfa186e73e26 100644
--- a/drivers/gpio/gpiolib-sysfs.c
+++ b/drivers/gpio/gpiolib-sysfs.c
@@ -289,6 +289,74 @@ static ssize_t edge_store(struct device *dev,
 }
 static DEVICE_ATTR_RW(edge);
 
+#define GPIOLIB_SYSFS_MAINTAIN_SUSPEND BIT(0)
+#define GPIOLIB_SYSFS_MAINTAIN_RESET   BIT(1)
+#define GPIOLIB_SYSFS_MAINTAIN_ALL GENMASK(1, 0)
+static ssize_t maintain_show(struct device *dev, struct device_attribute *attr,
+char *buf)
+{
+   struct gpiod_data *data = dev_get_drvdata(dev);
+   ssize_t status = 0;
+   int val = 0;
+
+   mutex_lock(&data->mutex);
+
+   if (!test_bit(FLAG_SLEEP_MAY_LOSE_VALUE, &data->desc->flags))
+   val |= GPIOLIB_SYSFS_MAINTAIN_SUSPEND;
+
+   if (test_bit(FLAG_RESET_TOLERANT, &data->desc->flags))
+   val |= GPIOLIB_SYSFS_MAINTAIN_RESET;
+
+   status = sprintf(buf, "%d\n", val);
+
+   mutex_unlock(&data->mutex);
+
+   return status;
+}
+
+static ssize_t maintain_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf,
+ size_t size)
+{
+   struct gpiod_data *data = dev_get_drvdata(dev);
+   struct gpio_chip *chip;
+   ssize_t status;
+   long provided;
+
+   mutex_lock(&data->mutex);
+
+   chip = data->desc->gdev->chip;
+
+   if (!chip->set_config)
+   return -ENOTSUPP;
+
+   status = kstrtol(buf, 0, &provided);
+   if (status < 0)
+   goto out;
+
+   if (provided & ~GPIOLIB_SYSFS_MAINTAIN_ALL) {
+   status = -EINVAL;
+   goto out;
+   }
+
+   if (!(provided & GPIOLIB_SYSFS_MAINTAIN_SUSPEND))
+   set_bit(FLAG_SLEEP_MAY_LOSE_VALUE, &data->desc->flags);
+   else
+   clear_bit(FLAG_SLEEP_MAY_LOSE_VALUE,
+ &data->desc->flags);
+
+   /* Configure reset tolerance */
+   status = gpiod_set_reset_tolerant(data->desc,
+   !!(provided & GPIOLIB_SYSFS_MAINTAIN_RESET));
+out:
+   mutex_unlock(&data->mutex);
+
+   return status ? : size;
+
+}
+static DEVICE_ATTR_RW(maintain);
+
 /* Caller holds gpiod-data mutex. */
 static int gpio_sysfs_set_active_low(struct device *dev, int value)
 {
@@ -378,6 +446,7 @@ static struct attribute *gpio_attrs[] = {
&dev_attr_edge.attr,
&dev_attr_value.attr,
&dev_attr_active_low.attr,
+   &dev_attr_maintain.attr,
NULL,
 };
 
@@ -474,11 +543,22 @@ static ssize_t export_store(struct class *class,
status = -ENODEV;
goto done;
}
-   status = gpiod_export(desc, true);
-   if (status < 0)
+
+   /*
+* If userspace is requesting the GPIO via sysfs, make them explicitly
+* configure reset tolerance each time by unconditionally disabling it
+* here, as the export and configuration steps are not atomic.
+*/
+   status = gpiod_set_reset_tolerant(desc, false);
+   if (status < 0) {
gpiod_free(desc);
-   else
-   set_bit(FLAG_SYSFS, &desc->flags);
+   } else {
+   status = gpiod_export(desc, true);
+   if (status < 0)
+   gpiod_free(desc);
+   else
+   set_bit(FLAG_SYSFS, &desc->flags);
+   }
 
 done:
if (status)
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 5/5] gpio: aspeed: Add support for reset tolerance

2017-10-19 Thread Andrew Jeffery
Use the new pinconf parameter for reset tolerance to expose the
associated capability of the Aspeed GPIO controller.

Signed-off-by: Andrew Jeffery 
---
 drivers/gpio/gpio-aspeed.c | 39 +--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/drivers/gpio/gpio-aspeed.c b/drivers/gpio/gpio-aspeed.c
index bfc53995064a..0492cd917178 100644
--- a/drivers/gpio/gpio-aspeed.c
+++ b/drivers/gpio/gpio-aspeed.c
@@ -60,6 +60,7 @@ struct aspeed_gpio_bank {
uint16_tval_regs;
uint16_tirq_regs;
uint16_tdebounce_regs;
+   uint16_ttolerance_regs;
const char  names[4][3];
 };
 
@@ -70,48 +71,56 @@ static const struct aspeed_gpio_bank aspeed_gpio_banks[] = {
.val_regs = 0x,
.irq_regs = 0x0008,
.debounce_regs = 0x0040,
+   .tolerance_regs = 0x001c,
.names = { "A", "B", "C", "D" },
},
{
.val_regs = 0x0020,
.irq_regs = 0x0028,
.debounce_regs = 0x0048,
+   .tolerance_regs = 0x003c,
.names = { "E", "F", "G", "H" },
},
{
.val_regs = 0x0070,
.irq_regs = 0x0098,
.debounce_regs = 0x00b0,
+   .tolerance_regs = 0x00ac,
.names = { "I", "J", "K", "L" },
},
{
.val_regs = 0x0078,
.irq_regs = 0x00e8,
.debounce_regs = 0x0100,
+   .tolerance_regs = 0x00fc,
.names = { "M", "N", "O", "P" },
},
{
.val_regs = 0x0080,
.irq_regs = 0x0118,
.debounce_regs = 0x0130,
+   .tolerance_regs = 0x012c,
.names = { "Q", "R", "S", "T" },
},
{
.val_regs = 0x0088,
.irq_regs = 0x0148,
.debounce_regs = 0x0160,
+   .tolerance_regs = 0x015c,
.names = { "U", "V", "W", "X" },
},
{
.val_regs = 0x01E0,
.irq_regs = 0x0178,
.debounce_regs = 0x0190,
+   .tolerance_regs = 0x018c,
.names = { "Y", "Z", "AA", "AB" },
},
{
-   .val_regs = 0x01E8,
-   .irq_regs = 0x01A8,
+   .val_regs = 0x01e8,
+   .irq_regs = 0x01a8,
.debounce_regs = 0x01c0,
+   .tolerance_regs = 0x01bc,
.names = { "AC", "", "", "" },
},
 };
@@ -531,6 +540,30 @@ static int aspeed_gpio_setup_irqs(struct aspeed_gpio *gpio,
return 0;
 }
 
+static int aspeed_gpio_reset_tolerance(struct gpio_chip *chip,
+   unsigned int offset, bool enable)
+{
+   struct aspeed_gpio *gpio = gpiochip_get_data(chip);
+   const struct aspeed_gpio_bank *bank;
+   unsigned long flags;
+   u32 val;
+
+   bank = to_bank(offset);
+
+   spin_lock_irqsave(&gpio->lock, flags);
+   val = readl(gpio->base + bank->tolerance_regs);
+
+   if (enable)
+   val |= GPIO_BIT(offset);
+   else
+   val &= ~GPIO_BIT(offset);
+
+   writel(val, gpio->base + bank->tolerance_regs);
+   spin_unlock_irqrestore(&gpio->lock, flags);
+
+   return 0;
+}
+
 static int aspeed_gpio_request(struct gpio_chip *chip, unsigned int offset)
 {
if (!have_gpio(gpiochip_get_data(chip), offset))
@@ -768,6 +801,8 @@ static int aspeed_gpio_set_config(struct gpio_chip *chip, 
unsigned int offset,
param == PIN_CONFIG_DRIVE_OPEN_SOURCE)
/* Return -ENOTSUPP to trigger emulation, as per datasheet */
return -ENOTSUPP;
+   else if (param == PIN_CONFIG_RESET_TOLERANT)
+   return aspeed_gpio_reset_tolerance(chip, offset, arg);
 
return -ENOTSUPP;
 }
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 0/5] gpio: Expose reset tolerance capability

2017-10-19 Thread Andrew Jeffery
Hello,

This series exposes a "reset tolerant" property for GPIOs. For example, the
controller implemented in Aspeed BMCs provides such a feature to allow the BMC
to be reset whilst maintaining necessary state to keep host systems alive or
status LEDs in-tact.

I'm sending it as an RFC because I'm not sure using pinconf is the right way
to go about it, or that expanding the sysfs interface is a good idea, or that
the approach taken is right in the context of the existing suspend support.
pinconf just ended up being a convenient abstraction whilst supporting the
sysfs change, and didn't feel unreasonable to use for devicetree or the chardev
interface either. My concern with using pinconf is that the reset-tolerant
property is (currently) GPIO-centric, but maybe that's not a worry.

So the patches in the series support configuring the property via devicetree,
the chardev interface and the sysfs interface. The sysfs interface also exposes
the ability to configure the suspend tolerance, though there are some ordering
requirements with respect to setting the direction (the suspend tolerance will
only take if configured before setting the pin direction on the Arizona
controller).

Please review!

Cheers,

Andrew

Andrew Jeffery (5):
  gpio: gpiolib: Add core support for maintaining GPIO values on reset
  gpio: gpiolib: Add OF support for maintaining GPIO values on reset
  gpio: gpiolib: Add chardev support for maintaining GPIO values on
reset
  gpio: gpiolib: Add sysfs support for maintaining GPIO values on reset
  gpio: aspeed: Add support for reset tolerance

 Documentation/gpio/sysfs.txt|  9 
 drivers/gpio/gpio-arizona.c |  4 +-
 drivers/gpio/gpio-aspeed.c  | 39 ++-
 drivers/gpio/gpiolib-of.c   |  2 +
 drivers/gpio/gpiolib-sysfs.c| 88 +++--
 drivers/gpio/gpiolib.c  | 74 +--
 drivers/gpio/gpiolib.h  |  1 +
 include/dt-bindings/gpio/gpio.h |  4 ++
 include/linux/gpio/consumer.h   |  9 
 include/linux/gpio/driver.h |  5 +-
 include/linux/gpio/machine.h|  2 +
 include/linux/of_gpio.h |  1 +
 include/linux/pinctrl/pinconf-generic.h |  2 +
 include/uapi/linux/gpio.h   | 11 +++--
 14 files changed, 234 insertions(+), 17 deletions(-)

-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 0/6] Add HiSilicon SoC uncore Performance Monitoring Unit driver

2017-10-19 Thread Zhangshaokun
Hi Mark/Will,

Thanks.

On 2017/10/19 23:32, Mark Rutland wrote:
> On Thu, Oct 19, 2017 at 04:28:35PM +0100, Will Deacon wrote:
>> On Thu, Oct 19, 2017 at 01:29:18PM +0100, Mark Rutland wrote:
>>> Will, are you happy to queue this?
>>>
>>> There's a minor fixup [1] needed in patch 2, but otherwise this looks
>>> good to me, and builds cleanly.
>>>
>>> I've pushed out a branch [2] with that fix folded in, in case that's
>>> easier for you. Otherwise, feel free to pick these up with my Ack.
>>
>> I'm just running some build tests on these. I also tweaked your fix slightly
>> -- can you check the diff below please?
> 
> That's nicer!
> 
> My ack stands with that folded in.
> 
> Mark.
> 
>> diff --git a/drivers/perf/hisilicon/hisi_uncore_pmu.c 
>> b/drivers/perf/hisilicon/hisi_uncore_pmu.c
>> index 2bff43f0736b..c74542af4acf 100644
>> --- a/drivers/perf/hisilicon/hisi_uncore_pmu.c
>> +++ b/drivers/perf/hisilicon/hisi_uncore_pmu.c
>> @@ -69,15 +69,18 @@ static bool hisi_validate_event_group(struct perf_event 
>> *event)
>>  /* Include count for the event */
>>  int counters = 1;
>>  
>> -/*
>> - * We must NOT create groups containing mixed PMUs, although
>> - * software events are acceptable
>> - */
>> -if (leader->pmu != event->pmu && !is_software_event(leader))
>> -return false;
>> +if (!is_software_event(leader)) {
>> +/*
>> + * We must NOT create groups containing mixed PMUs, although
>> + * software events are acceptable
>> + */
>> +if (leader->pmu != event->pmu)
>> +return false;
>>  
>> -/* Increment counter for the leader */
>> -counters++;
>> +/* Increment counter for the leader */
>> +if (leader != event)
>> +counters++;
>> +}
>>  
>>  list_for_each_entry(sibling, &event->group_leader->sibling_list,
>>  group_entry) {
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume

2017-10-19 Thread Rafael J. Wysocki
On Thursday, October 19, 2017 2:21:07 PM CEST Ulf Hansson wrote:
> On 19 October 2017 at 00:12, Rafael J. Wysocki  wrote:
> > On Wednesday, October 18, 2017 4:11:33 PM CEST Ulf Hansson wrote:
> >> [...]
> >>
> >> >>
> >> >> The reason why pm_runtime_force_* needs to respects the hierarchy of
> >> >> the RPM callbacks, is because otherwise it can't safely update the
> >> >> runtime PM status of the device.
> >> >
> >> > I'm not sure I follow this requirement.  Why is that so?
> >>
> >> If the PM domain controls some resources for the device in its RPM
> >> callbacks and the driver controls some other resources in its RPM
> >> callbacks - then these resources needs to be managed together.
> >
> > Right, but that doesn't automatically make it necessary to use runtime PM
> > callbacks in the middle layer.  Its system-wide PM callbacks may be
> > suitable for that just fine.
> >
> > That is, at least in some cases, you can combine ->runtime_suspend from a
> > driver and ->suspend_late from a middle layer with no problems, for example.
> >
> > That's why some middle layers allow drivers to point ->suspend_late and
> > ->runtime_suspend to the same routine if they want to reuse that code.
> >
> >> This follows the behavior of when a regular call to
> >> pm_runtime_get|put(), triggers the RPM callbacks to be invoked.
> >
> > But (a) it doesn't have to follow it and (b) in some cases it should not
> > follow it.
> 
> Of course you don't explicitly *have to* respect the hierarchy of the
> RPM callbacks in pm_runtime_force_*.
> 
> However, changing that would require some additional information
> exchange between the driver and the middle-layer/PM domain, as to
> instruct the middle-layer/PM domain of what to do during system-wide
> PM. Especially in cases when the driver deals with wakeup, as in those
> cases the instructions may change dynamically.

Well, if wakeup matters, drivers can't simply point their PM callbacks
to pm_runtime_force_* anyway.

If the driver itself deals with wakeups, it clearly needs different callback
routines for system-wide PM and for runtime PM, so it can't reuse its runtime
PM callbacks at all then.

If the middle layer deals with wakeups, different callbacks are needed at
that level and so pm_runtime_force_* are unsuitable too.

Really, invoking runtime PM callbacks from the middle layer in
pm_runtime_force_* is a not a idea at all and there's no general requirement
for it whatever.

> [...]
> 
> >> > In general, not if the wakeup settings are adjusted by the middle layer.
> >>
> >> Correct!
> >>
> >> To use pm_runtime_force* for these cases, one would need some
> >> additional information exchange between the driver and the
> >> middle-layer.
> >
> > Which pretty much defeats the purpose of the wrappers, doesn't it?
> 
> Well, no matter if the wrappers are used or not, we need some kind of
> information exchange between the driver and the middle-layers/PM
> domains.

Right.

But if that information is exchanged, then why use wrappers *in* *addition*
to that?

> Anyway, me personally think it's too early to conclude that using the
> wrappers may not be useful going forward. At this point, they clearly
> helps trivial cases to remain being trivial.

I'm not sure about that really.  So far I've seen more complexity resulting
from using them than being avoided by using them, but I guess the beauty is
in the eye of the beholder. :-)

> >
> >> >
> >> >> Regarding hibernation, honestly that's not really my area of
> >> >> expertise. Although, I assume the middle-layer and driver can treat
> >> >> that as a separate case, so if it's not suitable to use
> >> >> pm_runtime_force* for that case, then they shouldn't do it.
> >> >
> >> > Well, agreed.
> >> >
> >> > In some simple cases, though, driver callbacks can be reused for 
> >> > hibernation
> >> > too, so it would be good to have a common way to do that too, IMO.
> >>
> >> Okay, that makes sense!
> >>
> >> >
> >> >> >
> >> >> > Also, quite so often other middle layers interact with PCI directly or
> >> >> > indirectly (eg. a platform device may be a child or a consumer of a 
> >> >> > PCI
> >> >> > device) and some optimizations need to take that into account (eg. 
> >> >> > parents
> >> >> > generally need to be accessible when their childres are resumed and 
> >> >> > so on).
> >> >>
> >> >> A device's parent becomes informed when changing the runtime PM status
> >> >> of the device via pm_runtime_force_suspend|resume(), as those calls
> >> >> pm_runtime_set_suspended|active().
> >> >
> >> > This requires the parent driver or middle layer to look at the reference
> >> > counter and understand it the same way as pm_runtime_force_*.
> >> >
> >> >> In case that isn't that sufficient, what else is needed? Perhaps you can
> >> >> point me to an example so I can understand better?
> >> >
> >> > Say you want to leave the parent suspended after system resume, but the
> >> > child drivers use pm_runtime_force_suspend|resume().  The parent would 
>

Re: [PATCH 1/3] printk: Introduce per-console loglevel setting

2017-10-19 Thread Calvin Owens

On 09/28/2017 05:43 PM, Calvin Owens wrote:

Not all consoles are created equal: depending on the actual hardware,
the latency of a printk() call can vary dramatically. The worst examples
are serial consoles, where it can spin for tens of milliseconds banging
the UART to emit a message, which can cause application-level problems
when the kernel spews onto the console.


Any thoughts on this series? Happy to resend again, but if there are no
objections I'd love to see it merged sooner rather than later :)

Happy to resend too, just let me know.

Thanks,
Calvin


At Facebook we use netconsole to monitor our fleet, but we still have
serial consoles attached on each host for live debugging, and the latter
has caused problems. An obvious solution is to disable the kernel
console output to ttyS0, but this makes live debugging frustrating,
since crashes become silent and opaque to the ttyS0 user. Enabling it on
the fly when needed isn't feasible, since boxes you need to debug via
serial are likely to be borked in ways that make this impossible.

That puts us between a rock and a hard place: we'd love to set
kernel.printk to KERN_INFO and get all the logs. But while netconsole is
fast enough to permit that without perturbing userspace, ttyS0 is not,
and we're forced to limit console logging to KERN_WARNING and higher.

This patch introduces a new per-console loglevel setting, and changes
console_unlock() to use max(global_level, per_console_level) when
deciding whether or not to emit a given log message.

This lets us have our cake and eat it too: instead of being forced to
limit all consoles verbosity based on the speed of the slowest one, we
can "promote" the faster console while still using a conservative system
loglevel setting to avoid disturbing applications.

Cc: Petr Mladek 
Cc: Steven Rostedt 
Cc: Sergey Senozhatsky 
Signed-off-by: Calvin Owens 
---
(V1: https://lkml.org/lkml/2017/4/4/783)

Changes in V2:
* Honor the ignore_loglevel setting in all cases
* Change semantics to use max(global, console) as the loglevel
  for a console, instead of the previous patch where we treated
  the per-console one as a filter downstream of the global one.

  include/linux/console.h |  1 +
  kernel/printk/printk.c  | 38 +++---
  2 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/include/linux/console.h b/include/linux/console.h
index b8920a0..a5b5d79 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -147,6 +147,7 @@ struct console {
int cflag;
void*data;
struct   console *next;
+   int level;
  };
  
  /*

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 512f7c2..3f1675e 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1141,9 +1141,14 @@ module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR);
  MODULE_PARM_DESC(ignore_loglevel,
 "ignore loglevel setting (prints all kernel messages to the 
console)");
  
-static bool suppress_message_printing(int level)

+static int effective_loglevel(struct console *con)
  {
-   return (level >= console_loglevel && !ignore_loglevel);
+   return max(console_loglevel, con ? con->level : LOGLEVEL_EMERG);
+}
+
+static bool suppress_message_printing(int level, struct console *con)
+{
+   return (level >= effective_loglevel(con) && !ignore_loglevel);
  }
  
  #ifdef CONFIG_BOOT_PRINTK_DELAY

@@ -1175,7 +1180,7 @@ static void boot_delay_msec(int level)
unsigned long timeout;
  
  	if ((boot_delay == 0 || system_state >= SYSTEM_RUNNING)

-   || suppress_message_printing(level)) {
+   || suppress_message_printing(level, NULL)) {
return;
}
  
@@ -1549,7 +1554,7 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len)

   * The console_lock must be held.
   */
  static void call_console_drivers(const char *ext_text, size_t ext_len,
-const char *text, size_t len)
+const char *text, size_t len, int level)
  {
struct console *con;
  
@@ -1568,6 +1573,8 @@ static void call_console_drivers(const char *ext_text, size_t ext_len,

if (!cpu_online(smp_processor_id()) &&
!(con->flags & CON_ANYTIME))
continue;
+   if (suppress_message_printing(level, con))
+   continue;
if (con->flags & CON_EXTENDED)
con->write(con, ext_text, ext_len);
else
@@ -1856,10 +1863,9 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
  char *dict, size_t dict_len,
  char *text, size_t text_len) { return 0; }
  static void call_console_drivers(const char *ext_text, size_t ext_len,
-const char *text, size_t len) {}
+

[PATCH doc/rcu 2/2] doc: Fix various RCU docbook comment-header problems

2017-10-19 Thread Paul E. McKenney
Because many of RCU's files have not been included into docbook, a
number of errors have accumulated.  This commit fixes them.

Signed-off-by: Paul E. McKenney 
---
 include/linux/rculist.h  |  2 +-
 include/linux/rcupdate.h | 22 ++
 include/linux/srcu.h |  1 +
 kernel/rcu/srcutree.c|  2 +-
 kernel/rcu/sync.c|  9 ++---
 kernel/rcu/tree.c| 18 ++
 6 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index b1fd8bf85fdc..2bea1d5e9930 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -276,7 +276,7 @@ static inline void list_splice_tail_init_rcu(struct 
list_head *list,
 #define list_entry_rcu(ptr, type, member) \
container_of(lockless_dereference(ptr), type, member)
 
-/**
+/*
  * Where are list_empty_rcu() and list_first_entry_rcu()?
  *
  * Implementing those functions following their counterparts list_empty() and
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index de50d8a4cf41..1a9f70d44af9 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -523,7 +523,7 @@ static inline void rcu_preempt_sleep_check(void) { }
  * Return the value of the specified RCU-protected pointer, but omit
  * both the smp_read_barrier_depends() and the READ_ONCE().  This
  * is useful in cases where update-side locks prevent the value of the
- * pointer from changing.  Please note that this primitive does -not-
+ * pointer from changing.  Please note that this primitive does *not*
  * prevent the compiler from repeating this reference or combining it
  * with other references, so it should not be used without protection
  * of appropriate locks.
@@ -568,7 +568,7 @@ static inline void rcu_preempt_sleep_check(void) { }
  * is handed off from RCU to some other synchronization mechanism, for
  * example, reference counting or locking.  In C11, it would map to
  * kill_dependency().  It could be used as follows:
- *
+ * ``
  * rcu_read_lock();
  * p = rcu_dereference(gp);
  * long_lived = is_long_lived(p);
@@ -579,6 +579,7 @@ static inline void rcu_preempt_sleep_check(void) { }
  * p = rcu_pointer_handoff(p);
  * }
  * rcu_read_unlock();
+ *``
  */
 #define rcu_pointer_handoff(p) (p)
 
@@ -778,18 +779,21 @@ static inline notrace void 
rcu_read_unlock_sched_notrace(void)
 
 /**
  * RCU_INIT_POINTER() - initialize an RCU protected pointer
+ * @p: The pointer to be initialized.
+ * @v: The value to initialized the pointer to.
  *
  * Initialize an RCU-protected pointer in special cases where readers
  * do not need ordering constraints on the CPU or the compiler.  These
  * special cases are:
  *
- * 1.  This use of RCU_INIT_POINTER() is NULLing out the pointer -or-
+ * 1.  This use of RCU_INIT_POINTER() is NULLing out the pointer *or*
  * 2.  The caller has taken whatever steps are required to prevent
- * RCU readers from concurrently accessing this pointer -or-
+ * RCU readers from concurrently accessing this pointer *or*
  * 3.  The referenced data structure has already been exposed to
- * readers either at compile time or via rcu_assign_pointer() -and-
- * a.  You have not made -any- reader-visible changes to
- * this structure since then -or-
+ * readers either at compile time or via rcu_assign_pointer() *and*
+ *
+ * a.  You have not made *any* reader-visible changes to
+ * this structure since then *or*
  * b.  It is OK for readers accessing this structure from its
  * new location to see the old state of the structure.  (For
  * example, the changes were to statistical counters or to
@@ -805,7 +809,7 @@ static inline notrace void 
rcu_read_unlock_sched_notrace(void)
  * by a single external-to-structure RCU-protected pointer, then you may
  * use RCU_INIT_POINTER() to initialize the internal RCU-protected
  * pointers, but you must use rcu_assign_pointer() to initialize the
- * external-to-structure pointer -after- you have completely initialized
+ * external-to-structure pointer *after* you have completely initialized
  * the reader-accessible portions of the linked structure.
  *
  * Note that unlike rcu_assign_pointer(), RCU_INIT_POINTER() provides no
@@ -819,6 +823,8 @@ static inline notrace void 
rcu_read_unlock_sched_notrace(void)
 
 /**
  * RCU_POINTER_INITIALIZER() - statically initialize an RCU protected pointer
+ * @p: The pointer to be initialized.
+ * @v: The value to initialized the pointer to.
  *
  * GCC-style initialization for an RCU-protected pointer in a structure field.
  */
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 39af9bc0f653..62be8966e837 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -78,6 +78,7 @@ void synchronize_srcu(struct srcu_struct *sp);
 
 /**
  * srcu_read_lock_held - might we be in SRCU read-side critical section?
+ * @sp: The srcu_struct structure

[PATCH doc/rcu 1/2] doc: Fix RCU's docbook options

2017-10-19 Thread Paul E. McKenney
Commit 764f80798b95 ("doc: Add RCU files to docbook-generation files")
added :external: options for RCU source files in the file
Documentation/core-api/kernel-api.rst.  However, this now means
nothing, so this commit removes them.

Reported-by: Randy Dunlap 
Reported-by: Akira Yokosawa 
Signed-off-by: Paul E. McKenney 
---
 Documentation/core-api/kernel-api.rst | 14 --
 1 file changed, 14 deletions(-)

diff --git a/Documentation/core-api/kernel-api.rst 
b/Documentation/core-api/kernel-api.rst
index 8282099e0cbf..5da10184d908 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -352,44 +352,30 @@ Read-Copy Update (RCU)
 --
 
 .. kernel-doc:: include/linux/rcupdate.h
-   :external:
 
 .. kernel-doc:: include/linux/rcupdate_wait.h
-   :external:
 
 .. kernel-doc:: include/linux/rcutree.h
-   :external:
 
 .. kernel-doc:: kernel/rcu/tree.c
-   :external:
 
 .. kernel-doc:: kernel/rcu/tree_plugin.h
-   :external:
 
 .. kernel-doc:: kernel/rcu/tree_exp.h
-   :external:
 
 .. kernel-doc:: kernel/rcu/update.c
-   :external:
 
 .. kernel-doc:: include/linux/srcu.h
-   :external:
 
 .. kernel-doc:: kernel/rcu/srcutree.c
-   :external:
 
 .. kernel-doc:: include/linux/rculist_bl.h
-   :external:
 
 .. kernel-doc:: include/linux/rculist.h
-   :external:
 
 .. kernel-doc:: include/linux/rculist_nulls.h
-   :external:
 
 .. kernel-doc:: include/linux/rcu_sync.h
-   :external:
 
 .. kernel-doc:: kernel/rcu/sync.c
-   :external:
 
-- 
2.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume

2017-10-19 Thread Grygorii Strashko


On 10/19/2017 01:11 PM, Ulf Hansson wrote:
> On 19 October 2017 at 20:04, Ulf Hansson  wrote:
>> On 19 October 2017 at 19:21, Grygorii Strashko  
>> wrote:
>>>
>>>
>>> On 10/19/2017 03:33 AM, Ulf Hansson wrote:
 On 18 October 2017 at 23:48, Rafael J. Wysocki  wrote:
> On Wednesday, October 18, 2017 9:45:11 PM CEST Grygorii Strashko wrote:
>>
>> On 10/18/2017 09:11 AM, Ulf Hansson wrote:
>
> [...]
>
> That's the point. We know pm_runtime_force_* works nicely for the
> trivial middle-layer cases.

 In which cases the middle-layer callbacks don't exist, so it's just 
 like
 reusing driver callbacks directly. :-)
>>
>> I'd like to ask you clarify one point here and provide some info which I 
>> hope can be useful -
>> what's exactly means  "trivial middle-layer cases"?
>>
>> Is it when systems use "drivers/base/power/clock_ops.c - Generic clock
>> manipulation PM callbacks" as dev_pm_domain (arm davinci/keystone), or 
>> OMAP
>> device framework struct dev_pm_domain omap_device_pm_domain
>> (arm/mach-omap2/omap_device.c) or static const struct dev_pm_ops
>> tegra_aconnect_pm_ops?
>>
>> if yes all above have PM runtime callbacks.
>
> Trivial ones don't actually do anything meaningful in their PM callbacks.
>
> Things like the platform bus type, spi bus type, i2c bus type and similar.
>
> If the middle-layer callbacks manipulate devices in a significant way, 
> then
> they aren't trivial.

 I fully agree with Rafael's description above, but let me also clarify
 one more thing.

 We have also been discussing PM domains as being trivial and
 non-trivial. In some statements I even think the PM domain has been a
 part the middle-layer terminology, which may have been a bit
 confusing.

 In this regards as we consider genpd being a trivial PM domain, those
 examples your bring up above is too me also examples of trivial PM
 domains. Especially because they don't deal with wakeups, as that is
 taken care of by the drivers, right!?
>>>
>>> Not directly, for example, omap device framework has noirq callback 
>>> implemented
>>> which forcibly disable all devices which are not PM runtime suspended.
>>> while doing this it calls drivers PM .runtime_suspend() which may return
>>> non 0 value and in this case device will be left enabled (powered) at 
>>> suspend for
>>> wake up purposes (see _od_suspend_noirq()).
>>>
>>
>> Yeah, I had that feeling that omap has some trickyness going on. :-)
>>
>> I sure that can be fixed in the omap PM domain, although
> 
> ...slipped with my fingers.. here is the rest of the reply...
> 
> ..of course that require us to use another way for drivers to signal
> to the omap PM domain that it needs to stay powered as to deal with
> wakeup.
> 
> I can have a look at that more closely, to see if it makes sense to change.
> 

Also, additional note here. some IPs are reused between OMAP/Davinci/Keystone,
OMAP PM domain have some code running at noirq time to dial with devices left
in PM runtime enabled state (OMAP PM runtime centric), while Davinci/Keystone 
haven't (clock_ops.c),
so pm_runtime_force_* API is actually possibility now to make the same driver 
work 
 on all these platforms. 

-- 
regards,
-grygorii
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH doc/rcu 2/2] doc: Fix various RCU docbook comment-header problems

2017-10-19 Thread Paul E. McKenney
Because many of RCU's files have not been included into docbook, a
number of errors have accumulated.  This commit fixes them.

Signed-off-by: Paul E. McKenney 
---
 include/linux/rculist.h  |  2 +-
 include/linux/rcupdate.h | 22 ++
 include/linux/srcu.h |  1 +
 kernel/rcu/srcutree.c|  2 +-
 kernel/rcu/sync.c|  9 ++---
 kernel/rcu/tree.c| 18 ++
 6 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index b1fd8bf85fdc..2bea1d5e9930 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -276,7 +276,7 @@ static inline void list_splice_tail_init_rcu(struct 
list_head *list,
 #define list_entry_rcu(ptr, type, member) \
container_of(lockless_dereference(ptr), type, member)
 
-/**
+/*
  * Where are list_empty_rcu() and list_first_entry_rcu()?
  *
  * Implementing those functions following their counterparts list_empty() and
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index de50d8a4cf41..1a9f70d44af9 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -523,7 +523,7 @@ static inline void rcu_preempt_sleep_check(void) { }
  * Return the value of the specified RCU-protected pointer, but omit
  * both the smp_read_barrier_depends() and the READ_ONCE().  This
  * is useful in cases where update-side locks prevent the value of the
- * pointer from changing.  Please note that this primitive does -not-
+ * pointer from changing.  Please note that this primitive does *not*
  * prevent the compiler from repeating this reference or combining it
  * with other references, so it should not be used without protection
  * of appropriate locks.
@@ -568,7 +568,7 @@ static inline void rcu_preempt_sleep_check(void) { }
  * is handed off from RCU to some other synchronization mechanism, for
  * example, reference counting or locking.  In C11, it would map to
  * kill_dependency().  It could be used as follows:
- *
+ * ``
  * rcu_read_lock();
  * p = rcu_dereference(gp);
  * long_lived = is_long_lived(p);
@@ -579,6 +579,7 @@ static inline void rcu_preempt_sleep_check(void) { }
  * p = rcu_pointer_handoff(p);
  * }
  * rcu_read_unlock();
+ *``
  */
 #define rcu_pointer_handoff(p) (p)
 
@@ -778,18 +779,21 @@ static inline notrace void 
rcu_read_unlock_sched_notrace(void)
 
 /**
  * RCU_INIT_POINTER() - initialize an RCU protected pointer
+ * @p: The pointer to be initialized.
+ * @v: The value to initialized the pointer to.
  *
  * Initialize an RCU-protected pointer in special cases where readers
  * do not need ordering constraints on the CPU or the compiler.  These
  * special cases are:
  *
- * 1.  This use of RCU_INIT_POINTER() is NULLing out the pointer -or-
+ * 1.  This use of RCU_INIT_POINTER() is NULLing out the pointer *or*
  * 2.  The caller has taken whatever steps are required to prevent
- * RCU readers from concurrently accessing this pointer -or-
+ * RCU readers from concurrently accessing this pointer *or*
  * 3.  The referenced data structure has already been exposed to
- * readers either at compile time or via rcu_assign_pointer() -and-
- * a.  You have not made -any- reader-visible changes to
- * this structure since then -or-
+ * readers either at compile time or via rcu_assign_pointer() *and*
+ *
+ * a.  You have not made *any* reader-visible changes to
+ * this structure since then *or*
  * b.  It is OK for readers accessing this structure from its
  * new location to see the old state of the structure.  (For
  * example, the changes were to statistical counters or to
@@ -805,7 +809,7 @@ static inline notrace void 
rcu_read_unlock_sched_notrace(void)
  * by a single external-to-structure RCU-protected pointer, then you may
  * use RCU_INIT_POINTER() to initialize the internal RCU-protected
  * pointers, but you must use rcu_assign_pointer() to initialize the
- * external-to-structure pointer -after- you have completely initialized
+ * external-to-structure pointer *after* you have completely initialized
  * the reader-accessible portions of the linked structure.
  *
  * Note that unlike rcu_assign_pointer(), RCU_INIT_POINTER() provides no
@@ -819,6 +823,8 @@ static inline notrace void 
rcu_read_unlock_sched_notrace(void)
 
 /**
  * RCU_POINTER_INITIALIZER() - statically initialize an RCU protected pointer
+ * @p: The pointer to be initialized.
+ * @v: The value to initialized the pointer to.
  *
  * GCC-style initialization for an RCU-protected pointer in a structure field.
  */
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 39af9bc0f653..62be8966e837 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -78,6 +78,7 @@ void synchronize_srcu(struct srcu_struct *sp);
 
 /**
  * srcu_read_lock_held - might we be in SRCU read-side critical section?
+ * @sp: The srcu_struct structure

[PATCH doc/rcu 1/2] doc: Fix RCU's docbook options

2017-10-19 Thread Paul E. McKenney
Commit 764f80798b95 ("doc: Add RCU files to docbook-generation files")
added :external: options for RCU source files in the file
Documentation/core-api/kernel-api.rst.  However, this now means
nothing, so this commit removes them.

Reported-by: Randy Dunlap 
Reported-by: Akira Yokosawa 
Signed-off-by: Paul E. McKenney 
---
 Documentation/core-api/kernel-api.rst | 14 --
 1 file changed, 14 deletions(-)

diff --git a/Documentation/core-api/kernel-api.rst 
b/Documentation/core-api/kernel-api.rst
index 8282099e0cbf..5da10184d908 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -352,44 +352,30 @@ Read-Copy Update (RCU)
 --
 
 .. kernel-doc:: include/linux/rcupdate.h
-   :external:
 
 .. kernel-doc:: include/linux/rcupdate_wait.h
-   :external:
 
 .. kernel-doc:: include/linux/rcutree.h
-   :external:
 
 .. kernel-doc:: kernel/rcu/tree.c
-   :external:
 
 .. kernel-doc:: kernel/rcu/tree_plugin.h
-   :external:
 
 .. kernel-doc:: kernel/rcu/tree_exp.h
-   :external:
 
 .. kernel-doc:: kernel/rcu/update.c
-   :external:
 
 .. kernel-doc:: include/linux/srcu.h
-   :external:
 
 .. kernel-doc:: kernel/rcu/srcutree.c
-   :external:
 
 .. kernel-doc:: include/linux/rculist_bl.h
-   :external:
 
 .. kernel-doc:: include/linux/rculist.h
-   :external:
 
 .. kernel-doc:: include/linux/rculist_nulls.h
-   :external:
 
 .. kernel-doc:: include/linux/rcu_sync.h
-   :external:
 
 .. kernel-doc:: kernel/rcu/sync.c
-   :external:
 
-- 
2.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH doc/rcu 0/2] Fix docbook regression

2017-10-19 Thread Paul E. McKenney
Hello, Linus,

Commit 764f80798b95 ("doc: Add RCU files to docbook-generation files"),
which is in v4.14-rc1, added :external: options for RCU source files
in the file Documentation/core-api/kernel-api.rst.  However, this now
means nothing, and furthermore breaks builds of the docbook, which has
led to popular demand for this to be fixed in v4.14:

lkml.kernel.org/r/20171018100340.7f34a...@lwn.net

This series therefore contains the following two patches:

1.  Remove the erroneous :external: options.

2.  Fix the many docbook build complaints that have crept into RCU's
docbook comment headers.  These fixes include one non-comment
change where the name of rcu_sync_func()'s argument is changed
to match RCU convention.

Thanx, Paul



 Documentation/core-api/kernel-api.rst |   14 --
 include/linux/rculist.h   |2 +-
 include/linux/rcupdate.h  |   22 ++
 include/linux/srcu.h  |1 +
 kernel/rcu/srcutree.c |2 +-
 kernel/rcu/sync.c |9 ++---
 kernel/rcu/tree.c |   18 ++
 7 files changed, 33 insertions(+), 35 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-10-19 Thread Michal Hocko
On Thu 19-10-17 15:45:34, Johannes Weiner wrote:
> On Thu, Oct 19, 2017 at 07:52:12PM +0100, Roman Gushchin wrote:
> > This patchset makes the OOM killer cgroup-aware.
> 
> Hi Andrew,
> 
> I believe this code is ready for merging upstream, and it seems Michal
> is in agreement. There are two main things to consider, however.
> 
> David would have really liked for this patchset to include knobs to
> influence how the algorithm picks cgroup victims. The rest of us
> agreed that this is beyond the scope of these patches, that the
> patches don't need it to be useful, and that there is nothing
> preventing anyone from adding configurability later on. David
> subsequently nacked the series as he considers it incomplete. Neither
> Michal nor I see technical merit in David's nack.

agreed

> Michal acked the implementation, but on the condition that the new
> behavior be opt-in, to not surprise existing users.

and just to make it clear I have also said I will _not_ nack if that is
not the case.

> I *think* we agree
> that respecting the cgroup topography during global OOM is what we
> should have been doing when cgroups were initially introduced;

We do not agree here though. I am not convinced that respecting the
cgroup topography is an universal win. It is true that there is no best
OOM victim selection strategy but what we have currently is the simplest
option and as such the most robust one. I can tell from the past year
experience that many of those clever heuristics actually contributed to
lockups and non-deterministic behavior.

> where
> we disagree is that I think users shouldn't have to opt in to
> improvements. We have done much more invasive changes to the victim
> selection without actual regressions in the past. Further, this change
> only applies to mounts of the new cgroup2.

which basically means that the behavior will change under many users
feet because the respecitve cgroup configuration is chosen by somebody
else (e.g. systemd) so I do not really buy "only v2 behavior"

> Tejun also wasn't convinced
> of the risk for regression, and too would prefer cgroup-awareness to
> be the default in cgroup2. I would ask for patch 5/6 to be dropped.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-10-19 Thread Johannes Weiner
On Thu, Oct 19, 2017 at 07:52:12PM +0100, Roman Gushchin wrote:
> This patchset makes the OOM killer cgroup-aware.

Hi Andrew,

I believe this code is ready for merging upstream, and it seems Michal
is in agreement. There are two main things to consider, however.

David would have really liked for this patchset to include knobs to
influence how the algorithm picks cgroup victims. The rest of us
agreed that this is beyond the scope of these patches, that the
patches don't need it to be useful, and that there is nothing
preventing anyone from adding configurability later on. David
subsequently nacked the series as he considers it incomplete. Neither
Michal nor I see technical merit in David's nack.

Michal acked the implementation, but on the condition that the new
behavior be opt-in, to not surprise existing users. I *think* we agree
that respecting the cgroup topography during global OOM is what we
should have been doing when cgroups were initially introduced; where
we disagree is that I think users shouldn't have to opt in to
improvements. We have done much more invasive changes to the victim
selection without actual regressions in the past. Further, this change
only applies to mounts of the new cgroup2. Tejun also wasn't convinced
of the risk for regression, and too would prefer cgroup-awareness to
be the default in cgroup2. I would ask for patch 5/6 to be dropped.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer

2017-10-19 Thread Michal Hocko
On Thu 19-10-17 19:52:15, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> This patch introduces the core functionality: an ability to select
> a memory cgroup as an OOM victim. Under OOM conditions the OOM killer
> looks for the biggest leaf memory cgroup and kills the biggest
> task belonging to it.
> 
> The following patches will extend this functionality to consider
> non-leaf memory cgroups as OOM victims, and also provide an ability
> to kill all tasks belonging to the victim cgroup.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with other leaf memory cgroups.
> Due to memcg statistics implementation a special approximation
> is used for estimating oom_score of root memory cgroup: we sum
> oom_score of the belonging processes (or, to be more precise,
> tasks owning their mm structures).
> 
> Signed-off-by: Roman Gushchin 
> Acked-by: Michal Hocko 

Just to make it clear. My ack is conditional on the opt-in which is
implemented later in the series. Strictly speaking system would
behave differently during the bisection and that might lead to a
confusion. I guess it would be better to simply disable this feature
until we have means to enable it. But I do not really care strongly
here.

There is another thing that I am more concerned about. Usually you
should drop ack when making further changes or at least call them out
so that the reviewer is aware of them.  In this particular case I am
worried about the fallback code we have discussed previously

[...]
> @@ -1080,27 +1102,39 @@ bool out_of_memory(struct oom_control *oc)
>   current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
>   current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
>   get_task_struct(current);
> - oc->chosen = current;
> + oc->chosen_task = current;
>   oom_kill_process(oc, "Out of memory 
> (oom_kill_allocating_task)");
>   return true;
>   }
>  
> + if (mem_cgroup_select_oom_victim(oc)) {
> + if (oom_kill_memcg_victim(oc))
> + delay = true;
> +
> + goto out;
> + }
> +
[...]
> +out:
> + /*
> +  * Give the killed process a good chance to exit before trying
> +  * to allocate memory again.
> +  */
> + if (delay)
> + schedule_timeout_killable(1);
> +
> + return !!oc->chosen_task;
>  }

this basically means that if you manage to select a memcg victim but
then you won't be able to select any task in that memcg then you would
return false from out_of_memory and that has other consequences. Namely
__alloc_pages_may_oom will not set did_some_progress and so the
allocation path will fail. While this scenario is not very likely we
should behave better. Your previous implementation (which I've acked)
did fall back to the standard oom killer path which is the safest
option. Maybe we can do better but let's try robust and be clever later.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] documentation: kernel-api: add more info on bitmap functions

2017-10-19 Thread Jonathan Corbet
On Mon, 16 Oct 2017 16:32:51 -0700
Randy Dunlap  wrote:

> There are some good comments about bitmap operations in lib/bitmap.c
> and include/linux/bitmap.h, so format them for document generation and
> pull them into core-api/kernel-api.rst.
> 
> I converted the "tables" of functions from using tabs to using spaces
> so that they are more readable in the source file and in the generated
> output.

Looks good, thanks, applied.  Hopefully Linus won't yell at me about
touching all that stuff in lib/...

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] Documentation: fix invalid Documentation refs (2)

2017-10-19 Thread Jonathan Corbet
On Thu, 12 Oct 2017 15:23:26 -0500
Tom Saeger  wrote:

> Batch (2) set of simple document ref fixes.
> 
> 
> Tom Saeger (8):
>   Documentation: fix locking rt-mutex doc refs
>   Documentation: fix ref to sphinx/kerneldoc.py
>   Documentation: fix ref to workqueue content
>   Documentation: fix ref to coccinelle content
>   Documentation: fix ref to trace stm content
>   Documentation: fix ref to power basic-pm-debugging
>   Documentation: fix selftests related file refs
>   Documentation: fix ref to gpio.txt

I've applied the set (except 8/8, which Linus W. already grabbed).

Thanks,

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel/module: Delete an error message for a failed memory allocation in add_module_usage()

2017-10-19 Thread SF Markus Elfring
>>> Something like:
>>>
>>> "because there is a dump_stack() done on allocation failures
>>>  without __GFP_JNOWARN"
>>
>> How do you think about to convert such a description into a special format
>> for further reference documentation?
> 
> I think it's a bad idea if it's a "special" format.

Will it be nice to represent corresponding details as a better
“restructured text”?


> Always write _why_ some code is being changed.
> 
> People could read the commit descriptions and would not need
> to take extra time to lookup external references.

I would appreciate if I could copy a widely accepted explanation.


> Maybe add something like
> "see (commit  or )" for additional details"

Are there any related extensions possible besides other background information?
Link: 
http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf

Regards,
Markus
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND v12 4/6] mm, oom: introduce memory.oom_group

2017-10-19 Thread Roman Gushchin
The cgroup-aware OOM killer treats leaf memory cgroups as memory
consumption entities and performs the victim selection by comparing
them based on their memory footprint. Then it kills the biggest task
inside the selected memory cgroup.

But there are workloads, which are not tolerant to a such behavior.
Killing a random task may leave the workload in a broken state.

To solve this problem, memory.oom_group knob is introduced.
It will define, whether a memory group should be treated as an
indivisible memory consumer, compared by total memory consumption
with other memory consumers (leaf memory cgroups and other memory
cgroups with memory.oom_group set), and whether all belonging tasks
should be killed if the cgroup is selected.

If set on memcg A, it means that in case of system-wide OOM or
memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
belonging to the sub-tree of A will be killed. If OOM event is
scoped to a descendant cgroup (A/B, for example), only tasks in
that cgroup can be affected. OOM killer will never touch any tasks
outside of the scope of the OOM event.

Also, tasks with oom_score_adj set to -1000 will not be killed because
this has been a long established way to protect a particular process
from seeing an unexpected SIGKILL from the OOM killer. Ignoring this
user defined configuration might lead to data corruptions or other
misbehavior.

The default value is 0.

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h | 17 +++
 mm/memcontrol.c| 75 +++---
 mm/oom_kill.c  | 49 +++---
 3 files changed, 127 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 75b63b68846e..84ac10d7e67d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -200,6 +200,13 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
 
+   /*
+* Treat the sub-tree as an indivisible memory consumer,
+* kill all belonging tasks if the memory cgroup selected
+* as OOM victim.
+*/
+   bool oom_group;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
 
@@ -488,6 +495,11 @@ bool mem_cgroup_oom_synchronize(bool wait);
 
 bool mem_cgroup_select_oom_victim(struct oom_control *oc);
 
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+   return memcg->oom_group;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -953,6 +965,11 @@ static inline bool mem_cgroup_select_oom_victim(struct 
oom_control *oc)
 {
return false;
 }
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+   return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f364bfed745f..ad10dbdf723b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2785,19 +2785,51 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
 
 static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
*oc)
 {
-   struct mem_cgroup *iter;
+   struct mem_cgroup *iter, *group = NULL;
+   long group_score = 0;
 
oc->chosen_memcg = NULL;
oc->chosen_points = 0;
 
/*
+* If OOM is memcg-wide, and the memcg has the oom_group flag set,
+* all tasks belonging to the memcg should be killed.
+* So, we mark the memcg as a victim.
+*/
+   if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
+   oc->chosen_memcg = oc->memcg;
+   css_get(&oc->chosen_memcg->css);
+   return;
+   }
+
+   /*
 * The oom_score is calculated for leaf memory cgroups (including
 * the root memcg).
+* Non-leaf oom_group cgroups accumulating score of descendant
+* leaf memory cgroups.
 */
rcu_read_lock();
for_each_mem_cgroup_tree(iter, root) {
long score;
 
+   /*
+* We don't consider non-leaf non-oom_group memory cgroups
+* as OOM victims.
+*/
+   if (memcg_has_children(iter) && iter != root_mem_cgroup &&
+   !mem_cgroup_oom_group(iter))
+   continue;
+
+   /*
+* If group is not set or we've ran out of the group's sub-tree,
+* we should set group and reset group_score.
+*/
+   if (!group || group == root_mem_cgroup ||
+   !mem_cgroup_is_descendant(iter, group)) {
+ 

[RESEND v12 3/6] mm, oom: cgroup-aware OOM killer

2017-10-19 Thread Roman Gushchin
Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

To address these issues, the cgroup-aware OOM killer is introduced.

This patch introduces the core functionality: an ability to select
a memory cgroup as an OOM victim. Under OOM conditions the OOM killer
looks for the biggest leaf memory cgroup and kills the biggest
task belonging to it.

The following patches will extend this functionality to consider
non-leaf memory cgroups as OOM victims, and also provide an ability
to kill all tasks belonging to the victim cgroup.

The root cgroup is treated as a leaf memory cgroup, so it's score
is compared with other leaf memory cgroups.
Due to memcg statistics implementation a special approximation
is used for estimating oom_score of root memory cgroup: we sum
oom_score of the belonging processes (or, to be more precise,
tasks owning their mm structures).

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h |  17 +
 include/linux/oom.h|  12 ++-
 mm/memcontrol.c| 181 +
 mm/oom_kill.c  |  72 +-
 4 files changed, 262 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69966c461d1c..75b63b68846e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+   css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)   \
container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct 
*task,
return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -936,6 +948,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+   return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4ce39bc..ca78e2d5956e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -9,6 +9,13 @@
 #include  /* MMF_* */
 #include  /* VM_FAULT* */
 
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
 struct zonelist;
 struct notifier_block;
 struct mem_cgroup;
@@ -39,7 +46,8 @@ struct oom_control {
 
/* Used by oom implementation, do not set */
unsigned long totalpages;
-   struct task_struct *chosen;
+   struct task_struct *chosen_task;
+   struct mem_cgroup *chosen_memcg;
unsigned long chosen_points;
 };
 
@@ -101,6 +109,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1d30a45a4bbe..f364bfed745f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2670,6 +2670,187 @@ static inline bool memcg_has_children(struct mem_cgroup 
*memcg)
return ret;
 }
 
+static long memcg_oom_badness(str

[RESEND v12 1/6] mm, oom: refactor the oom_kill_process() function

2017-10-19 Thread Roman Gushchin
The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.

This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Acked-by: David Rientjes 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 mm/oom_kill.c | 123 +++---
 1 file changed, 65 insertions(+), 58 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26add8a0d1f7..0b9f36117989 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -842,68 +842,12 @@ static bool task_will_free_mem(struct task_struct *task)
return ret;
 }
 
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
 {
-   struct task_struct *p = oc->chosen;
-   unsigned int points = oc->chosen_points;
-   struct task_struct *victim = p;
-   struct task_struct *child;
-   struct task_struct *t;
+   struct task_struct *p;
struct mm_struct *mm;
-   unsigned int victim_points = 0;
-   static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
bool can_oom_reap = true;
 
-   /*
-* If the task is already exiting, don't alarm the sysadmin or kill
-* its children or threads, just give it access to memory reserves
-* so it can die quickly
-*/
-   task_lock(p);
-   if (task_will_free_mem(p)) {
-   mark_oom_victim(p);
-   wake_oom_reaper(p);
-   task_unlock(p);
-   put_task_struct(p);
-   return;
-   }
-   task_unlock(p);
-
-   if (__ratelimit(&oom_rs))
-   dump_header(oc, p);
-
-   pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-   message, task_pid_nr(p), p->comm, points);
-
-   /*
-* If any of p's children has a different mm and is eligible for kill,
-* the one with the highest oom_badness() score is sacrificed for its
-* parent.  This attempts to lose the minimal amount of work done while
-* still freeing memory.
-*/
-   read_lock(&tasklist_lock);
-   for_each_thread(p, t) {
-   list_for_each_entry(child, &t->children, sibling) {
-   unsigned int child_points;
-
-   if (process_shares_mm(child, p->mm))
-   continue;
-   /*
-* oom_badness() returns 0 if the thread is unkillable
-*/
-   child_points = oom_badness(child,
-   oc->memcg, oc->nodemask, oc->totalpages);
-   if (child_points > victim_points) {
-   put_task_struct(victim);
-   victim = child;
-   victim_points = child_points;
-   get_task_struct(victim);
-   }
-   }
-   }
-   read_unlock(&tasklist_lock);
-
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -977,6 +921,69 @@ static void oom_kill_process(struct oom_control *oc, const 
char *message)
 }
 #undef K
 
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+   struct task_struct *p = oc->chosen;
+   unsigned int points = oc->chosen_points;
+   struct task_struct *victim = p;
+   struct task_struct *child;
+   struct task_struct *t;
+   unsigned int victim_points = 0;
+   static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+   /*
+* If the task is already exiting, don't alarm the sysadmin or kill
+* its children or threads, just give it access to memory reserves
+* so it can die quickly
+*/
+   task_lock(p);
+   if (task_will_free_mem(p)) {
+   mark_oom_victim(p);
+   wake_oom_reaper(p);
+   task_unlock(p);
+   put_task_struct(p);
+   r

Re: [PATCH] docs: dev-tools: correct Coccinelle version number

2017-10-19 Thread Jonathan Corbet
On Sun, 15 Oct 2017 11:24:08 +0200
Julia Lawall  wrote:

> There is no Coccinelle version 1.2.  1.0.2 must be what was intended.
> 
> Signed-off-by: Julia Lawall 

Applied, thanks.

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND v12 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup

2017-10-19 Thread Roman Gushchin
Implement mem_cgroup_scan_tasks() functionality for the root
memory cgroup to use this function for looking for a OOM victim
task in the root memory cgroup by the cgroup-ware OOM killer.

The root memory cgroup is treated as a leaf cgroup, so only tasks
which are directly belonging to the root cgroup are iterated over.

This patch doesn't introduce any functional change as
mem_cgroup_scan_tasks() is never called for the root memcg.
This is preparatory work for the cgroup-aware OOM killer,
which will use this function to iterate over tasks belonging
to the root memcg.

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Acked-by: David Rientjes 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 mm/memcontrol.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 50e6906314f8..1d30a45a4bbe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup 
*dead_memcg)
  * value, the function breaks the iteration loop and returns the value.
  * Otherwise, it will iterate over all tasks and return 0.
  *
- * This function must not be called for the root memory cgroup.
+ * If memcg is the root memory cgroup, this function will iterate only
+ * over tasks belonging directly to the root memory cgroup.
  */
 int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
  int (*fn)(struct task_struct *, void *), void *arg)
@@ -925,8 +926,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
struct mem_cgroup *iter;
int ret = 0;
 
-   BUG_ON(memcg == root_mem_cgroup);
-
for_each_mem_cgroup_tree(iter, memcg) {
struct css_task_iter it;
struct task_struct *task;
@@ -935,7 +934,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
while (!ret && (task = css_task_iter_next(&it)))
ret = fn(task, arg);
css_task_iter_end(&it);
-   if (ret) {
+   if (ret || memcg == root_mem_cgroup) {
mem_cgroup_iter_break(memcg, iter);
break;
}
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND v12 0/6] cgroup-aware OOM killer

2017-10-19 Thread Roman Gushchin
This patchset makes the OOM killer cgroup-aware.

v12:
  - Root memory cgroup is evaluated based on sum of the oom scores
of belonging tasks
  - Do not fallback to the per-process behavior if there if
it wasn't possbile to kill a memcg victim
  - Rebase on top of mm tree

v11:
  - Fixed an issue with skipping the root mem cgroup
(discovered by Shakeel Butt)
  - Moved a check in __oom_kill_process() to the memmory.oom_group
patch, added corresponding comments
  - Added a note about ignoring tasks with oom_score_adj -1000
(proposed by Michal Hocko)
  - Rebase on top of mm tree

v10:
  - Separate oom_group introduction into a standalone patch
  - Stop propagating oom_group
  - Make oom_group delegatable
  - Do not try to kill the biggest task in the first order,
if the whole cgroup is going to be killed
  - Stop caching oom_score on struct memcg, optimize victim
memcg selection
  - Drop dmesg printing (for further refining)
  - Small refactorings and comments added here and there
  - Rebase on top of mm tree

v9:
  - Change siblings-to-siblings comparison to the tree-wide search,
make related refactorings
  - Make oom_group implicitly propagated down by the tree
  - Fix an issue with task selection in root cgroup

v8:
  - Do not kill tasks with OOM_SCORE_ADJ -1000
  - Make the whole thing opt-in with cgroup mount option control
  - Drop oom_priority for further discussions
  - Kill the whole cgroup if oom_group is set and it's
memory.max is reached
  - Update docs and commit messages

v7:
  - __oom_kill_process() drops reference to the victim task
  - oom_score_adj -1000 is always respected
  - Renamed oom_kill_all to oom_group
  - Dropped oom_prio range, converted from short to int
  - Added a cgroup v2 mount option to disable cgroup-aware OOM killer
  - Docs updated
  - Rebased on top of mmotm

v6:
  - Renamed oom_control.chosen to oom_control.chosen_task
  - Renamed oom_kill_all_tasks to oom_kill_all
  - Per-node NR_SLAB_UNRECLAIMABLE accounting
  - Several minor fixes and cleanups
  - Docs updated

v5:
  - Rebased on top of Michal Hocko's patches, which have changed the
way how OOM victims becoming an access to the memory
reserves. Dropped corresponding part of this patchset
  - Separated the oom_kill_process() splitting into a standalone commit
  - Added debug output (suggested by David Rientjes)
  - Some minor fixes

v4:
  - Reworked per-cgroup oom_score_adj into oom_priority
(based on ideas by David Rientjes)
  - Tasks with oom_score_adj -1000 are never selected if
oom_kill_all_tasks is not set
  - Memcg victim selection code is reworked, and
synchronization is based on finding tasks with OOM victim marker,
rather then on global counter
  - Debug output is dropped
  - Refactored TIF_MEMDIE usage

v3:
  - Merged commits 1-4 into 6
  - Separated oom_score_adj logic and debug output into separate commits
  - Fixed swap accounting

v2:
  - Reworked victim selection based on feedback
from Michal Hocko, Vladimir Davydov and Johannes Weiner
  - "Kill all tasks" is now an opt-in option, by default
only one process will be killed
  - Added per-cgroup oom_score_adj
  - Refined oom score calculations, suggested by Vladimir Davydov
  - Converted to a patchset

v1:
  https://lkml.org/lkml/2017/5/18/969


Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org


Roman Gushchin (6):
  mm, oom: refactor the oom_kill_process() function
  mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  mm, oom: cgroup-aware OOM killer
  mm, oom: introduce memory.oom_group
  mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  mm, oom, docs: describe the cgroup-aware OOM killer

 Documentation/cgroup-v2.txt |  51 +
 include/linux/cgroup-defs.h |   5 +
 include/linux/memcontrol.h  |  34 ++
 include/linux/oom.h |  12 ++-
 kernel/cgroup/cgroup.c  |  10 ++
 mm/memcontrol.c | 258 +++-
 mm/oom_kill.c   | 212 
 7 files changed, 506 insertions(+), 76 deletions(-)

-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND v12 6/6] mm, oom, docs: describe the cgroup-aware OOM killer

2017-10-19 Thread Roman Gushchin
Document the cgroup-aware OOM killer.

Signed-off-by: Roman Gushchin 
Acked-by: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: Andrew Morton 
Cc: David Rientjes 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 Documentation/cgroup-v2.txt | 51 +
 1 file changed, 51 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 0bbdc720dd7c..69db5bf9c580 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
+   5-2-4. OOM Killer
  5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
@@ -1031,6 +1032,28 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
 
+  memory.oom_group
+
+   A read-write single value file which exists on non-root
+   cgroups.  The default is "0".
+
+   If set, OOM killer will consider the memory cgroup as an
+   indivisible memory consumers and compare it with other memory
+   consumers by it's memory footprint.
+   If such memory cgroup is selected as an OOM victim, all
+   processes belonging to it or it's descendants will be killed.
+
+   This applies to system-wide OOM conditions and reaching
+   the hard memory limit of the cgroup and their ancestor.
+   If OOM condition happens in a descendant cgroup with it's own
+   memory limit, the memory cgroup can't be considered
+   as an OOM victim, and OOM killer will not kill all belonging
+   tasks.
+
+   Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
+   and will never kill the unkillable task, even if memory.oom_group
+   is set.
+
   memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
@@ -1234,6 +1257,34 @@ to be accessed repeatedly by other cgroups, it may make 
sense to use
 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
+OOM Killer
+~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, looking for a memory cgroup with the largest
+memory footprint, considering leaf cgroups and cgroups with the
+memory.oom_group option set, which are considered to be an indivisible
+memory consumers.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup memory.oom_group option. If set, it causes
+the OOM killer to kill all processes attached to the cgroup,
+except processes with oom_score_adj set to -1000.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+The root cgroup is treated as a leaf memory cgroup, so it's compared
+with other leaf memory cgroups and cgroups with oom_group option set.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
 
 IO
 --
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND v12 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-10-19 Thread Roman Gushchin
Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
OOM killer. If not set, the OOM selection is performed in
a "traditional" per-process way.

The behavior can be changed dynamically by remounting the cgroupfs.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/cgroup-defs.h |  5 +
 kernel/cgroup/cgroup.c  | 10 ++
 mm/memcontrol.c |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 3e55bbd31ad1..cae5343a8b21 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -80,6 +80,11 @@ enum {
 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 */
CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+
+   /*
+* Enable cgroup-aware OOM killer.
+*/
+   CGRP_GROUP_OOM = (1 << 5),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c7086c8835da..0e1685ca1d7b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned 
int *root_flags)
if (!strcmp(token, "nsdelegate")) {
*root_flags |= CGRP_ROOT_NS_DELEGATE;
continue;
+   } else if (!strcmp(token, "groupoom")) {
+   *root_flags |= CGRP_GROUP_OOM;
+   continue;
}
 
pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int 
root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
+
+   if (root_flags & CGRP_GROUP_OOM)
+   cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
+   else
+   cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
}
 }
 
@@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, 
struct kernfs_root *kf_root
 {
if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
seq_puts(seq, ",nsdelegate");
+   if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
+   seq_puts(seq, ",groupoom");
return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ad10dbdf723b..eb1e15385782 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2875,6 +2875,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
 
+   if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+   return false;
+
if (oc->memcg)
root = oc->memcg;
else
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume

2017-10-19 Thread Ulf Hansson
On 19 October 2017 at 20:04, Ulf Hansson  wrote:
> On 19 October 2017 at 19:21, Grygorii Strashko  
> wrote:
>>
>>
>> On 10/19/2017 03:33 AM, Ulf Hansson wrote:
>>> On 18 October 2017 at 23:48, Rafael J. Wysocki  wrote:
 On Wednesday, October 18, 2017 9:45:11 PM CEST Grygorii Strashko wrote:
>
> On 10/18/2017 09:11 AM, Ulf Hansson wrote:

 [...]

 That's the point. We know pm_runtime_force_* works nicely for the
 trivial middle-layer cases.
>>>
>>> In which cases the middle-layer callbacks don't exist, so it's just like
>>> reusing driver callbacks directly. :-)
>
> I'd like to ask you clarify one point here and provide some info which I 
> hope can be useful -
> what's exactly means  "trivial middle-layer cases"?
>
> Is it when systems use "drivers/base/power/clock_ops.c - Generic clock
> manipulation PM callbacks" as dev_pm_domain (arm davinci/keystone), or 
> OMAP
> device framework struct dev_pm_domain omap_device_pm_domain
> (arm/mach-omap2/omap_device.c) or static const struct dev_pm_ops
> tegra_aconnect_pm_ops?
>
> if yes all above have PM runtime callbacks.

 Trivial ones don't actually do anything meaningful in their PM callbacks.

 Things like the platform bus type, spi bus type, i2c bus type and similar.

 If the middle-layer callbacks manipulate devices in a significant way, then
 they aren't trivial.
>>>
>>> I fully agree with Rafael's description above, but let me also clarify
>>> one more thing.
>>>
>>> We have also been discussing PM domains as being trivial and
>>> non-trivial. In some statements I even think the PM domain has been a
>>> part the middle-layer terminology, which may have been a bit
>>> confusing.
>>>
>>> In this regards as we consider genpd being a trivial PM domain, those
>>> examples your bring up above is too me also examples of trivial PM
>>> domains. Especially because they don't deal with wakeups, as that is
>>> taken care of by the drivers, right!?
>>
>> Not directly, for example, omap device framework has noirq callback 
>> implemented
>> which forcibly disable all devices which are not PM runtime suspended.
>> while doing this it calls drivers PM .runtime_suspend() which may return
>> non 0 value and in this case device will be left enabled (powered) at 
>> suspend for
>> wake up purposes (see _od_suspend_noirq()).
>>
>
> Yeah, I had that feeling that omap has some trickyness going on. :-)
>
> I sure that can be fixed in the omap PM domain, although

...slipped with my fingers.. here is the rest of the reply...

..of course that require us to use another way for drivers to signal
to the omap PM domain that it needs to stay powered as to deal with
wakeup.

I can have a look at that more closely, to see if it makes sense to change.

Kind regards
Uffe
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume

2017-10-19 Thread Ulf Hansson
On 19 October 2017 at 19:21, Grygorii Strashko  wrote:
>
>
> On 10/19/2017 03:33 AM, Ulf Hansson wrote:
>> On 18 October 2017 at 23:48, Rafael J. Wysocki  wrote:
>>> On Wednesday, October 18, 2017 9:45:11 PM CEST Grygorii Strashko wrote:

 On 10/18/2017 09:11 AM, Ulf Hansson wrote:
>>>
>>> [...]
>>>
>>> That's the point. We know pm_runtime_force_* works nicely for the
>>> trivial middle-layer cases.
>>
>> In which cases the middle-layer callbacks don't exist, so it's just like
>> reusing driver callbacks directly. :-)

 I'd like to ask you clarify one point here and provide some info which I 
 hope can be useful -
 what's exactly means  "trivial middle-layer cases"?

 Is it when systems use "drivers/base/power/clock_ops.c - Generic clock
 manipulation PM callbacks" as dev_pm_domain (arm davinci/keystone), or OMAP
 device framework struct dev_pm_domain omap_device_pm_domain
 (arm/mach-omap2/omap_device.c) or static const struct dev_pm_ops
 tegra_aconnect_pm_ops?

 if yes all above have PM runtime callbacks.
>>>
>>> Trivial ones don't actually do anything meaningful in their PM callbacks.
>>>
>>> Things like the platform bus type, spi bus type, i2c bus type and similar.
>>>
>>> If the middle-layer callbacks manipulate devices in a significant way, then
>>> they aren't trivial.
>>
>> I fully agree with Rafael's description above, but let me also clarify
>> one more thing.
>>
>> We have also been discussing PM domains as being trivial and
>> non-trivial. In some statements I even think the PM domain has been a
>> part the middle-layer terminology, which may have been a bit
>> confusing.
>>
>> In this regards as we consider genpd being a trivial PM domain, those
>> examples your bring up above is too me also examples of trivial PM
>> domains. Especially because they don't deal with wakeups, as that is
>> taken care of by the drivers, right!?
>
> Not directly, for example, omap device framework has noirq callback 
> implemented
> which forcibly disable all devices which are not PM runtime suspended.
> while doing this it calls drivers PM .runtime_suspend() which may return
> non 0 value and in this case device will be left enabled (powered) at suspend 
> for
> wake up purposes (see _od_suspend_noirq()).
>

Yeah, I had that feeling that omap has some trickyness going on. :-)

I sure that can be fixed in the omap PM domain, although
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume

2017-10-19 Thread Ulf Hansson
[...]

>>> > Say you want to leave the parent suspended after system resume, but the
>>> > child drivers use pm_runtime_force_suspend|resume().  The parent would 
>>> > then
>>> > need to use pm_runtime_force_suspend|resume() too, no?
>>>
>>> Actually no.
>>>
>>> Currently the other options of "deferring resume" (not using
>>> pm_runtime_force_*), is either using the "direct_complete" path or
>>> similar to the approach you took for the i2c designware driver.
>>>
>>> Both cases should play nicely in combination of a child being managed
>>> by pm_runtime_force_*. That's because only when the parent device is
>>> kept runtime suspended during system suspend, resuming can be
>>> deferred.
>>
>> And because the parent remains in runtime suspend late enough in the
>> system suspend path, its children also are guaranteed to be suspended.
>
> Yes.
>
>>
>> But then all of them need to be left in runtime suspend during system
>> resume too, which is somewhat restrictive, because some drivers may
>> want their devices to be resumed then.
>
> Actually, this scenario is also addressed when using the pm_runtime_force_*.
>
> The driver for the child would only need to bump the runtime PM usage
> count (pm_runtime_get_noresume()) before calling
> pm_runtime_force_suspend() at system suspend. That then also
> propagates to the parent, leading to that both the parent and the
> child will be resumed when pm_runtime_force_resume() is called for
> them.

I need to correct myself here. The above currently only works if the
child is runtime resumed while pm_runtime_force_suspend() is called.

The logic in pm_runtime_force_* needs to be improved to take care of
such scenarios. However I think that should be rather easy to fix, if
we want that.

[...]

Kind regards
Uffe
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume

2017-10-19 Thread Grygorii Strashko


On 10/19/2017 03:33 AM, Ulf Hansson wrote:
> On 18 October 2017 at 23:48, Rafael J. Wysocki  wrote:
>> On Wednesday, October 18, 2017 9:45:11 PM CEST Grygorii Strashko wrote:
>>>
>>> On 10/18/2017 09:11 AM, Ulf Hansson wrote:
>>
>> [...]
>>
>> That's the point. We know pm_runtime_force_* works nicely for the
>> trivial middle-layer cases.
>
> In which cases the middle-layer callbacks don't exist, so it's just like
> reusing driver callbacks directly. :-)
>>>
>>> I'd like to ask you clarify one point here and provide some info which I 
>>> hope can be useful -
>>> what's exactly means  "trivial middle-layer cases"?
>>>
>>> Is it when systems use "drivers/base/power/clock_ops.c - Generic clock
>>> manipulation PM callbacks" as dev_pm_domain (arm davinci/keystone), or OMAP
>>> device framework struct dev_pm_domain omap_device_pm_domain
>>> (arm/mach-omap2/omap_device.c) or static const struct dev_pm_ops
>>> tegra_aconnect_pm_ops?
>>>
>>> if yes all above have PM runtime callbacks.
>>
>> Trivial ones don't actually do anything meaningful in their PM callbacks.
>>
>> Things like the platform bus type, spi bus type, i2c bus type and similar.
>>
>> If the middle-layer callbacks manipulate devices in a significant way, then
>> they aren't trivial.
> 
> I fully agree with Rafael's description above, but let me also clarify
> one more thing.
> 
> We have also been discussing PM domains as being trivial and
> non-trivial. In some statements I even think the PM domain has been a
> part the middle-layer terminology, which may have been a bit
> confusing.
> 
> In this regards as we consider genpd being a trivial PM domain, those
> examples your bring up above is too me also examples of trivial PM
> domains. Especially because they don't deal with wakeups, as that is
> taken care of by the drivers, right!?

Not directly, for example, omap device framework has noirq callback implemented
which forcibly disable all devices which are not PM runtime suspended.
while doing this it calls drivers PM .runtime_suspend() which may return
non 0 value and in this case device will be left enabled (powered) at suspend 
for
wake up purposes (see _od_suspend_noirq()).


-- 
regards,
-grygorii
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kbuild doc: a bundle of fixes on makefiles.txt

2017-10-19 Thread Masahiro Yamada
2017-10-19 12:17 GMT+09:00 Cao jin :
> It does several fixes:
> 1. move the displaced ld example to its reasonale place.
> 2. add new example for command gzip.
> 3. fix 2 number errors.
> 4. fix format of chapter 7.x, make it looks the same as other chapters.
>
> Signed-off-by: Cao jin 
> ---

Applied to linux-kbuild/fixes.  Thanks!



-- 
Best Regards
Masahiro Yamada
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 0/6] Add HiSilicon SoC uncore Performance Monitoring Unit driver

2017-10-19 Thread Mark Rutland
On Thu, Oct 19, 2017 at 04:28:35PM +0100, Will Deacon wrote:
> On Thu, Oct 19, 2017 at 01:29:18PM +0100, Mark Rutland wrote:
> > Will, are you happy to queue this?
> > 
> > There's a minor fixup [1] needed in patch 2, but otherwise this looks
> > good to me, and builds cleanly.
> > 
> > I've pushed out a branch [2] with that fix folded in, in case that's
> > easier for you. Otherwise, feel free to pick these up with my Ack.
> 
> I'm just running some build tests on these. I also tweaked your fix slightly
> -- can you check the diff below please?

That's nicer!

My ack stands with that folded in.

Mark.

> diff --git a/drivers/perf/hisilicon/hisi_uncore_pmu.c 
> b/drivers/perf/hisilicon/hisi_uncore_pmu.c
> index 2bff43f0736b..c74542af4acf 100644
> --- a/drivers/perf/hisilicon/hisi_uncore_pmu.c
> +++ b/drivers/perf/hisilicon/hisi_uncore_pmu.c
> @@ -69,15 +69,18 @@ static bool hisi_validate_event_group(struct perf_event 
> *event)
>   /* Include count for the event */
>   int counters = 1;
>  
> - /*
> -  * We must NOT create groups containing mixed PMUs, although
> -  * software events are acceptable
> -  */
> - if (leader->pmu != event->pmu && !is_software_event(leader))
> - return false;
> + if (!is_software_event(leader)) {
> + /*
> +  * We must NOT create groups containing mixed PMUs, although
> +  * software events are acceptable
> +  */
> + if (leader->pmu != event->pmu)
> + return false;
>  
> - /* Increment counter for the leader */
> - counters++;
> + /* Increment counter for the leader */
> + if (leader != event)
> + counters++;
> + }
>  
>   list_for_each_entry(sibling, &event->group_leader->sibling_list,
>   group_entry) {
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 0/6] Add HiSilicon SoC uncore Performance Monitoring Unit driver

2017-10-19 Thread Will Deacon
On Thu, Oct 19, 2017 at 01:29:18PM +0100, Mark Rutland wrote:
> Will, are you happy to queue this?
> 
> There's a minor fixup [1] needed in patch 2, but otherwise this looks
> good to me, and builds cleanly.
> 
> I've pushed out a branch [2] with that fix folded in, in case that's
> easier for you. Otherwise, feel free to pick these up with my Ack.

I'm just running some build tests on these. I also tweaked your fix slightly
-- can you check the diff below please?

Will

--->8

diff --git a/drivers/perf/hisilicon/hisi_uncore_pmu.c 
b/drivers/perf/hisilicon/hisi_uncore_pmu.c
index 2bff43f0736b..c74542af4acf 100644
--- a/drivers/perf/hisilicon/hisi_uncore_pmu.c
+++ b/drivers/perf/hisilicon/hisi_uncore_pmu.c
@@ -69,15 +69,18 @@ static bool hisi_validate_event_group(struct perf_event 
*event)
/* Include count for the event */
int counters = 1;
 
-   /*
-* We must NOT create groups containing mixed PMUs, although
-* software events are acceptable
-*/
-   if (leader->pmu != event->pmu && !is_software_event(leader))
-   return false;
+   if (!is_software_event(leader)) {
+   /*
+* We must NOT create groups containing mixed PMUs, although
+* software events are acceptable
+*/
+   if (leader->pmu != event->pmu)
+   return false;
 
-   /* Increment counter for the leader */
-   counters++;
+   /* Increment counter for the leader */
+   if (leader != event)
+   counters++;
+   }
 
list_for_each_entry(sibling, &event->group_leader->sibling_list,
group_entry) {
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mm, thp: introduce dedicated transparent huge page allocation interfaces

2017-10-19 Thread Michal Hocko
On Wed 18-10-17 19:00:26, Du, Changbin wrote:
> Hi Hocko,
> 
> On Tue, Oct 17, 2017 at 12:20:52PM +0200, Michal Hocko wrote:
> > [CC Kirill]
> > 
> > On Mon 16-10-17 17:19:16, changbin...@intel.com wrote:
> > > From: Changbin Du 
> > > 
> > > This patch introduced 4 new interfaces to allocate a prepared
> > > transparent huge page.
> > >   - alloc_transhuge_page_vma
> > >   - alloc_transhuge_page_nodemask
> > >   - alloc_transhuge_page_node
> > >   - alloc_transhuge_page
> > > 
> > > The aim is to remove duplicated code and simplify transparent
> > > huge page allocation. These are similar to alloc_hugepage_xxx
> > > which are for hugetlbfs pages. This patch does below changes:
> > >   - define alloc_transhuge_page_xxx interfaces
> > >   - apply them to all existing code
> > >   - declare prep_transhuge_page as static since no others use it
> > >   - remove alloc_hugepage_vma definition since it no longer has users
> > 
> > So what exactly is the advantage of the new API? The diffstat doesn't
> > sound very convincing to me.
> >
> The caller only need one step to allocate thp. Several LOCs removed for all 
> the
> caller side with this change. So it's little more convinent.

Yeah, but the overall result is more code. So I am not really convinced. 
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 0/6] Add HiSilicon SoC uncore Performance Monitoring Unit driver

2017-10-19 Thread Mark Rutland
Will, are you happy to queue this?

There's a minor fixup [1] needed in patch 2, but otherwise this looks
good to me, and builds cleanly.

I've pushed out a branch [2] with that fix folded in, in case that's
easier for you. Otherwise, feel free to pick these up with my Ack.

Thanks,
Mark.

[1] 
http://lists.infradead.org/pipermail/linux-arm-kernel/2017-October/538016.html
[2] git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git 
perf-drivers/hisilicon-soc

On Thu, Oct 19, 2017 at 07:05:15PM +0800, Shaokun Zhang wrote:
> This patchset adds support for HiSilicon SoC uncore PMUs driver. It
> includes L3C, Hydra Home Agent (HHA) and DDRC.
> 
> Changes in v6:
> * remove redundant member hisi_pmu::oneline_cpus
> * rename member hisi_pmu::id
> * add event code check when event init
> * fix online/offline notifier for L3C/HHA/DDRC
> 
> Changes in v5:
> * remove unnecessary name/num_events member in hisi_pmu
> * refactor hisi_pmu_hwevents structure
> * remove hisi_pmu_alloc function
> * revise cpuhotplug for L3C PMUs
> * add cpuhotplug for HHA/DDRC PMUs
> * fix the name format of uncore PMUs
> * remove unnecessary variants
> 
> Changes in v4:
> * remove redundant code and comments
> * reverse the functions order in exit function
> * remove some GPL information
> * revise including header file
> * fix Jonathan's other comments
> 
> Changes in v3:
> * rebase to 4.13-rc1
> * add dev_err if ioremap fails for PMUs
>  
> Changes in v2:
> * fix kbuild test robot error
> * make hisi_uncore_ops static
> 
> Shaokun Zhang (6):
>   Documentation: perf: hisi: Documentation for HiSilicon SoC PMU driver
>   perf: hisi: Add support for HiSilicon SoC uncore PMU driver
>   perf: hisi: Add support for HiSilicon SoC L3C PMU driver
>   perf: hisi: Add support for HiSilicon SoC HHA PMU driver
>   perf: hisi: Add support for HiSilicon SoC DDRC PMU driver
>   arm64: MAINTAINERS: hisi: Add HiSilicon SoC PMU support
> 
>  Documentation/perf/hisi-pmu.txt   |  53 +++
>  MAINTAINERS   |   7 +
>  drivers/perf/Kconfig  |   7 +
>  drivers/perf/Makefile |   1 +
>  drivers/perf/hisilicon/Makefile   |   1 +
>  drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c | 463 +
>  drivers/perf/hisilicon/hisi_uncore_hha_pmu.c  | 473 
> ++
>  drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c  | 463 +
>  drivers/perf/hisilicon/hisi_uncore_pmu.c  | 444 
>  drivers/perf/hisilicon/hisi_uncore_pmu.h  | 102 ++
>  include/linux/cpuhotplug.h|   3 +
>  11 files changed, 2017 insertions(+)
>  create mode 100644 Documentation/perf/hisi-pmu.txt
>  create mode 100644 drivers/perf/hisilicon/Makefile
>  create mode 100644 drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c
>  create mode 100644 drivers/perf/hisilicon/hisi_uncore_hha_pmu.c
>  create mode 100644 drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c
>  create mode 100644 drivers/perf/hisilicon/hisi_uncore_pmu.c
>  create mode 100644 drivers/perf/hisilicon/hisi_uncore_pmu.h
> 
> -- 
> 1.9.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume

2017-10-19 Thread Ulf Hansson
On 19 October 2017 at 00:12, Rafael J. Wysocki  wrote:
> On Wednesday, October 18, 2017 4:11:33 PM CEST Ulf Hansson wrote:
>> [...]
>>
>> >>
>> >> The reason why pm_runtime_force_* needs to respects the hierarchy of
>> >> the RPM callbacks, is because otherwise it can't safely update the
>> >> runtime PM status of the device.
>> >
>> > I'm not sure I follow this requirement.  Why is that so?
>>
>> If the PM domain controls some resources for the device in its RPM
>> callbacks and the driver controls some other resources in its RPM
>> callbacks - then these resources needs to be managed together.
>
> Right, but that doesn't automatically make it necessary to use runtime PM
> callbacks in the middle layer.  Its system-wide PM callbacks may be
> suitable for that just fine.
>
> That is, at least in some cases, you can combine ->runtime_suspend from a
> driver and ->suspend_late from a middle layer with no problems, for example.
>
> That's why some middle layers allow drivers to point ->suspend_late and
> ->runtime_suspend to the same routine if they want to reuse that code.
>
>> This follows the behavior of when a regular call to
>> pm_runtime_get|put(), triggers the RPM callbacks to be invoked.
>
> But (a) it doesn't have to follow it and (b) in some cases it should not
> follow it.

Of course you don't explicitly *have to* respect the hierarchy of the
RPM callbacks in pm_runtime_force_*.

However, changing that would require some additional information
exchange between the driver and the middle-layer/PM domain, as to
instruct the middle-layer/PM domain of what to do during system-wide
PM. Especially in cases when the driver deals with wakeup, as in those
cases the instructions may change dynamically.

[...]

>> > In general, not if the wakeup settings are adjusted by the middle layer.
>>
>> Correct!
>>
>> To use pm_runtime_force* for these cases, one would need some
>> additional information exchange between the driver and the
>> middle-layer.
>
> Which pretty much defeats the purpose of the wrappers, doesn't it?

Well, no matter if the wrappers are used or not, we need some kind of
information exchange between the driver and the middle-layers/PM
domains.

Anyway, me personally think it's too early to conclude that using the
wrappers may not be useful going forward. At this point, they clearly
helps trivial cases to remain being trivial.

>
>> >
>> >> Regarding hibernation, honestly that's not really my area of
>> >> expertise. Although, I assume the middle-layer and driver can treat
>> >> that as a separate case, so if it's not suitable to use
>> >> pm_runtime_force* for that case, then they shouldn't do it.
>> >
>> > Well, agreed.
>> >
>> > In some simple cases, though, driver callbacks can be reused for 
>> > hibernation
>> > too, so it would be good to have a common way to do that too, IMO.
>>
>> Okay, that makes sense!
>>
>> >
>> >> >
>> >> > Also, quite so often other middle layers interact with PCI directly or
>> >> > indirectly (eg. a platform device may be a child or a consumer of a PCI
>> >> > device) and some optimizations need to take that into account (eg. 
>> >> > parents
>> >> > generally need to be accessible when their childres are resumed and so 
>> >> > on).
>> >>
>> >> A device's parent becomes informed when changing the runtime PM status
>> >> of the device via pm_runtime_force_suspend|resume(), as those calls
>> >> pm_runtime_set_suspended|active().
>> >
>> > This requires the parent driver or middle layer to look at the reference
>> > counter and understand it the same way as pm_runtime_force_*.
>> >
>> >> In case that isn't that sufficient, what else is needed? Perhaps you can
>> >> point me to an example so I can understand better?
>> >
>> > Say you want to leave the parent suspended after system resume, but the
>> > child drivers use pm_runtime_force_suspend|resume().  The parent would then
>> > need to use pm_runtime_force_suspend|resume() too, no?
>>
>> Actually no.
>>
>> Currently the other options of "deferring resume" (not using
>> pm_runtime_force_*), is either using the "direct_complete" path or
>> similar to the approach you took for the i2c designware driver.
>>
>> Both cases should play nicely in combination of a child being managed
>> by pm_runtime_force_*. That's because only when the parent device is
>> kept runtime suspended during system suspend, resuming can be
>> deferred.
>
> And because the parent remains in runtime suspend late enough in the
> system suspend path, its children also are guaranteed to be suspended.

Yes.

>
> But then all of them need to be left in runtime suspend during system
> resume too, which is somewhat restrictive, because some drivers may
> want their devices to be resumed then.

Actually, this scenario is also addressed when using the pm_runtime_force_*.

The driver for the child would only need to bump the runtime PM usage
count (pm_runtime_get_noresume()) before calling
pm_runtime_force_suspend() at system suspend. That then also
propag

Re: kernel/module: Delete an error message for a failed memory allocation in add_module_usage()

2017-10-19 Thread Joe Perches
On Thu, 2017-10-19 at 13:35 +0200, SF Markus Elfring wrote:
> > > > > Omit an extra message for a memory allocation failure in this 
> > > > > function.
> > > > > 
> > > > > This issue was detected by using the Coccinelle software.
[]
> > > Do you see any need that I should extend subsequent commit messages
> > > for this software transformation pattern?
> > 
> > Add a description of _why_ this is being done.
> > 
> > Something like:
> > 
> > "because there is a dump_stack() done on allocation failures
> >  without __GFP_JNOWARN"
> 
> How do you think about to convert such a description into a special format
> for further reference documentation?

I think it's a bad idea if it's a "special" format.

Always write _why_ some code is being changed.

People could read the commit descriptions and would not need
to take extra time to lookup external references.

Maybe add something like
"see (commit  or )" for additional details"

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel/module: Delete an error message for a failed memory allocation in add_module_usage()

2017-10-19 Thread SF Markus Elfring
 Omit an extra message for a memory allocation failure in this function.

 This issue was detected by using the Coccinelle software.

 Signed-off-by: Markus Elfring 
>>>
>>> Applied to modules-next, thanks.
>>
>> Thanks for your acceptance of this update suggestion after a bit of 
>> clarification.
>>
>> Do you see any need that I should extend subsequent commit messages
>> for this software transformation pattern?
> 
> Add a description of _why_ this is being done.
> 
> Something like:
> 
> "because there is a dump_stack() done on allocation failures
>  without __GFP_JNOWARN"

How do you think about to convert such a description into a special format
for further reference documentation?

Regards,
Markus
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 2/6] perf: hisi: Add support for HiSilicon SoC uncore PMU driver

2017-10-19 Thread Mark Rutland
On Thu, Oct 19, 2017 at 07:05:17PM +0800, Shaokun Zhang wrote:
> This patch adds support HiSilicon SoC uncore PMU driver framework and
> interfaces.

> +static bool hisi_validate_event_group(struct perf_event *event)
> +{
> + struct perf_event *sibling, *leader = event->group_leader;
> + struct hisi_pmu *hisi_pmu = to_hisi_pmu(event->pmu);
> + /* Include count for the event */
> + int counters = 1;
> +
> + /*
> +  * We must NOT create groups containing mixed PMUs, although
> +  * software events are acceptable
> +  */
> + if (leader->pmu != event->pmu && !is_software_event(leader))
> + return false;
> +
> + /* Increment counter for the leader */
> + counters++;

Sorry I didn't spot this before, but I believe this should be:

if (event != leader && !is_software_event(leader))
counters++;

Since the leader can be a SW event, and for the group leader itself,
event == leader.

Assuming there aren't any major issues elsewhere, I can fix this up when
applying the series.

Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel/module: Delete an error message for a failed memory allocation in add_module_usage()

2017-10-19 Thread SF Markus Elfring
>> This is a small allocation so it can't fail in current kernels.  I can't
>> imagine a situation where this could fail and it wasn't dead easy to
>> debug.  Most modules are loaded at boot so it's not likely to fail, but
>> if it did, it would be easy to reproduce.  If it's not loaded at boot
>> it's probably really easy to tell which module we're loading.
> 
> Yeah, good points. And on second thought, we normally don't print
> warnings for every small alloc failure in the kernel anyway (that
> would be utterly superfluous), the error code itself is sufficient.
> And in the module loader this seems to be the only printk out of the
> dozen alloc calls we do, so I'm OK with removing this one.

Thanks for your constructive feedback.

Can it help to improve the corresponding documentation for Linux
programming interfaces a bit more?

Regards,
Markus
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 2/6] perf: hisi: Add support for HiSilicon SoC uncore PMU driver

2017-10-19 Thread Shaokun Zhang
This patch adds support HiSilicon SoC uncore PMU driver framework and
interfaces.

Reviewed-by: Jonathan Cameron 
Signed-off-by: Shaokun Zhang 
Signed-off-by: Anurup M 
---
 drivers/perf/Kconfig |   7 +
 drivers/perf/Makefile|   1 +
 drivers/perf/hisilicon/Makefile  |   1 +
 drivers/perf/hisilicon/hisi_uncore_pmu.c | 444 +++
 drivers/perf/hisilicon/hisi_uncore_pmu.h | 102 +++
 5 files changed, 555 insertions(+)
 create mode 100644 drivers/perf/hisilicon/Makefile
 create mode 100644 drivers/perf/hisilicon/hisi_uncore_pmu.c
 create mode 100644 drivers/perf/hisilicon/hisi_uncore_pmu.h

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index e5197ff..b1a3894 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -17,6 +17,13 @@ config ARM_PMU_ACPI
depends on ARM_PMU && ACPI
def_bool y
 
+config HISI_PMU
+   bool "HiSilicon SoC PMU"
+   depends on ARM64 && ACPI
+   help
+ Support for HiSilicon SoC uncore performance monitoring
+ unit (PMU), such as: L3C, HHA and DDRC.
+
 config QCOM_L2_PMU
bool "Qualcomm Technologies L2-cache PMU"
depends on ARCH_QCOM && ARM64 && ACPI
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index 6420bd4..41d3342 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_ARM_PMU) += arm_pmu.o arm_pmu_platform.o
 obj-$(CONFIG_ARM_PMU_ACPI) += arm_pmu_acpi.o
+obj-$(CONFIG_HISI_PMU) += hisilicon/
 obj-$(CONFIG_QCOM_L2_PMU)  += qcom_l2_pmu.o
 obj-$(CONFIG_QCOM_L3_PMU) += qcom_l3_pmu.o
 obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile
new file mode 100644
index 000..2783bb3
--- /dev/null
+++ b/drivers/perf/hisilicon/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o
diff --git a/drivers/perf/hisilicon/hisi_uncore_pmu.c 
b/drivers/perf/hisilicon/hisi_uncore_pmu.c
new file mode 100644
index 000..2bff43f
--- /dev/null
+++ b/drivers/perf/hisilicon/hisi_uncore_pmu.c
@@ -0,0 +1,444 @@
+/*
+ * HiSilicon SoC Hardware event counters support
+ *
+ * Copyright (C) 2017 Hisilicon Limited
+ * Author: Anurup M 
+ * Shaokun Zhang 
+ *
+ * This code is based on the uncore PMUs like arm-cci and arm-ccn.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "hisi_uncore_pmu.h"
+
+#define HISI_GET_EVENTID(ev) (ev->hw.config_base & 0xff)
+#define HISI_MAX_PERIOD(nr) (BIT_ULL(nr) - 1)
+
+/*
+ * PMU format attributes
+ */
+ssize_t hisi_format_sysfs_show(struct device *dev,
+  struct device_attribute *attr, char *buf)
+{
+   struct dev_ext_attribute *eattr;
+
+   eattr = container_of(attr, struct dev_ext_attribute, attr);
+
+   return sprintf(buf, "%s\n", (char *)eattr->var);
+}
+
+/*
+ * PMU event attributes
+ */
+ssize_t hisi_event_sysfs_show(struct device *dev,
+ struct device_attribute *attr, char *page)
+{
+   struct dev_ext_attribute *eattr;
+
+   eattr = container_of(attr, struct dev_ext_attribute, attr);
+
+   return sprintf(page, "config=0x%lx\n", (unsigned long)eattr->var);
+}
+
+/*
+ * sysfs cpumask attributes. For uncore PMU, we only have a single CPU to show
+ */
+ssize_t hisi_cpumask_sysfs_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct hisi_pmu *hisi_pmu = to_hisi_pmu(dev_get_drvdata(dev));
+
+   return sprintf(buf, "%d\n", hisi_pmu->on_cpu);
+}
+
+static bool hisi_validate_event_group(struct perf_event *event)
+{
+   struct perf_event *sibling, *leader = event->group_leader;
+   struct hisi_pmu *hisi_pmu = to_hisi_pmu(event->pmu);
+   /* Include count for the event */
+   int counters = 1;
+
+   /*
+* We must NOT create groups containing mixed PMUs, although
+* software events are acceptable
+*/
+   if (leader->pmu != event->pmu && !is_software_event(leader))
+   return false;
+
+   /* Increment counter for the leader */
+   counters++;
+
+   list_for_each_entry(sibling, &event->group_leader->sibling_list,
+   group_entry) {
+   if (is_software_event(sibling))
+   continue;
+   if (sibling->pmu != event->pmu)
+   return false;
+   /* Increment counter for each sibling */
+   counters++;
+   }
+
+   /* The group can not count events more than the counters in the HW */
+   return counters <= hisi_pmu->num_counters;
+}
+
+int hisi_uncore_pmu_counter_valid(struct hisi_pmu *hisi_pmu, int idx)
+{
+   return idx >= 0 && idx < 

[PATCH v6 3/6] perf: hisi: Add support for HiSilicon SoC L3C PMU driver

2017-10-19 Thread Shaokun Zhang
This patch adds support for L3C PMU driver in HiSilicon SoC chip, Each
L3C has own control, counter and interrupt registers and is an separate
PMU. For each L3C PMU, it has 8-programable counters and each counter
is free-running. Interrupt is supported to handle counter (48-bits)
overflow.

Reviewed-by: Jonathan Cameron 
Signed-off-by: Shaokun Zhang 
Signed-off-by: Anurup M 
---
 drivers/perf/hisilicon/Makefile  |   2 +-
 drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c | 463 +++
 include/linux/cpuhotplug.h   |   1 +
 3 files changed, 465 insertions(+), 1 deletion(-)
 create mode 100644 drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c

diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile
index 2783bb3..4a3d3e6 100644
--- a/drivers/perf/hisilicon/Makefile
+++ b/drivers/perf/hisilicon/Makefile
@@ -1 +1 @@
-obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o
+obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o
diff --git a/drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c 
b/drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c
new file mode 100644
index 000..0bde5d9
--- /dev/null
+++ b/drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c
@@ -0,0 +1,463 @@
+/*
+ * HiSilicon SoC L3C uncore Hardware event counters support
+ *
+ * Copyright (C) 2017 Hisilicon Limited
+ * Author: Anurup M 
+ * Shaokun Zhang 
+ *
+ * This code is based on the uncore PMUs like arm-cci and arm-ccn.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "hisi_uncore_pmu.h"
+
+/* L3C register definition */
+#define L3C_PERF_CTRL  0x0408
+#define L3C_INT_MASK   0x0800
+#define L3C_INT_STATUS 0x0808
+#define L3C_INT_CLEAR  0x080c
+#define L3C_EVENT_CTRL 0x1c00
+#define L3C_EVENT_TYPE00x1d00
+/*
+ * Each counter is 48-bits and [48:63] are reserved
+ * which are Read-As-Zero and Writes-Ignored.
+ */
+#define L3C_CNTR0_LOWER0x1e00
+
+/* L3C has 8-counters */
+#define L3C_NR_COUNTERS0x8
+
+#define L3C_PERF_CTRL_EN   0x2
+#define L3C_EVTYPE_NONE0xff
+
+/*
+ * Select the counter register offset using the counter index
+ */
+static u32 hisi_l3c_pmu_get_counter_offset(int cntr_idx)
+{
+   return (L3C_CNTR0_LOWER + (cntr_idx * 8));
+}
+
+static u64 hisi_l3c_pmu_read_counter(struct hisi_pmu *l3c_pmu,
+struct hw_perf_event *hwc)
+{
+   u32 idx = hwc->idx;
+
+   if (!hisi_uncore_pmu_counter_valid(l3c_pmu, idx)) {
+   dev_err(l3c_pmu->dev, "Unsupported event index:%d!\n", idx);
+   return 0;
+   }
+
+   /* Read 64-bits and the upper 16 bits are RAZ */
+   return readq(l3c_pmu->base + hisi_l3c_pmu_get_counter_offset(idx));
+}
+
+static void hisi_l3c_pmu_write_counter(struct hisi_pmu *l3c_pmu,
+  struct hw_perf_event *hwc, u64 val)
+{
+   u32 idx = hwc->idx;
+
+   if (!hisi_uncore_pmu_counter_valid(l3c_pmu, idx)) {
+   dev_err(l3c_pmu->dev, "Unsupported event index:%d!\n", idx);
+   return;
+   }
+
+   /* Write 64-bits and the upper 16 bits are WI */
+   writeq(val, l3c_pmu->base + hisi_l3c_pmu_get_counter_offset(idx));
+}
+
+static void hisi_l3c_pmu_write_evtype(struct hisi_pmu *l3c_pmu, int idx,
+ u32 type)
+{
+   u32 reg, reg_idx, shift, val;
+
+   /*
+* Select the appropriate event select register(L3C_EVENT_TYPE0/1).
+* There are 2 event select registers for the 8 hardware counters.
+* Event code is 8-bits and for the former 4 hardware counters,
+* L3C_EVENT_TYPE0 is chosen. For the latter 4 hardware counters,
+* L3C_EVENT_TYPE1 is chosen.
+*/
+   reg = L3C_EVENT_TYPE0 + (idx / 4) * 4;
+   reg_idx = idx % 4;
+   shift = 8 * reg_idx;
+
+   /* Write event code to L3C_EVENT_TYPEx Register */
+   val = readl(l3c_pmu->base + reg);
+   val &= ~(L3C_EVTYPE_NONE << shift);
+   val |= (type << shift);
+   writel(val, l3c_pmu->base + reg);
+}
+
+static void hisi_l3c_pmu_start_counters(struct hisi_pmu *l3c_pmu)
+{
+   u32 val;
+
+   /*
+* Set perf_enable bit in L3C_PERF_CTRL register to start counting
+* for all enabled counters.
+*/
+   val = readl(l3c_pmu->base + L3C_PERF_CTRL);
+   val |= L3C_PERF_CTRL_EN;
+   writel(val, l3c_pmu->base + L3C_PERF_CTRL);
+}
+
+static void hisi_l3c_pmu_stop_counters(struct hisi_pmu *l3c_pmu)
+{
+   u32 val;
+
+   /*
+* Clear perf_enable bit in L3C_PERF_CTRL register to stop counting
+* for all enabled counters.
+*/
+   val = readl(l3c_pmu->base + L3

[PATCH v6 6/6] arm64: MAINTAINERS: hisi: Add HiSilicon SoC PMU support

2017-10-19 Thread Shaokun Zhang
Add support HiSilicon SoC uncore PMU driver.

Signed-off-by: Shaokun Zhang 
---
 MAINTAINERS | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index a74227a..96c583c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6242,6 +6242,13 @@ S:   Maintained
 F: drivers/net/ethernet/hisilicon/
 F: Documentation/devicetree/bindings/net/hisilicon*.txt
 
+HISILICON PMU DRIVER
+M: Shaokun Zhang 
+W: http://www.hisilicon.com
+S: Supported
+F: drivers/perf/hisilicon
+F: Documentation/perf/hisi-pmu.txt
+
 HISILICON ROCE DRIVER
 M: Lijun Ou 
 M: Wei Hu(Xavier) 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 5/6] perf: hisi: Add support for HiSilicon SoC DDRC PMU driver

2017-10-19 Thread Shaokun Zhang
This patch adds support for DDRC PMU driver in HiSilicon SoC chip, Each
DDRC has own control, counter and interrupt registers and is an separate
PMU. For each DDRC PMU, it has 8-fixed-purpose counters which have been
mapped to 8-events by hardware, it assumes that counter index is equal
to event code (0 - 7) in DDRC PMU driver. Interrupt is supported to
handle counter (32-bits) overflow.

Reviewed-by: Jonathan Cameron 
Signed-off-by: Shaokun Zhang 
Signed-off-by: Anurup M 
---
 drivers/perf/hisilicon/Makefile   |   2 +-
 drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c | 463 ++
 include/linux/cpuhotplug.h|   1 +
 3 files changed, 465 insertions(+), 1 deletion(-)
 create mode 100644 drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c

diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile
index a72afe8..2621d51 100644
--- a/drivers/perf/hisilicon/Makefile
+++ b/drivers/perf/hisilicon/Makefile
@@ -1 +1 @@
-obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o 
hisi_uncore_hha_pmu.o
+obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o 
hisi_uncore_hha_pmu.o hisi_uncore_ddrc_pmu.o
diff --git a/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c 
b/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c
new file mode 100644
index 000..1b10ea0
--- /dev/null
+++ b/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c
@@ -0,0 +1,463 @@
+/*
+ * HiSilicon SoC DDRC uncore Hardware event counters support
+ *
+ * Copyright (C) 2017 Hisilicon Limited
+ * Author: Shaokun Zhang 
+ * Anurup M 
+ *
+ * This code is based on the uncore PMUs like arm-cci and arm-ccn.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "hisi_uncore_pmu.h"
+
+/* DDRC register definition */
+#define DDRC_PERF_CTRL 0x010
+#define DDRC_FLUX_WR   0x380
+#define DDRC_FLUX_RD   0x384
+#define DDRC_FLUX_WCMD  0x388
+#define DDRC_FLUX_RCMD  0x38c
+#define DDRC_PRE_CMD0x3c0
+#define DDRC_ACT_CMD0x3c4
+#define DDRC_BNK_CHG0x3c8
+#define DDRC_RNK_CHG0x3cc
+#define DDRC_EVENT_CTRL 0x6C0
+#define DDRC_INT_MASK  0x6c8
+#define DDRC_INT_STATUS0x6cc
+#define DDRC_INT_CLEAR 0x6d0
+
+/* DDRC has 8-counters */
+#define DDRC_NR_COUNTERS   0x8
+#define DDRC_PERF_CTRL_EN  0x2
+
+/*
+ * For DDRC PMU, there are eight-events and every event has been mapped
+ * to fixed-purpose counters which register offset is not consistent.
+ * Therefore there is no write event type and we assume that event
+ * code (0 to 7) is equal to counter index in PMU driver.
+ */
+#define GET_DDRC_EVENTID(hwc)  (hwc->config_base & 0x7)
+
+static const u32 ddrc_reg_off[] = {
+   DDRC_FLUX_WR, DDRC_FLUX_RD, DDRC_FLUX_WCMD, DDRC_FLUX_RCMD,
+   DDRC_PRE_CMD, DDRC_ACT_CMD, DDRC_BNK_CHG, DDRC_RNK_CHG
+};
+
+/*
+ * Select the counter register offset using the counter index.
+ * In DDRC there are no programmable counter, the count
+ * is readed form the statistics counter register itself.
+ */
+static u32 hisi_ddrc_pmu_get_counter_offset(int cntr_idx)
+{
+   return ddrc_reg_off[cntr_idx];
+}
+
+static u64 hisi_ddrc_pmu_read_counter(struct hisi_pmu *ddrc_pmu,
+ struct hw_perf_event *hwc)
+{
+   /* Use event code as counter index */
+   u32 idx = GET_DDRC_EVENTID(hwc);
+
+   if (!hisi_uncore_pmu_counter_valid(ddrc_pmu, idx)) {
+   dev_err(ddrc_pmu->dev, "Unsupported event index:%d!\n", idx);
+   return 0;
+   }
+
+   return readl(ddrc_pmu->base + hisi_ddrc_pmu_get_counter_offset(idx));
+}
+
+static void hisi_ddrc_pmu_write_counter(struct hisi_pmu *ddrc_pmu,
+   struct hw_perf_event *hwc, u64 val)
+{
+   u32 idx = GET_DDRC_EVENTID(hwc);
+
+   if (!hisi_uncore_pmu_counter_valid(ddrc_pmu, idx)) {
+   dev_err(ddrc_pmu->dev, "Unsupported event index:%d!\n", idx);
+   return;
+   }
+
+   writel((u32)val,
+  ddrc_pmu->base + hisi_ddrc_pmu_get_counter_offset(idx));
+}
+
+/*
+ * For DDRC PMU, event has been mapped to fixed-purpose counter by hardware,
+ * so there is no need to write event type.
+ */
+static void hisi_ddrc_pmu_write_evtype(struct hisi_pmu *hha_pmu, int idx,
+  u32 type)
+{
+}
+
+static void hisi_ddrc_pmu_start_counters(struct hisi_pmu *ddrc_pmu)
+{
+   u32 val;
+
+   /* Set perf_enable in DDRC_PERF_CTRL to start event counting */
+   val = readl(ddrc_pmu->base + DDRC_PERF_CTRL);
+   val |= DDRC_PERF_CTRL_EN;
+   writel(val, ddrc_pmu->base + DDRC_PERF_CTRL);
+}
+
+static void hisi_ddrc_pmu_stop_counters(st

[PATCH v6 4/6] perf: hisi: Add support for HiSilicon SoC HHA PMU driver

2017-10-19 Thread Shaokun Zhang
L3 cache coherence is maintained by Hydra Home Agent (HHA) in HiSilicon
SoC. This patch adds support for HHA PMU driver, Each HHA has own
control, counter and interrupt registers and is an separate PMU. For
each HHA PMU, it has 16-programable counters and each counter is
free-running. Interrupt is supported to handle counter (48-bits)
overflow.

Reviewed-by: Jonathan Cameron 
Signed-off-by: Shaokun Zhang 
Signed-off-by: Anurup M 
---
 drivers/perf/hisilicon/Makefile  |   2 +-
 drivers/perf/hisilicon/hisi_uncore_hha_pmu.c | 473 +++
 include/linux/cpuhotplug.h   |   1 +
 3 files changed, 475 insertions(+), 1 deletion(-)
 create mode 100644 drivers/perf/hisilicon/hisi_uncore_hha_pmu.c

diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile
index 4a3d3e6..a72afe8 100644
--- a/drivers/perf/hisilicon/Makefile
+++ b/drivers/perf/hisilicon/Makefile
@@ -1 +1 @@
-obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o
+obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o 
hisi_uncore_hha_pmu.o
diff --git a/drivers/perf/hisilicon/hisi_uncore_hha_pmu.c 
b/drivers/perf/hisilicon/hisi_uncore_hha_pmu.c
new file mode 100644
index 000..443906e
--- /dev/null
+++ b/drivers/perf/hisilicon/hisi_uncore_hha_pmu.c
@@ -0,0 +1,473 @@
+/*
+ * HiSilicon SoC HHA uncore Hardware event counters support
+ *
+ * Copyright (C) 2017 Hisilicon Limited
+ * Author: Shaokun Zhang 
+ * Anurup M 
+ *
+ * This code is based on the uncore PMUs like arm-cci and arm-ccn.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "hisi_uncore_pmu.h"
+
+/* HHA register definition */
+#define HHA_INT_MASK   0x0804
+#define HHA_INT_STATUS 0x0808
+#define HHA_INT_CLEAR  0x080C
+#define HHA_PERF_CTRL  0x1E00
+#define HHA_EVENT_CTRL 0x1E04
+#define HHA_EVENT_TYPE00x1E80
+/*
+ * Each counter is 48-bits and [48:63] are reserved
+ * which are Read-As-Zero and Writes-Ignored.
+ */
+#define HHA_CNT0_LOWER 0x1F00
+
+/* HHA has 16-counters */
+#define HHA_NR_COUNTERS0x10
+
+#define HHA_PERF_CTRL_EN   0x1
+#define HHA_EVTYPE_NONE0xff
+
+/*
+ * Select the counter register offset using the counter index
+ * each counter is 48-bits.
+ */
+static u32 hisi_hha_pmu_get_counter_offset(int cntr_idx)
+{
+   return (HHA_CNT0_LOWER + (cntr_idx * 8));
+}
+
+static u64 hisi_hha_pmu_read_counter(struct hisi_pmu *hha_pmu,
+struct hw_perf_event *hwc)
+{
+   u32 idx = hwc->idx;
+
+   if (!hisi_uncore_pmu_counter_valid(hha_pmu, idx)) {
+   dev_err(hha_pmu->dev, "Unsupported event index:%d!\n", idx);
+   return 0;
+   }
+
+   /* Read 64 bits and like L3C, top 16 bits are RAZ */
+   return readq(hha_pmu->base + hisi_hha_pmu_get_counter_offset(idx));
+}
+
+static void hisi_hha_pmu_write_counter(struct hisi_pmu *hha_pmu,
+  struct hw_perf_event *hwc, u64 val)
+{
+   u32 idx = hwc->idx;
+
+   if (!hisi_uncore_pmu_counter_valid(hha_pmu, idx)) {
+   dev_err(hha_pmu->dev, "Unsupported event index:%d!\n", idx);
+   return;
+   }
+
+   /* Write 64 bits and like L3C, top 16 bits are WI */
+   writeq(val, hha_pmu->base + hisi_hha_pmu_get_counter_offset(idx));
+}
+
+static void hisi_hha_pmu_write_evtype(struct hisi_pmu *hha_pmu, int idx,
+ u32 type)
+{
+   u32 reg, reg_idx, shift, val;
+
+   /*
+* Select the appropriate event select register(HHA_EVENT_TYPEx).
+* There are 4 event select registers for the 16 hardware counters.
+* Event code is 8-bits and for the first 4 hardware counters,
+* HHA_EVENT_TYPE0 is chosen. For the next 4 hardware counters,
+* HHA_EVENT_TYPE1 is chosen and so on.
+*/
+   reg = HHA_EVENT_TYPE0 + 4 * (idx / 4);
+   reg_idx = idx % 4;
+   shift = 8 * reg_idx;
+
+   /* Write event code to HHA_EVENT_TYPEx register */
+   val = readl(hha_pmu->base + reg);
+   val &= ~(HHA_EVTYPE_NONE << shift);
+   val |= (type << shift);
+   writel(val, hha_pmu->base + reg);
+}
+
+static void hisi_hha_pmu_start_counters(struct hisi_pmu *hha_pmu)
+{
+   u32 val;
+
+   /*
+* Set perf_enable bit in HHA_PERF_CTRL to start event
+* counting for all enabled counters.
+*/
+   val = readl(hha_pmu->base + HHA_PERF_CTRL);
+   val |= HHA_PERF_CTRL_EN;
+   writel(val, hha_pmu->base + HHA_PERF_CTRL);
+}
+
+static void hisi_hha_pmu_stop_counters(struct hisi_pmu *hha_pmu)
+{
+   u32 val;
+
+   /*
+* Clear perf_enable bit 

[PATCH v6 0/6] Add HiSilicon SoC uncore Performance Monitoring Unit driver

2017-10-19 Thread Shaokun Zhang
This patchset adds support for HiSilicon SoC uncore PMUs driver. It
includes L3C, Hydra Home Agent (HHA) and DDRC.

Changes in v6:
* remove redundant member hisi_pmu::oneline_cpus
* rename member hisi_pmu::id
* add event code check when event init
* fix online/offline notifier for L3C/HHA/DDRC

Changes in v5:
* remove unnecessary name/num_events member in hisi_pmu
* refactor hisi_pmu_hwevents structure
* remove hisi_pmu_alloc function
* revise cpuhotplug for L3C PMUs
* add cpuhotplug for HHA/DDRC PMUs
* fix the name format of uncore PMUs
* remove unnecessary variants

Changes in v4:
* remove redundant code and comments
* reverse the functions order in exit function
* remove some GPL information
* revise including header file
* fix Jonathan's other comments

Changes in v3:
* rebase to 4.13-rc1
* add dev_err if ioremap fails for PMUs
 
Changes in v2:
* fix kbuild test robot error
* make hisi_uncore_ops static

Shaokun Zhang (6):
  Documentation: perf: hisi: Documentation for HiSilicon SoC PMU driver
  perf: hisi: Add support for HiSilicon SoC uncore PMU driver
  perf: hisi: Add support for HiSilicon SoC L3C PMU driver
  perf: hisi: Add support for HiSilicon SoC HHA PMU driver
  perf: hisi: Add support for HiSilicon SoC DDRC PMU driver
  arm64: MAINTAINERS: hisi: Add HiSilicon SoC PMU support

 Documentation/perf/hisi-pmu.txt   |  53 +++
 MAINTAINERS   |   7 +
 drivers/perf/Kconfig  |   7 +
 drivers/perf/Makefile |   1 +
 drivers/perf/hisilicon/Makefile   |   1 +
 drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c | 463 +
 drivers/perf/hisilicon/hisi_uncore_hha_pmu.c  | 473 ++
 drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c  | 463 +
 drivers/perf/hisilicon/hisi_uncore_pmu.c  | 444 
 drivers/perf/hisilicon/hisi_uncore_pmu.h  | 102 ++
 include/linux/cpuhotplug.h|   3 +
 11 files changed, 2017 insertions(+)
 create mode 100644 Documentation/perf/hisi-pmu.txt
 create mode 100644 drivers/perf/hisilicon/Makefile
 create mode 100644 drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c
 create mode 100644 drivers/perf/hisilicon/hisi_uncore_hha_pmu.c
 create mode 100644 drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c
 create mode 100644 drivers/perf/hisilicon/hisi_uncore_pmu.c
 create mode 100644 drivers/perf/hisilicon/hisi_uncore_pmu.h

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 1/6] Documentation: perf: hisi: Documentation for HiSilicon SoC PMU driver

2017-10-19 Thread Shaokun Zhang
This patch adds documentation for the uncore PMUs on HiSilicon SoC.

Reviewed-by: Jonathan Cameron 
Signed-off-by: Shaokun Zhang 
Signed-off-by: Anurup M 
---
 Documentation/perf/hisi-pmu.txt | 53 +
 1 file changed, 53 insertions(+)
 create mode 100644 Documentation/perf/hisi-pmu.txt

diff --git a/Documentation/perf/hisi-pmu.txt b/Documentation/perf/hisi-pmu.txt
new file mode 100644
index 000..267a028
--- /dev/null
+++ b/Documentation/perf/hisi-pmu.txt
@@ -0,0 +1,53 @@
+HiSilicon SoC uncore Performance Monitoring Unit (PMU)
+==
+The HiSilicon SoC chip includes various independent system device PMUs
+such as L3 cache (L3C), Hydra Home Agent (HHA) and DDRC. These PMUs are
+independent and have hardware logic to gather statistics and performance
+information.
+
+The HiSilicon SoC encapsulates multiple CPU and IO dies. Each CPU cluster
+(CCL) is made up of 4 cpu cores sharing one L3 cache; each CPU die is
+called Super CPU cluster (SCCL) and is made up of 6 CCLs. Each SCCL has
+two HHAs (0 - 1) and four DDRCs (0 - 3), respectively.
+
+HiSilicon SoC uncore PMU driver
+---
+Each device PMU has separate registers for event counting, control and
+interrupt, and the PMU driver shall register perf PMU drivers like L3C,
+HHA and DDRC etc. The available events and configuration options shall
+be described in the sysfs, see :
+/sys/devices/hisi_sccl{X}_/, or
+/sys/bus/event_source/devices/hisi_sccl{X}_.
+The "perf list" command shall list the available events from sysfs.
+
+Each L3C, HHA and DDRC is registered as a separate PMU with perf. The PMU
+name will appear in event listing as hisi_sccl_module.
+where "sccl-id" is the identifier of the SCCL and "index-id" is the index of
+module.
+e.g. hisi_sccl3_l3c0/rd_hit_cpipe is READ_HIT_CPIPE event of L3C index #0 in
+SCCL ID #3.
+e.g. hisi_sccl1_hha0/rx_operations is RX_OPERATIONS event of HHA index #0 in
+SCCL ID #1.
+
+The driver also provides a "cpumask" sysfs attribute, which shows the CPU core
+ID used to count the uncore PMU event.
+
+Example usage of perf:
+$# perf list
+hisi_sccl3_l3c0/rd_hit_cpipe/ [kernel PMU event]
+--
+hisi_sccl3_l3c0/wr_hit_cpipe/ [kernel PMU event]
+--
+hisi_sccl1_l3c0/rd_hit_cpipe/ [kernel PMU event]
+--
+hisi_sccl1_l3c0/wr_hit_cpipe/ [kernel PMU event]
+--
+
+$# perf stat -a -e hisi_sccl3_l3c0/rd_hit_cpipe/ sleep 5
+$# perf stat -a -e hisi_sccl3_l3c0/config=0x02/ sleep 5
+
+The current driver does not support sampling. So "perf record" is unsupported.
+Also attach to a task is unsupported as the events are all uncore.
+
+Note: Please contact the maintainer for a complete list of events supported for
+the PMU devices in the SoC and its information if needed.
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] document: change the document for the extended movable_node

2017-10-19 Thread Chao Fan
Add the document for the change of extended movable_node=nn[KMG]@ss[KMG].

Cc: Jonathan Corbet 
Cc: linux-doc@vger.kernel.org
Signed-off-by: Chao Fan 
---
 Documentation/admin-guide/kernel-parameters.txt | 9 +
 1 file changed, 9 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index ead7f4066ea4..226560667d84 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2332,6 +2332,15 @@
allocations which rules out almost all kernel
allocations. Use with caution!
 
+   movable_node=nn[KMG]@ss[KMG]
+   [KNL] Force usage of a specific region of memory.
+   Extend movable_node to work well with KASLR.
+   Region of memory in immovable node is from ss to ss+nn.
+   Multiple regions can be specified, comma delimited.
+   Notice: we support 4 regions at most now.
+   Example:
+   movable_node=100M@2G,1G@4G
+
MTD_Partition=  [MTD]
Format: ,,,
 
-- 
2.13.6



--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/12] PM / sleep: Driver flags for system suspend/resume

2017-10-19 Thread Ulf Hansson
On 18 October 2017 at 23:48, Rafael J. Wysocki  wrote:
> On Wednesday, October 18, 2017 9:45:11 PM CEST Grygorii Strashko wrote:
>>
>> On 10/18/2017 09:11 AM, Ulf Hansson wrote:
>
> [...]
>
>> >>> That's the point. We know pm_runtime_force_* works nicely for the
>> >>> trivial middle-layer cases.
>> >>
>> >> In which cases the middle-layer callbacks don't exist, so it's just like
>> >> reusing driver callbacks directly. :-)
>>
>> I'd like to ask you clarify one point here and provide some info which I 
>> hope can be useful -
>> what's exactly means  "trivial middle-layer cases"?
>>
>> Is it when systems use "drivers/base/power/clock_ops.c - Generic clock
>> manipulation PM callbacks" as dev_pm_domain (arm davinci/keystone), or OMAP
>> device framework struct dev_pm_domain omap_device_pm_domain
>> (arm/mach-omap2/omap_device.c) or static const struct dev_pm_ops
>> tegra_aconnect_pm_ops?
>>
>> if yes all above have PM runtime callbacks.
>
> Trivial ones don't actually do anything meaningful in their PM callbacks.
>
> Things like the platform bus type, spi bus type, i2c bus type and similar.
>
> If the middle-layer callbacks manipulate devices in a significant way, then
> they aren't trivial.

I fully agree with Rafael's description above, but let me also clarify
one more thing.

We have also been discussing PM domains as being trivial and
non-trivial. In some statements I even think the PM domain has been a
part the middle-layer terminology, which may have been a bit
confusing.

In this regards as we consider genpd being a trivial PM domain, those
examples your bring up above is too me also examples of trivial PM
domains. Especially because they don't deal with wakeups, as that is
taken care of by the drivers, right!?

Kind regards
Uffe
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Update][PATCH v2 01/12] PM / core: Add NEVER_SKIP and SMART_PREPARE driver flags

2017-10-19 Thread Greg Kroah-Hartman
On Thu, Oct 19, 2017 at 01:17:31AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The motivation for this change is to provide a way to work around
> a problem with the direct-complete mechanism used for avoiding
> system suspend/resume handling for devices in runtime suspend.
> 
> The problem is that some middle layer code (the PCI bus type and
> the ACPI PM domain in particular) returns positive values from its
> system suspend ->prepare callbacks regardless of whether the driver's
> ->prepare returns a positive value or 0, which effectively prevents
> drivers from being able to control the direct-complete feature.
> Some drivers need that control, however, and the PCI bus type has
> grown its own flag to deal with this issue, but since it is not
> limited to PCI, it is better to address it by adding driver flags at
> the core level.
> 
> To that end, add a driver_flags field to struct dev_pm_info for flags
> that can be set by device drivers at the probe time to inform the PM
> core and/or bus types, PM domains and so on on the capabilities and/or
> preferences of device drivers.  Also add two static inline helpers
> for setting that field and testing it against a given set of flags
> and make the driver core clear it automatically on driver remove
> and probe failures.
> 
> Define and document two PM driver flags related to the direct-
> complete feature: NEVER_SKIP and SMART_PREPARE that can be used,
> respectively, to indicate to the PM core that the direct-complete
> mechanism should never be used for the device and to inform the
> middle layer code (bus types, PM domains etc) that it can only
> request the PM core to use the direct-complete mechanism for
> the device (by returning a positive value from its ->prepare
> callback) if it also has been requested by the driver.
> 
> While at it, make the core check pm_runtime_suspended() when
> setting power.direct_complete so that it doesn't need to be
> checked by ->prepare callbacks.
> 
> Signed-off-by: Rafael J. Wysocki 

Acked-by: Greg Kroah-Hartman 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html