date:20171005

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread Michal Hocko

On Thu 05-10-17 15:02:18, David Rientjes wrote:
[...]
> I would need to add patches to add the "evaluate as a whole but do not 
> kill all" knob and a knob for "oom priority" so that userspace has the 
> same influence over a cgroup based comparison that it does with a process 
> based comparison to meet business goals.

I do not think 2 knobs would be really necessary for your usecase. If we
allow priorities on non-leaf memcgs then a non 0 priority on such a
memcg would mean that we have to check the cumulative consumption. You
can safely use kill all knob on top of that if you need.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] PM: docs: Fix stale reference in kernel-parameters.txt

2017-10-05 Thread Rafael J. Wysocki

From: Rafael J. Wysocki 

Commit 7aa7a0360a66 (PM: docs: Delete the obsolete states.txt
document) forgot to update kernel-parameters.txt with a reference
to the new sleep-states.rst document and it still points to
states.txt that was dropped, so fix it now.

Fixes: 7aa7a0360a66 (PM: docs: Delete the obsolete states.txt document)
Signed-off-by: Rafael J. Wysocki 
---
 Documentation/admin-guide/kernel-parameters.txt |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-pm/Documentation/admin-guide/kernel-parameters.txt
===
--- linux-pm.orig/Documentation/admin-guide/kernel-parameters.txt
+++ linux-pm/Documentation/admin-guide/kernel-parameters.txt
@@ -2248,7 +2248,7 @@
s2idle  - Suspend-To-Idle
shallow - Power-On Suspend or equivalent (if supported)
deep- Suspend-To-RAM or equivalent (if supported)
-   See Documentation/power/states.txt.
+   See Documentation/admin-guide/pm/sleep-states.rst.
 
meye.*= [HW] Set MotionEye Camera parameters
See Documentation/video4linux/meye.txt.

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] dt-bindings: hwmon: Add max-expected-current property to ina2xx

2017-10-05 Thread Rob Herring

On Thu, Oct 05, 2017 at 05:15:07PM +0200, Maciej Purski wrote:
> On 10/01/2017 12:31 PM, Jonathan Cameron wrote:
> > On Thu, 28 Sep 2017 14:50:14 +0200
> > Maciej Purski  wrote:
> > 
> > > Add optional max expected current property which allows calibrating
> > > the ina sensor in order to achieve requested precision. Document
> > > the changes in Documentation/hwmon/ina2xx.
> > > 
> > 
> > This is introducing new generic devicetree bindings..
> > I'm not totally sure we want to do this rather than having a
> > manufacturer specific binding...  I don't have a good feeling for
> > how common this will be in other devices.
> > 
> > If it's generic, then this isn't necessarily the place to define it.
> > 
> > Jonathan
> > 
> 
> I agree, this should be manufacturer-specific property. Thanks.
> 
> > > Signed-off-by: Maciej Purski 
> > > ---
> > >   Documentation/devicetree/bindings/hwmon/ina2xx.txt | 4 +++-
> > >   Documentation/hwmon/ina2xx | 9 +
> > >   2 files changed, 8 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/Documentation/devicetree/bindings/hwmon/ina2xx.txt 
> > > b/Documentation/devicetree/bindings/hwmon/ina2xx.txt
> > > index 02af0d9..25ba372 100644
> > > --- a/Documentation/devicetree/bindings/hwmon/ina2xx.txt
> > > +++ b/Documentation/devicetree/bindings/hwmon/ina2xx.txt
> > > @@ -14,11 +14,13 @@ Optional properties:
> > >   - shunt-resistor
> > >   Shunt resistor value in micro-Ohm
> > > -
> > > +- max-expected-current
> > > + Max expected current value in mA

Also needs a unit suffix as defined in property-units.txt.

> > >   Example:
> > >   ina220@44 {
> > >   compatible = "ti,ina220";
> > >   reg = <0x44>;
> > >   shunt-resistor = <1000>;
> > > + max-expected-current = <3000>;
> > >   };
> > > diff --git a/Documentation/hwmon/ina2xx b/Documentation/hwmon/ina2xx
> > > index cfd31d9..e9ca838 100644
> > > --- a/Documentation/hwmon/ina2xx
> > > +++ b/Documentation/hwmon/ina2xx
> > > @@ -51,10 +51,11 @@ INA230 and INA231 are high or low side current shunt 
> > > and power monitors
> > >   with an I2C interface. The chips monitor both a shunt voltage drop and
> > >   bus supply voltage.
> > > -The shunt value in micro-ohms can be set via platform data or device 
> > > tree at
> > > -compile-time or via the shunt_resistor attribute in sysfs at run-time. 
> > > Please
> > > -refer to the Documentation/devicetree/bindings/i2c/ina2xx.txt for 
> > > bindings
> > > -if the device tree is used.
> > > +The shunt value in micro-ohms and max expected current in mA can be set
> > > +via platform data or device tree at compile-time or via the 
> > > shunt_resistor
> > > +and max_expected_current attributes in sysfs at run-time. Please refer 
> > > to the
> > > +Documentation/devicetree/bindings/i2c/ina2xx.txt for bindings if the
> > > +device tree is used.
> > >   Additionally ina226 supports update_interval attribute as described in
> > >   Documentation/hwmon/sysfs-interface. Internally the interval is the sum 
> > > of
> > 
> > 
> > 
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread David Rientjes

On Thu, 5 Oct 2017, Roman Gushchin wrote:

> > This patchset exists because overcommit is real, exactly the same as 
> > overcommit within memcg hierarchies is real.  99% of the time we don't run 
> > into global oom because people aren't using their limits so it just works 
> > out.  1% of the time we run into global oom and we need a decision to made 
> > based for forward progress.  Using Michal's earlier example of admins and 
> > students, a student can easily use all of his limit and also, with v10 of 
> > this patchset, 99% of the time avoid being oom killed just by forking N 
> > processes over N cgroups.  It's going to oom kill an admin every single 
> > time.
> 
> Overcommit is real, but configuring the system in a way that system-wide OOM
> happens often is a strange idea.

I wouldn't consider 1% of the time to be often, but the incident rate 
depends on many variables and who is sharing the same machine.  We can be 
smart about it and limit the potential for it in many ways, but the end 
result is that we still do overcommit and the system oom killer can be 
used to free memory from a low priority process.

> As we all know, the system can barely work
> adequate under global memory shortage: network packets are dropped, latency
> is bad, weird kernel issues are revealed periodically, etc.
> I do not see, why you can't overcommit on deeper layers of cgroup hierarchy,
> avoiding system-wide OOM to happen.
> 

Whether it's a system oom or whether its part of the cgroup hierarchy 
doesn't really matter, what matters is that overcommit occurs and we'd 
like to kill based on cgroup usage for each cgroup and its subtree, much 
like your earlier iterations, and also have the ability for userspace to 
influence that.

Without a cgroup-aware oom killer, I can prefer against killing an 
important job that uses 80% of memory and I want it to continue using 80% 
of memory.  We don't have that control over the cgroup-aware oom killer 
although we want to consider cgroup and subtree usage when choosing 
amongst cgroups with the same priority.  If you are not interested in 
defining the oom priority, all can remain at the default and there is no 
compatibility issue.

> > I know exactly why earlier versions of this patchset iterated that usage 
> > up the tree so you would pick from students, pick from this troublemaking 
> > student, and then oom kill from his hierarchy.  Roman has made that point 
> > himself.  My suggestion was to add userspace influence to it so that 
> > enterprise users and users with business goals can actually define that we 
> > really do want 80% of memory to be used by this process or this hierarchy, 
> > it's in our best interest.
> 
> I'll repeat myself: I believe that there is a range of possible policies:
> from a complete flat (what Johannes did suggested few weeks ago), to a very
> hierarchical (as in v8). Each with their pros and cons.
> (Michal did provide a clear example of bad behavior of the hierarchical 
> approach).
> 
> I assume, that v10 is a good middle point, and it's good because it doesn't
> prevent further development. Just for example, you can introduce a third state
> of oom_group knob, which will mean "evaluate as a whole, but do not kill all".
> And this is what will solve your particular case, right?
> 

I would need to add patches to add the "evaluate as a whole but do not 
kill all" knob and a knob for "oom priority" so that userspace has the 
same influence over a cgroup based comparison that it does with a process 
based comparison to meet business goals.  I'm not sure I'd be happy to 
pollute the mem cgroup v2 filesystem with such knobs when you can easily 
just not set the priority if you don't want to, and increase the priority 
if you have a leaf cgroup that should be preferred to be killed because of 
excess usage.  It has worked quite well in practice.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread David Rientjes

On Thu, 5 Oct 2017, Johannes Weiner wrote:

> > It is, because it can quite clearly be a DoSand was prevented with 
> > Roman's earlier design of iterating usage up the hierarchy and comparing 
> > siblings based on that criteria.  I know exactly why he chose that 
> > implementation detail early on, and it was to prevent cases such as this 
> > and to not let userspace hide from the oom killer.
> 
> This doesn't address how it's different from a single process
> following the same pattern right now.
> 

Are you referring to a single process being rewritten into N different 
subprocesses that do the same work as the single process but is separated 
in this manner to avoid having a large rss for any single process to avoid  
being oom killed?

This is solved by a cgroup-aware oom killer because these subprocesses 
should not be able to escape their own chargable entity.  It's exactly the 
usecase that Roman is addressing, correct?  My suggestion is to continue 
to iterate the usage up the hierarchy so that users can't easily defeat 
this by creating N subcontainers instead.

> > Let's resolve that global oom is a real condition and getting into that 
> > situation is not a userspace problem.  It's the result of overcommiting 
> > the system, and is used in the enterprise to address business goals.  If 
> > the above is true, and its up to memcg to prevent global oom in the first 
> > place, then this entire patchset is absolutely pointless.  Limit userspace 
> > to 95% of memory and when usage is approaching that limit, let userspace 
> > attached to the root memcg iterate the hierarchy itself and kill from the 
> > largest consumer.
> > 
> > This patchset exists because overcommit is real, exactly the same as 
> > overcommit within memcg hierarchies is real.  99% of the time we don't run 
> > into global oom because people aren't using their limits so it just works 
> > out.  1% of the time we run into global oom and we need a decision to made 
> > based for forward progress.  Using Michal's earlier example of admins and 
> > students, a student can easily use all of his limit and also, with v10 of 
> > this patchset, 99% of the time avoid being oom killed just by forking N 
> > processes over N cgroups.  It's going to oom kill an admin every single 
> > time.
> 
> We overcommit too, but our workloads organize themselves based on
> managing their resources, not based on evading the OOM killer. I'd
> wager that's true for many if not most users.
> 

No workloads are based on evading the oom killer, we are specifically 
trying to avoid that with oom priorities.  They have the power over 
increasing their own priority to be preferred to kill, but not decreasing 
their oom priority that was set by an activity manager.  This is exactly 
the same as how /proc/pid/oom_score_adj works.  With a cgroup-aware oom 
killer, which we'd love, nothing can possibly evade the oom killer if 
there are oom priorities.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] kbuild: Add a cache for generated variables

2017-10-05 Thread Doug Anderson

Hi,

On Thu, Oct 5, 2017 at 12:26 AM, Masahiro Yamada
 wrote:
> As far as I tested, I always see only one space after ":=" in v2.
>
> I did not consider this deeply,
> but something is working nicely behind the scene.

Try adding this to the end of the main Makefile:

+$(info LDFLAGS_BUILD_ID = $(LDFLAGS_BUILD_ID))
+$(info KBUILD_ARFLAGS = $(KBUILD_ARFLAGS))
+$(info KBUILD_CFLAGS = $(KBUILD_CFLAGS))
+$(info KBUILD_AFLAGS = $(KBUILD_AFLAGS))
+$(info KBUILD_CPPFLAGS = $(KBUILD_CPPFLAGS))
+$(info REALMODE_CFLAGS = $(REALMODE_CFLAGS))
+
 endif  # skip-makefile

Record what you see.  Then apply my patches and run your build again
(actually, run it twice and look at the 2nd time, just to be sure).  I
think you'll see slightly different spacing in the flags for the two
runs.  I don't think this is terribly important, though.

-Doug
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] doc: Fix typo "8023.ad" in bonding documentation

2017-10-05 Thread Axel Beckert

Hi Jon,

thanks for the prompt reply.

On Thu, Oct 05, 2017 at 01:15:04PM -0600, Jonathan Corbet wrote:
> On Thu, 5 Oct 2017 20:26:24 +0200
> Axel Beckert  wrote:
> 
> > Should be "802.3ad" like everywhere else in the document.
> 
> Could you do me a favor, please, and submit this one to Dave Miller and
> netdev?  He prefers to handle networking docs patches himself.

Will do.

Might be an idea to document this in MAINTAINERS, i.e. adding an "X:
Documentation/networking/" and "F: Documentation/networking/" or so in
the appropriate places.

Kind regards, Axel
-- 
/~\  Plain Text Ribbon Campaign   | Axel Beckert
\ /  Say No to HTML in E-Mail and News| a...@deuxchevaux.org  (Mail)
 X   See http://www.nonhtmlmail.org/campaign.html | a...@noone.org (Mail+Jabber)
/ \  I love long mails: http://email.is-not-s.ms/ | http://abe.noone.org/ (Web)
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] doc: Fix typo "8023.ad" in bonding documentation

2017-10-05 Thread Jonathan Corbet

On Thu, 5 Oct 2017 20:26:24 +0200
Axel Beckert  wrote:

> Should be "802.3ad" like everywhere else in the document.

Could you do me a favor, please, and submit this one to Dave Miller and
netdev?  He prefers to handle networking docs patches himself.

Thanks,

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 1/7] media: add glossary.rst with a glossary of terms used at V4L2 spec

2017-10-05 Thread Mauro Carvalho Chehab

Em Thu, 5 Oct 2017 11:21:07 +0300
Sakari Ailus  escreveu:

> Hi Mauro,
> 
> My apologies for the late reply.
> 
> On Tue, Aug 29, 2017 at 10:07:50AM -0300, Mauro Carvalho Chehab wrote:
> > Em Tue, 29 Aug 2017 10:47:48 +0300
> > Sakari Ailus  escreveu:
> >   
> > > Hi Mauro,
> > > 
> > > Thanks for the update. A few comments below.
> > > 
> > > On Mon, Aug 28, 2017 at 09:53:55AM -0300, Mauro Carvalho Chehab wrote:  
> > > > Add a glossary of terms for V4L2, as several concepts are complex
> > > > enough to cause misunderstandings.
> > > > 
> > > > Signed-off-by: Mauro Carvalho Chehab 
> > > > ---
> > > >  Documentation/media/uapi/v4l/glossary.rst | 147 
> > > > ++
> > > >  Documentation/media/uapi/v4l/v4l2.rst |   1 +
> > > >  2 files changed, 148 insertions(+)
> > > >  create mode 100644 Documentation/media/uapi/v4l/glossary.rst
> > > > 
> > > > diff --git a/Documentation/media/uapi/v4l/glossary.rst 
> > > > b/Documentation/media/uapi/v4l/glossary.rst
> > > > new file mode 100644
> > > > index ..0b6ab5adec81
> > > > --- /dev/null
> > > > +++ b/Documentation/media/uapi/v4l/glossary.rst
> > > > @@ -0,0 +1,147 @@
> > > > +
> > > > +Glossary
> > > > +
> > > > +
> > > > +.. note::
> > > > +
> > > > +   This goal of section is to standardize the terms used within the 
> > > > V4L2
> > > > +   documentation. It is written incrementally as they are standardized 
> > > > in
> > > > +   the V4L2 documentation. So, it is a Work In Progress.
> > > 
> > > I'd leave the WiP part out.  
> > 
> > IMO, it is important to mention it, as the glossary, right now, covers
> > only what's used on the first two sections of the API book. There are
> > a lot more to be covered.  
> 
> Works for me.
> 
> >   
> > >   
> > > > +
> > > > +.. Please keep the glossary entries in alphabetical order
> > > > +
> > > > +.. glossary::
> > > > +
> > > > +Bridge driver
> > > > +   The same as V4L2 main driver.
> > > 
> > > I've understood bridges being essentially a bus receiver + DMA. Most ISPs
> > > contain both but have more than that. How about:
> > > 
> > > A driver for a bus (e.g. parallel, CSI-2) receiver and DMA. Bridge drivers
> > > typically act as V4L2 main drivers.  
> > 
> > No, only on some drivers the bridge driver has DMA. A vast amount of
> > drivers (USB ones) don't implement any DMA inside the driver, as it is
> > up to the USB host driver to implement support for DMA.
> > 
> > There are even some USB host drivers that don't always use DMA for I/O 
> > transfers, using direct I/O if the message is smaller than a threshold
> > or not multiple of the bus word. This is pretty common on SoC USB host
> > drivers.
> > 
> > In any case, for the effect of this spec, and for all discussions we
> > ever had about it, bridge driver == V4L2 main driver. I don't
> > see any reason why to distinguish between them.  
> 
> I think you should precisely define what a bridge driver means. Generally
> ISP drivers aren't referred to as bridge drivers albeit they, too, function
> as V4L2 main drivers.

Btw, this is already defined, currently, at v4l2-subdev.h:

 * Sub-devices are devices that are connected somehow to the main bridge
 * device. These devices are usually audio/video muxers/encoders/decoders or
 * sensors and webcam controllers.
 *
 * Usually these devices are controlled through an i2c bus, but other busses
 * may also be used.

Please notice that there it says: "main bridge" :-)

Such definition was added since the beginning of the subdev concept, back in
2008 and was reviewed by several V4L core developers:

commit 2a1fcdf08230522bd5024f91da24aaa6e8d81f59
Author: Hans Verkuil 
Date:   Sat Nov 29 21:36:58 2008 -0300

V4L/DVB (9820): v4l2: add v4l2_device and v4l2_subdev structs to the v4l2 
framework.

Start implementing a proper v4l2 framework as discussed during the
Linux Plumbers Conference 2008.

Introduces v4l2_device (for device instances) and v4l2_subdev (representing
sub-device instances).

Signed-off-by: Hans Verkuil 
Reviewed-by: Laurent Pinchart 
Reviewed-by: Guennadi Liakhovetski 
Reviewed-by: Andy Walls 
Reviewed-by: David Brownell 
Signed-off-by: Mauro Carvalho Chehab 


Thanks,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] doc: Fix typo "8023.ad" in bonding documentation

2017-10-05 Thread Axel Beckert

Should be "802.3ad" like everywhere else in the document.

Signed-off-by: Axel Beckert 
---
 Documentation/networking/bonding.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/networking/bonding.txt 
b/Documentation/networking/bonding.txt
index 57f52cdce32e..9ba04c0bab8d 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -2387,7 +2387,7 @@ broadcast: Like active-backup, there is not much 
advantage to this
and packet type ID), so in a "gatewayed" configuration, all
outgoing traffic will generally use the same device.  Incoming
traffic may also end up on a single device, but that is
-   dependent upon the balancing policy of the peer's 8023.ad
+   dependent upon the balancing policy of the peer's 802.3ad
implementation.  In a "local" configuration, traffic will be
distributed across the devices in the bond.
 
-- 
2.14.2


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-10-05 Thread Michal Hocko

On Thu 05-10-17 10:54:01, Johannes Weiner wrote:
> On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote:
> > On Wed 04-10-17 16:04:53, Johannes Weiner wrote:
> > [...]
> > > That will silently ignore what the user writes to the memory.oom_group
> > > control files across the system's cgroup tree.
> > > 
> > > We'll have a knob that lets the workload declare itself an indivisible
> > > memory consumer, that it would like to get killed in one piece, and
> > > it's silently ignored because of a mount option they forgot to pass.
> > > 
> > > That's not good from an interface perspective.
> > 
> > Yes and that is why I think a boot time knob would be the most simple
> > way. It will also open doors for more oom policies in future which I
> > believe come sooner or later.
> 
> A boot time knob makes less sense to me than the mount option. It
> doesn't require a reboot to change this behavior, we shouldn't force
> the user to reboot when a runtime configuration is possible.

Do we need such a runtime configurability, though? If yes, what is the
usecase?

> But I don't see how dropping this patch as part of this series would
> prevent adding modular oom policies in the future?

I didn't say that dropping this patch would prevent further oom policies.
My point was that a command line option could be more generic to allow
more policies in future.

> That said, selectable OOM policies sound like a total deadend to
> me. The kernel OOM happens way too late to be useful for any kind of
> resource policy already. Even now it won't prevent you from thrashing
> indefinitely, with only 5% of your workload's time spent productively.
> 
> What kind of service quality do you have at this point?

The OOM killer is a disruptive operation which can be really costly
from the workload perspective (you are losing work) and as such the
victim selection really depends on the workload. Most of them are just
fine with the most rudimentary kill-the-largest approach but think of
workloads where the amount or type of work really matters much more
(think of a long running computational jobs taking weeks). We cannot
really handle all of those so I really expect that we will eventually
have to provide a way to allow different policies _somehow_.

> The *minority* of our OOM situations (in terms of "this isn't making
> real progress anymore due to a lack of memory") is even *seeing* OOM
> kills at this point. And it'll get worse as storage gets faster and
> memory bigger.

This is imho a separate problem which is independent on the oom victim
selection.

> How is that useful as a resource arbitration point?
> 
> Then there is the question of reliability. I mean, we still don't have
> a global OOM killer that is actually free from deadlocks.

Well, I believe that we should be deadlock free now.

> We don't
> have reserves measured to the exact requirements of reclaim that would
> guarantee recovery, the OOM reaper requires a lock that we hope isn't
> taken, etc. I wouldn't want any of my fleet to rely on this for
> regular operation - I'm just glad that, when we do mess up and hit
> this event, we don't have to reboot.
> 
> It makes much more sense to monitor memory pressure from userspace and
> smartly intervene when things turn unproductive, which is a long way
> from the point where the kernel is about to *deadlock* due to memory.

again this is independent on the oom selection policy.
 
> Global OOM kills can still happen, but their goal should really be 1)
> to save the kernel, 2) respect the integrity of a memory consumer and
> 3) be comprehensible to userspace. (These patches are about 2 and 3.)

I agree on these but I would add 4) make sure that the impact on the
system is acceptable/least disruptive possible.

> But abstracting such a rudimentary and fragile deadlock avoidance
> mechanism into higher-level resource management, or co-opting it as a
> policy enforcement tool, is crazy to me.
> 
> And it seems reckless to present it as those things to our users by
> encoding any such elaborate policy interfaces.
> 
> > > On the other hand, the only benefit of this patch is to shield users
> > > from changes to the OOM killing heuristics. Yet, it's really hard to
> > > imagine that modifying the victim selection process slightly could be
> > > called a regression in any way. We have done that many times over,
> > > without a second thought on backwards compatibility:
> > > 
> > > 5e9d834a0e0c oom: sacrifice child with highest badness score for parent
> > > a63d83f427fb oom: badness heuristic rewrite
> > > 778c14affaf9 mm, oom: base root bonus on current usage
> > 
> > yes we have changed that without a deeper considerations. Some of those
> > changes are arguable (e.g. child scarification). The oom badness
> > heuristic rewrite has triggered quite some complains AFAIR (I remember
> > Kosaki has made several attempts to revert it). I think that we are
> > trying to be more careful about user visible changes than we used to be.
> 
>

Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-10-05 Thread Tejun Heo

Hello, Michal.

On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote:
> Yes and that is why I think a boot time knob would be the most simple
> way. It will also open doors for more oom policies in future which I
> believe come sooner or later.

While boot params are fine for development and debugging, as a
user-interface, they aren't great.

* The user can't easily confirm whether the config they input is
  correct and when they get it wrong what's wrong can be pretty
  mysterious.

* While kernel params can be made r/w through /proc, people usually
  don't expect that and using that can become really confusing because
  a lot of people use "dmesg|grep" to confirm the boot params and that
  won't agree with the setting written later.

* It can't be scoped.  What if we want to choose different policies
  per delegated subtree?

* Boot params aren't the easiest (again, if you're a developer,
  they're but most aren't developers) to play with and prone to cause
  deployment issues.

* In this case, even worse because it ends up silently ignoring a
  clearly explicit configuration in an interface file.

If the behavior differences we get from group oom code isn't critical
(and it doesn't seem to be), I'd greatly prefer just enabling it when
cgroup2 is in use.  If it absolutely must be opt-in even on cgroup2,
we can discuss other ways but I'd really like to see stronger
rationales before going that route.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] dt-bindings: hwmon: Add max-expected-current property to ina2xx

2017-10-05 Thread Maciej Purski


On 10/01/2017 12:31 PM, Jonathan Cameron wrote:

On Thu, 28 Sep 2017 14:50:14 +0200
Maciej Purski  wrote:


Add optional max expected current property which allows calibrating
the ina sensor in order to achieve requested precision. Document
the changes in Documentation/hwmon/ina2xx.



This is introducing new generic devicetree bindings..
I'm not totally sure we want to do this rather than having a
manufacturer specific binding...  I don't have a good feeling for
how common this will be in other devices.

If it's generic, then this isn't necessarily the place to define it.

Jonathan



I agree, this should be manufacturer-specific property. Thanks.


Signed-off-by: Maciej Purski 
---
  Documentation/devicetree/bindings/hwmon/ina2xx.txt | 4 +++-
  Documentation/hwmon/ina2xx | 9 +
  2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/Documentation/devicetree/bindings/hwmon/ina2xx.txt 
b/Documentation/devicetree/bindings/hwmon/ina2xx.txt
index 02af0d9..25ba372 100644
--- a/Documentation/devicetree/bindings/hwmon/ina2xx.txt
+++ b/Documentation/devicetree/bindings/hwmon/ina2xx.txt
@@ -14,11 +14,13 @@ Optional properties:
  
  - shunt-resistor

Shunt resistor value in micro-Ohm
-
+- max-expected-current
+   Max expected current value in mA
  Example:
  
  ina220@44 {

compatible = "ti,ina220";
reg = <0x44>;
shunt-resistor = <1000>;
+   max-expected-current = <3000>;
  };
diff --git a/Documentation/hwmon/ina2xx b/Documentation/hwmon/ina2xx
index cfd31d9..e9ca838 100644
--- a/Documentation/hwmon/ina2xx
+++ b/Documentation/hwmon/ina2xx
@@ -51,10 +51,11 @@ INA230 and INA231 are high or low side current shunt and 
power monitors
  with an I2C interface. The chips monitor both a shunt voltage drop and
  bus supply voltage.
  
-The shunt value in micro-ohms can be set via platform data or device tree at

-compile-time or via the shunt_resistor attribute in sysfs at run-time. Please
-refer to the Documentation/devicetree/bindings/i2c/ina2xx.txt for bindings
-if the device tree is used.
+The shunt value in micro-ohms and max expected current in mA can be set
+via platform data or device tree at compile-time or via the shunt_resistor
+and max_expected_current attributes in sysfs at run-time. Please refer to the
+Documentation/devicetree/bindings/i2c/ina2xx.txt for bindings if the
+device tree is used.
  
  Additionally ina226 supports update_interval attribute as described in

  Documentation/hwmon/sysfs-interface. Internally the interval is the sum of






--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-10-05 Thread Johannes Weiner

On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote:
> On Wed 04-10-17 16:04:53, Johannes Weiner wrote:
> [...]
> > That will silently ignore what the user writes to the memory.oom_group
> > control files across the system's cgroup tree.
> > 
> > We'll have a knob that lets the workload declare itself an indivisible
> > memory consumer, that it would like to get killed in one piece, and
> > it's silently ignored because of a mount option they forgot to pass.
> > 
> > That's not good from an interface perspective.
> 
> Yes and that is why I think a boot time knob would be the most simple
> way. It will also open doors for more oom policies in future which I
> believe come sooner or later.

A boot time knob makes less sense to me than the mount option. It
doesn't require a reboot to change this behavior, we shouldn't force
the user to reboot when a runtime configuration is possible.

But I don't see how dropping this patch as part of this series would
prevent adding modular oom policies in the future?

That said, selectable OOM policies sound like a total deadend to
me. The kernel OOM happens way too late to be useful for any kind of
resource policy already. Even now it won't prevent you from thrashing
indefinitely, with only 5% of your workload's time spent productively.

What kind of service quality do you have at this point?

The *minority* of our OOM situations (in terms of "this isn't making
real progress anymore due to a lack of memory") is even *seeing* OOM
kills at this point. And it'll get worse as storage gets faster and
memory bigger.

How is that useful as a resource arbitration point?

Then there is the question of reliability. I mean, we still don't have
a global OOM killer that is actually free from deadlocks. We don't
have reserves measured to the exact requirements of reclaim that would
guarantee recovery, the OOM reaper requires a lock that we hope isn't
taken, etc. I wouldn't want any of my fleet to rely on this for
regular operation - I'm just glad that, when we do mess up and hit
this event, we don't have to reboot.

It makes much more sense to monitor memory pressure from userspace and
smartly intervene when things turn unproductive, which is a long way
from the point where the kernel is about to *deadlock* due to memory.

Global OOM kills can still happen, but their goal should really be 1)
to save the kernel, 2) respect the integrity of a memory consumer and
3) be comprehensible to userspace. (These patches are about 2 and 3.)

But abstracting such a rudimentary and fragile deadlock avoidance
mechanism into higher-level resource management, or co-opting it as a
policy enforcement tool, is crazy to me.

And it seems reckless to present it as those things to our users by
encoding any such elaborate policy interfaces.

> > On the other hand, the only benefit of this patch is to shield users
> > from changes to the OOM killing heuristics. Yet, it's really hard to
> > imagine that modifying the victim selection process slightly could be
> > called a regression in any way. We have done that many times over,
> > without a second thought on backwards compatibility:
> > 
> > 5e9d834a0e0c oom: sacrifice child with highest badness score for parent
> > a63d83f427fb oom: badness heuristic rewrite
> > 778c14affaf9 mm, oom: base root bonus on current usage
> 
> yes we have changed that without a deeper considerations. Some of those
> changes are arguable (e.g. child scarification). The oom badness
> heuristic rewrite has triggered quite some complains AFAIR (I remember
> Kosaki has made several attempts to revert it). I think that we are
> trying to be more careful about user visible changes than we used to be.

Whatever grumbling might have come up, it has not resulted in a revert
or a way to switch back to the old behavior. So I don't think this can
be considered an actual regression.

We change heuristics in the MM all the time. If you track for example
allocator behavior over different kernel versions, you can see how
much our caching policy, our huge page policy etc. fluctuates. The
impact of that is way bigger to regular workloads than how we go about
choosing an OOM victim.

We don't want to regress anybody, but let's also keep perspective here
and especially consider the userspace interfaces we are willing to put
in for at least the next few years, the promises we want to make, the
further fragmentation of the config space, for such a negligible risk.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v11 4/6] mm, oom: introduce memory.oom_group

2017-10-05 Thread Michal Hocko

Btw. here is how I would do the recursive oom badness. The diff is not
the nicest one because there is some code moving but the resulting code
is smaller and imho easier to grasp. Only compile tested though
---
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 085056e562b1..9cdba4682198 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -122,6 +122,11 @@ void cgroup_free(struct task_struct *p);
 int cgroup_init_early(void);
 int cgroup_init(void);
 
+static bool cgroup_has_tasks(struct cgroup *cgrp)
+{
+   return cgrp->nr_populated_csets;
+}
+
 /*
  * Iteration helpers and macros.
  */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8dacf73ad57e..a2dd7e3ffe23 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -319,11 +319,6 @@ static void cgroup_idr_remove(struct idr *idr, int id)
spin_unlock_bh(&cgroup_idr_lock);
 }
 
-static bool cgroup_has_tasks(struct cgroup *cgrp)
-{
-   return cgrp->nr_populated_csets;
-}
-
 bool cgroup_is_threaded(struct cgroup *cgrp)
 {
return cgrp->dom_cgrp != cgrp;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b3848bce4c86..012b2216266f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2671,59 +2671,63 @@ static inline bool memcg_has_children(struct mem_cgroup 
*memcg)
 }
 
 static long memcg_oom_badness(struct mem_cgroup *memcg,
- const nodemask_t *nodemask,
- unsigned long totalpages)
+ const nodemask_t *nodemask)
 {
+   struct mem_cgroup *iter;
+   struct css_task_iter it;
+   struct task_struct *task;
long points = 0;
+   int eligible = 0;
int nid;
pg_data_t *pgdat;
 
-   /*
-* We don't have necessary stats for the root memcg,
-* so we define it's oom_score as the maximum oom_score
-* of the belonging tasks.
-*
-* As tasks in the root memcg unlikely are parts of a
-* single workload, and we don't have to implement
-* group killing, this approximation is reasonable.
-*
-* But if we will have necessary stats for the root memcg,
-* we might switch to the approach which is used for all
-* other memcgs.
-*/
-   if (memcg == root_mem_cgroup) {
-   struct css_task_iter it;
-   struct task_struct *task;
-   long score, max_score = 0;
-
+   for_each_mem_cgroup_tree(iter, memcg) {
+   /*
+* Memcg is OOM eligible if there are OOM killable tasks inside.
+*
+* We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+* as unkillable.
+*
+* If there are inflight OOM victim tasks inside the memcg,
+* we return -1.
+*/
css_task_iter_start(&memcg->css, 0, &it);
while ((task = css_task_iter_next(&it))) {
-   score = oom_badness(task, memcg, nodemask,
-   totalpages);
-   if (score > max_score)
-   max_score = score;
+   if (!eligible &&
+   task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+   eligible = 1;
+
+   if (tsk_is_oom_victim(task) &&
+   !test_bit(MMF_OOM_SKIP, 
&task->signal->oom_mm->flags)) {
+   eligible = -1;
+   break;
+   }
}
css_task_iter_end(&it);
 
-   return max_score;
-   }
+   if (eligible <= 0) {
+   mem_cgroup_iter_break(memcg, iter);
+   points = -1;
+   break;
+   }
 
-   for_each_node_state(nid, N_MEMORY) {
-   if (nodemask && !node_isset(nid, *nodemask))
-   continue;
+   for_each_node_state(nid, N_MEMORY) {
+   if (nodemask && !node_isset(nid, *nodemask))
+   continue;
 
-   points += mem_cgroup_node_nr_lru_pages(memcg, nid,
-   LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+   points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+   LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
 
-   pgdat = NODE_DATA(nid);
-   points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
-   NR_SLAB_UNRECLAIMABLE);
-   }
+   pgdat = NODE_DATA(nid);
+   points += lruvec_page_state(mem_cgroup_lruvec(pgdat, 
memcg),
+   NR_SLAB_UNRECLAIMABLE);
+   }
 
-   points += memcg_page_state(m

Re: [v11 4/6] mm, oom: introduce memory.oom_group

2017-10-05 Thread Michal Hocko

On Thu 05-10-17 14:04:52, Roman Gushchin wrote:
> The cgroup-aware OOM killer treats leaf memory cgroups as memory
> consumption entities and performs the victim selection by comparing
> them based on their memory footprint. Then it kills the biggest task
> inside the selected memory cgroup.
> 
> But there are workloads, which are not tolerant to a such behavior.
> Killing a random task may leave the workload in a broken state.
> 
> To solve this problem, memory.oom_group knob is introduced.
> It will define, whether a memory group should be treated as an
> indivisible memory consumer, compared by total memory consumption
> with other memory consumers (leaf memory cgroups and other memory
> cgroups with memory.oom_group set), and whether all belonging tasks
> should be killed if the cgroup is selected.
> 
> If set on memcg A, it means that in case of system-wide OOM or
> memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
> belonging to the sub-tree of A will be killed. If OOM event is
> scoped to a descendant cgroup (A/B, for example), only tasks in
> that cgroup can be affected. OOM killer will never touch any tasks
> outside of the scope of the OOM event.
> 
> Also, tasks with oom_score_adj set to -1000 will not be killed because
> this has been a long established way to protect a particular process
> from seeing an unexpected SIGKILL from the OOM killer. Ignoring this
> user defined configuration might lead to data corruptions or other
> misbehavior.
> 
> The default value is 0.

I still believe that  oc->chosen_task == INFLIGHT_VICTIM check in
oom_kill_memcg_victim should go away.
> 
> Signed-off-by: Roman Gushchin 
> Cc: Michal Hocko 
> Cc: Vladimir Davydov 
> Cc: Johannes Weiner 
> Cc: Tetsuo Handa 
> Cc: David Rientjes 
> Cc: Andrew Morton 
> Cc: Tejun Heo 
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org

Acked-by: Michal Hocko 

> ---
>  include/linux/memcontrol.h | 17 +++
>  mm/memcontrol.c| 75 
> +++---
>  mm/oom_kill.c  | 49 +++---
>  3 files changed, 127 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 75b63b68846e..84ac10d7e67d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -200,6 +200,13 @@ struct mem_cgroup {
>   /* OOM-Killer disable */
>   int oom_kill_disable;
>  
> + /*
> +  * Treat the sub-tree as an indivisible memory consumer,
> +  * kill all belonging tasks if the memory cgroup selected
> +  * as OOM victim.
> +  */
> + bool oom_group;
> +
>   /* handle for "memory.events" */
>   struct cgroup_file events_file;
>  
> @@ -488,6 +495,11 @@ bool mem_cgroup_oom_synchronize(bool wait);
>  
>  bool mem_cgroup_select_oom_victim(struct oom_control *oc);
>  
> +static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
> +{
> + return memcg->oom_group;
> +}
> +
>  #ifdef CONFIG_MEMCG_SWAP
>  extern int do_swap_account;
>  #endif
> @@ -953,6 +965,11 @@ static inline bool mem_cgroup_select_oom_victim(struct 
> oom_control *oc)
>  {
>   return false;
>  }
> +
> +static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
> +{
> + return false;
> +}
>  #endif /* CONFIG_MEMCG */
>  
>  /* idx can be of type enum memcg_stat_item or node_stat_item */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 191b70735f1f..d5acb278b11a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2776,19 +2776,51 @@ static long oom_evaluate_memcg(struct mem_cgroup 
> *memcg,
>  
>  static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
> *oc)
>  {
> - struct mem_cgroup *iter;
> + struct mem_cgroup *iter, *group = NULL;
> + long group_score = 0;
>  
>   oc->chosen_memcg = NULL;
>   oc->chosen_points = 0;
>  
>   /*
> +  * If OOM is memcg-wide, and the memcg has the oom_group flag set,
> +  * all tasks belonging to the memcg should be killed.
> +  * So, we mark the memcg as a victim.
> +  */
> + if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
> + oc->chosen_memcg = oc->memcg;
> + css_get(&oc->chosen_memcg->css);
> + return;
> + }
> +
> + /*
>* The oom_score is calculated for leaf memory cgroups (including
>* the root memcg).
> +  * Non-leaf oom_group cgroups accumulating score of descendant
> +  * leaf memory cgroups.
>*/
>   rcu_read_lock();
>   for_each_mem_cgroup_tree(iter, root) {
>   long score;
>  
> + /*
> +  * We don't consider non-leaf non-oom_group memory cgroups
> +  * as OOM victims.
> +  */
> + if (memcg_has_children(iter) && iter != root_mem_cgroup &&
> + !mem_cgroup_oom_group(it

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread Michal Hocko

On Thu 05-10-17 14:04:51, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it looks for the biggest leaf memory cgroup
> and kills the biggest task belonging to it. The following patches
> will extend this functionality to consider non-leaf memory cgroups
> as well, and also provide an ability to kill all tasks belonging
> to the victim cgroup.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with leaf memory cgroups.
> Due to memcg statistics implementation a special algorithm
> is used for estimating it's oom_score: we define it as maximum
> oom_score of the belonging tasks.
> 
> Signed-off-by: Roman Gushchin 
> Cc: Michal Hocko 
> Cc: Vladimir Davydov 
> Cc: Johannes Weiner 
> Cc: Tetsuo Handa 
> Cc: David Rientjes 
> Cc: Andrew Morton 
> Cc: Tejun Heo 
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org

Assuming this is an opt-in
Acked-by: Michal Hocko 

> ---
>  include/linux/memcontrol.h |  17 +
>  include/linux/oom.h|  12 +++-
>  mm/memcontrol.c| 172 
> +
>  mm/oom_kill.c  |  70 +-
>  4 files changed, 251 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 69966c461d1c..75b63b68846e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -35,6 +35,7 @@ struct mem_cgroup;
>  struct page;
>  struct mm_struct;
>  struct kmem_cache;
> +struct oom_control;
>  
>  /* Cgroup-specific page state, on top of universal node page state */
>  enum memcg_stat_item {
> @@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
> cgroup_subsys_state *css){
>   return css ? container_of(css, struct mem_cgroup, css) : NULL;
>  }
>  
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> + css_put(&memcg->css);
> +}
> +
>  #define mem_cgroup_from_counter(counter, member) \
>   container_of(counter, struct mem_cgroup, member)
>  
> @@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct 
> *p)
>  
>  bool mem_cgroup_oom_synchronize(bool wait);
>  
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc);
> +
>  #ifdef CONFIG_MEMCG_SWAP
>  extern int do_swap_account;
>  #endif
> @@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct 
> *task,
>   return true;
>  }
>  
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> +}
> +
>  static inline struct mem_cgroup *
>  mem_cgroup_iter(struct mem_cgroup *root,
>   struct mem_cgroup *prev,
> @@ -936,6 +948,11 @@ static inline
>  void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
>  {
>  }
> +
> +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> + return false;
> +}
>  #endif /* CONFIG_MEMCG */
>  
>  /* idx can be of type enum memcg_stat_item or node_stat_item */
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 76aac4ce39bc..ca78e2d5956e 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -9,6 +9,13 @@
>  #include  /* MMF_* */
>  #include  /* VM_FAULT* */
>  
> +
> +/*
> + * Special value returned by victim selection functions to indicate
> + * that are inflight OOM victims.
> + */
> +#define INFLIGHT_VICTIM ((void *)-1UL)
> +
>  struct zonelist;
>  struct notifier_block;
>  struct mem_cgroup;
> @@ -39,7 +46,8 @@ struct oom_control {
>  
>   /* Used by oom implementation, do not set */
>   unsigned long totalpages;
> - struct task_struct *chosen;
> + struct task_struct *chosen_task;
> + struct mem_cgroup *chosen_memcg;
>   unsigned long chosen_points;
>  };
>  
> @@ -101,6 +109,8 @@ extern void oom_killer_enable(void);
>  
>  extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>  
> +extern int oom_evaluate_task(struct task_struct *task, void *arg);
> +
>  /* sysctls */
>  extern int sysctl_oom_dump_tasks;
>  extern int sysctl_oom_kill_allocating_task;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 41d71f665550..191b70735f1f 100644
> --- a/mm/memcontr

Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-10-05 Thread Michal Hocko

On Thu 05-10-17 14:41:13, Roman Gushchin wrote:
> On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote:
> > On Wed 04-10-17 16:04:53, Johannes Weiner wrote:
> > [...]
> > > That will silently ignore what the user writes to the memory.oom_group
> > > control files across the system's cgroup tree.
> > > 
> > > We'll have a knob that lets the workload declare itself an indivisible
> > > memory consumer, that it would like to get killed in one piece, and
> > > it's silently ignored because of a mount option they forgot to pass.
> > > 
> > > That's not good from an interface perspective.
> > 
> > Yes and that is why I think a boot time knob would be the most simple
> > way. It will also open doors for more oom policies in future which I
> > believe come sooner or later.
> 
> So, we would rely on grub config to set up OOM policy? Sounds weird.
> 
> We use boot options, when it's hard to implement on the fly switching
> (like turning on/off socket memory accounting), but here is not this case.

Well we define global policies with kernel command line so I do not
think it would be something unusual. An advantage is that you do not
have deal with semantic of the policy change during the runtime which is
something I am not sure we need or even want.
 
> > > On the other hand, the only benefit of this patch is to shield users
> > > from changes to the OOM killing heuristics. Yet, it's really hard to
> > > imagine that modifying the victim selection process slightly could be
> > > called a regression in any way. We have done that many times over,
> > > without a second thought on backwards compatibility:
> > > 
> > > 5e9d834a0e0c oom: sacrifice child with highest badness score for parent
> > > a63d83f427fb oom: badness heuristic rewrite
> > > 778c14affaf9 mm, oom: base root bonus on current usage
> > 
> > yes we have changed that without a deeper considerations. Some of those
> > changes are arguable (e.g. child scarification). The oom badness
> > heuristic rewrite has triggered quite some complains AFAIR (I remember
> > Kosaki has made several attempts to revert it). I think that we are
> > trying to be more careful about user visible changes than we used to be.
> > 
> > More importantly I do not think that the current (non-memcg aware) OOM
> > policy is somehow obsolete and many people expect it to behave
> > consistently. As I've said already, I have seen many complains that the
> > OOM killer doesn't kill the right task. Most of them were just NUMA
> > related issues where the oom report was not clear enough. I do not want
> > to repeat that again now. Memcg awareness is certainly a useful
> > heuristic but I do not see it universally applicable to all workloads.
> > 
> > > Let's not make the userspace interface crap because of some misguided
> > > idea that the OOM heuristic is a hard promise to userspace. It's never
> > > been, and nobody has complained about changes in the past.
> > > 
> > > This case is doubly silly, as the behavior change only applies to
> > > cgroup2, which doesn't exactly have a large base of legacy users yet.
> > 
> > I agree on the interface part but I disagree with making it default just
> > because v2 is not largerly adopted yet.
> 
> I believe that the only real regression can be caused by active using of
> oom_score_adj. I really don't know how many cgroup v2 users are relying
> on it (hopefully, 0).

Not only. A memcg with many small tasks could regress as well.

> So, personally I would prefer to have an opt-out cgroup v2 mount option
> (sane new behavior for most users, 100% backward compatibility for rare
> strange setups), but I don't have a very strong opinion here.

I fail to see why should people disable the feature after they see an
unexpected behavior rather than other way around when the feature is
enabled when it is really wanted. The opt-in is more correct just from
the "least surprise POV".
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-10-05 Thread Roman Gushchin

On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote:
> On Wed 04-10-17 16:04:53, Johannes Weiner wrote:
> [...]
> > That will silently ignore what the user writes to the memory.oom_group
> > control files across the system's cgroup tree.
> > 
> > We'll have a knob that lets the workload declare itself an indivisible
> > memory consumer, that it would like to get killed in one piece, and
> > it's silently ignored because of a mount option they forgot to pass.
> > 
> > That's not good from an interface perspective.
> 
> Yes and that is why I think a boot time knob would be the most simple
> way. It will also open doors for more oom policies in future which I
> believe come sooner or later.

So, we would rely on grub config to set up OOM policy? Sounds weird.

We use boot options, when it's hard to implement on the fly switching
(like turning on/off socket memory accounting), but here is not this case.

> 
> > On the other hand, the only benefit of this patch is to shield users
> > from changes to the OOM killing heuristics. Yet, it's really hard to
> > imagine that modifying the victim selection process slightly could be
> > called a regression in any way. We have done that many times over,
> > without a second thought on backwards compatibility:
> > 
> > 5e9d834a0e0c oom: sacrifice child with highest badness score for parent
> > a63d83f427fb oom: badness heuristic rewrite
> > 778c14affaf9 mm, oom: base root bonus on current usage
> 
> yes we have changed that without a deeper considerations. Some of those
> changes are arguable (e.g. child scarification). The oom badness
> heuristic rewrite has triggered quite some complains AFAIR (I remember
> Kosaki has made several attempts to revert it). I think that we are
> trying to be more careful about user visible changes than we used to be.
> 
> More importantly I do not think that the current (non-memcg aware) OOM
> policy is somehow obsolete and many people expect it to behave
> consistently. As I've said already, I have seen many complains that the
> OOM killer doesn't kill the right task. Most of them were just NUMA
> related issues where the oom report was not clear enough. I do not want
> to repeat that again now. Memcg awareness is certainly a useful
> heuristic but I do not see it universally applicable to all workloads.
> 
> > Let's not make the userspace interface crap because of some misguided
> > idea that the OOM heuristic is a hard promise to userspace. It's never
> > been, and nobody has complained about changes in the past.
> > 
> > This case is doubly silly, as the behavior change only applies to
> > cgroup2, which doesn't exactly have a large base of legacy users yet.
> 
> I agree on the interface part but I disagree with making it default just
> because v2 is not largerly adopted yet.

I believe that the only real regression can be caused by active using of
oom_score_adj. I really don't know how many cgroup v2 users are relying
on it (hopefully, 0).

So, personally I would prefer to have an opt-out cgroup v2 mount option
(sane new behavior for most users, 100% backward compatibility for rare
strange setups), but I don't have a very strong opinion here.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-10-05 Thread Michal Hocko

On Wed 04-10-17 16:04:53, Johannes Weiner wrote:
[...]
> That will silently ignore what the user writes to the memory.oom_group
> control files across the system's cgroup tree.
> 
> We'll have a knob that lets the workload declare itself an indivisible
> memory consumer, that it would like to get killed in one piece, and
> it's silently ignored because of a mount option they forgot to pass.
> 
> That's not good from an interface perspective.

Yes and that is why I think a boot time knob would be the most simple
way. It will also open doors for more oom policies in future which I
believe come sooner or later.

> On the other hand, the only benefit of this patch is to shield users
> from changes to the OOM killing heuristics. Yet, it's really hard to
> imagine that modifying the victim selection process slightly could be
> called a regression in any way. We have done that many times over,
> without a second thought on backwards compatibility:
> 
> 5e9d834a0e0c oom: sacrifice child with highest badness score for parent
> a63d83f427fb oom: badness heuristic rewrite
> 778c14affaf9 mm, oom: base root bonus on current usage

yes we have changed that without a deeper considerations. Some of those
changes are arguable (e.g. child scarification). The oom badness
heuristic rewrite has triggered quite some complains AFAIR (I remember
Kosaki has made several attempts to revert it). I think that we are
trying to be more careful about user visible changes than we used to be.

More importantly I do not think that the current (non-memcg aware) OOM
policy is somehow obsolete and many people expect it to behave
consistently. As I've said already, I have seen many complains that the
OOM killer doesn't kill the right task. Most of them were just NUMA
related issues where the oom report was not clear enough. I do not want
to repeat that again now. Memcg awareness is certainly a useful
heuristic but I do not see it universally applicable to all workloads.

> Let's not make the userspace interface crap because of some misguided
> idea that the OOM heuristic is a hard promise to userspace. It's never
> been, and nobody has complained about changes in the past.
> 
> This case is doubly silly, as the behavior change only applies to
> cgroup2, which doesn't exactly have a large base of legacy users yet.

I agree on the interface part but I disagree with making it default just
because v2 is not largerly adopted yet.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v11 0/6] cgroup-aware OOM killer

2017-10-05 Thread Roman Gushchin

This patchset makes the OOM killer cgroup-aware.

v11:
  - Fixed an issue with skipping the root mem cgroup
(discovered by Shakeel Butt)
  - Moved a check in __oom_kill_process() to the memmory.oom_group
patch, added corresponding comments
  - Added a note about ignoring tasks with oom_score_adj -1000
(proposed by Michal Hocko)
  - Rebase on top of mm tree

v10:
  - Separate oom_group introduction into a standalone patch
  - Stop propagating oom_group
  - Make oom_group delegatable
  - Do not try to kill the biggest task in the first order,
if the whole cgroup is going to be killed
  - Stop caching oom_score on struct memcg, optimize victim
memcg selection
  - Drop dmesg printing (for further refining)
  - Small refactorings and comments added here and there
  - Rebase on top of mm tree

v9:
  - Change siblings-to-siblings comparison to the tree-wide search,
make related refactorings
  - Make oom_group implicitly propagated down by the tree
  - Fix an issue with task selection in root cgroup

v8:
  - Do not kill tasks with OOM_SCORE_ADJ -1000
  - Make the whole thing opt-in with cgroup mount option control
  - Drop oom_priority for further discussions
  - Kill the whole cgroup if oom_group is set and it's
memory.max is reached
  - Update docs and commit messages

v7:
  - __oom_kill_process() drops reference to the victim task
  - oom_score_adj -1000 is always respected
  - Renamed oom_kill_all to oom_group
  - Dropped oom_prio range, converted from short to int
  - Added a cgroup v2 mount option to disable cgroup-aware OOM killer
  - Docs updated
  - Rebased on top of mmotm

v6:
  - Renamed oom_control.chosen to oom_control.chosen_task
  - Renamed oom_kill_all_tasks to oom_kill_all
  - Per-node NR_SLAB_UNRECLAIMABLE accounting
  - Several minor fixes and cleanups
  - Docs updated

v5:
  - Rebased on top of Michal Hocko's patches, which have changed the
way how OOM victims becoming an access to the memory
reserves. Dropped corresponding part of this patchset
  - Separated the oom_kill_process() splitting into a standalone commit
  - Added debug output (suggested by David Rientjes)
  - Some minor fixes

v4:
  - Reworked per-cgroup oom_score_adj into oom_priority
(based on ideas by David Rientjes)
  - Tasks with oom_score_adj -1000 are never selected if
oom_kill_all_tasks is not set
  - Memcg victim selection code is reworked, and
synchronization is based on finding tasks with OOM victim marker,
rather then on global counter
  - Debug output is dropped
  - Refactored TIF_MEMDIE usage

v3:
  - Merged commits 1-4 into 6
  - Separated oom_score_adj logic and debug output into separate commits
  - Fixed swap accounting

v2:
  - Reworked victim selection based on feedback
from Michal Hocko, Vladimir Davydov and Johannes Weiner
  - "Kill all tasks" is now an opt-in option, by default
only one process will be killed
  - Added per-cgroup oom_score_adj
  - Refined oom score calculations, suggested by Vladimir Davydov
  - Converted to a patchset

v1:
  https://lkml.org/lkml/2017/5/18/969


Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org


Roman Gushchin (6):
  mm, oom: refactor the oom_kill_process() function
  mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  mm, oom: cgroup-aware OOM killer
  mm, oom: introduce memory.oom_group
  mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  mm, oom, docs: describe the cgroup-aware OOM killer

 Documentation/cgroup-v2.txt |  51 +
 include/linux/cgroup-defs.h |   5 +
 include/linux/memcontrol.h  |  34 ++
 include/linux/oom.h |  12 ++-
 kernel/cgroup/cgroup.c  |  10 ++
 mm/memcontrol.c | 249 +++-
 mm/oom_kill.c   | 210 -
 7 files changed, 495 insertions(+), 76 deletions(-)

-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v11 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup

2017-10-05 Thread Roman Gushchin

Implement mem_cgroup_scan_tasks() functionality for the root
memory cgroup to use this function for looking for a OOM victim
task in the root memory cgroup by the cgroup-ware OOM killer.

The root memory cgroup is treated as a leaf cgroup, so only tasks
which are directly belonging to the root cgroup are iterated over.

This patch doesn't introduce any functional change as
mem_cgroup_scan_tasks() is never called for the root memcg.
This is preparatory work for the cgroup-aware OOM killer,
which will use this function to iterate over tasks belonging
to the root memcg.

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 mm/memcontrol.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c7410636fadf..41d71f665550 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup 
*dead_memcg)
  * value, the function breaks the iteration loop and returns the value.
  * Otherwise, it will iterate over all tasks and return 0.
  *
- * This function must not be called for the root memory cgroup.
+ * If memcg is the root memory cgroup, this function will iterate only
+ * over tasks belonging directly to the root memory cgroup.
  */
 int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
  int (*fn)(struct task_struct *, void *), void *arg)
@@ -925,8 +926,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
struct mem_cgroup *iter;
int ret = 0;
 
-   BUG_ON(memcg == root_mem_cgroup);
-
for_each_mem_cgroup_tree(iter, memcg) {
struct css_task_iter it;
struct task_struct *task;
@@ -935,7 +934,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
while (!ret && (task = css_task_iter_next(&it)))
ret = fn(task, arg);
css_task_iter_end(&it);
-   if (ret) {
+   if (ret || memcg == root_mem_cgroup) {
mem_cgroup_iter_break(memcg, iter);
break;
}
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v11 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-10-05 Thread Roman Gushchin

Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
OOM killer. If not set, the OOM selection is performed in
a "traditional" per-process way.

The behavior can be changed dynamically by remounting the cgroupfs.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/cgroup-defs.h |  5 +
 kernel/cgroup/cgroup.c  | 10 ++
 mm/memcontrol.c |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 3e55bbd31ad1..cae5343a8b21 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -80,6 +80,11 @@ enum {
 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 */
CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+
+   /*
+* Enable cgroup-aware OOM killer.
+*/
+   CGRP_GROUP_OOM = (1 << 5),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c3421ee0d230..8d8aa46ff930 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned 
int *root_flags)
if (!strcmp(token, "nsdelegate")) {
*root_flags |= CGRP_ROOT_NS_DELEGATE;
continue;
+   } else if (!strcmp(token, "groupoom")) {
+   *root_flags |= CGRP_GROUP_OOM;
+   continue;
}
 
pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int 
root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
+
+   if (root_flags & CGRP_GROUP_OOM)
+   cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
+   else
+   cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
}
 }
 
@@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, 
struct kernfs_root *kf_root
 {
if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
seq_puts(seq, ",nsdelegate");
+   if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
+   seq_puts(seq, ",groupoom");
return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5acb278b11a..fe6155d827c1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2866,6 +2866,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
 
+   if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+   return false;
+
if (oc->memcg)
root = oc->memcg;
else
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread Roman Gushchin

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

To address these issues, the cgroup-aware OOM killer is introduced.

Under OOM conditions, it looks for the biggest leaf memory cgroup
and kills the biggest task belonging to it. The following patches
will extend this functionality to consider non-leaf memory cgroups
as well, and also provide an ability to kill all tasks belonging
to the victim cgroup.

The root cgroup is treated as a leaf memory cgroup, so it's score
is compared with leaf memory cgroups.
Due to memcg statistics implementation a special algorithm
is used for estimating it's oom_score: we define it as maximum
oom_score of the belonging tasks.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h |  17 +
 include/linux/oom.h|  12 +++-
 mm/memcontrol.c| 172 +
 mm/oom_kill.c  |  70 +-
 4 files changed, 251 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69966c461d1c..75b63b68846e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+   css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)   \
container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct 
*task,
return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -936,6 +948,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+   return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4ce39bc..ca78e2d5956e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -9,6 +9,13 @@
 #include  /* MMF_* */
 #include  /* VM_FAULT* */
 
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
 struct zonelist;
 struct notifier_block;
 struct mem_cgroup;
@@ -39,7 +46,8 @@ struct oom_control {
 
/* Used by oom implementation, do not set */
unsigned long totalpages;
-   struct task_struct *chosen;
+   struct task_struct *chosen_task;
+   struct mem_cgroup *chosen_memcg;
unsigned long chosen_points;
 };
 
@@ -101,6 +109,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 41d71f665550..191b70735f1f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2670,6 +2670,178 @@ static inline bool memcg_has_children(struct mem_cgroup 
*memcg)
return ret;
 }
 
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask,
+ unsigned long totalpages)
+{
+   long points = 0;
+   int nid;
+   pg_data_t *pgd

[v11 6/6] mm, oom, docs: describe the cgroup-aware OOM killer

2017-10-05 Thread Roman Gushchin

Document the cgroup-aware OOM killer.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: Andrew Morton 
Cc: David Rientjes 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 Documentation/cgroup-v2.txt | 51 +
 1 file changed, 51 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 3f8216912df0..28429e62b0ea 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
+   5-2-4. OOM Killer
  5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
@@ -1043,6 +1044,28 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
 
+  memory.oom_group
+
+   A read-write single value file which exists on non-root
+   cgroups.  The default is "0".
+
+   If set, OOM killer will consider the memory cgroup as an
+   indivisible memory consumers and compare it with other memory
+   consumers by it's memory footprint.
+   If such memory cgroup is selected as an OOM victim, all
+   processes belonging to it or it's descendants will be killed.
+
+   This applies to system-wide OOM conditions and reaching
+   the hard memory limit of the cgroup and their ancestor.
+   If OOM condition happens in a descendant cgroup with it's own
+   memory limit, the memory cgroup can't be considered
+   as an OOM victim, and OOM killer will not kill all belonging
+   tasks.
+
+   Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
+   and will never kill the unkillable task, even if memory.oom_group
+   is set.
+
   memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
@@ -1246,6 +1269,34 @@ to be accessed repeatedly by other cgroups, it may make 
sense to use
 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
+OOM Killer
+~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, looking for a memory cgroup with the largest
+memory footprint, considering leaf cgroups and cgroups with the
+memory.oom_group option set, which are considered to be an indivisible
+memory consumers.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup memory.oom_group option. If set, it causes
+the OOM killer to kill all processes attached to the cgroup,
+except processes with oom_score_adj set to -1000.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+The root cgroup is treated as a leaf memory cgroup, so it's compared
+with other leaf memory cgroups and cgroups with oom_group option set.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
 
 IO
 --
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v11 4/6] mm, oom: introduce memory.oom_group

2017-10-05 Thread Roman Gushchin

The cgroup-aware OOM killer treats leaf memory cgroups as memory
consumption entities and performs the victim selection by comparing
them based on their memory footprint. Then it kills the biggest task
inside the selected memory cgroup.

But there are workloads, which are not tolerant to a such behavior.
Killing a random task may leave the workload in a broken state.

To solve this problem, memory.oom_group knob is introduced.
It will define, whether a memory group should be treated as an
indivisible memory consumer, compared by total memory consumption
with other memory consumers (leaf memory cgroups and other memory
cgroups with memory.oom_group set), and whether all belonging tasks
should be killed if the cgroup is selected.

If set on memcg A, it means that in case of system-wide OOM or
memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
belonging to the sub-tree of A will be killed. If OOM event is
scoped to a descendant cgroup (A/B, for example), only tasks in
that cgroup can be affected. OOM killer will never touch any tasks
outside of the scope of the OOM event.

Also, tasks with oom_score_adj set to -1000 will not be killed because
this has been a long established way to protect a particular process
from seeing an unexpected SIGKILL from the OOM killer. Ignoring this
user defined configuration might lead to data corruptions or other
misbehavior.

The default value is 0.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h | 17 +++
 mm/memcontrol.c| 75 +++---
 mm/oom_kill.c  | 49 +++---
 3 files changed, 127 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 75b63b68846e..84ac10d7e67d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -200,6 +200,13 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
 
+   /*
+* Treat the sub-tree as an indivisible memory consumer,
+* kill all belonging tasks if the memory cgroup selected
+* as OOM victim.
+*/
+   bool oom_group;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
 
@@ -488,6 +495,11 @@ bool mem_cgroup_oom_synchronize(bool wait);
 
 bool mem_cgroup_select_oom_victim(struct oom_control *oc);
 
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+   return memcg->oom_group;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -953,6 +965,11 @@ static inline bool mem_cgroup_select_oom_victim(struct 
oom_control *oc)
 {
return false;
 }
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+   return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 191b70735f1f..d5acb278b11a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2776,19 +2776,51 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
 
 static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
*oc)
 {
-   struct mem_cgroup *iter;
+   struct mem_cgroup *iter, *group = NULL;
+   long group_score = 0;
 
oc->chosen_memcg = NULL;
oc->chosen_points = 0;
 
/*
+* If OOM is memcg-wide, and the memcg has the oom_group flag set,
+* all tasks belonging to the memcg should be killed.
+* So, we mark the memcg as a victim.
+*/
+   if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
+   oc->chosen_memcg = oc->memcg;
+   css_get(&oc->chosen_memcg->css);
+   return;
+   }
+
+   /*
 * The oom_score is calculated for leaf memory cgroups (including
 * the root memcg).
+* Non-leaf oom_group cgroups accumulating score of descendant
+* leaf memory cgroups.
 */
rcu_read_lock();
for_each_mem_cgroup_tree(iter, root) {
long score;
 
+   /*
+* We don't consider non-leaf non-oom_group memory cgroups
+* as OOM victims.
+*/
+   if (memcg_has_children(iter) && iter != root_mem_cgroup &&
+   !mem_cgroup_oom_group(iter))
+   continue;
+
+   /*
+* If group is not set or we've ran out of the group's sub-tree,
+* we should set group and reset group_score.
+*/
+   if (!group || group == root_mem_cgroup ||
+   !mem_cgroup_is_descendant(iter, group)) {
+   gr

[v11 1/6] mm, oom: refactor the oom_kill_process() function

2017-10-05 Thread Roman Gushchin

The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.

This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: David Rientjes 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 mm/oom_kill.c | 123 +++---
 1 file changed, 65 insertions(+), 58 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f642a45b7f14..ccdb7d34cd13 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -845,68 +845,12 @@ static bool task_will_free_mem(struct task_struct *task)
return ret;
 }
 
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
 {
-   struct task_struct *p = oc->chosen;
-   unsigned int points = oc->chosen_points;
-   struct task_struct *victim = p;
-   struct task_struct *child;
-   struct task_struct *t;
+   struct task_struct *p;
struct mm_struct *mm;
-   unsigned int victim_points = 0;
-   static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
bool can_oom_reap = true;
 
-   /*
-* If the task is already exiting, don't alarm the sysadmin or kill
-* its children or threads, just give it access to memory reserves
-* so it can die quickly
-*/
-   task_lock(p);
-   if (task_will_free_mem(p)) {
-   mark_oom_victim(p);
-   wake_oom_reaper(p);
-   task_unlock(p);
-   put_task_struct(p);
-   return;
-   }
-   task_unlock(p);
-
-   if (__ratelimit(&oom_rs))
-   dump_header(oc, p);
-
-   pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-   message, task_pid_nr(p), p->comm, points);
-
-   /*
-* If any of p's children has a different mm and is eligible for kill,
-* the one with the highest oom_badness() score is sacrificed for its
-* parent.  This attempts to lose the minimal amount of work done while
-* still freeing memory.
-*/
-   read_lock(&tasklist_lock);
-   for_each_thread(p, t) {
-   list_for_each_entry(child, &t->children, sibling) {
-   unsigned int child_points;
-
-   if (process_shares_mm(child, p->mm))
-   continue;
-   /*
-* oom_badness() returns 0 if the thread is unkillable
-*/
-   child_points = oom_badness(child,
-   oc->memcg, oc->nodemask, oc->totalpages);
-   if (child_points > victim_points) {
-   put_task_struct(victim);
-   victim = child;
-   victim_points = child_points;
-   get_task_struct(victim);
-   }
-   }
-   }
-   read_unlock(&tasklist_lock);
-
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -980,6 +924,69 @@ static void oom_kill_process(struct oom_control *oc, const 
char *message)
 }
 #undef K
 
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+   struct task_struct *p = oc->chosen;
+   unsigned int points = oc->chosen_points;
+   struct task_struct *victim = p;
+   struct task_struct *child;
+   struct task_struct *t;
+   unsigned int victim_points = 0;
+   static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+   /*
+* If the task is already exiting, don't alarm the sysadmin or kill
+* its children or threads, just give it access to memory reserves
+* so it can die quickly
+*/
+   task_lock(p);
+   if (task_will_free_mem(p)) {
+   mark_oom_victim(p);
+   wake_oom_reaper(p);
+   task_unlock(p);
+   put_task_struct(p);
+   return;

Re: [v10 4/6] mm, oom: introduce memory.oom_group

2017-10-05 Thread Michal Hocko

On Thu 05-10-17 13:32:14, Roman Gushchin wrote:
> On Thu, Oct 05, 2017 at 02:06:49PM +0200, Michal Hocko wrote:
> > On Wed 04-10-17 16:46:36, Roman Gushchin wrote:
> > > The cgroup-aware OOM killer treats leaf memory cgroups as memory
> > > consumption entities and performs the victim selection by comparing
> > > them based on their memory footprint. Then it kills the biggest task
> > > inside the selected memory cgroup.
> > > 
> > > But there are workloads, which are not tolerant to a such behavior.
> > > Killing a random task may leave the workload in a broken state.
> > > 
> > > To solve this problem, memory.oom_group knob is introduced.
> > > It will define, whether a memory group should be treated as an
> > > indivisible memory consumer, compared by total memory consumption
> > > with other memory consumers (leaf memory cgroups and other memory
> > > cgroups with memory.oom_group set), and whether all belonging tasks
> > > should be killed if the cgroup is selected.
> > > 
> > > If set on memcg A, it means that in case of system-wide OOM or
> > > memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
> > > belonging to the sub-tree of A will be killed. If OOM event is
> > > scoped to a descendant cgroup (A/B, for example), only tasks in
> > > that cgroup can be affected. OOM killer will never touch any tasks
> > > outside of the scope of the OOM event.
> > > 
> > > Also, tasks with oom_score_adj set to -1000 will not be killed.
> > 
> > I would extend the last sentence with an explanation. What about the
> > following:
> > "
> > Also, tasks with oom_score_adj set to -1000 will not be killed because
> > this has been a long established way to protect a particular process
> > from seeing an unexpected SIGKILL from the oom killer. Ignoring this
> > user defined configuration might lead to data corruptions or other
> > misbehavior.
> > "
> 
> Added, thanks!
> 
> > 
> > few mostly nit picks below but this looks good other than that. Once the
> > fix mentioned in patch 3 is folded I will ack this.
> > 
> > [...]
> > 
> > >  static void select_victim_memcg(struct mem_cgroup *root, struct 
> > > oom_control *oc)
> > >  {
> > > - struct mem_cgroup *iter;
> > > + struct mem_cgroup *iter, *group = NULL;
> > > + long group_score = 0;
> > >  
> > >   oc->chosen_memcg = NULL;
> > >   oc->chosen_points = 0;
> > >  
> > >   /*
> > > +  * If OOM is memcg-wide, and the memcg has the oom_group flag set,
> > > +  * all tasks belonging to the memcg should be killed.
> > > +  * So, we mark the memcg as a victim.
> > > +  */
> > > + if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
> > 
> > we have is_memcg_oom() helper which is esier to read and understand than
> > the explicit oc->memcg check
> 
> It's defined in oom_kill.c and not exported, so I'm not sure.

putting it to oom.h shouldn't be a big deal.
 
> > > + oc->chosen_memcg = oc->memcg;
> > > + css_get(&oc->chosen_memcg->css);
> > > + return;
> > > + }
> > > +
> > > + /*
> > >* The oom_score is calculated for leaf memory cgroups (including
> > >* the root memcg).
> > > +  * Non-leaf oom_group cgroups accumulating score of descendant
> > > +  * leaf memory cgroups.
> > >*/
> > >   rcu_read_lock();
> > >   for_each_mem_cgroup_tree(iter, root) {
> > >   long score;
> > >  
> > > + /*
> > > +  * We don't consider non-leaf non-oom_group memory cgroups
> > > +  * as OOM victims.
> > > +  */
> > > + if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
> > > + continue;
> > > +
> > > + /*
> > > +  * If group is not set or we've ran out of the group's sub-tree,
> > > +  * we should set group and reset group_score.
> > > +  */
> > > + if (!group || group == root_mem_cgroup ||
> > > + !mem_cgroup_is_descendant(iter, group)) {
> > > + group = iter;
> > > + group_score = 0;
> > > + }
> > > +
> > 
> > hmm, I thought you would go with a recursive oom_evaluate_memcg
> > implementation that would result in a more readable code IMHO. It is
> > true that we would traverse oom_group more times. But I do not expect
> > we would have very deep memcg hierarchies in the majority of workloads
> > and even if we did then this is a cold path which should focus on
> > readability more than a performance. Also implementing
> > mem_cgroup_iter_skip_subtree shouldn't be all that hard if this ever
> > turns out a real problem.
> 
> I've tried to go this way, but I didn't like the result. These both
> loops will share a lot of code (e.g. breaking on finding a previous victim,
> skipping non-leaf non-oom-group memcgs, etc), so the result is more messy.
> And actually it's strange to start a new loop to iterate exactly over
> the same sub-tree, which you want to skip in the first loop.

As I've said, I will not insist. It just makes more sense to me to do
the hierarchical behavior in a single pl

Re: [v10 4/6] mm, oom: introduce memory.oom_group

2017-10-05 Thread Roman Gushchin

On Thu, Oct 05, 2017 at 02:06:49PM +0200, Michal Hocko wrote:
> On Wed 04-10-17 16:46:36, Roman Gushchin wrote:
> > The cgroup-aware OOM killer treats leaf memory cgroups as memory
> > consumption entities and performs the victim selection by comparing
> > them based on their memory footprint. Then it kills the biggest task
> > inside the selected memory cgroup.
> > 
> > But there are workloads, which are not tolerant to a such behavior.
> > Killing a random task may leave the workload in a broken state.
> > 
> > To solve this problem, memory.oom_group knob is introduced.
> > It will define, whether a memory group should be treated as an
> > indivisible memory consumer, compared by total memory consumption
> > with other memory consumers (leaf memory cgroups and other memory
> > cgroups with memory.oom_group set), and whether all belonging tasks
> > should be killed if the cgroup is selected.
> > 
> > If set on memcg A, it means that in case of system-wide OOM or
> > memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
> > belonging to the sub-tree of A will be killed. If OOM event is
> > scoped to a descendant cgroup (A/B, for example), only tasks in
> > that cgroup can be affected. OOM killer will never touch any tasks
> > outside of the scope of the OOM event.
> > 
> > Also, tasks with oom_score_adj set to -1000 will not be killed.
> 
> I would extend the last sentence with an explanation. What about the
> following:
> "
> Also, tasks with oom_score_adj set to -1000 will not be killed because
> this has been a long established way to protect a particular process
> from seeing an unexpected SIGKILL from the oom killer. Ignoring this
> user defined configuration might lead to data corruptions or other
> misbehavior.
> "

Added, thanks!

> 
> few mostly nit picks below but this looks good other than that. Once the
> fix mentioned in patch 3 is folded I will ack this.
> 
> [...]
> 
> >  static void select_victim_memcg(struct mem_cgroup *root, struct 
> > oom_control *oc)
> >  {
> > -   struct mem_cgroup *iter;
> > +   struct mem_cgroup *iter, *group = NULL;
> > +   long group_score = 0;
> >  
> > oc->chosen_memcg = NULL;
> > oc->chosen_points = 0;
> >  
> > /*
> > +* If OOM is memcg-wide, and the memcg has the oom_group flag set,
> > +* all tasks belonging to the memcg should be killed.
> > +* So, we mark the memcg as a victim.
> > +*/
> > +   if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
> 
> we have is_memcg_oom() helper which is esier to read and understand than
> the explicit oc->memcg check

It's defined in oom_kill.c and not exported, so I'm not sure.

> 
> > +   oc->chosen_memcg = oc->memcg;
> > +   css_get(&oc->chosen_memcg->css);
> > +   return;
> > +   }
> > +
> > +   /*
> >  * The oom_score is calculated for leaf memory cgroups (including
> >  * the root memcg).
> > +* Non-leaf oom_group cgroups accumulating score of descendant
> > +* leaf memory cgroups.
> >  */
> > rcu_read_lock();
> > for_each_mem_cgroup_tree(iter, root) {
> > long score;
> >  
> > +   /*
> > +* We don't consider non-leaf non-oom_group memory cgroups
> > +* as OOM victims.
> > +*/
> > +   if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
> > +   continue;
> > +
> > +   /*
> > +* If group is not set or we've ran out of the group's sub-tree,
> > +* we should set group and reset group_score.
> > +*/
> > +   if (!group || group == root_mem_cgroup ||
> > +   !mem_cgroup_is_descendant(iter, group)) {
> > +   group = iter;
> > +   group_score = 0;
> > +   }
> > +
> 
> hmm, I thought you would go with a recursive oom_evaluate_memcg
> implementation that would result in a more readable code IMHO. It is
> true that we would traverse oom_group more times. But I do not expect
> we would have very deep memcg hierarchies in the majority of workloads
> and even if we did then this is a cold path which should focus on
> readability more than a performance. Also implementing
> mem_cgroup_iter_skip_subtree shouldn't be all that hard if this ever
> turns out a real problem.

I've tried to go this way, but I didn't like the result. These both
loops will share a lot of code (e.g. breaking on finding a previous victim,
skipping non-leaf non-oom-group memcgs, etc), so the result is more messy.
And actually it's strange to start a new loop to iterate exactly over
the same sub-tree, which you want to skip in the first loop.

> 
> Anyway this is nothing really fundamental so I will leave the decision
> on you.
> 
> > +static bool oom_kill_memcg_victim(struct oom_control *oc)
> > +{
> > if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> > return oc->chosen_memcg;
> >  
> > -   /* Kill a task in the chosen memcg with the biggest memor

Re: [PATCH v5 1/7] media: add glossary.rst with a glossary of terms used at V4L2 spec

2017-10-05 Thread Mauro Carvalho Chehab

Em Thu, 5 Oct 2017 11:21:07 +0300
Sakari Ailus  escreveu:

> Hi Mauro,
> 
> My apologies for the late reply.
> 
> On Tue, Aug 29, 2017 at 10:07:50AM -0300, Mauro Carvalho Chehab wrote:
> > Em Tue, 29 Aug 2017 10:47:48 +0300
> > Sakari Ailus  escreveu:
> >   
> > > Hi Mauro,
> > > 
> > > Thanks for the update. A few comments below.
> > > 
> > > On Mon, Aug 28, 2017 at 09:53:55AM -0300, Mauro Carvalho Chehab wrote:  
> > > > Add a glossary of terms for V4L2, as several concepts are complex
> > > > enough to cause misunderstandings.
> > > > 
> > > > Signed-off-by: Mauro Carvalho Chehab 
> > > > ---
> > > >  Documentation/media/uapi/v4l/glossary.rst | 147 
> > > > ++
> > > >  Documentation/media/uapi/v4l/v4l2.rst |   1 +
> > > >  2 files changed, 148 insertions(+)
> > > >  create mode 100644 Documentation/media/uapi/v4l/glossary.rst
> > > > 
> > > > diff --git a/Documentation/media/uapi/v4l/glossary.rst 
> > > > b/Documentation/media/uapi/v4l/glossary.rst
> > > > new file mode 100644
> > > > index ..0b6ab5adec81
> > > > --- /dev/null
> > > > +++ b/Documentation/media/uapi/v4l/glossary.rst
> > > > @@ -0,0 +1,147 @@
> > > > +
> > > > +Glossary
> > > > +
> > > > +
> > > > +.. note::
> > > > +
> > > > +   This goal of section is to standardize the terms used within the 
> > > > V4L2
> > > > +   documentation. It is written incrementally as they are standardized 
> > > > in
> > > > +   the V4L2 documentation. So, it is a Work In Progress.
> > > 
> > > I'd leave the WiP part out.  
> > 
> > IMO, it is important to mention it, as the glossary, right now, covers
> > only what's used on the first two sections of the API book. There are
> > a lot more to be covered.  
> 
> Works for me.
> 
> >   
> > >   
> > > > +
> > > > +.. Please keep the glossary entries in alphabetical order
> > > > +
> > > > +.. glossary::
> > > > +
> > > > +Bridge driver
> > > > +   The same as V4L2 main driver.
> > > 
> > > I've understood bridges being essentially a bus receiver + DMA. Most ISPs
> > > contain both but have more than that. How about:
> > > 
> > > A driver for a bus (e.g. parallel, CSI-2) receiver and DMA. Bridge drivers
> > > typically act as V4L2 main drivers.  
> > 
> > No, only on some drivers the bridge driver has DMA. A vast amount of
> > drivers (USB ones) don't implement any DMA inside the driver, as it is
> > up to the USB host driver to implement support for DMA.
> > 
> > There are even some USB host drivers that don't always use DMA for I/O 
> > transfers, using direct I/O if the message is smaller than a threshold
> > or not multiple of the bus word. This is pretty common on SoC USB host
> > drivers.
> > 
> > In any case, for the effect of this spec, and for all discussions we
> > ever had about it, bridge driver == V4L2 main driver. I don't
> > see any reason why to distinguish between them.  
> 
> I think you should precisely define what a bridge driver means. Generally
> ISP drivers aren't referred to as bridge drivers albeit they, too, function
> as V4L2 main drivers.

It would be more productive if you could reply to the v7 of this patch
series with your suggestion for such definition.

IMHO, a bridge driver is a driver for a piece of hardware that allows
sending/receiving I2C control messages to a media hardware component,
but I'm likely ok with any other definition you have in mind.

> 
> >   
> > >   
> > > > +
> > > > +Device Node
> > > > +   A character device node in the file system used to control and 
> > > > do
> > > > +   input/output data transfers from/to a Kernel driver.
> > > > +
> > > > +Digital Signal Processor - DSP
> > > > +   A specialized microprocessor, with its architecture optimized 
> > > > for
> > > > +   the operational needs of digital signal processing.
> > > > +
> > > > +Driver
> > > > +   The part of the Linux Kernel that implements support
> > > > +   for a hardware component.
> > > > +
> > > > +Field-programmable Gate Array - FPGA
> > > > +   A field-programmable gate array (FPGA) is an integrated circuit
> > > > +   designed to be configured by a customer or a designer after
> > > > +   manufacturing.
> > > > +
> > > > +   See https://en.wikipedia.org/wiki/Field-programmable_gate_array.
> > > > +
> > > > +Hardware component
> > > > +   A subset of the media hardware. For example an I²C or SPI 
> > > > device,
> > > > +   or an IP block inside an SoC or FPGA.
> > > > +
> > > > +Hardware peripheral
> > > > +   A group of hardware components that together make a larger
> > > > +   user-facing functional peripheral. For instance the SoC ISP IP
> > > > +   cores and external camera sensors together make a
> > > > +   camera hardware peripheral.
> > > > +
> > > > +   Also known as peripheral.
> > > 
> > > Aren't peripherals usually understood to be devices that you can plug into
> > > a computer? Such as a printer, o

Re: [v10 4/6] mm, oom: introduce memory.oom_group

2017-10-05 Thread Michal Hocko

On Wed 04-10-17 16:46:36, Roman Gushchin wrote:
> The cgroup-aware OOM killer treats leaf memory cgroups as memory
> consumption entities and performs the victim selection by comparing
> them based on their memory footprint. Then it kills the biggest task
> inside the selected memory cgroup.
> 
> But there are workloads, which are not tolerant to a such behavior.
> Killing a random task may leave the workload in a broken state.
> 
> To solve this problem, memory.oom_group knob is introduced.
> It will define, whether a memory group should be treated as an
> indivisible memory consumer, compared by total memory consumption
> with other memory consumers (leaf memory cgroups and other memory
> cgroups with memory.oom_group set), and whether all belonging tasks
> should be killed if the cgroup is selected.
> 
> If set on memcg A, it means that in case of system-wide OOM or
> memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
> belonging to the sub-tree of A will be killed. If OOM event is
> scoped to a descendant cgroup (A/B, for example), only tasks in
> that cgroup can be affected. OOM killer will never touch any tasks
> outside of the scope of the OOM event.
> 
> Also, tasks with oom_score_adj set to -1000 will not be killed.

I would extend the last sentence with an explanation. What about the
following:
"
Also, tasks with oom_score_adj set to -1000 will not be killed because
this has been a long established way to protect a particular process
from seeing an unexpected SIGKILL from the oom killer. Ignoring this
user defined configuration might lead to data corruptions or other
misbehavior.
"

few mostly nit picks below but this looks good other than that. Once the
fix mentioned in patch 3 is folded I will ack this.

[...]

>  static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
> *oc)
>  {
> - struct mem_cgroup *iter;
> + struct mem_cgroup *iter, *group = NULL;
> + long group_score = 0;
>  
>   oc->chosen_memcg = NULL;
>   oc->chosen_points = 0;
>  
>   /*
> +  * If OOM is memcg-wide, and the memcg has the oom_group flag set,
> +  * all tasks belonging to the memcg should be killed.
> +  * So, we mark the memcg as a victim.
> +  */
> + if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {

we have is_memcg_oom() helper which is esier to read and understand than
the explicit oc->memcg check

> + oc->chosen_memcg = oc->memcg;
> + css_get(&oc->chosen_memcg->css);
> + return;
> + }
> +
> + /*
>* The oom_score is calculated for leaf memory cgroups (including
>* the root memcg).
> +  * Non-leaf oom_group cgroups accumulating score of descendant
> +  * leaf memory cgroups.
>*/
>   rcu_read_lock();
>   for_each_mem_cgroup_tree(iter, root) {
>   long score;
>  
> + /*
> +  * We don't consider non-leaf non-oom_group memory cgroups
> +  * as OOM victims.
> +  */
> + if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
> + continue;
> +
> + /*
> +  * If group is not set or we've ran out of the group's sub-tree,
> +  * we should set group and reset group_score.
> +  */
> + if (!group || group == root_mem_cgroup ||
> + !mem_cgroup_is_descendant(iter, group)) {
> + group = iter;
> + group_score = 0;
> + }
> +

hmm, I thought you would go with a recursive oom_evaluate_memcg
implementation that would result in a more readable code IMHO. It is
true that we would traverse oom_group more times. But I do not expect
we would have very deep memcg hierarchies in the majority of workloads
and even if we did then this is a cold path which should focus on
readability more than a performance. Also implementing
mem_cgroup_iter_skip_subtree shouldn't be all that hard if this ever
turns out a real problem.

Anyway this is nothing really fundamental so I will leave the decision
on you.

> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
>   if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
>   return oc->chosen_memcg;
>  
> - /* Kill a task in the chosen memcg with the biggest memory footprint */
> - oc->chosen_points = 0;
> - oc->chosen_task = NULL;
> - mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> -
> - if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> - goto out;
> -
> - __oom_kill_process(oc->chosen_task);
> + /*
> +  * If memory.oom_group is set, kill all tasks belonging to the sub-tree
> +  * of the chosen memory cgroup, otherwise kill the task with the biggest
> +  * memory footprint.
> +  */
> + if (mem_cgroup_oom_group(oc->chosen_memcg)) {
> + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_ki

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread Roman Gushchin

On Thu, Oct 05, 2017 at 01:12:30PM +0200, Michal Hocko wrote:
> On Thu 05-10-17 11:27:07, Roman Gushchin wrote:
> > On Wed, Oct 04, 2017 at 02:24:26PM -0700, Shakeel Butt wrote:
> [...]
> > > Sorry about the confusion. There are two things. First, should we do a
> > > css_get on the newly selected memcg within the for loop when we still
> > > have a reference to it?
> > 
> > We're holding rcu_read_lock, it should be enough. We're bumping css counter
> > just before releasing rcu lock.
> 
> yes
> 
> > > 
> > > Second, for the OFFLINE memcg, you are right oom_evaluate_memcg() will
> > > return 0 for offlined memcgs. Maybe no need to call
> > > oom_evaluate_memcg() for offlined memcgs.
> > 
> > Sounds like a good optimization, which can be done on top of the current
> > patchset.
> 
> You could achive this by checking whether a memcg has tasks rather than
> explicitly checking for children memcgs as I've suggested already.

Using cgroup_has_tasks() will require additional locking, so I'm not sure
it worth it.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread Michal Hocko

On Wed 04-10-17 16:46:35, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it looks for the biggest leaf memory cgroup
> and kills the biggest task belonging to it. The following patches
> will extend this functionality to consider non-leaf memory cgroups
> as well, and also provide an ability to kill all tasks belonging
> to the victim cgroup.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with leaf memory cgroups.
> Due to memcg statistics implementation a special algorithm
> is used for estimating it's oom_score: we define it as maximum
> oom_score of the belonging tasks.

Thanks for separating the group_oom part. This is getting in the
mergeable state. I will ack it once the suggested fixes are folded in.
There is some clean up potential on top (I find the oc->chosen_memcg
quite ugly and will post a patch on top of yours) but that can be done
later.
 
> Signed-off-by: Roman Gushchin 
> Cc: Michal Hocko 
> Cc: Vladimir Davydov 
> Cc: Johannes Weiner 
> Cc: Tetsuo Handa 
> Cc: David Rientjes 
> Cc: Andrew Morton 
> Cc: Tejun Heo 
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread Michal Hocko

On Wed 04-10-17 16:31:38, Johannes Weiner wrote:
> On Wed, Oct 04, 2017 at 01:17:14PM -0700, David Rientjes wrote:
> > On Wed, 4 Oct 2017, Roman Gushchin wrote:
> > 
> > > > > @@ -828,6 +828,12 @@ static void __oom_kill_process(struct 
> > > > > task_struct *victim)
> > > > >   struct mm_struct *mm;
> > > > >   bool can_oom_reap = true;
> > > > >  
> > > > > + if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
> > > > > + victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > > > > + put_task_struct(victim);
> > > > > + return;
> > > > > + }
> > > > > +
> > > > >   p = find_lock_task_mm(victim);
> > > > >   if (!p) {
> > > > >   put_task_struct(victim);
> > > > 
> > > > Is this necessary? The callers of this function use oom_badness() to
> > > > find a victim, and that filters init, kthread, OOM_SCORE_ADJ_MIN.
> > > 
> > > It is. __oom_kill_process() is used to kill all processes belonging
> > > to the selected memory cgroup, so we should perform these checks
> > > to avoid killing unkillable processes.
> > > 
> > 
> > That's only true after the next patch in the series which uses the 
> > oom_kill_memcg_member() callback to kill processes for oom_group, correct?  
> > Would it be possible to move this check to that patch so it's more 
> > obvious?
> 
> Yup, I realized it when reviewing the next patch. Moving this hunk to
> the next patch would probably make sense. Although, us reviewers have
> been made aware of this now, so I don't feel strongly about it. Won't
> make much of a difference once the patches are merged.

I think it would be better to move it because it will be less confusing
that way. Especially for those who are going to read git history in
order to understand why this is needed.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread Michal Hocko

On Thu 05-10-17 11:27:07, Roman Gushchin wrote:
> On Wed, Oct 04, 2017 at 02:24:26PM -0700, Shakeel Butt wrote:
[...]
> > Sorry about the confusion. There are two things. First, should we do a
> > css_get on the newly selected memcg within the for loop when we still
> > have a reference to it?
> 
> We're holding rcu_read_lock, it should be enough. We're bumping css counter
> just before releasing rcu lock.

yes

> > 
> > Second, for the OFFLINE memcg, you are right oom_evaluate_memcg() will
> > return 0 for offlined memcgs. Maybe no need to call
> > oom_evaluate_memcg() for offlined memcgs.
> 
> Sounds like a good optimization, which can be done on top of the current
> patchset.

You could achive this by checking whether a memcg has tasks rather than
explicitly checking for children memcgs as I've suggested already.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread Roman Gushchin

On Thu, Oct 05, 2017 at 01:40:09AM -0700, David Rientjes wrote:
> On Wed, 4 Oct 2017, Johannes Weiner wrote:
> 
> > > By only considering leaf memcgs, does this penalize users if their memcg 
> > > becomes oc->chosen_memcg purely because it has aggregated all of its 
> > > processes to be members of that memcg, which would otherwise be the 
> > > standard behavior?
> > > 
> > > What prevents me from spreading my memcg with N processes attached over N 
> > > child memcgs instead so that memcg_oom_badness() becomes very small for 
> > > each child memcg specifically to avoid being oom killed?
> > 
> > It's no different from forking out multiple mm to avoid being the
> > biggest process.
> >

Hi, David!

> 
> It is, because it can quite clearly be a DoS, and was prevented with 
> Roman's earlier design of iterating usage up the hierarchy and comparing 
> siblings based on that criteria.  I know exactly why he chose that 
> implementation detail early on, and it was to prevent cases such as this 
> and to not let userspace hide from the oom killer.
> 
> > It's up to the parent to enforce limits on that group and prevent you
> > from being able to cause global OOM in the first place, in particular
> > if you delegate to untrusted and potentially malicious users.
> > 
> 
> Let's resolve that global oom is a real condition and getting into that 
> situation is not a userspace problem.  It's the result of overcommiting 
> the system, and is used in the enterprise to address business goals.  If 
> the above is true, and its up to memcg to prevent global oom in the first 
> place, then this entire patchset is absolutely pointless.  Limit userspace 
> to 95% of memory and when usage is approaching that limit, let userspace 
> attached to the root memcg iterate the hierarchy itself and kill from the 
> largest consumer.
> 
> This patchset exists because overcommit is real, exactly the same as 
> overcommit within memcg hierarchies is real.  99% of the time we don't run 
> into global oom because people aren't using their limits so it just works 
> out.  1% of the time we run into global oom and we need a decision to made 
> based for forward progress.  Using Michal's earlier example of admins and 
> students, a student can easily use all of his limit and also, with v10 of 
> this patchset, 99% of the time avoid being oom killed just by forking N 
> processes over N cgroups.  It's going to oom kill an admin every single 
> time.

Overcommit is real, but configuring the system in a way that system-wide OOM
happens often is a strange idea. As we all know, the system can barely work
adequate under global memory shortage: network packets are dropped, latency
is bad, weird kernel issues are revealed periodically, etc.
I do not see, why you can't overcommit on deeper layers of cgroup hierarchy,
avoiding system-wide OOM to happen.

> 
> I know exactly why earlier versions of this patchset iterated that usage 
> up the tree so you would pick from students, pick from this troublemaking 
> student, and then oom kill from his hierarchy.  Roman has made that point 
> himself.  My suggestion was to add userspace influence to it so that 
> enterprise users and users with business goals can actually define that we 
> really do want 80% of memory to be used by this process or this hierarchy, 
> it's in our best interest.

I'll repeat myself: I believe that there is a range of possible policies:
from a complete flat (what Johannes did suggested few weeks ago), to a very
hierarchical (as in v8). Each with their pros and cons.
(Michal did provide a clear example of bad behavior of the hierarchical 
approach).

I assume, that v10 is a good middle point, and it's good because it doesn't
prevent further development. Just for example, you can introduce a third state
of oom_group knob, which will mean "evaluate as a whole, but do not kill all".
And this is what will solve your particular case, right?

> 
> Earlier iterations of this patchset did this, and did it correctly.  
> Userspace influence over the decisionmaking makes it a very powerful 
> combination because you _can_ specify what your goals are or choose to 
> leave the priorities as default so you can compare based solely on usage.  
> It was a beautiful solution to the problem.

I did, but then I did agree with Tejun's point, that proposed semantics will
limit us further. Really, oom_priorities do not guarantee the killing order
(remember numa issues, as well as oom_score_adj), so in practice it can be even
reverted (e.g. low prio cgroup killed before high prio). We shouldn't cause
users rely on these priorities more than some hints to the kernel.
But the way how they are defined doesn't allow to change anything, it's too
rigid.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread Roman Gushchin

On Wed, Oct 04, 2017 at 02:24:26PM -0700, Shakeel Butt wrote:
> >> > +   if (memcg_has_children(iter))
> >> > +   continue;
> >>
> >> && iter != root_mem_cgroup ?
> >
> > Oh, sure. I had a stupid bug in my test script, which prevented me from
> > catching this. Thanks!
> >
> > This should fix the problem.
> > --
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 2e82625bd354..b3848bce4c86 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2807,7 +2807,8 @@ static void select_victim_memcg(struct mem_cgroup 
> > *root, struct oom_control *oc)
> >  * We don't consider non-leaf non-oom_group memory cgroups
> >  * as OOM victims.
> >  */
> > -   if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
> > +   if (memcg_has_children(iter) && iter != root_mem_cgroup &&
> > +   !mem_cgroup_oom_group(iter))
> > continue;
> 
> I think you are mixing the 3rd and 4th patch. The root_mem_cgroup
> check should be in 3rd while oom_group stuff should be in 4th.
>

Right. This "patch" should fix them both, it was just confusing to
send two patches. I'll split it before final landing.

> 
> >>
> >> Shouldn't there be a CSS_ONLINE check? Also instead of css_get at the
> >> end why not css_tryget_online() here and css_put for the previous
> >> selected one.
> >
> > Hm, why do we need to check this? I do not see, how we can choose
> > an OFFLINE memcg as a victim, tbh. Please, explain the problem.
> >
> 
> Sorry about the confusion. There are two things. First, should we do a
> css_get on the newly selected memcg within the for loop when we still
> have a reference to it?

We're holding rcu_read_lock, it should be enough. We're bumping css counter
just before releasing rcu lock.

> 
> Second, for the OFFLINE memcg, you are right oom_evaluate_memcg() will
> return 0 for offlined memcgs. Maybe no need to call
> oom_evaluate_memcg() for offlined memcgs.

Sounds like a good optimization, which can be done on top of the current
patchset.

Thank you!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread Johannes Weiner

On Thu, Oct 05, 2017 at 01:40:09AM -0700, David Rientjes wrote:
> On Wed, 4 Oct 2017, Johannes Weiner wrote:
> 
> > > By only considering leaf memcgs, does this penalize users if their memcg 
> > > becomes oc->chosen_memcg purely because it has aggregated all of its 
> > > processes to be members of that memcg, which would otherwise be the 
> > > standard behavior?
> > > 
> > > What prevents me from spreading my memcg with N processes attached over N 
> > > child memcgs instead so that memcg_oom_badness() becomes very small for 
> > > each child memcg specifically to avoid being oom killed?
> > 
> > It's no different from forking out multiple mm to avoid being the
> > biggest process.
> > 
> 
> It is, because it can quite clearly be a DoSand was prevented with 
> Roman's earlier design of iterating usage up the hierarchy and comparing 
> siblings based on that criteria.  I know exactly why he chose that 
> implementation detail early on, and it was to prevent cases such as this 
> and to not let userspace hide from the oom killer.

This doesn't address how it's different from a single process
following the same pattern right now.

> > It's up to the parent to enforce limits on that group and prevent you
> > from being able to cause global OOM in the first place, in particular
> > if you delegate to untrusted and potentially malicious users.
> > 
> 
> Let's resolve that global oom is a real condition and getting into that 
> situation is not a userspace problem.  It's the result of overcommiting 
> the system, and is used in the enterprise to address business goals.  If 
> the above is true, and its up to memcg to prevent global oom in the first 
> place, then this entire patchset is absolutely pointless.  Limit userspace 
> to 95% of memory and when usage is approaching that limit, let userspace 
> attached to the root memcg iterate the hierarchy itself and kill from the 
> largest consumer.
> 
> This patchset exists because overcommit is real, exactly the same as 
> overcommit within memcg hierarchies is real.  99% of the time we don't run 
> into global oom because people aren't using their limits so it just works 
> out.  1% of the time we run into global oom and we need a decision to made 
> based for forward progress.  Using Michal's earlier example of admins and 
> students, a student can easily use all of his limit and also, with v10 of 
> this patchset, 99% of the time avoid being oom killed just by forking N 
> processes over N cgroups.  It's going to oom kill an admin every single 
> time.

We overcommit too, but our workloads organize themselves based on
managing their resources, not based on evading the OOM killer. I'd
wager that's true for many if not most users.

Untrusted workloads can evade the OOM killer now, and they can after
these patches are in. Nothing changed. It's not what this work tries
to address at all.

The changelogs are pretty clear on what the goal and the scope of this
is. Just because it doesn't address your highly specialized usecase
doesn't make it pointless. I think we've established that in the past.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread David Rientjes

On Wed, 4 Oct 2017, Johannes Weiner wrote:

> > By only considering leaf memcgs, does this penalize users if their memcg 
> > becomes oc->chosen_memcg purely because it has aggregated all of its 
> > processes to be members of that memcg, which would otherwise be the 
> > standard behavior?
> > 
> > What prevents me from spreading my memcg with N processes attached over N 
> > child memcgs instead so that memcg_oom_badness() becomes very small for 
> > each child memcg specifically to avoid being oom killed?
> 
> It's no different from forking out multiple mm to avoid being the
> biggest process.
> 

It is, because it can quite clearly be a DoS, and was prevented with 
Roman's earlier design of iterating usage up the hierarchy and comparing 
siblings based on that criteria.  I know exactly why he chose that 
implementation detail early on, and it was to prevent cases such as this 
and to not let userspace hide from the oom killer.

> It's up to the parent to enforce limits on that group and prevent you
> from being able to cause global OOM in the first place, in particular
> if you delegate to untrusted and potentially malicious users.
> 

Let's resolve that global oom is a real condition and getting into that 
situation is not a userspace problem.  It's the result of overcommiting 
the system, and is used in the enterprise to address business goals.  If 
the above is true, and its up to memcg to prevent global oom in the first 
place, then this entire patchset is absolutely pointless.  Limit userspace 
to 95% of memory and when usage is approaching that limit, let userspace 
attached to the root memcg iterate the hierarchy itself and kill from the 
largest consumer.

This patchset exists because overcommit is real, exactly the same as 
overcommit within memcg hierarchies is real.  99% of the time we don't run 
into global oom because people aren't using their limits so it just works 
out.  1% of the time we run into global oom and we need a decision to made 
based for forward progress.  Using Michal's earlier example of admins and 
students, a student can easily use all of his limit and also, with v10 of 
this patchset, 99% of the time avoid being oom killed just by forking N 
processes over N cgroups.  It's going to oom kill an admin every single 
time.

I know exactly why earlier versions of this patchset iterated that usage 
up the tree so you would pick from students, pick from this troublemaking 
student, and then oom kill from his hierarchy.  Roman has made that point 
himself.  My suggestion was to add userspace influence to it so that 
enterprise users and users with business goals can actually define that we 
really do want 80% of memory to be used by this process or this hierarchy, 
it's in our best interest.

Earlier iterations of this patchset did this, and did it correctly.  
Userspace influence over the decisionmaking makes it a very powerful 
combination because you _can_ specify what your goals are or choose to 
leave the priorities as default so you can compare based solely on usage.  
It was a beautiful solution to the problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 1/7] media: add glossary.rst with a glossary of terms used at V4L2 spec

2017-10-05 Thread Sakari Ailus

Hi Mauro,

My apologies for the late reply.

On Tue, Aug 29, 2017 at 10:07:50AM -0300, Mauro Carvalho Chehab wrote:
> Em Tue, 29 Aug 2017 10:47:48 +0300
> Sakari Ailus  escreveu:
> 
> > Hi Mauro,
> > 
> > Thanks for the update. A few comments below.
> > 
> > On Mon, Aug 28, 2017 at 09:53:55AM -0300, Mauro Carvalho Chehab wrote:
> > > Add a glossary of terms for V4L2, as several concepts are complex
> > > enough to cause misunderstandings.
> > > 
> > > Signed-off-by: Mauro Carvalho Chehab 
> > > ---
> > >  Documentation/media/uapi/v4l/glossary.rst | 147 
> > > ++
> > >  Documentation/media/uapi/v4l/v4l2.rst |   1 +
> > >  2 files changed, 148 insertions(+)
> > >  create mode 100644 Documentation/media/uapi/v4l/glossary.rst
> > > 
> > > diff --git a/Documentation/media/uapi/v4l/glossary.rst 
> > > b/Documentation/media/uapi/v4l/glossary.rst
> > > new file mode 100644
> > > index ..0b6ab5adec81
> > > --- /dev/null
> > > +++ b/Documentation/media/uapi/v4l/glossary.rst
> > > @@ -0,0 +1,147 @@
> > > +
> > > +Glossary
> > > +
> > > +
> > > +.. note::
> > > +
> > > +   This goal of section is to standardize the terms used within the V4L2
> > > +   documentation. It is written incrementally as they are standardized in
> > > +   the V4L2 documentation. So, it is a Work In Progress.  
> > 
> > I'd leave the WiP part out.
> 
> IMO, it is important to mention it, as the glossary, right now, covers
> only what's used on the first two sections of the API book. There are
> a lot more to be covered.

Works for me.

> 
> > 
> > > +
> > > +.. Please keep the glossary entries in alphabetical order
> > > +
> > > +.. glossary::
> > > +
> > > +Bridge driver
> > > + The same as V4L2 main driver.  
> > 
> > I've understood bridges being essentially a bus receiver + DMA. Most ISPs
> > contain both but have more than that. How about:
> > 
> > A driver for a bus (e.g. parallel, CSI-2) receiver and DMA. Bridge drivers
> > typically act as V4L2 main drivers.
> 
> No, only on some drivers the bridge driver has DMA. A vast amount of
> drivers (USB ones) don't implement any DMA inside the driver, as it is
> up to the USB host driver to implement support for DMA.
> 
> There are even some USB host drivers that don't always use DMA for I/O 
> transfers, using direct I/O if the message is smaller than a threshold
> or not multiple of the bus word. This is pretty common on SoC USB host
> drivers.
> 
> In any case, for the effect of this spec, and for all discussions we
> ever had about it, bridge driver == V4L2 main driver. I don't
> see any reason why to distinguish between them.

I think you should precisely define what a bridge driver means. Generally
ISP drivers aren't referred to as bridge drivers albeit they, too, function
as V4L2 main drivers.

> 
> > 
> > > +
> > > +Device Node
> > > + A character device node in the file system used to control and do
> > > + input/output data transfers from/to a Kernel driver.
> > > +
> > > +Digital Signal Processor - DSP
> > > + A specialized microprocessor, with its architecture optimized for
> > > + the operational needs of digital signal processing.
> > > +
> > > +Driver
> > > + The part of the Linux Kernel that implements support
> > > + for a hardware component.
> > > +
> > > +Field-programmable Gate Array - FPGA
> > > + A field-programmable gate array (FPGA) is an integrated circuit
> > > + designed to be configured by a customer or a designer after
> > > + manufacturing.
> > > +
> > > + See https://en.wikipedia.org/wiki/Field-programmable_gate_array.
> > > +
> > > +Hardware component
> > > + A subset of the media hardware. For example an I²C or SPI device,
> > > + or an IP block inside an SoC or FPGA.
> > > +
> > > +Hardware peripheral
> > > + A group of hardware components that together make a larger
> > > + user-facing functional peripheral. For instance the SoC ISP IP
> > > + cores and external camera sensors together make a
> > > + camera hardware peripheral.
> > > +
> > > + Also known as peripheral.  
> > 
> > Aren't peripherals usually understood to be devices that you can plug into
> > a computer? Such as a printer, or a... floppy drive?
> 
> That is the common sense, although, technically, peripherals are any
> I/O component. It is "peripheral" in the sense that it is not part of
> the CPU's internal circuits. Instead, it is a data that should be sent
> in or out the CPU. 
> 
> On such concept, even an I/O hardware integrated inside a SoC is a
> peripheral.
> 
> Yet, Hans argued already against using it. I opened a separate
> thread for us to discuss about it.
> 
> > I certainly wouldn't call this a peripheral. How about "aggregate device"?
> > We haven't used that before, but I think it relatively well catches the
> > meaning without being excessively elaborate.
> 
> "aggregate device" means nothing to me ;-) I propose "media hardware" instead.

Hardware is substance such as milk. You can't say "a m

Re: [PATCH v2 0/2] kbuild: Cache exploratory calls to the compiler

2017-10-05 Thread Ingo Molnar


* Douglas Anderson  wrote:

> This two-patch series attempts to speed incremental builds of the
> kernel up by a bit.  How much of a speedup you get depends a lot on
> your environment, specifically the speed of your workstation and how
> fast it takes to invoke the compiler.
> 
> In the Chrome OS build environment you get a really big win.  For an
> incremental build (via emerge) I measured a speedup from ~1 minute to
> ~35 seconds.

Very impressive!

> [...]  ...but Chrome OS calls the compiler through a number of wrapper 
> scripts 
> and also calls the kernel make at least twice for an emerge (during compile 
> stage and install stage), so it's a bit of a worst case.

I don't think that's a worst case: incremental builds are very commonly used 
during kernel development and kernel testing. (I'd even argue that the 
performnace 
of incremental builds is one of the most important features of a build system.)

That it's called twice in the Chrome OS build system does not change the 
proportion of the speedup.

> Perhaps a more realistic measure of the speedup others might see is
> running "time make help > /dev/null" outside of the Chrome OS build
> environment on my system.  When I do this I see that it took more than
> 1.0 seconds before and less than 0.2 seconds after.  So presumably
> this has the ability to shave ~0.8 seconds off an incremental build
> for most folks out there.  While 0.8 seconds savings isn't huge, it
> does make incremental builds feel a lot snappier.

This is a huge deal!

FWIIW I have tested your patches and they work fine here. Here's the 
before/after 
performance testing of various styles of build times of the scheduler.

First the true worst case is a full rebuild:

[ before ]

  triton:~/tip> perf stat --null --repeat 3 --pre "make clean 2>/dev/null 2>&1" 
make kernel/sched/ >/dev/null

 Performance counter stats for 'make kernel/sched/' (3 runs):

   4.693974827 seconds time elapsed 
 ( +-  0.05% )

[ after ]

  triton:~/tip> perf stat --null --repeat 3 --pre "make clean 2>/dev/null 2>&1" 
make kernel/sched/ >/dev/null

 Performance counter stats for 'make kernel/sched/' (3 runs):

   4.391769610 seconds time elapsed 
 ( +-  0.21% )

Still a ~6% speedup which is nice to have.

Then the best case, a fully cached rebuild of a specific subsystem - which I 
personally do all the time when I don't remember whether I already built the 
kernel or not:

[ before ]

triton:~/tip> taskset 1 perf stat --null --pre "sync" --repeat 10 make 
kernel/sched/ >/dev/null

 Performance counter stats for 'make kernel/sched/' (10 runs):

   0.439517157 seconds time elapsed 
 ( +-  0.14% )

[ after ]

  triton:~/tip> taskset 1 perf stat --null --pre "sync" --repeat 10 make 
kernel/sched/ >/dev/null

 Performance counter stats for 'make kernel/sched/' (10 runs):

   0.148483807 seconds time elapsed 
 ( +-  0.57% )

A 300% speedup on my system!

So I wholeheartedly endorse the whole concept of caching build environment 
invariants:

  Tested-by: Ingo Molnar 

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] kbuild: Add a cache for generated variables

2017-10-05 Thread Masahiro Yamada

Hi Douglas,


2017-10-05 7:38 GMT+09:00 Doug Anderson :
> Hi,
>
> On Tue, Oct 3, 2017 at 9:05 PM, Masahiro Yamada
>  wrote:
>> Thanks for the patches.  The idea is interesting.
>>
>> I am not a Chrome developer, but cc-option could be improved somehow.
>>
>>
>> I examined two approaches to mitigate the pain.
>>
>> [1] Skip cc-option completely when we run non-build targets
>> such as "make help", "make clean", etc.
>>
>> [2] Cache the result of cc-option like this patch
>>
>>
>> I wrote some patches for [1]
>> https://patchwork.kernel.org/patch/9983825/
>> https://patchwork.kernel.org/patch/9983829/
>> https://patchwork.kernel.org/patch/9983833/
>> https://patchwork.kernel.org/patch/9983827/
>>
>> Comments are welcome.  :)
>
> OK, I'll take a look at them.  I'm not massively familiar with the
> kernel Makefiles, but hopefully I can figure some of it out.  :-P
>
>
>> [1] does not solve the slowness in the incremental build.
>> Instead, it makes non-build targets faster
>> from a clean source tree because cc-option is zero cost.
>>
>>
>> [2] solves the issues in the incremental build.
>> One funny thing is, it computes cc-option and store the cache file
>> even for "make clean", where we know the cache file will be
>> immediately deleted.
>>
>>
>> We can apply [1] or [2], or maybe both of them
>> because [1] and [2] are trying to solve the different issues.
>
> Yeah, I'm much more interested in [2] than [1].  I run incremental
> builds almost constantly and hate the slowness.  :(  I can certainly
> appreciate that the inefficient compiler setup in Chrome OS isn't your
> problem and that an extra .5 seconds for an incremental build for most
> people isn't that huge, though.  ...but I'l probably move forward with
> [2] since it still helps me a lot and still should help others.
> Solving [1] seems worthwhile too, though...


I agree.

[2] is more helpful because developers spend most of time
for the incremental build.

[1] can improve it even more, but it just addresses some minor issues.



>
>> The cache approach seemed clever, but I do not like the implementation.
>> The code is even more unreadable with lots of escape sequence.
>>
>>
>> Here is my suggestion for simpler implementation (including bike-shed)
>>
>>
>> Implement a new function "shell-cache".  It works like $(shell ...)
>>
>> The difference is
>> $(call shell-cached,...) returns the cached result, or run the command
>> if not cached.
>>
>>
>>
>> Also, add try-run-cached.  This is a cached variant of try-run.
>>
>>
>> I just played with it, and seems working.
>> (I did not have spend much time for testing, more wider test is needed.)
>>
>>
>> The code is like something like this:
>>
>>
>>
>> make-cache := $(if $(KBUILD_EXTMOD),$(KBUILD_EXTMOD)/,$(if
>> $(obj),$(obj)/)).cache.mk
>>
>> -include $(make-cache)
>
> Thanks, this is much better!  :)
>
>
>> define __run-and-store
>> ifeq ($(origin $(1)),undefined)
>>   $$(eval $(1) := $$(shell $$2))
>>   $$(shell echo '$(1) := $$($(1))' >> $(make-cache))
>> endif
>> endef
>
> I like your idea.  Essentially you're saying that we can just defer
> the shell command, which was the really essential part that needed to
> be deferred.  Nice!  It seems much nicer / cleaner.  Still a tiny bit
> of voodoo, but with all of your improvements the voodoo is at least
> contained to a very small part of the file.
>
> OK, things seem pretty nice with this and I'll submit out a new patch.
> I'm a bit conflicted about whether I should put myself as authorship
> or you, since mostly the patch is your code now.  ;-)  For now I'll
> leave my authorship but feel free to claim it and change me to just
> "Suggested-by".  :-)


No worry.
Without your patches, I would have never come up with this.
The code improvement happened in general review process.



> NOTE: one thing I noticed about your example code is that you weren't
> always consistent about $(1) vs. $1.  I've changed things to be
> consistent at the expense of extra parenthesis.  If you hate it, let
> me know.

I accept your choice.
Keeping the consistent coding style is good.


If we decide to use $1 instead of $(1),
we should convert all instances tree-wide
for consistency.



>>
>> __set-and-store takes two arguments,
>> but here, it is called with one argument __cached_$(1)
>>
>> Probably, $$(try-run, ...) will be passed as $2,
>> but I do not know why this works.
>
> Whew!  It's not just me that's confused by all this...  ;-)  It looks
> like you continued to use this in your suggestion, so I guess you
> thought it was useful, at least...  Yeah, it's a bit of magic.  The
> goal was to have one less set of evaluating and passing going around.
>
> So really '__set-and-store' takes _1_ argument.  It outputs a string
> where it uses this argument.  Also part of the output string is a
> reference to $(2).  This will refer to the caller's second variable.
>
> ===
>
> Maybe a "simpler" example?
>
> define example
> $$(eval $(1) := $$(2))
> endef
>
> ex_usage = $(eva

42 matches

Mail list logo