Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-18 Thread Johannes Weiner
On Wed, Nov 18, 2015 at 07:02:54PM +0300, Vladimir Davydov wrote:
> On Tue, Nov 17, 2015 at 05:22:17PM -0500, Johannes Weiner wrote:
> > On Tue, Nov 17, 2015 at 11:18:50PM +0300, Vladimir Davydov wrote:
> > > And with this patch it will work this way, but only if sum limits <
> > > total ram, which is rather rare in practice. On tightly packed
> > > systems it does nothing.
> > 
> > That's not true, it's still useful when things go south inside a
> > cgroup, even with overcommitted limits. See above.
> 
> I meant solely this patch here, not the rest of the patch set. In the
> overcommitted case there is no difference if we have the last patch or
> not AFAIU.

Even this patch, and even in the overcommitted case. When things go
bad inside a cgroup it can steal free memory (it's rare that machines
are at 100% utilization in practice) or memory from other groups until
it hits its own limit. I expect most users except, for some largescale
deployments, to frequently hit memory.high (or max) in practice.

Obviously the utopian case of full utilization will be even smoother
when we make vmpressure more finegrained, but why would that be an
argument *against* this patch here, which is useful everywhere else?

> Why can't we apply all patches but the last one (they look OK at first
> glance, but I need more time to review them carefully) and disable
> socket accounting by default for now? Then you or someone else would
> prepare a separate patch set introducing vmpressure propagation to
> socket code, so that socket accounting could be enabled by default.

This is not going to happen, and we discussed this several times
before. I really wish Michal and you would put more thought into
interface implications. It's trivial to fix up implementation if
actual shortcomings are observed, but it's nigh impossible to fix
interfaces and user-visible behavior once published. It requires
enormous undertakings such as unified hierarchy to rectify things.

Please take your time to review this series, no problem.

But I'm no longer reacting to suggestions to make interface tradeoffs
because new code is not proven to work 99% of the time. That's simply
ridiculous. Any problems will have to be fixed either way, and we're
giving users the cmdline options to work around them in the meantime.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-18 Thread Vladimir Davydov
On Tue, Nov 17, 2015 at 05:22:17PM -0500, Johannes Weiner wrote:
> On Tue, Nov 17, 2015 at 11:18:50PM +0300, Vladimir Davydov wrote:
> > AFAIK vmpressure was designed to allow userspace to tune hard limits of
> > cgroups in accordance with their demands, in which case the way how
> > vmpressure notifications work makes sense.
> 
> You can still do that when the reporting happens on the reclaim level,
> it's easy to figure out where the pressure comes from once a group is
> struggling to reclaim its LRU pages.

Right, one only needs to check usage-vs-limit in the cgroup that
received a notification and all its ascendants, but I doubt existing
applications do that.

> 
> Reporting on the pressure level does nothing but destroy valuable
> information that would be useful in scenarios other than tuning a
> hierarchical memory limit.

Agree.

> 
> > > But you guys were wary about the patch that changed it, and this
> > 
> > Changing vmpressure semantics as you proposed in v1 would result in
> > userspace getting notifications even if cgroup does not hit its limit.
> > May be it could be useful to someone (e.g. it could help tuning
> > memory.low), but I am pretty sure this would also result in breakages
> > for others.
> 
> Maybe. I'll look into a two-layer vmpressure recording/reporting model
> that would give us reclaim-level events internally while retaining
> pressure-level events for the existing userspace interface.

It would be great. I think vmpressure, as you propose to implement it,
could be useful even in the unified hierarchy for tuning memory.low and
memory.high.

> 
> > > series has kicked up enough dust already, so I backed it out.
> > > 
> > > But this will still be useful. Yes, it won't help in rebalancing an
> > > regularly working system, which would be cool, but it'll still help
> > > contain a worklad that is growing beyond expectations, which is the
> > > scenario that kickstarted this work.
> > 
> > I haven't looked through all the previous patches in the series, but
> > AFAIU they should do the trick, no? Notifying sockets about vmpressure
> > is rather needed to protect a workload from itself.
> 
> No, the only critical thing is to protect the system from OOM
> conditions caused by what should be containerized processes.
> 
> That's a correctness issue.
> 
> How much we mitigate the consequences inside the container when the
> workload screws up is secondary. But even that is already much better
> in this series compared to memcg v1, while leaving us with all the
> freedom to continue improving this internal mitigation in the future.
> 
> > And with this patch it will work this way, but only if sum limits <
> > total ram, which is rather rare in practice. On tightly packed
> > systems it does nothing.
> 
> That's not true, it's still useful when things go south inside a
> cgroup, even with overcommitted limits. See above.

I meant solely this patch here, not the rest of the patch set. In the
overcommitted case there is no difference if we have the last patch or
not AFAIU.

> 
> We can optimize the continuous global pressure rebalancing later on;
> whether that'll be based on a modified vmpressure implementation, or
> adding reclaim efficiency to the shrinker API or whatever.
> 
> > That said, I don't think we should commit this particular patch. Neither
> > do I think socket accounting should be enabled by default in the unified
> > hierarchy for now, since the implementation is still incomplete. IMHO.
> 
> I don't see a technical basis for either of those suggestions.
> 

IMHO users switching to the unified hierarchy don't expect that
something gets broken in the default setup unless it's a bug. They
expect API changes, new functionality appeared, some features dropped,
but not breakages.

With this patch set, one gets socket accounting enabled by default,
which would be OK if it always worked right, at least in theory. But it
does not if the node is overcommitted - one might get unexpected local
OOMs due to growing socket buffers, which have never been seen in the
legacy hierarchy.

You say that it will help coping with global OOM, which is true, but it
looks like trading an old problem for a new one, which is unaccepted in
this particular case IMHO, because the legacy hierarchy has been used
for years and people are likely to be used to old problems such as lack
of socket buffers accounting - they might already work around this
problem by tuning global tcp limits for instance. After switching to the
unified hierarchy they'll get a new problem in the default setup, which
is no good IMHO.

I'm not against enabling socket buffers accounting by default, but only
once it is expected to work in 99% cases, at least theoretically.

Why can't we apply all patches but the last one (they look OK at first
glance, but I need more time to review them carefully) and disable
socket accounting by default for now? Then you or someone else would
prepare a separate patch set introducing vmpressure 

Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-18 Thread Vladimir Davydov
On Tue, Nov 17, 2015 at 05:22:17PM -0500, Johannes Weiner wrote:
> On Tue, Nov 17, 2015 at 11:18:50PM +0300, Vladimir Davydov wrote:
> > AFAIK vmpressure was designed to allow userspace to tune hard limits of
> > cgroups in accordance with their demands, in which case the way how
> > vmpressure notifications work makes sense.
> 
> You can still do that when the reporting happens on the reclaim level,
> it's easy to figure out where the pressure comes from once a group is
> struggling to reclaim its LRU pages.

Right, one only needs to check usage-vs-limit in the cgroup that
received a notification and all its ascendants, but I doubt existing
applications do that.

> 
> Reporting on the pressure level does nothing but destroy valuable
> information that would be useful in scenarios other than tuning a
> hierarchical memory limit.

Agree.

> 
> > > But you guys were wary about the patch that changed it, and this
> > 
> > Changing vmpressure semantics as you proposed in v1 would result in
> > userspace getting notifications even if cgroup does not hit its limit.
> > May be it could be useful to someone (e.g. it could help tuning
> > memory.low), but I am pretty sure this would also result in breakages
> > for others.
> 
> Maybe. I'll look into a two-layer vmpressure recording/reporting model
> that would give us reclaim-level events internally while retaining
> pressure-level events for the existing userspace interface.

It would be great. I think vmpressure, as you propose to implement it,
could be useful even in the unified hierarchy for tuning memory.low and
memory.high.

> 
> > > series has kicked up enough dust already, so I backed it out.
> > > 
> > > But this will still be useful. Yes, it won't help in rebalancing an
> > > regularly working system, which would be cool, but it'll still help
> > > contain a worklad that is growing beyond expectations, which is the
> > > scenario that kickstarted this work.
> > 
> > I haven't looked through all the previous patches in the series, but
> > AFAIU they should do the trick, no? Notifying sockets about vmpressure
> > is rather needed to protect a workload from itself.
> 
> No, the only critical thing is to protect the system from OOM
> conditions caused by what should be containerized processes.
> 
> That's a correctness issue.
> 
> How much we mitigate the consequences inside the container when the
> workload screws up is secondary. But even that is already much better
> in this series compared to memcg v1, while leaving us with all the
> freedom to continue improving this internal mitigation in the future.
> 
> > And with this patch it will work this way, but only if sum limits <
> > total ram, which is rather rare in practice. On tightly packed
> > systems it does nothing.
> 
> That's not true, it's still useful when things go south inside a
> cgroup, even with overcommitted limits. See above.

I meant solely this patch here, not the rest of the patch set. In the
overcommitted case there is no difference if we have the last patch or
not AFAIU.

> 
> We can optimize the continuous global pressure rebalancing later on;
> whether that'll be based on a modified vmpressure implementation, or
> adding reclaim efficiency to the shrinker API or whatever.
> 
> > That said, I don't think we should commit this particular patch. Neither
> > do I think socket accounting should be enabled by default in the unified
> > hierarchy for now, since the implementation is still incomplete. IMHO.
> 
> I don't see a technical basis for either of those suggestions.
> 

IMHO users switching to the unified hierarchy don't expect that
something gets broken in the default setup unless it's a bug. They
expect API changes, new functionality appeared, some features dropped,
but not breakages.

With this patch set, one gets socket accounting enabled by default,
which would be OK if it always worked right, at least in theory. But it
does not if the node is overcommitted - one might get unexpected local
OOMs due to growing socket buffers, which have never been seen in the
legacy hierarchy.

You say that it will help coping with global OOM, which is true, but it
looks like trading an old problem for a new one, which is unaccepted in
this particular case IMHO, because the legacy hierarchy has been used
for years and people are likely to be used to old problems such as lack
of socket buffers accounting - they might already work around this
problem by tuning global tcp limits for instance. After switching to the
unified hierarchy they'll get a new problem in the default setup, which
is no good IMHO.

I'm not against enabling socket buffers accounting by default, but only
once it is expected to work in 99% cases, at least theoretically.

Why can't we apply all patches but the last one (they look OK at first
glance, but I need more time to review them carefully) and disable
socket accounting by default for now? Then you or someone else would
prepare a separate patch set introducing vmpressure 

Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-18 Thread Johannes Weiner
On Wed, Nov 18, 2015 at 07:02:54PM +0300, Vladimir Davydov wrote:
> On Tue, Nov 17, 2015 at 05:22:17PM -0500, Johannes Weiner wrote:
> > On Tue, Nov 17, 2015 at 11:18:50PM +0300, Vladimir Davydov wrote:
> > > And with this patch it will work this way, but only if sum limits <
> > > total ram, which is rather rare in practice. On tightly packed
> > > systems it does nothing.
> > 
> > That's not true, it's still useful when things go south inside a
> > cgroup, even with overcommitted limits. See above.
> 
> I meant solely this patch here, not the rest of the patch set. In the
> overcommitted case there is no difference if we have the last patch or
> not AFAIU.

Even this patch, and even in the overcommitted case. When things go
bad inside a cgroup it can steal free memory (it's rare that machines
are at 100% utilization in practice) or memory from other groups until
it hits its own limit. I expect most users except, for some largescale
deployments, to frequently hit memory.high (or max) in practice.

Obviously the utopian case of full utilization will be even smoother
when we make vmpressure more finegrained, but why would that be an
argument *against* this patch here, which is useful everywhere else?

> Why can't we apply all patches but the last one (they look OK at first
> glance, but I need more time to review them carefully) and disable
> socket accounting by default for now? Then you or someone else would
> prepare a separate patch set introducing vmpressure propagation to
> socket code, so that socket accounting could be enabled by default.

This is not going to happen, and we discussed this several times
before. I really wish Michal and you would put more thought into
interface implications. It's trivial to fix up implementation if
actual shortcomings are observed, but it's nigh impossible to fix
interfaces and user-visible behavior once published. It requires
enormous undertakings such as unified hierarchy to rectify things.

Please take your time to review this series, no problem.

But I'm no longer reacting to suggestions to make interface tradeoffs
because new code is not proven to work 99% of the time. That's simply
ridiculous. Any problems will have to be fixed either way, and we're
giving users the cmdline options to work around them in the meantime.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-17 Thread Johannes Weiner
On Tue, Nov 17, 2015 at 11:18:50PM +0300, Vladimir Davydov wrote:
> AFAIK vmpressure was designed to allow userspace to tune hard limits of
> cgroups in accordance with their demands, in which case the way how
> vmpressure notifications work makes sense.

You can still do that when the reporting happens on the reclaim level,
it's easy to figure out where the pressure comes from once a group is
struggling to reclaim its LRU pages.

Reporting on the pressure level does nothing but destroy valuable
information that would be useful in scenarios other than tuning a
hierarchical memory limit.

> > But you guys were wary about the patch that changed it, and this
> 
> Changing vmpressure semantics as you proposed in v1 would result in
> userspace getting notifications even if cgroup does not hit its limit.
> May be it could be useful to someone (e.g. it could help tuning
> memory.low), but I am pretty sure this would also result in breakages
> for others.

Maybe. I'll look into a two-layer vmpressure recording/reporting model
that would give us reclaim-level events internally while retaining
pressure-level events for the existing userspace interface.

> > series has kicked up enough dust already, so I backed it out.
> > 
> > But this will still be useful. Yes, it won't help in rebalancing an
> > regularly working system, which would be cool, but it'll still help
> > contain a worklad that is growing beyond expectations, which is the
> > scenario that kickstarted this work.
> 
> I haven't looked through all the previous patches in the series, but
> AFAIU they should do the trick, no? Notifying sockets about vmpressure
> is rather needed to protect a workload from itself.

No, the only critical thing is to protect the system from OOM
conditions caused by what should be containerized processes.

That's a correctness issue.

How much we mitigate the consequences inside the container when the
workload screws up is secondary. But even that is already much better
in this series compared to memcg v1, while leaving us with all the
freedom to continue improving this internal mitigation in the future.

> And with this patch it will work this way, but only if sum limits <
> total ram, which is rather rare in practice. On tightly packed
> systems it does nothing.

That's not true, it's still useful when things go south inside a
cgroup, even with overcommitted limits. See above.

We can optimize the continuous global pressure rebalancing later on;
whether that'll be based on a modified vmpressure implementation, or
adding reclaim efficiency to the shrinker API or whatever.

> That said, I don't think we should commit this particular patch. Neither
> do I think socket accounting should be enabled by default in the unified
> hierarchy for now, since the implementation is still incomplete. IMHO.

I don't see a technical basis for either of those suggestions.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-17 Thread Vladimir Davydov
On Mon, Nov 16, 2015 at 01:53:16PM -0500, Johannes Weiner wrote:
> On Sun, Nov 15, 2015 at 04:54:57PM +0300, Vladimir Davydov wrote:
> > On Thu, Nov 12, 2015 at 06:41:33PM -0500, Johannes Weiner wrote:
> > > Let the networking stack know when a memcg is under reclaim pressure
> > > so that it can clamp its transmit windows accordingly.
> > > 
> > > Whenever the reclaim efficiency of a cgroup's LRU lists drops low
> > > enough for a MEDIUM or HIGH vmpressure event to occur, assert a
> > > pressure state in the socket and tcp memory code that tells it to curb
> > > consumption growth from sockets associated with said control group.
> > > 
> > > vmpressure events are naturally edge triggered, so for hysteresis
> > > assert socket pressure for a second to allow for subsequent vmpressure
> > > events to occur before letting the socket code return to normal.
> > 
> > AFAICS, in contrast to v1, now you don't modify vmpressure behavior,
> > which means socket_pressure will only be set when cgroup hits its
> > high/hard limit. On tightly packed system, this is unlikely IMO -
> > cgroups will mostly experience pressure due to memory shortage at the
> > global level and/or their low limit configuration, in which case no
> > vmpressure events will be triggered and therefore tcp window won't be
> > clamped accordingly.
> 
> Yeah, this is an inherent problem in the vmpressure design and it
> makes the feature significantly less useful than it could be IMO.

AFAIK vmpressure was designed to allow userspace to tune hard limits of
cgroups in accordance with their demands, in which case the way how
vmpressure notifications work makes sense.

> 
> But you guys were wary about the patch that changed it, and this

Changing vmpressure semantics as you proposed in v1 would result in
userspace getting notifications even if cgroup does not hit its limit.
May be it could be useful to someone (e.g. it could help tuning
memory.low), but I am pretty sure this would also result in breakages
for others.

> series has kicked up enough dust already, so I backed it out.
> 
> But this will still be useful. Yes, it won't help in rebalancing an
> regularly working system, which would be cool, but it'll still help
> contain a worklad that is growing beyond expectations, which is the
> scenario that kickstarted this work.

I haven't looked through all the previous patches in the series, but
AFAIU they should do the trick, no? Notifying sockets about vmpressure
is rather needed to protect a workload from itself. And with this patch
it will work this way, but only if sum limits < total ram, which is
rather rare in practice. On tightly packed systems it does nothing.

That said, I don't think we should commit this particular patch. Neither
do I think socket accounting should be enabled by default in the unified
hierarchy for now, since the implementation is still incomplete. IMHO.

Thanks,
Vladimir

> 
> > May be, we could use a per memcg slab shrinker to detect memory
> > pressure? This looks like abusing shrinkers API though.
> 
> Actually, I thought about doing this long-term.
> 
> Shrinkers are a nice way to export VM pressure to auxiliary allocators
> and caches. But currently, the only metric we export is LRU scan rate,
> whose application is limited to ageable caches: it doesn't make sense
> to cause auxiliary workingsets to shrink when the VM is merely picking
> up the drop-behind pages of a one-off page cache stream. I think it
> would make sense for shrinkers to include reclaim efficiency so that
> they can be used by caches that don't have 'accessed' bits and object
> rotation, but are able to shrink based on the cost they're imposing.
> 
> But a change like this is beyond the scope of this series, IMO.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-17 Thread Vladimir Davydov
On Mon, Nov 16, 2015 at 01:53:16PM -0500, Johannes Weiner wrote:
> On Sun, Nov 15, 2015 at 04:54:57PM +0300, Vladimir Davydov wrote:
> > On Thu, Nov 12, 2015 at 06:41:33PM -0500, Johannes Weiner wrote:
> > > Let the networking stack know when a memcg is under reclaim pressure
> > > so that it can clamp its transmit windows accordingly.
> > > 
> > > Whenever the reclaim efficiency of a cgroup's LRU lists drops low
> > > enough for a MEDIUM or HIGH vmpressure event to occur, assert a
> > > pressure state in the socket and tcp memory code that tells it to curb
> > > consumption growth from sockets associated with said control group.
> > > 
> > > vmpressure events are naturally edge triggered, so for hysteresis
> > > assert socket pressure for a second to allow for subsequent vmpressure
> > > events to occur before letting the socket code return to normal.
> > 
> > AFAICS, in contrast to v1, now you don't modify vmpressure behavior,
> > which means socket_pressure will only be set when cgroup hits its
> > high/hard limit. On tightly packed system, this is unlikely IMO -
> > cgroups will mostly experience pressure due to memory shortage at the
> > global level and/or their low limit configuration, in which case no
> > vmpressure events will be triggered and therefore tcp window won't be
> > clamped accordingly.
> 
> Yeah, this is an inherent problem in the vmpressure design and it
> makes the feature significantly less useful than it could be IMO.

AFAIK vmpressure was designed to allow userspace to tune hard limits of
cgroups in accordance with their demands, in which case the way how
vmpressure notifications work makes sense.

> 
> But you guys were wary about the patch that changed it, and this

Changing vmpressure semantics as you proposed in v1 would result in
userspace getting notifications even if cgroup does not hit its limit.
May be it could be useful to someone (e.g. it could help tuning
memory.low), but I am pretty sure this would also result in breakages
for others.

> series has kicked up enough dust already, so I backed it out.
> 
> But this will still be useful. Yes, it won't help in rebalancing an
> regularly working system, which would be cool, but it'll still help
> contain a worklad that is growing beyond expectations, which is the
> scenario that kickstarted this work.

I haven't looked through all the previous patches in the series, but
AFAIU they should do the trick, no? Notifying sockets about vmpressure
is rather needed to protect a workload from itself. And with this patch
it will work this way, but only if sum limits < total ram, which is
rather rare in practice. On tightly packed systems it does nothing.

That said, I don't think we should commit this particular patch. Neither
do I think socket accounting should be enabled by default in the unified
hierarchy for now, since the implementation is still incomplete. IMHO.

Thanks,
Vladimir

> 
> > May be, we could use a per memcg slab shrinker to detect memory
> > pressure? This looks like abusing shrinkers API though.
> 
> Actually, I thought about doing this long-term.
> 
> Shrinkers are a nice way to export VM pressure to auxiliary allocators
> and caches. But currently, the only metric we export is LRU scan rate,
> whose application is limited to ageable caches: it doesn't make sense
> to cause auxiliary workingsets to shrink when the VM is merely picking
> up the drop-behind pages of a one-off page cache stream. I think it
> would make sense for shrinkers to include reclaim efficiency so that
> they can be used by caches that don't have 'accessed' bits and object
> rotation, but are able to shrink based on the cost they're imposing.
> 
> But a change like this is beyond the scope of this series, IMO.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-17 Thread Johannes Weiner
On Tue, Nov 17, 2015 at 11:18:50PM +0300, Vladimir Davydov wrote:
> AFAIK vmpressure was designed to allow userspace to tune hard limits of
> cgroups in accordance with their demands, in which case the way how
> vmpressure notifications work makes sense.

You can still do that when the reporting happens on the reclaim level,
it's easy to figure out where the pressure comes from once a group is
struggling to reclaim its LRU pages.

Reporting on the pressure level does nothing but destroy valuable
information that would be useful in scenarios other than tuning a
hierarchical memory limit.

> > But you guys were wary about the patch that changed it, and this
> 
> Changing vmpressure semantics as you proposed in v1 would result in
> userspace getting notifications even if cgroup does not hit its limit.
> May be it could be useful to someone (e.g. it could help tuning
> memory.low), but I am pretty sure this would also result in breakages
> for others.

Maybe. I'll look into a two-layer vmpressure recording/reporting model
that would give us reclaim-level events internally while retaining
pressure-level events for the existing userspace interface.

> > series has kicked up enough dust already, so I backed it out.
> > 
> > But this will still be useful. Yes, it won't help in rebalancing an
> > regularly working system, which would be cool, but it'll still help
> > contain a worklad that is growing beyond expectations, which is the
> > scenario that kickstarted this work.
> 
> I haven't looked through all the previous patches in the series, but
> AFAIU they should do the trick, no? Notifying sockets about vmpressure
> is rather needed to protect a workload from itself.

No, the only critical thing is to protect the system from OOM
conditions caused by what should be containerized processes.

That's a correctness issue.

How much we mitigate the consequences inside the container when the
workload screws up is secondary. But even that is already much better
in this series compared to memcg v1, while leaving us with all the
freedom to continue improving this internal mitigation in the future.

> And with this patch it will work this way, but only if sum limits <
> total ram, which is rather rare in practice. On tightly packed
> systems it does nothing.

That's not true, it's still useful when things go south inside a
cgroup, even with overcommitted limits. See above.

We can optimize the continuous global pressure rebalancing later on;
whether that'll be based on a modified vmpressure implementation, or
adding reclaim efficiency to the shrinker API or whatever.

> That said, I don't think we should commit this particular patch. Neither
> do I think socket accounting should be enabled by default in the unified
> hierarchy for now, since the implementation is still incomplete. IMHO.

I don't see a technical basis for either of those suggestions.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-16 Thread Johannes Weiner
On Sun, Nov 15, 2015 at 04:54:57PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 12, 2015 at 06:41:33PM -0500, Johannes Weiner wrote:
> > Let the networking stack know when a memcg is under reclaim pressure
> > so that it can clamp its transmit windows accordingly.
> > 
> > Whenever the reclaim efficiency of a cgroup's LRU lists drops low
> > enough for a MEDIUM or HIGH vmpressure event to occur, assert a
> > pressure state in the socket and tcp memory code that tells it to curb
> > consumption growth from sockets associated with said control group.
> > 
> > vmpressure events are naturally edge triggered, so for hysteresis
> > assert socket pressure for a second to allow for subsequent vmpressure
> > events to occur before letting the socket code return to normal.
> 
> AFAICS, in contrast to v1, now you don't modify vmpressure behavior,
> which means socket_pressure will only be set when cgroup hits its
> high/hard limit. On tightly packed system, this is unlikely IMO -
> cgroups will mostly experience pressure due to memory shortage at the
> global level and/or their low limit configuration, in which case no
> vmpressure events will be triggered and therefore tcp window won't be
> clamped accordingly.

Yeah, this is an inherent problem in the vmpressure design and it
makes the feature significantly less useful than it could be IMO.

But you guys were wary about the patch that changed it, and this
series has kicked up enough dust already, so I backed it out.

But this will still be useful. Yes, it won't help in rebalancing an
regularly working system, which would be cool, but it'll still help
contain a worklad that is growing beyond expectations, which is the
scenario that kickstarted this work.

> May be, we could use a per memcg slab shrinker to detect memory
> pressure? This looks like abusing shrinkers API though.

Actually, I thought about doing this long-term.

Shrinkers are a nice way to export VM pressure to auxiliary allocators
and caches. But currently, the only metric we export is LRU scan rate,
whose application is limited to ageable caches: it doesn't make sense
to cause auxiliary workingsets to shrink when the VM is merely picking
up the drop-behind pages of a one-off page cache stream. I think it
would make sense for shrinkers to include reclaim efficiency so that
they can be used by caches that don't have 'accessed' bits and object
rotation, but are able to shrink based on the cost they're imposing.

But a change like this is beyond the scope of this series, IMO.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-16 Thread Johannes Weiner
On Sun, Nov 15, 2015 at 04:54:57PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 12, 2015 at 06:41:33PM -0500, Johannes Weiner wrote:
> > Let the networking stack know when a memcg is under reclaim pressure
> > so that it can clamp its transmit windows accordingly.
> > 
> > Whenever the reclaim efficiency of a cgroup's LRU lists drops low
> > enough for a MEDIUM or HIGH vmpressure event to occur, assert a
> > pressure state in the socket and tcp memory code that tells it to curb
> > consumption growth from sockets associated with said control group.
> > 
> > vmpressure events are naturally edge triggered, so for hysteresis
> > assert socket pressure for a second to allow for subsequent vmpressure
> > events to occur before letting the socket code return to normal.
> 
> AFAICS, in contrast to v1, now you don't modify vmpressure behavior,
> which means socket_pressure will only be set when cgroup hits its
> high/hard limit. On tightly packed system, this is unlikely IMO -
> cgroups will mostly experience pressure due to memory shortage at the
> global level and/or their low limit configuration, in which case no
> vmpressure events will be triggered and therefore tcp window won't be
> clamped accordingly.

Yeah, this is an inherent problem in the vmpressure design and it
makes the feature significantly less useful than it could be IMO.

But you guys were wary about the patch that changed it, and this
series has kicked up enough dust already, so I backed it out.

But this will still be useful. Yes, it won't help in rebalancing an
regularly working system, which would be cool, but it'll still help
contain a worklad that is growing beyond expectations, which is the
scenario that kickstarted this work.

> May be, we could use a per memcg slab shrinker to detect memory
> pressure? This looks like abusing shrinkers API though.

Actually, I thought about doing this long-term.

Shrinkers are a nice way to export VM pressure to auxiliary allocators
and caches. But currently, the only metric we export is LRU scan rate,
whose application is limited to ageable caches: it doesn't make sense
to cause auxiliary workingsets to shrink when the VM is merely picking
up the drop-behind pages of a one-off page cache stream. I think it
would make sense for shrinkers to include reclaim efficiency so that
they can be used by caches that don't have 'accessed' bits and object
rotation, but are able to shrink based on the cost they're imposing.

But a change like this is beyond the scope of this series, IMO.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-15 Thread Vladimir Davydov
On Thu, Nov 12, 2015 at 06:41:33PM -0500, Johannes Weiner wrote:
> Let the networking stack know when a memcg is under reclaim pressure
> so that it can clamp its transmit windows accordingly.
> 
> Whenever the reclaim efficiency of a cgroup's LRU lists drops low
> enough for a MEDIUM or HIGH vmpressure event to occur, assert a
> pressure state in the socket and tcp memory code that tells it to curb
> consumption growth from sockets associated with said control group.
> 
> vmpressure events are naturally edge triggered, so for hysteresis
> assert socket pressure for a second to allow for subsequent vmpressure
> events to occur before letting the socket code return to normal.

AFAICS, in contrast to v1, now you don't modify vmpressure behavior,
which means socket_pressure will only be set when cgroup hits its
high/hard limit. On tightly packed system, this is unlikely IMO -
cgroups will mostly experience pressure due to memory shortage at the
global level and/or their low limit configuration, in which case no
vmpressure events will be triggered and therefore tcp window won't be
clamped accordingly.

May be, we could use a per memcg slab shrinker to detect memory
pressure? This looks like abusing shrinkers API though.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-15 Thread Vladimir Davydov
On Thu, Nov 12, 2015 at 06:41:33PM -0500, Johannes Weiner wrote:
> Let the networking stack know when a memcg is under reclaim pressure
> so that it can clamp its transmit windows accordingly.
> 
> Whenever the reclaim efficiency of a cgroup's LRU lists drops low
> enough for a MEDIUM or HIGH vmpressure event to occur, assert a
> pressure state in the socket and tcp memory code that tells it to curb
> consumption growth from sockets associated with said control group.
> 
> vmpressure events are naturally edge triggered, so for hysteresis
> assert socket pressure for a second to allow for subsequent vmpressure
> events to occur before letting the socket code return to normal.

AFAICS, in contrast to v1, now you don't modify vmpressure behavior,
which means socket_pressure will only be set when cgroup hits its
high/hard limit. On tightly packed system, this is unlikely IMO -
cgroups will mostly experience pressure due to memory shortage at the
global level and/or their low limit configuration, in which case no
vmpressure events will be triggered and therefore tcp window won't be
clamped accordingly.

May be, we could use a per memcg slab shrinker to detect memory
pressure? This looks like abusing shrinkers API though.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-12 Thread Johannes Weiner
Let the networking stack know when a memcg is under reclaim pressure
so that it can clamp its transmit windows accordingly.

Whenever the reclaim efficiency of a cgroup's LRU lists drops low
enough for a MEDIUM or HIGH vmpressure event to occur, assert a
pressure state in the socket and tcp memory code that tells it to curb
consumption growth from sockets associated with said control group.

vmpressure events are naturally edge triggered, so for hysteresis
assert socket pressure for a second to allow for subsequent vmpressure
events to occur before letting the socket code return to normal.

This will likely need finetuning for a wider variety of workloads, but
for now stick to the vmpressure presets and keep hysteresis simple.

Signed-off-by: Johannes Weiner 
---
 include/linux/memcontrol.h | 29 +
 mm/memcontrol.c| 15 +--
 mm/vmpressure.c| 25 -
 3 files changed, 46 insertions(+), 23 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 809d6de..dba43cb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -258,6 +258,7 @@ struct mem_cgroup {
 
 #ifdef CONFIG_INET
struct work_struct  socket_work;
+   unsigned long   socket_pressure;
 #endif
 
/* List of events which userspace want to receive */
@@ -303,18 +304,34 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, 
struct zone *);
 
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
-struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
 
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+#define mem_cgroup_from_counter(counter, member)   \
+   container_of(counter, struct mem_cgroup, member)
+
 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
   struct mem_cgroup *,
   struct mem_cgroup_reclaim_cookie *);
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 
+/**
+ * parent_mem_cgroup - find the accounting parent of a memcg
+ * @memcg: memcg whose parent to find
+ *
+ * Returns the parent memcg, or NULL if this is the root or the memory
+ * controller is in legacy no-hierarchy mode.
+ */
+static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
+{
+   if (!memcg->memory.parent)
+   return NULL;
+   return mem_cgroup_from_counter(memcg->memory.parent, memory);
+}
+
 static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
  struct mem_cgroup *root)
 {
@@ -706,10 +723,14 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, 
unsigned int nr_pages);
 static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 {
 #ifdef CONFIG_MEMCG_KMEM
-   return memcg->tcp_mem.memory_pressure;
-#else
-   return false;
+   if (memcg->tcp_mem.memory_pressure)
+   return true;
 #endif
+   do {
+   if (time_before(jiffies, memcg->socket_pressure))
+   return true;
+   } while ((memcg = parent_mem_cgroup(memcg)));
+   return false;
 }
 #else
 #define mem_cgroup_sockets_enabled 0
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cad9525..4068662 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1091,9 +1091,6 @@ bool task_in_mem_cgroup(struct task_struct *task, struct 
mem_cgroup *memcg)
return ret;
 }
 
-#define mem_cgroup_from_counter(counter, member)   \
-   container_of(counter, struct mem_cgroup, member)
-
 /**
  * mem_cgroup_margin - calculate chargeable space of a memory cgroup
  * @memcg: the memory cgroup
@@ -4138,17 +4135,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
kfree(memcg);
 }
 
-/*
- * Returns the parent mem_cgroup in memcgroup hierarchy with hierarchy enabled.
- */
-struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
-{
-   if (!memcg->memory.parent)
-   return NULL;
-   return mem_cgroup_from_counter(memcg->memory.parent, memory);
-}
-EXPORT_SYMBOL(parent_mem_cgroup);
-
 static void socket_work_func(struct work_struct *work);
 
 static struct cgroup_subsys_state * __ref
@@ -4192,6 +4178,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
*parent_css)
 #endif
 #ifdef CONFIG_INET
INIT_WORK(>socket_work, socket_work_func);
+   memcg->socket_pressure = jiffies;
 #endif
return >css;
 
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 4c25e62..07e8440 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -137,14 +137,11 @@ struct vmpressure_event {
 };
 
 static bool vmpressure_event(struct vmpressure *vmpr,
-unsigned long scanned, unsigned long reclaimed)
+   

[PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-12 Thread Johannes Weiner
Let the networking stack know when a memcg is under reclaim pressure
so that it can clamp its transmit windows accordingly.

Whenever the reclaim efficiency of a cgroup's LRU lists drops low
enough for a MEDIUM or HIGH vmpressure event to occur, assert a
pressure state in the socket and tcp memory code that tells it to curb
consumption growth from sockets associated with said control group.

vmpressure events are naturally edge triggered, so for hysteresis
assert socket pressure for a second to allow for subsequent vmpressure
events to occur before letting the socket code return to normal.

This will likely need finetuning for a wider variety of workloads, but
for now stick to the vmpressure presets and keep hysteresis simple.

Signed-off-by: Johannes Weiner 
---
 include/linux/memcontrol.h | 29 +
 mm/memcontrol.c| 15 +--
 mm/vmpressure.c| 25 -
 3 files changed, 46 insertions(+), 23 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 809d6de..dba43cb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -258,6 +258,7 @@ struct mem_cgroup {
 
 #ifdef CONFIG_INET
struct work_struct  socket_work;
+   unsigned long   socket_pressure;
 #endif
 
/* List of events which userspace want to receive */
@@ -303,18 +304,34 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, 
struct zone *);
 
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
-struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
 
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+#define mem_cgroup_from_counter(counter, member)   \
+   container_of(counter, struct mem_cgroup, member)
+
 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
   struct mem_cgroup *,
   struct mem_cgroup_reclaim_cookie *);
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 
+/**
+ * parent_mem_cgroup - find the accounting parent of a memcg
+ * @memcg: memcg whose parent to find
+ *
+ * Returns the parent memcg, or NULL if this is the root or the memory
+ * controller is in legacy no-hierarchy mode.
+ */
+static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
+{
+   if (!memcg->memory.parent)
+   return NULL;
+   return mem_cgroup_from_counter(memcg->memory.parent, memory);
+}
+
 static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
  struct mem_cgroup *root)
 {
@@ -706,10 +723,14 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, 
unsigned int nr_pages);
 static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 {
 #ifdef CONFIG_MEMCG_KMEM
-   return memcg->tcp_mem.memory_pressure;
-#else
-   return false;
+   if (memcg->tcp_mem.memory_pressure)
+   return true;
 #endif
+   do {
+   if (time_before(jiffies, memcg->socket_pressure))
+   return true;
+   } while ((memcg = parent_mem_cgroup(memcg)));
+   return false;
 }
 #else
 #define mem_cgroup_sockets_enabled 0
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cad9525..4068662 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1091,9 +1091,6 @@ bool task_in_mem_cgroup(struct task_struct *task, struct 
mem_cgroup *memcg)
return ret;
 }
 
-#define mem_cgroup_from_counter(counter, member)   \
-   container_of(counter, struct mem_cgroup, member)
-
 /**
  * mem_cgroup_margin - calculate chargeable space of a memory cgroup
  * @memcg: the memory cgroup
@@ -4138,17 +4135,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
kfree(memcg);
 }
 
-/*
- * Returns the parent mem_cgroup in memcgroup hierarchy with hierarchy enabled.
- */
-struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
-{
-   if (!memcg->memory.parent)
-   return NULL;
-   return mem_cgroup_from_counter(memcg->memory.parent, memory);
-}
-EXPORT_SYMBOL(parent_mem_cgroup);
-
 static void socket_work_func(struct work_struct *work);
 
 static struct cgroup_subsys_state * __ref
@@ -4192,6 +4178,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
*parent_css)
 #endif
 #ifdef CONFIG_INET
INIT_WORK(>socket_work, socket_work_func);
+   memcg->socket_pressure = jiffies;
 #endif
return >css;
 
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 4c25e62..07e8440 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -137,14 +137,11 @@ struct vmpressure_event {
 };
 
 static bool vmpressure_event(struct vmpressure *vmpr,
-unsigned long scanned, unsigned long reclaimed)
+