Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-03-02 Thread Tejun Heo
On Tue, Mar 03, 2015 at 12:31:19AM +1100, Aleksa Sarai wrote:
> > If 16-bit PID's aren't a concern anymore, then why do we still default to
> > treating it like a 16-bit signed int (the default for
> > /proc/sys/kernel/pid_max is 32768)?
> 
> I just want to emphasise that *even if* we changed to another default
> limit, the mere existence of a system-wide pid_max makes PIDs a
> resource.

We seem to fail to communicate.  The primary reason why pid promotes
itself to a global resource status is because it's globally capped way
below its backing resource's (kernel memory) limit and it is very
difficult to make it not so due to direct userland dependencies on it.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-03-02 Thread Tejun Heo
On Mon, Mar 02, 2015 at 08:13:23AM -0500, Austin S Hemmelgarn wrote:
> If 16-bit PID's aren't a concern anymore, then why do we still default to
> treating it like a 16-bit signed int (the default for
> /proc/sys/kernel/pid_max is 32768)?

Inertia.  It has to start there for backward compatibility.  Now it's
trivial to adjust dynamically and majority of the users don't need to
worry about it, so there's no pressing reason to bump it up by
default.

16bit pid_t was already a dying breed on 32bit config and it never was
an option on 64bit.  Any remotely modern distros in the past decade,
whether 32 or 64bit, wouldn't have any problem with it.  The only
possibly problematic case would be legacy code which for some reason
explicitly used 16bit integer types instead of pid_t, but at this
point, we shouldn't be basing any design decisions on that.  If
anybody is still depending on that, there are different ways ton deal
with the issue on their end including namespacing its pid space.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-03-02 Thread Aleksa Sarai
> If 16-bit PID's aren't a concern anymore, then why do we still default to
> treating it like a 16-bit signed int (the default for
> /proc/sys/kernel/pid_max is 32768)?

I just want to emphasise that *even if* we changed to another default
limit, the mere existence of a system-wide pid_max makes PIDs a
resource.

--
Aleksa Sarai (cyphar)
www.cyphar.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-03-02 Thread Austin S Hemmelgarn

On 2015-02-28 11:43, Tejun Heo wrote:

Hello, Tim.

On Sat, Feb 28, 2015 at 08:38:07AM -0800, Tim Hockin wrote:

I know there is not much concern for legacy-system problems, but it is
worth adding this case - there are systems that limit PIDs for other
reasons, eg broken infrastructure that assumes PIDs fit in a short int,
hypothetically.  Given such a system, PIDs become precious and limiting
them per job is important.

My main point being that there are less obvious considerations in play than
just memory usage.


Sure, there are those cases but it'd be unwise to hinge long term
decisions on them.  It's hard to even argue 16bit pid in legacy code
as a significant contributing factor at this point.  At any rate, it
seems that pid is a global resource which needs to be provisioned for
reasonable isolation which is a good reason to consider controlling it
via cgroups.
If 16-bit PID's aren't a concern anymore, then why do we still default 
to treating it like a 16-bit signed int (the default for 
/proc/sys/kernel/pid_max is 32768)?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-28 Thread Tim Hockin
On Feb 28, 2015 2:50 PM, "Tejun Heo"  wrote:
>
> On Sat, Feb 28, 2015 at 02:26:58PM -0800, Tim Hockin wrote:
> > Wow, so much anger.  I'm not even sure how to respond, so I'll just
> > say this and sign off.  All I want is a better, friendlier, more
> > useful system overall.  We clearly have different ways of looking at
> > the problem.
>
> Can you communicate anything w/o passive aggression?  If you have a
> technical point, just state that.  Can you at least agree that we
> shouldn't be making design decisions based on 16bit pid_t?

Hmm, I have screwed this thread up, I think.  I've made some remarks
that did not come through with the proper tongue-in-cheek slant.  I'm
not being passive aggressive - we DO look at this problem differently.
OF COURSE we should not make decisions based on ancient artifacts of
history.  My point was that there are secondary considerations here -
PIDs are more than just the memory that backs them.  They _ARE_ a
constrained resource, and you shouldn't assume the constraint is just
physical memory.  It is a piece of policy that is outside the control
of the kernel proper - we handed those keys to userspace along time
ago.

Given that, I believe and have believed that the solution should model
the problem as the user perceives it - limiting PIDs - rather than
attaching to a solution-by-proxy.

Yes a solution here partially overlaps with kmemcg, but I don't think
that is a significant problem.  They are different policies governing
behavior that may result in the same condition, but for very different
reasons.  I do not think that is particularly bad for overall
comprehension, and I think the fact that this popped up yet again
indicates the existence of some nugget of user experience that is
worth paying consideration to.

I appreciate your promised consideration through a slightly refocused
lens.  I will go back to my cave and do something I hope is more
productive and less antagonistic.  I did not mean to bring out so much
vitriol.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-28 Thread Johannes Weiner
On Sat, Feb 28, 2015 at 02:26:58PM -0800, Tim Hockin wrote:
> On Sat, Feb 28, 2015 at 8:57 AM, Tejun Heo  wrote:
> >
> > On Sat, Feb 28, 2015 at 08:48:12AM -0800, Tim Hockin wrote:
> > > I am sorry that real-user problems are not perceived as substantial.  This
> > > was/is a real issue for us.  Being in limbo for years on end might not be 
> > > a
> > > technical point, but I do think it matters, and that was my point.
> >
> > It's a problem which is localized to you and caused by the specific
> > problems of your setup.  This isn't a wide-spread problem at all and
> > the world doesn't revolve around you.  If your setup is so messed up
> > as to require sticking to 16bit pids, handle that locally.  If
> > something at larger scale eases that handling, you get lucky.  If not,
> > it's *your* predicament to deal with.  The rest of the world doesn't
> > exist to wipe your ass.
> 
> Wow, so much anger.

Yeah, quite surprising after such an intellectually honest discussion:

: On Fri, Feb 27, 2015 at 01:45:09PM -0800, Tim Hockin wrote:
: > At least 3 or 4 people have INDEPENDENTLY decided this is what is
: > causing them pain and tried to fix it and invested the time to send a
: > patch says that it is actually a thing.  There exists a problem that
: > you are disallowing to be fixed.  Do you recognize that users are
: > experiencing pain?  Why do you hate your users? :)

[...]

: > Are you willing to put a drop-dead date on it?  If we don't have
: > kmemcg working well enough to _actually_ bound PID usage and FD usage
: > by, say, June 1st, will you then accept a patch to this effect?  If
: > the answer is no, then I have zero faith that it's coming any time
: > soon - I heard this 2 years ago.  I believed you then.

> I'm not even sure how to respond, so I'll just say this and sign
> off.  All I want is a better, friendlier, more useful system
> overall.  We clearly have different ways of looking at the problem.

Overlapping features and inconsistent userspace interfaces are only
better for the people that pick the hacks.  They are the opposite of
friendly and useful.  They are also horrible to maintain, which could
be a reason why you constantly disagree with the people that cleaned
up this unholy mess and are now trying to keep a balance between your
short term interests and the long-term health of the Linux kernel.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-28 Thread Tejun Heo
On Sat, Feb 28, 2015 at 02:26:58PM -0800, Tim Hockin wrote:
> Wow, so much anger.  I'm not even sure how to respond, so I'll just
> say this and sign off.  All I want is a better, friendlier, more
> useful system overall.  We clearly have different ways of looking at
> the problem.

Can you communicate anything w/o passive aggression?  If you have a
technical point, just state that.  Can you at least agree that we
shouldn't be making design decisions based on 16bit pid_t?

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-28 Thread Tim Hockin
On Sat, Feb 28, 2015 at 8:57 AM, Tejun Heo  wrote:
>
> On Sat, Feb 28, 2015 at 08:48:12AM -0800, Tim Hockin wrote:
> > I am sorry that real-user problems are not perceived as substantial.  This
> > was/is a real issue for us.  Being in limbo for years on end might not be a
> > technical point, but I do think it matters, and that was my point.
>
> It's a problem which is localized to you and caused by the specific
> problems of your setup.  This isn't a wide-spread problem at all and
> the world doesn't revolve around you.  If your setup is so messed up
> as to require sticking to 16bit pids, handle that locally.  If
> something at larger scale eases that handling, you get lucky.  If not,
> it's *your* predicament to deal with.  The rest of the world doesn't
> exist to wipe your ass.

Wow, so much anger.  I'm not even sure how to respond, so I'll just
say this and sign off.  All I want is a better, friendlier, more
useful system overall.  We clearly have different ways of looking at
the problem.

No antagonism intended

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-28 Thread Tejun Heo
On Sat, Feb 28, 2015 at 08:48:12AM -0800, Tim Hockin wrote:
> I am sorry that real-user problems are not perceived as substantial.  This
> was/is a real issue for us.  Being in limbo for years on end might not be a
> technical point, but I do think it matters, and that was my point.

It's a problem which is localized to you and caused by the specific
problems of your setup.  This isn't a wide-spread problem at all and
the world doesn't revolve around you.  If your setup is so messed up
as to require sticking to 16bit pids, handle that locally.  If
something at larger scale eases that handling, you get lucky.  If not,
it's *your* predicament to deal with.  The rest of the world doesn't
exist to wipe your ass.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-28 Thread Tejun Heo
Hello, Tim.

On Sat, Feb 28, 2015 at 08:38:07AM -0800, Tim Hockin wrote:
> I know there is not much concern for legacy-system problems, but it is
> worth adding this case - there are systems that limit PIDs for other
> reasons, eg broken infrastructure that assumes PIDs fit in a short int,
> hypothetically.  Given such a system, PIDs become precious and limiting
> them per job is important.
>
> My main point being that there are less obvious considerations in play than
> just memory usage.

Sure, there are those cases but it'd be unwise to hinge long term
decisions on them.  It's hard to even argue 16bit pid in legacy code
as a significant contributing factor at this point.  At any rate, it
seems that pid is a global resource which needs to be provisioned for
reasonable isolation which is a good reason to consider controlling it
via cgroups.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-28 Thread Tejun Heo
Hello, Aleksa.

On Sat, Feb 28, 2015 at 08:26:34PM +1100, Aleksa Sarai wrote:
> I just want to quickly echo my support for this statement. Process IDs
> aren't limited by kernel memory, they're a hard-set limit. Thus they are

Process IDs become a hard global resource because we didn't switch to
long during 64bit transition and put an artifical global limit on it,
which allows it to affect system-wide operation while its memory
consumption is staying within practical range.

> a resource like other global resources (open files, etc). Now, while you

Unlike open files.

> can argue that it is possible to limit the amount of *effective*
> processes you can use in a cgroup through kmemcg (by limiting the amount
> of memory spent in storing task_struct data) -- that isn't limiting the
> usage of the *actual* resource (the fact you're limiting the number of
> PIDs is little more than a by-product).

No, the problem is not that.  The problem is that pid_t is, as a
resource, is decoupled from its backing resource - memory - by the
extra artificial and difficult-to-overcome limit put on it.  You are
saying something which is completely different from what Austin was
arguing.

> Also, If it wasn't an actual resource then why is RLIMIT_NPROC a thing?

One strong reason would be because we didn't have a way to account for
and limit the fundamental resources.  If you can fully contain and
control the consumption via rationing the underlying resource, there
isn't much point in controlling the upper layer constructs.

> To me, that indicates that PID limiting not an esoteric usecase and it
> should be possible to use the Linux kernel's home-grown accounting
> system to limit the number of PIDs in a cgroup. Otherwise you're stuck

Again, I think it's a lot more indicative of the fact that we didn't
have any way to control kernel side memory consumption and pids and
open files were one of the things which are relatively easy to
implement policy-wise.

> in a weird world where you *can* limit the number of processes in a
> process tree but *not* the number of processes in a cgroup.

I'm not sold on the idea of replicating the features of ulimit in
cgroups.  ulimit is a mixed bag of relatively easily implementable
resource limits and their behaviors are a combination of resource
limits, per-user usage policies, and per-process behavior safetynets.
The only part translatable to cgroups is actual resource related part
and even among those we should identify what are actual resources
which can't be mapped to consumption of other fundamental resources.

> >> In general, I'm pretty strongly against adding controllers for things
> >> which aren't fundamental resources in the system.  What's next?  Open
> >> files?  Pipe buffer?  Number of flocks?  Number of session leaders or
> >> program groups?
> >>
> > PID's are a fundamental resource, you run out and it's an only marginally
> > better situation than OOM, namely, if you don't already have a shell open
> > which has kill builtin (because you can't fork), or have some other reliable
> > way to terminate processes without forking, you are stuck either waiting for
> > the problem to resolve itself, or have to reset the system.
> 
> I couldn't agree more. PIDs are a fundamental resource because there is
> a hard limit on the amount of PIDs you can have in any one system. Once
> you've exhausted that limit, there's not much you can do apart from
> doing the SYSRQ dance.

The reason why this holds is because we can hit the global limit way
earlier than a practically sized kmem consumption limits can kick in.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-28 Thread Aleksa Sarai
> I wouldn't think that preventing PID exhaustion would be all that much of a
> niche case, it's fully possible for it to happen without using excessive
> amounts of kernel memory (think about BIG server systems with terabytes of
> memory running (arguably poorly written) forking servers that handle tens of
> thousands of client requests per second, each lasting multiple tens of
> seconds), and not necessarily as trivial as you might think to handle sanely
> (especially if you want callbacks when the limits get hit).
> As far as being trivial to achieve, I'm assuming you are referring to rlimit
> and PAM's limits module, both of which have their own issues. Using
> pam_limits.so to limit processes isn't trivial because it requires calling
> through PAM to begin with, which almost no software that isn't login related
> does, and rlimits are tricky to set up properly with the granularity that
> having a cgroup would provide.

I just want to quickly echo my support for this statement. Process IDs
aren't limited by kernel memory, they're a hard-set limit. Thus they are
a resource like other global resources (open files, etc). Now, while you
can argue that it is possible to limit the amount of *effective*
processes you can use in a cgroup through kmemcg (by limiting the amount
of memory spent in storing task_struct data) -- that isn't limiting the
usage of the *actual* resource (the fact you're limiting the number of
PIDs is little more than a by-product).

Also, If it wasn't an actual resource then why is RLIMIT_NPROC a thing?
To me, that indicates that PID limiting not an esoteric usecase and it
should be possible to use the Linux kernel's home-grown accounting
system to limit the number of PIDs in a cgroup. Otherwise you're stuck
in a weird world where you *can* limit the number of processes in a
process tree but *not* the number of processes in a cgroup.

>> In general, I'm pretty strongly against adding controllers for things
>> which aren't fundamental resources in the system.  What's next?  Open
>> files?  Pipe buffer?  Number of flocks?  Number of session leaders or
>> program groups?
>>
> PID's are a fundamental resource, you run out and it's an only marginally
> better situation than OOM, namely, if you don't already have a shell open
> which has kill builtin (because you can't fork), or have some other reliable
> way to terminate processes without forking, you are stuck either waiting for
> the problem to resolve itself, or have to reset the system.

I couldn't agree more. PIDs are a fundamental resource because there is
a hard limit on the amount of PIDs you can have in any one system. Once
you've exhausted that limit, there's not much you can do apart from
doing the SYSRQ dance.

-- 
Aleksa Sarai (cyphar)
www.cyphar.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tejun Heo
On Fri, Feb 27, 2015 at 01:45:09PM -0800, Tim Hockin wrote:
> Are you willing to put a drop-dead date on it?  If we don't have
> kmemcg working well enough to _actually_ bound PID usage and FD usage
> by, say, June 1st, will you then accept a patch to this effect?  If
> the answer is no, then I have zero faith that it's coming any time
> soon - I heard this 2 years ago.  I believed you then.

Tim, cut this bullshit.  That's not how kernel development works.
Contribute to techincal discussion or shut it.  I'm really getting
tired of your whining without any useful substance.

> I see further downthread that you said you'll think about it.  Thank
> you.  Just because our use cases are not normal does not mean we're
> not valid :)

And can you even see why that made progress?

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tim Hockin
On Fri, Feb 27, 2015 at 9:45 AM, Tejun Heo  wrote:
> On Fri, Feb 27, 2015 at 09:25:10AM -0800, Tim Hockin wrote:
>> > In general, I'm pretty strongly against adding controllers for things
>> > which aren't fundamental resources in the system.  What's next?  Open
>> > files?  Pipe buffer?  Number of flocks?  Number of session leaders or
>> > program groups?
>>
>> Yes to some or all of those.  We do exactly this internally and it has
>> greatly added to the stability of our overall container management
>> system.  and while you have been telling everyone to wait for kmemcg,
>> we have had an extra 3+ years of stability.
>
> Yeah, good job.  I totally get why kernel part of memory consumption
> needs protection.  I'm not arguing against that at all.

You keep shifting the focus to be about memory, but that's not what
people are asking for.  You're letting the desire for a perfect
solution (which is years late) block good solutions that exist NOW.

>> > If you want to prevent a certain class of jobs from exhausting a given
>> > resource, protecting that resource is the obvious thing to do.
>>
>> I don't follow your argument - isn't this exactly what this patch set
>> is doing - protecting resources?
>
> If you have proper protection over kernel memory consumption, this is
> completely covered because memory is the fundamental resource here.
> Controlling distribution of those fundamental resources is what
> cgroups are primarily about.

You say that's what cgroups are about, but it's not at all obvious
that you are right.  What users, admins, systems people want is
building blocks that are usable and make sense.  Limiting kernel
memory is NOT the logical building block, here.  It's not something
people can reason about or quantify easily.  if you need to implement
the interfaces in terms of memory, go nuts, but making users think
liek that is just not right.

>> > Wasn't it like a year ago?  Yeah, it's taking longer than everybody
>> > hoped but seriously kmemcg reclaimer just got merged and also did the
>> > new memcg interface which will tie kmemcg and memcg together.
>>
>> By my email it was almost 2 years ago, and that was the second or
>> third incarnation of this patch.
>
> Again, I agree this is taking a while.  Memory people had to retool
> the whole reclamation path to make this work, which is the pattern
> being repeated across the different controllers - we're refactoring a
> lot of infrastructure code so that resource control can integrate with
> the regular operation of the kernel, which BTW is what we should have
> been doing from the beginning.
>
> If your complaint is that this is taking too long, I hear you, and
> there's a certain amount of validity in arguing that upstreaming a
> temporary measure is the better trade-off, but the rationale for nproc
> (or nfds, or virtual memory, whatever) has been pretty weak otherwise.

At least 3 or 4 people have INDEPENDENTLY decided this is what is
causing them pain and tried to fix it and invested the time to send a
patch says that it is actually a thing.  There exists a problem that
you are disallowing to be fixed.  Do you recognize that users are
experiencing pain?  Why do you hate your users? :)

> And as for the different incarnations of this patchset.  Reposting the
> same stuff repeatedly doesn't really change anything.  Why would it?

Because reasonable people might survey the ecosystem and say "humm,
things have changed over the years - isolation has become a pretty
serious topic".  or maybe they hope that you'll finally agree that
fixing the problem NOW is worthwhile, even if the solution is
imperfect, and that a more perfect solution will arrive.

>> >> Something like this is long overdue, IMO, and is still more
>> >> appropriate and obvious than kmemcg anyway.
>> >
>> > Thanks for chiming in again but if you aren't bringing out anything
>> > new to the table (I don't remember you doing that last time either),
>> > I'm not sure why the decision would be different this time.
>>
>> I'm just vocalizing my support for this idea in defense of practical
>> solutions that work NOW instead of "engineering ideals" that never
>> actually arrive.
>>
>> As containers take the server world by storm, stuff like this gets
>> more and more important.
>
> Again, protection of kernel side memory consumption is important.
> There's no question about that.  As for the never-arriving part, well,
> it is arriving.  If you still can't believe, just take a look at the
> code.

Are you willing to put a drop-dead date on it?  If we don't have
kmemcg working well enough to _actually_ bound PID usage and FD usage
by, say, June 1st, will you then accept a patch to this effect?  If
the answer is no, then I have zero faith that it's coming any time
soon - I heard this 2 years ago.  I believed you then.

I see further downthread that you said you'll think about it.  Thank
you.  Just because our use cases are not normal does not mean we're
not valid :)

Tim
--
To unsubscribe

Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tejun Heo
Hello, Austin.

On Fri, Feb 27, 2015 at 01:49:53PM -0500, Austin S Hemmelgarn wrote:
> As far as being trivial to achieve, I'm assuming you are referring to rlimit
> and PAM's limits module, both of which have their own issues. Using
> pam_limits.so to limit processes isn't trivial because it requires calling
> through PAM to begin with, which almost no software that isn't login related
> does, and rlimits are tricky to set up properly with the granularity that
> having a cgroup would provide.
...
> PID's are a fundamental resource, you run out and it's an only marginally
> better situation than OOM, namely, if you don't already have a shell open
> which has kill builtin (because you can't fork), or have some other reliable
> way to terminate processes without forking, you are stuck either waiting for
> the problem to resolve itself, or have to reset the system.

Right, this is an a lot more valid argument.  Currently, we're capping
max pid at 4M which translates to some tens of gigs of memory which
isn't a crazy amount on modern machines.  The hard(er) barrier would
be around 2^30 (2^29 from futex side, apparently) which would also be
reacheable on configurations w/ terabytes of memory.

I'll think more about it and get back.

Thanks a lot.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Austin S Hemmelgarn

On 2015-02-27 12:06, Tejun Heo wrote:

Hello,

On Fri, Feb 27, 2015 at 11:42:10AM -0500, Austin S Hemmelgarn wrote:

Kernel memory consumption isn't the only valid reason to want to limit the
number of processes in a cgroup.  Limiting the number of processes is very
useful to ensure that a program is working correctly (for example, the NTP
daemon should (usually) have an _exact_ number of children if it is
functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_
children), to prevent PID number exhaustion, to head off DoS attacks against
forking network servers before they get to the point of causing kmem
exhaustion, and to limit the number of processes in a cgroup that uses lots
of kernel memory very infrequently.


All the use cases you're listing are extremely niche and can be
trivially achieved without introducing another cgroup controller.  Not
only that, they're actually pretty silly.  Let's say NTP daemon is
misbehaving (or its code changed w/o you knowing or there are corner
cases which trigger extremely infrequently).  What do you exactly
achieve by rejecting its fork call?  It's just adding another
variation to the misbehavior.  It was misbehaving before and would now
be continuing to misbehave after a failed fork.

I wouldn't think that preventing PID exhaustion would be all that much 
of a niche case, it's fully possible for it to happen without using 
excessive amounts of kernel memory (think about BIG server systems with 
terabytes of memory running (arguably poorly written) forking servers 
that handle tens of thousands of client requests per second, each 
lasting multiple tens of seconds), and not necessarily as trivial as you 
might think to handle sanely (especially if you want callbacks when the 
limits get hit).
As far as being trivial to achieve, I'm assuming you are referring to 
rlimit and PAM's limits module, both of which have their own issues. 
Using pam_limits.so to limit processes isn't trivial because it requires 
calling through PAM to begin with, which almost no software that isn't 
login related does, and rlimits are tricky to set up properly with the 
granularity that having a cgroup would provide.

In general, I'm pretty strongly against adding controllers for things
which aren't fundamental resources in the system.  What's next?  Open
files?  Pipe buffer?  Number of flocks?  Number of session leaders or
program groups?

PID's are a fundamental resource, you run out and it's an only 
marginally better situation than OOM, namely, if you don't already have 
a shell open which has kill builtin (because you can't fork), or have 
some other reliable way to terminate processes without forking, you are 
stuck either waiting for the problem to resolve itself, or have to reset 
the system.

If you want to prevent a certain class of jobs from exhausting a given
resource, protecting that resource is the obvious thing to do.

Which is why I'm advocating something that provides a more robust method 
of preventing the system from exhausting PID numbers.

Thanks.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tejun Heo
On Fri, Feb 27, 2015 at 12:45:03PM -0500, Tejun Heo wrote:
> If your complaint is that this is taking too long, I hear you, and
> there's a certain amount of validity in arguing that upstreaming a
> temporary measure is the better trade-off, but the rationale for nproc
> (or nfds, or virtual memory, whatever) has been pretty weak otherwise.

Also, note that this is subset of a larger problem.  e.g. there's a
patchset trying to implement writeback IO control from the filesystem
layer.  cgroup control of writeback has been a thorny issue for over
three years now and the rationale for implementing this reversed
controlling scheme is about the same - doing it properly is too
difficult, let's bolt something on the top as a practical measure.

I think it'd be seriously short-sighted to give in and merge all
those.  These sorts of shortcuts are crippling in the long term.
Again, similarly, proper cgroup writeback support is literally right
around the corner.

The situation sure can be frustrating if you need something now but we
can't make decisions solely on that.  This is an a lot longer term
project and we better, for once, get things right.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tejun Heo
On Fri, Feb 27, 2015 at 09:25:10AM -0800, Tim Hockin wrote:
> > In general, I'm pretty strongly against adding controllers for things
> > which aren't fundamental resources in the system.  What's next?  Open
> > files?  Pipe buffer?  Number of flocks?  Number of session leaders or
> > program groups?
> 
> Yes to some or all of those.  We do exactly this internally and it has
> greatly added to the stability of our overall container management
> system.  and while you have been telling everyone to wait for kmemcg,
> we have had an extra 3+ years of stability.

Yeah, good job.  I totally get why kernel part of memory consumption
needs protection.  I'm not arguing against that at all.

> > If you want to prevent a certain class of jobs from exhausting a given
> > resource, protecting that resource is the obvious thing to do.
> 
> I don't follow your argument - isn't this exactly what this patch set
> is doing - protecting resources?

If you have proper protection over kernel memory consumption, this is
completely covered because memory is the fundamental resource here.
Controlling distribution of those fundamental resources is what
cgroups are primarily about.

> > Wasn't it like a year ago?  Yeah, it's taking longer than everybody
> > hoped but seriously kmemcg reclaimer just got merged and also did the
> > new memcg interface which will tie kmemcg and memcg together.
> 
> By my email it was almost 2 years ago, and that was the second or
> third incarnation of this patch.

Again, I agree this is taking a while.  Memory people had to retool
the whole reclamation path to make this work, which is the pattern
being repeated across the different controllers - we're refactoring a
lot of infrastructure code so that resource control can integrate with
the regular operation of the kernel, which BTW is what we should have
been doing from the beginning.

If your complaint is that this is taking too long, I hear you, and
there's a certain amount of validity in arguing that upstreaming a
temporary measure is the better trade-off, but the rationale for nproc
(or nfds, or virtual memory, whatever) has been pretty weak otherwise.

And as for the different incarnations of this patchset.  Reposting the
same stuff repeatedly doesn't really change anything.  Why would it?

> >> Something like this is long overdue, IMO, and is still more
> >> appropriate and obvious than kmemcg anyway.
> >
> > Thanks for chiming in again but if you aren't bringing out anything
> > new to the table (I don't remember you doing that last time either),
> > I'm not sure why the decision would be different this time.
> 
> I'm just vocalizing my support for this idea in defense of practical
> solutions that work NOW instead of "engineering ideals" that never
> actually arrive.
> 
> As containers take the server world by storm, stuff like this gets
> more and more important.

Again, protection of kernel side memory consumption is important.
There's no question about that.  As for the never-arriving part, well,
it is arriving.  If you still can't believe, just take a look at the
code.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tim Hockin
On Fri, Feb 27, 2015 at 9:06 AM, Tejun Heo  wrote:
> Hello,
>
> On Fri, Feb 27, 2015 at 11:42:10AM -0500, Austin S Hemmelgarn wrote:
>> Kernel memory consumption isn't the only valid reason to want to limit the
>> number of processes in a cgroup.  Limiting the number of processes is very
>> useful to ensure that a program is working correctly (for example, the NTP
>> daemon should (usually) have an _exact_ number of children if it is
>> functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_
>> children), to prevent PID number exhaustion, to head off DoS attacks against
>> forking network servers before they get to the point of causing kmem
>> exhaustion, and to limit the number of processes in a cgroup that uses lots
>> of kernel memory very infrequently.
>
> All the use cases you're listing are extremely niche and can be
> trivially achieved without introducing another cgroup controller.  Not
> only that, they're actually pretty silly.  Let's say NTP daemon is
> misbehaving (or its code changed w/o you knowing or there are corner
> cases which trigger extremely infrequently).  What do you exactly
> achieve by rejecting its fork call?  It's just adding another
> variation to the misbehavior.  It was misbehaving before and would now
> be continuing to misbehave after a failed fork.
>
> In general, I'm pretty strongly against adding controllers for things
> which aren't fundamental resources in the system.  What's next?  Open
> files?  Pipe buffer?  Number of flocks?  Number of session leaders or
> program groups?

Yes to some or all of those.  We do exactly this internally and it has
greatly added to the stability of our overall container management
system.  and while you have been telling everyone to wait for kmemcg,
we have had an extra 3+ years of stability.

> If you want to prevent a certain class of jobs from exhausting a given
> resource, protecting that resource is the obvious thing to do.

I don't follow your argument - isn't this exactly what this patch set
is doing - protecting resources?

> Wasn't it like a year ago?  Yeah, it's taking longer than everybody
> hoped but seriously kmemcg reclaimer just got merged and also did the
> new memcg interface which will tie kmemcg and memcg together.

By my email it was almost 2 years ago, and that was the second or
third incarnation of this patch.

>> Something like this is long overdue, IMO, and is still more
>> appropriate and obvious than kmemcg anyway.
>
> Thanks for chiming in again but if you aren't bringing out anything
> new to the table (I don't remember you doing that last time either),
> I'm not sure why the decision would be different this time.

I'm just vocalizing my support for this idea in defense of practical
solutions that work NOW instead of "engineering ideals" that never
actually arrive.

As containers take the server world by storm, stuff like this gets
more and more important.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tejun Heo
On Fri, Feb 27, 2015 at 09:12:45AM -0800, Tim Hockin wrote:
> I was told that the plan was to use kmemcg - but I was told that YEARS
> AGO.  In the mean time we all either do our own thing or we do nothing
> and suffer.

Wasn't it like a year ago?  Yeah, it's taking longer than everybody
hoped but seriously kmemcg reclaimer just got merged and also did the
new memcg interface which will tie kmemcg and memcg together.

> Something like this is long overdue, IMO, and is still more
> appropriate and obvious than kmemcg anyway.

Thanks for chiming in again but if you aren't bringing out anything
new to the table (I don't remember you doing that last time either),
I'm not sure why the decision would be different this time.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tim Hockin
On Fri, Feb 27, 2015 at 8:42 AM, Austin S Hemmelgarn
 wrote:
> On 2015-02-27 06:49, Tejun Heo wrote:
>>
>> Hello,
>>
>> On Mon, Feb 23, 2015 at 02:08:09PM +1100, Aleksa Sarai wrote:
>>>
>>> The current state of resource limitation for the number of open
>>> processes (as well as the number of open file descriptors) requires you
>>> to use setrlimit(2), which means that you are limited to resource
>>> limiting process trees rather than resource limiting cgroups (which is
>>> the point of cgroups).
>>>
>>> There was a patch to implement this in 2011[1], but that was rejected
>>> because it implemented a general-purpose rlimit subsystem -- which meant
>>> that you couldn't control distinct resource limits in different
>>> heirarchies. This patch implements a resource controller *specifically*
>>> for the number of processes in a cgroup, overcoming this issue.
>>>
>>> There has been a similar attempt to implement a resource controller for
>>> the number of open file descriptors[2], which has not been merged
>>> becasue the reasons were dubious. Merely from a "sane interface"
>>> perspective, it should be possible to utilise cgroups to do such
>>> rudimentary resource management (which currently only exists for process
>>> trees).
>>
>>
>> This isn't a proper resource to control.  kmemcg just grew proper
>> reclaim support and will be useable to control kernel side of memory
>> consumption.

I was told that the plan was to use kmemcg - but I was told that YEARS
AGO.  In the mean time we all either do our own thing or we do nothing
and suffer.

Something like this is long overdue, IMO, and is still more
appropriate and obvious than kmemcg anyway.


>> Thanks.
>>
> Kernel memory consumption isn't the only valid reason to want to limit the
> number of processes in a cgroup.  Limiting the number of processes is very
> useful to ensure that a program is working correctly (for example, the NTP
> daemon should (usually) have an _exact_ number of children if it is
> functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_
> children), to prevent PID number exhaustion, to head off DoS attacks against
> forking network servers before they get to the point of causing kmem
> exhaustion, and to limit the number of processes in a cgroup that uses lots
> of kernel memory very infrequently.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tejun Heo
Hello,

On Fri, Feb 27, 2015 at 11:42:10AM -0500, Austin S Hemmelgarn wrote:
> Kernel memory consumption isn't the only valid reason to want to limit the
> number of processes in a cgroup.  Limiting the number of processes is very
> useful to ensure that a program is working correctly (for example, the NTP
> daemon should (usually) have an _exact_ number of children if it is
> functioning correctly, and rpcbind shouldn't (AFAIK) ever have _any_
> children), to prevent PID number exhaustion, to head off DoS attacks against
> forking network servers before they get to the point of causing kmem
> exhaustion, and to limit the number of processes in a cgroup that uses lots
> of kernel memory very infrequently.

All the use cases you're listing are extremely niche and can be
trivially achieved without introducing another cgroup controller.  Not
only that, they're actually pretty silly.  Let's say NTP daemon is
misbehaving (or its code changed w/o you knowing or there are corner
cases which trigger extremely infrequently).  What do you exactly
achieve by rejecting its fork call?  It's just adding another
variation to the misbehavior.  It was misbehaving before and would now
be continuing to misbehave after a failed fork.

In general, I'm pretty strongly against adding controllers for things
which aren't fundamental resources in the system.  What's next?  Open
files?  Pipe buffer?  Number of flocks?  Number of session leaders or
program groups?

If you want to prevent a certain class of jobs from exhausting a given
resource, protecting that resource is the obvious thing to do.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Austin S Hemmelgarn

On 2015-02-27 06:49, Tejun Heo wrote:

Hello,

On Mon, Feb 23, 2015 at 02:08:09PM +1100, Aleksa Sarai wrote:

The current state of resource limitation for the number of open
processes (as well as the number of open file descriptors) requires you
to use setrlimit(2), which means that you are limited to resource
limiting process trees rather than resource limiting cgroups (which is
the point of cgroups).

There was a patch to implement this in 2011[1], but that was rejected
because it implemented a general-purpose rlimit subsystem -- which meant
that you couldn't control distinct resource limits in different
heirarchies. This patch implements a resource controller *specifically*
for the number of processes in a cgroup, overcoming this issue.

There has been a similar attempt to implement a resource controller for
the number of open file descriptors[2], which has not been merged
becasue the reasons were dubious. Merely from a "sane interface"
perspective, it should be possible to utilise cgroups to do such
rudimentary resource management (which currently only exists for process
trees).


This isn't a proper resource to control.  kmemcg just grew proper
reclaim support and will be useable to control kernel side of memory
consumption.

Thanks.

Kernel memory consumption isn't the only valid reason to want to limit 
the number of processes in a cgroup.  Limiting the number of processes 
is very useful to ensure that a program is working correctly (for 
example, the NTP daemon should (usually) have an _exact_ number of 
children if it is functioning correctly, and rpcbind shouldn't (AFAIK) 
ever have _any_ children), to prevent PID number exhaustion, to head off 
DoS attacks against forking network servers before they get to the point 
of causing kmem exhaustion, and to limit the number of processes in a 
cgroup that uses lots of kernel memory very infrequently.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tejun Heo
Hello,

On Fri, Feb 27, 2015 at 02:46:13PM +0100, Richard Weinberger wrote:
> just to make sure that I understand the big picture.
> The plan is to limit kernel memory per cgroup such that fork bombs and
> stuff cannot harm other groups of processes?

Yes, the kmem part of memcg hasn't really been functional because the
reclaim part was broken and (partially conseqently) kmem config being
siloed from the rest but we're very close to solving that at this
point.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Richard Weinberger
Tejun,

Am 27.02.2015 um 12:49 schrieb Tejun Heo:
> This isn't a proper resource to control.  kmemcg just grew proper
> reclaim support and will be useable to control kernel side of memory
> consumption.

just to make sure that I understand the big picture.
The plan is to limit kernel memory per cgroup such that fork bombs and
stuff cannot harm other groups of processes?

Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-27 Thread Tejun Heo
Hello,

On Mon, Feb 23, 2015 at 02:08:09PM +1100, Aleksa Sarai wrote:
> The current state of resource limitation for the number of open
> processes (as well as the number of open file descriptors) requires you
> to use setrlimit(2), which means that you are limited to resource
> limiting process trees rather than resource limiting cgroups (which is
> the point of cgroups).
> 
> There was a patch to implement this in 2011[1], but that was rejected
> because it implemented a general-purpose rlimit subsystem -- which meant
> that you couldn't control distinct resource limits in different
> heirarchies. This patch implements a resource controller *specifically*
> for the number of processes in a cgroup, overcoming this issue.
> 
> There has been a similar attempt to implement a resource controller for
> the number of open file descriptors[2], which has not been merged
> becasue the reasons were dubious. Merely from a "sane interface"
> perspective, it should be possible to utilise cgroups to do such
> rudimentary resource management (which currently only exists for process
> trees).

This isn't a proper resource to control.  kmemcg just grew proper
reclaim support and will be useable to control kernel side of memory
consumption.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 0/2] add nproc cgroup subsystem

2015-02-22 Thread Aleksa Sarai
The current state of resource limitation for the number of open
processes (as well as the number of open file descriptors) requires you
to use setrlimit(2), which means that you are limited to resource
limiting process trees rather than resource limiting cgroups (which is
the point of cgroups).

There was a patch to implement this in 2011[1], but that was rejected
because it implemented a general-purpose rlimit subsystem -- which meant
that you couldn't control distinct resource limits in different
heirarchies. This patch implements a resource controller *specifically*
for the number of processes in a cgroup, overcoming this issue.

There has been a similar attempt to implement a resource controller for
the number of open file descriptors[2], which has not been merged
becasue the reasons were dubious. Merely from a "sane interface"
perspective, it should be possible to utilise cgroups to do such
rudimentary resource management (which currently only exists for process
trees).

Aleksa Sarai (2):
  cgroups: allow a cgroup subsystem to reject a fork
  cgroups: add an nproc subsystem

 include/linux/cgroup.h|   9 ++-
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig  |  10 +++
 kernel/Makefile   |   1 +
 kernel/cgroup.c   |  13 ++-
 kernel/cgroup_freezer.c   |   6 +-
 kernel/cgroup_nproc.c | 181 ++
 kernel/fork.c |   4 +-
 kernel/sched/core.c   |   3 +-
 9 files changed, 221 insertions(+), 10 deletions(-)
 create mode 100644 kernel/cgroup_nproc.c

[1]: https://lkml.org/lkml/2011/6/19/170
[2]: https://lkml.org/lkml/2014/7/2/640

-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/