Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-05 Thread Michal Hocko
On Thu 05-04-18 09:15:01, Matthew Wilcox wrote:
> On Thu, Apr 05, 2018 at 05:32:40PM +0200, Michal Hocko wrote:
> > On Thu 05-04-18 08:13:59, Matthew Wilcox wrote:
> > > Argh.  The comment confused me.  OK, now I've read the source and
> > > understand that GFP_KERNEL | __GFP_RETRY_MAYFAIL tries exactly as hard
> > > as GFP_KERNEL *except* that it won't cause OOM itself.  But any other
> > > simultaneous GFP_KERNEL allocation without __GFP_RETRY_MAYFAIL will
> > > cause an OOM.  (And that's why we're having a conversation)
> > 
> > Well, I can udnerstand how this can be confusing. The all confusion
> > boils down to the small-never-fails semantic we have. So all reclaim
> > modificators (__GFP_NOFAIL, __GFP_RETRY_MAYFAIL, __GFP_NORETRY) only
> > modify the _default_ behavior.
> 
> Now that I understand the flag, I'll try to write a more clear
> explanation.

Good luck with that. It took me several iterations to land with the
current state. It is quite hard to be not misleading yet understandable.

> > > That's a problem because we have places in the kernel that call
> > > kv[zm]alloc(very_large_size, GFP_KERNEL), and that will turn into vmalloc,
> > > which will do the exact same thing, only it will trigger OOM all by itself
> > > (assuming the largest free chunk of address space in the vmalloc area
> > > is larger than the amount of free memory).
> > 
> > well, hardcoded GFP_KERNEL from vmalloc guts is yet another, ehm,
> > herritage that you are not so proud of.
> 
> Certainly not, but that's not what I'm concerned about; I'm concerned
> about the allocation of the pages, not the allocation of the array
> containing the page pointers.

Those pages will use the gfp flag you give to vmalloc IIRC. It is page
tables that are GFP_KERNEL unconditionally.

> > > We could also have a GFP flag that says to only succeed if we're further
> > > above the existing watermark than normal.  __GFP_LOW (==ALLOC_LOW),
> > > if you like.  That would give us the desired behaviour of trying all of
> > > the reclaim methods that GFP_KERNEL would, but not being able to exhaust
> > > all the memory that GFP_KERNEL allocations would take.
> > 
> > Well, I would be really careful with yet another gfp mask. They are so
> > incredibly hard to define properly and then people kinda tend to screw
> > your best intentions with their usecases ;)
> > Failing on low wmark is very close to __GFP_NORETRY or even
> > __GFP_NOWAIT, btw. So let's try to not overthink this...
> 
> Oh, indeed.  We must be able to clearly communicate to users when they
> should use this flag.  I have in mind something like this:
> 
>  * __GFP_HIGH indicates that the caller is high-priority and that granting
>  *   the request is necessary before the system can make forward progress.
>  *   For example, creating an IO context to clean pages.
>  *
>  * __GFP_LOW indicates that the caller is low-priority and that it should
>  *   not be allocated pages that would cause the system to get into an
>  *   out-of-memory situation.  For example, allocating multiple individual
>  *   pages in order to satisfy a larger request.

So how exactly the low fits into GFP_NOWAIT, GFP_NORETRY and
GFP_RETRY_MAFAIL? We _do_have several levels of how hard to try and we
have users relying on that. And do not forget about costly vs.
non-costly sizes.

That being said, we should not hijack this thread more and further
enhancements should be discussed separatelly. I am all for making the
whole allocation api less obscure but keep in mind that we have really
hard historical restrictions.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-05 Thread Matthew Wilcox
On Thu, Apr 05, 2018 at 05:32:40PM +0200, Michal Hocko wrote:
> On Thu 05-04-18 08:13:59, Matthew Wilcox wrote:
> > Argh.  The comment confused me.  OK, now I've read the source and
> > understand that GFP_KERNEL | __GFP_RETRY_MAYFAIL tries exactly as hard
> > as GFP_KERNEL *except* that it won't cause OOM itself.  But any other
> > simultaneous GFP_KERNEL allocation without __GFP_RETRY_MAYFAIL will
> > cause an OOM.  (And that's why we're having a conversation)
> 
> Well, I can udnerstand how this can be confusing. The all confusion
> boils down to the small-never-fails semantic we have. So all reclaim
> modificators (__GFP_NOFAIL, __GFP_RETRY_MAYFAIL, __GFP_NORETRY) only
> modify the _default_ behavior.

Now that I understand the flag, I'll try to write a more clear
explanation.

> > That's a problem because we have places in the kernel that call
> > kv[zm]alloc(very_large_size, GFP_KERNEL), and that will turn into vmalloc,
> > which will do the exact same thing, only it will trigger OOM all by itself
> > (assuming the largest free chunk of address space in the vmalloc area
> > is larger than the amount of free memory).
> 
> well, hardcoded GFP_KERNEL from vmalloc guts is yet another, ehm,
> herritage that you are not so proud of.

Certainly not, but that's not what I'm concerned about; I'm concerned
about the allocation of the pages, not the allocation of the array
containing the page pointers.

> > We could also have a GFP flag that says to only succeed if we're further
> > above the existing watermark than normal.  __GFP_LOW (==ALLOC_LOW),
> > if you like.  That would give us the desired behaviour of trying all of
> > the reclaim methods that GFP_KERNEL would, but not being able to exhaust
> > all the memory that GFP_KERNEL allocations would take.
> 
> Well, I would be really careful with yet another gfp mask. They are so
> incredibly hard to define properly and then people kinda tend to screw
> your best intentions with their usecases ;)
> Failing on low wmark is very close to __GFP_NORETRY or even
> __GFP_NOWAIT, btw. So let's try to not overthink this...

Oh, indeed.  We must be able to clearly communicate to users when they
should use this flag.  I have in mind something like this:

 * __GFP_HIGH indicates that the caller is high-priority and that granting
 *   the request is necessary before the system can make forward progress.
 *   For example, creating an IO context to clean pages.
 *
 * __GFP_LOW indicates that the caller is low-priority and that it should
 *   not be allocated pages that would cause the system to get into an
 *   out-of-memory situation.  For example, allocating multiple individual
 *   pages in order to satisfy a larger request.

I think this should actually replace __GFP_RETRY_MAYFAIL.  It makes sense
to a user: "This is a low priority GFP_KERNEL allocation".  I doubt there's
one kernel hacker in a hundred who could explain what GFP_RETRY_MAYFAIL
does, exactly, and I'm not just saying that because I got it wrong ;-)



Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-05 Thread Michal Hocko
On Thu 05-04-18 08:13:59, Matthew Wilcox wrote:
> On Thu, Apr 05, 2018 at 04:27:49PM +0200, Michal Hocko wrote:
> > On Thu 05-04-18 07:22:58, Matthew Wilcox wrote:
> > > On Wed, Apr 04, 2018 at 09:12:52PM -0700, Joel Fernandes wrote:
> > > > On Wed, Apr 4, 2018 at 7:58 PM, Matthew Wilcox  
> > > > wrote:
> > > > > I still don't get why you want RETRY_MAYFAIL.  You know that tries
> > > > > *harder* to allocate memory than plain GFP_KERNEL does, right?  And
> > > > > that seems like the exact opposite of what you want.
> 
> Argh.  The comment confused me.  OK, now I've read the source and
> understand that GFP_KERNEL | __GFP_RETRY_MAYFAIL tries exactly as hard
> as GFP_KERNEL *except* that it won't cause OOM itself.  But any other
> simultaneous GFP_KERNEL allocation without __GFP_RETRY_MAYFAIL will
> cause an OOM.  (And that's why we're having a conversation)

Well, I can udnerstand how this can be confusing. The all confusion
boils down to the small-never-fails semantic we have. So all reclaim
modificators (__GFP_NOFAIL, __GFP_RETRY_MAYFAIL, __GFP_NORETRY) only
modify the _default_ behavior.

> That's a problem because we have places in the kernel that call
> kv[zm]alloc(very_large_size, GFP_KERNEL), and that will turn into vmalloc,
> which will do the exact same thing, only it will trigger OOM all by itself
> (assuming the largest free chunk of address space in the vmalloc area
> is larger than the amount of free memory).

well, hardcoded GFP_KERNEL from vmalloc guts is yet another, ehm,
herritage that you are not so proud of.
 
> I considered an alloc_page_array(), but that doesn't fit well with the
> design of the ring buffer code.  We could have:
> 
> struct page *alloc_page_list_node(int nid, gfp_t gfp_mask, unsigned long nr);
> 
> and link the allocated pages together through page->lru.
> 
> We could also have a GFP flag that says to only succeed if we're further
> above the existing watermark than normal.  __GFP_LOW (==ALLOC_LOW),
> if you like.  That would give us the desired behaviour of trying all of
> the reclaim methods that GFP_KERNEL would, but not being able to exhaust
> all the memory that GFP_KERNEL allocations would take.

Well, I would be really careful with yet another gfp mask. They are so
incredibly hard to define properly and then people kinda tend to screw
your best intentions with their usecases ;)
Failing on low wmark is very close to __GFP_NORETRY or even
__GFP_NOWAIT, btw. So let's try to not overthink this...
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-05 Thread Matthew Wilcox
On Thu, Apr 05, 2018 at 04:27:49PM +0200, Michal Hocko wrote:
> On Thu 05-04-18 07:22:58, Matthew Wilcox wrote:
> > On Wed, Apr 04, 2018 at 09:12:52PM -0700, Joel Fernandes wrote:
> > > On Wed, Apr 4, 2018 at 7:58 PM, Matthew Wilcox  
> > > wrote:
> > > > I still don't get why you want RETRY_MAYFAIL.  You know that tries
> > > > *harder* to allocate memory than plain GFP_KERNEL does, right?  And
> > > > that seems like the exact opposite of what you want.

Argh.  The comment confused me.  OK, now I've read the source and
understand that GFP_KERNEL | __GFP_RETRY_MAYFAIL tries exactly as hard
as GFP_KERNEL *except* that it won't cause OOM itself.  But any other
simultaneous GFP_KERNEL allocation without __GFP_RETRY_MAYFAIL will
cause an OOM.  (And that's why we're having a conversation)

That's a problem because we have places in the kernel that call
kv[zm]alloc(very_large_size, GFP_KERNEL), and that will turn into vmalloc,
which will do the exact same thing, only it will trigger OOM all by itself
(assuming the largest free chunk of address space in the vmalloc area
is larger than the amount of free memory).

I considered an alloc_page_array(), but that doesn't fit well with the
design of the ring buffer code.  We could have:

struct page *alloc_page_list_node(int nid, gfp_t gfp_mask, unsigned long nr);

and link the allocated pages together through page->lru.

We could also have a GFP flag that says to only succeed if we're further
above the existing watermark than normal.  __GFP_LOW (==ALLOC_LOW),
if you like.  That would give us the desired behaviour of trying all of
the reclaim methods that GFP_KERNEL would, but not being able to exhaust
all the memory that GFP_KERNEL allocations would take.



Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-05 Thread Steven Rostedt
On Thu, 5 Apr 2018 16:27:49 +0200
Michal Hocko  wrote:

> > I understand you don't want GFP_NORETRY.  But why is it more important for
> > this allocation to succeed than other normal GFP_KERNEL allocations?  
> 
> I guess they simply want a failure rather than OOM even when they can
> shoot themselves into head by using oom_origin. It is still quite ugly
> to see OOM report...

Exactly!

-- Steve


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-05 Thread Steven Rostedt
On Thu, 5 Apr 2018 07:22:58 -0700
Matthew Wilcox  wrote:

> I understand you don't want GFP_NORETRY.  But why is it more important for
> this allocation to succeed than other normal GFP_KERNEL allocations?

Not sure what you mean by "more important"? Does saying "RETRY_MAYFAIL"
make it more important? The difference is, if GFP_KERNEL fails, we
don't want to trigger an OOM, and simply clean up and report -ENOMEM to
the user. It has nothing to do with being more important than other
allocations.

If there's 100 Megs of memory available, and the user requests a gig of
memory, it's going to fail. Ideally, it doesn't trigger OOM, but
instead simply reports -ENOMEM to the user.

-- Steve


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-05 Thread Michal Hocko
On Thu 05-04-18 07:22:58, Matthew Wilcox wrote:
> On Wed, Apr 04, 2018 at 09:12:52PM -0700, Joel Fernandes wrote:
> > On Wed, Apr 4, 2018 at 7:58 PM, Matthew Wilcox  wrote:
> > > On Wed, Apr 04, 2018 at 11:47:30AM -0400, Steven Rostedt wrote:
> > >> I originally was going to remove the RETRY_MAYFAIL, but adding this
> > >> check (at the end of the loop though) appears to have OOM consistently
> > >> kill this task.
> > >>
> > >> I still like to keep RETRY_MAYFAIL, because it wont trigger OOM if
> > >> nothing comes in and tries to do an allocation, but instead will fail
> > >> nicely with -ENOMEM.
> > >
> > > I still don't get why you want RETRY_MAYFAIL.  You know that tries
> > > *harder* to allocate memory than plain GFP_KERNEL does, right?  And
> > > that seems like the exact opposite of what you want.
> > 
> > No. We do want it to try harder but not if its already setup for failure.
> 
> I understand you don't want GFP_NORETRY.  But why is it more important for
> this allocation to succeed than other normal GFP_KERNEL allocations?

I guess they simply want a failure rather than OOM even when they can
shoot themselves into head by using oom_origin. It is still quite ugly
to see OOM report...

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-05 Thread Matthew Wilcox
On Wed, Apr 04, 2018 at 09:12:52PM -0700, Joel Fernandes wrote:
> On Wed, Apr 4, 2018 at 7:58 PM, Matthew Wilcox  wrote:
> > On Wed, Apr 04, 2018 at 11:47:30AM -0400, Steven Rostedt wrote:
> >> I originally was going to remove the RETRY_MAYFAIL, but adding this
> >> check (at the end of the loop though) appears to have OOM consistently
> >> kill this task.
> >>
> >> I still like to keep RETRY_MAYFAIL, because it wont trigger OOM if
> >> nothing comes in and tries to do an allocation, but instead will fail
> >> nicely with -ENOMEM.
> >
> > I still don't get why you want RETRY_MAYFAIL.  You know that tries
> > *harder* to allocate memory than plain GFP_KERNEL does, right?  And
> > that seems like the exact opposite of what you want.
> 
> No. We do want it to try harder but not if its already setup for failure.

I understand you don't want GFP_NORETRY.  But why is it more important for
this allocation to succeed than other normal GFP_KERNEL allocations?


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-04 Thread Joel Fernandes
On Wed, Apr 4, 2018 at 7:58 PM, Matthew Wilcox  wrote:
> On Wed, Apr 04, 2018 at 11:47:30AM -0400, Steven Rostedt wrote:
>> I originally was going to remove the RETRY_MAYFAIL, but adding this
>> check (at the end of the loop though) appears to have OOM consistently
>> kill this task.
>>
>> I still like to keep RETRY_MAYFAIL, because it wont trigger OOM if
>> nothing comes in and tries to do an allocation, but instead will fail
>> nicely with -ENOMEM.
>
> I still don't get why you want RETRY_MAYFAIL.  You know that tries
> *harder* to allocate memory than plain GFP_KERNEL does, right?  And
> that seems like the exact opposite of what you want.

No. We do want it to try harder but not if its already setup for failure.


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-04 Thread Matthew Wilcox
On Wed, Apr 04, 2018 at 11:47:30AM -0400, Steven Rostedt wrote:
> I originally was going to remove the RETRY_MAYFAIL, but adding this
> check (at the end of the loop though) appears to have OOM consistently
> kill this task.
> 
> I still like to keep RETRY_MAYFAIL, because it wont trigger OOM if
> nothing comes in and tries to do an allocation, but instead will fail
> nicely with -ENOMEM.

I still don't get why you want RETRY_MAYFAIL.  You know that tries
*harder* to allocate memory than plain GFP_KERNEL does, right?  And
that seems like the exact opposite of what you want.


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-04 Thread Steven Rostedt
On Wed, 4 Apr 2018 16:23:29 +0200
Michal Hocko  wrote:

> > 
> > I tried it out, I did the following:
> > 
> > set_current_oom_origin();
> > for (i = 0; i < nr_pages; i++) {
> > struct page *page;
> > /*
> >  * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
> >  * gracefully without invoking oom-killer and the system is not
> >  * destabilized.
> >  */
> > bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
> > GFP_KERNEL | __GFP_RETRY_MAYFAIL,
> > cpu_to_node(cpu));
> > if (!bpage)
> > goto free_pages;
> > 
> > list_add(&bpage->list, pages);
> > 
> > page = alloc_pages_node(cpu_to_node(cpu),
> > GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0);
> > if (!page)
> > goto free_pages;  
> 
>   if (fatal_signal_pending())
>   fgoto free_pages;

I originally was going to remove the RETRY_MAYFAIL, but adding this
check (at the end of the loop though) appears to have OOM consistently
kill this task.

I still like to keep RETRY_MAYFAIL, because it wont trigger OOM if
nothing comes in and tries to do an allocation, but instead will fail
nicely with -ENOMEM.

-- Steve


> 
> > bpage->page = page_address(page);
> > rb_init_page(bpage->page);
> > }
> > clear_current_oom_origin();  



Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-04 Thread Steven Rostedt
On Wed, 4 Apr 2018 17:27:13 +0200
Michal Hocko  wrote:


> I am afraid I cannot help you much more though.

No, you have actually been a great help. I'm finishing up a patch on
top of this one. I'll be posting it soon.

Thanks for your help and your time!

-- Steve


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-04 Thread Michal Hocko
On Wed 04-04-18 11:04:42, Steven Rostedt wrote:
[...]
> I'm not looking for perfect. In fact, I love what si_mem_available()
> gives me now! Sure, it can say "there's enough memory" even if I can't
> use it. Because most of the OOM allocations that happen with increasing
> the size of the ring buffer isn't due to "just enough memory
> allocated", but it's due to "trying to allocate crazy amounts of
> memory".  That's because it does the allocation one page at a time, and
> if you try to allocate crazy amounts of memory, it will allocate all
> memory before it fails. I don't want that. I want crazy allocations to
> fail from the start. A "maybe this will allocate" is fine even if it
> will end up causing an OOM.

OK, fair enough. It's your code ;) I would recommend using the
oom_origin thingy to reduce the immediate damage and to have a clear
culprit so that I do not have to scratch my head why we see an OOM
report with a lot of unaccounted memory...

I am afraid I cannot help you much more though.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-04 Thread Steven Rostedt
On Wed, 4 Apr 2018 16:42:55 +0200
Michal Hocko  wrote:

> On Wed 04-04-18 10:25:27, Steven Rostedt wrote:
> > On Wed, 4 Apr 2018 16:10:52 +0200
> > Michal Hocko  wrote:
> >   
> > > On Wed 04-04-18 08:59:01, Steven Rostedt wrote:
> > > [...]  
> > > > +   /*
> > > > +* Check if the available memory is there first.
> > > > +* Note, si_mem_available() only gives us a rough estimate of 
> > > > available
> > > > +* memory. It may not be accurate. But we don't care, we just 
> > > > want
> > > > +* to prevent doing any allocation when it is obvious that it is
> > > > +* not going to succeed.
> > > > +*/
> > > > +   i = si_mem_available();
> > > > +   if (i < nr_pages)
> > > > +   return -ENOMEM;
> > > > +
> > > > 
> > > > Better?
> > > 
> > > I must be really missing something here. How can that work at all for
> > > e.g. the zone_{highmem/movable}. You will get false on the above tests
> > > even when you will have hard time to allocate anything from your
> > > destination zones.  
> > 
> > You mean we will get true on the above tests?  Again, the current
> > method is to just say screw it and try to allocate.  
> 
> No, you will get false on that test. Say that you have a system with

Ah, I'm thinking backwards, I looked at false meaning "not enough
memory", where if it's true (i < nr_pages), false means there is enough
memory. OK, we are in agreement.

> large ZONE_MOVABLE. Now your kernel allocations can fit only into
> !movable zones (say we have 1G for !movable and 3G for movable). Now say
> that !movable zones are getting close to the edge while movable zones
> are full of reclaimable pages. si_mem_available will tell you there is a
> _lot_ of memory available while your GFP_KERNEL request will happily
> consume the rest of !movable zones and trigger OOM. See?

Which is still better than what we have today. I'm fine with it.
Really, I am.

> 
> [...]
> > I'm looking for something where "yes" means "there may be enough, but
> > there may not be, buyer beware", and "no" means "forget it, don't even
> > start, because you just asked for more than possible".  
> 
> We do not have _that_ something other than try to opportunistically
> allocate and see what happens. Sucks? Maybe yes but I really cannot
> think of an interface with sane semantic that would catch all the
> different scenarios.

And I'm fine with that too. I don't want to catch all different
scenarios. I want to just catch the crazy ones. Like trying to allocate
gigs of memory when there's only a few megs left. Those can easily
happen with the current interface that can't change.

I'm not looking for perfect. In fact, I love what si_mem_available()
gives me now! Sure, it can say "there's enough memory" even if I can't
use it. Because most of the OOM allocations that happen with increasing
the size of the ring buffer isn't due to "just enough memory
allocated", but it's due to "trying to allocate crazy amounts of
memory".  That's because it does the allocation one page at a time, and
if you try to allocate crazy amounts of memory, it will allocate all
memory before it fails. I don't want that. I want crazy allocations to
fail from the start. A "maybe this will allocate" is fine even if it
will end up causing an OOM.

-- Steve


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-04 Thread Michal Hocko
On Wed 04-04-18 10:31:11, Steven Rostedt wrote:
> On Wed, 4 Apr 2018 16:23:29 +0200
> Michal Hocko  wrote:
> 
> > On Wed 04-04-18 10:11:49, Steven Rostedt wrote:
> > > On Wed, 4 Apr 2018 08:23:40 +0200
> > > Michal Hocko  wrote:
> > >   
> > > > If you are afraid of that then you can have a look at 
> > > > {set,clear}_current_oom_origin()
> > > > which will automatically select the current process as an oom victim and
> > > > kill it.  
> > > 
> > > Would it even receive the signal? Does alloc_pages_node() even respond
> > > to signals? Because the OOM happens while the allocation loop is
> > > running.  
> > 
> > Well, you would need to do something like:
> > 
> > > 
> > > I tried it out, I did the following:
> > > 
> > >   set_current_oom_origin();
> > >   for (i = 0; i < nr_pages; i++) {
> > >   struct page *page;
> > >   /*
> > >* __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
> > >* gracefully without invoking oom-killer and the system is not
> > >* destabilized.
> > >*/
> > >   bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
> > >   GFP_KERNEL | __GFP_RETRY_MAYFAIL,
> > >   cpu_to_node(cpu));
> > >   if (!bpage)
> > >   goto free_pages;
> > > 
> > >   list_add(&bpage->list, pages);
> > > 
> > >   page = alloc_pages_node(cpu_to_node(cpu),
> > >   GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0);
> > >   if (!page)
> > >   goto free_pages;  
> > 
> > if (fatal_signal_pending())
> > fgoto free_pages;
> 
> But wouldn't page be NULL in this case?

__GFP_RETRY_MAYFAIL itself fails rather than triggers the OOM killer.
You still might get killed from other allocation context which can
trigger the OOM killer though. In any case you would back off and fail,
no?

> > >   bpage->page = page_address(page);
> > >   rb_init_page(bpage->page);
> > >   }
> > >   clear_current_oom_origin();  
> > 
> > If you use __GFP_RETRY_MAYFAIL it would have to be somedy else to
> > trigger the OOM killer and this user context would get killed. If you
> > drop __GFP_RETRY_MAYFAIL it would be this context to trigger the OOM but
> > it would still be the selected victim.
> 
> Then we guarantee to kill the process instead of just sending a
> -ENOMEM, which would change user space ABI, and is a NO NO.

I see. Although I would expect it would be echo writing to a file most
of the time. But I am not really familiar what traces usually do so I
will not speculate.

> Ideally, we want to avoid an OOM. I could add the above as well, when
> si_mem_avaiable() returns something that is greater than what is
> available, and at least this is the process that will get the OOM if it
> fails to allocate.
> 
> Would that work for you?

I have responded wrt si_mem_avaiable in other email but yes, using the
oom_origin would reduce the immediate damage at least.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-04 Thread Michal Hocko
On Wed 04-04-18 10:25:27, Steven Rostedt wrote:
> On Wed, 4 Apr 2018 16:10:52 +0200
> Michal Hocko  wrote:
> 
> > On Wed 04-04-18 08:59:01, Steven Rostedt wrote:
> > [...]
> > > +   /*
> > > +* Check if the available memory is there first.
> > > +* Note, si_mem_available() only gives us a rough estimate of 
> > > available
> > > +* memory. It may not be accurate. But we don't care, we just want
> > > +* to prevent doing any allocation when it is obvious that it is
> > > +* not going to succeed.
> > > +*/
> > > +   i = si_mem_available();
> > > +   if (i < nr_pages)
> > > +   return -ENOMEM;
> > > +
> > > 
> > > Better?  
> > 
> > I must be really missing something here. How can that work at all for
> > e.g. the zone_{highmem/movable}. You will get false on the above tests
> > even when you will have hard time to allocate anything from your
> > destination zones.
> 
> You mean we will get true on the above tests?  Again, the current
> method is to just say screw it and try to allocate.

No, you will get false on that test. Say that you have a system with
large ZONE_MOVABLE. Now your kernel allocations can fit only into
!movable zones (say we have 1G for !movable and 3G for movable). Now say
that !movable zones are getting close to the edge while movable zones
are full of reclaimable pages. si_mem_available will tell you there is a
_lot_ of memory available while your GFP_KERNEL request will happily
consume the rest of !movable zones and trigger OOM. See?

[...]
> I'm looking for something where "yes" means "there may be enough, but
> there may not be, buyer beware", and "no" means "forget it, don't even
> start, because you just asked for more than possible".

We do not have _that_ something other than try to opportunistically
allocate and see what happens. Sucks? Maybe yes but I really cannot
think of an interface with sane semantic that would catch all the
different scenarios.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-04 Thread Steven Rostedt
On Wed, 4 Apr 2018 16:23:29 +0200
Michal Hocko  wrote:

> On Wed 04-04-18 10:11:49, Steven Rostedt wrote:
> > On Wed, 4 Apr 2018 08:23:40 +0200
> > Michal Hocko  wrote:
> >   
> > > If you are afraid of that then you can have a look at 
> > > {set,clear}_current_oom_origin()
> > > which will automatically select the current process as an oom victim and
> > > kill it.  
> > 
> > Would it even receive the signal? Does alloc_pages_node() even respond
> > to signals? Because the OOM happens while the allocation loop is
> > running.  
> 
> Well, you would need to do something like:
> 
> > 
> > I tried it out, I did the following:
> > 
> > set_current_oom_origin();
> > for (i = 0; i < nr_pages; i++) {
> > struct page *page;
> > /*
> >  * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
> >  * gracefully without invoking oom-killer and the system is not
> >  * destabilized.
> >  */
> > bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
> > GFP_KERNEL | __GFP_RETRY_MAYFAIL,
> > cpu_to_node(cpu));
> > if (!bpage)
> > goto free_pages;
> > 
> > list_add(&bpage->list, pages);
> > 
> > page = alloc_pages_node(cpu_to_node(cpu),
> > GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0);
> > if (!page)
> > goto free_pages;  
> 
>   if (fatal_signal_pending())
>   fgoto free_pages;

But wouldn't page be NULL in this case?

> 
> > bpage->page = page_address(page);
> > rb_init_page(bpage->page);
> > }
> > clear_current_oom_origin();  
> 
> If you use __GFP_RETRY_MAYFAIL it would have to be somedy else to
> trigger the OOM killer and this user context would get killed. If you
> drop __GFP_RETRY_MAYFAIL it would be this context to trigger the OOM but
> it would still be the selected victim.

Then we guarantee to kill the process instead of just sending a
-ENOMEM, which would change user space ABI, and is a NO NO.

Ideally, we want to avoid an OOM. I could add the above as well, when
si_mem_avaiable() returns something that is greater than what is
available, and at least this is the process that will get the OOM if it
fails to allocate.

Would that work for you?

-- Steve


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-04 Thread Steven Rostedt
On Wed, 4 Apr 2018 16:10:52 +0200
Michal Hocko  wrote:

> On Wed 04-04-18 08:59:01, Steven Rostedt wrote:
> [...]
> > +   /*
> > +* Check if the available memory is there first.
> > +* Note, si_mem_available() only gives us a rough estimate of 
> > available
> > +* memory. It may not be accurate. But we don't care, we just want
> > +* to prevent doing any allocation when it is obvious that it is
> > +* not going to succeed.
> > +*/
> > +   i = si_mem_available();
> > +   if (i < nr_pages)
> > +   return -ENOMEM;
> > +
> > 
> > Better?  
> 
> I must be really missing something here. How can that work at all for
> e.g. the zone_{highmem/movable}. You will get false on the above tests
> even when you will have hard time to allocate anything from your
> destination zones.

You mean we will get true on the above tests?  Again, the current
method is to just say screw it and try to allocate.

I originally just used NORETRY which would only allocate memory that is
currently available and not try to reclaim anything. But people like
Joel at Google that required increasing the buffer when memory was
mostly taken up by page cache changed it from NORETRY to RETRY_MAYFAIL.

But this now causes the issue that a large allocation can take up all
memory even when the allocation requested is guaranteed to fail,
because there is not enough memory to pull this off.

We just want a way to say "hey, is there enough memory in the system to
allocate all these pages before we try? We don't need specifics, we
just want to make sure we are not allocating way too much".

The answer I want is "yes there may be enough (but you may not be able
to use it)" or "no, there is definitely not enough for that".

Currently si_mem_available() is the closest thing we have to answering
that question. I'm fine if the answer is "Yes" even if I can't allocate
that memory.

I'm looking for something where "yes" means "there may be enough, but
there may not be, buyer beware", and "no" means "forget it, don't even
start, because you just asked for more than possible".

-- Steve


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-04 Thread Michal Hocko
On Wed 04-04-18 10:11:49, Steven Rostedt wrote:
> On Wed, 4 Apr 2018 08:23:40 +0200
> Michal Hocko  wrote:
> 
> > If you are afraid of that then you can have a look at 
> > {set,clear}_current_oom_origin()
> > which will automatically select the current process as an oom victim and
> > kill it.
> 
> Would it even receive the signal? Does alloc_pages_node() even respond
> to signals? Because the OOM happens while the allocation loop is
> running.

Well, you would need to do something like:

> 
> I tried it out, I did the following:
> 
>   set_current_oom_origin();
>   for (i = 0; i < nr_pages; i++) {
>   struct page *page;
>   /*
>* __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
>* gracefully without invoking oom-killer and the system is not
>* destabilized.
>*/
>   bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
>   GFP_KERNEL | __GFP_RETRY_MAYFAIL,
>   cpu_to_node(cpu));
>   if (!bpage)
>   goto free_pages;
> 
>   list_add(&bpage->list, pages);
> 
>   page = alloc_pages_node(cpu_to_node(cpu),
>   GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0);
>   if (!page)
>   goto free_pages;

if (fatal_signal_pending())
fgoto free_pages;

>   bpage->page = page_address(page);
>   rb_init_page(bpage->page);
>   }
>   clear_current_oom_origin();

If you use __GFP_RETRY_MAYFAIL it would have to be somedy else to
trigger the OOM killer and this user context would get killed. If you
drop __GFP_RETRY_MAYFAIL it would be this context to trigger the OOM but
it would still be the selected victim.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-04 Thread Steven Rostedt
On Wed, 4 Apr 2018 08:23:40 +0200
Michal Hocko  wrote:

> If you are afraid of that then you can have a look at 
> {set,clear}_current_oom_origin()
> which will automatically select the current process as an oom victim and
> kill it.

Would it even receive the signal? Does alloc_pages_node() even respond
to signals? Because the OOM happens while the allocation loop is
running.

I tried it out, I did the following:

set_current_oom_origin();
for (i = 0; i < nr_pages; i++) {
struct page *page;
/*
 * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
 * gracefully without invoking oom-killer and the system is not
 * destabilized.
 */
bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
GFP_KERNEL | __GFP_RETRY_MAYFAIL,
cpu_to_node(cpu));
if (!bpage)
goto free_pages;

list_add(&bpage->list, pages);

page = alloc_pages_node(cpu_to_node(cpu),
GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0);
if (!page)
goto free_pages;
bpage->page = page_address(page);
rb_init_page(bpage->page);
}
clear_current_oom_origin();

The first time I ran my ring buffer memory stress test, it killed the
stress test. The second time I ran it, it killed polkitd.

Still doesn't help as much as the original patch.

You haven't convinced me that using si_mem_available() is a bad idea.
If anything, you've solidified my confidence in it.

-- Steve


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-04 Thread Michal Hocko
On Wed 04-04-18 08:59:01, Steven Rostedt wrote:
[...]
> +   /*
> +* Check if the available memory is there first.
> +* Note, si_mem_available() only gives us a rough estimate of 
> available
> +* memory. It may not be accurate. But we don't care, we just want
> +* to prevent doing any allocation when it is obvious that it is
> +* not going to succeed.
> +*/
> +   i = si_mem_available();
> +   if (i < nr_pages)
> +   return -ENOMEM;
> +
> 
> Better?

I must be really missing something here. How can that work at all for
e.g. the zone_{highmem/movable}. You will get false on the above tests
even when you will have hard time to allocate anything from your
destination zones.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-04 Thread Steven Rostedt
On Wed, 4 Apr 2018 08:20:39 +0200
Michal Hocko  wrote:


> > Now can you please explain to me why si_mem_available is not suitable
> > for my purpose.  
> 
> Several problems. It is overly optimistic especially when we are close
> to OOM. The available pagecache or slab reclaimable objects might be pinned
> long enough that your allocation based on that estimation will just make
> the situation worse and result in OOM. More importantly though, your
> allocations are GFP_KERNEL, right, that means that such an allocation
> will not reach to ZONE_MOVABLE or ZONE_HIGMEM (32b systems) while the
> pagecache will. So you will get an overestimate of how much you can
> allocate.
> 
> Really si_mem_available is for proc/meminfo and a rough estimate of the
> free memory because users tend to be confused by seeing MemFree too low
> and complaining that the system has eaten all their memory. I have some
> skepticism about how useful it is in practice apart from showing it in
> top or alike tools. The memory is simply not usable immediately or
> without an overall and visible effect on the whole system.

What you are telling me is that this is perfect for my use case.

I'm not looking for a "if this tells me have enough memory, I then have
enough memory". I'm looking for a "If I screwed up and asked for a
magnitude more than I really need, don't OOM the system".

Really, I don't care if the number is not truly accurate. In fact, what
you tell me above is exactly what I wanted. I'm more worried it will
return a smaller number than what is available. I much rather have an
over estimate.

This is not about trying to get as much memory for tracing as possible.
Where we slowly increase the buffer size till we have pretty much every
page for tracing. If someone does that, then the system should OOM and
become unstable.

This is about doing what I've (and others) have done several times,
which is put in one or two more zeros than I really wanted. Or forgot
that writing in a number to buffer_size_kb is the buffer size for each
CPU. Yes, the number you write in there is multiplied by every CPU on
the system. It is easy to over allocate by mistake.

I'm looking to protect against gross mistakes where it is obvious that
the allocation isn't going to succeed before the allocating begins. I'm
not looking to be perfect here.

As I've stated before, the current method is to say F*ck You to the
rest of the system and OOM anything else.

If you want, I can change the comment above the code to be:

+   /*
+* Check if the available memory is there first.
+* Note, si_mem_available() only gives us a rough estimate of available
+* memory. It may not be accurate. But we don't care, we just want
+* to prevent doing any allocation when it is obvious that it is
+* not going to succeed.
+*/
+   i = si_mem_available();
+   if (i < nr_pages)
+   return -ENOMEM;
+

Better?

-- Steve



Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-04 Thread Joel Fernandes
On Tue, Apr 3, 2018 at 11:20 PM, Michal Hocko  wrote:
> On Tue 03-04-18 18:56:27, Steven Rostedt wrote:
> [...]
>> From your earlier email:
>>
>> > Except that it doesn't work. si_mem_available is not really suitable for
>> > any allocation estimations. Its only purpose is to provide a very rough
>> > estimation for userspace. Any other use is basically abuse. The
>> > situation can change really quickly. Really it is really hard to be
>> > clever here with the volatility the memory allocations can cause.
>>
>> Now can you please explain to me why si_mem_available is not suitable
>> for my purpose.
>
> Several problems. It is overly optimistic especially when we are close
> to OOM. The available pagecache or slab reclaimable objects might be pinned
> long enough that your allocation based on that estimation will just make
> the situation worse and result in OOM. More importantly though, your
> allocations are GFP_KERNEL, right, that means that such an allocation
> will not reach to ZONE_MOVABLE or ZONE_HIGMEM (32b systems) while the
> pagecache will. So you will get an overestimate of how much you can
> allocate.

Yes, but right now it is assumed that there is all the memory in the
world to allocate. Clearly an overestimate is better than that. Right
now ftrace will just allocate memory until it actually fails to
allocate, at which point it may have caused other processes to be
likely to OOM. As Steve pointed out, it doesn't need to be accurate
but does solve the problem here.. (or some other similar method seems
to be needed to solve it).

>
> Really si_mem_available is for proc/meminfo and a rough estimate of the
> free memory because users tend to be confused by seeing MemFree too low
> and complaining that the system has eaten all their memory. I have some

By why is a rough estimate not good in this case, that's what I don't get.

> skepticism about how useful it is in practice apart from showing it in
> top or alike tools. The memory is simply not usable immediately or
> without an overall and visible effect on the whole system.

Sure there can be false positives but it does reduce the problem for
the most part I feel. May be we can use this as an interim solution
(better than leaving the issue hanging)? Or you could propose a
solution of how to get an estimate and prevent other tasks from
experiencing memory pressure for no reason.

Another thing we could do is check how much total memory there is on
the system and cap by that (which will prevent impossibly large
allocations) but that still doesn't address the problem completely.

thanks,

- Joel


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-04 Thread Zhaoyang Huang
On Wed, Apr 4, 2018 at 2:23 PM, Michal Hocko  wrote:
> On Wed 04-04-18 10:58:39, Zhaoyang Huang wrote:
>> On Tue, Apr 3, 2018 at 9:56 PM, Michal Hocko  wrote:
>> > On Tue 03-04-18 09:32:45, Steven Rostedt wrote:
>> >> On Tue, 3 Apr 2018 14:35:14 +0200
>> >> Michal Hocko  wrote:
>> > [...]
>> >> > Being clever is OK if it doesn't add a tricky code. And relying on
>> >> > si_mem_available is definitely tricky and obscure.
>> >>
>> >> Can we get the mm subsystem to provide a better method to know if an
>> >> allocation will possibly succeed or not before trying it? It doesn't
>> >> have to be free of races. Just "if I allocate this many pages right
>> >> now, will it work?" If that changes from the time it asks to the time
>> >> it allocates, that's fine. I'm not trying to prevent OOM to never
>> >> trigger. I just don't want to to trigger consistently.
>> >
>> > How do you do that without an actuall allocation request? And more
>> > fundamentally, what if your _particular_ request is just fine but it
>> > will get us so close to the OOM edge that the next legit allocation
>> > request simply goes OOM? There is simply no sane interface I can think
>> > of that would satisfy a safe/sensible "will it cause OOM" semantic.
>> >
>> The point is the app which try to allocate the size over the line will escape
>> the OOM and let other innocent to be sacrificed. However, the one which you
>> mentioned above will be possibly selected by OOM that triggered by consequnce
>> failed allocation.
>
> If you are afraid of that then you can have a look at 
> {set,clear}_current_oom_origin()
> which will automatically select the current process as an oom victim and
> kill it.
But we can not call the function on behalf of the current process
which maybe don't want
to be killed for memory reason. It is proper to tell it ENOMEM and let
it make further decision.
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-03 Thread Michal Hocko
On Wed 04-04-18 10:58:39, Zhaoyang Huang wrote:
> On Tue, Apr 3, 2018 at 9:56 PM, Michal Hocko  wrote:
> > On Tue 03-04-18 09:32:45, Steven Rostedt wrote:
> >> On Tue, 3 Apr 2018 14:35:14 +0200
> >> Michal Hocko  wrote:
> > [...]
> >> > Being clever is OK if it doesn't add a tricky code. And relying on
> >> > si_mem_available is definitely tricky and obscure.
> >>
> >> Can we get the mm subsystem to provide a better method to know if an
> >> allocation will possibly succeed or not before trying it? It doesn't
> >> have to be free of races. Just "if I allocate this many pages right
> >> now, will it work?" If that changes from the time it asks to the time
> >> it allocates, that's fine. I'm not trying to prevent OOM to never
> >> trigger. I just don't want to to trigger consistently.
> >
> > How do you do that without an actuall allocation request? And more
> > fundamentally, what if your _particular_ request is just fine but it
> > will get us so close to the OOM edge that the next legit allocation
> > request simply goes OOM? There is simply no sane interface I can think
> > of that would satisfy a safe/sensible "will it cause OOM" semantic.
> >
> The point is the app which try to allocate the size over the line will escape
> the OOM and let other innocent to be sacrificed. However, the one which you
> mentioned above will be possibly selected by OOM that triggered by consequnce
> failed allocation.

If you are afraid of that then you can have a look at 
{set,clear}_current_oom_origin()
which will automatically select the current process as an oom victim and
kill it.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-03 Thread Michal Hocko
On Tue 03-04-18 18:56:27, Steven Rostedt wrote:
[...]
> From your earlier email:
> 
> > Except that it doesn't work. si_mem_available is not really suitable for
> > any allocation estimations. Its only purpose is to provide a very rough
> > estimation for userspace. Any other use is basically abuse. The
> > situation can change really quickly. Really it is really hard to be
> > clever here with the volatility the memory allocations can cause.
> 
> Now can you please explain to me why si_mem_available is not suitable
> for my purpose.

Several problems. It is overly optimistic especially when we are close
to OOM. The available pagecache or slab reclaimable objects might be pinned
long enough that your allocation based on that estimation will just make
the situation worse and result in OOM. More importantly though, your
allocations are GFP_KERNEL, right, that means that such an allocation
will not reach to ZONE_MOVABLE or ZONE_HIGMEM (32b systems) while the
pagecache will. So you will get an overestimate of how much you can
allocate.

Really si_mem_available is for proc/meminfo and a rough estimate of the
free memory because users tend to be confused by seeing MemFree too low
and complaining that the system has eaten all their memory. I have some
skepticism about how useful it is in practice apart from showing it in
top or alike tools. The memory is simply not usable immediately or
without an overall and visible effect on the whole system.

HTH
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-03 Thread Zhaoyang Huang
On Tue, Apr 3, 2018 at 9:56 PM, Michal Hocko  wrote:
> On Tue 03-04-18 09:32:45, Steven Rostedt wrote:
>> On Tue, 3 Apr 2018 14:35:14 +0200
>> Michal Hocko  wrote:
> [...]
>> > Being clever is OK if it doesn't add a tricky code. And relying on
>> > si_mem_available is definitely tricky and obscure.
>>
>> Can we get the mm subsystem to provide a better method to know if an
>> allocation will possibly succeed or not before trying it? It doesn't
>> have to be free of races. Just "if I allocate this many pages right
>> now, will it work?" If that changes from the time it asks to the time
>> it allocates, that's fine. I'm not trying to prevent OOM to never
>> trigger. I just don't want to to trigger consistently.
>
> How do you do that without an actuall allocation request? And more
> fundamentally, what if your _particular_ request is just fine but it
> will get us so close to the OOM edge that the next legit allocation
> request simply goes OOM? There is simply no sane interface I can think
> of that would satisfy a safe/sensible "will it cause OOM" semantic.
>
The point is the app which try to allocate the size over the line will escape
the OOM and let other innocent to be sacrificed. However, the one which you
mentioned above will be possibly selected by OOM that triggered by consequnce
failed allocation.

>> > > Perhaps I should try to allocate a large group of pages with
>> > > RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking
>> > > that the large allocation may reclaim some memory that would allow the
>> > > NORETRY to succeed with smaller allocations (one page at a time)?
>> >
>> > That again relies on a subtle dependencies of the current
>> > implementation. So I would rather ask whether this is something that
>> > really deserves special treatment. If admin asks for a buffer of a
>> > certain size then try to do so. If we get OOM then bad luck you cannot
>> > get large memory buffers for free...
>>
>> That is not acceptable to me nor to the people asking for this.
>>
>> The problem is known. The ring buffer allocates memory page by page,
>> and this can allow it to easily take all memory in the system before it
>> fails to allocate and free everything it had done.
>
> Then do not allow buffers that are too large. How often do you need
> buffers that are larger than few megs or small % of the available
> memory? Consuming excessive amount of memory just to trace workload
> which will need some memory on its own sounds just dubious to me.
>
>> If you don't like the use of si_mem_available() I'll do the larger
>> pages method. Yes it depends on the current implementation of memory
>> allocation. It will depend on RETRY_MAYFAIL trying to allocate a large
>> number of pages, and fail if it can't (leaving memory for other
>> allocations to succeed).
>>
>> The allocation of the ring buffer isn't critical. It can fail to
>> expand, and we can tell the user -ENOMEM. I original had NORETRY
>> because I rather have it fail than cause an OOM. But there's folks
>> (like Joel) that want it to succeed when there's available memory in
>> page caches.
>
> Then implement a retry logic on top of NORETRY. You can control how hard
> to retry to satisfy the request yourself. You still risk that your
> allocation will get us close to OOM for _somebody_ else though.
>
>> I'm fine if the admin shoots herself in the foot if the ring buffer
>> gets big enough to start causing OOMs, but I don't want it to cause
>> OOMs if there's not even enough memory to fulfill the ring buffer size
>> itself.
>
> I simply do not see the difference between the two. Both have the same
> deadly effect in the end. The direct OOM has an arguable advantage that
> the effect is immediate rather than subtle with potential performance
> side effects until the machine OOMs after crawling for quite some time.
>
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-03 Thread Steven Rostedt
On Tue, 3 Apr 2018 18:11:19 +0200
Michal Hocko  wrote:

> You can do so, of course. In fact it would have some advantages over
> single pages because you would fragment the memory less but this is not
> a reliable prevention from OOM killing and the complete memory
> depletion if you allow arbitrary trace buffer sizes.

You are right that this doesn't prevent OOM killing. I tried various
methods, and the only thing that currently "works" the way I'm happy
with, is this original patch.

>From your earlier email:

> Except that it doesn't work. si_mem_available is not really suitable for
> any allocation estimations. Its only purpose is to provide a very rough
> estimation for userspace. Any other use is basically abuse. The
> situation can change really quickly. Really it is really hard to be
> clever here with the volatility the memory allocations can cause.

Now can you please explain to me why si_mem_available is not suitable
for my purpose. If it's wrong and states there is less memory than
actually exists, we simply fail to increase the buffer.

If it is wrong and states that there is more memory than actually
exists, then we do nothing different than what we do today, and trigger
an OOM.

But for the ring buffer use case, it is "good enough". Can you
please explain to me what issues you see that can happen if we apply
this patch?

-- Steve


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-03 Thread Steven Rostedt
On Tue, 3 Apr 2018 18:11:19 +0200
Michal Hocko  wrote:

> yes a fallback is questionable. Whether to make the batch size
> configuration is a matter of how much internal details you want to
> expose to userspace.

Well, it is tracing the guts of the kernel, so internal details are
generally exposed to userspace ;-)

-- Steve


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-03 Thread Michal Hocko
On Tue 03-04-18 10:17:53, Steven Rostedt wrote:
> On Tue, 3 Apr 2018 15:56:07 +0200
> Michal Hocko  wrote:
[...]
> > I simply do not see the difference between the two. Both have the same
> > deadly effect in the end. The direct OOM has an arguable advantage that
> > the effect is immediate rather than subtle with potential performance
> > side effects until the machine OOMs after crawling for quite some time.
> 
> The difference is if the allocation succeeds or not. If it doesn't
> succeed, we free all memory that we tried to allocate. If it succeeds
> and causes issues, then yes, that's the admins fault.

What am I trying to say is that this is so extremely time and workload
sensitive that you can hardly have a stable behavior. It will become a
pure luck whether the failure happens.

> I'm worried about
> the accidental putting in too big of a number, either by an admin by
> mistake, or some stupid script that just thinks the current machines
> has terabytes of memory.

I would argue that stupid scripts should have no business calling root
only interfaces which can allocate a lot of memory and cause OOMs.

> I'm under the assumption that if I allocate an allocation of 32 pages
> with RETRY_MAYFAIL, and there's 2 pages available, but not 32, and
> while my allocation is reclaiming memory, and another task comes in and
> asks for a single page, it can still succeed. This would be why I would
> be using RETRY_MAYFAIL with higher orders of pages, that it doesn't
> take all memory in the system if it fails. Is this assumption incorrect?

Yes. There is no guarantee that the allocation will get the memory it
reclaimed in the direct reclaim. Pages are simply freed back into the
pool and it is a matter of timing who gets them.

> The current approach of allocating 1 page at a time with RETRY_MAYFAIL
> is that it will succeed to get any pages that are available, until
> there are none, and if some unlucky task asks for memory during that
> time, it is guaranteed to fail its allocation triggering an OOM.
> 
> I was thinking of doing something like:
> 
>   large_pages = nr_pages / 32;
>   if (large_pages) {
>   pages = alloc_pages_node(cpu_to_node(cpu),
>   GFP_KERNEL | __GFP_RETRY_MAYFAIL, 5);
>   if (pages)
>   /* break up pages */
>   else
>   /* try to allocate with NORETRY */
>   }

You can do so, of course. In fact it would have some advantages over
single pages because you would fragment the memory less but this is not
a reliable prevention from OOM killing and the complete memory
depletion if you allow arbitrary trace buffer sizes.

> Now it will allocate memory in 32 page chunks using reclaim. If it
> fails to allocate them, it would not have taken up any smaller chunks
> that were available, leaving them for other users. It would then go
> back to singe pages, allocating with RETRY. Or I could just say screw
> it, and make the allocation of the ring buffer always be 32 page chunks
> (or at least make it user defined).

yes a fallback is questionable. Whether to make the batch size
configuration is a matter of how much internal details you want to
expose to userspace.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-03 Thread Steven Rostedt
On Tue, 3 Apr 2018 15:56:07 +0200
Michal Hocko  wrote:

> On Tue 03-04-18 09:32:45, Steven Rostedt wrote:
> > On Tue, 3 Apr 2018 14:35:14 +0200
> > Michal Hocko  wrote:  
> [...]
> > > Being clever is OK if it doesn't add a tricky code. And relying on
> > > si_mem_available is definitely tricky and obscure.  
> > 
> > Can we get the mm subsystem to provide a better method to know if an
> > allocation will possibly succeed or not before trying it? It doesn't
> > have to be free of races. Just "if I allocate this many pages right
> > now, will it work?" If that changes from the time it asks to the time
> > it allocates, that's fine. I'm not trying to prevent OOM to never
> > trigger. I just don't want to to trigger consistently.  
> 
> How do you do that without an actuall allocation request? And more
> fundamentally, what if your _particular_ request is just fine but it
> will get us so close to the OOM edge that the next legit allocation
> request simply goes OOM? There is simply no sane interface I can think
> of that would satisfy a safe/sensible "will it cause OOM" semantic.

That's where I'm fine with the admin shooting herself in the foot. If
they ask for almost all memory, and then the system needs more, that's
not our problem.

I'm more worried about putting in a couple of extra zeros by mistake,
which will pretty much guarantee an OOM on an active system.

>  
> > > > Perhaps I should try to allocate a large group of pages with
> > > > RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking
> > > > that the large allocation may reclaim some memory that would allow the
> > > > NORETRY to succeed with smaller allocations (one page at a time)?
> > > 
> > > That again relies on a subtle dependencies of the current
> > > implementation. So I would rather ask whether this is something that
> > > really deserves special treatment. If admin asks for a buffer of a
> > > certain size then try to do so. If we get OOM then bad luck you cannot
> > > get large memory buffers for free...  
> > 
> > That is not acceptable to me nor to the people asking for this.
> > 
> > The problem is known. The ring buffer allocates memory page by page,
> > and this can allow it to easily take all memory in the system before it
> > fails to allocate and free everything it had done.  
> 
> Then do not allow buffers that are too large. How often do you need
> buffers that are larger than few megs or small % of the available
> memory? Consuming excessive amount of memory just to trace workload
> which will need some memory on its own sounds just dubious to me.

For recording 100s of million events per second, it requires hundreds
of megs of memory. Large buffers are required for tracing. That's
extremely common.

> 
> > If you don't like the use of si_mem_available() I'll do the larger
> > pages method. Yes it depends on the current implementation of memory
> > allocation. It will depend on RETRY_MAYFAIL trying to allocate a large
> > number of pages, and fail if it can't (leaving memory for other
> > allocations to succeed).
> > 
> > The allocation of the ring buffer isn't critical. It can fail to
> > expand, and we can tell the user -ENOMEM. I original had NORETRY
> > because I rather have it fail than cause an OOM. But there's folks
> > (like Joel) that want it to succeed when there's available memory in
> > page caches.  
> 
> Then implement a retry logic on top of NORETRY. You can control how hard
> to retry to satisfy the request yourself. You still risk that your
> allocation will get us close to OOM for _somebody_ else though.
> 
> > I'm fine if the admin shoots herself in the foot if the ring buffer
> > gets big enough to start causing OOMs, but I don't want it to cause
> > OOMs if there's not even enough memory to fulfill the ring buffer size
> > itself.  
> 
> I simply do not see the difference between the two. Both have the same
> deadly effect in the end. The direct OOM has an arguable advantage that
> the effect is immediate rather than subtle with potential performance
> side effects until the machine OOMs after crawling for quite some time.

The difference is if the allocation succeeds or not. If it doesn't
succeed, we free all memory that we tried to allocate. If it succeeds
and causes issues, then yes, that's the admins fault. I'm worried about
the accidental putting in too big of a number, either by an admin by
mistake, or some stupid script that just thinks the current machines
has terabytes of memory.

I'm under the assumption that if I allocate an allocation of 32 pages
with RETRY_MAYFAIL, and there's 2 pages available, but not 32, and
while my allocation is reclaiming memory, and another task comes in and
asks for a single page, it can still succeed. This would be why I would
be using RETRY_MAYFAIL with higher orders of pages, that it doesn't
take all memory in the system if it fails. Is this assumption incorrect?

The current approach of allocating 1 page at a time with RETRY_MAYFAIL
is th

Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-03 Thread Michal Hocko
On Tue 03-04-18 09:32:45, Steven Rostedt wrote:
> On Tue, 3 Apr 2018 14:35:14 +0200
> Michal Hocko  wrote:
[...]
> > Being clever is OK if it doesn't add a tricky code. And relying on
> > si_mem_available is definitely tricky and obscure.
> 
> Can we get the mm subsystem to provide a better method to know if an
> allocation will possibly succeed or not before trying it? It doesn't
> have to be free of races. Just "if I allocate this many pages right
> now, will it work?" If that changes from the time it asks to the time
> it allocates, that's fine. I'm not trying to prevent OOM to never
> trigger. I just don't want to to trigger consistently.

How do you do that without an actuall allocation request? And more
fundamentally, what if your _particular_ request is just fine but it
will get us so close to the OOM edge that the next legit allocation
request simply goes OOM? There is simply no sane interface I can think
of that would satisfy a safe/sensible "will it cause OOM" semantic.
 
> > > Perhaps I should try to allocate a large group of pages with
> > > RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking
> > > that the large allocation may reclaim some memory that would allow the
> > > NORETRY to succeed with smaller allocations (one page at a time)?  
> > 
> > That again relies on a subtle dependencies of the current
> > implementation. So I would rather ask whether this is something that
> > really deserves special treatment. If admin asks for a buffer of a
> > certain size then try to do so. If we get OOM then bad luck you cannot
> > get large memory buffers for free...
> 
> That is not acceptable to me nor to the people asking for this.
> 
> The problem is known. The ring buffer allocates memory page by page,
> and this can allow it to easily take all memory in the system before it
> fails to allocate and free everything it had done.

Then do not allow buffers that are too large. How often do you need
buffers that are larger than few megs or small % of the available
memory? Consuming excessive amount of memory just to trace workload
which will need some memory on its own sounds just dubious to me.

> If you don't like the use of si_mem_available() I'll do the larger
> pages method. Yes it depends on the current implementation of memory
> allocation. It will depend on RETRY_MAYFAIL trying to allocate a large
> number of pages, and fail if it can't (leaving memory for other
> allocations to succeed).
> 
> The allocation of the ring buffer isn't critical. It can fail to
> expand, and we can tell the user -ENOMEM. I original had NORETRY
> because I rather have it fail than cause an OOM. But there's folks
> (like Joel) that want it to succeed when there's available memory in
> page caches.

Then implement a retry logic on top of NORETRY. You can control how hard
to retry to satisfy the request yourself. You still risk that your
allocation will get us close to OOM for _somebody_ else though.

> I'm fine if the admin shoots herself in the foot if the ring buffer
> gets big enough to start causing OOMs, but I don't want it to cause
> OOMs if there's not even enough memory to fulfill the ring buffer size
> itself.

I simply do not see the difference between the two. Both have the same
deadly effect in the end. The direct OOM has an arguable advantage that
the effect is immediate rather than subtle with potential performance
side effects until the machine OOMs after crawling for quite some time.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-03 Thread Steven Rostedt
On Tue, 3 Apr 2018 14:35:14 +0200
Michal Hocko  wrote:

> > If we use NORETRY, then we have those that complain that we do not try
> > hard enough to reclaim memory. If we use RETRY_MAYFAIL we have this
> > issue of taking up all memory before we get what we want.  
> 
> Just try to do what admin asks for and trust it will not try to shoot
> his foot? I mean there are other ways admin can shoot the machine down.

Allowing the admin to just shoot her foot is not an option.

Yes there are many ways to bring down a machine, but this shouldn't be
one of them. All one needs to do is echo too big of a number
into /sys/kernel/tracing/buffer_size_kb and OOM may kill a critical
program on a production machine. Tracing is made for production, and
should not allow an easy way to trigger OOM.

> Being clever is OK if it doesn't add a tricky code. And relying on
> si_mem_available is definitely tricky and obscure.

Can we get the mm subsystem to provide a better method to know if an
allocation will possibly succeed or not before trying it? It doesn't
have to be free of races. Just "if I allocate this many pages right
now, will it work?" If that changes from the time it asks to the time
it allocates, that's fine. I'm not trying to prevent OOM to never
trigger. I just don't want to to trigger consistently.

> 
> > Perhaps I should try to allocate a large group of pages with
> > RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking
> > that the large allocation may reclaim some memory that would allow the
> > NORETRY to succeed with smaller allocations (one page at a time)?  
> 
> That again relies on a subtle dependencies of the current
> implementation. So I would rather ask whether this is something that
> really deserves special treatment. If admin asks for a buffer of a
> certain size then try to do so. If we get OOM then bad luck you cannot
> get large memory buffers for free...

That is not acceptable to me nor to the people asking for this.

The problem is known. The ring buffer allocates memory page by page,
and this can allow it to easily take all memory in the system before it
fails to allocate and free everything it had done.

If you don't like the use of si_mem_available() I'll do the larger
pages method. Yes it depends on the current implementation of memory
allocation. It will depend on RETRY_MAYFAIL trying to allocate a large
number of pages, and fail if it can't (leaving memory for other
allocations to succeed).

The allocation of the ring buffer isn't critical. It can fail to
expand, and we can tell the user -ENOMEM. I original had NORETRY
because I rather have it fail than cause an OOM. But there's folks
(like Joel) that want it to succeed when there's available memory in
page caches.

I'm fine if the admin shoots herself in the foot if the ring buffer
gets big enough to start causing OOMs, but I don't want it to cause
OOMs if there's not even enough memory to fulfill the ring buffer size
itself.

-- Steve


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-03 Thread Michal Hocko
On Tue 03-04-18 08:23:48, Steven Rostedt wrote:
> On Tue, 3 Apr 2018 14:16:14 +0200
> Michal Hocko  wrote:
> 
> > > This came up because there's scripts or programs that set the size of
> > > the ring buffer. The complaint was that the application would just set
> > > the size to something bigger than what was available and cause an OOM
> > > killing other applications. The final solution is to simply check the
> > > available memory before allocating the ring buffer:
> > > 
> > >   /* Check if the available memory is there first */
> > >   i = si_mem_available();
> > >   if (i < nr_pages)
> > >   return -ENOMEM;
> > > 
> > > And it works well.  
> > 
> > Except that it doesn't work. si_mem_available is not really suitable for
> > any allocation estimations. Its only purpose is to provide a very rough
> > estimation for userspace. Any other use is basically abuse. The
> > situation can change really quickly. Really it is really hard to be
> > clever here with the volatility the memory allocations can cause.
> 
> OK, then what do you suggest? Because currently, it appears to work. A
> rough estimate may be good enough.
> 
> If we use NORETRY, then we have those that complain that we do not try
> hard enough to reclaim memory. If we use RETRY_MAYFAIL we have this
> issue of taking up all memory before we get what we want.

Just try to do what admin asks for and trust it will not try to shoot
his foot? I mean there are other ways admin can shoot the machine down.
Being clever is OK if it doesn't add a tricky code. And relying on
si_mem_available is definitely tricky and obscure.

> Perhaps I should try to allocate a large group of pages with
> RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking
> that the large allocation may reclaim some memory that would allow the
> NORETRY to succeed with smaller allocations (one page at a time)?

That again relies on a subtle dependencies of the current
implementation. So I would rather ask whether this is something that
really deserves special treatment. If admin asks for a buffer of a
certain size then try to do so. If we get OOM then bad luck you cannot
get large memory buffers for free...
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-03 Thread Steven Rostedt
On Tue, 3 Apr 2018 14:16:14 +0200
Michal Hocko  wrote:

> > This came up because there's scripts or programs that set the size of
> > the ring buffer. The complaint was that the application would just set
> > the size to something bigger than what was available and cause an OOM
> > killing other applications. The final solution is to simply check the
> > available memory before allocating the ring buffer:
> > 
> > /* Check if the available memory is there first */
> > i = si_mem_available();
> > if (i < nr_pages)
> > return -ENOMEM;
> > 
> > And it works well.  
> 
> Except that it doesn't work. si_mem_available is not really suitable for
> any allocation estimations. Its only purpose is to provide a very rough
> estimation for userspace. Any other use is basically abuse. The
> situation can change really quickly. Really it is really hard to be
> clever here with the volatility the memory allocations can cause.

OK, then what do you suggest? Because currently, it appears to work. A
rough estimate may be good enough.

If we use NORETRY, then we have those that complain that we do not try
hard enough to reclaim memory. If we use RETRY_MAYFAIL we have this
issue of taking up all memory before we get what we want.

Perhaps I should try to allocate a large group of pages with
RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking
that the large allocation may reclaim some memory that would allow the
NORETRY to succeed with smaller allocations (one page at a time)?

-- Steve


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-03 Thread Michal Hocko
On Tue 03-04-18 07:51:58, Steven Rostedt wrote:
> On Tue, 3 Apr 2018 13:06:12 +0200
> Michal Hocko  wrote:
> 
> > > I wonder if I should have the ring buffer allocate groups of pages, to
> > > avoid this. Or try to allocate with NORETRY, one page at a time, and
> > > when that fails, allocate groups of pages with RETRY_MAYFAIL, and that
> > > may keep it from causing an OOM?  
> > 
> > I wonder why it really matters. The interface is root only and we expect
> > some sanity from an admin, right? So allocating such a large ring buffer
> > that it sends the system to the OOM is a sign that the admin should be
> > more careful. Balancing on the OOM edge is always a risk and the result
> > will highly depend on the workload running in parallel.
> 
> This came up because there's scripts or programs that set the size of
> the ring buffer. The complaint was that the application would just set
> the size to something bigger than what was available and cause an OOM
> killing other applications. The final solution is to simply check the
> available memory before allocating the ring buffer:
> 
>   /* Check if the available memory is there first */
>   i = si_mem_available();
>   if (i < nr_pages)
>   return -ENOMEM;
> 
> And it works well.

Except that it doesn't work. si_mem_available is not really suitable for
any allocation estimations. Its only purpose is to provide a very rough
estimation for userspace. Any other use is basically abuse. The
situation can change really quickly. Really it is really hard to be
clever here with the volatility the memory allocations can cause.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-03 Thread Steven Rostedt
On Tue, 3 Apr 2018 13:06:12 +0200
Michal Hocko  wrote:

> > I wonder if I should have the ring buffer allocate groups of pages, to
> > avoid this. Or try to allocate with NORETRY, one page at a time, and
> > when that fails, allocate groups of pages with RETRY_MAYFAIL, and that
> > may keep it from causing an OOM?  
> 
> I wonder why it really matters. The interface is root only and we expect
> some sanity from an admin, right? So allocating such a large ring buffer
> that it sends the system to the OOM is a sign that the admin should be
> more careful. Balancing on the OOM edge is always a risk and the result
> will highly depend on the workload running in parallel.

This came up because there's scripts or programs that set the size of
the ring buffer. The complaint was that the application would just set
the size to something bigger than what was available and cause an OOM
killing other applications. The final solution is to simply check the
available memory before allocating the ring buffer:

/* Check if the available memory is there first */
i = si_mem_available();
if (i < nr_pages)
return -ENOMEM;

And it works well.

-- Steve


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-03 Thread Michal Hocko
On Fri 30-03-18 10:20:38, Steven Rostedt wrote:
> 
> [ Adding memory management folks to discuss the issue ]
> 
> On Thu, 29 Mar 2018 18:41:44 +0800
> Zhaoyang Huang  wrote:
> 
> > It is reported that some user app would like to echo a huge
> > number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless
> >  of the available memory, which will cause the coinstantaneous
> > page allocation failed and introduce OOM. The commit checking the
> > val against the available mem first to avoid the consequence allocation.
> > 
> > Signed-off-by: Zhaoyang Huang 
> > ---
> >  kernel/trace/trace.c | 39 ++-
> >  1 file changed, 38 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index 2d0ffcc..a4a4237 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -43,6 +43,8 @@
> >  #include 
> >  #include 
> >  
> > +#include 
> > +#include 
> >  #include "trace.h"
> >  #include "trace_output.h"
> >  
> > @@ -5967,6 +5969,39 @@ static ssize_t tracing_splice_read_pipe(struct file 
> > *filp,
> > return ret;
> >  }
> >  
> > +static long get_available_mem(void)
> > +{
> > +   struct sysinfo i;
> > +   long available;
> > +   unsigned long pagecache;
> > +   unsigned long wmark_low = 0;
> > +   unsigned long pages[NR_LRU_LISTS];
> > +   struct zone *zone;
> > +   int lru;
> > +
> > +   si_meminfo(&i);
> > +   si_swapinfo(&i);
> > +
> > +   for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
> > +   pages[lru] = global_page_state(NR_LRU_BASE + lru);
> > +
> > +   for_each_zone(zone)
> > +   wmark_low += zone->watermark[WMARK_LOW];
> > +
> > +   available = i.freeram - wmark_low;
> > +
> > +   pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
> > +   pagecache -= min(pagecache / 2, wmark_low);
> > +   available += pagecache;
> > +
> > +   available += global_page_state(NR_SLAB_RECLAIMABLE) -
> > +   min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
> > +
> > +   if (available < 0)
> > +   available = 0;
> > +   return available;
> > +}
> > +
> 
> As I stated in my other reply, the above function does not belong in
> tracing.
> 
> That said, it appears you are having issues that were caused by the
> change by commit 848618857d2 ("tracing/ring_buffer: Try harder to
> allocate"), where we replaced NORETRY with RETRY_MAYFAIL. The point of
> NORETRY was to keep allocations of the tracing ring-buffer from causing
> OOMs. But the RETRY was too strong in that case, because there were
> those that wanted to allocate large ring buffers but it would fail due
> to memory being used that could be reclaimed. Supposedly, RETRY_MAYFAIL
> is to allocate with reclaim but still allow to fail, and isn't suppose
> to trigger an OOM. From my own tests, this is obviously not the case.
> 
> Perhaps this is because the ring buffer allocates one page at a time,
> and by doing so, it can get every last available page, and if anything
> in the mean time does an allocation without MAYFAIL, it will cause an
> OOM. For example, when I stressed this I triggered this:

Yes, this is indeed the case.

>  pool invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), 
> nodemask=(null), order=0, oom_score_adj=0
>  pool cpuset=/ mems_allowed=0
>  CPU: 7 PID: 1040 Comm: pool Not tainted 4.16.0-rc4-test+ #663
>  Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 
> 07/14/2016
>  Call Trace:
>   dump_stack+0x8e/0xce
>   dump_header.isra.30+0x6e/0x28f
>   ? _raw_spin_unlock_irqrestore+0x30/0x60
>   oom_kill_process+0x218/0x400
>   ? has_capability_noaudit+0x17/0x20
>   out_of_memory+0xe3/0x5c0
>   __alloc_pages_slowpath+0xa8e/0xe50
>   __alloc_pages_nodemask+0x206/0x220
>   alloc_pages_current+0x6a/0xe0
>   __page_cache_alloc+0x6a/0xa0
>   filemap_fault+0x208/0x5f0
>   ? __might_sleep+0x4a/0x80
>   ext4_filemap_fault+0x31/0x44
>   __do_fault+0x20/0xd0
>   __handle_mm_fault+0xc08/0x1160
>   handle_mm_fault+0x76/0x110
>   __do_page_fault+0x299/0x580
>   do_page_fault+0x2d/0x110
>   ? page_fault+0x2f/0x50
>   page_fault+0x45/0x50
> 
> I wonder if I should have the ring buffer allocate groups of pages, to
> avoid this. Or try to allocate with NORETRY, one page at a time, and
> when that fails, allocate groups of pages with RETRY_MAYFAIL, and that
> may keep it from causing an OOM?

I wonder why it really matters. The interface is root only and we expect
some sanity from an admin, right? So allocating such a large ring buffer
that it sends the system to the OOM is a sign that the admin should be
more careful. Balancing on the OOM edge is always a risk and the result
will highly depend on the workload running in parallel.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-04-01 Thread Zhaoyang Huang
On Sat, Mar 31, 2018 at 5:42 AM, Steven Rostedt  wrote:
> On Fri, 30 Mar 2018 17:30:31 -0400
> Steven Rostedt  wrote:
>
>> I'll take a look at si_mem_available() that Joel suggested and see if
>> we can make that work.
>
> Wow, this appears to work great! Joel and Zhaoyang, can you test this?
>
> -- Steve
>
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index a2fd3893cc02..32a803626ee2 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, struct 
> list_head *pages, int cpu)
> struct buffer_page *bpage, *tmp;
> long i;
>
> +   /* Check if the available memory is there first */
> +   i = si_mem_available();
> +   if (i < nr_pages)
> +   return -ENOMEM;
> +
> for (i = 0; i < nr_pages; i++) {
> struct page *page;
> /*
Hi Steve, It works as my previous patch does.


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Joel Fernandes
On Fri, Mar 30, 2018 at 8:07 PM, Steven Rostedt  wrote:
> On Fri, 30 Mar 2018 19:18:57 -0700
> Matthew Wilcox  wrote:
>
>> Again though, this is the same pattern as vmalloc.  There are any number
>> of places where userspace can cause an arbitrarily large vmalloc to be
>> attempted (grep for kvmalloc_array for a list of promising candidates).
>> I'm pretty sure that just changing your GFP flags to GFP_KERNEL |
>> __GFP_NOWARN will give you the exact behaviour that you want with no
>> need to grub around in the VM to find out if your huge allocation is
>> likely to succeed.
>
> Not sure how this helps. Note, I don't care about consecutive pages, so
> this is not an array. It's a link list of thousands of pages. How do
> you suggest allocating them? The ring buffer is a link list of pages.

Yeah I didn't understand the suggestion either. If I remember
correctly, not using either NO_RETRY or RETRY_MAY_FAIL, and just plain
GFP_KERNEL was precisely causing the buffer_size_kb write to cause an
OOM in my testing. So I think Steven's patch does the right thing in
checking in advance.

thanks,

- Joel


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Steven Rostedt
On Fri, 30 Mar 2018 19:18:57 -0700
Matthew Wilcox  wrote:

> Again though, this is the same pattern as vmalloc.  There are any number
> of places where userspace can cause an arbitrarily large vmalloc to be
> attempted (grep for kvmalloc_array for a list of promising candidates).
> I'm pretty sure that just changing your GFP flags to GFP_KERNEL |
> __GFP_NOWARN will give you the exact behaviour that you want with no
> need to grub around in the VM to find out if your huge allocation is
> likely to succeed.

Not sure how this helps. Note, I don't care about consecutive pages, so
this is not an array. It's a link list of thousands of pages. How do
you suggest allocating them? The ring buffer is a link list of pages.

What I currently do is to see how many more pages I need. Allocate them
one at a time and put them in a temporary list, if it succeeds I add
them to the ring buffer, if not, I free the entire list (it's an all or
nothing operation).

The allocation I'm making doesn't warn. The problem is the
GFP_RETRY_MAYFAIL, which will try to allocate any possible memory in
the system. When it succeeds, the ring buffer allocation logic will
then try to allocate another page. If we add too many pages, we will
allocate all possible pages and then try to allocate more. This
allocation will fail without causing an OOM. That's not the problem.
The problem is if the system is active during this time, and something
else tries to do any allocation, after all memory has been consumed,
that allocation will fail. Then it will trigger an OOM.

I showed this in my Call Trace, that the allocation that failed during
my test was something completely unrelated, and that failure caused an
OOM.

What this last patch does is see if there's space available before it
even starts the process.

Maybe I'm missing something, but I don't see how NOWARN can help. My
allocations are not what is giving the warning.

-- Steve



Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Matthew Wilcox
On Fri, Mar 30, 2018 at 09:41:51PM -0400, Steven Rostedt wrote:
> On Fri, 30 Mar 2018 16:38:52 -0700
> Joel Fernandes  wrote:
> 
> > > --- a/kernel/trace/ring_buffer.c
> > > +++ b/kernel/trace/ring_buffer.c
> > > @@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, 
> > > struct list_head *pages, int cpu)
> > > struct buffer_page *bpage, *tmp;
> > > long i;
> > >
> > > +   /* Check if the available memory is there first */
> > > +   i = si_mem_available();
> > > +   if (i < nr_pages)  
> > 
> > Does it make sense to add a small margin here so that after ftrace
> > finishes allocating, we still have some memory left for the system?
> > But then then we have to define a magic number :-|
> 
> I don't think so. The memory is allocated by user defined numbers. They
> can do "free" to see what is available. The original patch from
> Zhaoyang was due to a script that would just try a very large number
> and cause issues.
> 
> If the memory is available, I just say let them have it. This is
> borderline user space issue and not a kernel one.

Again though, this is the same pattern as vmalloc.  There are any number
of places where userspace can cause an arbitrarily large vmalloc to be
attempted (grep for kvmalloc_array for a list of promising candidates).
I'm pretty sure that just changing your GFP flags to GFP_KERNEL |
__GFP_NOWARN will give you the exact behaviour that you want with no
need to grub around in the VM to find out if your huge allocation is
likely to succeed.



Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Steven Rostedt
On Fri, 30 Mar 2018 16:38:52 -0700
Joel Fernandes  wrote:

> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c
> > @@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, struct 
> > list_head *pages, int cpu)
> > struct buffer_page *bpage, *tmp;
> > long i;
> >
> > +   /* Check if the available memory is there first */
> > +   i = si_mem_available();
> > +   if (i < nr_pages)  
> 
> Does it make sense to add a small margin here so that after ftrace
> finishes allocating, we still have some memory left for the system?
> But then then we have to define a magic number :-|

I don't think so. The memory is allocated by user defined numbers. They
can do "free" to see what is available. The original patch from
Zhaoyang was due to a script that would just try a very large number
and cause issues.

If the memory is available, I just say let them have it. This is
borderline user space issue and not a kernel one.

> > +  
> 
> I tested in Qemu with 1GB memory, I am always able to get it to fail
> allocation even without this patch without causing an OOM. Maybe I am
> not running enough allocations in parallel or something :)

Try just echoing in "100" into buffer_size_kb and see what happens.

> 
> The patch you shared using si_mem_available is working since I'm able
> to allocate till the end without a page allocation failure:
> 
> bash-4.3# echo 237800 > /d/tracing/buffer_size_kb
> bash: echo: write error: Cannot allocate memory
> bash-4.3# echo 237700 > /d/tracing/buffer_size_kb
> bash-4.3# free -m
>  total used free   shared  buffers
> Mem:   985  9777   100
> -/+ buffers:9777
> Swap:000
> bash-4.3#
> 
> I think this patch is still good to have, since IMO we should not go
> and get page allocation failure (even if its a non-OOM) and subsequent
> stack dump from mm's allocator, if we can avoid it.
> 
> Tested-by: Joel Fernandes 

Great thanks! I'll make it into a formal patch.

-- Steve


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Joel Fernandes
Hi Steven,

On Fri, Mar 30, 2018 at 2:42 PM, Steven Rostedt  wrote:
> On Fri, 30 Mar 2018 17:30:31 -0400
> Steven Rostedt  wrote:
>
>> I'll take a look at si_mem_available() that Joel suggested and see if
>> we can make that work.
>
> Wow, this appears to work great! Joel and Zhaoyang, can you test this?
>
> -- Steve
>
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index a2fd3893cc02..32a803626ee2 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, struct 
> list_head *pages, int cpu)
> struct buffer_page *bpage, *tmp;
> long i;
>
> +   /* Check if the available memory is there first */
> +   i = si_mem_available();
> +   if (i < nr_pages)

Does it make sense to add a small margin here so that after ftrace
finishes allocating, we still have some memory left for the system?
But then then we have to define a magic number :-|

> +   return -ENOMEM;
> +

I tested in Qemu with 1GB memory, I am always able to get it to fail
allocation even without this patch without causing an OOM. Maybe I am
not running enough allocations in parallel or something :)

The patch you shared using si_mem_available is working since I'm able
to allocate till the end without a page allocation failure:

bash-4.3# echo 237800 > /d/tracing/buffer_size_kb
bash: echo: write error: Cannot allocate memory
bash-4.3# echo 237700 > /d/tracing/buffer_size_kb
bash-4.3# free -m
 total used free   shared  buffers
Mem:   985  9777   100
-/+ buffers:9777
Swap:000
bash-4.3#

I think this patch is still good to have, since IMO we should not go
and get page allocation failure (even if its a non-OOM) and subsequent
stack dump from mm's allocator, if we can avoid it.

Tested-by: Joel Fernandes 

thanks,

- Joel


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Steven Rostedt
On Fri, 30 Mar 2018 17:30:31 -0400
Steven Rostedt  wrote:

> I'll take a look at si_mem_available() that Joel suggested and see if
> we can make that work.

Wow, this appears to work great! Joel and Zhaoyang, can you test this?

-- Steve

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a2fd3893cc02..32a803626ee2 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, struct 
list_head *pages, int cpu)
struct buffer_page *bpage, *tmp;
long i;
 
+   /* Check if the available memory is there first */
+   i = si_mem_available();
+   if (i < nr_pages)
+   return -ENOMEM;
+
for (i = 0; i < nr_pages; i++) {
struct page *page;
/*


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Steven Rostedt
On Fri, 30 Mar 2018 13:53:56 -0700
Matthew Wilcox  wrote:

> It seems to me that what you're asking for at the moment is
> lower-likelihood-of-failure-than-GFP_KERNEL, and it's not entirely
> clear to me why your allocation is so much more important than other
> allocations in the kernel.

The ring buffer is for fast tracing and is allocated when a user
requests it. Usually there's plenty of memory, but when a user wants a
lot of tracing events stored, they may increase it themselves.

> 
> Also, the pattern you have is very close to that of vmalloc.  You're
> allocating one page at a time to satisfy a multi-page request.  In lieu
> of actually thinking about what you should do, I might recommend using the
> same GFP flags as vmalloc() which works out to GFP_KERNEL | __GFP_NOWARN
> (possibly | __GFP_HIGHMEM if you can tolerate having to kmap the pages
> when accessed from within the kernel).

When the ring buffer was first created, we couldn't use vmalloc because
vmalloc access wasn't working in NMIs (that has recently changed with
lots of work to handle faults). But the ring buffer is broken up into
pages (that are sent to the user or to the network), and allocating one
page at a time makes everything work fine.

The issue that happens when someone allocates a large ring buffer is
that it will allocate all memory in the system before it fails. This
means that there's a short time that any allocation will cause an OOM
(which is what is happening).

I think I agree with Joel and Zhaoyang, that we shouldn't allocate a
ring buffer if there's not going to be enough memory to do it. If we
can see the available memory before we start allocating one page at a
time, and if the available memory isn't going to be sufficient, there's
no reason to try to do the allocation, and simply send to the use
-ENOMEM, and let them try something smaller.

I'll take a look at si_mem_available() that Joel suggested and see if
we can make that work.

-- Steve


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Matthew Wilcox
On Fri, Mar 30, 2018 at 10:20:38AM -0400, Steven Rostedt wrote:
> That said, it appears you are having issues that were caused by the
> change by commit 848618857d2 ("tracing/ring_buffer: Try harder to
> allocate"), where we replaced NORETRY with RETRY_MAYFAIL. The point of
> NORETRY was to keep allocations of the tracing ring-buffer from causing
> OOMs. But the RETRY was too strong in that case, because there were
> those that wanted to allocate large ring buffers but it would fail due
> to memory being used that could be reclaimed. Supposedly, RETRY_MAYFAIL
> is to allocate with reclaim but still allow to fail, and isn't suppose
> to trigger an OOM. From my own tests, this is obviously not the case.

That's not exactly what the comment says in gfp.h:

 * __GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim
 *   procedures that have previously failed if there is some indication
 *   that progress has been made else where.  It can wait for other
 *   tasks to attempt high level approaches to freeing memory such as
 *   compaction (which removes fragmentation) and page-out.
 *   There is still a definite limit to the number of retries, but it is
 *   a larger limit than with __GFP_NORETRY.
 *   Allocations with this flag may fail, but only when there is
 *   genuinely little unused memory. While these allocations do not
 *   directly trigger the OOM killer, their failure indicates that
 *   the system is likely to need to use the OOM killer soon.  The
 *   caller must handle failure, but can reasonably do so by failing
 *   a higher-level request, or completing it only in a much less
 *   efficient manner.
 *   If the allocation does fail, and the caller is in a position to
 *   free some non-essential memory, doing so could benefit the system
 *   as a whole.

It seems to me that what you're asking for at the moment is
lower-likelihood-of-failure-than-GFP_KERNEL, and it's not entirely
clear to me why your allocation is so much more important than other
allocations in the kernel.

Also, the pattern you have is very close to that of vmalloc.  You're
allocating one page at a time to satisfy a multi-page request.  In lieu
of actually thinking about what you should do, I might recommend using the
same GFP flags as vmalloc() which works out to GFP_KERNEL | __GFP_NOWARN
(possibly | __GFP_HIGHMEM if you can tolerate having to kmap the pages
when accessed from within the kernel).



Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Joel Fernandes
Hi Steve,

On Fri, Mar 30, 2018 at 12:10 PM, Steven Rostedt  wrote:
[..]
>> > I wonder if I should have the ring buffer allocate groups of pages, to
>> > avoid this. Or try to allocate with NORETRY, one page at a time, and
>> > when that fails, allocate groups of pages with RETRY_MAYFAIL, and that
>> > may keep it from causing an OOM?
>> >
>>
>> I don't see immediately how that can prevent an OOM in other
>> applications here? If ftrace allocates lots of memory with
>> RETRY_MAYFAIL, then we would still OOM in other applications if memory
>> isn't available. Sorry if I missed something.
>
> Here's the idea.
>
> Allocate one page at a time with NORETRY. If that fails, then allocate
> larger amounts (higher order of pages) with RETRY_MAYFAIL. Then if it
> can't get all the memory it needs, it wont take up all memory in the
> system before it finds out that it can't have any more.
>
> Or perhaps the memory management system can provide a
> get_available_mem() function that ftrace could call before it tries to
> increase the ring buffer and take up all the memory of the system
> before it realizes that it can't get all the memory it wants.
>
> The main problem I have with Zhaoyang's patch is that
> get_available_mem() does not belong in the tracing code. It should be
> something that the mm subsystem provides.
>

Cool. Personally I like the getting of available memory solution and
use that, since its simpler.

MM already provides it through si_mem_available since the commit
"mm/page_alloc.c: calculate 'available' memory in a separate function"
(sha d02bd27b). Maybe we could just use that?

MemAvailable was initially added in commit "/proc/meminfo: provide
estimated available memory" (sha 34e431b0ae39)

thanks,

- Joel


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Steven Rostedt
On Fri, 30 Mar 2018 09:37:58 -0700
Joel Fernandes  wrote:

> > That said, it appears you are having issues that were caused by the
> > change by commit 848618857d2 ("tracing/ring_buffer: Try harder to
> > allocate"), where we replaced NORETRY with RETRY_MAYFAIL. The point of
> > NORETRY was to keep allocations of the tracing ring-buffer from causing
> > OOMs. But the RETRY was too strong in that case, because there were  
> 
> Yes this was discussed with -mm folks. Basically the problem we were
> seeing is devices with tonnes of free memory (but free as in free but
> used by page cache)  were not being used so it was unnecessarily
> failing to allocate ring buffer on the system with otherwise lots of
> memory.

Right.

> 
> IIRC, the OOM that my patch was trying to avoid, was being triggered
> in the path/context of the write to buffer_size_kb itself (when not
> doing the NORETRY),  not by other processes.

Yes, that is correct.

> 
> > Perhaps this is because the ring buffer allocates one page at a time,
> > and by doing so, it can get every last available page, and if anything
> > in the mean time does an allocation without MAYFAIL, it will cause an
> > OOM. For example, when I stressed this I triggered this:
> >
> >  pool invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), 
> > nodemask=(null), order=0, oom_score_adj=0
> >  pool cpuset=/ mems_allowed=0
> >  CPU: 7 PID: 1040 Comm: pool Not tainted 4.16.0-rc4-test+ #663
> >  Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 
> > v03.03 07/14/2016
> >  Call Trace:
> >   dump_stack+0x8e/0xce
> >   dump_header.isra.30+0x6e/0x28f
> >   ? _raw_spin_unlock_irqrestore+0x30/0x60
> >   oom_kill_process+0x218/0x400
> >   ? has_capability_noaudit+0x17/0x20
> >   out_of_memory+0xe3/0x5c0
> >   __alloc_pages_slowpath+0xa8e/0xe50
> >   __alloc_pages_nodemask+0x206/0x220
> >   alloc_pages_current+0x6a/0xe0
> >   __page_cache_alloc+0x6a/0xa0
> >   filemap_fault+0x208/0x5f0
> >   ? __might_sleep+0x4a/0x80
> >   ext4_filemap_fault+0x31/0x44
> >   __do_fault+0x20/0xd0
> >   __handle_mm_fault+0xc08/0x1160
> >   handle_mm_fault+0x76/0x110
> >   __do_page_fault+0x299/0x580
> >   do_page_fault+0x2d/0x110
> >   ? page_fault+0x2f/0x50
> >   page_fault+0x45/0x50  
> 
> But this OOM is not in the path of the buffer_size_kb write, right? So
> then what does it have to do with buffer_size_kb write failure?

Yep. I'll explain below.

> 
> I guess the original issue reported is that the buffer_size_kb write
> causes *other* applications to fail allocation. So in that case,
> capping the amount that ftrace writes makes sense. Basically my point
> is I don't see how the patch you mentioned introduces the problem here
> - in the sense the patch just makes ftrace allocate from memory it
> couldn't before and to try harder.

The issue is that ftrace allocates its ring buffer one page at a time.
Thus, when a RETRY_MAYFAIL succeeds, that memory is allocated. Since it
does it one page at a time, even if ftrace does not get all the memory
it needs at the end, it will take all memory from the system before it
finds that out. Then, if something else (like the above splat) tries to
allocate anything, it will fail and trigger an OOM.

> 
> >
> > I wonder if I should have the ring buffer allocate groups of pages, to
> > avoid this. Or try to allocate with NORETRY, one page at a time, and
> > when that fails, allocate groups of pages with RETRY_MAYFAIL, and that
> > may keep it from causing an OOM?
> >  
> 
> I don't see immediately how that can prevent an OOM in other
> applications here? If ftrace allocates lots of memory with
> RETRY_MAYFAIL, then we would still OOM in other applications if memory
> isn't available. Sorry if I missed something.

Here's the idea.

Allocate one page at a time with NORETRY. If that fails, then allocate
larger amounts (higher order of pages) with RETRY_MAYFAIL. Then if it
can't get all the memory it needs, it wont take up all memory in the
system before it finds out that it can't have any more.

Or perhaps the memory management system can provide a
get_available_mem() function that ftrace could call before it tries to
increase the ring buffer and take up all the memory of the system
before it realizes that it can't get all the memory it wants.

The main problem I have with Zhaoyang's patch is that
get_available_mem() does not belong in the tracing code. It should be
something that the mm subsystem provides.

-- Steve


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Joel Fernandes
Hi Steve,

On Fri, Mar 30, 2018 at 7:20 AM, Steven Rostedt  wrote:
>
> [ Adding memory management folks to discuss the issue ]
>
> On Thu, 29 Mar 2018 18:41:44 +0800
> Zhaoyang Huang  wrote:
>
>> It is reported that some user app would like to echo a huge
>> number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless
>>  of the available memory, which will cause the coinstantaneous
>> page allocation failed and introduce OOM. The commit checking the
>> val against the available mem first to avoid the consequence allocation.
>>
>> Signed-off-by: Zhaoyang Huang 
>> ---
>>  kernel/trace/trace.c | 39 ++-
>>  1 file changed, 38 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
>> index 2d0ffcc..a4a4237 100644
>> --- a/kernel/trace/trace.c
>> +++ b/kernel/trace/trace.c
>> @@ -43,6 +43,8 @@
>>  #include 
>>  #include 
>>
>> +#include 
>> +#include 
>>  #include "trace.h"
>>  #include "trace_output.h"
>>
>> @@ -5967,6 +5969,39 @@ static ssize_t tracing_splice_read_pipe(struct file 
>> *filp,
>>   return ret;
>>  }
>>
>> +static long get_available_mem(void)
>> +{
>> + struct sysinfo i;
>> + long available;
>> + unsigned long pagecache;
>> + unsigned long wmark_low = 0;
>> + unsigned long pages[NR_LRU_LISTS];
>> + struct zone *zone;
>> + int lru;
>> +
>> + si_meminfo(&i);
>> + si_swapinfo(&i);
>> +
>> + for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
>> + pages[lru] = global_page_state(NR_LRU_BASE + lru);
>> +
>> + for_each_zone(zone)
>> + wmark_low += zone->watermark[WMARK_LOW];
>> +
>> + available = i.freeram - wmark_low;
>> +
>> + pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
>> + pagecache -= min(pagecache / 2, wmark_low);
>> + available += pagecache;
>> +
>> + available += global_page_state(NR_SLAB_RECLAIMABLE) -
>> + min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
>> +
>> + if (available < 0)
>> + available = 0;
>> + return available;
>> +}
>> +
>
> As I stated in my other reply, the above function does not belong in
> tracing.
>
> That said, it appears you are having issues that were caused by the
> change by commit 848618857d2 ("tracing/ring_buffer: Try harder to
> allocate"), where we replaced NORETRY with RETRY_MAYFAIL. The point of
> NORETRY was to keep allocations of the tracing ring-buffer from causing
> OOMs. But the RETRY was too strong in that case, because there were

Yes this was discussed with -mm folks. Basically the problem we were
seeing is devices with tonnes of free memory (but free as in free but
used by page cache)  were not being used so it was unnecessarily
failing to allocate ring buffer on the system with otherwise lots of
memory.

> those that wanted to allocate large ring buffers but it would fail due
> to memory being used that could be reclaimed. Supposedly, RETRY_MAYFAIL
> is to allocate with reclaim but still allow to fail, and isn't suppose
> to trigger an OOM. From my own tests, this is obviously not the case.
>

IIRC, the OOM that my patch was trying to avoid, was being triggered
in the path/context of the write to buffer_size_kb itself (when not
doing the NORETRY),  not by other processes.

> Perhaps this is because the ring buffer allocates one page at a time,
> and by doing so, it can get every last available page, and if anything
> in the mean time does an allocation without MAYFAIL, it will cause an
> OOM. For example, when I stressed this I triggered this:
>
>  pool invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), 
> nodemask=(null), order=0, oom_score_adj=0
>  pool cpuset=/ mems_allowed=0
>  CPU: 7 PID: 1040 Comm: pool Not tainted 4.16.0-rc4-test+ #663
>  Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 
> 07/14/2016
>  Call Trace:
>   dump_stack+0x8e/0xce
>   dump_header.isra.30+0x6e/0x28f
>   ? _raw_spin_unlock_irqrestore+0x30/0x60
>   oom_kill_process+0x218/0x400
>   ? has_capability_noaudit+0x17/0x20
>   out_of_memory+0xe3/0x5c0
>   __alloc_pages_slowpath+0xa8e/0xe50
>   __alloc_pages_nodemask+0x206/0x220
>   alloc_pages_current+0x6a/0xe0
>   __page_cache_alloc+0x6a/0xa0
>   filemap_fault+0x208/0x5f0
>   ? __might_sleep+0x4a/0x80
>   ext4_filemap_fault+0x31/0x44
>   __do_fault+0x20/0xd0
>   __handle_mm_fault+0xc08/0x1160
>   handle_mm_fault+0x76/0x110
>   __do_page_fault+0x299/0x580
>   do_page_fault+0x2d/0x110
>   ? page_fault+0x2f/0x50
>   page_fault+0x45/0x50

But this OOM is not in the path of the buffer_size_kb write, right? So
then what does it have to do with buffer_size_kb write failure?

I guess the original issue reported is that the buffer_size_kb write
causes *other* applications to fail allocation. So in that case,
capping the amount that ftrace writes makes sense. Basically my point
is I don't see how the patch you mentioned introduces the problem here
- in the sense the patch just m

Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Steven Rostedt

[ Adding memory management folks to discuss the issue ]

On Thu, 29 Mar 2018 18:41:44 +0800
Zhaoyang Huang  wrote:

> It is reported that some user app would like to echo a huge
> number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless
>  of the available memory, which will cause the coinstantaneous
> page allocation failed and introduce OOM. The commit checking the
> val against the available mem first to avoid the consequence allocation.
> 
> Signed-off-by: Zhaoyang Huang 
> ---
>  kernel/trace/trace.c | 39 ++-
>  1 file changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 2d0ffcc..a4a4237 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -43,6 +43,8 @@
>  #include 
>  #include 
>  
> +#include 
> +#include 
>  #include "trace.h"
>  #include "trace_output.h"
>  
> @@ -5967,6 +5969,39 @@ static ssize_t tracing_splice_read_pipe(struct file 
> *filp,
>   return ret;
>  }
>  
> +static long get_available_mem(void)
> +{
> + struct sysinfo i;
> + long available;
> + unsigned long pagecache;
> + unsigned long wmark_low = 0;
> + unsigned long pages[NR_LRU_LISTS];
> + struct zone *zone;
> + int lru;
> +
> + si_meminfo(&i);
> + si_swapinfo(&i);
> +
> + for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
> + pages[lru] = global_page_state(NR_LRU_BASE + lru);
> +
> + for_each_zone(zone)
> + wmark_low += zone->watermark[WMARK_LOW];
> +
> + available = i.freeram - wmark_low;
> +
> + pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
> + pagecache -= min(pagecache / 2, wmark_low);
> + available += pagecache;
> +
> + available += global_page_state(NR_SLAB_RECLAIMABLE) -
> + min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
> +
> + if (available < 0)
> + available = 0;
> + return available;
> +}
> +

As I stated in my other reply, the above function does not belong in
tracing.

That said, it appears you are having issues that were caused by the
change by commit 848618857d2 ("tracing/ring_buffer: Try harder to
allocate"), where we replaced NORETRY with RETRY_MAYFAIL. The point of
NORETRY was to keep allocations of the tracing ring-buffer from causing
OOMs. But the RETRY was too strong in that case, because there were
those that wanted to allocate large ring buffers but it would fail due
to memory being used that could be reclaimed. Supposedly, RETRY_MAYFAIL
is to allocate with reclaim but still allow to fail, and isn't suppose
to trigger an OOM. From my own tests, this is obviously not the case.

Perhaps this is because the ring buffer allocates one page at a time,
and by doing so, it can get every last available page, and if anything
in the mean time does an allocation without MAYFAIL, it will cause an
OOM. For example, when I stressed this I triggered this:

 pool invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), 
nodemask=(null), order=0, oom_score_adj=0
 pool cpuset=/ mems_allowed=0
 CPU: 7 PID: 1040 Comm: pool Not tainted 4.16.0-rc4-test+ #663
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 
07/14/2016
 Call Trace:
  dump_stack+0x8e/0xce
  dump_header.isra.30+0x6e/0x28f
  ? _raw_spin_unlock_irqrestore+0x30/0x60
  oom_kill_process+0x218/0x400
  ? has_capability_noaudit+0x17/0x20
  out_of_memory+0xe3/0x5c0
  __alloc_pages_slowpath+0xa8e/0xe50
  __alloc_pages_nodemask+0x206/0x220
  alloc_pages_current+0x6a/0xe0
  __page_cache_alloc+0x6a/0xa0
  filemap_fault+0x208/0x5f0
  ? __might_sleep+0x4a/0x80
  ext4_filemap_fault+0x31/0x44
  __do_fault+0x20/0xd0
  __handle_mm_fault+0xc08/0x1160
  handle_mm_fault+0x76/0x110
  __do_page_fault+0x299/0x580
  do_page_fault+0x2d/0x110
  ? page_fault+0x2f/0x50
  page_fault+0x45/0x50

I wonder if I should have the ring buffer allocate groups of pages, to
avoid this. Or try to allocate with NORETRY, one page at a time, and
when that fails, allocate groups of pages with RETRY_MAYFAIL, and that
may keep it from causing an OOM?

Thoughts?

-- Steve



>  static ssize_t
>  tracing_entries_write(struct file *filp, const char __user *ubuf,
> size_t cnt, loff_t *ppos)
> @@ -5975,13 +6010,15 @@ static ssize_t tracing_splice_read_pipe(struct file 
> *filp,
>   struct trace_array *tr = inode->i_private;
>   unsigned long val;
>   int ret;
> + long available;
>  
> + available = get_available_mem();
>   ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
>   if (ret)
>   return ret;
>  
>   /* must have at least 1 entry */
> - if (!val)
> + if (!val || (val > available))
>   return -EINVAL;
>  
>   /* value is in KB */



Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-30 Thread Steven Rostedt
On Fri, 30 Mar 2018 11:32:22 +0800
Zhaoyang Huang  wrote:

> Steve, thanks for your quick feedback. The original purpose is to
> avoid such kind
> of OOM as kernel can not define the behavior of the application. I
> think it is no need
> to do the alloc->free process if we have known the number of pages
> difinitly lead to failure.
> Furthermore,  the app which screw up the thing usually escape the OOM and 
> cause
> the innocent to be sacrificed.

A couple of things. Your patch goes way too deep into coupling with the
memory management subsystem. If I were to accept this patch, Linus
would have a fit. That get_available_mem() does not belong in the
tracing code. If you want that feature, you need to get the mm folks to
accept it, and then tracing could call it.

Next, this may be an issue with the GPF_RETRY_MAYFAIL should not
trigger OOM (although, when it fails, other allocations that happen at
that time and before the ring buffer can clean up, will cause an OOM).
I'll send an email to folks about this.

-- Steve


Re: [Kernel-patch-test] [PATCH v1] kernel/trace:check the val against the available mem

2018-03-29 Thread kbuild test robot
Hi Zhaoyang,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/perf/core]
[also build test ERROR on v4.16-rc7 next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Zhaoyang-Huang/kernel-trace-check-the-val-against-the-available-mem/20180330-140917
config: x86_64-randconfig-x009-201812 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   kernel//trace/trace.c: In function 'get_available_mem':
>> kernel//trace/trace.c:5992:16: error: implicit declaration of function 
>> 'global_page_state'; did you mean 'global_numa_state'? 
>> [-Werror=implicit-function-declaration]
  pages[lru] = global_page_state(NR_LRU_BASE + lru);
   ^
   global_numa_state
   In file included from include/asm-generic/bug.h:18:0,
from arch/x86/include/asm/bug.h:83,
from include/linux/bug.h:5,
from include/linux/mmdebug.h:5,
from include/linux/mm.h:9,
from include/linux/ring_buffer.h:5,
from kernel//trace/trace.c:14:
   include/linux/kernel.h:793:16: warning: comparison of distinct pointer types 
lacks a cast
 (void) (&min1 == &min2);   \
   ^
   include/linux/kernel.h:802:2: note: in expansion of macro '__min'
 __min(typeof(x), typeof(y),   \
 ^
   kernel//trace/trace.c:6004:3: note: in expansion of macro 'min'
  min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
  ^~~
   Cyclomatic Complexity 5 include/linux/compiler.h:__read_once_size
   Cyclomatic Complexity 5 include/linux/compiler.h:__write_once_size
   Cyclomatic Complexity 2 arch/x86/include/asm/bitops.h:set_bit
   Cyclomatic Complexity 2 arch/x86/include/asm/bitops.h:clear_bit
   Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:constant_test_bit
   Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:variable_test_bit
   Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:fls64
   Cyclomatic Complexity 1 arch/x86/include/asm/arch_hweight.h:__arch_hweight64
   Cyclomatic Complexity 2 include/linux/bitops.h:hweight_long
   Cyclomatic Complexity 1 include/linux/log2.h:__ilog2_u64
   Cyclomatic Complexity 1 include/linux/list.h:INIT_LIST_HEAD
   Cyclomatic Complexity 1 include/linux/list.h:__list_add_valid
   Cyclomatic Complexity 1 include/linux/list.h:__list_del_entry_valid
   Cyclomatic Complexity 2 include/linux/list.h:__list_add
   Cyclomatic Complexity 1 include/linux/list.h:list_add
   Cyclomatic Complexity 1 include/linux/list.h:__list_del
   Cyclomatic Complexity 2 include/linux/list.h:__list_del_entry
   Cyclomatic Complexity 1 include/linux/list.h:list_del
   Cyclomatic Complexity 1 arch/x86/include/asm/current.h:get_current
   Cyclomatic Complexity 1 include/asm-generic/getorder.h:__get_order
   Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_read
   Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_set
   Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_inc
   Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_dec
   Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_add_return
   Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_cmpxchg
   Cyclomatic Complexity 1 arch/x86/include/asm/atomic64_64.h:atomic64_read
   Cyclomatic Complexity 1 include/asm-generic/atomic-long.h:atomic_long_read
   Cyclomatic Complexity 1 include/linux/jump_label.h:static_key_count
   Cyclomatic Complexity 2 include/linux/jump_label.h:static_key_false
   Cyclomatic Complexity 1 include/linux/string.h:strnlen
   Cyclomatic Complexity 4 include/linux/string.h:strlen
   Cyclomatic Complexity 6 include/linux/string.h:strlcpy
   Cyclomatic Complexity 3 include/linux/string.h:memset
   Cyclomatic Complexity 4 include/linux/string.h:memcpy
   Cyclomatic Complexity 2 include/linux/string.h:strcpy
   Cyclomatic Complexity 3 include/linux/bitmap.h:bitmap_weight
   Cyclomatic Complexity 1 include/linux/cpumask.h:cpumask_check
   Cyclomatic Complexity 1 include/linux/cpumask.h:cpumask_set_cpu
   Cyclomatic Complexity 1 include/linux/cpumask.h:cpumask_test_cpu
   Cyclomatic Complexity 1 include/linux/cpumask.h:cpumask_weight
   Cyclomatic Complexity 1 include/linux/cpumask.h:cpumask_available
   Cyclomatic Complexity 1 arch/x86/include/asm/paravirt.h:arch_local_save_flags
   Cyclomatic Complexity 1 
arch/x86/include/asm/paravirt.h:arch_local_irq_restore
   Cyclomatic Complexity 1 
arch/x86/include/asm/paravirt.h:arch_local_irq_disable
   Cyclomatic Complexity 1 arch/x86/include/asm/paravirt.h:arch_local_irq_enable
   Cyclomatic Complexity 1 arch/x86/include/asm/paravirt.h:arch_local_irq_save
   Cyclomatic Complexity 1 include/linux/e

Re: [Kernel-patch-test] [PATCH v1] kernel/trace:check the val against the available mem

2018-03-29 Thread kbuild test robot
Hi Zhaoyang,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/perf/core]
[also build test ERROR on v4.16-rc7 next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Zhaoyang-Huang/kernel-trace-check-the-val-against-the-available-mem/20180330-140917
config: i386-randconfig-x073-201812 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   kernel/trace/trace.c: In function 'get_available_mem':
>> kernel/trace/trace.c:5992:16: error: implicit declaration of function 
>> 'global_page_state'; did you mean 'zone_page_state'? 
>> [-Werror=implicit-function-declaration]
  pages[lru] = global_page_state(NR_LRU_BASE + lru);
   ^
   zone_page_state
   In file included from include/asm-generic/bug.h:18:0,
from arch/x86/include/asm/bug.h:83,
from include/linux/bug.h:5,
from include/linux/mmdebug.h:5,
from include/linux/mm.h:9,
from include/linux/ring_buffer.h:5,
from kernel/trace/trace.c:14:
   include/linux/kernel.h:793:16: warning: comparison of distinct pointer types 
lacks a cast
 (void) (&min1 == &min2);   \
   ^
   include/linux/kernel.h:802:2: note: in expansion of macro '__min'
 __min(typeof(x), typeof(y),   \
 ^
>> kernel/trace/trace.c:6004:3: note: in expansion of macro 'min'
  min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
  ^~~
   cc1: some warnings being treated as errors

vim +5992 kernel/trace/trace.c

  5977  
  5978  static long get_available_mem(void)
  5979  {
  5980  struct sysinfo i;
  5981  long available;
  5982  unsigned long pagecache;
  5983  unsigned long wmark_low = 0;
  5984  unsigned long pages[NR_LRU_LISTS];
  5985  struct zone *zone;
  5986  int lru;
  5987  
  5988  si_meminfo(&i);
  5989  si_swapinfo(&i);
  5990  
  5991  for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
> 5992  pages[lru] = global_page_state(NR_LRU_BASE + lru);
  5993  
  5994  for_each_zone(zone)
  5995  wmark_low += zone->watermark[WMARK_LOW];
  5996  
  5997  available = i.freeram - wmark_low;
  5998  
  5999  pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
  6000  pagecache -= min(pagecache / 2, wmark_low);
  6001  available += pagecache;
  6002  
  6003  available += global_page_state(NR_SLAB_RECLAIMABLE) -
> 6004  min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, 
> wmark_low);
  6005  
  6006  if (available < 0)
  6007  available = 0;
  6008  return available;
  6009  }
  6010  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-29 Thread Zhaoyang Huang
On Fri, Mar 30, 2018 at 12:05 AM, Steven Rostedt  wrote:
> On Thu, 29 Mar 2018 18:41:44 +0800
> Zhaoyang Huang  wrote:
>
>> It is reported that some user app would like to echo a huge
>> number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless
>>  of the available memory, which will cause the coinstantaneous
>> page allocation failed and introduce OOM. The commit checking the
>> val against the available mem first to avoid the consequence allocation.
>>
>
> One of my tests is to stress buffer_size_kb, and it fails nicely if you
> try to get too much. Although, it may cause an OOM, but that's expected.
>
> The application should do the test (try "free" on the command line).
> This isn't something that the kernel should be responsible for. If
> someone wants to allocate all memory for tracing, that's their
> prerogative.
>
> -- Steve
Steve, thanks for your quick feedback. The original purpose is to
avoid such kind
of OOM as kernel can not define the behavior of the application. I
think it is no need
to do the alloc->free process if we have known the number of pages
difinitly lead to failure.
Furthermore,  the app which screw up the thing usually escape the OOM and cause
the innocent to be sacrificed.


Re: [PATCH v1] kernel/trace:check the val against the available mem

2018-03-29 Thread Steven Rostedt
On Thu, 29 Mar 2018 18:41:44 +0800
Zhaoyang Huang  wrote:

> It is reported that some user app would like to echo a huge
> number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless
>  of the available memory, which will cause the coinstantaneous
> page allocation failed and introduce OOM. The commit checking the
> val against the available mem first to avoid the consequence allocation.
> 

One of my tests is to stress buffer_size_kb, and it fails nicely if you
try to get too much. Although, it may cause an OOM, but that's expected.

The application should do the test (try "free" on the command line).
This isn't something that the kernel should be responsible for. If
someone wants to allocate all memory for tracing, that's their
prerogative.

-- Steve


[PATCH v1] kernel/trace:check the val against the available mem

2018-03-29 Thread Zhaoyang Huang
It is reported that some user app would like to echo a huge
number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless
 of the available memory, which will cause the coinstantaneous
page allocation failed and introduce OOM. The commit checking the
val against the available mem first to avoid the consequence allocation.

Signed-off-by: Zhaoyang Huang 
---
 kernel/trace/trace.c | 39 ++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2d0ffcc..a4a4237 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -43,6 +43,8 @@
 #include 
 #include 
 
+#include 
+#include 
 #include "trace.h"
 #include "trace_output.h"
 
@@ -5967,6 +5969,39 @@ static ssize_t tracing_splice_read_pipe(struct file 
*filp,
return ret;
 }
 
+static long get_available_mem(void)
+{
+   struct sysinfo i;
+   long available;
+   unsigned long pagecache;
+   unsigned long wmark_low = 0;
+   unsigned long pages[NR_LRU_LISTS];
+   struct zone *zone;
+   int lru;
+
+   si_meminfo(&i);
+   si_swapinfo(&i);
+
+   for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
+   pages[lru] = global_page_state(NR_LRU_BASE + lru);
+
+   for_each_zone(zone)
+   wmark_low += zone->watermark[WMARK_LOW];
+
+   available = i.freeram - wmark_low;
+
+   pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
+   pagecache -= min(pagecache / 2, wmark_low);
+   available += pagecache;
+
+   available += global_page_state(NR_SLAB_RECLAIMABLE) -
+   min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
+
+   if (available < 0)
+   available = 0;
+   return available;
+}
+
 static ssize_t
 tracing_entries_write(struct file *filp, const char __user *ubuf,
  size_t cnt, loff_t *ppos)
@@ -5975,13 +6010,15 @@ static ssize_t tracing_splice_read_pipe(struct file 
*filp,
struct trace_array *tr = inode->i_private;
unsigned long val;
int ret;
+   long available;
 
+   available = get_available_mem();
ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
if (ret)
return ret;
 
/* must have at least 1 entry */
-   if (!val)
+   if (!val || (val > available))
return -EINVAL;
 
/* value is in KB */
-- 
1.9.1