Re: [PATCH v1] kernel/trace:check the val against the available mem
On Thu 05-04-18 09:15:01, Matthew Wilcox wrote: > On Thu, Apr 05, 2018 at 05:32:40PM +0200, Michal Hocko wrote: > > On Thu 05-04-18 08:13:59, Matthew Wilcox wrote: > > > Argh. The comment confused me. OK, now I've read the source and > > > understand that GFP_KERNEL | __GFP_RETRY_MAYFAIL tries exactly as hard > > > as GFP_KERNEL *except* that it won't cause OOM itself. But any other > > > simultaneous GFP_KERNEL allocation without __GFP_RETRY_MAYFAIL will > > > cause an OOM. (And that's why we're having a conversation) > > > > Well, I can udnerstand how this can be confusing. The all confusion > > boils down to the small-never-fails semantic we have. So all reclaim > > modificators (__GFP_NOFAIL, __GFP_RETRY_MAYFAIL, __GFP_NORETRY) only > > modify the _default_ behavior. > > Now that I understand the flag, I'll try to write a more clear > explanation. Good luck with that. It took me several iterations to land with the current state. It is quite hard to be not misleading yet understandable. > > > That's a problem because we have places in the kernel that call > > > kv[zm]alloc(very_large_size, GFP_KERNEL), and that will turn into vmalloc, > > > which will do the exact same thing, only it will trigger OOM all by itself > > > (assuming the largest free chunk of address space in the vmalloc area > > > is larger than the amount of free memory). > > > > well, hardcoded GFP_KERNEL from vmalloc guts is yet another, ehm, > > herritage that you are not so proud of. > > Certainly not, but that's not what I'm concerned about; I'm concerned > about the allocation of the pages, not the allocation of the array > containing the page pointers. Those pages will use the gfp flag you give to vmalloc IIRC. It is page tables that are GFP_KERNEL unconditionally. > > > We could also have a GFP flag that says to only succeed if we're further > > > above the existing watermark than normal. __GFP_LOW (==ALLOC_LOW), > > > if you like. That would give us the desired behaviour of trying all of > > > the reclaim methods that GFP_KERNEL would, but not being able to exhaust > > > all the memory that GFP_KERNEL allocations would take. > > > > Well, I would be really careful with yet another gfp mask. They are so > > incredibly hard to define properly and then people kinda tend to screw > > your best intentions with their usecases ;) > > Failing on low wmark is very close to __GFP_NORETRY or even > > __GFP_NOWAIT, btw. So let's try to not overthink this... > > Oh, indeed. We must be able to clearly communicate to users when they > should use this flag. I have in mind something like this: > > * __GFP_HIGH indicates that the caller is high-priority and that granting > * the request is necessary before the system can make forward progress. > * For example, creating an IO context to clean pages. > * > * __GFP_LOW indicates that the caller is low-priority and that it should > * not be allocated pages that would cause the system to get into an > * out-of-memory situation. For example, allocating multiple individual > * pages in order to satisfy a larger request. So how exactly the low fits into GFP_NOWAIT, GFP_NORETRY and GFP_RETRY_MAFAIL? We _do_have several levels of how hard to try and we have users relying on that. And do not forget about costly vs. non-costly sizes. That being said, we should not hijack this thread more and further enhancements should be discussed separatelly. I am all for making the whole allocation api less obscure but keep in mind that we have really hard historical restrictions. -- Michal Hocko SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Thu, Apr 05, 2018 at 05:32:40PM +0200, Michal Hocko wrote: > On Thu 05-04-18 08:13:59, Matthew Wilcox wrote: > > Argh. The comment confused me. OK, now I've read the source and > > understand that GFP_KERNEL | __GFP_RETRY_MAYFAIL tries exactly as hard > > as GFP_KERNEL *except* that it won't cause OOM itself. But any other > > simultaneous GFP_KERNEL allocation without __GFP_RETRY_MAYFAIL will > > cause an OOM. (And that's why we're having a conversation) > > Well, I can udnerstand how this can be confusing. The all confusion > boils down to the small-never-fails semantic we have. So all reclaim > modificators (__GFP_NOFAIL, __GFP_RETRY_MAYFAIL, __GFP_NORETRY) only > modify the _default_ behavior. Now that I understand the flag, I'll try to write a more clear explanation. > > That's a problem because we have places in the kernel that call > > kv[zm]alloc(very_large_size, GFP_KERNEL), and that will turn into vmalloc, > > which will do the exact same thing, only it will trigger OOM all by itself > > (assuming the largest free chunk of address space in the vmalloc area > > is larger than the amount of free memory). > > well, hardcoded GFP_KERNEL from vmalloc guts is yet another, ehm, > herritage that you are not so proud of. Certainly not, but that's not what I'm concerned about; I'm concerned about the allocation of the pages, not the allocation of the array containing the page pointers. > > We could also have a GFP flag that says to only succeed if we're further > > above the existing watermark than normal. __GFP_LOW (==ALLOC_LOW), > > if you like. That would give us the desired behaviour of trying all of > > the reclaim methods that GFP_KERNEL would, but not being able to exhaust > > all the memory that GFP_KERNEL allocations would take. > > Well, I would be really careful with yet another gfp mask. They are so > incredibly hard to define properly and then people kinda tend to screw > your best intentions with their usecases ;) > Failing on low wmark is very close to __GFP_NORETRY or even > __GFP_NOWAIT, btw. So let's try to not overthink this... Oh, indeed. We must be able to clearly communicate to users when they should use this flag. I have in mind something like this: * __GFP_HIGH indicates that the caller is high-priority and that granting * the request is necessary before the system can make forward progress. * For example, creating an IO context to clean pages. * * __GFP_LOW indicates that the caller is low-priority and that it should * not be allocated pages that would cause the system to get into an * out-of-memory situation. For example, allocating multiple individual * pages in order to satisfy a larger request. I think this should actually replace __GFP_RETRY_MAYFAIL. It makes sense to a user: "This is a low priority GFP_KERNEL allocation". I doubt there's one kernel hacker in a hundred who could explain what GFP_RETRY_MAYFAIL does, exactly, and I'm not just saying that because I got it wrong ;-)
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Thu 05-04-18 08:13:59, Matthew Wilcox wrote: > On Thu, Apr 05, 2018 at 04:27:49PM +0200, Michal Hocko wrote: > > On Thu 05-04-18 07:22:58, Matthew Wilcox wrote: > > > On Wed, Apr 04, 2018 at 09:12:52PM -0700, Joel Fernandes wrote: > > > > On Wed, Apr 4, 2018 at 7:58 PM, Matthew Wilcox > > > > wrote: > > > > > I still don't get why you want RETRY_MAYFAIL. You know that tries > > > > > *harder* to allocate memory than plain GFP_KERNEL does, right? And > > > > > that seems like the exact opposite of what you want. > > Argh. The comment confused me. OK, now I've read the source and > understand that GFP_KERNEL | __GFP_RETRY_MAYFAIL tries exactly as hard > as GFP_KERNEL *except* that it won't cause OOM itself. But any other > simultaneous GFP_KERNEL allocation without __GFP_RETRY_MAYFAIL will > cause an OOM. (And that's why we're having a conversation) Well, I can udnerstand how this can be confusing. The all confusion boils down to the small-never-fails semantic we have. So all reclaim modificators (__GFP_NOFAIL, __GFP_RETRY_MAYFAIL, __GFP_NORETRY) only modify the _default_ behavior. > That's a problem because we have places in the kernel that call > kv[zm]alloc(very_large_size, GFP_KERNEL), and that will turn into vmalloc, > which will do the exact same thing, only it will trigger OOM all by itself > (assuming the largest free chunk of address space in the vmalloc area > is larger than the amount of free memory). well, hardcoded GFP_KERNEL from vmalloc guts is yet another, ehm, herritage that you are not so proud of. > I considered an alloc_page_array(), but that doesn't fit well with the > design of the ring buffer code. We could have: > > struct page *alloc_page_list_node(int nid, gfp_t gfp_mask, unsigned long nr); > > and link the allocated pages together through page->lru. > > We could also have a GFP flag that says to only succeed if we're further > above the existing watermark than normal. __GFP_LOW (==ALLOC_LOW), > if you like. That would give us the desired behaviour of trying all of > the reclaim methods that GFP_KERNEL would, but not being able to exhaust > all the memory that GFP_KERNEL allocations would take. Well, I would be really careful with yet another gfp mask. They are so incredibly hard to define properly and then people kinda tend to screw your best intentions with their usecases ;) Failing on low wmark is very close to __GFP_NORETRY or even __GFP_NOWAIT, btw. So let's try to not overthink this... -- Michal Hocko SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Thu, Apr 05, 2018 at 04:27:49PM +0200, Michal Hocko wrote: > On Thu 05-04-18 07:22:58, Matthew Wilcox wrote: > > On Wed, Apr 04, 2018 at 09:12:52PM -0700, Joel Fernandes wrote: > > > On Wed, Apr 4, 2018 at 7:58 PM, Matthew Wilcox > > > wrote: > > > > I still don't get why you want RETRY_MAYFAIL. You know that tries > > > > *harder* to allocate memory than plain GFP_KERNEL does, right? And > > > > that seems like the exact opposite of what you want. Argh. The comment confused me. OK, now I've read the source and understand that GFP_KERNEL | __GFP_RETRY_MAYFAIL tries exactly as hard as GFP_KERNEL *except* that it won't cause OOM itself. But any other simultaneous GFP_KERNEL allocation without __GFP_RETRY_MAYFAIL will cause an OOM. (And that's why we're having a conversation) That's a problem because we have places in the kernel that call kv[zm]alloc(very_large_size, GFP_KERNEL), and that will turn into vmalloc, which will do the exact same thing, only it will trigger OOM all by itself (assuming the largest free chunk of address space in the vmalloc area is larger than the amount of free memory). I considered an alloc_page_array(), but that doesn't fit well with the design of the ring buffer code. We could have: struct page *alloc_page_list_node(int nid, gfp_t gfp_mask, unsigned long nr); and link the allocated pages together through page->lru. We could also have a GFP flag that says to only succeed if we're further above the existing watermark than normal. __GFP_LOW (==ALLOC_LOW), if you like. That would give us the desired behaviour of trying all of the reclaim methods that GFP_KERNEL would, but not being able to exhaust all the memory that GFP_KERNEL allocations would take.
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Thu, 5 Apr 2018 16:27:49 +0200 Michal Hocko wrote: > > I understand you don't want GFP_NORETRY. But why is it more important for > > this allocation to succeed than other normal GFP_KERNEL allocations? > > I guess they simply want a failure rather than OOM even when they can > shoot themselves into head by using oom_origin. It is still quite ugly > to see OOM report... Exactly! -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Thu, 5 Apr 2018 07:22:58 -0700 Matthew Wilcox wrote: > I understand you don't want GFP_NORETRY. But why is it more important for > this allocation to succeed than other normal GFP_KERNEL allocations? Not sure what you mean by "more important"? Does saying "RETRY_MAYFAIL" make it more important? The difference is, if GFP_KERNEL fails, we don't want to trigger an OOM, and simply clean up and report -ENOMEM to the user. It has nothing to do with being more important than other allocations. If there's 100 Megs of memory available, and the user requests a gig of memory, it's going to fail. Ideally, it doesn't trigger OOM, but instead simply reports -ENOMEM to the user. -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Thu 05-04-18 07:22:58, Matthew Wilcox wrote: > On Wed, Apr 04, 2018 at 09:12:52PM -0700, Joel Fernandes wrote: > > On Wed, Apr 4, 2018 at 7:58 PM, Matthew Wilcox wrote: > > > On Wed, Apr 04, 2018 at 11:47:30AM -0400, Steven Rostedt wrote: > > >> I originally was going to remove the RETRY_MAYFAIL, but adding this > > >> check (at the end of the loop though) appears to have OOM consistently > > >> kill this task. > > >> > > >> I still like to keep RETRY_MAYFAIL, because it wont trigger OOM if > > >> nothing comes in and tries to do an allocation, but instead will fail > > >> nicely with -ENOMEM. > > > > > > I still don't get why you want RETRY_MAYFAIL. You know that tries > > > *harder* to allocate memory than plain GFP_KERNEL does, right? And > > > that seems like the exact opposite of what you want. > > > > No. We do want it to try harder but not if its already setup for failure. > > I understand you don't want GFP_NORETRY. But why is it more important for > this allocation to succeed than other normal GFP_KERNEL allocations? I guess they simply want a failure rather than OOM even when they can shoot themselves into head by using oom_origin. It is still quite ugly to see OOM report... -- Michal Hocko SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed, Apr 04, 2018 at 09:12:52PM -0700, Joel Fernandes wrote: > On Wed, Apr 4, 2018 at 7:58 PM, Matthew Wilcox wrote: > > On Wed, Apr 04, 2018 at 11:47:30AM -0400, Steven Rostedt wrote: > >> I originally was going to remove the RETRY_MAYFAIL, but adding this > >> check (at the end of the loop though) appears to have OOM consistently > >> kill this task. > >> > >> I still like to keep RETRY_MAYFAIL, because it wont trigger OOM if > >> nothing comes in and tries to do an allocation, but instead will fail > >> nicely with -ENOMEM. > > > > I still don't get why you want RETRY_MAYFAIL. You know that tries > > *harder* to allocate memory than plain GFP_KERNEL does, right? And > > that seems like the exact opposite of what you want. > > No. We do want it to try harder but not if its already setup for failure. I understand you don't want GFP_NORETRY. But why is it more important for this allocation to succeed than other normal GFP_KERNEL allocations?
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed, Apr 4, 2018 at 7:58 PM, Matthew Wilcox wrote: > On Wed, Apr 04, 2018 at 11:47:30AM -0400, Steven Rostedt wrote: >> I originally was going to remove the RETRY_MAYFAIL, but adding this >> check (at the end of the loop though) appears to have OOM consistently >> kill this task. >> >> I still like to keep RETRY_MAYFAIL, because it wont trigger OOM if >> nothing comes in and tries to do an allocation, but instead will fail >> nicely with -ENOMEM. > > I still don't get why you want RETRY_MAYFAIL. You know that tries > *harder* to allocate memory than plain GFP_KERNEL does, right? And > that seems like the exact opposite of what you want. No. We do want it to try harder but not if its already setup for failure.
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed, Apr 04, 2018 at 11:47:30AM -0400, Steven Rostedt wrote: > I originally was going to remove the RETRY_MAYFAIL, but adding this > check (at the end of the loop though) appears to have OOM consistently > kill this task. > > I still like to keep RETRY_MAYFAIL, because it wont trigger OOM if > nothing comes in and tries to do an allocation, but instead will fail > nicely with -ENOMEM. I still don't get why you want RETRY_MAYFAIL. You know that tries *harder* to allocate memory than plain GFP_KERNEL does, right? And that seems like the exact opposite of what you want.
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed, 4 Apr 2018 16:23:29 +0200 Michal Hocko wrote: > > > > I tried it out, I did the following: > > > > set_current_oom_origin(); > > for (i = 0; i < nr_pages; i++) { > > struct page *page; > > /* > > * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails > > * gracefully without invoking oom-killer and the system is not > > * destabilized. > > */ > > bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), > > GFP_KERNEL | __GFP_RETRY_MAYFAIL, > > cpu_to_node(cpu)); > > if (!bpage) > > goto free_pages; > > > > list_add(&bpage->list, pages); > > > > page = alloc_pages_node(cpu_to_node(cpu), > > GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0); > > if (!page) > > goto free_pages; > > if (fatal_signal_pending()) > fgoto free_pages; I originally was going to remove the RETRY_MAYFAIL, but adding this check (at the end of the loop though) appears to have OOM consistently kill this task. I still like to keep RETRY_MAYFAIL, because it wont trigger OOM if nothing comes in and tries to do an allocation, but instead will fail nicely with -ENOMEM. -- Steve > > > bpage->page = page_address(page); > > rb_init_page(bpage->page); > > } > > clear_current_oom_origin();
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed, 4 Apr 2018 17:27:13 +0200 Michal Hocko wrote: > I am afraid I cannot help you much more though. No, you have actually been a great help. I'm finishing up a patch on top of this one. I'll be posting it soon. Thanks for your help and your time! -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed 04-04-18 11:04:42, Steven Rostedt wrote: [...] > I'm not looking for perfect. In fact, I love what si_mem_available() > gives me now! Sure, it can say "there's enough memory" even if I can't > use it. Because most of the OOM allocations that happen with increasing > the size of the ring buffer isn't due to "just enough memory > allocated", but it's due to "trying to allocate crazy amounts of > memory". That's because it does the allocation one page at a time, and > if you try to allocate crazy amounts of memory, it will allocate all > memory before it fails. I don't want that. I want crazy allocations to > fail from the start. A "maybe this will allocate" is fine even if it > will end up causing an OOM. OK, fair enough. It's your code ;) I would recommend using the oom_origin thingy to reduce the immediate damage and to have a clear culprit so that I do not have to scratch my head why we see an OOM report with a lot of unaccounted memory... I am afraid I cannot help you much more though. -- Michal Hocko SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed, 4 Apr 2018 16:42:55 +0200 Michal Hocko wrote: > On Wed 04-04-18 10:25:27, Steven Rostedt wrote: > > On Wed, 4 Apr 2018 16:10:52 +0200 > > Michal Hocko wrote: > > > > > On Wed 04-04-18 08:59:01, Steven Rostedt wrote: > > > [...] > > > > + /* > > > > +* Check if the available memory is there first. > > > > +* Note, si_mem_available() only gives us a rough estimate of > > > > available > > > > +* memory. It may not be accurate. But we don't care, we just > > > > want > > > > +* to prevent doing any allocation when it is obvious that it is > > > > +* not going to succeed. > > > > +*/ > > > > + i = si_mem_available(); > > > > + if (i < nr_pages) > > > > + return -ENOMEM; > > > > + > > > > > > > > Better? > > > > > > I must be really missing something here. How can that work at all for > > > e.g. the zone_{highmem/movable}. You will get false on the above tests > > > even when you will have hard time to allocate anything from your > > > destination zones. > > > > You mean we will get true on the above tests? Again, the current > > method is to just say screw it and try to allocate. > > No, you will get false on that test. Say that you have a system with Ah, I'm thinking backwards, I looked at false meaning "not enough memory", where if it's true (i < nr_pages), false means there is enough memory. OK, we are in agreement. > large ZONE_MOVABLE. Now your kernel allocations can fit only into > !movable zones (say we have 1G for !movable and 3G for movable). Now say > that !movable zones are getting close to the edge while movable zones > are full of reclaimable pages. si_mem_available will tell you there is a > _lot_ of memory available while your GFP_KERNEL request will happily > consume the rest of !movable zones and trigger OOM. See? Which is still better than what we have today. I'm fine with it. Really, I am. > > [...] > > I'm looking for something where "yes" means "there may be enough, but > > there may not be, buyer beware", and "no" means "forget it, don't even > > start, because you just asked for more than possible". > > We do not have _that_ something other than try to opportunistically > allocate and see what happens. Sucks? Maybe yes but I really cannot > think of an interface with sane semantic that would catch all the > different scenarios. And I'm fine with that too. I don't want to catch all different scenarios. I want to just catch the crazy ones. Like trying to allocate gigs of memory when there's only a few megs left. Those can easily happen with the current interface that can't change. I'm not looking for perfect. In fact, I love what si_mem_available() gives me now! Sure, it can say "there's enough memory" even if I can't use it. Because most of the OOM allocations that happen with increasing the size of the ring buffer isn't due to "just enough memory allocated", but it's due to "trying to allocate crazy amounts of memory". That's because it does the allocation one page at a time, and if you try to allocate crazy amounts of memory, it will allocate all memory before it fails. I don't want that. I want crazy allocations to fail from the start. A "maybe this will allocate" is fine even if it will end up causing an OOM. -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed 04-04-18 10:31:11, Steven Rostedt wrote: > On Wed, 4 Apr 2018 16:23:29 +0200 > Michal Hocko wrote: > > > On Wed 04-04-18 10:11:49, Steven Rostedt wrote: > > > On Wed, 4 Apr 2018 08:23:40 +0200 > > > Michal Hocko wrote: > > > > > > > If you are afraid of that then you can have a look at > > > > {set,clear}_current_oom_origin() > > > > which will automatically select the current process as an oom victim and > > > > kill it. > > > > > > Would it even receive the signal? Does alloc_pages_node() even respond > > > to signals? Because the OOM happens while the allocation loop is > > > running. > > > > Well, you would need to do something like: > > > > > > > > I tried it out, I did the following: > > > > > > set_current_oom_origin(); > > > for (i = 0; i < nr_pages; i++) { > > > struct page *page; > > > /* > > >* __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails > > >* gracefully without invoking oom-killer and the system is not > > >* destabilized. > > >*/ > > > bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), > > > GFP_KERNEL | __GFP_RETRY_MAYFAIL, > > > cpu_to_node(cpu)); > > > if (!bpage) > > > goto free_pages; > > > > > > list_add(&bpage->list, pages); > > > > > > page = alloc_pages_node(cpu_to_node(cpu), > > > GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0); > > > if (!page) > > > goto free_pages; > > > > if (fatal_signal_pending()) > > fgoto free_pages; > > But wouldn't page be NULL in this case? __GFP_RETRY_MAYFAIL itself fails rather than triggers the OOM killer. You still might get killed from other allocation context which can trigger the OOM killer though. In any case you would back off and fail, no? > > > bpage->page = page_address(page); > > > rb_init_page(bpage->page); > > > } > > > clear_current_oom_origin(); > > > > If you use __GFP_RETRY_MAYFAIL it would have to be somedy else to > > trigger the OOM killer and this user context would get killed. If you > > drop __GFP_RETRY_MAYFAIL it would be this context to trigger the OOM but > > it would still be the selected victim. > > Then we guarantee to kill the process instead of just sending a > -ENOMEM, which would change user space ABI, and is a NO NO. I see. Although I would expect it would be echo writing to a file most of the time. But I am not really familiar what traces usually do so I will not speculate. > Ideally, we want to avoid an OOM. I could add the above as well, when > si_mem_avaiable() returns something that is greater than what is > available, and at least this is the process that will get the OOM if it > fails to allocate. > > Would that work for you? I have responded wrt si_mem_avaiable in other email but yes, using the oom_origin would reduce the immediate damage at least. -- Michal Hocko SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed 04-04-18 10:25:27, Steven Rostedt wrote: > On Wed, 4 Apr 2018 16:10:52 +0200 > Michal Hocko wrote: > > > On Wed 04-04-18 08:59:01, Steven Rostedt wrote: > > [...] > > > + /* > > > +* Check if the available memory is there first. > > > +* Note, si_mem_available() only gives us a rough estimate of > > > available > > > +* memory. It may not be accurate. But we don't care, we just want > > > +* to prevent doing any allocation when it is obvious that it is > > > +* not going to succeed. > > > +*/ > > > + i = si_mem_available(); > > > + if (i < nr_pages) > > > + return -ENOMEM; > > > + > > > > > > Better? > > > > I must be really missing something here. How can that work at all for > > e.g. the zone_{highmem/movable}. You will get false on the above tests > > even when you will have hard time to allocate anything from your > > destination zones. > > You mean we will get true on the above tests? Again, the current > method is to just say screw it and try to allocate. No, you will get false on that test. Say that you have a system with large ZONE_MOVABLE. Now your kernel allocations can fit only into !movable zones (say we have 1G for !movable and 3G for movable). Now say that !movable zones are getting close to the edge while movable zones are full of reclaimable pages. si_mem_available will tell you there is a _lot_ of memory available while your GFP_KERNEL request will happily consume the rest of !movable zones and trigger OOM. See? [...] > I'm looking for something where "yes" means "there may be enough, but > there may not be, buyer beware", and "no" means "forget it, don't even > start, because you just asked for more than possible". We do not have _that_ something other than try to opportunistically allocate and see what happens. Sucks? Maybe yes but I really cannot think of an interface with sane semantic that would catch all the different scenarios. -- Michal Hocko SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed, 4 Apr 2018 16:23:29 +0200 Michal Hocko wrote: > On Wed 04-04-18 10:11:49, Steven Rostedt wrote: > > On Wed, 4 Apr 2018 08:23:40 +0200 > > Michal Hocko wrote: > > > > > If you are afraid of that then you can have a look at > > > {set,clear}_current_oom_origin() > > > which will automatically select the current process as an oom victim and > > > kill it. > > > > Would it even receive the signal? Does alloc_pages_node() even respond > > to signals? Because the OOM happens while the allocation loop is > > running. > > Well, you would need to do something like: > > > > > I tried it out, I did the following: > > > > set_current_oom_origin(); > > for (i = 0; i < nr_pages; i++) { > > struct page *page; > > /* > > * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails > > * gracefully without invoking oom-killer and the system is not > > * destabilized. > > */ > > bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), > > GFP_KERNEL | __GFP_RETRY_MAYFAIL, > > cpu_to_node(cpu)); > > if (!bpage) > > goto free_pages; > > > > list_add(&bpage->list, pages); > > > > page = alloc_pages_node(cpu_to_node(cpu), > > GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0); > > if (!page) > > goto free_pages; > > if (fatal_signal_pending()) > fgoto free_pages; But wouldn't page be NULL in this case? > > > bpage->page = page_address(page); > > rb_init_page(bpage->page); > > } > > clear_current_oom_origin(); > > If you use __GFP_RETRY_MAYFAIL it would have to be somedy else to > trigger the OOM killer and this user context would get killed. If you > drop __GFP_RETRY_MAYFAIL it would be this context to trigger the OOM but > it would still be the selected victim. Then we guarantee to kill the process instead of just sending a -ENOMEM, which would change user space ABI, and is a NO NO. Ideally, we want to avoid an OOM. I could add the above as well, when si_mem_avaiable() returns something that is greater than what is available, and at least this is the process that will get the OOM if it fails to allocate. Would that work for you? -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed, 4 Apr 2018 16:10:52 +0200 Michal Hocko wrote: > On Wed 04-04-18 08:59:01, Steven Rostedt wrote: > [...] > > + /* > > +* Check if the available memory is there first. > > +* Note, si_mem_available() only gives us a rough estimate of > > available > > +* memory. It may not be accurate. But we don't care, we just want > > +* to prevent doing any allocation when it is obvious that it is > > +* not going to succeed. > > +*/ > > + i = si_mem_available(); > > + if (i < nr_pages) > > + return -ENOMEM; > > + > > > > Better? > > I must be really missing something here. How can that work at all for > e.g. the zone_{highmem/movable}. You will get false on the above tests > even when you will have hard time to allocate anything from your > destination zones. You mean we will get true on the above tests? Again, the current method is to just say screw it and try to allocate. I originally just used NORETRY which would only allocate memory that is currently available and not try to reclaim anything. But people like Joel at Google that required increasing the buffer when memory was mostly taken up by page cache changed it from NORETRY to RETRY_MAYFAIL. But this now causes the issue that a large allocation can take up all memory even when the allocation requested is guaranteed to fail, because there is not enough memory to pull this off. We just want a way to say "hey, is there enough memory in the system to allocate all these pages before we try? We don't need specifics, we just want to make sure we are not allocating way too much". The answer I want is "yes there may be enough (but you may not be able to use it)" or "no, there is definitely not enough for that". Currently si_mem_available() is the closest thing we have to answering that question. I'm fine if the answer is "Yes" even if I can't allocate that memory. I'm looking for something where "yes" means "there may be enough, but there may not be, buyer beware", and "no" means "forget it, don't even start, because you just asked for more than possible". -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed 04-04-18 10:11:49, Steven Rostedt wrote: > On Wed, 4 Apr 2018 08:23:40 +0200 > Michal Hocko wrote: > > > If you are afraid of that then you can have a look at > > {set,clear}_current_oom_origin() > > which will automatically select the current process as an oom victim and > > kill it. > > Would it even receive the signal? Does alloc_pages_node() even respond > to signals? Because the OOM happens while the allocation loop is > running. Well, you would need to do something like: > > I tried it out, I did the following: > > set_current_oom_origin(); > for (i = 0; i < nr_pages; i++) { > struct page *page; > /* >* __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails >* gracefully without invoking oom-killer and the system is not >* destabilized. >*/ > bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), > GFP_KERNEL | __GFP_RETRY_MAYFAIL, > cpu_to_node(cpu)); > if (!bpage) > goto free_pages; > > list_add(&bpage->list, pages); > > page = alloc_pages_node(cpu_to_node(cpu), > GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0); > if (!page) > goto free_pages; if (fatal_signal_pending()) fgoto free_pages; > bpage->page = page_address(page); > rb_init_page(bpage->page); > } > clear_current_oom_origin(); If you use __GFP_RETRY_MAYFAIL it would have to be somedy else to trigger the OOM killer and this user context would get killed. If you drop __GFP_RETRY_MAYFAIL it would be this context to trigger the OOM but it would still be the selected victim. -- Michal Hocko SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed, 4 Apr 2018 08:23:40 +0200 Michal Hocko wrote: > If you are afraid of that then you can have a look at > {set,clear}_current_oom_origin() > which will automatically select the current process as an oom victim and > kill it. Would it even receive the signal? Does alloc_pages_node() even respond to signals? Because the OOM happens while the allocation loop is running. I tried it out, I did the following: set_current_oom_origin(); for (i = 0; i < nr_pages; i++) { struct page *page; /* * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails * gracefully without invoking oom-killer and the system is not * destabilized. */ bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), GFP_KERNEL | __GFP_RETRY_MAYFAIL, cpu_to_node(cpu)); if (!bpage) goto free_pages; list_add(&bpage->list, pages); page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0); if (!page) goto free_pages; bpage->page = page_address(page); rb_init_page(bpage->page); } clear_current_oom_origin(); The first time I ran my ring buffer memory stress test, it killed the stress test. The second time I ran it, it killed polkitd. Still doesn't help as much as the original patch. You haven't convinced me that using si_mem_available() is a bad idea. If anything, you've solidified my confidence in it. -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed 04-04-18 08:59:01, Steven Rostedt wrote: [...] > + /* > +* Check if the available memory is there first. > +* Note, si_mem_available() only gives us a rough estimate of > available > +* memory. It may not be accurate. But we don't care, we just want > +* to prevent doing any allocation when it is obvious that it is > +* not going to succeed. > +*/ > + i = si_mem_available(); > + if (i < nr_pages) > + return -ENOMEM; > + > > Better? I must be really missing something here. How can that work at all for e.g. the zone_{highmem/movable}. You will get false on the above tests even when you will have hard time to allocate anything from your destination zones. -- Michal Hocko SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed, 4 Apr 2018 08:20:39 +0200 Michal Hocko wrote: > > Now can you please explain to me why si_mem_available is not suitable > > for my purpose. > > Several problems. It is overly optimistic especially when we are close > to OOM. The available pagecache or slab reclaimable objects might be pinned > long enough that your allocation based on that estimation will just make > the situation worse and result in OOM. More importantly though, your > allocations are GFP_KERNEL, right, that means that such an allocation > will not reach to ZONE_MOVABLE or ZONE_HIGMEM (32b systems) while the > pagecache will. So you will get an overestimate of how much you can > allocate. > > Really si_mem_available is for proc/meminfo and a rough estimate of the > free memory because users tend to be confused by seeing MemFree too low > and complaining that the system has eaten all their memory. I have some > skepticism about how useful it is in practice apart from showing it in > top or alike tools. The memory is simply not usable immediately or > without an overall and visible effect on the whole system. What you are telling me is that this is perfect for my use case. I'm not looking for a "if this tells me have enough memory, I then have enough memory". I'm looking for a "If I screwed up and asked for a magnitude more than I really need, don't OOM the system". Really, I don't care if the number is not truly accurate. In fact, what you tell me above is exactly what I wanted. I'm more worried it will return a smaller number than what is available. I much rather have an over estimate. This is not about trying to get as much memory for tracing as possible. Where we slowly increase the buffer size till we have pretty much every page for tracing. If someone does that, then the system should OOM and become unstable. This is about doing what I've (and others) have done several times, which is put in one or two more zeros than I really wanted. Or forgot that writing in a number to buffer_size_kb is the buffer size for each CPU. Yes, the number you write in there is multiplied by every CPU on the system. It is easy to over allocate by mistake. I'm looking to protect against gross mistakes where it is obvious that the allocation isn't going to succeed before the allocating begins. I'm not looking to be perfect here. As I've stated before, the current method is to say F*ck You to the rest of the system and OOM anything else. If you want, I can change the comment above the code to be: + /* +* Check if the available memory is there first. +* Note, si_mem_available() only gives us a rough estimate of available +* memory. It may not be accurate. But we don't care, we just want +* to prevent doing any allocation when it is obvious that it is +* not going to succeed. +*/ + i = si_mem_available(); + if (i < nr_pages) + return -ENOMEM; + Better? -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Tue, Apr 3, 2018 at 11:20 PM, Michal Hocko wrote: > On Tue 03-04-18 18:56:27, Steven Rostedt wrote: > [...] >> From your earlier email: >> >> > Except that it doesn't work. si_mem_available is not really suitable for >> > any allocation estimations. Its only purpose is to provide a very rough >> > estimation for userspace. Any other use is basically abuse. The >> > situation can change really quickly. Really it is really hard to be >> > clever here with the volatility the memory allocations can cause. >> >> Now can you please explain to me why si_mem_available is not suitable >> for my purpose. > > Several problems. It is overly optimistic especially when we are close > to OOM. The available pagecache or slab reclaimable objects might be pinned > long enough that your allocation based on that estimation will just make > the situation worse and result in OOM. More importantly though, your > allocations are GFP_KERNEL, right, that means that such an allocation > will not reach to ZONE_MOVABLE or ZONE_HIGMEM (32b systems) while the > pagecache will. So you will get an overestimate of how much you can > allocate. Yes, but right now it is assumed that there is all the memory in the world to allocate. Clearly an overestimate is better than that. Right now ftrace will just allocate memory until it actually fails to allocate, at which point it may have caused other processes to be likely to OOM. As Steve pointed out, it doesn't need to be accurate but does solve the problem here.. (or some other similar method seems to be needed to solve it). > > Really si_mem_available is for proc/meminfo and a rough estimate of the > free memory because users tend to be confused by seeing MemFree too low > and complaining that the system has eaten all their memory. I have some By why is a rough estimate not good in this case, that's what I don't get. > skepticism about how useful it is in practice apart from showing it in > top or alike tools. The memory is simply not usable immediately or > without an overall and visible effect on the whole system. Sure there can be false positives but it does reduce the problem for the most part I feel. May be we can use this as an interim solution (better than leaving the issue hanging)? Or you could propose a solution of how to get an estimate and prevent other tasks from experiencing memory pressure for no reason. Another thing we could do is check how much total memory there is on the system and cap by that (which will prevent impossibly large allocations) but that still doesn't address the problem completely. thanks, - Joel
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed, Apr 4, 2018 at 2:23 PM, Michal Hocko wrote: > On Wed 04-04-18 10:58:39, Zhaoyang Huang wrote: >> On Tue, Apr 3, 2018 at 9:56 PM, Michal Hocko wrote: >> > On Tue 03-04-18 09:32:45, Steven Rostedt wrote: >> >> On Tue, 3 Apr 2018 14:35:14 +0200 >> >> Michal Hocko wrote: >> > [...] >> >> > Being clever is OK if it doesn't add a tricky code. And relying on >> >> > si_mem_available is definitely tricky and obscure. >> >> >> >> Can we get the mm subsystem to provide a better method to know if an >> >> allocation will possibly succeed or not before trying it? It doesn't >> >> have to be free of races. Just "if I allocate this many pages right >> >> now, will it work?" If that changes from the time it asks to the time >> >> it allocates, that's fine. I'm not trying to prevent OOM to never >> >> trigger. I just don't want to to trigger consistently. >> > >> > How do you do that without an actuall allocation request? And more >> > fundamentally, what if your _particular_ request is just fine but it >> > will get us so close to the OOM edge that the next legit allocation >> > request simply goes OOM? There is simply no sane interface I can think >> > of that would satisfy a safe/sensible "will it cause OOM" semantic. >> > >> The point is the app which try to allocate the size over the line will escape >> the OOM and let other innocent to be sacrificed. However, the one which you >> mentioned above will be possibly selected by OOM that triggered by consequnce >> failed allocation. > > If you are afraid of that then you can have a look at > {set,clear}_current_oom_origin() > which will automatically select the current process as an oom victim and > kill it. But we can not call the function on behalf of the current process which maybe don't want to be killed for memory reason. It is proper to tell it ENOMEM and let it make further decision. > -- > Michal Hocko > SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Wed 04-04-18 10:58:39, Zhaoyang Huang wrote: > On Tue, Apr 3, 2018 at 9:56 PM, Michal Hocko wrote: > > On Tue 03-04-18 09:32:45, Steven Rostedt wrote: > >> On Tue, 3 Apr 2018 14:35:14 +0200 > >> Michal Hocko wrote: > > [...] > >> > Being clever is OK if it doesn't add a tricky code. And relying on > >> > si_mem_available is definitely tricky and obscure. > >> > >> Can we get the mm subsystem to provide a better method to know if an > >> allocation will possibly succeed or not before trying it? It doesn't > >> have to be free of races. Just "if I allocate this many pages right > >> now, will it work?" If that changes from the time it asks to the time > >> it allocates, that's fine. I'm not trying to prevent OOM to never > >> trigger. I just don't want to to trigger consistently. > > > > How do you do that without an actuall allocation request? And more > > fundamentally, what if your _particular_ request is just fine but it > > will get us so close to the OOM edge that the next legit allocation > > request simply goes OOM? There is simply no sane interface I can think > > of that would satisfy a safe/sensible "will it cause OOM" semantic. > > > The point is the app which try to allocate the size over the line will escape > the OOM and let other innocent to be sacrificed. However, the one which you > mentioned above will be possibly selected by OOM that triggered by consequnce > failed allocation. If you are afraid of that then you can have a look at {set,clear}_current_oom_origin() which will automatically select the current process as an oom victim and kill it. -- Michal Hocko SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Tue 03-04-18 18:56:27, Steven Rostedt wrote: [...] > From your earlier email: > > > Except that it doesn't work. si_mem_available is not really suitable for > > any allocation estimations. Its only purpose is to provide a very rough > > estimation for userspace. Any other use is basically abuse. The > > situation can change really quickly. Really it is really hard to be > > clever here with the volatility the memory allocations can cause. > > Now can you please explain to me why si_mem_available is not suitable > for my purpose. Several problems. It is overly optimistic especially when we are close to OOM. The available pagecache or slab reclaimable objects might be pinned long enough that your allocation based on that estimation will just make the situation worse and result in OOM. More importantly though, your allocations are GFP_KERNEL, right, that means that such an allocation will not reach to ZONE_MOVABLE or ZONE_HIGMEM (32b systems) while the pagecache will. So you will get an overestimate of how much you can allocate. Really si_mem_available is for proc/meminfo and a rough estimate of the free memory because users tend to be confused by seeing MemFree too low and complaining that the system has eaten all their memory. I have some skepticism about how useful it is in practice apart from showing it in top or alike tools. The memory is simply not usable immediately or without an overall and visible effect on the whole system. HTH -- Michal Hocko SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Tue, Apr 3, 2018 at 9:56 PM, Michal Hocko wrote: > On Tue 03-04-18 09:32:45, Steven Rostedt wrote: >> On Tue, 3 Apr 2018 14:35:14 +0200 >> Michal Hocko wrote: > [...] >> > Being clever is OK if it doesn't add a tricky code. And relying on >> > si_mem_available is definitely tricky and obscure. >> >> Can we get the mm subsystem to provide a better method to know if an >> allocation will possibly succeed or not before trying it? It doesn't >> have to be free of races. Just "if I allocate this many pages right >> now, will it work?" If that changes from the time it asks to the time >> it allocates, that's fine. I'm not trying to prevent OOM to never >> trigger. I just don't want to to trigger consistently. > > How do you do that without an actuall allocation request? And more > fundamentally, what if your _particular_ request is just fine but it > will get us so close to the OOM edge that the next legit allocation > request simply goes OOM? There is simply no sane interface I can think > of that would satisfy a safe/sensible "will it cause OOM" semantic. > The point is the app which try to allocate the size over the line will escape the OOM and let other innocent to be sacrificed. However, the one which you mentioned above will be possibly selected by OOM that triggered by consequnce failed allocation. >> > > Perhaps I should try to allocate a large group of pages with >> > > RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking >> > > that the large allocation may reclaim some memory that would allow the >> > > NORETRY to succeed with smaller allocations (one page at a time)? >> > >> > That again relies on a subtle dependencies of the current >> > implementation. So I would rather ask whether this is something that >> > really deserves special treatment. If admin asks for a buffer of a >> > certain size then try to do so. If we get OOM then bad luck you cannot >> > get large memory buffers for free... >> >> That is not acceptable to me nor to the people asking for this. >> >> The problem is known. The ring buffer allocates memory page by page, >> and this can allow it to easily take all memory in the system before it >> fails to allocate and free everything it had done. > > Then do not allow buffers that are too large. How often do you need > buffers that are larger than few megs or small % of the available > memory? Consuming excessive amount of memory just to trace workload > which will need some memory on its own sounds just dubious to me. > >> If you don't like the use of si_mem_available() I'll do the larger >> pages method. Yes it depends on the current implementation of memory >> allocation. It will depend on RETRY_MAYFAIL trying to allocate a large >> number of pages, and fail if it can't (leaving memory for other >> allocations to succeed). >> >> The allocation of the ring buffer isn't critical. It can fail to >> expand, and we can tell the user -ENOMEM. I original had NORETRY >> because I rather have it fail than cause an OOM. But there's folks >> (like Joel) that want it to succeed when there's available memory in >> page caches. > > Then implement a retry logic on top of NORETRY. You can control how hard > to retry to satisfy the request yourself. You still risk that your > allocation will get us close to OOM for _somebody_ else though. > >> I'm fine if the admin shoots herself in the foot if the ring buffer >> gets big enough to start causing OOMs, but I don't want it to cause >> OOMs if there's not even enough memory to fulfill the ring buffer size >> itself. > > I simply do not see the difference between the two. Both have the same > deadly effect in the end. The direct OOM has an arguable advantage that > the effect is immediate rather than subtle with potential performance > side effects until the machine OOMs after crawling for quite some time. > > -- > Michal Hocko > SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Tue, 3 Apr 2018 18:11:19 +0200 Michal Hocko wrote: > You can do so, of course. In fact it would have some advantages over > single pages because you would fragment the memory less but this is not > a reliable prevention from OOM killing and the complete memory > depletion if you allow arbitrary trace buffer sizes. You are right that this doesn't prevent OOM killing. I tried various methods, and the only thing that currently "works" the way I'm happy with, is this original patch. >From your earlier email: > Except that it doesn't work. si_mem_available is not really suitable for > any allocation estimations. Its only purpose is to provide a very rough > estimation for userspace. Any other use is basically abuse. The > situation can change really quickly. Really it is really hard to be > clever here with the volatility the memory allocations can cause. Now can you please explain to me why si_mem_available is not suitable for my purpose. If it's wrong and states there is less memory than actually exists, we simply fail to increase the buffer. If it is wrong and states that there is more memory than actually exists, then we do nothing different than what we do today, and trigger an OOM. But for the ring buffer use case, it is "good enough". Can you please explain to me what issues you see that can happen if we apply this patch? -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Tue, 3 Apr 2018 18:11:19 +0200 Michal Hocko wrote: > yes a fallback is questionable. Whether to make the batch size > configuration is a matter of how much internal details you want to > expose to userspace. Well, it is tracing the guts of the kernel, so internal details are generally exposed to userspace ;-) -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Tue 03-04-18 10:17:53, Steven Rostedt wrote: > On Tue, 3 Apr 2018 15:56:07 +0200 > Michal Hocko wrote: [...] > > I simply do not see the difference between the two. Both have the same > > deadly effect in the end. The direct OOM has an arguable advantage that > > the effect is immediate rather than subtle with potential performance > > side effects until the machine OOMs after crawling for quite some time. > > The difference is if the allocation succeeds or not. If it doesn't > succeed, we free all memory that we tried to allocate. If it succeeds > and causes issues, then yes, that's the admins fault. What am I trying to say is that this is so extremely time and workload sensitive that you can hardly have a stable behavior. It will become a pure luck whether the failure happens. > I'm worried about > the accidental putting in too big of a number, either by an admin by > mistake, or some stupid script that just thinks the current machines > has terabytes of memory. I would argue that stupid scripts should have no business calling root only interfaces which can allocate a lot of memory and cause OOMs. > I'm under the assumption that if I allocate an allocation of 32 pages > with RETRY_MAYFAIL, and there's 2 pages available, but not 32, and > while my allocation is reclaiming memory, and another task comes in and > asks for a single page, it can still succeed. This would be why I would > be using RETRY_MAYFAIL with higher orders of pages, that it doesn't > take all memory in the system if it fails. Is this assumption incorrect? Yes. There is no guarantee that the allocation will get the memory it reclaimed in the direct reclaim. Pages are simply freed back into the pool and it is a matter of timing who gets them. > The current approach of allocating 1 page at a time with RETRY_MAYFAIL > is that it will succeed to get any pages that are available, until > there are none, and if some unlucky task asks for memory during that > time, it is guaranteed to fail its allocation triggering an OOM. > > I was thinking of doing something like: > > large_pages = nr_pages / 32; > if (large_pages) { > pages = alloc_pages_node(cpu_to_node(cpu), > GFP_KERNEL | __GFP_RETRY_MAYFAIL, 5); > if (pages) > /* break up pages */ > else > /* try to allocate with NORETRY */ > } You can do so, of course. In fact it would have some advantages over single pages because you would fragment the memory less but this is not a reliable prevention from OOM killing and the complete memory depletion if you allow arbitrary trace buffer sizes. > Now it will allocate memory in 32 page chunks using reclaim. If it > fails to allocate them, it would not have taken up any smaller chunks > that were available, leaving them for other users. It would then go > back to singe pages, allocating with RETRY. Or I could just say screw > it, and make the allocation of the ring buffer always be 32 page chunks > (or at least make it user defined). yes a fallback is questionable. Whether to make the batch size configuration is a matter of how much internal details you want to expose to userspace. -- Michal Hocko SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Tue, 3 Apr 2018 15:56:07 +0200 Michal Hocko wrote: > On Tue 03-04-18 09:32:45, Steven Rostedt wrote: > > On Tue, 3 Apr 2018 14:35:14 +0200 > > Michal Hocko wrote: > [...] > > > Being clever is OK if it doesn't add a tricky code. And relying on > > > si_mem_available is definitely tricky and obscure. > > > > Can we get the mm subsystem to provide a better method to know if an > > allocation will possibly succeed or not before trying it? It doesn't > > have to be free of races. Just "if I allocate this many pages right > > now, will it work?" If that changes from the time it asks to the time > > it allocates, that's fine. I'm not trying to prevent OOM to never > > trigger. I just don't want to to trigger consistently. > > How do you do that without an actuall allocation request? And more > fundamentally, what if your _particular_ request is just fine but it > will get us so close to the OOM edge that the next legit allocation > request simply goes OOM? There is simply no sane interface I can think > of that would satisfy a safe/sensible "will it cause OOM" semantic. That's where I'm fine with the admin shooting herself in the foot. If they ask for almost all memory, and then the system needs more, that's not our problem. I'm more worried about putting in a couple of extra zeros by mistake, which will pretty much guarantee an OOM on an active system. > > > > > Perhaps I should try to allocate a large group of pages with > > > > RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking > > > > that the large allocation may reclaim some memory that would allow the > > > > NORETRY to succeed with smaller allocations (one page at a time)? > > > > > > That again relies on a subtle dependencies of the current > > > implementation. So I would rather ask whether this is something that > > > really deserves special treatment. If admin asks for a buffer of a > > > certain size then try to do so. If we get OOM then bad luck you cannot > > > get large memory buffers for free... > > > > That is not acceptable to me nor to the people asking for this. > > > > The problem is known. The ring buffer allocates memory page by page, > > and this can allow it to easily take all memory in the system before it > > fails to allocate and free everything it had done. > > Then do not allow buffers that are too large. How often do you need > buffers that are larger than few megs or small % of the available > memory? Consuming excessive amount of memory just to trace workload > which will need some memory on its own sounds just dubious to me. For recording 100s of million events per second, it requires hundreds of megs of memory. Large buffers are required for tracing. That's extremely common. > > > If you don't like the use of si_mem_available() I'll do the larger > > pages method. Yes it depends on the current implementation of memory > > allocation. It will depend on RETRY_MAYFAIL trying to allocate a large > > number of pages, and fail if it can't (leaving memory for other > > allocations to succeed). > > > > The allocation of the ring buffer isn't critical. It can fail to > > expand, and we can tell the user -ENOMEM. I original had NORETRY > > because I rather have it fail than cause an OOM. But there's folks > > (like Joel) that want it to succeed when there's available memory in > > page caches. > > Then implement a retry logic on top of NORETRY. You can control how hard > to retry to satisfy the request yourself. You still risk that your > allocation will get us close to OOM for _somebody_ else though. > > > I'm fine if the admin shoots herself in the foot if the ring buffer > > gets big enough to start causing OOMs, but I don't want it to cause > > OOMs if there's not even enough memory to fulfill the ring buffer size > > itself. > > I simply do not see the difference between the two. Both have the same > deadly effect in the end. The direct OOM has an arguable advantage that > the effect is immediate rather than subtle with potential performance > side effects until the machine OOMs after crawling for quite some time. The difference is if the allocation succeeds or not. If it doesn't succeed, we free all memory that we tried to allocate. If it succeeds and causes issues, then yes, that's the admins fault. I'm worried about the accidental putting in too big of a number, either by an admin by mistake, or some stupid script that just thinks the current machines has terabytes of memory. I'm under the assumption that if I allocate an allocation of 32 pages with RETRY_MAYFAIL, and there's 2 pages available, but not 32, and while my allocation is reclaiming memory, and another task comes in and asks for a single page, it can still succeed. This would be why I would be using RETRY_MAYFAIL with higher orders of pages, that it doesn't take all memory in the system if it fails. Is this assumption incorrect? The current approach of allocating 1 page at a time with RETRY_MAYFAIL is th
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Tue 03-04-18 09:32:45, Steven Rostedt wrote: > On Tue, 3 Apr 2018 14:35:14 +0200 > Michal Hocko wrote: [...] > > Being clever is OK if it doesn't add a tricky code. And relying on > > si_mem_available is definitely tricky and obscure. > > Can we get the mm subsystem to provide a better method to know if an > allocation will possibly succeed or not before trying it? It doesn't > have to be free of races. Just "if I allocate this many pages right > now, will it work?" If that changes from the time it asks to the time > it allocates, that's fine. I'm not trying to prevent OOM to never > trigger. I just don't want to to trigger consistently. How do you do that without an actuall allocation request? And more fundamentally, what if your _particular_ request is just fine but it will get us so close to the OOM edge that the next legit allocation request simply goes OOM? There is simply no sane interface I can think of that would satisfy a safe/sensible "will it cause OOM" semantic. > > > Perhaps I should try to allocate a large group of pages with > > > RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking > > > that the large allocation may reclaim some memory that would allow the > > > NORETRY to succeed with smaller allocations (one page at a time)? > > > > That again relies on a subtle dependencies of the current > > implementation. So I would rather ask whether this is something that > > really deserves special treatment. If admin asks for a buffer of a > > certain size then try to do so. If we get OOM then bad luck you cannot > > get large memory buffers for free... > > That is not acceptable to me nor to the people asking for this. > > The problem is known. The ring buffer allocates memory page by page, > and this can allow it to easily take all memory in the system before it > fails to allocate and free everything it had done. Then do not allow buffers that are too large. How often do you need buffers that are larger than few megs or small % of the available memory? Consuming excessive amount of memory just to trace workload which will need some memory on its own sounds just dubious to me. > If you don't like the use of si_mem_available() I'll do the larger > pages method. Yes it depends on the current implementation of memory > allocation. It will depend on RETRY_MAYFAIL trying to allocate a large > number of pages, and fail if it can't (leaving memory for other > allocations to succeed). > > The allocation of the ring buffer isn't critical. It can fail to > expand, and we can tell the user -ENOMEM. I original had NORETRY > because I rather have it fail than cause an OOM. But there's folks > (like Joel) that want it to succeed when there's available memory in > page caches. Then implement a retry logic on top of NORETRY. You can control how hard to retry to satisfy the request yourself. You still risk that your allocation will get us close to OOM for _somebody_ else though. > I'm fine if the admin shoots herself in the foot if the ring buffer > gets big enough to start causing OOMs, but I don't want it to cause > OOMs if there's not even enough memory to fulfill the ring buffer size > itself. I simply do not see the difference between the two. Both have the same deadly effect in the end. The direct OOM has an arguable advantage that the effect is immediate rather than subtle with potential performance side effects until the machine OOMs after crawling for quite some time. -- Michal Hocko SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Tue, 3 Apr 2018 14:35:14 +0200 Michal Hocko wrote: > > If we use NORETRY, then we have those that complain that we do not try > > hard enough to reclaim memory. If we use RETRY_MAYFAIL we have this > > issue of taking up all memory before we get what we want. > > Just try to do what admin asks for and trust it will not try to shoot > his foot? I mean there are other ways admin can shoot the machine down. Allowing the admin to just shoot her foot is not an option. Yes there are many ways to bring down a machine, but this shouldn't be one of them. All one needs to do is echo too big of a number into /sys/kernel/tracing/buffer_size_kb and OOM may kill a critical program on a production machine. Tracing is made for production, and should not allow an easy way to trigger OOM. > Being clever is OK if it doesn't add a tricky code. And relying on > si_mem_available is definitely tricky and obscure. Can we get the mm subsystem to provide a better method to know if an allocation will possibly succeed or not before trying it? It doesn't have to be free of races. Just "if I allocate this many pages right now, will it work?" If that changes from the time it asks to the time it allocates, that's fine. I'm not trying to prevent OOM to never trigger. I just don't want to to trigger consistently. > > > Perhaps I should try to allocate a large group of pages with > > RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking > > that the large allocation may reclaim some memory that would allow the > > NORETRY to succeed with smaller allocations (one page at a time)? > > That again relies on a subtle dependencies of the current > implementation. So I would rather ask whether this is something that > really deserves special treatment. If admin asks for a buffer of a > certain size then try to do so. If we get OOM then bad luck you cannot > get large memory buffers for free... That is not acceptable to me nor to the people asking for this. The problem is known. The ring buffer allocates memory page by page, and this can allow it to easily take all memory in the system before it fails to allocate and free everything it had done. If you don't like the use of si_mem_available() I'll do the larger pages method. Yes it depends on the current implementation of memory allocation. It will depend on RETRY_MAYFAIL trying to allocate a large number of pages, and fail if it can't (leaving memory for other allocations to succeed). The allocation of the ring buffer isn't critical. It can fail to expand, and we can tell the user -ENOMEM. I original had NORETRY because I rather have it fail than cause an OOM. But there's folks (like Joel) that want it to succeed when there's available memory in page caches. I'm fine if the admin shoots herself in the foot if the ring buffer gets big enough to start causing OOMs, but I don't want it to cause OOMs if there's not even enough memory to fulfill the ring buffer size itself. -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Tue 03-04-18 08:23:48, Steven Rostedt wrote: > On Tue, 3 Apr 2018 14:16:14 +0200 > Michal Hocko wrote: > > > > This came up because there's scripts or programs that set the size of > > > the ring buffer. The complaint was that the application would just set > > > the size to something bigger than what was available and cause an OOM > > > killing other applications. The final solution is to simply check the > > > available memory before allocating the ring buffer: > > > > > > /* Check if the available memory is there first */ > > > i = si_mem_available(); > > > if (i < nr_pages) > > > return -ENOMEM; > > > > > > And it works well. > > > > Except that it doesn't work. si_mem_available is not really suitable for > > any allocation estimations. Its only purpose is to provide a very rough > > estimation for userspace. Any other use is basically abuse. The > > situation can change really quickly. Really it is really hard to be > > clever here with the volatility the memory allocations can cause. > > OK, then what do you suggest? Because currently, it appears to work. A > rough estimate may be good enough. > > If we use NORETRY, then we have those that complain that we do not try > hard enough to reclaim memory. If we use RETRY_MAYFAIL we have this > issue of taking up all memory before we get what we want. Just try to do what admin asks for and trust it will not try to shoot his foot? I mean there are other ways admin can shoot the machine down. Being clever is OK if it doesn't add a tricky code. And relying on si_mem_available is definitely tricky and obscure. > Perhaps I should try to allocate a large group of pages with > RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking > that the large allocation may reclaim some memory that would allow the > NORETRY to succeed with smaller allocations (one page at a time)? That again relies on a subtle dependencies of the current implementation. So I would rather ask whether this is something that really deserves special treatment. If admin asks for a buffer of a certain size then try to do so. If we get OOM then bad luck you cannot get large memory buffers for free... -- Michal Hocko SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Tue, 3 Apr 2018 14:16:14 +0200 Michal Hocko wrote: > > This came up because there's scripts or programs that set the size of > > the ring buffer. The complaint was that the application would just set > > the size to something bigger than what was available and cause an OOM > > killing other applications. The final solution is to simply check the > > available memory before allocating the ring buffer: > > > > /* Check if the available memory is there first */ > > i = si_mem_available(); > > if (i < nr_pages) > > return -ENOMEM; > > > > And it works well. > > Except that it doesn't work. si_mem_available is not really suitable for > any allocation estimations. Its only purpose is to provide a very rough > estimation for userspace. Any other use is basically abuse. The > situation can change really quickly. Really it is really hard to be > clever here with the volatility the memory allocations can cause. OK, then what do you suggest? Because currently, it appears to work. A rough estimate may be good enough. If we use NORETRY, then we have those that complain that we do not try hard enough to reclaim memory. If we use RETRY_MAYFAIL we have this issue of taking up all memory before we get what we want. Perhaps I should try to allocate a large group of pages with RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking that the large allocation may reclaim some memory that would allow the NORETRY to succeed with smaller allocations (one page at a time)? -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Tue 03-04-18 07:51:58, Steven Rostedt wrote: > On Tue, 3 Apr 2018 13:06:12 +0200 > Michal Hocko wrote: > > > > I wonder if I should have the ring buffer allocate groups of pages, to > > > avoid this. Or try to allocate with NORETRY, one page at a time, and > > > when that fails, allocate groups of pages with RETRY_MAYFAIL, and that > > > may keep it from causing an OOM? > > > > I wonder why it really matters. The interface is root only and we expect > > some sanity from an admin, right? So allocating such a large ring buffer > > that it sends the system to the OOM is a sign that the admin should be > > more careful. Balancing on the OOM edge is always a risk and the result > > will highly depend on the workload running in parallel. > > This came up because there's scripts or programs that set the size of > the ring buffer. The complaint was that the application would just set > the size to something bigger than what was available and cause an OOM > killing other applications. The final solution is to simply check the > available memory before allocating the ring buffer: > > /* Check if the available memory is there first */ > i = si_mem_available(); > if (i < nr_pages) > return -ENOMEM; > > And it works well. Except that it doesn't work. si_mem_available is not really suitable for any allocation estimations. Its only purpose is to provide a very rough estimation for userspace. Any other use is basically abuse. The situation can change really quickly. Really it is really hard to be clever here with the volatility the memory allocations can cause. -- Michal Hocko SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Tue, 3 Apr 2018 13:06:12 +0200 Michal Hocko wrote: > > I wonder if I should have the ring buffer allocate groups of pages, to > > avoid this. Or try to allocate with NORETRY, one page at a time, and > > when that fails, allocate groups of pages with RETRY_MAYFAIL, and that > > may keep it from causing an OOM? > > I wonder why it really matters. The interface is root only and we expect > some sanity from an admin, right? So allocating such a large ring buffer > that it sends the system to the OOM is a sign that the admin should be > more careful. Balancing on the OOM edge is always a risk and the result > will highly depend on the workload running in parallel. This came up because there's scripts or programs that set the size of the ring buffer. The complaint was that the application would just set the size to something bigger than what was available and cause an OOM killing other applications. The final solution is to simply check the available memory before allocating the ring buffer: /* Check if the available memory is there first */ i = si_mem_available(); if (i < nr_pages) return -ENOMEM; And it works well. -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Fri 30-03-18 10:20:38, Steven Rostedt wrote: > > [ Adding memory management folks to discuss the issue ] > > On Thu, 29 Mar 2018 18:41:44 +0800 > Zhaoyang Huang wrote: > > > It is reported that some user app would like to echo a huge > > number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless > > of the available memory, which will cause the coinstantaneous > > page allocation failed and introduce OOM. The commit checking the > > val against the available mem first to avoid the consequence allocation. > > > > Signed-off-by: Zhaoyang Huang > > --- > > kernel/trace/trace.c | 39 ++- > > 1 file changed, 38 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > > index 2d0ffcc..a4a4237 100644 > > --- a/kernel/trace/trace.c > > +++ b/kernel/trace/trace.c > > @@ -43,6 +43,8 @@ > > #include > > #include > > > > +#include > > +#include > > #include "trace.h" > > #include "trace_output.h" > > > > @@ -5967,6 +5969,39 @@ static ssize_t tracing_splice_read_pipe(struct file > > *filp, > > return ret; > > } > > > > +static long get_available_mem(void) > > +{ > > + struct sysinfo i; > > + long available; > > + unsigned long pagecache; > > + unsigned long wmark_low = 0; > > + unsigned long pages[NR_LRU_LISTS]; > > + struct zone *zone; > > + int lru; > > + > > + si_meminfo(&i); > > + si_swapinfo(&i); > > + > > + for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++) > > + pages[lru] = global_page_state(NR_LRU_BASE + lru); > > + > > + for_each_zone(zone) > > + wmark_low += zone->watermark[WMARK_LOW]; > > + > > + available = i.freeram - wmark_low; > > + > > + pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE]; > > + pagecache -= min(pagecache / 2, wmark_low); > > + available += pagecache; > > + > > + available += global_page_state(NR_SLAB_RECLAIMABLE) - > > + min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low); > > + > > + if (available < 0) > > + available = 0; > > + return available; > > +} > > + > > As I stated in my other reply, the above function does not belong in > tracing. > > That said, it appears you are having issues that were caused by the > change by commit 848618857d2 ("tracing/ring_buffer: Try harder to > allocate"), where we replaced NORETRY with RETRY_MAYFAIL. The point of > NORETRY was to keep allocations of the tracing ring-buffer from causing > OOMs. But the RETRY was too strong in that case, because there were > those that wanted to allocate large ring buffers but it would fail due > to memory being used that could be reclaimed. Supposedly, RETRY_MAYFAIL > is to allocate with reclaim but still allow to fail, and isn't suppose > to trigger an OOM. From my own tests, this is obviously not the case. > > Perhaps this is because the ring buffer allocates one page at a time, > and by doing so, it can get every last available page, and if anything > in the mean time does an allocation without MAYFAIL, it will cause an > OOM. For example, when I stressed this I triggered this: Yes, this is indeed the case. > pool invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), > nodemask=(null), order=0, oom_score_adj=0 > pool cpuset=/ mems_allowed=0 > CPU: 7 PID: 1040 Comm: pool Not tainted 4.16.0-rc4-test+ #663 > Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 > 07/14/2016 > Call Trace: > dump_stack+0x8e/0xce > dump_header.isra.30+0x6e/0x28f > ? _raw_spin_unlock_irqrestore+0x30/0x60 > oom_kill_process+0x218/0x400 > ? has_capability_noaudit+0x17/0x20 > out_of_memory+0xe3/0x5c0 > __alloc_pages_slowpath+0xa8e/0xe50 > __alloc_pages_nodemask+0x206/0x220 > alloc_pages_current+0x6a/0xe0 > __page_cache_alloc+0x6a/0xa0 > filemap_fault+0x208/0x5f0 > ? __might_sleep+0x4a/0x80 > ext4_filemap_fault+0x31/0x44 > __do_fault+0x20/0xd0 > __handle_mm_fault+0xc08/0x1160 > handle_mm_fault+0x76/0x110 > __do_page_fault+0x299/0x580 > do_page_fault+0x2d/0x110 > ? page_fault+0x2f/0x50 > page_fault+0x45/0x50 > > I wonder if I should have the ring buffer allocate groups of pages, to > avoid this. Or try to allocate with NORETRY, one page at a time, and > when that fails, allocate groups of pages with RETRY_MAYFAIL, and that > may keep it from causing an OOM? I wonder why it really matters. The interface is root only and we expect some sanity from an admin, right? So allocating such a large ring buffer that it sends the system to the OOM is a sign that the admin should be more careful. Balancing on the OOM edge is always a risk and the result will highly depend on the workload running in parallel. -- Michal Hocko SUSE Labs
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Sat, Mar 31, 2018 at 5:42 AM, Steven Rostedt wrote: > On Fri, 30 Mar 2018 17:30:31 -0400 > Steven Rostedt wrote: > >> I'll take a look at si_mem_available() that Joel suggested and see if >> we can make that work. > > Wow, this appears to work great! Joel and Zhaoyang, can you test this? > > -- Steve > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index a2fd3893cc02..32a803626ee2 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, struct > list_head *pages, int cpu) > struct buffer_page *bpage, *tmp; > long i; > > + /* Check if the available memory is there first */ > + i = si_mem_available(); > + if (i < nr_pages) > + return -ENOMEM; > + > for (i = 0; i < nr_pages; i++) { > struct page *page; > /* Hi Steve, It works as my previous patch does.
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Fri, Mar 30, 2018 at 8:07 PM, Steven Rostedt wrote: > On Fri, 30 Mar 2018 19:18:57 -0700 > Matthew Wilcox wrote: > >> Again though, this is the same pattern as vmalloc. There are any number >> of places where userspace can cause an arbitrarily large vmalloc to be >> attempted (grep for kvmalloc_array for a list of promising candidates). >> I'm pretty sure that just changing your GFP flags to GFP_KERNEL | >> __GFP_NOWARN will give you the exact behaviour that you want with no >> need to grub around in the VM to find out if your huge allocation is >> likely to succeed. > > Not sure how this helps. Note, I don't care about consecutive pages, so > this is not an array. It's a link list of thousands of pages. How do > you suggest allocating them? The ring buffer is a link list of pages. Yeah I didn't understand the suggestion either. If I remember correctly, not using either NO_RETRY or RETRY_MAY_FAIL, and just plain GFP_KERNEL was precisely causing the buffer_size_kb write to cause an OOM in my testing. So I think Steven's patch does the right thing in checking in advance. thanks, - Joel
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Fri, 30 Mar 2018 19:18:57 -0700 Matthew Wilcox wrote: > Again though, this is the same pattern as vmalloc. There are any number > of places where userspace can cause an arbitrarily large vmalloc to be > attempted (grep for kvmalloc_array for a list of promising candidates). > I'm pretty sure that just changing your GFP flags to GFP_KERNEL | > __GFP_NOWARN will give you the exact behaviour that you want with no > need to grub around in the VM to find out if your huge allocation is > likely to succeed. Not sure how this helps. Note, I don't care about consecutive pages, so this is not an array. It's a link list of thousands of pages. How do you suggest allocating them? The ring buffer is a link list of pages. What I currently do is to see how many more pages I need. Allocate them one at a time and put them in a temporary list, if it succeeds I add them to the ring buffer, if not, I free the entire list (it's an all or nothing operation). The allocation I'm making doesn't warn. The problem is the GFP_RETRY_MAYFAIL, which will try to allocate any possible memory in the system. When it succeeds, the ring buffer allocation logic will then try to allocate another page. If we add too many pages, we will allocate all possible pages and then try to allocate more. This allocation will fail without causing an OOM. That's not the problem. The problem is if the system is active during this time, and something else tries to do any allocation, after all memory has been consumed, that allocation will fail. Then it will trigger an OOM. I showed this in my Call Trace, that the allocation that failed during my test was something completely unrelated, and that failure caused an OOM. What this last patch does is see if there's space available before it even starts the process. Maybe I'm missing something, but I don't see how NOWARN can help. My allocations are not what is giving the warning. -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Fri, Mar 30, 2018 at 09:41:51PM -0400, Steven Rostedt wrote: > On Fri, 30 Mar 2018 16:38:52 -0700 > Joel Fernandes wrote: > > > > --- a/kernel/trace/ring_buffer.c > > > +++ b/kernel/trace/ring_buffer.c > > > @@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, > > > struct list_head *pages, int cpu) > > > struct buffer_page *bpage, *tmp; > > > long i; > > > > > > + /* Check if the available memory is there first */ > > > + i = si_mem_available(); > > > + if (i < nr_pages) > > > > Does it make sense to add a small margin here so that after ftrace > > finishes allocating, we still have some memory left for the system? > > But then then we have to define a magic number :-| > > I don't think so. The memory is allocated by user defined numbers. They > can do "free" to see what is available. The original patch from > Zhaoyang was due to a script that would just try a very large number > and cause issues. > > If the memory is available, I just say let them have it. This is > borderline user space issue and not a kernel one. Again though, this is the same pattern as vmalloc. There are any number of places where userspace can cause an arbitrarily large vmalloc to be attempted (grep for kvmalloc_array for a list of promising candidates). I'm pretty sure that just changing your GFP flags to GFP_KERNEL | __GFP_NOWARN will give you the exact behaviour that you want with no need to grub around in the VM to find out if your huge allocation is likely to succeed.
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Fri, 30 Mar 2018 16:38:52 -0700 Joel Fernandes wrote: > > --- a/kernel/trace/ring_buffer.c > > +++ b/kernel/trace/ring_buffer.c > > @@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, struct > > list_head *pages, int cpu) > > struct buffer_page *bpage, *tmp; > > long i; > > > > + /* Check if the available memory is there first */ > > + i = si_mem_available(); > > + if (i < nr_pages) > > Does it make sense to add a small margin here so that after ftrace > finishes allocating, we still have some memory left for the system? > But then then we have to define a magic number :-| I don't think so. The memory is allocated by user defined numbers. They can do "free" to see what is available. The original patch from Zhaoyang was due to a script that would just try a very large number and cause issues. If the memory is available, I just say let them have it. This is borderline user space issue and not a kernel one. > > + > > I tested in Qemu with 1GB memory, I am always able to get it to fail > allocation even without this patch without causing an OOM. Maybe I am > not running enough allocations in parallel or something :) Try just echoing in "100" into buffer_size_kb and see what happens. > > The patch you shared using si_mem_available is working since I'm able > to allocate till the end without a page allocation failure: > > bash-4.3# echo 237800 > /d/tracing/buffer_size_kb > bash: echo: write error: Cannot allocate memory > bash-4.3# echo 237700 > /d/tracing/buffer_size_kb > bash-4.3# free -m > total used free shared buffers > Mem: 985 9777 100 > -/+ buffers:9777 > Swap:000 > bash-4.3# > > I think this patch is still good to have, since IMO we should not go > and get page allocation failure (even if its a non-OOM) and subsequent > stack dump from mm's allocator, if we can avoid it. > > Tested-by: Joel Fernandes Great thanks! I'll make it into a formal patch. -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
Hi Steven, On Fri, Mar 30, 2018 at 2:42 PM, Steven Rostedt wrote: > On Fri, 30 Mar 2018 17:30:31 -0400 > Steven Rostedt wrote: > >> I'll take a look at si_mem_available() that Joel suggested and see if >> we can make that work. > > Wow, this appears to work great! Joel and Zhaoyang, can you test this? > > -- Steve > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index a2fd3893cc02..32a803626ee2 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, struct > list_head *pages, int cpu) > struct buffer_page *bpage, *tmp; > long i; > > + /* Check if the available memory is there first */ > + i = si_mem_available(); > + if (i < nr_pages) Does it make sense to add a small margin here so that after ftrace finishes allocating, we still have some memory left for the system? But then then we have to define a magic number :-| > + return -ENOMEM; > + I tested in Qemu with 1GB memory, I am always able to get it to fail allocation even without this patch without causing an OOM. Maybe I am not running enough allocations in parallel or something :) The patch you shared using si_mem_available is working since I'm able to allocate till the end without a page allocation failure: bash-4.3# echo 237800 > /d/tracing/buffer_size_kb bash: echo: write error: Cannot allocate memory bash-4.3# echo 237700 > /d/tracing/buffer_size_kb bash-4.3# free -m total used free shared buffers Mem: 985 9777 100 -/+ buffers:9777 Swap:000 bash-4.3# I think this patch is still good to have, since IMO we should not go and get page allocation failure (even if its a non-OOM) and subsequent stack dump from mm's allocator, if we can avoid it. Tested-by: Joel Fernandes thanks, - Joel
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Fri, 30 Mar 2018 17:30:31 -0400 Steven Rostedt wrote: > I'll take a look at si_mem_available() that Joel suggested and see if > we can make that work. Wow, this appears to work great! Joel and Zhaoyang, can you test this? -- Steve diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index a2fd3893cc02..32a803626ee2 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu) struct buffer_page *bpage, *tmp; long i; + /* Check if the available memory is there first */ + i = si_mem_available(); + if (i < nr_pages) + return -ENOMEM; + for (i = 0; i < nr_pages; i++) { struct page *page; /*
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Fri, 30 Mar 2018 13:53:56 -0700 Matthew Wilcox wrote: > It seems to me that what you're asking for at the moment is > lower-likelihood-of-failure-than-GFP_KERNEL, and it's not entirely > clear to me why your allocation is so much more important than other > allocations in the kernel. The ring buffer is for fast tracing and is allocated when a user requests it. Usually there's plenty of memory, but when a user wants a lot of tracing events stored, they may increase it themselves. > > Also, the pattern you have is very close to that of vmalloc. You're > allocating one page at a time to satisfy a multi-page request. In lieu > of actually thinking about what you should do, I might recommend using the > same GFP flags as vmalloc() which works out to GFP_KERNEL | __GFP_NOWARN > (possibly | __GFP_HIGHMEM if you can tolerate having to kmap the pages > when accessed from within the kernel). When the ring buffer was first created, we couldn't use vmalloc because vmalloc access wasn't working in NMIs (that has recently changed with lots of work to handle faults). But the ring buffer is broken up into pages (that are sent to the user or to the network), and allocating one page at a time makes everything work fine. The issue that happens when someone allocates a large ring buffer is that it will allocate all memory in the system before it fails. This means that there's a short time that any allocation will cause an OOM (which is what is happening). I think I agree with Joel and Zhaoyang, that we shouldn't allocate a ring buffer if there's not going to be enough memory to do it. If we can see the available memory before we start allocating one page at a time, and if the available memory isn't going to be sufficient, there's no reason to try to do the allocation, and simply send to the use -ENOMEM, and let them try something smaller. I'll take a look at si_mem_available() that Joel suggested and see if we can make that work. -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Fri, Mar 30, 2018 at 10:20:38AM -0400, Steven Rostedt wrote: > That said, it appears you are having issues that were caused by the > change by commit 848618857d2 ("tracing/ring_buffer: Try harder to > allocate"), where we replaced NORETRY with RETRY_MAYFAIL. The point of > NORETRY was to keep allocations of the tracing ring-buffer from causing > OOMs. But the RETRY was too strong in that case, because there were > those that wanted to allocate large ring buffers but it would fail due > to memory being used that could be reclaimed. Supposedly, RETRY_MAYFAIL > is to allocate with reclaim but still allow to fail, and isn't suppose > to trigger an OOM. From my own tests, this is obviously not the case. That's not exactly what the comment says in gfp.h: * __GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim * procedures that have previously failed if there is some indication * that progress has been made else where. It can wait for other * tasks to attempt high level approaches to freeing memory such as * compaction (which removes fragmentation) and page-out. * There is still a definite limit to the number of retries, but it is * a larger limit than with __GFP_NORETRY. * Allocations with this flag may fail, but only when there is * genuinely little unused memory. While these allocations do not * directly trigger the OOM killer, their failure indicates that * the system is likely to need to use the OOM killer soon. The * caller must handle failure, but can reasonably do so by failing * a higher-level request, or completing it only in a much less * efficient manner. * If the allocation does fail, and the caller is in a position to * free some non-essential memory, doing so could benefit the system * as a whole. It seems to me that what you're asking for at the moment is lower-likelihood-of-failure-than-GFP_KERNEL, and it's not entirely clear to me why your allocation is so much more important than other allocations in the kernel. Also, the pattern you have is very close to that of vmalloc. You're allocating one page at a time to satisfy a multi-page request. In lieu of actually thinking about what you should do, I might recommend using the same GFP flags as vmalloc() which works out to GFP_KERNEL | __GFP_NOWARN (possibly | __GFP_HIGHMEM if you can tolerate having to kmap the pages when accessed from within the kernel).
Re: [PATCH v1] kernel/trace:check the val against the available mem
Hi Steve, On Fri, Mar 30, 2018 at 12:10 PM, Steven Rostedt wrote: [..] >> > I wonder if I should have the ring buffer allocate groups of pages, to >> > avoid this. Or try to allocate with NORETRY, one page at a time, and >> > when that fails, allocate groups of pages with RETRY_MAYFAIL, and that >> > may keep it from causing an OOM? >> > >> >> I don't see immediately how that can prevent an OOM in other >> applications here? If ftrace allocates lots of memory with >> RETRY_MAYFAIL, then we would still OOM in other applications if memory >> isn't available. Sorry if I missed something. > > Here's the idea. > > Allocate one page at a time with NORETRY. If that fails, then allocate > larger amounts (higher order of pages) with RETRY_MAYFAIL. Then if it > can't get all the memory it needs, it wont take up all memory in the > system before it finds out that it can't have any more. > > Or perhaps the memory management system can provide a > get_available_mem() function that ftrace could call before it tries to > increase the ring buffer and take up all the memory of the system > before it realizes that it can't get all the memory it wants. > > The main problem I have with Zhaoyang's patch is that > get_available_mem() does not belong in the tracing code. It should be > something that the mm subsystem provides. > Cool. Personally I like the getting of available memory solution and use that, since its simpler. MM already provides it through si_mem_available since the commit "mm/page_alloc.c: calculate 'available' memory in a separate function" (sha d02bd27b). Maybe we could just use that? MemAvailable was initially added in commit "/proc/meminfo: provide estimated available memory" (sha 34e431b0ae39) thanks, - Joel
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Fri, 30 Mar 2018 09:37:58 -0700 Joel Fernandes wrote: > > That said, it appears you are having issues that were caused by the > > change by commit 848618857d2 ("tracing/ring_buffer: Try harder to > > allocate"), where we replaced NORETRY with RETRY_MAYFAIL. The point of > > NORETRY was to keep allocations of the tracing ring-buffer from causing > > OOMs. But the RETRY was too strong in that case, because there were > > Yes this was discussed with -mm folks. Basically the problem we were > seeing is devices with tonnes of free memory (but free as in free but > used by page cache) were not being used so it was unnecessarily > failing to allocate ring buffer on the system with otherwise lots of > memory. Right. > > IIRC, the OOM that my patch was trying to avoid, was being triggered > in the path/context of the write to buffer_size_kb itself (when not > doing the NORETRY), not by other processes. Yes, that is correct. > > > Perhaps this is because the ring buffer allocates one page at a time, > > and by doing so, it can get every last available page, and if anything > > in the mean time does an allocation without MAYFAIL, it will cause an > > OOM. For example, when I stressed this I triggered this: > > > > pool invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), > > nodemask=(null), order=0, oom_score_adj=0 > > pool cpuset=/ mems_allowed=0 > > CPU: 7 PID: 1040 Comm: pool Not tainted 4.16.0-rc4-test+ #663 > > Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 > > v03.03 07/14/2016 > > Call Trace: > > dump_stack+0x8e/0xce > > dump_header.isra.30+0x6e/0x28f > > ? _raw_spin_unlock_irqrestore+0x30/0x60 > > oom_kill_process+0x218/0x400 > > ? has_capability_noaudit+0x17/0x20 > > out_of_memory+0xe3/0x5c0 > > __alloc_pages_slowpath+0xa8e/0xe50 > > __alloc_pages_nodemask+0x206/0x220 > > alloc_pages_current+0x6a/0xe0 > > __page_cache_alloc+0x6a/0xa0 > > filemap_fault+0x208/0x5f0 > > ? __might_sleep+0x4a/0x80 > > ext4_filemap_fault+0x31/0x44 > > __do_fault+0x20/0xd0 > > __handle_mm_fault+0xc08/0x1160 > > handle_mm_fault+0x76/0x110 > > __do_page_fault+0x299/0x580 > > do_page_fault+0x2d/0x110 > > ? page_fault+0x2f/0x50 > > page_fault+0x45/0x50 > > But this OOM is not in the path of the buffer_size_kb write, right? So > then what does it have to do with buffer_size_kb write failure? Yep. I'll explain below. > > I guess the original issue reported is that the buffer_size_kb write > causes *other* applications to fail allocation. So in that case, > capping the amount that ftrace writes makes sense. Basically my point > is I don't see how the patch you mentioned introduces the problem here > - in the sense the patch just makes ftrace allocate from memory it > couldn't before and to try harder. The issue is that ftrace allocates its ring buffer one page at a time. Thus, when a RETRY_MAYFAIL succeeds, that memory is allocated. Since it does it one page at a time, even if ftrace does not get all the memory it needs at the end, it will take all memory from the system before it finds that out. Then, if something else (like the above splat) tries to allocate anything, it will fail and trigger an OOM. > > > > > I wonder if I should have the ring buffer allocate groups of pages, to > > avoid this. Or try to allocate with NORETRY, one page at a time, and > > when that fails, allocate groups of pages with RETRY_MAYFAIL, and that > > may keep it from causing an OOM? > > > > I don't see immediately how that can prevent an OOM in other > applications here? If ftrace allocates lots of memory with > RETRY_MAYFAIL, then we would still OOM in other applications if memory > isn't available. Sorry if I missed something. Here's the idea. Allocate one page at a time with NORETRY. If that fails, then allocate larger amounts (higher order of pages) with RETRY_MAYFAIL. Then if it can't get all the memory it needs, it wont take up all memory in the system before it finds out that it can't have any more. Or perhaps the memory management system can provide a get_available_mem() function that ftrace could call before it tries to increase the ring buffer and take up all the memory of the system before it realizes that it can't get all the memory it wants. The main problem I have with Zhaoyang's patch is that get_available_mem() does not belong in the tracing code. It should be something that the mm subsystem provides. -- Steve
Re: [PATCH v1] kernel/trace:check the val against the available mem
Hi Steve, On Fri, Mar 30, 2018 at 7:20 AM, Steven Rostedt wrote: > > [ Adding memory management folks to discuss the issue ] > > On Thu, 29 Mar 2018 18:41:44 +0800 > Zhaoyang Huang wrote: > >> It is reported that some user app would like to echo a huge >> number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless >> of the available memory, which will cause the coinstantaneous >> page allocation failed and introduce OOM. The commit checking the >> val against the available mem first to avoid the consequence allocation. >> >> Signed-off-by: Zhaoyang Huang >> --- >> kernel/trace/trace.c | 39 ++- >> 1 file changed, 38 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c >> index 2d0ffcc..a4a4237 100644 >> --- a/kernel/trace/trace.c >> +++ b/kernel/trace/trace.c >> @@ -43,6 +43,8 @@ >> #include >> #include >> >> +#include >> +#include >> #include "trace.h" >> #include "trace_output.h" >> >> @@ -5967,6 +5969,39 @@ static ssize_t tracing_splice_read_pipe(struct file >> *filp, >> return ret; >> } >> >> +static long get_available_mem(void) >> +{ >> + struct sysinfo i; >> + long available; >> + unsigned long pagecache; >> + unsigned long wmark_low = 0; >> + unsigned long pages[NR_LRU_LISTS]; >> + struct zone *zone; >> + int lru; >> + >> + si_meminfo(&i); >> + si_swapinfo(&i); >> + >> + for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++) >> + pages[lru] = global_page_state(NR_LRU_BASE + lru); >> + >> + for_each_zone(zone) >> + wmark_low += zone->watermark[WMARK_LOW]; >> + >> + available = i.freeram - wmark_low; >> + >> + pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE]; >> + pagecache -= min(pagecache / 2, wmark_low); >> + available += pagecache; >> + >> + available += global_page_state(NR_SLAB_RECLAIMABLE) - >> + min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low); >> + >> + if (available < 0) >> + available = 0; >> + return available; >> +} >> + > > As I stated in my other reply, the above function does not belong in > tracing. > > That said, it appears you are having issues that were caused by the > change by commit 848618857d2 ("tracing/ring_buffer: Try harder to > allocate"), where we replaced NORETRY with RETRY_MAYFAIL. The point of > NORETRY was to keep allocations of the tracing ring-buffer from causing > OOMs. But the RETRY was too strong in that case, because there were Yes this was discussed with -mm folks. Basically the problem we were seeing is devices with tonnes of free memory (but free as in free but used by page cache) were not being used so it was unnecessarily failing to allocate ring buffer on the system with otherwise lots of memory. > those that wanted to allocate large ring buffers but it would fail due > to memory being used that could be reclaimed. Supposedly, RETRY_MAYFAIL > is to allocate with reclaim but still allow to fail, and isn't suppose > to trigger an OOM. From my own tests, this is obviously not the case. > IIRC, the OOM that my patch was trying to avoid, was being triggered in the path/context of the write to buffer_size_kb itself (when not doing the NORETRY), not by other processes. > Perhaps this is because the ring buffer allocates one page at a time, > and by doing so, it can get every last available page, and if anything > in the mean time does an allocation without MAYFAIL, it will cause an > OOM. For example, when I stressed this I triggered this: > > pool invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), > nodemask=(null), order=0, oom_score_adj=0 > pool cpuset=/ mems_allowed=0 > CPU: 7 PID: 1040 Comm: pool Not tainted 4.16.0-rc4-test+ #663 > Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 > 07/14/2016 > Call Trace: > dump_stack+0x8e/0xce > dump_header.isra.30+0x6e/0x28f > ? _raw_spin_unlock_irqrestore+0x30/0x60 > oom_kill_process+0x218/0x400 > ? has_capability_noaudit+0x17/0x20 > out_of_memory+0xe3/0x5c0 > __alloc_pages_slowpath+0xa8e/0xe50 > __alloc_pages_nodemask+0x206/0x220 > alloc_pages_current+0x6a/0xe0 > __page_cache_alloc+0x6a/0xa0 > filemap_fault+0x208/0x5f0 > ? __might_sleep+0x4a/0x80 > ext4_filemap_fault+0x31/0x44 > __do_fault+0x20/0xd0 > __handle_mm_fault+0xc08/0x1160 > handle_mm_fault+0x76/0x110 > __do_page_fault+0x299/0x580 > do_page_fault+0x2d/0x110 > ? page_fault+0x2f/0x50 > page_fault+0x45/0x50 But this OOM is not in the path of the buffer_size_kb write, right? So then what does it have to do with buffer_size_kb write failure? I guess the original issue reported is that the buffer_size_kb write causes *other* applications to fail allocation. So in that case, capping the amount that ftrace writes makes sense. Basically my point is I don't see how the patch you mentioned introduces the problem here - in the sense the patch just m
Re: [PATCH v1] kernel/trace:check the val against the available mem
[ Adding memory management folks to discuss the issue ] On Thu, 29 Mar 2018 18:41:44 +0800 Zhaoyang Huang wrote: > It is reported that some user app would like to echo a huge > number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless > of the available memory, which will cause the coinstantaneous > page allocation failed and introduce OOM. The commit checking the > val against the available mem first to avoid the consequence allocation. > > Signed-off-by: Zhaoyang Huang > --- > kernel/trace/trace.c | 39 ++- > 1 file changed, 38 insertions(+), 1 deletion(-) > > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > index 2d0ffcc..a4a4237 100644 > --- a/kernel/trace/trace.c > +++ b/kernel/trace/trace.c > @@ -43,6 +43,8 @@ > #include > #include > > +#include > +#include > #include "trace.h" > #include "trace_output.h" > > @@ -5967,6 +5969,39 @@ static ssize_t tracing_splice_read_pipe(struct file > *filp, > return ret; > } > > +static long get_available_mem(void) > +{ > + struct sysinfo i; > + long available; > + unsigned long pagecache; > + unsigned long wmark_low = 0; > + unsigned long pages[NR_LRU_LISTS]; > + struct zone *zone; > + int lru; > + > + si_meminfo(&i); > + si_swapinfo(&i); > + > + for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++) > + pages[lru] = global_page_state(NR_LRU_BASE + lru); > + > + for_each_zone(zone) > + wmark_low += zone->watermark[WMARK_LOW]; > + > + available = i.freeram - wmark_low; > + > + pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE]; > + pagecache -= min(pagecache / 2, wmark_low); > + available += pagecache; > + > + available += global_page_state(NR_SLAB_RECLAIMABLE) - > + min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low); > + > + if (available < 0) > + available = 0; > + return available; > +} > + As I stated in my other reply, the above function does not belong in tracing. That said, it appears you are having issues that were caused by the change by commit 848618857d2 ("tracing/ring_buffer: Try harder to allocate"), where we replaced NORETRY with RETRY_MAYFAIL. The point of NORETRY was to keep allocations of the tracing ring-buffer from causing OOMs. But the RETRY was too strong in that case, because there were those that wanted to allocate large ring buffers but it would fail due to memory being used that could be reclaimed. Supposedly, RETRY_MAYFAIL is to allocate with reclaim but still allow to fail, and isn't suppose to trigger an OOM. From my own tests, this is obviously not the case. Perhaps this is because the ring buffer allocates one page at a time, and by doing so, it can get every last available page, and if anything in the mean time does an allocation without MAYFAIL, it will cause an OOM. For example, when I stressed this I triggered this: pool invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 pool cpuset=/ mems_allowed=0 CPU: 7 PID: 1040 Comm: pool Not tainted 4.16.0-rc4-test+ #663 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 Call Trace: dump_stack+0x8e/0xce dump_header.isra.30+0x6e/0x28f ? _raw_spin_unlock_irqrestore+0x30/0x60 oom_kill_process+0x218/0x400 ? has_capability_noaudit+0x17/0x20 out_of_memory+0xe3/0x5c0 __alloc_pages_slowpath+0xa8e/0xe50 __alloc_pages_nodemask+0x206/0x220 alloc_pages_current+0x6a/0xe0 __page_cache_alloc+0x6a/0xa0 filemap_fault+0x208/0x5f0 ? __might_sleep+0x4a/0x80 ext4_filemap_fault+0x31/0x44 __do_fault+0x20/0xd0 __handle_mm_fault+0xc08/0x1160 handle_mm_fault+0x76/0x110 __do_page_fault+0x299/0x580 do_page_fault+0x2d/0x110 ? page_fault+0x2f/0x50 page_fault+0x45/0x50 I wonder if I should have the ring buffer allocate groups of pages, to avoid this. Or try to allocate with NORETRY, one page at a time, and when that fails, allocate groups of pages with RETRY_MAYFAIL, and that may keep it from causing an OOM? Thoughts? -- Steve > static ssize_t > tracing_entries_write(struct file *filp, const char __user *ubuf, > size_t cnt, loff_t *ppos) > @@ -5975,13 +6010,15 @@ static ssize_t tracing_splice_read_pipe(struct file > *filp, > struct trace_array *tr = inode->i_private; > unsigned long val; > int ret; > + long available; > > + available = get_available_mem(); > ret = kstrtoul_from_user(ubuf, cnt, 10, &val); > if (ret) > return ret; > > /* must have at least 1 entry */ > - if (!val) > + if (!val || (val > available)) > return -EINVAL; > > /* value is in KB */
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Fri, 30 Mar 2018 11:32:22 +0800 Zhaoyang Huang wrote: > Steve, thanks for your quick feedback. The original purpose is to > avoid such kind > of OOM as kernel can not define the behavior of the application. I > think it is no need > to do the alloc->free process if we have known the number of pages > difinitly lead to failure. > Furthermore, the app which screw up the thing usually escape the OOM and > cause > the innocent to be sacrificed. A couple of things. Your patch goes way too deep into coupling with the memory management subsystem. If I were to accept this patch, Linus would have a fit. That get_available_mem() does not belong in the tracing code. If you want that feature, you need to get the mm folks to accept it, and then tracing could call it. Next, this may be an issue with the GPF_RETRY_MAYFAIL should not trigger OOM (although, when it fails, other allocations that happen at that time and before the ring buffer can clean up, will cause an OOM). I'll send an email to folks about this. -- Steve
Re: [Kernel-patch-test] [PATCH v1] kernel/trace:check the val against the available mem
Hi Zhaoyang, Thank you for the patch! Yet something to improve: [auto build test ERROR on tip/perf/core] [also build test ERROR on v4.16-rc7 next-20180329] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Zhaoyang-Huang/kernel-trace-check-the-val-against-the-available-mem/20180330-140917 config: x86_64-randconfig-x009-201812 (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All errors (new ones prefixed by >>): kernel//trace/trace.c: In function 'get_available_mem': >> kernel//trace/trace.c:5992:16: error: implicit declaration of function >> 'global_page_state'; did you mean 'global_numa_state'? >> [-Werror=implicit-function-declaration] pages[lru] = global_page_state(NR_LRU_BASE + lru); ^ global_numa_state In file included from include/asm-generic/bug.h:18:0, from arch/x86/include/asm/bug.h:83, from include/linux/bug.h:5, from include/linux/mmdebug.h:5, from include/linux/mm.h:9, from include/linux/ring_buffer.h:5, from kernel//trace/trace.c:14: include/linux/kernel.h:793:16: warning: comparison of distinct pointer types lacks a cast (void) (&min1 == &min2); \ ^ include/linux/kernel.h:802:2: note: in expansion of macro '__min' __min(typeof(x), typeof(y), \ ^ kernel//trace/trace.c:6004:3: note: in expansion of macro 'min' min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low); ^~~ Cyclomatic Complexity 5 include/linux/compiler.h:__read_once_size Cyclomatic Complexity 5 include/linux/compiler.h:__write_once_size Cyclomatic Complexity 2 arch/x86/include/asm/bitops.h:set_bit Cyclomatic Complexity 2 arch/x86/include/asm/bitops.h:clear_bit Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:constant_test_bit Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:variable_test_bit Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:fls64 Cyclomatic Complexity 1 arch/x86/include/asm/arch_hweight.h:__arch_hweight64 Cyclomatic Complexity 2 include/linux/bitops.h:hweight_long Cyclomatic Complexity 1 include/linux/log2.h:__ilog2_u64 Cyclomatic Complexity 1 include/linux/list.h:INIT_LIST_HEAD Cyclomatic Complexity 1 include/linux/list.h:__list_add_valid Cyclomatic Complexity 1 include/linux/list.h:__list_del_entry_valid Cyclomatic Complexity 2 include/linux/list.h:__list_add Cyclomatic Complexity 1 include/linux/list.h:list_add Cyclomatic Complexity 1 include/linux/list.h:__list_del Cyclomatic Complexity 2 include/linux/list.h:__list_del_entry Cyclomatic Complexity 1 include/linux/list.h:list_del Cyclomatic Complexity 1 arch/x86/include/asm/current.h:get_current Cyclomatic Complexity 1 include/asm-generic/getorder.h:__get_order Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_read Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_set Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_inc Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_dec Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_add_return Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_cmpxchg Cyclomatic Complexity 1 arch/x86/include/asm/atomic64_64.h:atomic64_read Cyclomatic Complexity 1 include/asm-generic/atomic-long.h:atomic_long_read Cyclomatic Complexity 1 include/linux/jump_label.h:static_key_count Cyclomatic Complexity 2 include/linux/jump_label.h:static_key_false Cyclomatic Complexity 1 include/linux/string.h:strnlen Cyclomatic Complexity 4 include/linux/string.h:strlen Cyclomatic Complexity 6 include/linux/string.h:strlcpy Cyclomatic Complexity 3 include/linux/string.h:memset Cyclomatic Complexity 4 include/linux/string.h:memcpy Cyclomatic Complexity 2 include/linux/string.h:strcpy Cyclomatic Complexity 3 include/linux/bitmap.h:bitmap_weight Cyclomatic Complexity 1 include/linux/cpumask.h:cpumask_check Cyclomatic Complexity 1 include/linux/cpumask.h:cpumask_set_cpu Cyclomatic Complexity 1 include/linux/cpumask.h:cpumask_test_cpu Cyclomatic Complexity 1 include/linux/cpumask.h:cpumask_weight Cyclomatic Complexity 1 include/linux/cpumask.h:cpumask_available Cyclomatic Complexity 1 arch/x86/include/asm/paravirt.h:arch_local_save_flags Cyclomatic Complexity 1 arch/x86/include/asm/paravirt.h:arch_local_irq_restore Cyclomatic Complexity 1 arch/x86/include/asm/paravirt.h:arch_local_irq_disable Cyclomatic Complexity 1 arch/x86/include/asm/paravirt.h:arch_local_irq_enable Cyclomatic Complexity 1 arch/x86/include/asm/paravirt.h:arch_local_irq_save Cyclomatic Complexity 1 include/linux/e
Re: [Kernel-patch-test] [PATCH v1] kernel/trace:check the val against the available mem
Hi Zhaoyang, Thank you for the patch! Yet something to improve: [auto build test ERROR on tip/perf/core] [also build test ERROR on v4.16-rc7 next-20180329] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Zhaoyang-Huang/kernel-trace-check-the-val-against-the-available-mem/20180330-140917 config: i386-randconfig-x073-201812 (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All error/warnings (new ones prefixed by >>): kernel/trace/trace.c: In function 'get_available_mem': >> kernel/trace/trace.c:5992:16: error: implicit declaration of function >> 'global_page_state'; did you mean 'zone_page_state'? >> [-Werror=implicit-function-declaration] pages[lru] = global_page_state(NR_LRU_BASE + lru); ^ zone_page_state In file included from include/asm-generic/bug.h:18:0, from arch/x86/include/asm/bug.h:83, from include/linux/bug.h:5, from include/linux/mmdebug.h:5, from include/linux/mm.h:9, from include/linux/ring_buffer.h:5, from kernel/trace/trace.c:14: include/linux/kernel.h:793:16: warning: comparison of distinct pointer types lacks a cast (void) (&min1 == &min2); \ ^ include/linux/kernel.h:802:2: note: in expansion of macro '__min' __min(typeof(x), typeof(y), \ ^ >> kernel/trace/trace.c:6004:3: note: in expansion of macro 'min' min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low); ^~~ cc1: some warnings being treated as errors vim +5992 kernel/trace/trace.c 5977 5978 static long get_available_mem(void) 5979 { 5980 struct sysinfo i; 5981 long available; 5982 unsigned long pagecache; 5983 unsigned long wmark_low = 0; 5984 unsigned long pages[NR_LRU_LISTS]; 5985 struct zone *zone; 5986 int lru; 5987 5988 si_meminfo(&i); 5989 si_swapinfo(&i); 5990 5991 for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++) > 5992 pages[lru] = global_page_state(NR_LRU_BASE + lru); 5993 5994 for_each_zone(zone) 5995 wmark_low += zone->watermark[WMARK_LOW]; 5996 5997 available = i.freeram - wmark_low; 5998 5999 pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE]; 6000 pagecache -= min(pagecache / 2, wmark_low); 6001 available += pagecache; 6002 6003 available += global_page_state(NR_SLAB_RECLAIMABLE) - > 6004 min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, > wmark_low); 6005 6006 if (available < 0) 6007 available = 0; 6008 return available; 6009 } 6010 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Fri, Mar 30, 2018 at 12:05 AM, Steven Rostedt wrote: > On Thu, 29 Mar 2018 18:41:44 +0800 > Zhaoyang Huang wrote: > >> It is reported that some user app would like to echo a huge >> number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless >> of the available memory, which will cause the coinstantaneous >> page allocation failed and introduce OOM. The commit checking the >> val against the available mem first to avoid the consequence allocation. >> > > One of my tests is to stress buffer_size_kb, and it fails nicely if you > try to get too much. Although, it may cause an OOM, but that's expected. > > The application should do the test (try "free" on the command line). > This isn't something that the kernel should be responsible for. If > someone wants to allocate all memory for tracing, that's their > prerogative. > > -- Steve Steve, thanks for your quick feedback. The original purpose is to avoid such kind of OOM as kernel can not define the behavior of the application. I think it is no need to do the alloc->free process if we have known the number of pages difinitly lead to failure. Furthermore, the app which screw up the thing usually escape the OOM and cause the innocent to be sacrificed.
Re: [PATCH v1] kernel/trace:check the val against the available mem
On Thu, 29 Mar 2018 18:41:44 +0800 Zhaoyang Huang wrote: > It is reported that some user app would like to echo a huge > number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless > of the available memory, which will cause the coinstantaneous > page allocation failed and introduce OOM. The commit checking the > val against the available mem first to avoid the consequence allocation. > One of my tests is to stress buffer_size_kb, and it fails nicely if you try to get too much. Although, it may cause an OOM, but that's expected. The application should do the test (try "free" on the command line). This isn't something that the kernel should be responsible for. If someone wants to allocate all memory for tracing, that's their prerogative. -- Steve
[PATCH v1] kernel/trace:check the val against the available mem
It is reported that some user app would like to echo a huge number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless of the available memory, which will cause the coinstantaneous page allocation failed and introduce OOM. The commit checking the val against the available mem first to avoid the consequence allocation. Signed-off-by: Zhaoyang Huang --- kernel/trace/trace.c | 39 ++- 1 file changed, 38 insertions(+), 1 deletion(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 2d0ffcc..a4a4237 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -43,6 +43,8 @@ #include #include +#include +#include #include "trace.h" #include "trace_output.h" @@ -5967,6 +5969,39 @@ static ssize_t tracing_splice_read_pipe(struct file *filp, return ret; } +static long get_available_mem(void) +{ + struct sysinfo i; + long available; + unsigned long pagecache; + unsigned long wmark_low = 0; + unsigned long pages[NR_LRU_LISTS]; + struct zone *zone; + int lru; + + si_meminfo(&i); + si_swapinfo(&i); + + for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++) + pages[lru] = global_page_state(NR_LRU_BASE + lru); + + for_each_zone(zone) + wmark_low += zone->watermark[WMARK_LOW]; + + available = i.freeram - wmark_low; + + pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE]; + pagecache -= min(pagecache / 2, wmark_low); + available += pagecache; + + available += global_page_state(NR_SLAB_RECLAIMABLE) - + min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low); + + if (available < 0) + available = 0; + return available; +} + static ssize_t tracing_entries_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) @@ -5975,13 +6010,15 @@ static ssize_t tracing_splice_read_pipe(struct file *filp, struct trace_array *tr = inode->i_private; unsigned long val; int ret; + long available; + available = get_available_mem(); ret = kstrtoul_from_user(ubuf, cnt, 10, &val); if (ret) return ret; /* must have at least 1 entry */ - if (!val) + if (!val || (val > available)) return -EINVAL; /* value is in KB */ -- 1.9.1