Re: [RFC PATCH 0/1] mm: add a warning about high order allocations
On Fri 28-12-18 14:23:29, Konstantin Khorenko wrote: > On 12/27/2018 07:46 PM, Michal Hocko wrote: > > On Thu 27-12-18 15:18:54, Konstantin Khorenko wrote: > >> Hi Michal, > >> > >> thank you very much for your questions, please see my notes below. > >> > >> On 12/26/2018 11:35 AM, Michal Hocko wrote: > >>> On Tue 25-12-18 18:39:26, Konstantin Khorenko wrote: > Q: Why do we need to bother at all? > A: If a node is highly loaded and its memory is significantly fragmented > (unfortunately almost any node with serious load has highly fragmented > memory) > then any high order memory allocation can trigger massive memory shrink > and > result in quite a big allocation latency. And the node becomes less > responsive > and users don't like it. > The ultimate solution here is to get rid of large allocations, but we > need an > instrument to detect them. > >>> > >>> Can you point to an example of the problem you are referring here? At > >>> least for costly orders we do bail out early and try to not cause > >>> massive reclaim. So what is the order that you are concerned about? > >> > >> Well, this is the most difficult question to answer. > >> Unfortunately i don't have a reproducer for that, usually we get into > >> situation > >> when someone experiences significant node slowdown, nodes most often have > >> a lot of RAM, > >> we check what is going on there and see the node is busy with reclaim. > >> And almost every time the reason was - fragmented memory and high order > >> allocations. > >> Mostly of 2nd and 3rd (which is still considered not costly) order. > >> > >> Recent related issues we faced were about FUSE dev pipe: > >> d6d931adce11 ("fuse: use kvmalloc to allocate array of pipe_buffer > >> structs.") > >> > >> and about bnx driver + mtu 9000 which for each packet required page of 2nd > >> order > >> (and it even failed sometimes, though it was not the root cause): > >> kswapd0: page allocation failure: order:2, mode:0x4020 > >> Call Trace: > >> dump_stack+0x19/0x1b > >> warn_alloc_failed+0x110/0x180 > >> __alloc_pages_nodemask+0x7bf/0xc60 > >> alloc_pages_current+0x98/0x110 > >> kmalloc_order+0x18/0x40 > >> kmalloc_order_trace+0x26/0xa0 > >> __kmalloc+0x279/0x290 > >> bnx2x_frag_alloc.isra.61+0x2a/0x40 [bnx2x] > >> bnx2x_rx_int+0x227/0x17c0 [bnx2x] > >> bnx2x_poll+0x1dd/0x260 [bnx2x] > >> net_rx_action+0x179/0x390 > >> __do_softirq+0x10f/0x2aa > >> call_softirq+0x1c/0x30 > >> do_softirq+0x65/0xa0 > >> irq_exit+0x105/0x110 > >> do_IRQ+0x56/0xe0 > >> common_interrupt+0x6d/0x6d > >> > >> And as both places were called very often - the system latency was high. > >> > >> This warning can be also used to catch allocation of 4th order and higher > >> which may > >> easily fail. Those places which are ready to get allocation errors and have > >> fallbacks are marked with __GFP_NOWARN. > > > > This is not true in general, though. > > Right now - yep, this is not true. Sorry, i was not clear enough here. > i meant after we catch all high order allocations, we review them and either > switch to kvmalloc() or mark with NOWARN if we are ready to handle allocation > errors > in that particular place. So this is an ideal situation when we reviewed all > the things. I do not think that making all high order allocations NOWARN is reasonable or even correct. There allocations which really require physically contiguous memory range of a larger size. And we migh want to warn about those and they might be perfectly fine. [...] > > I believe that > > for your particular use case it is much better to simply enable reclaim > > and page allocator tracepoints which will give you not only the source > > of the allocation but also a much better picture > > Tracepoints are much better for issues investigation, right. And we do so. > > And warnings are intended not for investigation but for prevention of > possible future issues. > If we spot a big allocation, we can review it in advance, before we face any > problem, > and in most cases just switch it to use kvmalloc() in 90% cases and we never > ever have > a problem with unexpected reclaim due to this allocation. Ever. > With any reclaim algorithm - the compaction just won't be triggered. I do not think this is realistic, to be honest. As mentioned before there are simply valid high order allocations. [...] > > But the warning alone will not give us useful information I am afraid. > > It will only give us, there are warnings but not whether those are > > actually a problem or not. > > Yes. And even more - a lot of high order allocations which cannot be > switched to kvmalloc() are in drivers - for DMA zones - so they are very > rare and most probably won't ever cause a problem. > > But some of them can potentially cause a problem
Re: [RFC PATCH 0/1] mm: add a warning about high order allocations
On 12/27/2018 07:46 PM, Michal Hocko wrote: > On Thu 27-12-18 15:18:54, Konstantin Khorenko wrote: >> Hi Michal, >> >> thank you very much for your questions, please see my notes below. >> >> On 12/26/2018 11:35 AM, Michal Hocko wrote: >>> On Tue 25-12-18 18:39:26, Konstantin Khorenko wrote: Q: Why do we need to bother at all? A: If a node is highly loaded and its memory is significantly fragmented (unfortunately almost any node with serious load has highly fragmented memory) then any high order memory allocation can trigger massive memory shrink and result in quite a big allocation latency. And the node becomes less responsive and users don't like it. The ultimate solution here is to get rid of large allocations, but we need an instrument to detect them. >>> >>> Can you point to an example of the problem you are referring here? At >>> least for costly orders we do bail out early and try to not cause >>> massive reclaim. So what is the order that you are concerned about? >> >> Well, this is the most difficult question to answer. >> Unfortunately i don't have a reproducer for that, usually we get into >> situation >> when someone experiences significant node slowdown, nodes most often have a >> lot of RAM, >> we check what is going on there and see the node is busy with reclaim. >> And almost every time the reason was - fragmented memory and high order >> allocations. >> Mostly of 2nd and 3rd (which is still considered not costly) order. >> >> Recent related issues we faced were about FUSE dev pipe: >> d6d931adce11 ("fuse: use kvmalloc to allocate array of pipe_buffer structs.") >> >> and about bnx driver + mtu 9000 which for each packet required page of 2nd >> order >> (and it even failed sometimes, though it was not the root cause): >> kswapd0: page allocation failure: order:2, mode:0x4020 >> Call Trace: >> dump_stack+0x19/0x1b >> warn_alloc_failed+0x110/0x180 >> __alloc_pages_nodemask+0x7bf/0xc60 >> alloc_pages_current+0x98/0x110 >> kmalloc_order+0x18/0x40 >> kmalloc_order_trace+0x26/0xa0 >> __kmalloc+0x279/0x290 >> bnx2x_frag_alloc.isra.61+0x2a/0x40 [bnx2x] >> bnx2x_rx_int+0x227/0x17c0 [bnx2x] >> bnx2x_poll+0x1dd/0x260 [bnx2x] >> net_rx_action+0x179/0x390 >> __do_softirq+0x10f/0x2aa >> call_softirq+0x1c/0x30 >> do_softirq+0x65/0xa0 >> irq_exit+0x105/0x110 >> do_IRQ+0x56/0xe0 >> common_interrupt+0x6d/0x6d >> >> And as both places were called very often - the system latency was high. >> >> This warning can be also used to catch allocation of 4th order and higher >> which may >> easily fail. Those places which are ready to get allocation errors and have >> fallbacks are marked with __GFP_NOWARN. > > This is not true in general, though. Right now - yep, this is not true. Sorry, i was not clear enough here. i meant after we catch all high order allocations, we review them and either switch to kvmalloc() or mark with NOWARN if we are ready to handle allocation errors in that particular place. So this is an ideal situation when we reviewed all the things. > [...] >> But after it's done and there are no (almost) unmarked high order >> allocations - >> why not? This will reveal new cases of high order allocations soon. > > There will always be legitimate high order allocations. Sure. But after we review them we either switch them to kvmalloc() or mark them with NOWARN. In both cases we won't get new warnings about that places. > I believe that > for your particular use case it is much better to simply enable reclaim > and page allocator tracepoints which will give you not only the source > of the allocation but also a much better picture Tracepoints are much better for issues investigation, right. And we do so. And warnings are intended not for investigation but for prevention of possible future issues. If we spot a big allocation, we can review it in advance, before we face any problem, and in most cases just switch it to use kvmalloc() in 90% cases and we never ever have a problem with unexpected reclaim due to this allocation. Ever. With any reclaim algorithm - the compaction just won't be triggered. >> i think people who run systems with "kernel.panic_on_warn" enabled do care >> about reporting issues. > > You surely do not want to put the system down just because of the high > order allocation though, right? Right, i do not. (And i also don't want to run a node with "kernel.panic_on_warn" enabled in production :) ) But people who do run nodes with "kernel.panic_on_warn" enabled in production may disable high allocation warning by increasing warning order level higher than MAX_ORDER. Or just not enable kernel config option. i do understand the warning will be noisy at the beginning thus i surely don't even suggest to make it enable by default now. Q: Why
Re: [RFC PATCH 0/1] mm: add a warning about high order allocations
On Thu 27-12-18 15:18:54, Konstantin Khorenko wrote: > Hi Michal, > > thank you very much for your questions, please see my notes below. > > On 12/26/2018 11:35 AM, Michal Hocko wrote: > > On Tue 25-12-18 18:39:26, Konstantin Khorenko wrote: > >> Q: Why do we need to bother at all? > >> A: If a node is highly loaded and its memory is significantly fragmented > >> (unfortunately almost any node with serious load has highly fragmented > >> memory) > >> then any high order memory allocation can trigger massive memory shrink and > >> result in quite a big allocation latency. And the node becomes less > >> responsive > >> and users don't like it. > >> The ultimate solution here is to get rid of large allocations, but we need > >> an > >> instrument to detect them. > > > > Can you point to an example of the problem you are referring here? At > > least for costly orders we do bail out early and try to not cause > > massive reclaim. So what is the order that you are concerned about? > > Well, this is the most difficult question to answer. > Unfortunately i don't have a reproducer for that, usually we get into > situation > when someone experiences significant node slowdown, nodes most often have a > lot of RAM, > we check what is going on there and see the node is busy with reclaim. > And almost every time the reason was - fragmented memory and high order > allocations. > Mostly of 2nd and 3rd (which is still considered not costly) order. > > Recent related issues we faced were about FUSE dev pipe: > d6d931adce11 ("fuse: use kvmalloc to allocate array of pipe_buffer structs.") > > and about bnx driver + mtu 9000 which for each packet required page of 2nd > order > (and it even failed sometimes, though it was not the root cause): > kswapd0: page allocation failure: order:2, mode:0x4020 > Call Trace: > dump_stack+0x19/0x1b > warn_alloc_failed+0x110/0x180 > __alloc_pages_nodemask+0x7bf/0xc60 > alloc_pages_current+0x98/0x110 > kmalloc_order+0x18/0x40 > kmalloc_order_trace+0x26/0xa0 > __kmalloc+0x279/0x290 > bnx2x_frag_alloc.isra.61+0x2a/0x40 [bnx2x] > bnx2x_rx_int+0x227/0x17c0 [bnx2x] > bnx2x_poll+0x1dd/0x260 [bnx2x] > net_rx_action+0x179/0x390 > __do_softirq+0x10f/0x2aa > call_softirq+0x1c/0x30 > do_softirq+0x65/0xa0 > irq_exit+0x105/0x110 > do_IRQ+0x56/0xe0 > common_interrupt+0x6d/0x6d > > And as both places were called very often - the system latency was high. > > This warning can be also used to catch allocation of 4th order and higher > which may > easily fail. Those places which are ready to get allocation errors and have > fallbacks are marked with __GFP_NOWARN. This is not true in general, though. [...] > But after it's done and there are no (almost) unmarked high order allocations > - > why not? This will reveal new cases of high order allocations soon. There will always be legitimate high order allocations. I believe that for your particular use case it is much better to simply enable reclaim and page allocator tracepoints which will give you not only the source of the allocation but also a much better picture > i think people who run systems with "kernel.panic_on_warn" enabled do care > about reporting issues. You surely do not want to put the system down just because of the high order allocation though, right? > >> Q: Why compile time config option? > >> A: In order not to decrease the performance even a bit in case someone > >> does not > >> want to hunt for large allocations. > >> In an ideal life i'd prefer this check/warning is enabled by default and > >> may be > >> even without a config option so it works on every node. Once we find and > >> rework > >> or mark all large allocations that would be good by default. Until that > >> though > >> it will be noisy. > > > > So who is going to enable this option? > > At the beginning - people who want to debug kernel and verify their fallbacks > on > memory allocations failures in the code or just speed up their code on nodes > with fragmented memory - for 2nd and 3rd orders. > > mm performance issues are tough, you know, and this is just another way to > gain more performance. It won't avoid the necessity of digging mm for sure, > but might decrease the pressure level. But the warning alone will not give us useful information I am afraid. It will only give us, there are warnings but not whether those are actually a problem or not. So I really believe that using existing tracepoints or add some that will fill missing gaps will be much more better long term. And we do not have to add another config and touch the code as a bonus. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 0/1] mm: add a warning about high order allocations
Hi Michal, thank you very much for your questions, please see my notes below. On 12/26/2018 11:35 AM, Michal Hocko wrote: > On Tue 25-12-18 18:39:26, Konstantin Khorenko wrote: >> Q: Why do we need to bother at all? >> A: If a node is highly loaded and its memory is significantly fragmented >> (unfortunately almost any node with serious load has highly fragmented >> memory) >> then any high order memory allocation can trigger massive memory shrink and >> result in quite a big allocation latency. And the node becomes less >> responsive >> and users don't like it. >> The ultimate solution here is to get rid of large allocations, but we need an >> instrument to detect them. > > Can you point to an example of the problem you are referring here? At > least for costly orders we do bail out early and try to not cause > massive reclaim. So what is the order that you are concerned about? Well, this is the most difficult question to answer. Unfortunately i don't have a reproducer for that, usually we get into situation when someone experiences significant node slowdown, nodes most often have a lot of RAM, we check what is going on there and see the node is busy with reclaim. And almost every time the reason was - fragmented memory and high order allocations. Mostly of 2nd and 3rd (which is still considered not costly) order. Recent related issues we faced were about FUSE dev pipe: d6d931adce11 ("fuse: use kvmalloc to allocate array of pipe_buffer structs.") and about bnx driver + mtu 9000 which for each packet required page of 2nd order (and it even failed sometimes, though it was not the root cause): kswapd0: page allocation failure: order:2, mode:0x4020 Call Trace: dump_stack+0x19/0x1b warn_alloc_failed+0x110/0x180 __alloc_pages_nodemask+0x7bf/0xc60 alloc_pages_current+0x98/0x110 kmalloc_order+0x18/0x40 kmalloc_order_trace+0x26/0xa0 __kmalloc+0x279/0x290 bnx2x_frag_alloc.isra.61+0x2a/0x40 [bnx2x] bnx2x_rx_int+0x227/0x17c0 [bnx2x] bnx2x_poll+0x1dd/0x260 [bnx2x] net_rx_action+0x179/0x390 __do_softirq+0x10f/0x2aa call_softirq+0x1c/0x30 do_softirq+0x65/0xa0 irq_exit+0x105/0x110 do_IRQ+0x56/0xe0 common_interrupt+0x6d/0x6d And as both places were called very often - the system latency was high. This warning can be also used to catch allocation of 4th order and higher which may easily fail. Those places which are ready to get allocation errors and have fallbacks are marked with __GFP_NOWARN. >> Q: Why warning? Use tracepoints! >> A: Well, this is a matter of magic defaults. >> Yes, you can use tracepoints to catch large allocations, but you need to do >> this >> on purpose and regularly and this is to be done by every developer which is >> quite unreal. >> On the other hand if you develop something and get a warning, you'll have to >> think about the reason and either succeed with reworking the code to use >> smaller allocation sizes (and thus decrease allocation latency!) or just use >> kvmalloc() if you don't really need physically continuos chunk or come to the >> conclusion you definitely need physically continuos memory and shut up the >> warning. > > Well, not really. For one thing, there are systems to panic on warning > and you really do not want to blow up just because somebody is doing a > large order allocation. Well, on one hand - yes, i agree with you. That's why i don't suggest to enable the warning by default right now - until all (most) of large allocations are marked properly. But after it's done and there are no (almost) unmarked high order allocations - why not? This will reveal new cases of high order allocations soon. i think people who run systems with "kernel.panic_on_warn" enabled do care about reporting issues. >> Q: Why compile time config option? >> A: In order not to decrease the performance even a bit in case someone does >> not >> want to hunt for large allocations. >> In an ideal life i'd prefer this check/warning is enabled by default and may >> be >> even without a config option so it works on every node. Once we find and >> rework >> or mark all large allocations that would be good by default. Until that >> though >> it will be noisy. > > So who is going to enable this option? At the beginning - people who want to debug kernel and verify their fallbacks on memory allocations failures in the code or just speed up their code on nodes with fragmented memory - for 2nd and 3rd orders. mm performance issues are tough, you know, and this is just another way to gain more performance. It won't avoid the necessity of digging mm for sure, but might decrease the pressure level. Later (i hope) i could be enabled by default so all big new allocations are verified sooner and either reworked or marked with __GFP_NOWARN if the code is ready. -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team >>
Re: [RFC PATCH 0/1] mm: add a warning about high order allocations
On Tue 25-12-18 18:39:26, Konstantin Khorenko wrote: > Q: Why do we need to bother at all? > A: If a node is highly loaded and its memory is significantly fragmented > (unfortunately almost any node with serious load has highly fragmented memory) > then any high order memory allocation can trigger massive memory shrink and > result in quite a big allocation latency. And the node becomes less responsive > and users don't like it. > The ultimate solution here is to get rid of large allocations, but we need an > instrument to detect them. Can you point to an example of the problem you are referring here? At least for costly orders we do bail out early and try to not cause massive reclaim. So what is the order that you are concerned about? > Q: Why warning? Use tracepoints! > A: Well, this is a matter of magic defaults. > Yes, you can use tracepoints to catch large allocations, but you need to do > this > on purpose and regularly and this is to be done by every developer which is > quite unreal. > On the other hand if you develop something and get a warning, you'll have to > think about the reason and either succeed with reworking the code to use > smaller allocation sizes (and thus decrease allocation latency!) or just use > kvmalloc() if you don't really need physically continuos chunk or come to the > conclusion you definitely need physically continuos memory and shut up the > warning. Well, not really. For one thing, there are systems to panic on warning and you really do not want to blow up just because somebody is doing a large order allocation. > Q: Why compile time config option? > A: In order not to decrease the performance even a bit in case someone does > not > want to hunt for large allocations. > In an ideal life i'd prefer this check/warning is enabled by default and may > be > even without a config option so it works on every node. Once we find and > rework > or mark all large allocations that would be good by default. Until that though > it will be noisy. So who is going to enable this option? > Another option is to rework the patch via static keys (having the warning > disabled by default surely). That makes it possible to turn on the feature > without recompiling the kernel - during testing period for example. > > If you prefer this way, i would be happy to rework the patch via static keys. I would rather go and chase the underlying issue. So can we get an actual data please? -- Michal Hocko SUSE Labs
[RFC PATCH 0/1] mm: add a warning about high order allocations
Q: Why do we need to bother at all? A: If a node is highly loaded and its memory is significantly fragmented (unfortunately almost any node with serious load has highly fragmented memory) then any high order memory allocation can trigger massive memory shrink and result in quite a big allocation latency. And the node becomes less responsive and users don't like it. The ultimate solution here is to get rid of large allocations, but we need an instrument to detect them. Q: Why warning? Use tracepoints! A: Well, this is a matter of magic defaults. Yes, you can use tracepoints to catch large allocations, but you need to do this on purpose and regularly and this is to be done by every developer which is quite unreal. On the other hand if you develop something and get a warning, you'll have to think about the reason and either succeed with reworking the code to use smaller allocation sizes (and thus decrease allocation latency!) or just use kvmalloc() if you don't really need physically continuos chunk or come to the conclusion you definitely need physically continuos memory and shut up the warning. Q: Why compile time config option? A: In order not to decrease the performance even a bit in case someone does not want to hunt for large allocations. In an ideal life i'd prefer this check/warning is enabled by default and may be even without a config option so it works on every node. Once we find and rework or mark all large allocations that would be good by default. Until that though it will be noisy. Another option is to rework the patch via static keys (having the warning disabled by default surely). That makes it possible to turn on the feature without recompiling the kernel - during testing period for example. If you prefer this way, i would be happy to rework the patch via static keys. Konstantin Khorenko (1): mm/page_alloc: add warning about high order allocations kernel/sysctl.c | 15 +++ mm/Kconfig | 18 ++ mm/page_alloc.c | 25 + 3 files changed, 58 insertions(+) -- 2.15.1