Re: [PATCH net-next] mlx4: support __GFP_MEMALLOC for rx
From: Eric Dumazet Date: Tue, 17 Jan 2017 20:14:10 -0800 > From: Eric Dumazet > > Commit 04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX > ring with __GFP_NOMEMALLOC") added code that appears to be not needed at > that time, since mlx4 never used __GFP_MEMALLOC allocations anyway. > > As using memory reserves is a must in some situations (swap over NFS or > iSCSI), this patch adds this flag. > > Note that this driver does not reuse pages (yet) so we do not have to > add anything else. > > Signed-off-by: Eric Dumazet Applied, thanks Eric.
Re: [PATCH net-next] mlx4: support __GFP_MEMALLOC for rx
On Wed, 2017-01-18 at 18:11 +0300, Konstantin Khlebnikov wrote: > On 18.01.2017 17:23, Eric Dumazet wrote: > > > > Take a look at sk_filter_trim_cap(), where the RX packets received on a > > socket which does not have SOCK_MEMALLOC is dropped. > > > > /* > > * If the skb was allocated from pfmemalloc reserves, only > > * allow SOCK_MEMALLOC sockets to use it as this socket is > > * helping free memory > > */ > > if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) > > return -ENOMEM; > > I suppose this happens in BH context right after receiving packet? > > Potentially any ACK could free memory in TCP send queue, > so using reserves here makes sense. Yes, but only sockets with SOCK_MEMALLOC have this contract with the mm layer. For 'other' sockets, one possible trick would be that if only the page fragment attached to skb had the pfmemalloc bit, and not the sk_buff itself, we could attempt a skb_condense() operation [1], but it is not really easy to properly recompute skb->pfmemalloc. Pure TCP ACK packets can usually be trimmed by skb_condense(). Since they have no payload, we have a guarantee they wont sit in a queue and hold memory.
Re: [PATCH net-next] mlx4: support __GFP_MEMALLOC for rx
On 18.01.2017 17:23, Eric Dumazet wrote: On Wed, 2017-01-18 at 12:31 +0300, Konstantin Khlebnikov wrote: On 18.01.2017 07:14, Eric Dumazet wrote: From: Eric Dumazet Commit 04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX ring with __GFP_NOMEMALLOC") added code that appears to be not needed at that time, since mlx4 never used __GFP_MEMALLOC allocations anyway. As using memory reserves is a must in some situations (swap over NFS or iSCSI), this patch adds this flag. AFAIK __GFP_MEMALLOC is used for TX, not for RX: for allocations which are required by memory reclaimer to free some pages. Allocation RX buffers with __GFP_MEMALLOC is a straight way to depleting all reserves by flood from network. You are mistaken. How do you think a TCP flow can make progress sending data if no ACK packet can go back in RX ? Well. Ok. I mistaken. Take a look at sk_filter_trim_cap(), where the RX packets received on a socket which does not have SOCK_MEMALLOC is dropped. /* * If the skb was allocated from pfmemalloc reserves, only * allow SOCK_MEMALLOC sockets to use it as this socket is * helping free memory */ if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) return -ENOMEM; I suppose this happens in BH context right after receiving packet? Potentially any ACK could free memory in TCP send queue, so using reserves here makes sense. Also take a look at __dev_alloc_pages() static inline struct page *__dev_alloc_pages(gfp_t gfp_mask, unsigned int order) { /* This piece of code contains several assumptions. * 1. This is for device Rx, therefor a cold page is preferred. * 2. The expectation is the user wants a compound page. * 3. If requesting a order 0 page it will not be compound * due to the check to see if order has a value in prep_new_page * 4. __GFP_MEMALLOC is ignored if __GFP_NOMEMALLOC is set due to * code in gfp_to_alloc_flags that should be enforcing this. */ gfp_mask |= __GFP_COLD | __GFP_COMP | __GFP_MEMALLOC; return alloc_pages_node(NUMA_NO_NODE, gfp_mask, order); } So __GFP_MEMALLOC in RX is absolutely supported. But drivers have to opt-in, either using __dev_alloc_pages() or dev_alloc_pages, or explicitely ORing __GFP_MEMALLOC when using alloc_page[s]() -- Konstantin
Re: [PATCH net-next] mlx4: support __GFP_MEMALLOC for rx
On Wed, 2017-01-18 at 12:31 +0300, Konstantin Khlebnikov wrote: > On 18.01.2017 07:14, Eric Dumazet wrote: > > From: Eric Dumazet > > > > Commit 04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX > > ring with __GFP_NOMEMALLOC") added code that appears to be not needed at > > that time, since mlx4 never used __GFP_MEMALLOC allocations anyway. > > > > As using memory reserves is a must in some situations (swap over NFS or > > iSCSI), this patch adds this flag. > > AFAIK __GFP_MEMALLOC is used for TX, not for RX: for allocations which > are required by memory reclaimer to free some pages. > > Allocation RX buffers with __GFP_MEMALLOC is a straight way to > depleting all reserves by flood from network. You are mistaken. How do you think a TCP flow can make progress sending data if no ACK packet can go back in RX ? Take a look at sk_filter_trim_cap(), where the RX packets received on a socket which does not have SOCK_MEMALLOC is dropped. /* * If the skb was allocated from pfmemalloc reserves, only * allow SOCK_MEMALLOC sockets to use it as this socket is * helping free memory */ if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) return -ENOMEM; Also take a look at __dev_alloc_pages() static inline struct page *__dev_alloc_pages(gfp_t gfp_mask, unsigned int order) { /* This piece of code contains several assumptions. * 1. This is for device Rx, therefor a cold page is preferred. * 2. The expectation is the user wants a compound page. * 3. If requesting a order 0 page it will not be compound * due to the check to see if order has a value in prep_new_page * 4. __GFP_MEMALLOC is ignored if __GFP_NOMEMALLOC is set due to * code in gfp_to_alloc_flags that should be enforcing this. */ gfp_mask |= __GFP_COLD | __GFP_COMP | __GFP_MEMALLOC; return alloc_pages_node(NUMA_NO_NODE, gfp_mask, order); } So __GFP_MEMALLOC in RX is absolutely supported. But drivers have to opt-in, either using __dev_alloc_pages() or dev_alloc_pages, or explicitely ORing __GFP_MEMALLOC when using alloc_page[s]()
Re: [PATCH net-next] mlx4: support __GFP_MEMALLOC for rx
On 18.01.2017 07:14, Eric Dumazet wrote: From: Eric Dumazet Commit 04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX ring with __GFP_NOMEMALLOC") added code that appears to be not needed at that time, since mlx4 never used __GFP_MEMALLOC allocations anyway. As using memory reserves is a must in some situations (swap over NFS or iSCSI), this patch adds this flag. AFAIK __GFP_MEMALLOC is used for TX, not for RX: for allocations which are required by memory reclaimer to free some pages. Allocation RX buffers with __GFP_MEMALLOC is a straight way to depleting all reserves by flood from network. Note that this driver does not reuse pages (yet) so we do not have to add anything else. Signed-off-by: Eric Dumazet Cc: Konstantin Khlebnikov Cc: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx4/en_rx.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index eac527e25ec902c2a586e9952272b9e8e599e2c8..e362f99334d03c0df4d88320977670015870dd9c 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -706,7 +706,8 @@ static bool mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv, do { if (mlx4_en_prepare_rx_desc(priv, ring, ring->prod & ring->size_mask, - GFP_ATOMIC | __GFP_COLD)) + GFP_ATOMIC | __GFP_COLD | + __GFP_MEMALLOC)) break; ring->prod++; } while (--missing); -- Konstantin
[PATCH net-next] mlx4: support __GFP_MEMALLOC for rx
From: Eric Dumazet Commit 04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX ring with __GFP_NOMEMALLOC") added code that appears to be not needed at that time, since mlx4 never used __GFP_MEMALLOC allocations anyway. As using memory reserves is a must in some situations (swap over NFS or iSCSI), this patch adds this flag. Note that this driver does not reuse pages (yet) so we do not have to add anything else. Signed-off-by: Eric Dumazet Cc: Konstantin Khlebnikov Cc: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx4/en_rx.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index eac527e25ec902c2a586e9952272b9e8e599e2c8..e362f99334d03c0df4d88320977670015870dd9c 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -706,7 +706,8 @@ static bool mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv, do { if (mlx4_en_prepare_rx_desc(priv, ring, ring->prod & ring->size_mask, - GFP_ATOMIC | __GFP_COLD)) + GFP_ATOMIC | __GFP_COLD | + __GFP_MEMALLOC)) break; ring->prod++; } while (--missing);