Re: [PATCH net-next] mlx4: support __GFP_MEMALLOC for rx

2017-01-19 Thread David Miller
From: Eric Dumazet 
Date: Tue, 17 Jan 2017 20:14:10 -0800

> From: Eric Dumazet 
> 
> Commit 04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX
> ring with __GFP_NOMEMALLOC") added code that appears to be not needed at
> that time, since mlx4 never used __GFP_MEMALLOC allocations anyway.
> 
> As using memory reserves is a must in some situations (swap over NFS or
> iSCSI), this patch adds this flag.
> 
> Note that this driver does not reuse pages (yet) so we do not have to
> add anything else.
> 
> Signed-off-by: Eric Dumazet 

Applied, thanks Eric.


Re: [PATCH net-next] mlx4: support __GFP_MEMALLOC for rx

2017-01-18 Thread Eric Dumazet
On Wed, 2017-01-18 at 18:11 +0300, Konstantin Khlebnikov wrote:
> On 18.01.2017 17:23, Eric Dumazet wrote:

> >
> > Take a look at sk_filter_trim_cap(), where the RX packets received on a
> > socket which does not have SOCK_MEMALLOC is dropped.
> >
> > /*
> >  * If the skb was allocated from pfmemalloc reserves, only
> >  * allow SOCK_MEMALLOC sockets to use it as this socket is
> >  * helping free memory
> >  */
> > if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
> > return -ENOMEM;
> 
> I suppose this happens in BH context right after receiving packet?
> 
> Potentially any ACK could free memory in TCP send queue,
> so using reserves here makes sense.

Yes, but only sockets with SOCK_MEMALLOC have this contract with the mm
layer.

For 'other' sockets, one possible trick would be that if only the page
fragment attached to skb had the pfmemalloc bit, and not the sk_buff
itself, we could attempt a skb_condense() operation [1], but it is not
really easy to properly recompute skb->pfmemalloc.

Pure TCP ACK packets can usually be trimmed by skb_condense().
Since they have no payload, we have a guarantee they wont sit in a queue
and hold memory.




Re: [PATCH net-next] mlx4: support __GFP_MEMALLOC for rx

2017-01-18 Thread Konstantin Khlebnikov

On 18.01.2017 17:23, Eric Dumazet wrote:

On Wed, 2017-01-18 at 12:31 +0300, Konstantin Khlebnikov wrote:

On 18.01.2017 07:14, Eric Dumazet wrote:

From: Eric Dumazet 

Commit 04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX
ring with __GFP_NOMEMALLOC") added code that appears to be not needed at
that time, since mlx4 never used __GFP_MEMALLOC allocations anyway.

As using memory reserves is a must in some situations (swap over NFS or
iSCSI), this patch adds this flag.


AFAIK __GFP_MEMALLOC is used for TX, not for RX: for allocations which
are required by memory reclaimer to free some pages.

Allocation RX buffers with __GFP_MEMALLOC is a straight way to
depleting all reserves by flood from network.


You are mistaken.

How do you think a TCP flow can make progress sending data if no ACK
packet can go back in RX ?


Well. Ok. I mistaken.



Take a look at sk_filter_trim_cap(), where the RX packets received on a
socket which does not have SOCK_MEMALLOC is dropped.

/*
 * If the skb was allocated from pfmemalloc reserves, only
 * allow SOCK_MEMALLOC sockets to use it as this socket is
 * helping free memory
 */
if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
return -ENOMEM;


I suppose this happens in BH context right after receiving packet?

Potentially any ACK could free memory in TCP send queue,
so using reserves here makes sense.



Also take a look at __dev_alloc_pages()

static inline struct page *__dev_alloc_pages(gfp_t gfp_mask,
 unsigned int order)
{
/* This piece of code contains several assumptions.
 * 1.  This is for device Rx, therefor a cold page is preferred.
 * 2.  The expectation is the user wants a compound page.
 * 3.  If requesting a order 0 page it will not be compound
 * due to the check to see if order has a value in prep_new_page
 * 4.  __GFP_MEMALLOC is ignored if __GFP_NOMEMALLOC is set due to
 * code in gfp_to_alloc_flags that should be enforcing this.
 */
gfp_mask |= __GFP_COLD | __GFP_COMP | __GFP_MEMALLOC;

return alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
}


So __GFP_MEMALLOC in RX is absolutely supported.

But drivers have to opt-in, either using __dev_alloc_pages() or
dev_alloc_pages, or explicitely ORing __GFP_MEMALLOC when using
alloc_page[s]()






--
Konstantin


Re: [PATCH net-next] mlx4: support __GFP_MEMALLOC for rx

2017-01-18 Thread Eric Dumazet
On Wed, 2017-01-18 at 12:31 +0300, Konstantin Khlebnikov wrote:
> On 18.01.2017 07:14, Eric Dumazet wrote:
> > From: Eric Dumazet 
> >
> > Commit 04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX
> > ring with __GFP_NOMEMALLOC") added code that appears to be not needed at
> > that time, since mlx4 never used __GFP_MEMALLOC allocations anyway.
> >
> > As using memory reserves is a must in some situations (swap over NFS or
> > iSCSI), this patch adds this flag.
> 
> AFAIK __GFP_MEMALLOC is used for TX, not for RX: for allocations which
> are required by memory reclaimer to free some pages.
> 
> Allocation RX buffers with __GFP_MEMALLOC is a straight way to
> depleting all reserves by flood from network.

You are mistaken.

How do you think a TCP flow can make progress sending data if no ACK
packet can go back in RX ?

Take a look at sk_filter_trim_cap(), where the RX packets received on a
socket which does not have SOCK_MEMALLOC is dropped.

/*
 * If the skb was allocated from pfmemalloc reserves, only
 * allow SOCK_MEMALLOC sockets to use it as this socket is
 * helping free memory
 */
if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
return -ENOMEM;

Also take a look at __dev_alloc_pages()

static inline struct page *__dev_alloc_pages(gfp_t gfp_mask,
 unsigned int order)
{
/* This piece of code contains several assumptions.
 * 1.  This is for device Rx, therefor a cold page is preferred.
 * 2.  The expectation is the user wants a compound page.
 * 3.  If requesting a order 0 page it will not be compound
 * due to the check to see if order has a value in prep_new_page
 * 4.  __GFP_MEMALLOC is ignored if __GFP_NOMEMALLOC is set due to
 * code in gfp_to_alloc_flags that should be enforcing this.
 */
gfp_mask |= __GFP_COLD | __GFP_COMP | __GFP_MEMALLOC;

return alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
}


So __GFP_MEMALLOC in RX is absolutely supported.

But drivers have to opt-in, either using __dev_alloc_pages() or
dev_alloc_pages, or explicitely ORing __GFP_MEMALLOC when using
alloc_page[s]()





Re: [PATCH net-next] mlx4: support __GFP_MEMALLOC for rx

2017-01-18 Thread Konstantin Khlebnikov

On 18.01.2017 07:14, Eric Dumazet wrote:

From: Eric Dumazet 

Commit 04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX
ring with __GFP_NOMEMALLOC") added code that appears to be not needed at
that time, since mlx4 never used __GFP_MEMALLOC allocations anyway.

As using memory reserves is a must in some situations (swap over NFS or
iSCSI), this patch adds this flag.


AFAIK __GFP_MEMALLOC is used for TX, not for RX: for allocations which are 
required by memory reclaimer to free some pages.

Allocation RX buffers with __GFP_MEMALLOC is a straight way to depleting all 
reserves by flood from network.



Note that this driver does not reuse pages (yet) so we do not have to
add anything else.

Signed-off-by: Eric Dumazet 
Cc: Konstantin Khlebnikov 
Cc: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 
eac527e25ec902c2a586e9952272b9e8e599e2c8..e362f99334d03c0df4d88320977670015870dd9c
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -706,7 +706,8 @@ static bool mlx4_en_refill_rx_buffers(struct mlx4_en_priv 
*priv,
do {
if (mlx4_en_prepare_rx_desc(priv, ring,
ring->prod & ring->size_mask,
-   GFP_ATOMIC | __GFP_COLD))
+   GFP_ATOMIC | __GFP_COLD |
+   __GFP_MEMALLOC))
break;
ring->prod++;
} while (--missing);





--
Konstantin


[PATCH net-next] mlx4: support __GFP_MEMALLOC for rx

2017-01-17 Thread Eric Dumazet
From: Eric Dumazet 

Commit 04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX
ring with __GFP_NOMEMALLOC") added code that appears to be not needed at
that time, since mlx4 never used __GFP_MEMALLOC allocations anyway.

As using memory reserves is a must in some situations (swap over NFS or
iSCSI), this patch adds this flag.

Note that this driver does not reuse pages (yet) so we do not have to
add anything else.

Signed-off-by: Eric Dumazet 
Cc: Konstantin Khlebnikov 
Cc: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 
eac527e25ec902c2a586e9952272b9e8e599e2c8..e362f99334d03c0df4d88320977670015870dd9c
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -706,7 +706,8 @@ static bool mlx4_en_refill_rx_buffers(struct mlx4_en_priv 
*priv,
do {
if (mlx4_en_prepare_rx_desc(priv, ring,
ring->prod & ring->size_mask,
-   GFP_ATOMIC | __GFP_COLD))
+   GFP_ATOMIC | __GFP_COLD |
+   __GFP_MEMALLOC))
break;
ring->prod++;
} while (--missing);