Re: [PATCH net-next V2 05/11] net/mlx5e: Support RX multi-packet WQE (Striding RQ)
On Tue, Apr 19, 2016 at 8:39 PM, Mel Gorman wrote: > On Tue, Apr 19, 2016 at 06:25:32PM +0200, Jesper Dangaard Brouer wrote: >> On Mon, 18 Apr 2016 07:17:13 -0700 >> Eric Dumazet wrote: >> > > alloc_pages_exact() > We want to allocate 32 order-0 physically contiguous pages and to free each one of them individually. the documentation states "Memory allocated by this function must be released by free_pages_exact()" Also it returns a pointer to the memory and we need pointers to pages. >> > > allocates many physically contiguous pages with order0 ! so we assume >> > > it is ok to use split_page. >> > >> > Note: I have no idea of split_page() performance : >> >> Maybe Mel knows? > > Irrelevant in comparison to the cost of allocating an order-5 pages if > one is not already available. > we still allocate order-5 pages but now we split them to 32 order-0 pages. the split adds extra few cpu cycles but it is lookless and straightforward, and it does the job in terms of better memory utilization. now in scenarios where small packets can hold a ref on pages for too long they would hold a ref on order-0 pages rather than order-5.
Re: [PATCH net-next V2 05/11] net/mlx5e: Support RX multi-packet WQE (Striding RQ)
On Tue, Apr 19, 2016 at 7:25 PM, Jesper Dangaard Brouer wrote: > On Mon, 18 Apr 2016 07:17:13 -0700 > Eric Dumazet wrote: > >> Another idea would be to have a way to control max number of order-5 >> pages that a port would be using. >> >> Since driver always own a ref on a order-5 pages, idea would be to >> maintain a circular ring of up to XXX such pages, so that we can detect >> an abnormal use and fallback to order-0 immediately. > > That is part of my idea with my page-pool proposal. In the page-pool I > want to have some watermark counter that can block/stop the OOM issue at > this RX ring level. > > See slide 12 of presentation: > http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf > Cool Idea guys, and we already tested our own version of it, we tried to recycle our own driver pages but we saw that the stack took too long to release them, we had to work with 2X and sometimes 4X pages pool per ring to be able to reuse recycled pages on every RX packet on 50Gb line rate, but we dropped the Idea since 2X is too much. but definitely, this is the best way to go for all drivers, reusing already dma mapped pages and significantly reducing dma operations for the driver is a big win ! we are still considering such option as future optimization.
Re: [PATCH net-next V2 05/11] net/mlx5e: Support RX multi-packet WQE (Striding RQ)
On Tue, Apr 19, 2016 at 06:25:32PM +0200, Jesper Dangaard Brouer wrote: > On Mon, 18 Apr 2016 07:17:13 -0700 > Eric Dumazet wrote: > > > On Mon, 2016-04-18 at 16:05 +0300, Saeed Mahameed wrote: > > > On Mon, Apr 18, 2016 at 3:48 PM, Eric Dumazet > > > wrote: > > > > On Sun, 2016-04-17 at 17:29 -0700, Eric Dumazet wrote: > > > > > > > >> > > > >> If really you need to allocate physically contiguous memory, have you > > > >> considered converting the order-5 pages into 32 order-0 ones ? > > > > > > > > Search for split_page() call sites for examples. > > > > > > > > > > > > > > Thanks Eric, we are already evaluating split_page as we speak. > > > > > > We did look but could not find any specific alloc_pages API that alloc_pages_exact() > > > allocates many physically contiguous pages with order0 ! so we assume > > > it is ok to use split_page. > > > > Note: I have no idea of split_page() performance : > > Maybe Mel knows? Irrelevant in comparison to the cost of allocating an order-5 pages if one is not already available. > And maybe Mel have an opinion about if this is a good > or bad approach, e.g. will this approach stress the page allocator in a > bad way? > It'll contend on the zone lock minimally but again, irrelevant in comparison to having to reclaim/compact an order-5 page if one is not already free. It'll appear to work well in benchmarks and then fall apart when the system is running for long enough.
Re: [PATCH net-next V2 05/11] net/mlx5e: Support RX multi-packet WQE (Striding RQ)
On Mon, 18 Apr 2016 07:17:13 -0700 Eric Dumazet wrote: > On Mon, 2016-04-18 at 16:05 +0300, Saeed Mahameed wrote: > > On Mon, Apr 18, 2016 at 3:48 PM, Eric Dumazet > > wrote: > > > On Sun, 2016-04-17 at 17:29 -0700, Eric Dumazet wrote: > > > > > >> > > >> If really you need to allocate physically contiguous memory, have you > > >> considered converting the order-5 pages into 32 order-0 ones ? > > > > > > Search for split_page() call sites for examples. > > > > > > > > > > Thanks Eric, we are already evaluating split_page as we speak. > > > > We did look but could not find any specific alloc_pages API that > > allocates many physically contiguous pages with order0 ! so we assume > > it is ok to use split_page. > > Note: I have no idea of split_page() performance : Maybe Mel knows? And maybe Mel have an opinion about if this is a good or bad approach, e.g. will this approach stress the page allocator in a bad way? > Buddy page allocator has to aggregate pages into order-5, then we would > undo the work, touching 32 cache lines. > > You might first benchmark a simple loop doing > > loop 10,000,000 times > Order-5 allocation > split into 32 order-0 > free 32 pages > > > Another idea would be to have a way to control max number of order-5 > pages that a port would be using. > > Since driver always own a ref on a order-5 pages, idea would be to > maintain a circular ring of up to XXX such pages, so that we can detect > an abnormal use and fallback to order-0 immediately. That is part of my idea with my page-pool proposal. In the page-pool I want to have some watermark counter that can block/stop the OOM issue at this RX ring level. See slide 12 of presentation: http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH net-next V2 05/11] net/mlx5e: Support RX multi-packet WQE (Striding RQ)
On Mon, 2016-04-18 at 16:05 +0300, Saeed Mahameed wrote: > On Mon, Apr 18, 2016 at 3:48 PM, Eric Dumazet wrote: > > On Sun, 2016-04-17 at 17:29 -0700, Eric Dumazet wrote: > > > >> > >> If really you need to allocate physically contiguous memory, have you > >> considered converting the order-5 pages into 32 order-0 ones ? > > > > Search for split_page() call sites for examples. > > > > > > Thanks Eric, we are already evaluating split_page as we speak. > > We did look but could not find any specific alloc_pages API that > allocates many physically contiguous pages with order0 ! so we assume > it is ok to use split_page. Note: I have no idea of split_page() performance : Buddy page allocator has to aggregate pages into order-5, then we would undo the work, touching 32 cache lines. You might first benchmark a simple loop doing loop 10,000,000 times Order-5 allocation split into 32 order-0 free 32 pages Another idea would be to have a way to control max number of order-5 pages that a port would be using. Since driver always own a ref on a order-5 pages, idea would be to maintain a circular ring of up to XXX such pages, so that we can detect an abnormal use and fallback to order-0 immediately.
Re: [PATCH net-next V2 05/11] net/mlx5e: Support RX multi-packet WQE (Striding RQ)
On Mon, Apr 18, 2016 at 3:48 PM, Eric Dumazet wrote: > On Sun, 2016-04-17 at 17:29 -0700, Eric Dumazet wrote: > >> >> If really you need to allocate physically contiguous memory, have you >> considered converting the order-5 pages into 32 order-0 ones ? > > Search for split_page() call sites for examples. > > Thanks Eric, we are already evaluating split_page as we speak. We did look but could not find any specific alloc_pages API that allocates many physically contiguous pages with order0 ! so we assume it is ok to use split_page. BTW our MPWQE solution doesn't totally rely on huge physically contiguous memory, as you see in the next two patches we introduce a fragmented MPWQE approach as a fallback, but we do understand your concern for the normal flow.
Re: [PATCH net-next V2 05/11] net/mlx5e: Support RX multi-packet WQE (Striding RQ)
On Sun, 2016-04-17 at 17:29 -0700, Eric Dumazet wrote: > > If really you need to allocate physically contiguous memory, have you > considered converting the order-5 pages into 32 order-0 ones ? Search for split_page() call sites for examples.
Re: [PATCH net-next V2 05/11] net/mlx5e: Support RX multi-packet WQE (Striding RQ)
On Mon, 2016-04-18 at 00:31 +0300, Saeed Mahameed wrote: > Performance tested on ConnectX4-Lx 50G. > To isolate the feature under test, the numbers below were measured with > HW LRO turned off. We verified that the performance just improves when > LRO is turned back on. > > * Netperf single TCP stream: > - BW raised by 10-15% for representative packet sizes: > default, 64B, 1024B, 1478B, 65536B. > > * Netperf multi TCP stream: > - No degradation, line rate reached. > > * Pktgen: packet rate raised by 5-10% for traffic of different message > sizes: 64B, 128B, 256B, 1024B, and 1500B. > > * Pktgen: packet loss in bursts of small messages (64byte), > single stream: > - | num packets | packets loss before | packets loss after > | 2K | ~ 1K | 0 > | 8K | ~ 6K | 0 > | 16K | ~13K | 0 > | 32K | ~28K | 0 > | 64K | ~57K | ~24K As I already mentioned, allocated order-5 pages and hoping host only receives friendly traffic is very optimistic. A 192 bytes frame, is claiming to consume 192 bytes frag with your new allocation strategy. (skb->truesize is kind of minimal) In reality, it can prevent a whole 131072 bytes of memory from being reclaimed/freed. TCP stack will not consider such skb has a candidate for collapsing in case of memory pressure or hostile peer. Your tests are obviously run on a freshly booted host, where all physical memory can be consumed for networking buffers. Even with order-3 pages, we have problems (at Facebook and Google) on hosts that we do not reboot every day. At the time order-5 allocations fail, it is already too late, as maybe thousands of out-of-order TCP packets might have consumed all the memory and the host will die. /proc/sys/net/ipv4/tcp_mem by default allows TCP to use up to 10% of hysical memory, assuming skb->truesize is true. In your schem, TCP might never notice it uses 100% of the ram for packets stored in out or order queues, since a frag will hold 32 times more pages than really announced. If really you need to allocate physically contiguous memory, have you considered converting the order-5 pages into 32 order-0 ones ? This way, a 192 bytes frame sitting in one socket would hold one order-0 page in the worst case, and TCP wont be allowed to use all physical memory.