Re: [PATCH net-next v2 0/3] introduce skb_for_each_frag()
On Tue, Apr 13, 2021 at 9:53 AM David Laight wrote: > > From: Matteo Croce > > Sent: 12 April 2021 01:38 > > > > Introduce skb_for_each_frag, an helper macro to iterate over the SKB frags. > > The real question is why, the change is: > > - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { > + skb_for_each_frag(skb, i) { > > The existing code isn't complicated or obscure and 'does what it > says on the tin'. > The 'helper' requires you go and look up its definition to see > what it is really doing. > > Unless you have a cunning plan to change the definition > there is zero point. > > A more interesting change would be something that generated: > unsigned int nr_frags = skb_shinfo(skb)->nr_frags; > for (i = 0; i < nr_frags; i++) { > since that will run faster for most loops. > But that is ~impossible to do since you can't declare > variables inside the (...) that are scoped to the loop. > I don't know how to do it with C90. It would be nice to have a switch to just allow declaration of variables inside the (...) instead of enabling the full C99 language which, as Linus said[1], allows the insane mixing of variables and code. [1] https://lore.kernel.org/lkml/CA+55aFzs=duyibwymufiu_r1ajhar-8hpqhwlew8r5q4ncd...@mail.gmail.com/ -- per aspera ad upstream
[PATCH net-next v2 3/3] net: use skb_for_each_frag() in illegal_highdma()
From: Matteo Croce Coccinelle failed with the following error: EXN: Failure("no position information") in net/core/dev.c Apply it by hand as it's trivial. Signed-off-by: Matteo Croce --- net/core/dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/dev.c b/net/core/dev.c index cc5df273f766..605103582aac 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3505,7 +3505,7 @@ static int illegal_highdma(struct net_device *dev, struct sk_buff *skb) int i; if (!(dev->features & NETIF_F_HIGHDMA)) { - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_for_each_frag(skb, i) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; if (PageHighMem(skb_frag_page(frag))) -- 2.30.2
[PATCH net-next v2 2/3] net: use skb_for_each_frag() helper where possible
From: Matteo Croce use the new helper macro skb_for_each_frag() which allows to iterate through all the SKB fragments. The patch was created with Coccinelle, this was the semantic patch: @@ struct sk_buff *skb; identifier i; statement S; iterator name skb_for_each_frag; @@ -for (i = 0; i < skb_shinfo(skb)->nr_frags; \(++i\|i++\)) +skb_for_each_frag(skb, i) S @@ struct skb_shared_info *sinfo; struct sk_buff *skb; identifier i; statement S; iterator name skb_for_each_frag; @@ sinfo = skb_shinfo(skb) ... -for (i = 0; i < sinfo->nr_frags; \(++i\|i++\)) +skb_for_each_frag(skb, i) S Tested with an allmodconfig W=1 build and a test run. Signed-off-by: Matteo Croce --- drivers/atm/he.c | 2 +- drivers/hsi/clients/ssi_protocol.c| 2 +- drivers/infiniband/hw/hfi1/ipoib_tx.c | 2 +- drivers/infiniband/hw/hfi1/vnic_sdma.c| 2 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 4 +-- drivers/net/ethernet/3com/3c59x.c | 2 +- drivers/net/ethernet/3com/typhoon.c | 2 +- drivers/net/ethernet/adaptec/starfire.c | 2 +- drivers/net/ethernet/aeroflex/greth.c | 2 +- drivers/net/ethernet/alteon/acenic.c | 2 +- drivers/net/ethernet/amd/xgbe/xgbe-desc.c | 2 +- drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 2 +- .../net/ethernet/apm/xgene/xgene_enet_main.c | 2 +- drivers/net/ethernet/atheros/alx/main.c | 2 +- .../net/ethernet/atheros/atl1e/atl1e_main.c | 2 +- .../net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 2 +- drivers/net/ethernet/broadcom/tg3.c | 2 +- .../ethernet/cavium/thunder/nicvf_queues.c| 2 +- drivers/net/ethernet/chelsio/cxgb3/sge.c | 2 +- drivers/net/ethernet/emulex/benet/be_main.c | 2 +- .../net/ethernet/freescale/dpaa/dpaa_eth.c| 2 +- drivers/net/ethernet/freescale/gianfar.c | 3 +- drivers/net/ethernet/google/gve/gve_tx.c | 2 +- drivers/net/ethernet/hisilicon/hix5hd2_gmac.c | 4 +-- .../net/ethernet/hisilicon/hns3/hns3_enet.c | 4 +-- drivers/net/ethernet/huawei/hinic/hinic_rx.c | 2 +- drivers/net/ethernet/huawei/hinic/hinic_tx.c | 4 +-- drivers/net/ethernet/ibm/ibmveth.c| 2 +- drivers/net/ethernet/ibm/ibmvnic.c| 2 +- drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +- drivers/net/ethernet/intel/igb/igb_main.c | 2 +- drivers/net/ethernet/intel/igbvf/netdev.c | 2 +- drivers/net/ethernet/intel/igc/igc_main.c | 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +- .../net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +- drivers/net/ethernet/marvell/mv643xx_eth.c| 2 +- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 2 +- drivers/net/ethernet/marvell/skge.c | 2 +- drivers/net/ethernet/marvell/sky2.c | 8 ++--- drivers/net/ethernet/mediatek/mtk_eth_soc.c | 2 +- .../net/ethernet/mellanox/mlx5/core/en_tx.c | 2 +- drivers/net/ethernet/mellanox/mlxsw/pci.c | 2 +- drivers/net/ethernet/realtek/8139cp.c | 2 +- drivers/net/ethernet/realtek/r8169_main.c | 2 +- drivers/net/ethernet/rocker/rocker_main.c | 2 +- drivers/net/ethernet/sfc/tx.c | 2 +- drivers/net/ethernet/sun/niu.c| 4 +-- drivers/net/ethernet/sun/sungem.c | 2 +- drivers/net/ethernet/sun/sunhme.c | 2 +- drivers/net/ethernet/sun/sunvnet_common.c | 4 +-- .../net/ethernet/synopsys/dwc-xlgmac-desc.c | 2 +- .../net/ethernet/synopsys/dwc-xlgmac-net.c| 2 +- drivers/net/ethernet/ti/am65-cpsw-nuss.c | 2 +- drivers/net/ethernet/ti/netcp_core.c | 2 +- drivers/net/ethernet/via/via-velocity.c | 2 +- drivers/net/usb/usbnet.c | 2 +- drivers/net/vmxnet3/vmxnet3_drv.c | 4 +-- drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 2 +- drivers/net/wireless/intel/iwlwifi/queue/tx.c | 2 +- drivers/net/xen-netback/netback.c | 2 +- drivers/net/xen-netfront.c| 2 +- drivers/s390/net/qeth_core_main.c | 4 +-- drivers/scsi/fcoe/fcoe_transport.c| 2 +- drivers/staging/octeon/ethernet-tx.c | 2 +- drivers/target/iscsi/cxgbit/cxgbit_target.c | 4 +-- net/appletalk/ddp.c | 2 +- net/core/datagram.c | 4 +-- net/core/skbuff.c | 32 +-- net/ipv4/inet_fragment.c | 2 +- net/ipv4/tcp.c| 2 +- net/ipv4/tcp_output.c | 2 +- net/iucv/af_iucv.c| 4 +-- net/kcm/kcmsock.c | 3 +- net/tls/tls_sw.c | 2 +- 74 files changed, 103 insertions(+), 105 deletions(-) diff --git a/drivers/atm/he.c b/drivers/atm/he.c index 17f44abc9418..2e606e255f7c 100644 --- a/drivers/atm/he.c +++ b/drivers/atm/he.c @@ -25
[PATCH net-next v2 0/3] introduce skb_for_each_frag()
From: Matteo Croce Introduce skb_for_each_frag, an helper macro to iterate over the SKB frags. First patch introduces the helper, the second one is generated with coccinelle and uses the macro where possible. Last one is a chunk which have to be applied by hand. The second patch raises some checkpatch.pl warnings because part of net/tls/tls_sw.c is indented with spaces. Build tested with an allmodconfig and a test run. v1 -> v2: - don't replace code where a local variable holds a cached value for skb_shinfo(skb)->nr_frags Matteo Croce (3): skbuff: add helper to walk over the fraglist net: use skb_for_each_frag() helper where possible net: use skb_for_each_frag() in illegal_highdma() drivers/atm/he.c | 2 +- drivers/hsi/clients/ssi_protocol.c| 2 +- drivers/infiniband/hw/hfi1/ipoib_tx.c | 2 +- drivers/infiniband/hw/hfi1/vnic_sdma.c| 2 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 4 +-- drivers/net/ethernet/3com/3c59x.c | 2 +- drivers/net/ethernet/3com/typhoon.c | 2 +- drivers/net/ethernet/adaptec/starfire.c | 2 +- drivers/net/ethernet/aeroflex/greth.c | 2 +- drivers/net/ethernet/alteon/acenic.c | 2 +- drivers/net/ethernet/amd/xgbe/xgbe-desc.c | 2 +- drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 2 +- .../net/ethernet/apm/xgene/xgene_enet_main.c | 2 +- drivers/net/ethernet/atheros/alx/main.c | 2 +- .../net/ethernet/atheros/atl1e/atl1e_main.c | 2 +- .../net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 2 +- drivers/net/ethernet/broadcom/tg3.c | 2 +- .../ethernet/cavium/thunder/nicvf_queues.c| 2 +- drivers/net/ethernet/chelsio/cxgb3/sge.c | 2 +- drivers/net/ethernet/emulex/benet/be_main.c | 2 +- .../net/ethernet/freescale/dpaa/dpaa_eth.c| 2 +- drivers/net/ethernet/freescale/gianfar.c | 3 +- drivers/net/ethernet/google/gve/gve_tx.c | 2 +- drivers/net/ethernet/hisilicon/hix5hd2_gmac.c | 4 +-- .../net/ethernet/hisilicon/hns3/hns3_enet.c | 4 +-- drivers/net/ethernet/huawei/hinic/hinic_rx.c | 2 +- drivers/net/ethernet/huawei/hinic/hinic_tx.c | 4 +-- drivers/net/ethernet/ibm/ibmveth.c| 2 +- drivers/net/ethernet/ibm/ibmvnic.c| 2 +- drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +- drivers/net/ethernet/intel/igb/igb_main.c | 2 +- drivers/net/ethernet/intel/igbvf/netdev.c | 2 +- drivers/net/ethernet/intel/igc/igc_main.c | 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +- .../net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +- drivers/net/ethernet/marvell/mv643xx_eth.c| 2 +- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 2 +- drivers/net/ethernet/marvell/skge.c | 2 +- drivers/net/ethernet/marvell/sky2.c | 8 ++--- drivers/net/ethernet/mediatek/mtk_eth_soc.c | 2 +- .../net/ethernet/mellanox/mlx5/core/en_tx.c | 2 +- drivers/net/ethernet/mellanox/mlxsw/pci.c | 2 +- drivers/net/ethernet/realtek/8139cp.c | 2 +- drivers/net/ethernet/realtek/r8169_main.c | 2 +- drivers/net/ethernet/rocker/rocker_main.c | 2 +- drivers/net/ethernet/sfc/tx.c | 2 +- drivers/net/ethernet/sun/niu.c| 4 +-- drivers/net/ethernet/sun/sungem.c | 2 +- drivers/net/ethernet/sun/sunhme.c | 2 +- drivers/net/ethernet/sun/sunvnet_common.c | 4 +-- .../net/ethernet/synopsys/dwc-xlgmac-desc.c | 2 +- .../net/ethernet/synopsys/dwc-xlgmac-net.c| 2 +- drivers/net/ethernet/ti/am65-cpsw-nuss.c | 2 +- drivers/net/ethernet/ti/netcp_core.c | 2 +- drivers/net/ethernet/via/via-velocity.c | 2 +- drivers/net/usb/usbnet.c | 2 +- drivers/net/vmxnet3/vmxnet3_drv.c | 4 +-- drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 2 +- drivers/net/wireless/intel/iwlwifi/queue/tx.c | 2 +- drivers/net/xen-netback/netback.c | 2 +- drivers/net/xen-netfront.c| 2 +- drivers/s390/net/qeth_core_main.c | 4 +-- drivers/scsi/fcoe/fcoe_transport.c| 2 +- drivers/staging/octeon/ethernet-tx.c | 2 +- drivers/target/iscsi/cxgbit/cxgbit_target.c | 4 +-- include/linux/skbuff.h| 4 +++ net/appletalk/ddp.c | 2 +- net/core/datagram.c | 4 +-- net/core/dev.c| 2 +- net/core/skbuff.c | 32 +-- net/ipv4/inet_fragment.c | 2 +- net/ipv4/tcp.c| 2 +- net/ipv4/tcp_output.c | 2 +- net/iucv/af_iucv.c| 4 +-- net/kcm/kcmsock.c | 3 +- net/tls/tls_sw.c | 2 +- 76 files changed, 108 insertions(+), 106 deletions(-) -- 2.30.2
[PATCH net-next v2 1/3] skbuff: add helper to walk over the fraglist
From: Matteo Croce Add an skb_for_each_frag() macro to iterate on SKB fragments. Signed-off-by: Matteo Croce --- include/linux/skbuff.h | 4 1 file changed, 4 insertions(+) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index dbf820a50a39..a8d4ccacdda5 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1528,6 +1528,10 @@ static inline void skb_mark_not_on_list(struct sk_buff *skb) for ((skb) = (first), (next_skb) = (skb) ? (skb)->next : NULL; (skb); \ (skb) = (next_skb), (next_skb) = (skb) ? (skb)->next : NULL) +/* Iterate through skb fragments. */ +#define skb_for_each_frag(skb, __i) \ + for (__i = 0; __i < skb_shinfo(skb)->nr_frags; __i++) + static inline void skb_list_del_init(struct sk_buff *skb) { __list_del_entry(&skb->list); -- 2.30.2
Re: [PATCH net-next 2/3] net: use skb_for_each_frag() helper where possible
On Fri, Apr 9, 2021 at 11:28 PM Jakub Kicinski wrote: > > On Fri, 9 Apr 2021 22:44:50 +0200 Matteo Croce wrote: > > > What pops to mind (although quite nit picky) is the question if the > > > assembly changes much between driver which used to cache nr_frags and > > > now always going skb_shinfo(skb)->nr_frags? It's a relatively common > > > pattern. > > > > Since skb_shinfo() is a macro and skb_end_pointer() a static inline, > > it should be the same, but I was curious to check so, this is a diff > > between the following snippet before and afer the macro: > > > > int frags = skb_shinfo(skb)->nr_frags; > > int i; > > for (i = 0; i < frags; i++) > > kfree(skb->frags[i]); > > > > 1 file changed, 8 insertions(+), 7 deletions(-) > > > > --- ins1.s 2021-04-09 22:35:59.384523865 +0200 > > +++ ins2.s 2021-04-09 22:36:08.132594737 +0200 > > @@ -1,26 +1,27 @@ > > iter: > > movsx rax, DWORD PTR [rdi+16] > > mov rdx, QWORD PTR [rdi+8] > > mov eax, DWORD PTR [rdx+rax] > > testeax, eax > > jle .L6 > > pushrbp > > -sub eax, 1 > > +mov rbp, rdi > > pushrbx > > -lea rbp, [rdi+32+rax*8] > > -lea rbx, [rdi+24] > > +xor ebx, ebx > > sub rsp, 8 > > .L3: > > -mov rdi, QWORD PTR [rbx] > > -add rbx, 8 > > +mov rdi, QWORD PTR [rbp+24+rbx*8] > > +add rbx, 1 > > callkfree > > -cmp rbx, rbp > > -jne .L3 > > +movsx rax, DWORD PTR [rbp+16] > > +mov rdx, QWORD PTR [rbp+8] > > +cmp DWORD PTR [rdx+rax], ebx > > +jg .L3 > > add rsp, 8 > > xor eax, eax > > pop rbx > > pop rbp > > ret > > .L6: > > xor eax, eax > > for (i = 0; i < frags; i++)ret > > > > So looks like before compiler generated: > > end = &frags[nfrags] > for (ptr = &frag[0]; ptr < end; ptr++) > > and now it has to use the actual value of i, read nfrags in the loop > each time and compare it to i. > > That makes sense, since it can't prove kfree() doesn't change nr_frags. > > IDK if we care, but at least commit message should mention this. Anyway, the chunks using a local nr_frags are too few and not worth it. I think you're right and that's better to use the cached value, I see the instructions here being ligther. Drop the series, I will make a new one which only acts where skb_shinfo(skb) is accessed. Thanks, -- per aspera ad upstream
Re: [PATCH net-next v3 3/5] page_pool: Allow drivers to hint on SKB recycling
On Sat, Apr 10, 2021 at 2:11 AM Ilias Apalodimas wrote: > > Hi Matteo, > > [...] > > +bool page_pool_return_skb_page(void *data); > > + > > struct page_pool *page_pool_create(const struct page_pool_params *params); > > > > #ifdef CONFIG_PAGE_POOL > > @@ -243,4 +247,13 @@ static inline void page_pool_ring_unlock(struct > > page_pool *pool) > > spin_unlock_bh(&pool->ring.producer_lock); > > } > > > > +/* Store mem_info on struct page and use it while recycling skb frags */ > > +static inline > > +void page_pool_store_mem_info(struct page *page, struct xdp_mem_info *mem) > > +{ > > + u32 *xmi = (u32 *)mem; > > + > > I just noticed this changed from the original patchset I was carrying. > On the original, I had a union containing a u32 member to explicitly avoid > this casting. Let's wait for comments on the rest of the series, but i'd like > to change that back in a v4. Aplogies, I completely missed this on the > previous postings ... > Hi, I had to change this because including net/xdp.h here caused a circular dependency. I think that the safest thing we can do is to use memcpy(), which will handle the alignments correctly, depending on the architecture. Cheers, -- per aspera ad upstream
[PATCH net-next v3 5/5] mvneta: recycle buffers
From: Matteo Croce Use the new recycling API for page_pool. In a drop rate test, the packet rate increased di 10%, from 269 Kpps to 296 Kpps. perf top on a stock system shows: Overhead Shared Object Symbol 21.78% [kernel] [k] __pi___inval_dcache_area 21.66% [mvneta] [k] mvneta_rx_swbm 7.00% [kernel] [k] kmem_cache_alloc 6.05% [kernel] [k] eth_type_trans 4.44% [kernel] [k] kmem_cache_free.part.0 3.80% [kernel] [k] __netif_receive_skb_core 3.68% [kernel] [k] dev_gro_receive 3.65% [kernel] [k] get_page_from_freelist 3.43% [kernel] [k] page_pool_release_page 3.35% [kernel] [k] free_unref_page And this is the same output with recycling enabled: Overhead Shared Object Symbol 24.10% [kernel] [k] __pi___inval_dcache_area 23.02% [mvneta] [k] mvneta_rx_swbm 7.19% [kernel] [k] kmem_cache_alloc 6.50% [kernel] [k] eth_type_trans 4.93% [kernel] [k] __netif_receive_skb_core 4.77% [kernel] [k] kmem_cache_free.part.0 3.93% [kernel] [k] dev_gro_receive 3.03% [kernel] [k] build_skb 2.91% [kernel] [k] page_pool_put_page 2.85% [kernel] [k] __xdp_return The test was done with mausezahn on the TX side with 64 byte raw ethernet frames. Signed-off-by: Matteo Croce --- drivers/net/ethernet/marvell/mvneta.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index f20dfd1d7a6b..f88c189b60a4 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -2331,7 +2331,7 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq, if (!skb) return ERR_PTR(-ENOMEM); - page_pool_release_page(rxq->page_pool, virt_to_page(xdp->data)); + skb_mark_for_recycle(skb, virt_to_page(xdp->data), &xdp->rxq->mem); skb_reserve(skb, xdp->data - xdp->data_hard_start); skb_put(skb, xdp->data_end - xdp->data); @@ -2343,7 +2343,10 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq, skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, skb_frag_page(frag), skb_frag_off(frag), skb_frag_size(frag), PAGE_SIZE); - page_pool_release_page(rxq->page_pool, skb_frag_page(frag)); + /* We don't need to reset pp_recycle here. It's already set, so +* just mark fragments for recycling. +*/ + page_pool_store_mem_info(skb_frag_page(frag), &xdp->rxq->mem); } return skb; -- 2.30.2
[PATCH net-next v3 3/5] page_pool: Allow drivers to hint on SKB recycling
From: Ilias Apalodimas Up to now several high speed NICs have custom mechanisms of recycling the allocated memory they use for their payloads. Our page_pool API already has recycling capabilities that are always used when we are running in 'XDP mode'. So let's tweak the API and the kernel network stack slightly and allow the recycling to happen even during the standard operation. The API doesn't take into account 'split page' policies used by those drivers currently, but can be extended once we have users for that. The idea is to be able to intercept the packet on skb_release_data(). If it's a buffer coming from our page_pool API recycle it back to the pool for further usage or just release the packet entirely. To achieve that we introduce a bit in struct sk_buff (pp_recycle:1) and store the xdp_mem_info in page->private. Storing the information in page->private allows us to recycle both SKBs and their fragments. The SKB bit is needed for a couple of reasons. First of all in an effort to affect the free path as less as possible, reading a single bit, is better that trying to derive identical information for the page stored data. Moreover page->private is used by skb_copy_ubufs. We do have a special mark in the page, that won't allow this to happen, but again deciding without having to read the entire page is preferable. The driver has to take care of the sync operations on it's own during the buffer recycling since the buffer is, after opting-in to the recycling, never unmapped. Since the gain on the drivers depends on the architecture, we are not enabling recycling by default if the page_pool API is used on a driver. In order to enable recycling the driver must call skb_mark_for_recycle() to store the information we need for recycling in page->private and enabling the recycling bit, or page_pool_store_mem_info() for a fragment Since we added an extra argument on __skb_frag_unref() to handle recycling, update the current users of the function with that. Co-developed-by: Jesper Dangaard Brouer Co-developed-by: Matteo Croce Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Matteo Croce Signed-off-by: Ilias Apalodimas --- .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c | 2 +- drivers/net/ethernet/marvell/sky2.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c| 2 +- include/linux/skbuff.h| 35 --- include/net/page_pool.h | 13 ++ include/net/xdp.h | 1 + net/core/page_pool.c | 43 +++ net/core/skbuff.c | 20 - net/core/xdp.c| 6 +++ net/tls/tls_device.c | 2 +- 10 files changed, 115 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c b/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c index 1115b8f9ea4e..8f815ebb59ae 100644 --- a/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c +++ b/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c @@ -2125,7 +2125,7 @@ static int chcr_ktls_xmit(struct sk_buff *skb, struct net_device *dev) /* clear the frag ref count which increased locally before */ for (i = 0; i < record->num_frags; i++) { /* clear the frag ref count */ - __skb_frag_unref(&record->frags[i]); + __skb_frag_unref(&record->frags[i], false); } /* if any failure, come out from the loop. */ if (ret) { diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c index 68c154d715d6..9dc25c4fb359 100644 --- a/drivers/net/ethernet/marvell/sky2.c +++ b/drivers/net/ethernet/marvell/sky2.c @@ -2503,7 +2503,7 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space, if (length == 0) { /* don't need this page */ - __skb_frag_unref(frag); + __skb_frag_unref(frag, false); --skb_shinfo(skb)->nr_frags; } else { size = min(length, (unsigned) PAGE_SIZE); diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index e35e4d7ef4d1..cea62b8f554c 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -526,7 +526,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv, fail: while (nr > 0) { nr--; - __skb_frag_unref(skb_shinfo(skb)->frags + nr); + __skb_frag_unref(skb_shinfo(skb)->frags + nr, false); } return 0; } diff --git a/include/linux/skbuff.h b/include/linux/skb
[PATCH net-next v3 4/5] mvpp2: recycle buffers
From: Matteo Croce Use the new recycling API for page_pool. In a drop rate test, the packet rate is more than doubled, from 962 Kpps to 2047 Kpps. perf top on a stock system shows: Overhead Shared Object Symbol 30.67% [kernel] [k] page_pool_release_page 8.37% [kernel] [k] get_page_from_freelist 7.34% [kernel] [k] free_unref_page 6.47% [mvpp2] [k] mvpp2_rx 4.69% [kernel] [k] eth_type_trans 4.55% [kernel] [k] __netif_receive_skb_core 4.40% [kernel] [k] build_skb 4.29% [kernel] [k] kmem_cache_free 4.00% [kernel] [k] kmem_cache_alloc 3.81% [kernel] [k] dev_gro_receive With packet rate stable at 962 Kpps: tx: 0 bps 0 pps rx: 477.4 Mbps 962.6 Kpps tx: 0 bps 0 pps rx: 477.6 Mbps 962.8 Kpps tx: 0 bps 0 pps rx: 477.6 Mbps 962.9 Kpps tx: 0 bps 0 pps rx: 477.2 Mbps 962.1 Kpps tx: 0 bps 0 pps rx: 477.5 Mbps 962.7 Kpps And this is the same output with recycling enabled: Overhead Shared Object Symbol 12.75% [mvpp2] [k] mvpp2_rx 9.56% [kernel] [k] __netif_receive_skb_core 9.29% [kernel] [k] build_skb 9.27% [kernel] [k] eth_type_trans 8.39% [kernel] [k] kmem_cache_alloc 7.85% [kernel] [k] kmem_cache_free 7.36% [kernel] [k] page_pool_put_page 6.45% [kernel] [k] dev_gro_receive 4.72% [kernel] [k] __xdp_return 3.06% [kernel] [k] page_pool_refill_alloc_cache With packet rate above 2000 Kpps: tx: 0 bps 0 pps rx: 1015 Mbps 2046 Kpps tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps The major performance increase is explained by the fact that the most CPU consuming functions (page_pool_release_page, get_page_from_freelist and free_unref_page) are no longer called on a per packet basis. The test was done by sending to the macchiatobin 64 byte ethernet frames with an invalid ethertype, so the packets are dropped early in the RX path. Signed-off-by: Matteo Croce --- drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index ec706d614cac..633b39cfef65 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -3847,6 +3847,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi, struct mvpp2_pcpu_stats ps = {}; enum dma_data_direction dma_dir; struct bpf_prog *xdp_prog; + struct xdp_rxq_info *rxqi; struct xdp_buff xdp; int rx_received; int rx_done = 0; @@ -3912,15 +3913,15 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi, else frag_size = bm_pool->frag_size; - if (xdp_prog) { - struct xdp_rxq_info *xdp_rxq; + if (bm_pool->pkt_size == MVPP2_BM_SHORT_PKT_SIZE) + rxqi = &rxq->xdp_rxq_short; + else + rxqi = &rxq->xdp_rxq_long; - if (bm_pool->pkt_size == MVPP2_BM_SHORT_PKT_SIZE) - xdp_rxq = &rxq->xdp_rxq_short; - else - xdp_rxq = &rxq->xdp_rxq_long; + if (xdp_prog) { + xdp.rxq = rxqi; - xdp_init_buff(&xdp, PAGE_SIZE, xdp_rxq); + xdp_init_buff(&xdp, PAGE_SIZE, rxqi); xdp_prepare_buff(&xdp, data, MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM, rx_bytes, false); @@ -3964,7 +3965,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi, } if (pp) - page_pool_release_page(pp, virt_to_page(data)); + skb_mark_for_recycle(skb, virt_to_page(data), &rxqi->mem); else dma_unmap_single_attrs(dev->dev.parent, dma_addr, bm_pool->buf_size, DMA_FROM_DEVICE, -- 2.30.2
[PATCH net-next v3 2/5] mm: add a signature in struct page
From: Matteo Croce This is needed by the page_pool to avoid recycling a page not allocated via page_pool. Signed-off-by: Matteo Croce --- include/linux/mm_types.h | 1 + include/net/page_pool.h | 2 ++ net/core/page_pool.c | 4 3 files changed, 7 insertions(+) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 6613b26a8894..ef2d0d5f62e4 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -101,6 +101,7 @@ struct page { * 32-bit architectures. */ dma_addr_t dma_addr; + unsigned long signature; }; struct {/* slab, slob and slub */ union { diff --git a/include/net/page_pool.h b/include/net/page_pool.h index b5b195305346..b30405e84b5e 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -63,6 +63,8 @@ */ #define PP_ALLOC_CACHE_SIZE128 #define PP_ALLOC_CACHE_REFILL 64 +#define PP_SIGNATURE 0x20210303 + struct pp_alloc_cache { u32 count; void *cache[PP_ALLOC_CACHE_SIZE]; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index ad8b0707af04..2ae9b554ef98 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, page_pool_dma_sync_for_device(pool, page, pool->p.max_len); skip_dma_map: + page->signature = PP_SIGNATURE; + /* Track how many pages are held 'in-flight' */ pool->pages_state_hold_cnt++; @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page) DMA_ATTR_SKIP_CPU_SYNC); page->dma_addr = 0; skip_dma_unmap: + page->signature = 0; + /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. */ -- 2.30.2
[PATCH net-next v3 1/5] xdp: reduce size of struct xdp_mem_info
From: Jesper Dangaard Brouer It is possible to compress/reduce the size of struct xdp_mem_info. This change reduce struct xdp_mem_info from 8 bytes to 4 bytes. The member xdp_mem_info.id can be reduced to u16, as the mem_id_ht rhashtable in net/core/xdp.c is already limited by MEM_ID_MAX=0xFFFE which can safely fit in u16. The member xdp_mem_info.type could be reduced more than u16, as it stores the enum xdp_mem_type, but due to alignment it is only reduced to u16. Signed-off-by: Jesper Dangaard Brouer --- include/net/xdp.h | 4 ++-- net/core/xdp.c| 8 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/include/net/xdp.h b/include/net/xdp.h index a5bc214a49d9..c35864d59113 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -48,8 +48,8 @@ enum xdp_mem_type { #define XDP_XMIT_FLAGS_MASKXDP_XMIT_FLUSH struct xdp_mem_info { - u32 type; /* enum xdp_mem_type, but known size type */ - u32 id; + u16 type; /* enum xdp_mem_type, but known size type */ + u16 id; }; struct page_pool; diff --git a/net/core/xdp.c b/net/core/xdp.c index 05354976c1fc..3dd47ed83778 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -35,11 +35,11 @@ static struct rhashtable *mem_id_ht; static u32 xdp_mem_id_hashfn(const void *data, u32 len, u32 seed) { - const u32 *k = data; - const u32 key = *k; + const u16 *k = data; + const u16 key = *k; BUILD_BUG_ON(sizeof_field(struct xdp_mem_allocator, mem.id) -!= sizeof(u32)); +!= sizeof(u16)); /* Use cyclic increasing ID as direct hash key */ return key; @@ -49,7 +49,7 @@ static int xdp_mem_id_cmp(struct rhashtable_compare_arg *arg, const void *ptr) { const struct xdp_mem_allocator *xa = ptr; - u32 mem_id = *(u32 *)arg->key; + u16 mem_id = *(u16 *)arg->key; return xa->mem.id != mem_id; } -- 2.30.2
[PATCH net-next v3 0/5] page_pool: recycle buffers
From: Matteo Croce This is a respin of [1] This patchset shows the plans for allowing page_pool to handle and maintain DMA map/unmap of the pages it serves to the driver. For this to work a return hook in the network core is introduced. The overall purpose is to simplify drivers, by providing a page allocation API that does recycling, such that each driver doesn't have to reinvent its own recycling scheme. Using page_pool in a driver does not require implementing XDP support, but it makes it trivially easy to do so. Instead of allocating buffers specifically for SKBs we now allocate a generic buffer and either wrap it on an SKB (via build_skb) or create an XDP frame. The recycling code leverages the XDP recycle APIs. The Marvell mvpp2 and mvneta drivers are used in this patchset to demonstrate how to use the API, and tested on a MacchiatoBIN and EspressoBIN boards respectively. Please let this going in on a future -rc1 so to allow enough time to have wider tests. [1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/ v2 -> v3: - added missing SOBs - CCed the MM people v1 -> v2: - fix a commit message - avoid setting pp_recycle multiple times on mvneta - squash two patches to avoid breaking bisect Ilias Apalodimas (1): page_pool: Allow drivers to hint on SKB recycling Jesper Dangaard Brouer (1): xdp: reduce size of struct xdp_mem_info Matteo Croce (3): mm: add a signature in struct page mvpp2: recycle buffers mvneta: recycle buffers .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c | 2 +- drivers/net/ethernet/marvell/mvneta.c | 7 ++- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 17 +++ drivers/net/ethernet/marvell/sky2.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c| 2 +- include/linux/mm_types.h | 1 + include/linux/skbuff.h| 35 -- include/net/page_pool.h | 15 ++ include/net/xdp.h | 5 +- net/core/page_pool.c | 47 +++ net/core/skbuff.c | 20 +++- net/core/xdp.c| 14 -- net/tls/tls_device.c | 2 +- 13 files changed, 142 insertions(+), 27 deletions(-) -- 2.30.2
Re: [PATCH net-next 2/3] net: use skb_for_each_frag() helper where possible
On Fri, Apr 9, 2021 at 8:54 PM Jakub Kicinski wrote: > > On Fri, 9 Apr 2021 20:06:04 +0200 Matteo Croce wrote: > > From: Matteo Croce > > > > use the new helper macro skb_for_each_frag() which allows to iterate > > through all the SKB fragments. > > > > The patch was created with Coccinelle, this was the semantic patch: > > Bunch of set but not used warnings here. Please make sure the code > builds cleanly allmodconfig, W=1 C=1 before posting. > Will do. > What pops to mind (although quite nit picky) is the question if the > assembly changes much between driver which used to cache nr_frags and > now always going skb_shinfo(skb)->nr_frags? It's a relatively common > pattern. Since skb_shinfo() is a macro and skb_end_pointer() a static inline, it should be the same, but I was curious to check so, this is a diff between the following snippet before and afer the macro: int frags = skb_shinfo(skb)->nr_frags; int i; for (i = 0; i < frags; i++) kfree(skb->frags[i]); 1 file changed, 8 insertions(+), 7 deletions(-) --- ins1.s 2021-04-09 22:35:59.384523865 +0200 +++ ins2.s 2021-04-09 22:36:08.132594737 +0200 @@ -1,26 +1,27 @@ iter: movsx rax, DWORD PTR [rdi+16] mov rdx, QWORD PTR [rdi+8] mov eax, DWORD PTR [rdx+rax] testeax, eax jle .L6 pushrbp -sub eax, 1 +mov rbp, rdi pushrbx -lea rbp, [rdi+32+rax*8] -lea rbx, [rdi+24] +xor ebx, ebx sub rsp, 8 .L3: -mov rdi, QWORD PTR [rbx] -add rbx, 8 +mov rdi, QWORD PTR [rbp+24+rbx*8] +add rbx, 1 callkfree -cmp rbx, rbp -jne .L3 +movsx rax, DWORD PTR [rbp+16] +mov rdx, QWORD PTR [rbp+8] +cmp DWORD PTR [rdx+rax], ebx +jg .L3 add rsp, 8 xor eax, eax pop rbx pop rbp ret .L6: xor eax, eax ret -- per aspera ad upstream
[PATCH net-next 3/3] net: use skb_for_each_frag() in illegal_highdma()
From: Matteo Croce Coccinelle failed with the following error: EXN: Failure("no position information") in net/core/dev.c Apply it by hand as it's trivial. Signed-off-by: Matteo Croce --- net/core/dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/dev.c b/net/core/dev.c index 33ff4a944109..98deb4852151 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3505,7 +3505,7 @@ static int illegal_highdma(struct net_device *dev, struct sk_buff *skb) int i; if (!(dev->features & NETIF_F_HIGHDMA)) { - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_for_each_frag(skb, i) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; if (PageHighMem(skb_frag_page(frag))) -- 2.30.2
[PATCH net-next 2/3] net: use skb_for_each_frag() helper where possible
From: Matteo Croce use the new helper macro skb_for_each_frag() which allows to iterate through all the SKB fragments. The patch was created with Coccinelle, this was the semantic patch: @@ struct sk_buff *skb; identifier i; statement S; iterator name skb_for_each_frag; @@ -for (i = 0; i < skb_shinfo(skb)->nr_frags; \(++i\|i++\)) +skb_for_each_frag(skb, i) S @@ struct skb_shared_info *sinfo; struct sk_buff *skb; identifier i; statement S; iterator name skb_for_each_frag; @@ sinfo = skb_shinfo(skb); ... -for (i = 0; i < sinfo->nr_frags; \(++i\|i++\)) +skb_for_each_frag(skb, i) S @@ struct sk_buff *skb; identifier i, num_frags; statement S; iterator name skb_for_each_frag; @@ num_frags = skb_shinfo(skb)->nr_frags; ... -for (i = 0; i < num_frags; \(++i\|i++\)) +skb_for_each_frag(skb, i) S Tested with an allmodconfig build. Signed-off-by: Matteo Croce --- arch/um/drivers/vector_kern.c | 4 +-- drivers/atm/he.c | 2 +- drivers/hsi/clients/ssi_protocol.c| 2 +- drivers/infiniband/hw/hfi1/ipoib_tx.c | 2 +- drivers/infiniband/hw/hfi1/vnic_sdma.c| 2 +- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 2 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 4 +-- drivers/net/ethernet/3com/3c59x.c | 2 +- drivers/net/ethernet/3com/typhoon.c | 2 +- drivers/net/ethernet/adaptec/starfire.c | 2 +- drivers/net/ethernet/aeroflex/greth.c | 6 ++-- drivers/net/ethernet/alteon/acenic.c | 2 +- drivers/net/ethernet/amazon/ena/ena_netdev.c | 2 +- drivers/net/ethernet/amd/xgbe/xgbe-desc.c | 2 +- drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 2 +- .../net/ethernet/apm/xgene/xgene_enet_main.c | 2 +- drivers/net/ethernet/atheros/alx/main.c | 2 +- .../net/ethernet/atheros/atl1c/atl1c_main.c | 2 +- .../net/ethernet/atheros/atl1e/atl1e_main.c | 4 +-- drivers/net/ethernet/atheros/atlx/atl1.c | 4 +-- drivers/net/ethernet/broadcom/bgmac.c | 2 +- drivers/net/ethernet/broadcom/bnx2.c | 2 +- .../net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 2 +- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 6 ++-- drivers/net/ethernet/broadcom/tg3.c | 2 +- drivers/net/ethernet/cadence/macb_main.c | 4 +-- .../ethernet/cavium/thunder/nicvf_queues.c| 2 +- drivers/net/ethernet/chelsio/cxgb3/sge.c | 4 +-- drivers/net/ethernet/emulex/benet/be_main.c | 2 +- drivers/net/ethernet/faraday/ftgmac100.c | 2 +- .../net/ethernet/freescale/dpaa/dpaa_eth.c| 2 +- drivers/net/ethernet/freescale/gianfar.c | 5 ++-- drivers/net/ethernet/hisilicon/hix5hd2_gmac.c | 4 +-- drivers/net/ethernet/hisilicon/hns/hns_enet.c | 2 +- .../net/ethernet/hisilicon/hns3/hns3_enet.c | 4 +-- drivers/net/ethernet/huawei/hinic/hinic_rx.c | 2 +- drivers/net/ethernet/huawei/hinic/hinic_tx.c | 4 +-- drivers/net/ethernet/ibm/ibmveth.c| 2 +- drivers/net/ethernet/ibm/ibmvnic.c| 2 +- drivers/net/ethernet/intel/e1000/e1000_main.c | 2 +- drivers/net/ethernet/intel/e1000e/netdev.c| 2 +- drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +- drivers/net/ethernet/intel/igb/igb_main.c | 2 +- drivers/net/ethernet/intel/igbvf/netdev.c | 2 +- drivers/net/ethernet/intel/igc/igc_main.c | 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +- .../net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +- drivers/net/ethernet/marvell/mv643xx_eth.c| 2 +- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 2 +- drivers/net/ethernet/marvell/skge.c | 2 +- drivers/net/ethernet/marvell/sky2.c | 10 +++ drivers/net/ethernet/mediatek/mtk_eth_soc.c | 4 +-- .../net/ethernet/mellanox/mlx5/core/en_tx.c | 2 +- drivers/net/ethernet/mellanox/mlxsw/pci.c | 2 +- drivers/net/ethernet/microchip/lan743x_main.c | 2 +- drivers/net/ethernet/neterion/s2io.c | 2 +- .../net/ethernet/neterion/vxge/vxge-main.c| 6 ++-- drivers/net/ethernet/ni/nixge.c | 2 +- drivers/net/ethernet/pasemi/pasemi_mac.c | 2 +- .../ethernet/qlogic/netxen/netxen_nic_main.c | 2 +- drivers/net/ethernet/qlogic/qed/qed_ll2.c | 2 +- .../net/ethernet/qlogic/qlcnic/qlcnic_io.c| 2 +- drivers/net/ethernet/realtek/8139cp.c | 2 +- drivers/net/ethernet/rocker/rocker_main.c | 2 +- drivers/net/ethernet/sfc/tx.c | 2 +- drivers/net/ethernet/sun/cassini.c| 2 +- drivers/net/ethernet/sun/niu.c| 4 +-- drivers/net/ethernet/sun/sungem.c | 2 +- drivers/net/ethernet/sun/sunhme.c | 2 +- drivers/net/ethernet/sun/sunvnet_common.c | 4 +-- .../net/ethernet/synopsys/dwc-xlgmac-desc.c | 2 +- .../net/ethernet/synopsys/dwc-xlgmac-net.c| 2 +- drivers/net/ethernet/ti/am65-cpsw-nuss.c | 2 +- drivers/net/ethernet/ti/netcp_core.c | 2 +- driver
[PATCH net-next 1/3] skbuff: add helper to walk over the fraglist
From: Matteo Croce Add an skb_for_each_frag() macro to iterate on SKB fragments. Signed-off-by: Matteo Croce --- include/linux/skbuff.h | 4 1 file changed, 4 insertions(+) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index dbf820a50a39..a8d4ccacdda5 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1528,6 +1528,10 @@ static inline void skb_mark_not_on_list(struct sk_buff *skb) for ((skb) = (first), (next_skb) = (skb) ? (skb)->next : NULL; (skb); \ (skb) = (next_skb), (next_skb) = (skb) ? (skb)->next : NULL) +/* Iterate through skb fragments. */ +#define skb_for_each_frag(skb, __i) \ + for (__i = 0; __i < skb_shinfo(skb)->nr_frags; __i++) + static inline void skb_list_del_init(struct sk_buff *skb) { __list_del_entry(&skb->list); -- 2.30.2
[PATCH net-next 0/3] introduce skb_for_each_frag()
From: Matteo Croce Introduce skb_for_each_frag, an helper macro to iterate over the SKB frags. First patch introduces the helper, the second one is generated with coccinelle and uses the macro where possible. Last one is a chunk which have to be applied by hand. The second patch raises some checkpatch.pl warnings because part of net/tls/tls_sw.c is indented with spaces. Build tested with an allmodconfig and a test run. Matteo Croce (3): skbuff: add helper to walk over the fraglist net: use skb_for_each_frag() helper where possible net: use skb_for_each_frag() in illegal_highdma() arch/um/drivers/vector_kern.c | 4 +-- drivers/atm/he.c | 2 +- drivers/hsi/clients/ssi_protocol.c| 2 +- drivers/infiniband/hw/hfi1/ipoib_tx.c | 2 +- drivers/infiniband/hw/hfi1/vnic_sdma.c| 2 +- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 2 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 4 +-- drivers/net/ethernet/3com/3c59x.c | 2 +- drivers/net/ethernet/3com/typhoon.c | 2 +- drivers/net/ethernet/adaptec/starfire.c | 2 +- drivers/net/ethernet/aeroflex/greth.c | 6 ++-- drivers/net/ethernet/alteon/acenic.c | 2 +- drivers/net/ethernet/amazon/ena/ena_netdev.c | 2 +- drivers/net/ethernet/amd/xgbe/xgbe-desc.c | 2 +- drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 2 +- .../net/ethernet/apm/xgene/xgene_enet_main.c | 2 +- drivers/net/ethernet/atheros/alx/main.c | 2 +- .../net/ethernet/atheros/atl1c/atl1c_main.c | 2 +- .../net/ethernet/atheros/atl1e/atl1e_main.c | 4 +-- drivers/net/ethernet/atheros/atlx/atl1.c | 4 +-- drivers/net/ethernet/broadcom/bgmac.c | 2 +- drivers/net/ethernet/broadcom/bnx2.c | 2 +- .../net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 2 +- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 6 ++-- drivers/net/ethernet/broadcom/tg3.c | 2 +- drivers/net/ethernet/cadence/macb_main.c | 4 +-- .../ethernet/cavium/thunder/nicvf_queues.c| 2 +- drivers/net/ethernet/chelsio/cxgb3/sge.c | 4 +-- drivers/net/ethernet/emulex/benet/be_main.c | 2 +- drivers/net/ethernet/faraday/ftgmac100.c | 2 +- .../net/ethernet/freescale/dpaa/dpaa_eth.c| 2 +- drivers/net/ethernet/freescale/gianfar.c | 5 ++-- drivers/net/ethernet/hisilicon/hix5hd2_gmac.c | 4 +-- drivers/net/ethernet/hisilicon/hns/hns_enet.c | 2 +- .../net/ethernet/hisilicon/hns3/hns3_enet.c | 4 +-- drivers/net/ethernet/huawei/hinic/hinic_rx.c | 2 +- drivers/net/ethernet/huawei/hinic/hinic_tx.c | 4 +-- drivers/net/ethernet/ibm/ibmveth.c| 2 +- drivers/net/ethernet/ibm/ibmvnic.c| 2 +- drivers/net/ethernet/intel/e1000/e1000_main.c | 2 +- drivers/net/ethernet/intel/e1000e/netdev.c| 2 +- drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +- drivers/net/ethernet/intel/igb/igb_main.c | 2 +- drivers/net/ethernet/intel/igbvf/netdev.c | 2 +- drivers/net/ethernet/intel/igc/igc_main.c | 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +- .../net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +- drivers/net/ethernet/marvell/mv643xx_eth.c| 2 +- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 2 +- drivers/net/ethernet/marvell/skge.c | 2 +- drivers/net/ethernet/marvell/sky2.c | 10 +++ drivers/net/ethernet/mediatek/mtk_eth_soc.c | 4 +-- .../net/ethernet/mellanox/mlx5/core/en_tx.c | 2 +- drivers/net/ethernet/mellanox/mlxsw/pci.c | 2 +- drivers/net/ethernet/microchip/lan743x_main.c | 2 +- drivers/net/ethernet/neterion/s2io.c | 2 +- .../net/ethernet/neterion/vxge/vxge-main.c| 6 ++-- drivers/net/ethernet/ni/nixge.c | 2 +- drivers/net/ethernet/pasemi/pasemi_mac.c | 2 +- .../ethernet/qlogic/netxen/netxen_nic_main.c | 2 +- drivers/net/ethernet/qlogic/qed/qed_ll2.c | 2 +- .../net/ethernet/qlogic/qlcnic/qlcnic_io.c| 2 +- drivers/net/ethernet/realtek/8139cp.c | 2 +- drivers/net/ethernet/rocker/rocker_main.c | 2 +- drivers/net/ethernet/sfc/tx.c | 2 +- drivers/net/ethernet/sun/cassini.c| 2 +- drivers/net/ethernet/sun/niu.c| 4 +-- drivers/net/ethernet/sun/sungem.c | 2 +- drivers/net/ethernet/sun/sunhme.c | 2 +- drivers/net/ethernet/sun/sunvnet_common.c | 4 +-- .../net/ethernet/synopsys/dwc-xlgmac-desc.c | 2 +- .../net/ethernet/synopsys/dwc-xlgmac-net.c| 2 +- drivers/net/ethernet/ti/am65-cpsw-nuss.c | 2 +- drivers/net/ethernet/ti/netcp_core.c | 2 +- drivers/net/ethernet/via/via-velocity.c | 2 +- drivers/net/ethernet/xilinx/ll_temac_main.c | 2 +- .../net/ethernet/xilinx/xilinx_axienet_main.c | 2 +- drivers/net/usb/usbnet.c | 2 +- drivers/net/vmxnet3/vmxnet3_drv.c | 4 +-- drivers/net/wireless/intel
[PATCH net-next v2 5/5] mvneta: recycle buffers
From: Matteo Croce Use the new recycling API for page_pool. In a drop rate test, the packet rate increased di 10%, from 269 Kpps to 296 Kpps. perf top on a stock system shows: Overhead Shared Object Symbol 21.78% [kernel] [k] __pi___inval_dcache_area 21.66% [mvneta] [k] mvneta_rx_swbm 7.00% [kernel] [k] kmem_cache_alloc 6.05% [kernel] [k] eth_type_trans 4.44% [kernel] [k] kmem_cache_free.part.0 3.80% [kernel] [k] __netif_receive_skb_core 3.68% [kernel] [k] dev_gro_receive 3.65% [kernel] [k] get_page_from_freelist 3.43% [kernel] [k] page_pool_release_page 3.35% [kernel] [k] free_unref_page And this is the same output with recycling enabled: Overhead Shared Object Symbol 24.10% [kernel] [k] __pi___inval_dcache_area 23.02% [mvneta] [k] mvneta_rx_swbm 7.19% [kernel] [k] kmem_cache_alloc 6.50% [kernel] [k] eth_type_trans 4.93% [kernel] [k] __netif_receive_skb_core 4.77% [kernel] [k] kmem_cache_free.part.0 3.93% [kernel] [k] dev_gro_receive 3.03% [kernel] [k] build_skb 2.91% [kernel] [k] page_pool_put_page 2.85% [kernel] [k] __xdp_return The test was done with mausezahn on the TX side with 64 byte raw ethernet frames. Signed-off-by: Matteo Croce --- drivers/net/ethernet/marvell/mvneta.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index f20dfd1d7a6b..f88c189b60a4 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -2331,7 +2331,7 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq, if (!skb) return ERR_PTR(-ENOMEM); - page_pool_release_page(rxq->page_pool, virt_to_page(xdp->data)); + skb_mark_for_recycle(skb, virt_to_page(xdp->data), &xdp->rxq->mem); skb_reserve(skb, xdp->data - xdp->data_hard_start); skb_put(skb, xdp->data_end - xdp->data); @@ -2343,7 +2343,10 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq, skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, skb_frag_page(frag), skb_frag_off(frag), skb_frag_size(frag), PAGE_SIZE); - page_pool_release_page(rxq->page_pool, skb_frag_page(frag)); + /* We don't need to reset pp_recycle here. It's already set, so +* just mark fragments for recycling. +*/ + page_pool_store_mem_info(skb_frag_page(frag), &xdp->rxq->mem); } return skb; -- 2.30.2
[PATCH net-next v2 4/5] mvpp2: recycle buffers
From: Matteo Croce Use the new recycling API for page_pool. In a drop rate test, the packet rate is more than doubled, from 962 Kpps to 2047 Kpps. perf top on a stock system shows: Overhead Shared Object Symbol 30.67% [kernel] [k] page_pool_release_page 8.37% [kernel] [k] get_page_from_freelist 7.34% [kernel] [k] free_unref_page 6.47% [mvpp2] [k] mvpp2_rx 4.69% [kernel] [k] eth_type_trans 4.55% [kernel] [k] __netif_receive_skb_core 4.40% [kernel] [k] build_skb 4.29% [kernel] [k] kmem_cache_free 4.00% [kernel] [k] kmem_cache_alloc 3.81% [kernel] [k] dev_gro_receive With packet rate stable at 962 Kpps: tx: 0 bps 0 pps rx: 477.4 Mbps 962.6 Kpps tx: 0 bps 0 pps rx: 477.6 Mbps 962.8 Kpps tx: 0 bps 0 pps rx: 477.6 Mbps 962.9 Kpps tx: 0 bps 0 pps rx: 477.2 Mbps 962.1 Kpps tx: 0 bps 0 pps rx: 477.5 Mbps 962.7 Kpps And this is the same output with recycling enabled: Overhead Shared Object Symbol 12.75% [mvpp2] [k] mvpp2_rx 9.56% [kernel] [k] __netif_receive_skb_core 9.29% [kernel] [k] build_skb 9.27% [kernel] [k] eth_type_trans 8.39% [kernel] [k] kmem_cache_alloc 7.85% [kernel] [k] kmem_cache_free 7.36% [kernel] [k] page_pool_put_page 6.45% [kernel] [k] dev_gro_receive 4.72% [kernel] [k] __xdp_return 3.06% [kernel] [k] page_pool_refill_alloc_cache With packet rate above 2000 Kpps: tx: 0 bps 0 pps rx: 1015 Mbps 2046 Kpps tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps The major performance increase is explained by the fact that the most CPU consuming functions (page_pool_release_page, get_page_from_freelist and free_unref_page) are no longer called on a per packet basis. The test was done by sending to the macchiatobin 64 byte ethernet frames with an invalid ethertype, so the packets are dropped early in the RX path. Signed-off-by: Matteo Croce --- drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index ec706d614cac..633b39cfef65 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -3847,6 +3847,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi, struct mvpp2_pcpu_stats ps = {}; enum dma_data_direction dma_dir; struct bpf_prog *xdp_prog; + struct xdp_rxq_info *rxqi; struct xdp_buff xdp; int rx_received; int rx_done = 0; @@ -3912,15 +3913,15 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi, else frag_size = bm_pool->frag_size; - if (xdp_prog) { - struct xdp_rxq_info *xdp_rxq; + if (bm_pool->pkt_size == MVPP2_BM_SHORT_PKT_SIZE) + rxqi = &rxq->xdp_rxq_short; + else + rxqi = &rxq->xdp_rxq_long; - if (bm_pool->pkt_size == MVPP2_BM_SHORT_PKT_SIZE) - xdp_rxq = &rxq->xdp_rxq_short; - else - xdp_rxq = &rxq->xdp_rxq_long; + if (xdp_prog) { + xdp.rxq = rxqi; - xdp_init_buff(&xdp, PAGE_SIZE, xdp_rxq); + xdp_init_buff(&xdp, PAGE_SIZE, rxqi); xdp_prepare_buff(&xdp, data, MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM, rx_bytes, false); @@ -3964,7 +3965,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi, } if (pp) - page_pool_release_page(pp, virt_to_page(data)); + skb_mark_for_recycle(skb, virt_to_page(data), &rxqi->mem); else dma_unmap_single_attrs(dev->dev.parent, dma_addr, bm_pool->buf_size, DMA_FROM_DEVICE, -- 2.30.2
[PATCH net-next v2 3/5] page_pool: Allow drivers to hint on SKB recycling
From: Ilias Apalodimas Up to now several high speed NICs have custom mechanisms of recycling the allocated memory they use for their payloads. Our page_pool API already has recycling capabilities that are always used when we are running in 'XDP mode'. So let's tweak the API and the kernel network stack slightly and allow the recycling to happen even during the standard operation. The API doesn't take into account 'split page' policies used by those drivers currently, but can be extended once we have users for that. The idea is to be able to intercept the packet on skb_release_data(). If it's a buffer coming from our page_pool API recycle it back to the pool for further usage or just release the packet entirely. To achieve that we introduce a bit in struct sk_buff (pp_recycle:1) and store the xdp_mem_info in page->private. Storing the information in page->private allows us to recycle both SKBs and their fragments. The SKB bit is needed for a couple of reasons. First of all in an effort to affect the free path as less as possible, reading a single bit, is better that trying to derive identical information for the page stored data. Moreover page->private is used by skb_copy_ubufs. We do have a special mark in the page, that won't allow this to happen, but again deciding without having to read the entire page is preferable. The driver has to take care of the sync operations on it's own during the buffer recycling since the buffer is, after opting-in to the recycling, never unmapped. Since the gain on the drivers depends on the architecture, we are not enabling recycling by default if the page_pool API is used on a driver. In order to enable recycling the driver must call skb_mark_for_recycle() to store the information we need for recycling in page->private and enabling the recycling bit, or page_pool_store_mem_info() for a fragment Since we added an extra argument on __skb_frag_unref() to handle recycling, update the current users of the function with that. Co-developed-by: Jesper Dangaard Brouer Co-developed-by: Matteo Croce Signed-off-by: Ilias Apalodimas --- .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c | 2 +- drivers/net/ethernet/marvell/sky2.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c| 2 +- include/linux/skbuff.h| 35 --- include/net/page_pool.h | 13 ++ include/net/xdp.h | 1 + net/core/page_pool.c | 43 +++ net/core/skbuff.c | 20 - net/core/xdp.c| 6 +++ net/tls/tls_device.c | 2 +- 10 files changed, 115 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c b/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c index 1115b8f9ea4e..8f815ebb59ae 100644 --- a/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c +++ b/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c @@ -2125,7 +2125,7 @@ static int chcr_ktls_xmit(struct sk_buff *skb, struct net_device *dev) /* clear the frag ref count which increased locally before */ for (i = 0; i < record->num_frags; i++) { /* clear the frag ref count */ - __skb_frag_unref(&record->frags[i]); + __skb_frag_unref(&record->frags[i], false); } /* if any failure, come out from the loop. */ if (ret) { diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c index 68c154d715d6..9dc25c4fb359 100644 --- a/drivers/net/ethernet/marvell/sky2.c +++ b/drivers/net/ethernet/marvell/sky2.c @@ -2503,7 +2503,7 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space, if (length == 0) { /* don't need this page */ - __skb_frag_unref(frag); + __skb_frag_unref(frag, false); --skb_shinfo(skb)->nr_frags; } else { size = min(length, (unsigned) PAGE_SIZE); diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index e35e4d7ef4d1..cea62b8f554c 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -526,7 +526,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv, fail: while (nr > 0) { nr--; - __skb_frag_unref(skb_shinfo(skb)->frags + nr); + __skb_frag_unref(skb_shinfo(skb)->frags + nr, false); } return 0; } diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index c8def85fcc22..fffa6b35 100644 --- a/include/lin
[PATCH net-next v2 2/5] mm: add a signature in struct page
From: Matteo Croce This is needed by the page_pool to avoid recycling a page not allocated via page_pool. Signed-off-by: Matteo Croce --- include/linux/mm_types.h | 1 + include/net/page_pool.h | 2 ++ net/core/page_pool.c | 4 3 files changed, 7 insertions(+) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 6613b26a8894..ef2d0d5f62e4 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -101,6 +101,7 @@ struct page { * 32-bit architectures. */ dma_addr_t dma_addr; + unsigned long signature; }; struct {/* slab, slob and slub */ union { diff --git a/include/net/page_pool.h b/include/net/page_pool.h index b5b195305346..b30405e84b5e 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -63,6 +63,8 @@ */ #define PP_ALLOC_CACHE_SIZE128 #define PP_ALLOC_CACHE_REFILL 64 +#define PP_SIGNATURE 0x20210303 + struct pp_alloc_cache { u32 count; void *cache[PP_ALLOC_CACHE_SIZE]; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index ad8b0707af04..2ae9b554ef98 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, page_pool_dma_sync_for_device(pool, page, pool->p.max_len); skip_dma_map: + page->signature = PP_SIGNATURE; + /* Track how many pages are held 'in-flight' */ pool->pages_state_hold_cnt++; @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page) DMA_ATTR_SKIP_CPU_SYNC); page->dma_addr = 0; skip_dma_unmap: + page->signature = 0; + /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. */ -- 2.30.2
[PATCH net-next v2 1/5] xdp: reduce size of struct xdp_mem_info
From: Jesper Dangaard Brouer It is possible to compress/reduce the size of struct xdp_mem_info. This change reduce struct xdp_mem_info from 8 bytes to 4 bytes. The member xdp_mem_info.id can be reduced to u16, as the mem_id_ht rhashtable in net/core/xdp.c is already limited by MEM_ID_MAX=0xFFFE which can safely fit in u16. The member xdp_mem_info.type could be reduced more than u16, as it stores the enum xdp_mem_type, but due to alignment it is only reduced to u16. Signed-off-by: Jesper Dangaard Brouer --- include/net/xdp.h | 4 ++-- net/core/xdp.c| 8 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/include/net/xdp.h b/include/net/xdp.h index a5bc214a49d9..c35864d59113 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -48,8 +48,8 @@ enum xdp_mem_type { #define XDP_XMIT_FLAGS_MASKXDP_XMIT_FLUSH struct xdp_mem_info { - u32 type; /* enum xdp_mem_type, but known size type */ - u32 id; + u16 type; /* enum xdp_mem_type, but known size type */ + u16 id; }; struct page_pool; diff --git a/net/core/xdp.c b/net/core/xdp.c index 05354976c1fc..3dd47ed83778 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -35,11 +35,11 @@ static struct rhashtable *mem_id_ht; static u32 xdp_mem_id_hashfn(const void *data, u32 len, u32 seed) { - const u32 *k = data; - const u32 key = *k; + const u16 *k = data; + const u16 key = *k; BUILD_BUG_ON(sizeof_field(struct xdp_mem_allocator, mem.id) -!= sizeof(u32)); +!= sizeof(u16)); /* Use cyclic increasing ID as direct hash key */ return key; @@ -49,7 +49,7 @@ static int xdp_mem_id_cmp(struct rhashtable_compare_arg *arg, const void *ptr) { const struct xdp_mem_allocator *xa = ptr; - u32 mem_id = *(u32 *)arg->key; + u16 mem_id = *(u16 *)arg->key; return xa->mem.id != mem_id; } -- 2.30.2
[PATCH net-next v2 0/5] page_pool: recycle buffers
From: Matteo Croce This is a respin of [1] This patchset shows the plans for allowing page_pool to handle and maintain DMA map/unmap of the pages it serves to the driver. For this to work a return hook in the network core is introduced. The overall purpose is to simplify drivers, by providing a page allocation API that does recycling, such that each driver doesn't have to reinvent its own recycling scheme. Using page_pool in a driver does not require implementing XDP support, but it makes it trivially easy to do so. Instead of allocating buffers specifically for SKBs we now allocate a generic buffer and either wrap it on an SKB (via build_skb) or create an XDP frame. The recycling code leverages the XDP recycle APIs. The Marvell mvpp2 and mvneta drivers are used in this patchset to demonstrate how to use the API, and tested on a MacchiatoBIN and EspressoBIN boards respectively. [1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/ v1 -> v2: - fix a commit message - avoid setting pp_recycle multiple times on mvneta - squash two patches to avoid breaking bisect Ilias Apalodimas (1): page_pool: Allow drivers to hint on SKB recycling Jesper Dangaard Brouer (1): xdp: reduce size of struct xdp_mem_info Matteo Croce (3): mm: add a signature in struct page mvpp2: recycle buffers mvneta: recycle buffers .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c | 2 +- drivers/net/ethernet/marvell/mvneta.c | 7 ++- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 17 +++ drivers/net/ethernet/marvell/sky2.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c| 2 +- include/linux/mm_types.h | 1 + include/linux/skbuff.h| 35 -- include/net/page_pool.h | 15 ++ include/net/xdp.h | 5 +- net/core/page_pool.c | 47 +++ net/core/skbuff.c | 20 +++- net/core/xdp.c| 14 -- net/tls/tls_device.c | 2 +- 13 files changed, 142 insertions(+), 27 deletions(-) -- 2.30.2
Re: [PATCH net-next 0/6] page_pool: recycle buffers
On Tue, Mar 23, 2021 at 5:10 PM Ilias Apalodimas wrote: > > On Tue, Mar 23, 2021 at 05:04:47PM +0100, Jesper Dangaard Brouer wrote: > > On Tue, 23 Mar 2021 17:47:46 +0200 > > Ilias Apalodimas wrote: > > > > > On Tue, Mar 23, 2021 at 03:41:23PM +, Alexander Lobakin wrote: > > > > From: Matteo Croce > > > > Date: Mon, 22 Mar 2021 18:02:55 +0100 > > > > > > > > > From: Matteo Croce > > > > > > > > > > This series enables recycling of the buffers allocated with the > > > > > page_pool API. > > > > > The first two patches are just prerequisite to save space in a struct > > > > > and > > > > > avoid recycling pages allocated with other API. > > > > > Patch 2 was based on a previous idea from Jonathan Lemon. > > > > > > > > > > The third one is the real recycling, 4 fixes the compilation of > > > > > __skb_frag_unref > > > > > users, and 5,6 enable the recycling on two drivers. > > > > > > > > > > In the last two patches I reported the improvement I have with the > > > > > series. > > > > > > > > > > The recycling as is can't be used with drivers like mlx5 which do > > > > > page split, > > > > > but this is documented in a comment. > > > > > In the future, a refcount can be used so to support mlx5 with no > > > > > changes. > > > > > > > > > > Ilias Apalodimas (2): > > > > > page_pool: DMA handling and allow to recycles frames via SKB > > > > > net: change users of __skb_frag_unref() and add an extra argument > > > > > > > > > > Jesper Dangaard Brouer (1): > > > > > xdp: reduce size of struct xdp_mem_info > > > > > > > > > > Matteo Croce (3): > > > > > mm: add a signature in struct page > > > > > mvpp2: recycle buffers > > > > > mvneta: recycle buffers > > > > > > > > > > .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c | 2 +- > > > > > drivers/net/ethernet/marvell/mvneta.c | 4 +- > > > > > .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 17 +++ > > > > > drivers/net/ethernet/marvell/sky2.c | 2 +- > > > > > drivers/net/ethernet/mellanox/mlx4/en_rx.c| 2 +- > > > > > include/linux/mm_types.h | 1 + > > > > > include/linux/skbuff.h| 33 +++-- > > > > > include/net/page_pool.h | 15 ++ > > > > > include/net/xdp.h | 5 +- > > > > > net/core/page_pool.c | 47 > > > > > +++ > > > > > net/core/skbuff.c | 20 +++- > > > > > net/core/xdp.c| 14 -- > > > > > net/tls/tls_device.c | 2 +- > > > > > 13 files changed, 138 insertions(+), 26 deletions(-) > > > > > > > > Just for the reference, I've performed some tests on 1G SoC NIC with > > > > this patchset on, here's direct link: [0] > > > > > > > > > > Thanks for the testing! > > > Any chance you can get a perf measurement on this? > > > > I guess you mean perf-report (--stdio) output, right? > > > > Yea, > As hinted below, I am just trying to figure out if on Alexander's platform the > cost of syncing, is bigger that free-allocate. I remember one armv7 were that > was the case. > > > > Is DMA syncing taking a substantial amount of your cpu usage? > > > > (+1 this is an important question) > > > > > > > > > > [0] > > > > https://lore.kernel.org/netdev/20210323153550.130385-1-aloba...@pm.me > > > > > > That would be the same as for mvneta: Overhead Shared Object Symbol 24.10% [kernel] [k] __pi___inval_dcache_area 23.02% [mvneta] [k] mvneta_rx_swbm 7.19% [kernel] [k] kmem_cache_alloc Anyway, I tried to use the recycling *and* napi_build_skb on mvpp2, and I get lower packet rate than recycling alone. I don't know why, we should investigate it. Regards, -- per aspera ad upstream
Re: [PATCH net-next 3/6] page_pool: DMA handling and allow to recycles frames via SKB
On Mon, Mar 22, 2021 at 6:03 PM Matteo Croce wrote: > > From: Ilias Apalodimas > > During skb_release_data() intercept the packet and if it's a buffer > coming from our page_pool API recycle it back to the pool for further > usage. > To achieve that we introduce a bit in struct sk_buff (pp_recycle:1) and > store the xdp_mem_info in page->private. The SKB bit is needed since > page->private is used by skb_copy_ubufs, so we can't rely solely on > page->private to trigger recycling. > > The driver has to take care of the sync operations on it's own > during the buffer recycling since the buffer is never unmapped. > > In order to enable recycling the driver must call skb_mark_for_recycle() > to store the information we need for recycling in page->private and > enabling the recycling bit > > Storing the information in page->private allows us to recycle both SKBs > and their fragments > > Signed-off-by: Ilias Apalodimas > Signed-off-by: Jesper Dangaard Brouer > Signed-off-by: Matteo Croce > --- Hi, the patch title really should be: page_pool: DMA handling and frame recycling via SKBs As in the previous RFC. Sorry, -- per aspera ad upstream
[PATCH net-next 6/6] mvneta: recycle buffers
From: Matteo Croce Use the new recycling API for page_pool. In a drop rate test, the packet rate increased di 10%, from 269 Kpps to 296 Kpps. perf top on a stock system shows: Overhead Shared Object Symbol 21.78% [kernel] [k] __pi___inval_dcache_area 21.66% [mvneta] [k] mvneta_rx_swbm 7.00% [kernel] [k] kmem_cache_alloc 6.05% [kernel] [k] eth_type_trans 4.44% [kernel] [k] kmem_cache_free.part.0 3.80% [kernel] [k] __netif_receive_skb_core 3.68% [kernel] [k] dev_gro_receive 3.65% [kernel] [k] get_page_from_freelist 3.43% [kernel] [k] page_pool_release_page 3.35% [kernel] [k] free_unref_page And this is the same output with recycling enabled: Overhead Shared Object Symbol 24.10% [kernel] [k] __pi___inval_dcache_area 23.02% [mvneta] [k] mvneta_rx_swbm 7.19% [kernel] [k] kmem_cache_alloc 6.50% [kernel] [k] eth_type_trans 4.93% [kernel] [k] __netif_receive_skb_core 4.77% [kernel] [k] kmem_cache_free.part.0 3.93% [kernel] [k] dev_gro_receive 3.03% [kernel] [k] build_skb 2.91% [kernel] [k] page_pool_put_page 2.85% [kernel] [k] __xdp_return The test was done with mausezahn on the TX side with 64 byte raw ethernet frames. Signed-off-by: Matteo Croce --- drivers/net/ethernet/marvell/mvneta.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index a635cf84608a..8b3250394703 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -2332,7 +2332,7 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq, if (!skb) return ERR_PTR(-ENOMEM); - page_pool_release_page(rxq->page_pool, virt_to_page(xdp->data)); + skb_mark_for_recycle(skb, virt_to_page(xdp->data), &xdp->rxq->mem); skb_reserve(skb, xdp->data - xdp->data_hard_start); skb_put(skb, xdp->data_end - xdp->data); @@ -2344,7 +2344,7 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq, skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, skb_frag_page(frag), skb_frag_off(frag), skb_frag_size(frag), PAGE_SIZE); - page_pool_release_page(rxq->page_pool, skb_frag_page(frag)); + skb_mark_for_recycle(skb, skb_frag_page(frag), &xdp->rxq->mem); } return skb; -- 2.30.2
[PATCH net-next 5/6] mvpp2: recycle buffers
From: Matteo Croce Use the new recycling API for page_pool. In a drop rate test, the packet rate is more than doubled, from 962 Kpps to 2047 Kpps. perf top on a stock system shows: Overhead Shared Object Symbol 30.67% [kernel] [k] page_pool_release_page 8.37% [kernel] [k] get_page_from_freelist 7.34% [kernel] [k] free_unref_page 6.47% [mvpp2] [k] mvpp2_rx 4.69% [kernel] [k] eth_type_trans 4.55% [kernel] [k] __netif_receive_skb_core 4.40% [kernel] [k] build_skb 4.29% [kernel] [k] kmem_cache_free 4.00% [kernel] [k] kmem_cache_alloc 3.81% [kernel] [k] dev_gro_receive With packet rate stable at 962 Kpps: tx: 0 bps 0 pps rx: 477.4 Mbps 962.6 Kpps tx: 0 bps 0 pps rx: 477.6 Mbps 962.8 Kpps tx: 0 bps 0 pps rx: 477.6 Mbps 962.9 Kpps tx: 0 bps 0 pps rx: 477.2 Mbps 962.1 Kpps tx: 0 bps 0 pps rx: 477.5 Mbps 962.7 Kpps And this is the same output with recycling enabled: Overhead Shared Object Symbol 12.75% [mvpp2] [k] mvpp2_rx 9.56% [kernel] [k] __netif_receive_skb_core 9.29% [kernel] [k] build_skb 9.27% [kernel] [k] eth_type_trans 8.39% [kernel] [k] kmem_cache_alloc 7.85% [kernel] [k] kmem_cache_free 7.36% [kernel] [k] page_pool_put_page 6.45% [kernel] [k] dev_gro_receive 4.72% [kernel] [k] __xdp_return 3.06% [kernel] [k] page_pool_refill_alloc_cache With packet rate above 2000 Kpps: tx: 0 bps 0 pps rx: 1015 Mbps 2046 Kpps tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps The major performance increase is explained by the fact that the most CPU consuming functions (page_pool_release_page, get_page_from_freelist and free_unref_page) are no longer called on a per packet basis. The test was done by sending to the macchiatobin 64 byte ethernet frames with an invalid ethertype, so the packets are dropped early in the RX path. Signed-off-by: Matteo Croce --- drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index 1767c60056c5..8f03bbc763bc 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -3848,6 +3848,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi, struct mvpp2_pcpu_stats ps = {}; enum dma_data_direction dma_dir; struct bpf_prog *xdp_prog; + struct xdp_rxq_info *rxqi; struct xdp_buff xdp; int rx_received; int rx_done = 0; @@ -3913,15 +3914,15 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi, else frag_size = bm_pool->frag_size; - if (xdp_prog) { - struct xdp_rxq_info *xdp_rxq; + if (bm_pool->pkt_size == MVPP2_BM_SHORT_PKT_SIZE) + rxqi = &rxq->xdp_rxq_short; + else + rxqi = &rxq->xdp_rxq_long; - if (bm_pool->pkt_size == MVPP2_BM_SHORT_PKT_SIZE) - xdp_rxq = &rxq->xdp_rxq_short; - else - xdp_rxq = &rxq->xdp_rxq_long; + if (xdp_prog) { + xdp.rxq = rxqi; - xdp_init_buff(&xdp, PAGE_SIZE, xdp_rxq); + xdp_init_buff(&xdp, PAGE_SIZE, rxqi); xdp_prepare_buff(&xdp, data, MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM, rx_bytes, false); @@ -3965,7 +3966,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi, } if (pp) - page_pool_release_page(pp, virt_to_page(data)); + skb_mark_for_recycle(skb, virt_to_page(data), &rxqi->mem); else dma_unmap_single_attrs(dev->dev.parent, dma_addr, bm_pool->buf_size, DMA_FROM_DEVICE, -- 2.30.2
[PATCH net-next 3/6] page_pool: DMA handling and allow to recycles frames via SKB
From: Ilias Apalodimas During skb_release_data() intercept the packet and if it's a buffer coming from our page_pool API recycle it back to the pool for further usage. To achieve that we introduce a bit in struct sk_buff (pp_recycle:1) and store the xdp_mem_info in page->private. The SKB bit is needed since page->private is used by skb_copy_ubufs, so we can't rely solely on page->private to trigger recycling. The driver has to take care of the sync operations on it's own during the buffer recycling since the buffer is never unmapped. In order to enable recycling the driver must call skb_mark_for_recycle() to store the information we need for recycling in page->private and enabling the recycling bit Storing the information in page->private allows us to recycle both SKBs and their fragments Signed-off-by: Ilias Apalodimas Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Matteo Croce --- include/linux/skbuff.h | 33 +++ include/net/page_pool.h | 13 + include/net/xdp.h | 1 + net/core/page_pool.c| 43 + net/core/skbuff.c | 20 +-- net/core/xdp.c | 6 ++ 6 files changed, 110 insertions(+), 6 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index ecc029674ae4..3e09a070136f 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -40,6 +40,9 @@ #if IS_ENABLED(CONFIG_NF_CONNTRACK) #include #endif +#if IS_BUILTIN(CONFIG_PAGE_POOL) +#include +#endif /* The interface for checksum offload between the stack and networking drivers * is as follows... @@ -247,6 +250,7 @@ struct napi_struct; struct bpf_prog; union bpf_attr; struct skb_ext; +struct xdp_mem_info; #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER) struct nf_bridge_info { @@ -666,6 +670,8 @@ typedef unsigned char *sk_buff_data_t; * @head_frag: skb was allocated from page fragments, * not allocated by kmalloc() or vmalloc(). * @pfmemalloc: skbuff was allocated from PFMEMALLOC reserves + * @pp_recycle: mark the packet for recycling instead of freeing (implies + * page_pool support on driver) * @active_extensions: active extensions (skb_ext_id types) * @ndisc_nodetype: router type (from link layer) * @ooo_okay: allow the mapping of a socket to a queue to be changed @@ -790,10 +796,12 @@ struct sk_buff { fclone:2, peeked:1, head_frag:1, - pfmemalloc:1; + pfmemalloc:1, + pp_recycle:1; /* page_pool recycle indicator */ #ifdef CONFIG_SKB_EXTENSIONS __u8active_extensions; #endif + /* fields enclosed in headers_start/headers_end are copied * using a single memcpy() in __copy_skb_header() */ @@ -3080,12 +3088,20 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f) /** * __skb_frag_unref - release a reference on a paged fragment. * @frag: the paged fragment + * @recycle: recycle the page if allocated via page_pool * * Releases a reference on the paged fragment @frag. + * or recycles the page via the page_pool API */ -static inline void __skb_frag_unref(skb_frag_t *frag) +static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle) { - put_page(skb_frag_page(frag)); + struct page *page = skb_frag_page(frag); + +#if IS_BUILTIN(CONFIG_PAGE_POOL) + if (recycle && page_pool_return_skb_page(page_address(page))) + return; +#endif + put_page(page); } /** @@ -3097,7 +3113,7 @@ static inline void __skb_frag_unref(skb_frag_t *frag) */ static inline void skb_frag_unref(struct sk_buff *skb, int f) { - __skb_frag_unref(&skb_shinfo(skb)->frags[f]); + __skb_frag_unref(&skb_shinfo(skb)->frags[f], skb->pp_recycle); } /** @@ -4695,5 +4711,14 @@ static inline u64 skb_get_kcov_handle(struct sk_buff *skb) #endif } +#if IS_BUILTIN(CONFIG_PAGE_POOL) +static inline void skb_mark_for_recycle(struct sk_buff *skb, struct page *page, + struct xdp_mem_info *mem) +{ + skb->pp_recycle = 1; + page_pool_store_mem_info(page, mem); +} +#endif + #endif /* __KERNEL__ */ #endif /* _LINUX_SKBUFF_H */ diff --git a/include/net/page_pool.h b/include/net/page_pool.h index b30405e84b5e..75fffc15788b 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -65,6 +65,8 @@ #define PP_ALLOC_CACHE_REFILL 64 #define PP_SIGNATURE 0x20210303 +struct xdp_mem_info; + struct pp_alloc_cache { u32 count; void *cache[PP_ALLOC_CACHE_SIZE]; @@ -148,6 +150,8 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool) return pool->p.dma_dir; } +bool page_pool_return_skb_
[PATCH net-next 4/6] net: change users of __skb_frag_unref() and add an extra argument
From: Ilias Apalodimas On a previous patch we added an extra argument on __skb_frag_unref() to handle recycling. Update the current users of the function with that. Signed-off-by: Ilias Apalodimas Signed-off-by: Matteo Croce --- drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c | 2 +- drivers/net/ethernet/marvell/sky2.c| 2 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +- net/tls/tls_device.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c b/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c index 169e10c91378..cbcff5518965 100644 --- a/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c +++ b/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c @@ -2125,7 +2125,7 @@ static int chcr_ktls_xmit(struct sk_buff *skb, struct net_device *dev) /* clear the frag ref count which increased locally before */ for (i = 0; i < record->num_frags; i++) { /* clear the frag ref count */ - __skb_frag_unref(&record->frags[i]); + __skb_frag_unref(&record->frags[i], false); } /* if any failure, come out from the loop. */ if (ret) { diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c index 2a752fb6b758..1cc646fb4fe4 100644 --- a/drivers/net/ethernet/marvell/sky2.c +++ b/drivers/net/ethernet/marvell/sky2.c @@ -2501,7 +2501,7 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space, if (length == 0) { /* don't need this page */ - __skb_frag_unref(frag); + __skb_frag_unref(frag, false); --skb_shinfo(skb)->nr_frags; } else { size = min(length, (unsigned) PAGE_SIZE); diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index e35e4d7ef4d1..cea62b8f554c 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -526,7 +526,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv, fail: while (nr > 0) { nr--; - __skb_frag_unref(skb_shinfo(skb)->frags + nr); + __skb_frag_unref(skb_shinfo(skb)->frags + nr, false); } return 0; } diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c index d9cd229aa111..2a32a547e51a 100644 --- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -127,7 +127,7 @@ static void destroy_record(struct tls_record_info *record) int i; for (i = 0; i < record->num_frags; i++) - __skb_frag_unref(&record->frags[i]); + __skb_frag_unref(&record->frags[i], false); kfree(record); } -- 2.30.2
[PATCH net-next 2/6] mm: add a signature in struct page
From: Matteo Croce This is needed by the page_pool to avoid recycling a page not allocated via page_pool. Signed-off-by: Matteo Croce --- include/linux/mm_types.h | 1 + include/net/page_pool.h | 2 ++ net/core/page_pool.c | 4 3 files changed, 7 insertions(+) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 0974ad501a47..67caade433e4 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -100,6 +100,7 @@ struct page { * 32-bit architectures. */ dma_addr_t dma_addr; + unsigned long signature; }; struct {/* slab, slob and slub */ union { diff --git a/include/net/page_pool.h b/include/net/page_pool.h index b5b195305346..b30405e84b5e 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -63,6 +63,8 @@ */ #define PP_ALLOC_CACHE_SIZE128 #define PP_ALLOC_CACHE_REFILL 64 +#define PP_SIGNATURE 0x20210303 + struct pp_alloc_cache { u32 count; void *cache[PP_ALLOC_CACHE_SIZE]; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index ad8b0707af04..2ae9b554ef98 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, page_pool_dma_sync_for_device(pool, page, pool->p.max_len); skip_dma_map: + page->signature = PP_SIGNATURE; + /* Track how many pages are held 'in-flight' */ pool->pages_state_hold_cnt++; @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page) DMA_ATTR_SKIP_CPU_SYNC); page->dma_addr = 0; skip_dma_unmap: + page->signature = 0; + /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. */ -- 2.30.2
[PATCH net-next 1/6] xdp: reduce size of struct xdp_mem_info
From: Jesper Dangaard Brouer It is possible to compress/reduce the size of struct xdp_mem_info. This change reduce struct xdp_mem_info from 8 bytes to 4 bytes. The member xdp_mem_info.id can be reduced to u16, as the mem_id_ht rhashtable in net/core/xdp.c is already limited by MEM_ID_MAX=0xFFFE which can safely fit in u16. The member xdp_mem_info.type could be reduced more than u16, as it stores the enum xdp_mem_type, but due to alignment it is only reduced to u16. Signed-off-by: Jesper Dangaard Brouer --- include/net/xdp.h | 4 ++-- net/core/xdp.c| 8 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/include/net/xdp.h b/include/net/xdp.h index a5bc214a49d9..c35864d59113 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -48,8 +48,8 @@ enum xdp_mem_type { #define XDP_XMIT_FLAGS_MASKXDP_XMIT_FLUSH struct xdp_mem_info { - u32 type; /* enum xdp_mem_type, but known size type */ - u32 id; + u16 type; /* enum xdp_mem_type, but known size type */ + u16 id; }; struct page_pool; diff --git a/net/core/xdp.c b/net/core/xdp.c index 05354976c1fc..3dd47ed83778 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -35,11 +35,11 @@ static struct rhashtable *mem_id_ht; static u32 xdp_mem_id_hashfn(const void *data, u32 len, u32 seed) { - const u32 *k = data; - const u32 key = *k; + const u16 *k = data; + const u16 key = *k; BUILD_BUG_ON(sizeof_field(struct xdp_mem_allocator, mem.id) -!= sizeof(u32)); +!= sizeof(u16)); /* Use cyclic increasing ID as direct hash key */ return key; @@ -49,7 +49,7 @@ static int xdp_mem_id_cmp(struct rhashtable_compare_arg *arg, const void *ptr) { const struct xdp_mem_allocator *xa = ptr; - u32 mem_id = *(u32 *)arg->key; + u16 mem_id = *(u16 *)arg->key; return xa->mem.id != mem_id; } -- 2.30.2
[PATCH net-next 0/6] page_pool: recycle buffers
From: Matteo Croce This series enables recycling of the buffers allocated with the page_pool API. The first two patches are just prerequisite to save space in a struct and avoid recycling pages allocated with other API. Patch 2 was based on a previous idea from Jonathan Lemon. The third one is the real recycling, 4 fixes the compilation of __skb_frag_unref users, and 5,6 enable the recycling on two drivers. In the last two patches I reported the improvement I have with the series. The recycling as is can't be used with drivers like mlx5 which do page split, but this is documented in a comment. In the future, a refcount can be used so to support mlx5 with no changes. Ilias Apalodimas (2): page_pool: DMA handling and allow to recycles frames via SKB net: change users of __skb_frag_unref() and add an extra argument Jesper Dangaard Brouer (1): xdp: reduce size of struct xdp_mem_info Matteo Croce (3): mm: add a signature in struct page mvpp2: recycle buffers mvneta: recycle buffers .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c | 2 +- drivers/net/ethernet/marvell/mvneta.c | 4 +- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 17 +++ drivers/net/ethernet/marvell/sky2.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c| 2 +- include/linux/mm_types.h | 1 + include/linux/skbuff.h| 33 +++-- include/net/page_pool.h | 15 ++ include/net/xdp.h | 5 +- net/core/page_pool.c | 47 +++ net/core/skbuff.c | 20 +++- net/core/xdp.c| 14 -- net/tls/tls_device.c | 2 +- 13 files changed, 138 insertions(+), 26 deletions(-) -- 2.30.2
[RFC net-next 6/6] mvneta: recycle buffers
From: Matteo Croce Use the new recycling API for page_pool. In a drop rate test, the packet rate increased di 10%, from 269 Kpps to 296 Kpps. perf top on a stock system shows: Overhead Shared Object Symbol 21.78% [kernel] [k] __pi___inval_dcache_area 21.66% [mvneta] [k] mvneta_rx_swbm 7.00% [kernel] [k] kmem_cache_alloc 6.05% [kernel] [k] eth_type_trans 4.44% [kernel] [k] kmem_cache_free.part.0 3.80% [kernel] [k] __netif_receive_skb_core 3.68% [kernel] [k] dev_gro_receive 3.65% [kernel] [k] get_page_from_freelist 3.43% [kernel] [k] page_pool_release_page 3.35% [kernel] [k] free_unref_page And this is the same output with recycling enabled: Overhead Shared Object Symbol 24.10% [kernel] [k] __pi___inval_dcache_area 23.02% [mvneta] [k] mvneta_rx_swbm 7.19% [kernel] [k] kmem_cache_alloc 6.50% [kernel] [k] eth_type_trans 4.93% [kernel] [k] __netif_receive_skb_core 4.77% [kernel] [k] kmem_cache_free.part.0 3.93% [kernel] [k] dev_gro_receive 3.03% [kernel] [k] build_skb 2.91% [kernel] [k] page_pool_put_page 2.85% [kernel] [k] __xdp_return The test was done with mausezahn on the TX side with 64 byte raw ethernet frames. Signed-off-by: Matteo Croce --- drivers/net/ethernet/marvell/mvneta.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index a635cf84608a..8b3250394703 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -2332,7 +2332,7 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq, if (!skb) return ERR_PTR(-ENOMEM); - page_pool_release_page(rxq->page_pool, virt_to_page(xdp->data)); + skb_mark_for_recycle(skb, virt_to_page(xdp->data), &xdp->rxq->mem); skb_reserve(skb, xdp->data - xdp->data_hard_start); skb_put(skb, xdp->data_end - xdp->data); @@ -2344,7 +2344,7 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq, skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, skb_frag_page(frag), skb_frag_off(frag), skb_frag_size(frag), PAGE_SIZE); - page_pool_release_page(rxq->page_pool, skb_frag_page(frag)); + skb_mark_for_recycle(skb, skb_frag_page(frag), &xdp->rxq->mem); } return skb; -- 2.29.2
[RFC net-next 5/6] mvpp2: recycle buffers
From: Matteo Croce Use the new recycling API for page_pool. In a drop rate test, the packet rate is more than doubled, from 962 Kpps to 2047 Kpps. perf top on a stock system shows: Overhead Shared Object Symbol 30.67% [kernel] [k] page_pool_release_page 8.37% [kernel] [k] get_page_from_freelist 7.34% [kernel] [k] free_unref_page 6.47% [mvpp2] [k] mvpp2_rx 4.69% [kernel] [k] eth_type_trans 4.55% [kernel] [k] __netif_receive_skb_core 4.40% [kernel] [k] build_skb 4.29% [kernel] [k] kmem_cache_free 4.00% [kernel] [k] kmem_cache_alloc 3.81% [kernel] [k] dev_gro_receive With packet rate stable at 962 Kpps: tx: 0 bps 0 pps rx: 477.4 Mbps 962.6 Kpps tx: 0 bps 0 pps rx: 477.6 Mbps 962.8 Kpps tx: 0 bps 0 pps rx: 477.6 Mbps 962.9 Kpps tx: 0 bps 0 pps rx: 477.2 Mbps 962.1 Kpps tx: 0 bps 0 pps rx: 477.5 Mbps 962.7 Kpps And this is the same output with recycling enabled: Overhead Shared Object Symbol 12.75% [mvpp2] [k] mvpp2_rx 9.56% [kernel] [k] __netif_receive_skb_core 9.29% [kernel] [k] build_skb 9.27% [kernel] [k] eth_type_trans 8.39% [kernel] [k] kmem_cache_alloc 7.85% [kernel] [k] kmem_cache_free 7.36% [kernel] [k] page_pool_put_page 6.45% [kernel] [k] dev_gro_receive 4.72% [kernel] [k] __xdp_return 3.06% [kernel] [k] page_pool_refill_alloc_cache With packet rate above 2000 Kpps: tx: 0 bps 0 pps rx: 1015 Mbps 2046 Kpps tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps tx: 0 bps 0 pps rx: 1015 Mbps 2047 Kpps The major performance increase is explained by the fact that the most CPU consuming functions (page_pool_release_page, get_page_from_freelist and free_unref_page) are no longer called on a per packet basis. The test was done by sending to the macchiatobin 64 byte ethernet frames with an invalid ethertype, so the packets are dropped early in the RX path. Signed-off-by: Matteo Croce --- drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index 1767c60056c5..8f03bbc763bc 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -3848,6 +3848,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi, struct mvpp2_pcpu_stats ps = {}; enum dma_data_direction dma_dir; struct bpf_prog *xdp_prog; + struct xdp_rxq_info *rxqi; struct xdp_buff xdp; int rx_received; int rx_done = 0; @@ -3913,15 +3914,15 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi, else frag_size = bm_pool->frag_size; - if (xdp_prog) { - struct xdp_rxq_info *xdp_rxq; + if (bm_pool->pkt_size == MVPP2_BM_SHORT_PKT_SIZE) + rxqi = &rxq->xdp_rxq_short; + else + rxqi = &rxq->xdp_rxq_long; - if (bm_pool->pkt_size == MVPP2_BM_SHORT_PKT_SIZE) - xdp_rxq = &rxq->xdp_rxq_short; - else - xdp_rxq = &rxq->xdp_rxq_long; + if (xdp_prog) { + xdp.rxq = rxqi; - xdp_init_buff(&xdp, PAGE_SIZE, xdp_rxq); + xdp_init_buff(&xdp, PAGE_SIZE, rxqi); xdp_prepare_buff(&xdp, data, MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM, rx_bytes, false); @@ -3965,7 +3966,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi, } if (pp) - page_pool_release_page(pp, virt_to_page(data)); + skb_mark_for_recycle(skb, virt_to_page(data), &rxqi->mem); else dma_unmap_single_attrs(dev->dev.parent, dma_addr, bm_pool->buf_size, DMA_FROM_DEVICE, -- 2.29.2
[RFC net-next 4/6] net: change users of __skb_frag_unref() and add an extra argument
From: Ilias Apalodimas On a previous patch we added an extra argument on __skb_frag_unref() to handle recycling. Update the current users of the function with that. Signed-off-by: Ilias Apalodimas Signed-off-by: Matteo Croce --- drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c | 2 +- drivers/net/ethernet/marvell/sky2.c| 2 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +- net/tls/tls_device.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c b/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c index 169e10c91378..cbcff5518965 100644 --- a/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c +++ b/drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c @@ -2125,7 +2125,7 @@ static int chcr_ktls_xmit(struct sk_buff *skb, struct net_device *dev) /* clear the frag ref count which increased locally before */ for (i = 0; i < record->num_frags; i++) { /* clear the frag ref count */ - __skb_frag_unref(&record->frags[i]); + __skb_frag_unref(&record->frags[i], false); } /* if any failure, come out from the loop. */ if (ret) { diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c index dbec8e187a68..5dc4aee56fa1 100644 --- a/drivers/net/ethernet/marvell/sky2.c +++ b/drivers/net/ethernet/marvell/sky2.c @@ -2501,7 +2501,7 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space, if (length == 0) { /* don't need this page */ - __skb_frag_unref(frag); + __skb_frag_unref(frag, false); --skb_shinfo(skb)->nr_frags; } else { size = min(length, (unsigned) PAGE_SIZE); diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index e35e4d7ef4d1..cea62b8f554c 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -526,7 +526,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv, fail: while (nr > 0) { nr--; - __skb_frag_unref(skb_shinfo(skb)->frags + nr); + __skb_frag_unref(skb_shinfo(skb)->frags + nr, false); } return 0; } diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c index d9cd229aa111..2a32a547e51a 100644 --- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -127,7 +127,7 @@ static void destroy_record(struct tls_record_info *record) int i; for (i = 0; i < record->num_frags; i++) - __skb_frag_unref(&record->frags[i]); + __skb_frag_unref(&record->frags[i], false); kfree(record); } -- 2.29.2
[RFC net-next 3/6] page_pool: DMA handling and frame recycling via SKBs
From: Ilias Apalodimas During skb_release_data() intercept the packet and if it's a buffer coming from our page_pool API recycle it back to the pool for further usage. To achieve that we introduce a bit in struct sk_buff (pp_recycle:1) and store the xdp_mem_info in page->private. The SKB bit is needed since page->private is used by skb_copy_ubufs, so we can't rely solely on page->private to trigger recycling. The driver has to take care of the sync operations on it's own during the buffer recycling since the buffer is never unmapped. In order to enable recycling the driver must call skb_mark_for_recycle() to store the information we need for recycling in page->private and enabling the recycling bit Storing the information in page->private allows us to recycle both SKBs and their fragments Signed-off-by: Ilias Apalodimas Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Matteo Croce --- include/linux/skbuff.h | 33 +++ include/net/page_pool.h | 13 + include/net/xdp.h | 1 + net/core/page_pool.c| 43 + net/core/skbuff.c | 20 +-- net/core/xdp.c | 6 ++ 6 files changed, 110 insertions(+), 6 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 0503c917d773..d7183fae00aa 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -40,6 +40,9 @@ #if IS_ENABLED(CONFIG_NF_CONNTRACK) #include #endif +#if IS_BUILTIN(CONFIG_PAGE_POOL) +#include +#endif /* The interface for checksum offload between the stack and networking drivers * is as follows... @@ -247,6 +250,7 @@ struct napi_struct; struct bpf_prog; union bpf_attr; struct skb_ext; +struct xdp_mem_info; #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER) struct nf_bridge_info { @@ -666,6 +670,8 @@ typedef unsigned char *sk_buff_data_t; * @head_frag: skb was allocated from page fragments, * not allocated by kmalloc() or vmalloc(). * @pfmemalloc: skbuff was allocated from PFMEMALLOC reserves + * @pp_recycle: mark the packet for recycling instead of freeing (implies + * page_pool support on driver) * @active_extensions: active extensions (skb_ext_id types) * @ndisc_nodetype: router type (from link layer) * @ooo_okay: allow the mapping of a socket to a queue to be changed @@ -790,10 +796,12 @@ struct sk_buff { fclone:2, peeked:1, head_frag:1, - pfmemalloc:1; + pfmemalloc:1, + pp_recycle:1; /* page_pool recycle indicator */ #ifdef CONFIG_SKB_EXTENSIONS __u8active_extensions; #endif + /* fields enclosed in headers_start/headers_end are copied * using a single memcpy() in __copy_skb_header() */ @@ -3081,12 +3089,20 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f) /** * __skb_frag_unref - release a reference on a paged fragment. * @frag: the paged fragment + * @recycle: recycle the page if allocated via page_pool * * Releases a reference on the paged fragment @frag. + * or recycles the page via the page_pool API */ -static inline void __skb_frag_unref(skb_frag_t *frag) +static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle) { - put_page(skb_frag_page(frag)); + struct page *page = skb_frag_page(frag); + +#if IS_BUILTIN(CONFIG_PAGE_POOL) + if (recycle && page_pool_return_skb_page(page_address(page))) + return; +#endif + put_page(page); } /** @@ -3098,7 +3114,7 @@ static inline void __skb_frag_unref(skb_frag_t *frag) */ static inline void skb_frag_unref(struct sk_buff *skb, int f) { - __skb_frag_unref(&skb_shinfo(skb)->frags[f]); + __skb_frag_unref(&skb_shinfo(skb)->frags[f], skb->pp_recycle); } /** @@ -4697,5 +4713,14 @@ static inline u64 skb_get_kcov_handle(struct sk_buff *skb) #endif } +#if IS_BUILTIN(CONFIG_PAGE_POOL) +static inline void skb_mark_for_recycle(struct sk_buff *skb, struct page *page, + struct xdp_mem_info *mem) +{ + skb->pp_recycle = 1; + page_pool_store_mem_info(page, mem); +} +#endif + #endif /* __KERNEL__ */ #endif /* _LINUX_SKBUFF_H */ diff --git a/include/net/page_pool.h b/include/net/page_pool.h index b30405e84b5e..75fffc15788b 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -65,6 +65,8 @@ #define PP_ALLOC_CACHE_REFILL 64 #define PP_SIGNATURE 0x20210303 +struct xdp_mem_info; + struct pp_alloc_cache { u32 count; void *cache[PP_ALLOC_CACHE_SIZE]; @@ -148,6 +150,8 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool) return pool->p.dma_dir; } +bool page_pool_return_skb_
[RFC net-next 2/6] mm: add a signature in struct page
From: Matteo Croce This is needed by the page_pool to avoid recycling a page not allocated via page_pool. Signed-off-by: Matteo Croce --- include/linux/mm_types.h | 1 + include/net/page_pool.h | 2 ++ net/core/page_pool.c | 4 3 files changed, 7 insertions(+) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 0974ad501a47..67caade433e4 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -100,6 +100,7 @@ struct page { * 32-bit architectures. */ dma_addr_t dma_addr; + unsigned long signature; }; struct {/* slab, slob and slub */ union { diff --git a/include/net/page_pool.h b/include/net/page_pool.h index b5b195305346..b30405e84b5e 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -63,6 +63,8 @@ */ #define PP_ALLOC_CACHE_SIZE128 #define PP_ALLOC_CACHE_REFILL 64 +#define PP_SIGNATURE 0x20210303 + struct pp_alloc_cache { u32 count; void *cache[PP_ALLOC_CACHE_SIZE]; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index ad8b0707af04..2ae9b554ef98 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, page_pool_dma_sync_for_device(pool, page, pool->p.max_len); skip_dma_map: + page->signature = PP_SIGNATURE; + /* Track how many pages are held 'in-flight' */ pool->pages_state_hold_cnt++; @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page) DMA_ATTR_SKIP_CPU_SYNC); page->dma_addr = 0; skip_dma_unmap: + page->signature = 0; + /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. */ -- 2.29.2
[RFC net-next 1/6] xdp: reduce size of struct xdp_mem_info
From: Jesper Dangaard Brouer It is possible to compress/reduce the size of struct xdp_mem_info. This change reduce struct xdp_mem_info from 8 bytes to 4 bytes. The member xdp_mem_info.id can be reduced to u16, as the mem_id_ht rhashtable in net/core/xdp.c is already limited by MEM_ID_MAX=0xFFFE which can safely fit in u16. The member xdp_mem_info.type could be reduced more than u16, as it stores the enum xdp_mem_type, but due to alignment it is only reduced to u16. Signed-off-by: Jesper Dangaard Brouer --- include/net/xdp.h | 4 ++-- net/core/xdp.c| 8 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/include/net/xdp.h b/include/net/xdp.h index a5bc214a49d9..c35864d59113 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -48,8 +48,8 @@ enum xdp_mem_type { #define XDP_XMIT_FLAGS_MASKXDP_XMIT_FLUSH struct xdp_mem_info { - u32 type; /* enum xdp_mem_type, but known size type */ - u32 id; + u16 type; /* enum xdp_mem_type, but known size type */ + u16 id; }; struct page_pool; diff --git a/net/core/xdp.c b/net/core/xdp.c index 05354976c1fc..3dd47ed83778 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -35,11 +35,11 @@ static struct rhashtable *mem_id_ht; static u32 xdp_mem_id_hashfn(const void *data, u32 len, u32 seed) { - const u32 *k = data; - const u32 key = *k; + const u16 *k = data; + const u16 key = *k; BUILD_BUG_ON(sizeof_field(struct xdp_mem_allocator, mem.id) -!= sizeof(u32)); +!= sizeof(u16)); /* Use cyclic increasing ID as direct hash key */ return key; @@ -49,7 +49,7 @@ static int xdp_mem_id_cmp(struct rhashtable_compare_arg *arg, const void *ptr) { const struct xdp_mem_allocator *xa = ptr; - u32 mem_id = *(u32 *)arg->key; + u16 mem_id = *(u16 *)arg->key; return xa->mem.id != mem_id; } -- 2.29.2
[RFC net-next 0/6] page_pool: recycle buffers
From: Matteo Croce This series enables recycling of the buffers allocated with the page_pool API. The first two patches are just prerequisite to save space in a struct and avoid recycling pages allocated with other API. Patch 2 was based on a previous idea from Jonathan Lemon. The third one is the real recycling, 4 fixes the compilation of __skb_frag_unref users, and 5,6 enable the recycling on two drivers. In the last two patches I reported the improvement I have with the series. The recycling as is can't be used with drivers like mlx5 which do page split, but this is documented in a comment. In the future, a refcount can be used so to support mlx5 with no changes. Ilias Apalodimas (2): page_pool: DMA handling and frame recycling via SKBs net: change users of __skb_frag_unref() and add an extra argument Jesper Dangaard Brouer (1): xdp: reduce size of struct xdp_mem_info Matteo Croce (3): mm: add a signature in struct page mvpp2: recycle buffers mvneta: recycle buffers .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c | 2 +- drivers/net/ethernet/marvell/mvneta.c | 4 +- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 17 +++ drivers/net/ethernet/marvell/sky2.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c| 2 +- include/linux/mm_types.h | 1 + include/linux/skbuff.h| 33 +++-- include/net/page_pool.h | 15 ++ include/net/xdp.h | 5 +- net/core/page_pool.c | 47 +++ net/core/skbuff.c | 20 +++- net/core/xdp.c| 14 -- net/tls/tls_device.c | 2 +- 13 files changed, 138 insertions(+), 26 deletions(-) -- 2.29.2
[PATCH RESEND net-next] cfg80211: remove unused callback
From: Matteo Croce The ieee80211 class registers a callback which actually does nothing. Given that the callback is optional, and all its accesses are protected by a NULL check, remove it entirely. Signed-off-by: Matteo Croce --- net/wireless/sysfs.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/net/wireless/sysfs.c b/net/wireless/sysfs.c index 3ac1f48195d2..41cb4d565149 100644 --- a/net/wireless/sysfs.c +++ b/net/wireless/sysfs.c @@ -81,12 +81,6 @@ static void wiphy_dev_release(struct device *dev) cfg80211_dev_free(rdev); } -static int wiphy_uevent(struct device *dev, struct kobj_uevent_env *env) -{ - /* TODO, we probably need stuff here */ - return 0; -} - #ifdef CONFIG_PM_SLEEP static void cfg80211_leave_all(struct cfg80211_registered_device *rdev) { @@ -157,7 +151,6 @@ struct class ieee80211_class = { .owner = THIS_MODULE, .dev_release = wiphy_dev_release, .dev_groups = ieee80211_groups, - .dev_uevent = wiphy_uevent, .pm = WIPHY_PM_OPS, .ns_type = &net_ns_type_operations, .namespace = wiphy_namespace, -- 2.29.2
Re: [PATCH net-next] cfg80211: remove unused callback
On Sat, Feb 6, 2021 at 1:29 PM Matteo Croce wrote: > > From: Matteo Croce > > The ieee80211 class registers a callback which actually does nothing. > Given that the callback is optional, and all its accesses are protected > by a NULL check, remove it entirely. > > Signed-off-by: Matteo Croce > --- I just noticed that I forgot to cc linux-wireless, I will resend it -- per aspera ad upstream
[PATCH net-next] cfg80211: remove unused callback
From: Matteo Croce The ieee80211 class registers a callback which actually does nothing. Given that the callback is optional, and all its accesses are protected by a NULL check, remove it entirely. Signed-off-by: Matteo Croce --- net/wireless/sysfs.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/net/wireless/sysfs.c b/net/wireless/sysfs.c index 3ac1f48195d2..41cb4d565149 100644 --- a/net/wireless/sysfs.c +++ b/net/wireless/sysfs.c @@ -81,12 +81,6 @@ static void wiphy_dev_release(struct device *dev) cfg80211_dev_free(rdev); } -static int wiphy_uevent(struct device *dev, struct kobj_uevent_env *env) -{ - /* TODO, we probably need stuff here */ - return 0; -} - #ifdef CONFIG_PM_SLEEP static void cfg80211_leave_all(struct cfg80211_registered_device *rdev) { @@ -157,7 +151,6 @@ struct class ieee80211_class = { .owner = THIS_MODULE, .dev_release = wiphy_dev_release, .dev_groups = ieee80211_groups, - .dev_uevent = wiphy_uevent, .pm = WIPHY_PM_OPS, .ns_type = &net_ns_type_operations, .namespace = wiphy_namespace, -- 2.29.2
Re: [PATCH net-next v3] mptcp: fix length of MP_PRIO suboption
On Mon, Feb 1, 2021 at 2:08 PM Davide Caratti wrote: > > With version 0 of the protocol it was legal to encode the 'Subflow Id' in > the MP_PRIO suboption, to specify which subflow would change its 'Backup' > flag. This has been removed from v1 specification: thus, according to RFC > 8684 §3.3.8, the resulting 'Length' for MP_PRIO changed from 4 to 3 byte. > > Current Linux generates / parses MP_PRIO according to the old spec, using > 'Length' equal to 4, and hardcoding 1 as 'Subflow Id'; RFC compliance can > improve if we change 'Length' in other to become 3, leaving a 'Nop' after > the MP_PRIO suboption. In this way the kernel will emit and accept *only* > MP_PRIO suboptions that are compliant to version 1 of the MPTCP protocol. > > unpatched 5.11-rc kernel: > [root@bottarga ~]# tcpdump -tnnr unpatched.pcap | grep prio > reading from file unpatched.pcap, link-type LINUX_SLL (Linux cooked v1) > dropped privs to tcpdump > IP 10.0.3.2.48433 > 10.0.1.1.10006: Flags [.], ack 1, win 502, options > [nop,nop,TS val 4032325513 ecr 1876514270,mptcp prio non-backup id 1,mptcp > dss ack 14084896651682217737], length 0 > > patched 5.11-rc kernel: > [root@bottarga ~]# tcpdump -tnnr patched.pcap | grep prio > reading from file patched.pcap, link-type LINUX_SLL (Linux cooked v1) > dropped privs to tcpdump > IP 10.0.3.2.49735 > 10.0.1.1.10006: Flags [.], ack 1, win 502, options > [nop,nop,TS val 1276737699 ecr 2686399734,mptcp prio non-backup,nop,mptcp dss > ack 18433038869082491686], length 0 > > Changes since v2: > - when accounting for option space, don't increment 'TCPOLEN_MPTCP_PRIO' >and use 'TCPOLEN_MPTCP_PRIO_ALIGN' instead, thanks to Matthieu Baerts. > Changes since v1: > - refactor patch to avoid using 'TCPOLEN_MPTCP_PRIO' with its old value, >thanks to Geliang Tang. > > Fixes: 067065422fcd ("mptcp: add the outgoing MP_PRIO support") > Reviewed-by: Mat Martineau > Reviewed-by: Matthieu Baerts > Signed-off-by: Davide Caratti Reviewed-by: Matteo Croce -- per aspera ad upstream
Re: [PATCH net 0/2] ipv6: fixes for the multicast routes
On Sat, Jan 16, 2021 at 5:41 AM David Ahern wrote: > > On 1/15/21 4:12 PM, Matteo Croce wrote: > > On Fri, Jan 15, 2021 at 11:50 PM Jakub Kicinski wrote: > >> > >> On Fri, 15 Jan 2021 19:42:07 +0100 Matteo Croce wrote: > >>> From: Matteo Croce > >>> > >>> Fix two wrong flags in the IPv6 multicast routes created > >>> by the autoconf code. > >> > >> Any chance for Fixes tags here? > > > > Right. > > For 1/2 I don't know exactly, that code was touched last time in > > 86872cb57925 ("[IPv6] route: FIB6 configuration using struct > > fib6_config"), but it was only refactored. Before 86872cb57925, it was > > pushed in the git import commit by Linus: 1da177e4c3f4 > > ("Linux-2.6.12-rc2"). > > BTW, according the history repo, it entered the tree in the 2.4.0 > > import, so I'd say it's here since the beginning. > > > > While for 2/2 I'd say: > > > > Fixes: e8478e80e5a7 ("net/ipv6: Save route type in rt6_info") > > > > As I recall (memory jogging from commit description) my patch only moved > the setting from ip6_route_info_create default to here. > > The change is correct, just thinking it goes back beyond 4.16. If > someone has a system running a 4.14 or earlier kernel it should be easy > to know if this was the default prior. Indeed, it was the same long before 4.14: # uname -a Linux ubuntu 4.4.0-142-generic #168~14.04.1-Ubuntu SMP Sat Jan 19 11:26:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux # ip -6 -d route show table local dev eth0 unicast ff00::/8 proto boot scope global metric 256 pref medium Regards, -- per aspera ad upstream
Re: [PATCH net 1/2] ipv6: create multicast route with RTPROT_KERNEL
On Sat, Jan 16, 2021 at 5:36 AM David Ahern wrote: > > On 1/15/21 11:42 AM, Matteo Croce wrote: > > From: Matteo Croce > > > > The ff00::/8 multicast route is created without specifying the fc_protocol > > field, so the default RTPROT_BOOT value is used: > > > > $ ip -6 -d route > > unicast ::1 dev lo proto kernel scope global metric 256 pref medium > > unicast fe80::/64 dev eth0 proto kernel scope global metric 256 pref > > medium > > unicast ff00::/8 dev eth0 proto boot scope global metric 256 pref medium > > > > As the documentation says, this value identifies routes installed during > > boot, but the route is created when interface is set up. > > Change the value to RTPROT_KERNEL which is a better value. > > > > Signed-off-by: Matteo Croce > > --- > > net/ipv6/addrconf.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c > > index eff2cacd5209..19bf6822911c 100644 > > --- a/net/ipv6/addrconf.c > > +++ b/net/ipv6/addrconf.c > > @@ -2469,6 +2469,7 @@ static void addrconf_add_mroute(struct net_device > > *dev) > > .fc_flags = RTF_UP, > > .fc_type = RTN_UNICAST, > > .fc_nlinfo.nl_net = dev_net(dev), > > + .fc_protocol = RTPROT_KERNEL, > > }; > > > > ipv6_addr_set(&cfg.fc_dst, htonl(0xFF00), 0, 0, 0); > > > > > What's the motivation for changing this? ie., what s/w cares that it is > kernel vs boot? A practical example: systemd-networkd explicitly ignores routes with kernel flag[1]. Having a different flag leads the daemon to remove the route sooner or later. [1] https://github.com/systemd/systemd/blob/0b81225e5791f660506f7db0ab88078cf296b771/src/network/networkd-routing-policy-rule.c#L675-L677 -- per aspera ad upstream
Re: [PATCH net 0/2] ipv6: fixes for the multicast routes
On Fri, Jan 15, 2021 at 11:50 PM Jakub Kicinski wrote: > > On Fri, 15 Jan 2021 19:42:07 +0100 Matteo Croce wrote: > > From: Matteo Croce > > > > Fix two wrong flags in the IPv6 multicast routes created > > by the autoconf code. > > Any chance for Fixes tags here? Right. For 1/2 I don't know exactly, that code was touched last time in 86872cb57925 ("[IPv6] route: FIB6 configuration using struct fib6_config"), but it was only refactored. Before 86872cb57925, it was pushed in the git import commit by Linus: 1da177e4c3f4 ("Linux-2.6.12-rc2"). BTW, according the history repo, it entered the tree in the 2.4.0 import, so I'd say it's here since the beginning. While for 2/2 I'd say: Fixes: e8478e80e5a7 ("net/ipv6: Save route type in rt6_info") -- per aspera ad upstream
[PATCH net 2/2] ipv6: set multicast flag on the multicast route
From: Matteo Croce The multicast route ff00::/8 is created with type RTN_UNICAST: $ ip -6 -d route unicast ::1 dev lo proto kernel scope global metric 256 pref medium unicast fe80::/64 dev eth0 proto kernel scope global metric 256 pref medium unicast ff00::/8 dev eth0 proto kernel scope global metric 256 pref medium Set the type to RTN_MULTICAST which is more appropriate. Signed-off-by: Matteo Croce --- net/ipv6/addrconf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 19bf6822911c..9edc5bb2d531 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -2467,7 +2467,7 @@ static void addrconf_add_mroute(struct net_device *dev) .fc_ifindex = dev->ifindex, .fc_dst_len = 8, .fc_flags = RTF_UP, - .fc_type = RTN_UNICAST, + .fc_type = RTN_MULTICAST, .fc_nlinfo.nl_net = dev_net(dev), .fc_protocol = RTPROT_KERNEL, }; -- 2.29.2
[PATCH net 0/2] ipv6: fixes for the multicast routes
From: Matteo Croce Fix two wrong flags in the IPv6 multicast routes created by the autoconf code. Matteo Croce (2): ipv6: create multicast route with RTPROT_KERNEL ipv6: set multicast flag on the multicast route net/ipv6/addrconf.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) -- 2.29.2
[PATCH net 1/2] ipv6: create multicast route with RTPROT_KERNEL
From: Matteo Croce The ff00::/8 multicast route is created without specifying the fc_protocol field, so the default RTPROT_BOOT value is used: $ ip -6 -d route unicast ::1 dev lo proto kernel scope global metric 256 pref medium unicast fe80::/64 dev eth0 proto kernel scope global metric 256 pref medium unicast ff00::/8 dev eth0 proto boot scope global metric 256 pref medium As the documentation says, this value identifies routes installed during boot, but the route is created when interface is set up. Change the value to RTPROT_KERNEL which is a better value. Signed-off-by: Matteo Croce --- net/ipv6/addrconf.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index eff2cacd5209..19bf6822911c 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -2469,6 +2469,7 @@ static void addrconf_add_mroute(struct net_device *dev) .fc_flags = RTF_UP, .fc_type = RTN_UNICAST, .fc_nlinfo.nl_net = dev_net(dev), + .fc_protocol = RTPROT_KERNEL, }; ipv6_addr_set(&cfg.fc_dst, htonl(0xFF00), 0, 0, 0); -- 2.29.2
[PATCH iproute2-next] ip: don't use program name to select command
From: Matteo Croce ip has an ancient behaviour of looking at its program name to determine the command to run. If the name is longer than 2 characters, the first two letters are stripped and the others are interpreted as the command name: $ ln -s /sbin/ip iproute $ ln -s /sbin/ip ipa $ ./iproute default via 192.168.55.1 dev wlp0s20f3 proto dhcp metric 600 192.168.55.0/24 dev wlp0s20f3 proto kernel scope link src 192.168.55.26 metric 600 192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown $ ./ipa show dev lo 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever This creates problems when the ip binary is renamed. For example, Yocto renames it to 'ip.iproute2' when also the busybox implementation is present, giving the following error: $ ip.iproute2 Object ".iproute2" is unknown, try "ip help". Since noone is using it, remove this undocumented feature. Signed-off-by: Matteo Croce --- ip/ip.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/ip/ip.c b/ip/ip.c index 40d2998a..9b772307 100644 --- a/ip/ip.c +++ b/ip/ip.c @@ -311,9 +311,6 @@ int main(int argc, char **argv) rtnl_set_strict_dump(&rth); - if (strlen(basename) > 2) - return do_cmd(basename+2, argc, argv); - if (argc > 1) return do_cmd(argv[1], argc-1, argv+1); -- 2.29.2
Re: [PATCH net] net: openvswitch: fix TTL decrement action netlink message format
On Fri, Nov 20, 2020 at 10:12 PM Jakub Kicinski wrote: > > On Thu, 19 Nov 2020 04:04:04 -0500 Eelco Chaudron wrote: > > Currently, the openvswitch module is not accepting the correctly formated > > netlink message for the TTL decrement action. For both setting and getting > > the dec_ttl action, the actions should be nested in the > > OVS_DEC_TTL_ATTR_ACTION attribute as mentioned in the openvswitch.h uapi. > > IOW this change will not break any known user space, correct? > > But existing OvS user space already expects it to work like you > make it work now? > > What's the harm in leaving it as is? > > > Fixes: 744676e77720 ("openvswitch: add TTL decrement action") > > Signed-off-by: Eelco Chaudron > > Can we get a review from OvS folks? Matteo looks good to you (as the > original author)? > Hi, I think that the userspace still has to implement the dec_ttl action; by now dec_ttl is implemented with set_ttl(). So there is no breakage yet. Eelco, with this fix we will encode the netlink attribute in the same way for the kernel and netdev datapath? If so, go for it. > > - err = __ovs_nla_copy_actions(net, attr, key, sfa, eth_type, > > + err = __ovs_nla_copy_actions(net, actions, key, sfa, eth_type, > >vlan_tci, mpls_label_count, log); > > if (err) > > return err; > > You're not canceling any nests on error, I assume this is normal. > > > + add_nested_action_end(*sfa, action_start); > > add_nested_action_end(*sfa, start); > > return 0; > > } > -- per aspera ad upstream
Re: [PATCH net-next] net: marvell: mvpp2: Fix W=1 warning with !CONFIG_ACPI
> From: Andrew Lunn > Subject: [PATCH net-next] net: marvell: mvpp2: Fix W=1 warning with > !CONFIG_ACPI > > Wrap the definition inside #ifdef/#endif. > > Compile tested only. > > Signed-off-by: Andrew Lunn Looks good, ACPI_PTR() should be NULL if !CONFIG_ACPI -- per aspera ad upstream
Re: [PATCH net] net: mvpp2: fix memory leak in mvpp2_rx
> Release skb memory in mvpp2_rx() if mvpp2_rx_refill routine fails > > Fixes: b5015854674b ("net: mvpp2: fix refilling BM pools in RX path") > Signed-off-by: Lorenzo Bianconi I think that mvpp2_rx() has changed a bit in net-next due to the XDP support, but it should apply too. Acked-by: Matteo Croce
[PATCH net-next] mvpp2: fix pointer check
From: Matteo Croce priv->page_pool is an array, so comparing against it will always return true. Do a meaningful check by checking priv->page_pool[0] instead. While at it, clear the page_pool pointers on deallocation, or when an allocation error happens during init. Reported-by: Colin Ian King Fixes: c2d6fe6163de ("mvpp2: XDP TX support") Signed-off-by: Matteo Croce --- drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index d8e238ed533e..6a3f356640a0 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -548,8 +548,10 @@ static int mvpp2_bm_pool_destroy(struct device *dev, struct mvpp2 *priv, val |= MVPP2_BM_STOP_MASK; mvpp2_write(priv, MVPP2_BM_POOL_CTRL_REG(bm_pool->id), val); - if (priv->percpu_pools) + if (priv->percpu_pools) { page_pool_destroy(priv->page_pool[bm_pool->id]); + priv->page_pool[bm_pool->id] = NULL; + } dma_free_coherent(dev, bm_pool->size_bytes, bm_pool->virt_addr, @@ -609,8 +611,15 @@ static int mvpp2_bm_init(struct device *dev, struct mvpp2 *priv) mvpp2_pools[pn].buf_num, mvpp2_pools[pn].pkt_size, dma_dir); - if (IS_ERR(priv->page_pool[i])) + if (IS_ERR(priv->page_pool[i])) { + int j; + + for (j = 0; j < i; j++) { + page_pool_destroy(priv->page_pool[j]); + priv->page_pool[j] = NULL; + } return PTR_ERR(priv->page_pool[i]); + } } } @@ -4486,7 +4495,7 @@ static int mvpp2_check_pagepool_dma(struct mvpp2_port *port) if (!priv->percpu_pools) return err; - if (!priv->page_pool) + if (!priv->page_pool[0]) return -ENOMEM; for (i = 0; i < priv->port_count; i++) { -- 2.26.2
Re: mvpp2: XDP TX support
On Mon, 6 Jul 2020 14:59:22 +0100 Colin Ian King wrote: > Hi, > > Static analysis with Coverity has found a potential issue in the > following commit: > > commit c2d6fe6163de80d7f7cf400ee351f56d6cdb7a5a > Author: Matteo Croce > Date: Thu Jul 2 16:12:43 2020 +0200 > > mvpp2: XDP TX support > > > In source drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c in function > mvpp2_check_pagepool_dma, analysis is as follows: > > > 4486if (!priv->percpu_pools) > 4487return err; > 4488 > CID (#1 of 1): Array compared against 0 (NO_EFFECT) > array_null: Comparing an array to null is not useful: priv->page_pool, > since the test will always evaluate as true. > > Was priv->page_pool formerly declared as a pointer? > > 4489if (!priv->page_pool) > 4490return -ENOMEM; > 4491 > > > page_pool is declared as: > > struct page_pool *page_pool[MVPP2_PORT_MAX_RXQ]; > > ..it is an array and hence cannot be null, so the null check is > redundant. Later on there is a reference of priv->page_pool[0], so > was the check meant to be: > > if (!priv->page_pool[0]) > > Colin Hi, yes, the check was meant to be 'if (!priv->page_pool[0])'. Maybe it's a copy/paste error from other points where 'page_pool' is a local variable. While at it, I've found that in case a page_pool allocation fails, I don't cleanup the previously allocated pools, and upon deallocation the pointer isn't set back to NULL. I should add something like: @@ -548,8 +548,10 @@ static int mvpp2_bm_pool_destroy(struct device *dev, struct mvpp2 *priv, val |= MVPP2_BM_STOP_MASK; mvpp2_write(priv, MVPP2_BM_POOL_CTRL_REG(bm_pool->id), val); - if (priv->percpu_pools) + if (priv->percpu_pools) { page_pool_destroy(priv->page_pool[bm_pool->id]); + priv->page_pool[bm_pool->id] = NULL; + } dma_free_coherent(dev, bm_pool->size_bytes, bm_pool->virt_addr, @@ -609,8 +611,15 @@ static int mvpp2_bm_init(struct device *dev, struct mvpp2 *priv) mvpp2_pools[pn].buf_num, mvpp2_pools[pn].pkt_size, dma_dir); - if (IS_ERR(priv->page_pool[i])) - return PTR_ERR(priv->page_pool[i]); + if (IS_ERR(priv->page_pool[i])) { + err = PTR_ERR(priv->page_pool[i]); + + for (i--; i >=0; i--) { + page_pool_destroy(priv->page_pool[i]); + priv->page_pool[i] = NULL; + } + return err; + } } } Looks sane to you? Regards, -- per aspera ad upstream
Re: [PATCH net-next v2 0/5] mvpp2: XDP support
On Thu, Jul 2, 2020 at 4:12 PM Matteo Croce wrote: > > From: Matteo Croce > > From: Matteo Croce > DUP! Sorry. -- per aspera ad upstream
[PATCH net-next v2 5/5] mvpp2: xdp ethtool stats
From: Sven Auhagen Add ethtool statistics for XDP. Signed-off-by: Sven Auhagen --- drivers/net/ethernet/marvell/mvpp2/mvpp2.h| 8 + .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 158 -- 2 files changed, 148 insertions(+), 18 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h index c52955b33fab..ebec47087b27 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h @@ -846,6 +846,14 @@ struct mvpp2_pcpu_stats { u64 rx_bytes; u64 tx_packets; u64 tx_bytes; + /* XDP */ + u64 xdp_redirect; + u64 xdp_pass; + u64 xdp_drop; + u64 xdp_xmit; + u64 xdp_xmit_err; + u64 xdp_tx; + u64 xdp_tx_err; }; /* Per-CPU port control */ diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index d49a814311be..d8e238ed533e 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -1485,6 +1485,16 @@ static void mvpp2_port_loopback_set(struct mvpp2_port *port, writel(val, port->base + MVPP2_GMAC_CTRL_1_REG); } +enum { + ETHTOOL_XDP_REDIRECT, + ETHTOOL_XDP_PASS, + ETHTOOL_XDP_DROP, + ETHTOOL_XDP_TX, + ETHTOOL_XDP_TX_ERR, + ETHTOOL_XDP_XMIT, + ETHTOOL_XDP_XMIT_ERR, +}; + struct mvpp2_ethtool_counter { unsigned int offset; const char string[ETH_GSTRING_LEN]; @@ -1577,10 +1587,21 @@ static const struct mvpp2_ethtool_counter mvpp2_ethtool_rxq_regs[] = { { MVPP2_RX_PKTS_BM_DROP_CTR, "rxq_%d_packets_bm_drops" }, }; +static const struct mvpp2_ethtool_counter mvpp2_ethtool_xdp[] = { + { ETHTOOL_XDP_REDIRECT, "rx_xdp_redirect", }, + { ETHTOOL_XDP_PASS, "rx_xdp_pass", }, + { ETHTOOL_XDP_DROP, "rx_xdp_drop", }, + { ETHTOOL_XDP_TX, "rx_xdp_tx", }, + { ETHTOOL_XDP_TX_ERR, "rx_xdp_tx_errors", }, + { ETHTOOL_XDP_XMIT, "tx_xdp_xmit", }, + { ETHTOOL_XDP_XMIT_ERR, "tx_xdp_xmit_errors", }, +}; + #define MVPP2_N_ETHTOOL_STATS(ntxqs, nrxqs) (ARRAY_SIZE(mvpp2_ethtool_mib_regs) + \ ARRAY_SIZE(mvpp2_ethtool_port_regs) + \ (ARRAY_SIZE(mvpp2_ethtool_txq_regs) * (ntxqs)) + \ - (ARRAY_SIZE(mvpp2_ethtool_rxq_regs) * (nrxqs))) + (ARRAY_SIZE(mvpp2_ethtool_rxq_regs) * (nrxqs)) + \ +ARRAY_SIZE(mvpp2_ethtool_xdp)) static void mvpp2_ethtool_get_strings(struct net_device *netdev, u32 sset, u8 *data) @@ -1619,10 +1640,57 @@ static void mvpp2_ethtool_get_strings(struct net_device *netdev, u32 sset, data += ETH_GSTRING_LEN; } } + + for (i = 0; i < ARRAY_SIZE(mvpp2_ethtool_xdp); i++) { + strscpy(data, mvpp2_ethtool_xdp[i].string, + ETH_GSTRING_LEN); + data += ETH_GSTRING_LEN; + } +} + +static void +mvpp2_get_xdp_stats(struct mvpp2_port *port, struct mvpp2_pcpu_stats *xdp_stats) +{ + unsigned int start; + unsigned int cpu; + + /* Gather XDP Statistics */ + for_each_possible_cpu(cpu) { + struct mvpp2_pcpu_stats *cpu_stats; + u64 xdp_redirect; + u64 xdp_pass; + u64 xdp_drop; + u64 xdp_xmit; + u64 xdp_xmit_err; + u64 xdp_tx; + u64 xdp_tx_err; + + cpu_stats = per_cpu_ptr(port->stats, cpu); + do { + start = u64_stats_fetch_begin_irq(&cpu_stats->syncp); + xdp_redirect = cpu_stats->xdp_redirect; + xdp_pass = cpu_stats->xdp_pass; + xdp_drop = cpu_stats->xdp_drop; + xdp_xmit = cpu_stats->xdp_xmit; + xdp_xmit_err = cpu_stats->xdp_xmit_err; + xdp_tx = cpu_stats->xdp_tx; + xdp_tx_err = cpu_stats->xdp_tx_err; + } while (u64_stats_fetch_retry_irq(&cpu_stats->syncp, start)); + + xdp_stats->xdp_redirect += xdp_redirect; + xdp_stats->xdp_pass += xdp_pass; + xdp_stats->xdp_drop += xdp_drop; + xdp_stats->xdp_xmit += xdp_xmit; + xdp_stats->xdp_xmit_err += xdp_xmit_err; + xdp_stats->xdp_tx += xdp_tx; + xdp_stats->xdp_tx_err += xdp_tx_err; + } } static void mvpp2_read_stats(struct mvpp2_port *port) { + struct mvpp2_pcpu_stats xdp_stats = {}; + const struct mvpp2_ethtool_counter *s; u64 *pst
[PATCH net-next v2 4/5] mvpp2: XDP TX support
From: Matteo Croce Add the transmit part of XDP support, which includes: - support for XDP_TX in mvpp2_xdp() - .ndo_xdp_xmit hook for AF_XDP and XDP_REDIRECT with mvpp2 as destination mvpp2_xdp_submit_frame() is a generic function which is called by mvpp2_xdp_xmit_back() when doing XDP_TX, and by mvpp2_xdp_xmit when doing AF_XDP or XDP_REDIRECT target. The buffer allocation has been reworked to be able to map the buffers as DMA_FROM_DEVICE or DMA_BIDIRECTIONAL depending if native XDP is in use or not. Co-developed-by: Sven Auhagen Signed-off-by: Sven Auhagen Signed-off-by: Matteo Croce --- drivers/net/ethernet/marvell/mvpp2/mvpp2.h| 13 +- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 304 +++--- 2 files changed, 272 insertions(+), 45 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h index f351e41c9da6..c52955b33fab 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h @@ -1082,9 +1082,20 @@ struct mvpp2_rx_desc { }; }; +enum mvpp2_tx_buf_type { + MVPP2_TYPE_SKB, + MVPP2_TYPE_XDP_TX, + MVPP2_TYPE_XDP_NDO, +}; + struct mvpp2_txq_pcpu_buf { + enum mvpp2_tx_buf_type type; + /* Transmitted SKB */ - struct sk_buff *skb; + union { + struct xdp_frame *xdpf; + struct sk_buff *skb; + }; /* Physical address of transmitted buffer */ dma_addr_t dma; diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index 5f740c96a02c..d49a814311be 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -97,7 +97,8 @@ static inline u32 mvpp2_cpu_to_thread(struct mvpp2 *priv, int cpu) } static struct page_pool * -mvpp2_create_page_pool(struct device *dev, int num, int len) +mvpp2_create_page_pool(struct device *dev, int num, int len, + enum dma_data_direction dma_dir) { struct page_pool_params pp_params = { /* internal DMA mapping in page_pool */ @@ -105,7 +106,7 @@ mvpp2_create_page_pool(struct device *dev, int num, int len) .pool_size = num, .nid = NUMA_NO_NODE, .dev = dev, - .dma_dir = DMA_FROM_DEVICE, + .dma_dir = dma_dir, .offset = MVPP2_SKB_HEADROOM, .max_len = len, }; @@ -299,12 +300,17 @@ static void mvpp2_txq_inc_get(struct mvpp2_txq_pcpu *txq_pcpu) static void mvpp2_txq_inc_put(struct mvpp2_port *port, struct mvpp2_txq_pcpu *txq_pcpu, - struct sk_buff *skb, - struct mvpp2_tx_desc *tx_desc) + void *data, + struct mvpp2_tx_desc *tx_desc, + enum mvpp2_tx_buf_type buf_type) { struct mvpp2_txq_pcpu_buf *tx_buf = txq_pcpu->buffs + txq_pcpu->txq_put_index; - tx_buf->skb = skb; + tx_buf->type = buf_type; + if (buf_type == MVPP2_TYPE_SKB) + tx_buf->skb = data; + else + tx_buf->xdpf = data; tx_buf->size = mvpp2_txdesc_size_get(port, tx_desc); tx_buf->dma = mvpp2_txdesc_dma_addr_get(port, tx_desc) + mvpp2_txdesc_offset_get(port, tx_desc); @@ -527,9 +533,6 @@ static int mvpp2_bm_pool_destroy(struct device *dev, struct mvpp2 *priv, int buf_num; u32 val; - if (priv->percpu_pools) - page_pool_destroy(priv->page_pool[bm_pool->id]); - buf_num = mvpp2_check_hw_buf_num(priv, bm_pool); mvpp2_bm_bufs_free(dev, priv, bm_pool, buf_num); @@ -545,6 +548,9 @@ static int mvpp2_bm_pool_destroy(struct device *dev, struct mvpp2 *priv, val |= MVPP2_BM_STOP_MASK; mvpp2_write(priv, MVPP2_BM_POOL_CTRL_REG(bm_pool->id), val); + if (priv->percpu_pools) + page_pool_destroy(priv->page_pool[bm_pool->id]); + dma_free_coherent(dev, bm_pool->size_bytes, bm_pool->virt_addr, bm_pool->dma_addr); @@ -580,9 +586,19 @@ static int mvpp2_bm_pools_init(struct device *dev, struct mvpp2 *priv) static int mvpp2_bm_init(struct device *dev, struct mvpp2 *priv) { + enum dma_data_direction dma_dir = DMA_FROM_DEVICE; int i, err, poolnum = MVPP2_BM_POOLS_NUM; + struct mvpp2_port *port; if (priv->percpu_pools) { + for (i = 0; i < priv->port_count; i++) { + port = priv->port_list[i]; + if (port->xdp_prog) { + dma_dir = DMA_BIDIRECTIONAL; + break; + } + } +
[PATCH net-next v2 2/5] mvpp2: use page_pool allocator
From: Matteo Croce Use the page_pool API for memory management. This is a prerequisite for native XDP support. Tested-by: Sven Auhagen Signed-off-by: Matteo Croce --- drivers/net/ethernet/marvell/Kconfig | 1 + drivers/net/ethernet/marvell/mvpp2/mvpp2.h| 8 + .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 156 +++--- 3 files changed, 140 insertions(+), 25 deletions(-) diff --git a/drivers/net/ethernet/marvell/Kconfig b/drivers/net/ethernet/marvell/Kconfig index cd8ddd1ef6f2..ef4f35ba077d 100644 --- a/drivers/net/ethernet/marvell/Kconfig +++ b/drivers/net/ethernet/marvell/Kconfig @@ -87,6 +87,7 @@ config MVPP2 depends on ARCH_MVEBU || COMPILE_TEST select MVMDIO select PHYLINK + select PAGE_POOL help This driver supports the network interface units in the Marvell ARMADA 375, 7K and 8K SoCs. diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h index 543a310ec102..4c16c9e9c1e5 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h @@ -15,6 +15,7 @@ #include #include #include +#include /* Fifo Registers */ #define MVPP2_RX_DATA_FIFO_SIZE_REG(port) (0x00 + 4 * (port)) @@ -820,6 +821,9 @@ struct mvpp2 { /* RSS Indirection tables */ struct mvpp2_rss_table *rss_tables[MVPP22_N_RSS_TABLES]; + + /* page_pool allocator */ + struct page_pool *page_pool[MVPP2_PORT_MAX_RXQ]; }; struct mvpp2_pcpu_stats { @@ -1161,6 +1165,10 @@ struct mvpp2_rx_queue { /* Port's logic RXQ number to which physical RXQ is mapped */ int logic_rxq; + + /* XDP memory accounting */ + struct xdp_rxq_info xdp_rxq_short; + struct xdp_rxq_info xdp_rxq_long; }; struct mvpp2_bm_pool { diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index 027de7291f92..5d0e02c161a6 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -95,6 +95,22 @@ static inline u32 mvpp2_cpu_to_thread(struct mvpp2 *priv, int cpu) return cpu % priv->nthreads; } +static struct page_pool * +mvpp2_create_page_pool(struct device *dev, int num, int len) +{ + struct page_pool_params pp_params = { + /* internal DMA mapping in page_pool */ + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV, + .pool_size = num, + .nid = NUMA_NO_NODE, + .dev = dev, + .dma_dir = DMA_FROM_DEVICE, + .max_len = len, + }; + + return page_pool_create(&pp_params); +} + /* These accessors should be used to access: * * - per-thread registers, where each thread has its own copy of the @@ -327,17 +343,25 @@ static inline int mvpp2_txq_phys(int port, int txq) return (MVPP2_MAX_TCONT + port) * MVPP2_MAX_TXQ + txq; } -static void *mvpp2_frag_alloc(const struct mvpp2_bm_pool *pool) +/* Returns a struct page if page_pool is set, otherwise a buffer */ +static void *mvpp2_frag_alloc(const struct mvpp2_bm_pool *pool, + struct page_pool *page_pool) { + if (page_pool) + return page_pool_dev_alloc_pages(page_pool); + if (likely(pool->frag_size <= PAGE_SIZE)) return netdev_alloc_frag(pool->frag_size); - else - return kmalloc(pool->frag_size, GFP_ATOMIC); + + return kmalloc(pool->frag_size, GFP_ATOMIC); } -static void mvpp2_frag_free(const struct mvpp2_bm_pool *pool, void *data) +static void mvpp2_frag_free(const struct mvpp2_bm_pool *pool, + struct page_pool *page_pool, void *data) { - if (likely(pool->frag_size <= PAGE_SIZE)) + if (page_pool) + page_pool_put_full_page(page_pool, virt_to_head_page(data), false); + else if (likely(pool->frag_size <= PAGE_SIZE)) skb_free_frag(data); else kfree(data); @@ -442,6 +466,7 @@ static void mvpp2_bm_bufs_get_addrs(struct device *dev, struct mvpp2 *priv, static void mvpp2_bm_bufs_free(struct device *dev, struct mvpp2 *priv, struct mvpp2_bm_pool *bm_pool, int buf_num) { + struct page_pool *pp = NULL; int i; if (buf_num > bm_pool->buf_num) { @@ -450,6 +475,9 @@ static void mvpp2_bm_bufs_free(struct device *dev, struct mvpp2 *priv, buf_num = bm_pool->buf_num; } + if (priv->percpu_pools) + pp = priv->page_pool[bm_pool->id]; + for (i = 0; i < buf_num; i++) { dma_addr_t buf_dma_addr; phys_addr_t buf_phys_addr; @@ -458,14 +486,15 @@ static void mvpp2_bm_bufs_free(struct device *dev, struct mvpp2 *priv, mvpp2_bm_bufs_get_addrs(dev, priv, bm
[PATCH net-next v2 3/5] mvpp2: add basic XDP support
From: Matteo Croce Add XDP native support. By now only XDP_DROP, XDP_PASS and XDP_REDIRECT verdicts are supported. Co-developed-by: Sven Auhagen Signed-off-by: Sven Auhagen Signed-off-by: Matteo Croce --- drivers/net/ethernet/marvell/mvpp2/mvpp2.h| 28 ++- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 165 +- 2 files changed, 185 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h index 4c16c9e9c1e5..f351e41c9da6 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h @@ -16,6 +16,18 @@ #include #include #include +#include +#include + +/* The PacketOffset field is measured in units of 32 bytes and is 3 bits wide, + * so the maximum offset is 7 * 32 = 224 + */ +#define MVPP2_SKB_HEADROOM min(max(XDP_PACKET_HEADROOM, NET_SKB_PAD), 224) + +#define MVPP2_XDP_PASS 0 +#define MVPP2_XDP_DROPPED BIT(0) +#define MVPP2_XDP_TX BIT(1) +#define MVPP2_XDP_REDIRBIT(2) /* Fifo Registers */ #define MVPP2_RX_DATA_FIFO_SIZE_REG(port) (0x00 + 4 * (port)) @@ -629,10 +641,12 @@ ALIGN((mtu) + MVPP2_MH_SIZE + MVPP2_VLAN_TAG_LEN + \ ETH_HLEN + ETH_FCS_LEN, cache_line_size()) -#define MVPP2_RX_BUF_SIZE(pkt_size)((pkt_size) + NET_SKB_PAD) +#define MVPP2_RX_BUF_SIZE(pkt_size)((pkt_size) + MVPP2_SKB_HEADROOM) #define MVPP2_RX_TOTAL_SIZE(buf_size) ((buf_size) + MVPP2_SKB_SHINFO_SIZE) #define MVPP2_RX_MAX_PKT_SIZE(total_size) \ - ((total_size) - NET_SKB_PAD - MVPP2_SKB_SHINFO_SIZE) + ((total_size) - MVPP2_SKB_HEADROOM - MVPP2_SKB_SHINFO_SIZE) + +#define MVPP2_MAX_RX_BUF_SIZE (PAGE_SIZE - MVPP2_SKB_SHINFO_SIZE - MVPP2_SKB_HEADROOM) #define MVPP2_BIT_TO_BYTE(bit) ((bit) / 8) #define MVPP2_BIT_TO_WORD(bit) ((bit) / 32) @@ -690,9 +704,9 @@ enum mvpp2_prs_l3_cast { #define MVPP2_BM_COOKIE_POOL_OFFS 8 #define MVPP2_BM_COOKIE_CPU_OFFS 24 -#define MVPP2_BM_SHORT_FRAME_SIZE 512 -#define MVPP2_BM_LONG_FRAME_SIZE 2048 -#define MVPP2_BM_JUMBO_FRAME_SIZE 10240 +#define MVPP2_BM_SHORT_FRAME_SIZE 704 /* frame size 128 */ +#define MVPP2_BM_LONG_FRAME_SIZE 2240/* frame size 1664 */ +#define MVPP2_BM_JUMBO_FRAME_SIZE 10432 /* frame size 9856 */ /* BM short pool packet size * These value assure that for SWF the total number * of bytes allocated for each buffer will be 512 @@ -913,6 +927,8 @@ struct mvpp2_port { unsigned int ntxqs; struct net_device *dev; + struct bpf_prog *xdp_prog; + int pkt_size; /* Per-CPU port control */ @@ -932,6 +948,8 @@ struct mvpp2_port { struct mvpp2_pcpu_stats __percpu *stats; u64 *ethtool_stats; + unsigned long state; + /* Per-port work and its lock to gather hardware statistics */ struct mutex gather_stats_lock; struct delayed_work stats_work; diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index 5d0e02c161a6..5f740c96a02c 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -36,6 +36,7 @@ #include #include #include +#include #include "mvpp2.h" #include "mvpp2_prs.h" @@ -105,6 +106,7 @@ mvpp2_create_page_pool(struct device *dev, int num, int len) .nid = NUMA_NO_NODE, .dev = dev, .dma_dir = DMA_FROM_DEVICE, + .offset = MVPP2_SKB_HEADROOM, .max_len = len, }; @@ -2462,7 +2464,7 @@ static int mvpp2_rxq_init(struct mvpp2_port *port, put_cpu(); /* Set Offset */ - mvpp2_rxq_offset_set(port, rxq->id, NET_SKB_PAD); + mvpp2_rxq_offset_set(port, rxq->id, MVPP2_SKB_HEADROOM); /* Set coalescing pkts and time */ mvpp2_rx_pkts_coal_set(port, rxq); @@ -3037,16 +3039,69 @@ static u32 mvpp2_skb_tx_csum(struct mvpp2_port *port, struct sk_buff *skb) return MVPP2_TXD_L4_CSUM_NOT | MVPP2_TXD_IP_CSUM_DISABLE; } +static int +mvpp2_run_xdp(struct mvpp2_port *port, struct mvpp2_rx_queue *rxq, + struct bpf_prog *prog, struct xdp_buff *xdp, + struct page_pool *pp) +{ + unsigned int len, sync, err; + struct page *page; + u32 ret, act; + + len = xdp->data_end - xdp->data_hard_start - MVPP2_SKB_HEADROOM; + act = bpf_prog_run_xdp(prog, xdp); + + /* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */ + sync = xdp->data_end - xdp->data_hard_start - MVPP2_SKB_HEADROOM; + sync = max(sync, len); + + switch (act) { + case XDP_PASS: + ret = MVPP2_XDP_PASS; + break; + case XDP_REDIRECT: + err = xdp_do_redirect(port->dev, xdp
[PATCH net-next v2 0/5] mvpp2: XDP support
From: Matteo Croce From: Matteo Croce Add XDP support to mvpp2. This series converts the driver to the page_pool API for RX buffer management, and adds native XDP support. XDP support comes with extack error reporting and statistics as well. These are the performance numbers, as measured by Sven: SKB fwd page pool: Rx bps 390.38 Mbps Rx pps 762.46 Kpps XDP fwd: Rx bps 1.39 Gbps Rx pps 2.72 Mpps XDP Drop: eth0: 12.9 Mpps eth1: 4.1 Mpps Matteo Croce (4): mvpp2: refactor BM pool init percpu code mvpp2: use page_pool allocator mvpp2: add basic XDP support mvpp2: XDP TX support Sven Auhagen (1): mvpp2: xdp ethtool stats drivers/net/ethernet/marvell/Kconfig | 1 + drivers/net/ethernet/marvell/mvpp2/mvpp2.h| 57 +- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 730 -- 3 files changed, 718 insertions(+), 70 deletions(-) -- 2.26.2
[PATCH net-next v2 1/5] mvpp2: refactor BM pool init percpu code
From: Matteo Croce In mvpp2_swf_bm_pool_init_percpu(), a reference to a struct mvpp2_bm_pool is obtained traversing multiple structs, when a local variable already points to the same object. Fix it and, while at it, give the variable a meaningful name. Signed-off-by: Matteo Croce --- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 23 +-- 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index 212fc3b54310..027de7291f92 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -907,28 +907,27 @@ static int mvpp2_swf_bm_pool_init_shared(struct mvpp2_port *port) /* Initialize pools for swf, percpu buffers variant */ static int mvpp2_swf_bm_pool_init_percpu(struct mvpp2_port *port) { - struct mvpp2_bm_pool *p; + struct mvpp2_bm_pool *bm_pool; int i; for (i = 0; i < port->nrxqs; i++) { - p = mvpp2_bm_pool_use_percpu(port, MVPP2_BM_SHORT, i, - mvpp2_pools[MVPP2_BM_SHORT].pkt_size); - if (!p) + bm_pool = mvpp2_bm_pool_use_percpu(port, MVPP2_BM_SHORT, i, + mvpp2_pools[MVPP2_BM_SHORT].pkt_size); + if (!bm_pool) return -ENOMEM; - port->priv->bm_pools[i].port_map |= BIT(port->id); - mvpp2_rxq_short_pool_set(port, i, port->priv->bm_pools[i].id); + bm_pool->port_map |= BIT(port->id); + mvpp2_rxq_short_pool_set(port, i, bm_pool->id); } for (i = 0; i < port->nrxqs; i++) { - p = mvpp2_bm_pool_use_percpu(port, MVPP2_BM_LONG, i + port->nrxqs, - mvpp2_pools[MVPP2_BM_LONG].pkt_size); - if (!p) + bm_pool = mvpp2_bm_pool_use_percpu(port, MVPP2_BM_LONG, i + port->nrxqs, + mvpp2_pools[MVPP2_BM_LONG].pkt_size); + if (!bm_pool) return -ENOMEM; - port->priv->bm_pools[i + port->nrxqs].port_map |= BIT(port->id); - mvpp2_rxq_long_pool_set(port, i, - port->priv->bm_pools[i + port->nrxqs].id); + bm_pool->port_map |= BIT(port->id); + mvpp2_rxq_long_pool_set(port, i, bm_pool->id); } port->pool_long = NULL; -- 2.26.2
Re: [PATCH 1/1] mvpp2: xdp ethtool stats
Hi all, Sven agreed to add the v2 of this patch to the series it depends on, please drop this one. I'm sending a v2 of the series soon. Thanks, On Wed, Jul 1, 2020 at 9:48 PM David Miller wrote: > > From: Sven Auhagen > Date: Wed, 1 Jul 2020 17:30:44 +0200 > > > static void mvpp2_read_stats(struct mvpp2_port *port) > > { > > u64 *pstats; > > + const struct mvpp2_ethtool_counter *s; > > + struct mvpp2_pcpu_stats xdp_stats = {}; > > int i, q; > > Reverse christmas tree ordering here, please. > > @@ -3166,6 +3271,7 @@ > > struct xdp_frame **frames, u32 flags) > > { > > struct mvpp2_port *port = netdev_priv(dev); > > + struct mvpp2_pcpu_stats *stats = this_cpu_ptr(port->stats); > > int i, nxmit_byte = 0, nxmit = num_frame; > > u32 ret; > > u16 txq_id; > > Likewise. > > > @@ -3258,11 +3374,10 @@ > > enum dma_data_direction dma_dir; > > struct bpf_prog *xdp_prog; > > struct xdp_buff xdp; > > + struct mvpp2_pcpu_stats ps = {}; > > int rx_received; > > int rx_done = 0; > > u32 xdp_ret = 0; > > - u32 rcvd_pkts = 0; > > - u32 rcvd_bytes = 0; > > Likewise. > -- per aspera ad upstream
Re: [PATCH net-next 3/4] mvpp2: add basic XDP support
On Thu, Jul 2, 2020 at 11:14 AM Maciej Fijalkowski wrote: > > > + if (port->dev->mtu > ETH_DATA_LEN) { > > > + netdev_err(port->dev, "Jumbo frames are not supported by XDP, > > > current MTU %d.\n", > > > + port->dev->mtu); > > > > ditto > > Here I agree and for every other netdev_err within mvpp2_xdp_setup(). > Nice idea, I'll add extack error reporting where possible. -- per aspera ad upstream
Re: [PATCH net-next 2/4] mvpp2: use page_pool allocator
On Thu, Jul 2, 2020 at 9:31 AM wrote: > > Hi Matteo, > > Thanks for working on this! > :) > On Tue, Jun 30, 2020 at 08:09:28PM +0200, Matteo Croce wrote: > > From: Matteo Croce > > -static void *mvpp2_frag_alloc(const struct mvpp2_bm_pool *pool) > > +/* Returns a struct page if page_pool is set, otherwise a buffer */ > > +static void *mvpp2_frag_alloc(const struct mvpp2_bm_pool *pool, > > + struct page_pool *page_pool) > > { > > + if (page_pool) > > + return page_pool_alloc_pages(page_pool, > > + GFP_ATOMIC | __GFP_NOWARN); > > page_pool_dev_alloc_pages() can set these flags for you, instead of explicitly > calling them > Ok > > > > + if (priv->percpu_pools) { > > + err = xdp_rxq_info_reg(&rxq->xdp_rxq_short, port->dev, > > rxq->id); > > + if (err < 0) > > + goto err_free_dma; > > + > > + err = xdp_rxq_info_reg(&rxq->xdp_rxq_long, port->dev, > > rxq->id); > > + if (err < 0) > > + goto err_unregister_rxq_short; > > + > > + /* Every RXQ has a pool for short and another for long > > packets */ > > + err = xdp_rxq_info_reg_mem_model(&rxq->xdp_rxq_short, > > + MEM_TYPE_PAGE_POOL, > > + > > priv->page_pool[rxq->logic_rxq]); > > + if (err < 0) > > + goto err_unregister_rxq_short; > > + > > + err = xdp_rxq_info_reg_mem_model(&rxq->xdp_rxq_long, > > + MEM_TYPE_PAGE_POOL, > > + > > priv->page_pool[rxq->logic_rxq + > > + > > port->nrxqs]); > > + if (err < 0) > > + goto err_unregister_rxq_long; > > Since mvpp2_rxq_init() will return an error shouldn't we unregister the short > memory pool as well? > Ok, I'll add another label like: err_unregister_mem_rxq_short: xdp_rxq_info_unreg_mem_model(&rxq->xdp_rxq_short); -- per aspera ad upstream
[PATCH net-next 0/4] mvpp2: XDP support
From: Matteo Croce Add XDP support to mvpp2. This series converts the driver to the page_pool API for RX buffer management, and adds native XDP support. These are the performance numbers, as measured by Sven: SKB fwd page pool: Rx bps 390.38 Mbps Rx pps 762.46 Kpps XDP fwd: Rx bps 1.39 Gbps Rx pps 2.72 Mpps XDP Drop: eth0: 12.9 Mpps eth1: 4.1 Mpps Matteo Croce (4): mvpp2: refactor BM pool init percpu code mvpp2: use page_pool allocator mvpp2: add basic XDP support mvpp2: XDP TX support drivers/net/ethernet/marvell/Kconfig | 1 + drivers/net/ethernet/marvell/mvpp2/mvpp2.h| 49 +- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 600 -- 3 files changed, 588 insertions(+), 62 deletions(-) -- 2.26.2
[PATCH net-next 4/4] mvpp2: XDP TX support
From: Matteo Croce Add the transmit part of XDP support, which includes: - support for XDP_TX in mvpp2_xdp() - .ndo_xdp_xmit hook for AF_XDP and XDP_REDIRECT with mvpp2 as destination mvpp2_xdp_submit_frame() is a generic function which is called by mvpp2_xdp_xmit_back() when doing XDP_TX, and by mvpp2_xdp_xmit when doing AF_XDP or XDP_REDIRECT target. The buffer allocation has been reworked to be able to map the buffers as DMA_FROM_DEVICE or DMA_BIDIRECTIONAL depending if a BPF program is loaded or not. Co-developed-by: Sven Auhagen Signed-off-by: Sven Auhagen Signed-off-by: Matteo Croce --- drivers/net/ethernet/marvell/mvpp2/mvpp2.h| 13 +- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 312 +++--- 2 files changed, 280 insertions(+), 45 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h index f351e41c9da6..c52955b33fab 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h @@ -1082,9 +1082,20 @@ struct mvpp2_rx_desc { }; }; +enum mvpp2_tx_buf_type { + MVPP2_TYPE_SKB, + MVPP2_TYPE_XDP_TX, + MVPP2_TYPE_XDP_NDO, +}; + struct mvpp2_txq_pcpu_buf { + enum mvpp2_tx_buf_type type; + /* Transmitted SKB */ - struct sk_buff *skb; + union { + struct xdp_frame *xdpf; + struct sk_buff *skb; + }; /* Physical address of transmitted buffer */ dma_addr_t dma; diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index 864d4789a0b3..ffc2a220613d 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -97,7 +97,8 @@ static inline u32 mvpp2_cpu_to_thread(struct mvpp2 *priv, int cpu) } static struct page_pool * -mvpp2_create_page_pool(struct device *dev, int num, int len) +mvpp2_create_page_pool(struct device *dev, int num, int len, + enum dma_data_direction dma_dir) { struct page_pool_params pp_params = { /* internal DMA mapping in page_pool */ @@ -105,7 +106,7 @@ mvpp2_create_page_pool(struct device *dev, int num, int len) .pool_size = num, .nid = NUMA_NO_NODE, .dev = dev, - .dma_dir = DMA_FROM_DEVICE, + .dma_dir = dma_dir, .offset = MVPP2_SKB_HEADROOM, .max_len = len, }; @@ -299,12 +300,17 @@ static void mvpp2_txq_inc_get(struct mvpp2_txq_pcpu *txq_pcpu) static void mvpp2_txq_inc_put(struct mvpp2_port *port, struct mvpp2_txq_pcpu *txq_pcpu, - struct sk_buff *skb, - struct mvpp2_tx_desc *tx_desc) + void *data, + struct mvpp2_tx_desc *tx_desc, + enum mvpp2_tx_buf_type buf_type) { struct mvpp2_txq_pcpu_buf *tx_buf = txq_pcpu->buffs + txq_pcpu->txq_put_index; - tx_buf->skb = skb; + tx_buf->type = buf_type; + if (buf_type == MVPP2_TYPE_SKB) + tx_buf->skb = data; + else + tx_buf->xdpf = data; tx_buf->size = mvpp2_txdesc_size_get(port, tx_desc); tx_buf->dma = mvpp2_txdesc_dma_addr_get(port, tx_desc) + mvpp2_txdesc_offset_get(port, tx_desc); @@ -528,9 +534,6 @@ static int mvpp2_bm_pool_destroy(struct device *dev, struct mvpp2 *priv, int buf_num; u32 val; - if (priv->percpu_pools) - page_pool_destroy(priv->page_pool[bm_pool->id]); - buf_num = mvpp2_check_hw_buf_num(priv, bm_pool); mvpp2_bm_bufs_free(dev, priv, bm_pool, buf_num); @@ -546,6 +549,9 @@ static int mvpp2_bm_pool_destroy(struct device *dev, struct mvpp2 *priv, val |= MVPP2_BM_STOP_MASK; mvpp2_write(priv, MVPP2_BM_POOL_CTRL_REG(bm_pool->id), val); + if (priv->percpu_pools) + page_pool_destroy(priv->page_pool[bm_pool->id]); + dma_free_coherent(dev, bm_pool->size_bytes, bm_pool->virt_addr, bm_pool->dma_addr); @@ -581,9 +587,19 @@ static int mvpp2_bm_pools_init(struct device *dev, struct mvpp2 *priv) static int mvpp2_bm_init(struct device *dev, struct mvpp2 *priv) { + enum dma_data_direction dma_dir = DMA_FROM_DEVICE; int i, err, poolnum = MVPP2_BM_POOLS_NUM; + struct mvpp2_port *port; if (priv->percpu_pools) { + for (i = 0; i < priv->port_count; i++) { + port = priv->port_list[i]; + if (port->xdp_prog) { + dma_dir = DMA_BIDIRECTIONAL; + break; + } + } +
[PATCH net-next 3/4] mvpp2: add basic XDP support
From: Matteo Croce Add XDP native support. By now only XDP_DROP, XDP_PASS and XDP_REDIRECT verdicts are supported. Co-developed-by: Sven Auhagen Signed-off-by: Sven Auhagen Signed-off-by: Matteo Croce --- drivers/net/ethernet/marvell/mvpp2/mvpp2.h| 28 ++- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 166 +- 2 files changed, 186 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h index 4c16c9e9c1e5..f351e41c9da6 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h @@ -16,6 +16,18 @@ #include #include #include +#include +#include + +/* The PacketOffset field is measured in units of 32 bytes and is 3 bits wide, + * so the maximum offset is 7 * 32 = 224 + */ +#define MVPP2_SKB_HEADROOM min(max(XDP_PACKET_HEADROOM, NET_SKB_PAD), 224) + +#define MVPP2_XDP_PASS 0 +#define MVPP2_XDP_DROPPED BIT(0) +#define MVPP2_XDP_TX BIT(1) +#define MVPP2_XDP_REDIRBIT(2) /* Fifo Registers */ #define MVPP2_RX_DATA_FIFO_SIZE_REG(port) (0x00 + 4 * (port)) @@ -629,10 +641,12 @@ ALIGN((mtu) + MVPP2_MH_SIZE + MVPP2_VLAN_TAG_LEN + \ ETH_HLEN + ETH_FCS_LEN, cache_line_size()) -#define MVPP2_RX_BUF_SIZE(pkt_size)((pkt_size) + NET_SKB_PAD) +#define MVPP2_RX_BUF_SIZE(pkt_size)((pkt_size) + MVPP2_SKB_HEADROOM) #define MVPP2_RX_TOTAL_SIZE(buf_size) ((buf_size) + MVPP2_SKB_SHINFO_SIZE) #define MVPP2_RX_MAX_PKT_SIZE(total_size) \ - ((total_size) - NET_SKB_PAD - MVPP2_SKB_SHINFO_SIZE) + ((total_size) - MVPP2_SKB_HEADROOM - MVPP2_SKB_SHINFO_SIZE) + +#define MVPP2_MAX_RX_BUF_SIZE (PAGE_SIZE - MVPP2_SKB_SHINFO_SIZE - MVPP2_SKB_HEADROOM) #define MVPP2_BIT_TO_BYTE(bit) ((bit) / 8) #define MVPP2_BIT_TO_WORD(bit) ((bit) / 32) @@ -690,9 +704,9 @@ enum mvpp2_prs_l3_cast { #define MVPP2_BM_COOKIE_POOL_OFFS 8 #define MVPP2_BM_COOKIE_CPU_OFFS 24 -#define MVPP2_BM_SHORT_FRAME_SIZE 512 -#define MVPP2_BM_LONG_FRAME_SIZE 2048 -#define MVPP2_BM_JUMBO_FRAME_SIZE 10240 +#define MVPP2_BM_SHORT_FRAME_SIZE 704 /* frame size 128 */ +#define MVPP2_BM_LONG_FRAME_SIZE 2240/* frame size 1664 */ +#define MVPP2_BM_JUMBO_FRAME_SIZE 10432 /* frame size 9856 */ /* BM short pool packet size * These value assure that for SWF the total number * of bytes allocated for each buffer will be 512 @@ -913,6 +927,8 @@ struct mvpp2_port { unsigned int ntxqs; struct net_device *dev; + struct bpf_prog *xdp_prog; + int pkt_size; /* Per-CPU port control */ @@ -932,6 +948,8 @@ struct mvpp2_port { struct mvpp2_pcpu_stats __percpu *stats; u64 *ethtool_stats; + unsigned long state; + /* Per-port work and its lock to gather hardware statistics */ struct mutex gather_stats_lock; struct delayed_work stats_work; diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index 9e2e8fb0a0b8..864d4789a0b3 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -36,6 +36,7 @@ #include #include #include +#include #include "mvpp2.h" #include "mvpp2_prs.h" @@ -105,6 +106,7 @@ mvpp2_create_page_pool(struct device *dev, int num, int len) .nid = NUMA_NO_NODE, .dev = dev, .dma_dir = DMA_FROM_DEVICE, + .offset = MVPP2_SKB_HEADROOM, .max_len = len, }; @@ -2463,7 +2465,7 @@ static int mvpp2_rxq_init(struct mvpp2_port *port, put_cpu(); /* Set Offset */ - mvpp2_rxq_offset_set(port, rxq->id, NET_SKB_PAD); + mvpp2_rxq_offset_set(port, rxq->id, MVPP2_SKB_HEADROOM); /* Set coalescing pkts and time */ mvpp2_rx_pkts_coal_set(port, rxq); @@ -3036,16 +3038,69 @@ static u32 mvpp2_skb_tx_csum(struct mvpp2_port *port, struct sk_buff *skb) return MVPP2_TXD_L4_CSUM_NOT | MVPP2_TXD_IP_CSUM_DISABLE; } +static int +mvpp2_run_xdp(struct mvpp2_port *port, struct mvpp2_rx_queue *rxq, + struct bpf_prog *prog, struct xdp_buff *xdp, + struct page_pool *pp) +{ + unsigned int len, sync, err; + struct page *page; + u32 ret, act; + + len = xdp->data_end - xdp->data_hard_start - MVPP2_SKB_HEADROOM; + act = bpf_prog_run_xdp(prog, xdp); + + /* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */ + sync = xdp->data_end - xdp->data_hard_start - MVPP2_SKB_HEADROOM; + sync = max(sync, len); + + switch (act) { + case XDP_PASS: + ret = MVPP2_XDP_PASS; + break; + case XDP_REDIRECT: + err = xdp_do_redirect(port->dev, xdp
[PATCH net-next 2/4] mvpp2: use page_pool allocator
From: Matteo Croce Use the page_pool API for memory management. This is a prerequisite for native XDP support. Tested-by: Sven Auhagen Signed-off-by: Matteo Croce --- drivers/net/ethernet/marvell/Kconfig | 1 + drivers/net/ethernet/marvell/mvpp2/mvpp2.h| 8 + .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 155 +++--- 3 files changed, 139 insertions(+), 25 deletions(-) diff --git a/drivers/net/ethernet/marvell/Kconfig b/drivers/net/ethernet/marvell/Kconfig index cd8ddd1ef6f2..ef4f35ba077d 100644 --- a/drivers/net/ethernet/marvell/Kconfig +++ b/drivers/net/ethernet/marvell/Kconfig @@ -87,6 +87,7 @@ config MVPP2 depends on ARCH_MVEBU || COMPILE_TEST select MVMDIO select PHYLINK + select PAGE_POOL help This driver supports the network interface units in the Marvell ARMADA 375, 7K and 8K SoCs. diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h index 543a310ec102..4c16c9e9c1e5 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h @@ -15,6 +15,7 @@ #include #include #include +#include /* Fifo Registers */ #define MVPP2_RX_DATA_FIFO_SIZE_REG(port) (0x00 + 4 * (port)) @@ -820,6 +821,9 @@ struct mvpp2 { /* RSS Indirection tables */ struct mvpp2_rss_table *rss_tables[MVPP22_N_RSS_TABLES]; + + /* page_pool allocator */ + struct page_pool *page_pool[MVPP2_PORT_MAX_RXQ]; }; struct mvpp2_pcpu_stats { @@ -1161,6 +1165,10 @@ struct mvpp2_rx_queue { /* Port's logic RXQ number to which physical RXQ is mapped */ int logic_rxq; + + /* XDP memory accounting */ + struct xdp_rxq_info xdp_rxq_short; + struct xdp_rxq_info xdp_rxq_long; }; struct mvpp2_bm_pool { diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index 027de7291f92..9e2e8fb0a0b8 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -95,6 +95,22 @@ static inline u32 mvpp2_cpu_to_thread(struct mvpp2 *priv, int cpu) return cpu % priv->nthreads; } +static struct page_pool * +mvpp2_create_page_pool(struct device *dev, int num, int len) +{ + struct page_pool_params pp_params = { + /* internal DMA mapping in page_pool */ + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV, + .pool_size = num, + .nid = NUMA_NO_NODE, + .dev = dev, + .dma_dir = DMA_FROM_DEVICE, + .max_len = len, + }; + + return page_pool_create(&pp_params); +} + /* These accessors should be used to access: * * - per-thread registers, where each thread has its own copy of the @@ -327,17 +343,26 @@ static inline int mvpp2_txq_phys(int port, int txq) return (MVPP2_MAX_TCONT + port) * MVPP2_MAX_TXQ + txq; } -static void *mvpp2_frag_alloc(const struct mvpp2_bm_pool *pool) +/* Returns a struct page if page_pool is set, otherwise a buffer */ +static void *mvpp2_frag_alloc(const struct mvpp2_bm_pool *pool, + struct page_pool *page_pool) { + if (page_pool) + return page_pool_alloc_pages(page_pool, +GFP_ATOMIC | __GFP_NOWARN); + if (likely(pool->frag_size <= PAGE_SIZE)) return netdev_alloc_frag(pool->frag_size); - else - return kmalloc(pool->frag_size, GFP_ATOMIC); + + return kmalloc(pool->frag_size, GFP_ATOMIC); } -static void mvpp2_frag_free(const struct mvpp2_bm_pool *pool, void *data) +static void mvpp2_frag_free(const struct mvpp2_bm_pool *pool, + struct page_pool *page_pool, void *data) { - if (likely(pool->frag_size <= PAGE_SIZE)) + if (page_pool) + page_pool_put_full_page(page_pool, virt_to_head_page(data), false); + else if (likely(pool->frag_size <= PAGE_SIZE)) skb_free_frag(data); else kfree(data); @@ -442,6 +467,7 @@ static void mvpp2_bm_bufs_get_addrs(struct device *dev, struct mvpp2 *priv, static void mvpp2_bm_bufs_free(struct device *dev, struct mvpp2 *priv, struct mvpp2_bm_pool *bm_pool, int buf_num) { + struct page_pool *pp = NULL; int i; if (buf_num > bm_pool->buf_num) { @@ -450,6 +476,9 @@ static void mvpp2_bm_bufs_free(struct device *dev, struct mvpp2 *priv, buf_num = bm_pool->buf_num; } + if (priv->percpu_pools) + pp = priv->page_pool[bm_pool->id]; + for (i = 0; i < buf_num; i++) { dma_addr_t buf_dma_addr; phys_addr_t buf_phys_addr; @@ -458,14 +487,15 @@ static void mvpp2_bm_bufs_free(struct device *dev, stru
[PATCH net-next 1/4] mvpp2: refactor BM pool init percpu code
From: Matteo Croce In mvpp2_swf_bm_pool_init_percpu(), a reference to a struct mvpp2_bm_pool is obtained traversing multiple structs, when a local variable already points to the same object. Fix it and, while at it, give the variable a meaningful name. Signed-off-by: Matteo Croce --- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 23 +-- 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index 212fc3b54310..027de7291f92 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -907,28 +907,27 @@ static int mvpp2_swf_bm_pool_init_shared(struct mvpp2_port *port) /* Initialize pools for swf, percpu buffers variant */ static int mvpp2_swf_bm_pool_init_percpu(struct mvpp2_port *port) { - struct mvpp2_bm_pool *p; + struct mvpp2_bm_pool *bm_pool; int i; for (i = 0; i < port->nrxqs; i++) { - p = mvpp2_bm_pool_use_percpu(port, MVPP2_BM_SHORT, i, - mvpp2_pools[MVPP2_BM_SHORT].pkt_size); - if (!p) + bm_pool = mvpp2_bm_pool_use_percpu(port, MVPP2_BM_SHORT, i, + mvpp2_pools[MVPP2_BM_SHORT].pkt_size); + if (!bm_pool) return -ENOMEM; - port->priv->bm_pools[i].port_map |= BIT(port->id); - mvpp2_rxq_short_pool_set(port, i, port->priv->bm_pools[i].id); + bm_pool->port_map |= BIT(port->id); + mvpp2_rxq_short_pool_set(port, i, bm_pool->id); } for (i = 0; i < port->nrxqs; i++) { - p = mvpp2_bm_pool_use_percpu(port, MVPP2_BM_LONG, i + port->nrxqs, - mvpp2_pools[MVPP2_BM_LONG].pkt_size); - if (!p) + bm_pool = mvpp2_bm_pool_use_percpu(port, MVPP2_BM_LONG, i + port->nrxqs, + mvpp2_pools[MVPP2_BM_LONG].pkt_size); + if (!bm_pool) return -ENOMEM; - port->priv->bm_pools[i + port->nrxqs].port_map |= BIT(port->id); - mvpp2_rxq_long_pool_set(port, i, - port->priv->bm_pools[i + port->nrxqs].id); + bm_pool->port_map |= BIT(port->id); + mvpp2_rxq_long_pool_set(port, i, bm_pool->id); } port->pool_long = NULL; -- 2.26.2
Re: [PATCH 1/1] mvpp2: ethtool rxtx stats fix
On Sun, Jun 14, 2020 at 7:19 AM Sven Auhagen wrote: > > The ethtool rx and tx queue statistics are reporting wrong values. > Fix reading out the correct ones. > > Signed-off-by: Sven Auhagen Hi Sven, seems to work as expected now: # ethtool -S eth2 |grep rxq rxq_0_desc_enqueue: 983 rxq_0_queue_full_drops: 0 rxq_0_packets_early_drops: 0 rxq_0_packets_bm_drops: 0 rxq_1_desc_enqueue: 14 rxq_1_queue_full_drops: 0 rxq_1_packets_early_drops: 0 rxq_1_packets_bm_drops: 0 rxq_2_desc_enqueue: 12 rxq_2_queue_full_drops: 0 rxq_2_packets_early_drops: 0 rxq_2_packets_bm_drops: 0 rxq_3_desc_enqueue: 4 rxq_3_queue_full_drops: 0 rxq_3_packets_early_drops: 0 rxq_3_packets_bm_drops: 0 If you manage to find the commit which introduced this tag, please add a Fixes tag. Thanks, -- Matteo Croce perl -e 'for($t=0;;$t++){print chr($t*($t>>8|$t>>13)&255)}' |aplay
Re: [PATCH 1/1] mvpp2: remove module bugfix
On Sun, Jun 14, 2020 at 7:20 AM Sven Auhagen wrote: > > The remove function does not destroy all > BM Pools when per cpu pool is active. > > When reloading the mvpp2 as a module the BM Pools > are still active in hardware and due to the bug > have twice the size now old + new. > > This eventually leads to a kernel crash. > > Signed-off-by: Sven Auhagen Hi Sven, Nice fix, if you think that I introduced it in 7d04b0b13b1175ce0c4bdc77f1278c1f120f874f, please add a Fixes tag. -- Matteo Croce perl -e 'for($t=0;;$t++){print chr($t*($t>>8|$t>>13)&255)}' |aplay
Re: [EXT] Re: [PATCH net-next 3/5] net: mvpp2: cls: Use RSS contexts to handle RSS tables
On Wed, May 20, 2020 at 1:11 PM Russell King - ARM Linux admin wrote: > > On Tue, May 19, 2020 at 07:05:34PM +0200, Matteo Croce wrote: > > On Tue, 19 May 2020 12:05:20 +0200 > > Matteo Croce wrote: > > > > Hi, > > > > The patch seems to work. I'm generating traffic with random MAC and IP > > addresses, to have many flows: > > > > # tcpdump -tenni eth2 > > 9a:a9:b1:3a:b1:6b > 00:51:82:11:22:02, ethertype IPv4 (0x0800), length 60: > > 10.0.0.4.0 > 192.168.0.4.0: UDP, length 12 > > 9e:92:fd:f8:7f:0a > 00:51:82:11:22:02, ethertype IPv4 (0x0800), length 60: > > 10.0.0.4.0 > 192.168.0.4.0: UDP, length 12 > > 66:b7:11:8a:c2:1f > 00:51:82:11:22:02, ethertype IPv4 (0x0800), length 60: > > 10.0.0.1.0 > 192.168.0.1.0: UDP, length 12 > > 7a:ba:58:bd:9a:62 > 00:51:82:11:22:02, ethertype IPv4 (0x0800), length 60: > > 10.0.0.1.0 > 192.168.0.1.0: UDP, length 12 > > 7e:78:a9:97:70:3a > 00:51:82:11:22:02, ethertype IPv4 (0x0800), length 60: > > 10.0.0.2.0 > 192.168.0.2.0: UDP, length 12 > > b2:81:91:34:ce:42 > 00:51:82:11:22:02, ethertype IPv4 (0x0800), length 60: > > 10.0.0.2.0 > 192.168.0.2.0: UDP, length 12 > > 2a:05:52:d0:d9:3f > 00:51:82:11:22:02, ethertype IPv4 (0x0800), length 60: > > 10.0.0.3.0 > 192.168.0.3.0: UDP, length 12 > > ee:ee:47:35:fa:81 > 00:51:82:11:22:02, ethertype IPv4 (0x0800), length 60: > > 10.0.0.3.0 > 192.168.0.3.0: UDP, length 12 > > > > This is the default rate, with rxhash off: > > > > # utraf eth2 > > tx: 0 bps 0 pps rx: 397.4 Mbps 827.9 Kpps > > tx: 0 bps 0 pps rx: 396.3 Mbps 825.7 Kpps > > tx: 0 bps 0 pps rx: 396.6 Mbps 826.3 Kpps > > tx: 0 bps 0 pps rx: 396.5 Mbps 826.1 Kpps > > > > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ > > COMMAND > > 9 root 20 0 0 0 0 R 99.7 0.0 7:02.58 > > ksoftirqd/0 > > 15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 > > ksoftirqd/1 > > 20 root 20 0 0 0 0 S 0.0 0.0 2:01.48 > > ksoftirqd/2 > > 25 root 20 0 0 0 0 S 0.0 0.0 0:32.86 > > ksoftirqd/3 > > > > and this with rx hashing enabled: > > > > # ethtool -K eth2 rxhash on > > # utraf eth2 > > tx: 0 bps 0 pps rx: 456.4 Mbps 950.8 Kpps > > tx: 0 bps 0 pps rx: 458.4 Mbps 955.0 Kpps > > tx: 0 bps 0 pps rx: 457.6 Mbps 953.3 Kpps > > tx: 0 bps 0 pps rx: 462.2 Mbps 962.9 Kpps > > > > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ > > COMMAND > > 20 root 20 0 0 0 0 R 0.7 0.0 2:02.34 > > ksoftirqd/2 > > 25 root 20 0 0 0 0 S 0.3 0.0 0:33.25 > > ksoftirqd/3 > > 9 root 20 0 0 0 0 S 0.0 0.0 7:52.57 > > ksoftirqd/0 > > 15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 > > ksoftirqd/1 > > > > > > The throughput doesn't increase so much, maybe we hit an HW limit of > > the gigabit port. The interesting thing is how the global CPU usage > > drops from 25% to 1%. > > I can't explain this, it could be due to the reduced contention? > > Hi Matteo, > > Can I take that as a Tested-by ? > > Thanks. > > -- > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ > FTTC for 0.8m (est. 1762m) line in suburbia: sync at 13.1Mbps down 424kbps up > Tested-by: Matteo Croce probably also: Reported-by: Matteo Croce Thanks, -- Matteo Croce per aspera ad upstream
Re: [EXT] Re: [PATCH net-next 3/5] net: mvpp2: cls: Use RSS contexts to handle RSS tables
On Tue, 19 May 2020 12:05:20 +0200 Matteo Croce wrote: > On Tue, May 19, 2020 at 11:54 AM Russell King - ARM Linux admin > wrote: > > > > On Sat, May 09, 2020 at 09:20:50PM +0100, Russell King - ARM Linux > > admin wrote: > > > On Sat, May 09, 2020 at 08:52:46PM +0100, Russell King - ARM > > > Linux admin wrote: > > > > It is highly likely that 895586d5dc32 is responsible for this > > > > breakage. I've been investigating this afternoon, and what I've > > > > found, comparing a kernel without 895586d5dc32 and with > > > > 895586d5dc32 applied is: > > > > > > > > - The table programmed into the hardware via > > > > mvpp22_rss_fill_table() appears to be identical with or without > > > > the commit. > > > > > > > > - When rxhash is enabled on eth2, mvpp2_rss_port_c2_enable() > > > > reports that c2.attr[0] and c2.attr[2] are written back > > > > containing: > > > > > > > >- with 895586d5dc32, failing:0020 4000 > > > >- without 895586d5dc32, working: 0400 4000 > > > > > > > > - When disabling rxhash, c2.attr[0] and c2.attr[2] are written > > > > back as: > > > > > > > >0400 > > > > > > > > The second value represents the MVPP22_CLS_C2_ATTR2_RSS_EN bit, > > > > the first value is the queue number, which comprises two > > > > fields. The high 5 bits are 24:29 and the low three are 21:23 > > > > inclusive. This comes from: > > > > > > > >c2.attr[0] = MVPP22_CLS_C2_ATTR0_QHIGH(qh) | > > > > MVPP22_CLS_C2_ATTR0_QLOW(ql); > > > > #define MVPP22_CLS_C2_ATTR0_QHIGH(qh) (((qh) & 0x1f) > > > > << 24) #define MVPP22_CLS_C2_ATTR0_QLOW(ql)(((ql) & > > > > 0x7) << 21) > > > > > > > > So, the working case gives eth2 a queue id of 4.0, or 32 as per > > > > port->first_rxq, and the non-working case a queue id of 0.1, or > > > > 1. > > > > > > > > The allocation of queue IDs seems to be in mvpp2_port_probe(): > > > > > > > > if (priv->hw_version == MVPP21) > > > > port->first_rxq = port->id * port->nrxqs; > > > > else > > > > port->first_rxq = port->id * > > > > priv->max_port_rxqs; > > > > > > > > Where: > > > > > > > > if (priv->hw_version == MVPP21) > > > > priv->max_port_rxqs = 8; > > > > else > > > > priv->max_port_rxqs = 32; > > > > > > > > Making the port 0 (eth0 / eth1) have port->first_rxq = 0, and > > > > port 1 (eth2) be 32. It seems the idea is that the first 32 > > > > queues belong to port 0, the second 32 queues belong to port 1, > > > > etc. > > > > > > > > mvpp2_rss_port_c2_enable() gets the queue number from it's > > > > parameter, 'ctx', which comes from mvpp22_rss_ctx(port, 0). > > > > This returns port->rss_ctx[0]. > > > > > > > > mvpp22_rss_context_create() is responsible for allocating that, > > > > which it does by looking for an unallocated priv->rss_tables[] > > > > pointer. This table is shared amongst all ports on the CP > > > > silicon. > > > > > > > > When we write the tables in mvpp22_rss_fill_table(), the RSS > > > > table entry is defined by: > > > > > > > > u32 sel = MVPP22_RSS_INDEX_TABLE(rss_ctx) | > > > > MVPP22_RSS_INDEX_TABLE_ENTRY(i); > > > > > > > > where rss_ctx is the context ID (queue number) and i is the > > > > index in the table. > > > > > > > > #define MVPP22_RSS_INDEX_TABLE_ENTRY(idx) (idx) > > > > #define MVPP22_RSS_INDEX_TABLE(idx) ((idx) << 8) > > > > #define MVPP22_RSS_INDEX_QUEUE(idx) ((idx) << 16) > > > > > > > > If we look at what is written: > > > > > > > > - The first table to be written has "sel" values of > > > > ..001f, containing values 0..3. This appears to be > > > > for eth1. This is table 0, RX queue number 0. > > > > - The second table has
Re: [EXT] Re: [PATCH net-next 3/5] net: mvpp2: cls: Use RSS contexts to handle RSS tables
the MVPP22_RXQ2RSS_TABLE > > > register. Before 895586d5dc32, it was: > > > > > >mvpp2_write(priv, MVPP22_RSS_INDEX, > > >MVPP22_RSS_INDEX_QUEUE(port->first_rxq)); > > >mvpp2_write(priv, MVPP22_RXQ2RSS_TABLE, > > >MVPP22_RSS_TABLE_POINTER(port->id)); > > > > > > and after: > > > > > >mvpp2_write(priv, MVPP22_RSS_INDEX, MVPP22_RSS_INDEX_QUEUE(ctx)); > > >mvpp2_write(priv, MVPP22_RXQ2RSS_TABLE, > > > MVPP22_RSS_TABLE_POINTER(ctx)); > > > > > > So, before the commit, for eth2, that would've contained '32' for the > > > index and '1' for the table pointer - mapping queue 32 to table 1. > > > Remember that this is queue-high.queue-low of 4.0. > > > > > > After the commit, we appear to map queue 1 to table 1. That again > > > looks fine on the face of it. > > > > > > Section 9.3.1 of the A8040 manual seems indicate the reason that the > > > queue number is separated. queue-low seems to always come from the > > > classifier, whereas queue-high can be from the ingress physical port > > > number or the classifier depending on the MVPP2_CLS_SWFWD_PCTRL_REG. > > > > > > We set the port bit in MVPP2_CLS_SWFWD_PCTRL_REG, meaning that queue-high > > > comes from the MVPP2_CLS_SWFWD_P2HQ_REG() register... and this seems to > > > be where our bug comes from. > > > > > > mvpp2_cls_oversize_rxq_set() sets this up as: > > > > > > mvpp2_write(port->priv, MVPP2_CLS_SWFWD_P2HQ_REG(port->id), > > > (port->first_rxq >> MVPP2_CLS_OVERSIZE_RXQ_LOW_BITS)); > > > > > > val = mvpp2_read(port->priv, MVPP2_CLS_SWFWD_PCTRL_REG); > > > val |= MVPP2_CLS_SWFWD_PCTRL_MASK(port->id); > > > mvpp2_write(port->priv, MVPP2_CLS_SWFWD_PCTRL_REG, val); > > > > > > so, the queue-high for eth2 is _always_ 4, meaning that only queues > > > 32 through 39 inclusive are available to eth2. Yet, we're trying to > > > tell the classifier to set queue-high, which will be ignored, to zero. > > > > > > So we end up directing traffic from eth2 not to queue 1, but to queue > > > 33, and then we tell it to look up queue 33 in the RSS table. However, > > > RSS table has not been programmed for queue 33, and so it ends up > > > (presumably) dropping the packets. > > > > > > It seems that mvpp22_rss_context_create() doesn't take account of the > > > fact that the upper 5 bits of the queue ID can't actually be changed > > > due to the settings in mvpp2_cls_oversize_rxq_set(), _or_ it seems > > > that mvpp2_cls_oversize_rxq_set() has been missed in this commit. > > > Either way, these two functions mutually disagree with what queue > > > number should be used. > > > > > > So, 895586d5dc32 is indeed the cause of this problem. > > > > Looking deeper into what mvpp2_cls_oversize_rxq_set() and the MTU > > validation is doing, it seems that MVPP2_CLS_SWFWD_P2HQ_REG() is > > used for at least a couple of things. > > > > So, with the classifier having had RSS enabled and directing eth2 > > traffic to queue 1, we can not ignore the fact that we may have > > packets appearing on queue 32 for this port. > > > > One of the things that queue 32 will be used for is if an over-sized > > packet attempts to egress through eth2 - it seems that the A8040 has > > the ability to forward frames between its ports. However, afaik we > > don't support that feature, and the kernel restricts the packet size, > > so we should never violate the MTU validator and end up with such a > > packet. In any case, _if_ we were to attempt to transmit an oversized > > packet, we have no support in the kernel to deal with that appearing > > in the port's receive queue. > > > > Maybe it would be safe to clear the MVPP2_CLS_SWFWD_PCTRL_MASK() bit? > > > > My testing seems to confirm my findings above - clearing this bit > > means that if I enable rxhash on eth2, the interface can then pass > > traffic, as we are now directing traffic to RX queue 1 rather than > > queue 33. Traffic still seems to work with rxhash off as well. > > > > So, I think it's clear where the problem lies, but not what the correct > > solution is; someone with more experience of packet classifiers (this > > one?) needs to look at this - this is my first venture into these > > things, and certainly the first time I've traced through how this is > > trying to work (or not)... > > This is what I was using here to work around the problem, and what I > mentioned above. > > diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c > b/drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c > index fd221d88811e..0dd3b65822dd 100644 > --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c > +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c > @@ -1058,7 +1058,7 @@ void mvpp2_cls_oversize_rxq_set(struct mvpp2_port *port) > (port->first_rxq >> MVPP2_CLS_OVERSIZE_RXQ_LOW_BITS)); > > val = mvpp2_read(port->priv, MVPP2_CLS_SWFWD_PCTRL_REG); > - val |= MVPP2_CLS_SWFWD_PCTRL_MASK(port->id); > + val &= ~MVPP2_CLS_SWFWD_PCTRL_MASK(port->id); > mvpp2_write(port->priv, MVPP2_CLS_SWFWD_PCTRL_REG, val); > } > > Hi, I will try this change and let you know if it works. Thanks -- Matteo Croce per aspera ad upstream
[PATCH bpf] samples: bpf: fix build error
GCC 10 is very strict about symbol clash, and lwt_len_hist_user contains a symbol which clashes with libbpf: /usr/bin/ld: samples/bpf/lwt_len_hist_user.o:(.bss+0x0): multiple definition of `bpf_log_buf'; samples/bpf/bpf_load.o:(.bss+0x8c0): first defined here collect2: error: ld returned 1 exit status bpf_log_buf here seems to be a leftover, so removing it. Signed-off-by: Matteo Croce --- samples/bpf/lwt_len_hist_user.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/samples/bpf/lwt_len_hist_user.c b/samples/bpf/lwt_len_hist_user.c index 587b68b1f8dd..430a4b7e353e 100644 --- a/samples/bpf/lwt_len_hist_user.c +++ b/samples/bpf/lwt_len_hist_user.c @@ -15,8 +15,6 @@ #define MAX_INDEX 64 #define MAX_STARS 38 -char bpf_log_buf[BPF_LOG_BUF_SIZE]; - static void stars(char *str, long val, long max, int width) { int i; -- 2.26.2
Re: [EXT] Re: [PATCH net-next 3/5] net: mvpp2: cls: Use RSS contexts to handle RSS tables
On Sat, May 9, 2020 at 4:49 PM Russell King - ARM Linux admin wrote: > > On Sat, May 09, 2020 at 02:51:05PM +0100, Russell King - ARM Linux admin > wrote: > > On Sat, May 09, 2020 at 03:14:05PM +0200, Matteo Croce wrote: > > > On Sat, May 9, 2020 at 1:45 PM Russell King - ARM Linux admin > > > wrote: > > > > > > > > On Sat, May 09, 2020 at 11:15:58AM +, Stefan Chulski wrote: > > > > > > > > > > > > > > > > -Original Message- > > > > > > From: Matteo Croce > > > > > > Sent: Saturday, May 9, 2020 3:13 AM > > > > > > To: David S . Miller > > > > > > Cc: Maxime Chevallier ; netdev > > > > > > ; LKML ; > > > > > > Antoine > > > > > > Tenart ; Thomas Petazzoni > > > > > > ; gregory.clem...@bootlin.com; > > > > > > miquel.ray...@bootlin.com; Nadav Haklai ; Stefan > > > > > > Chulski ; Marcin Wojtas ; > > > > > > Linux > > > > > > ARM ; Russell King - ARM > > > > > > Linux admin > > > > > > > > > > > > Subject: [EXT] Re: [PATCH net-next 3/5] net: mvpp2: cls: Use RSS > > > > > > contexts to > > > > > > handle RSS tables > > > > > > > > > > > > Hi, > > > > > > > > > > > > What do you think about temporarily disabling it like this? > > > > > > > > > > > > --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c > > > > > > +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c > > > > > > @@ -5775,7 +5775,8 @@ static int mvpp2_port_probe(struct > > > > > > platform_device > > > > > > *pdev, > > > > > > NETIF_F_HW_VLAN_CTAG_FILTER; > > > > > > > > > > > > if (mvpp22_rss_is_supported()) { > > > > > > - dev->hw_features |= NETIF_F_RXHASH; > > > > > > + if (port->phy_interface != PHY_INTERFACE_MODE_SGMII) > > > > > > + dev->hw_features |= NETIF_F_RXHASH; > > > > > > dev->features |= NETIF_F_NTUPLE; > > > > > > } > > > > > > > > > > > > > > > > > > David, is this "workaround" too bad to get accepted? > > > > > > > > > > Not sure that RSS related to physical interface(SGMII), better just > > > > > remove NETIF_F_RXHASH as "workaround". > > > > > > > > Hmm, I'm not sure this is the right way forward. This patch has the > > > > effect of disabling: > > > > > > > > d33ec4525007 ("net: mvpp2: add an RSS classification step for each > > > > flow") > > > > > > > > but the commit you're pointing at which caused the regression is: > > > > > > > > 895586d5dc32 ("net: mvpp2: cls: Use RSS contexts to handle RSS tables") > > > > > > > > > > > > > > Hi, > > > > > > When git bisect pointed to 895586d5dc32 ("net: mvpp2: cls: Use RSS > > > contexts to handle RSS tables"), which was merged > > > almost an year after d33ec4525007 ("net: mvpp2: add an RSS > > > classification step for each flow"), so I assume that between these > > > two commits either the feature was working or it was disable and we > > > didn't notice > > > > > > Without knowing what was happening, which commit should my Fixes tag > > > point to? > > > > Let me make sure that I get this clear: > > > > - Prior to 895586d5dc32, you can turn on and off rxhash without issue > > on any port. > > - After 895586d5dc32, turning rxhash on eth2 prevents reception. > > > > Prior to 895586d5dc32, with rxhash on, it looks like hashing using > > CRC32 is supported but only one context. So, if it's possible to > > enable rxhash on any port on the mcbin without 895586d5dc32, and the > > port continues to work, I'd say the bug was introduced by > > 895586d5dc32. > > > > Of course, that would be reinforced if there was a measurable > > difference in performance due to rxhash on each port. > > I've just run this test, but I can detect no difference in performance > with or without 895586d5dc32 o
[PATCH net] mvpp2: enable rxhash only on the first port
Currently rxhash only works on the first port of the CP (Communication Processor). Enabling it on other ports completely blocks packet reception. This patch only adds rxhash as supported feature to the first port, so rxhash can't be enabled on other ports: # ethtool -K eth0 rxhash on # ethtool -K eth1 rxhash on # ethtool -K eth2 rxhash on Cannot change receive-hashing Could not change any device features # ethtool -K eth3 rxhash on Cannot change receive-hashing Could not change any device features Fixes: 895586d5dc32 ("net: mvpp2: cls: Use RSS contexts to handle RSS tables") Signed-off-by: Matteo Croce --- drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index 2b5dad2ec650..ba71583c7ae3 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -5423,7 +5423,8 @@ static int mvpp2_port_probe(struct platform_device *pdev, NETIF_F_HW_VLAN_CTAG_FILTER; if (mvpp22_rss_is_supported()) { - dev->hw_features |= NETIF_F_RXHASH; + if (port->id == 0) + dev->hw_features |= NETIF_F_RXHASH; dev->features |= NETIF_F_NTUPLE; } -- 2.26.2
Re: [EXT] Re: [PATCH net-next 3/5] net: mvpp2: cls: Use RSS contexts to handle RSS tables
On Sat, May 9, 2020 at 1:45 PM Russell King - ARM Linux admin wrote: > > On Sat, May 09, 2020 at 11:15:58AM +, Stefan Chulski wrote: > > > > > > > -Original Message----- > > > From: Matteo Croce > > > Sent: Saturday, May 9, 2020 3:13 AM > > > To: David S . Miller > > > Cc: Maxime Chevallier ; netdev > > > ; LKML ; Antoine > > > Tenart ; Thomas Petazzoni > > > ; gregory.clem...@bootlin.com; > > > miquel.ray...@bootlin.com; Nadav Haklai ; Stefan > > > Chulski ; Marcin Wojtas ; Linux > > > ARM ; Russell King - ARM Linux admin > > > > > > Subject: [EXT] Re: [PATCH net-next 3/5] net: mvpp2: cls: Use RSS contexts > > > to > > > handle RSS tables > > > > > > Hi, > > > > > > What do you think about temporarily disabling it like this? > > > > > > --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c > > > +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c > > > @@ -5775,7 +5775,8 @@ static int mvpp2_port_probe(struct platform_device > > > *pdev, > > > NETIF_F_HW_VLAN_CTAG_FILTER; > > > > > > if (mvpp22_rss_is_supported()) { > > > - dev->hw_features |= NETIF_F_RXHASH; > > > + if (port->phy_interface != PHY_INTERFACE_MODE_SGMII) > > > + dev->hw_features |= NETIF_F_RXHASH; > > > dev->features |= NETIF_F_NTUPLE; > > > } > > > > > > > > > David, is this "workaround" too bad to get accepted? > > > > Not sure that RSS related to physical interface(SGMII), better just remove > > NETIF_F_RXHASH as "workaround". > > Hmm, I'm not sure this is the right way forward. This patch has the > effect of disabling: > > d33ec4525007 ("net: mvpp2: add an RSS classification step for each flow") > > but the commit you're pointing at which caused the regression is: > > 895586d5dc32 ("net: mvpp2: cls: Use RSS contexts to handle RSS tables") > > Hi, When git bisect pointed to 895586d5dc32 ("net: mvpp2: cls: Use RSS contexts to handle RSS tables"), which was merged almost an year after d33ec4525007 ("net: mvpp2: add an RSS classification step for each flow"), so I assume that between these two commits either the feature was working or it was disable and we didn't notice Without knowing what was happening, which commit should my Fixes tag point to? Regards, -- Matteo Croce per aspera ad upstream
Re: [EXT] Re: [PATCH net-next 3/5] net: mvpp2: cls: Use RSS contexts to handle RSS tables
On Sat, May 9, 2020 at 1:16 PM Stefan Chulski wrote: > > Hi, > > > > What do you think about temporarily disabling it like this? > > > > --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c > > +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c > > @@ -5775,7 +5775,8 @@ static int mvpp2_port_probe(struct platform_device > > *pdev, > > NETIF_F_HW_VLAN_CTAG_FILTER; > > > > if (mvpp22_rss_is_supported()) { > > - dev->hw_features |= NETIF_F_RXHASH; > > + if (port->phy_interface != PHY_INTERFACE_MODE_SGMII) > > + dev->hw_features |= NETIF_F_RXHASH; > > dev->features |= NETIF_F_NTUPLE; > > } > > > > > > David, is this "workaround" too bad to get accepted? > > Not sure that RSS related to physical interface(SGMII), better just remove > NETIF_F_RXHASH as "workaround". > > Stefan. Hi, The point is that RXHASH works fine on all interfaces, but on the gigabit one (eth2 usually). And on the 10 gbit interface is very very effective, the throughput goes 4x when enabled, so it would be a big drawback to disable it on all interfaces. Honestly I don't have any 2.5 gbit hardware to test it on eth3, so I don't know if rxhash actually only works on the first interface of a unit (so eth0 and eth1), or if it just doesn't work on the gigabit one. If someone could test it on the 2.5 gbit port, this will be helpful. Regards, -- Matteo Croce per aspera ad upstream
Re: [PATCH net-next 3/5] net: mvpp2: cls: Use RSS contexts to handle RSS tables
On Thu, Apr 23, 2020 at 7:00 PM Russell King - ARM Linux admin wrote: > > On Tue, Apr 14, 2020 at 01:43:02AM +0200, Matteo Croce wrote: > > On Tue, Apr 14, 2020 at 1:21 AM Maxime Chevallier > > wrote: > > > > > > The PPv2 controller has 8 RSS tables that are shared across all ports on > > > a given PPv2 instance. The previous implementation allocated one table > > > per port, leaving others unused. > > > > > > By using RSS contexts, we can make use of multiple RSS tables per > > > port, one being the default table (always id 0), the other ones being > > > used as destinations for flow steering, in the same way as rx rings. > > > > > > This commit introduces RSS contexts management in the PPv2 driver. We > > > always reserve one table per port, allocated when the port is probed. > > > > > > The global table list is stored in the struct mvpp2, as it's a global > > > resource. Each port then maintains a list of indices in that global > > > table, that way each port can have it's own numbering scheme starting > > > from 0. > > > > > > One limitation that seems unavoidable is that the hashing parameters are > > > shared across all RSS contexts for a given port. Hashing parameters for > > > ctx 0 will be applied to all contexts. > > > > > > Signed-off-by: Maxime Chevallier > > > > Hi all, > > > > I noticed that enabling rxhash blocks the RX on my Macchiatobin. It > > works fine with the 10G ports (the RX rate goes 4x up) but it > > completely kills the gigabit interface. > > > > # 10G port > > root@macchiatobin:~# iperf3 -c 192.168.0.2 > > Connecting to host 192.168.0.2, port 5201 > > [ 5] local 192.168.0.1 port 42394 connected to 192.168.0.2 port 5201 > > [ ID] Interval Transfer Bitrate Retr Cwnd > > [ 5] 0.00-1.00 sec 941 MBytes 7.89 Gbits/sec 4030250 KBytes > > [ 5] 1.00-2.00 sec 933 MBytes 7.82 Gbits/sec 4393240 KBytes > > root@macchiatobin:~# ethtool -K eth0 rxhash on > > root@macchiatobin:~# iperf3 -c 192.168.0.2 > > Connecting to host 192.168.0.2, port 5201 > > [ 5] local 192.168.0.1 port 42398 connected to 192.168.0.2 port 5201 > > [ ID] Interval Transfer Bitrate Retr Cwnd > > [ 5] 0.00-1.00 sec 860 MBytes 7.21 Gbits/sec 428410 KBytes > > [ 5] 1.00-2.00 sec 859 MBytes 7.20 Gbits/sec 185563 KBytes > > > > # gigabit port > > root@macchiatobin:~# iperf3 -c turbo > > Connecting to host turbo, port 5201 > > [ 5] local 192.168.85.42 port 45144 connected to 192.168.85.6 port 5201 > > [ ID] Interval Transfer Bitrate Retr Cwnd > > [ 5] 0.00-1.00 sec 113 MBytes 948 Mbits/sec0407 KBytes > > [ 5] 1.00-2.00 sec 112 MBytes 942 Mbits/sec0428 KBytes > > root@macchiatobin:~# ethtool -K eth2 rxhash on > > root@macchiatobin:~# iperf3 -c turbo > > iperf3: error - unable to connect to server: Resource temporarily > > unavailable > > > > I've bisected and it seems that this commit causes the issue. I tried > > to revert it on nex-next as a second test, but the code has changed a > > lot much since, generating too much conflicts. > > Can you have a look into this? > > This behaviour on eth2 is confirmed here on v5.6. Turning on rxhash > appears to prevent eth2 working. > > Maxime, please look into this regression, thanks. > > -- > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ > FTTC broadband for 0.8mile line in suburbia: sync at 10.2Mbps down 587kbps up > Hi, What do you think about temporarily disabling it like this? --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -5775,7 +5775,8 @@ static int mvpp2_port_probe(struct platform_device *pdev, NETIF_F_HW_VLAN_CTAG_FILTER; if (mvpp22_rss_is_supported()) { - dev->hw_features |= NETIF_F_RXHASH; + if (port->phy_interface != PHY_INTERFACE_MODE_SGMII) + dev->hw_features |= NETIF_F_RXHASH; dev->features |= NETIF_F_NTUPLE; } David, is this "workaround" too bad to get accepted? Bye, -- Matteo Croce per aspera ad upstream
Re: [PATCH net v2 0/2] Fix 88x3310 leaving power save mode
On Wed, Apr 15, 2020 at 1:48 AM David Miller wrote: > > From: Russell King - ARM Linux admin > Date: Tue, 14 Apr 2020 20:47:53 +0100 > > > This series fixes a problem with the 88x3310 PHY on Macchiatobin > > coming out of powersave mode noticed by Matteo Croce. It seems > > that certain PHY firmwares do not properly exit powersave mode, > > resulting in a fibre link not coming up. > > > > The solution appears to be to soft-reset the PHY after clearing > > the powersave bit. > > > > We add support for reporting the PHY firmware version to the kernel > > log, and use it to trigger this new behaviour if we have v0.3.x.x > > or more recent firmware on the PHY. This, however, is a guess as > > the firmware revision documentation does not mention this issue, > > and we know that v0.2.1.0 works without this fix but v0.3.3.0 and > > later does not. > > Series applied, thanks. > Hi, should we queue this to -stable? The 10 gbit ports don't work without this fix. Regards, -- Matteo Croce per aspera ad upstream
[PATCH net-next 1/4] flow_dissector: add meaningful comments
Documents two piece of code which can't be understood at a glance. Signed-off-by: Matteo Croce --- include/net/flow_dissector.h | 1 + net/core/flow_dissector.c| 6 ++ 2 files changed, 7 insertions(+) diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h index 90bd210be060..7747af3cc500 100644 --- a/include/net/flow_dissector.h +++ b/include/net/flow_dissector.h @@ -282,6 +282,7 @@ struct flow_keys { struct flow_dissector_key_vlan cvlan; struct flow_dissector_key_keyid keyid; struct flow_dissector_key_ports ports; + /* 'addrs' must be the last member */ struct flow_dissector_key_addrs addrs; }; diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index 7c09d87d3269..affde70dad47 100644 --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -1374,6 +1374,9 @@ static inline size_t flow_keys_hash_length(const struct flow_keys *flow) { size_t diff = FLOW_KEYS_HASH_OFFSET + sizeof(flow->addrs); BUILD_BUG_ON((sizeof(*flow) - FLOW_KEYS_HASH_OFFSET) % sizeof(u32)); + /* flow.addrs MUST be the last member in struct flow_keys because +* different L3 protocols have different address length +*/ BUILD_BUG_ON(offsetof(typeof(*flow), addrs) != sizeof(*flow) - sizeof(flow->addrs)); @@ -1421,6 +1424,9 @@ __be32 flow_get_u32_dst(const struct flow_keys *flow) } EXPORT_SYMBOL(flow_get_u32_dst); +/* Sort the source and destination IP (and the ports if the IP are the same), + * to have consistent hash within the two directions + */ static inline void __flow_hash_consistentify(struct flow_keys *keys) { int addr_diff, i; -- 2.21.0
Re: [PATCH iproute2] testsuite: don't clobber /tmp
On Tue, Jun 25, 2019 at 4:39 PM Matteo Croce wrote: > > On Thu, Jun 13, 2019 at 7:15 PM Matteo Croce wrote: > > > > On Wed, Jun 12, 2019 at 8:20 PM Stephen Hemminger > > wrote: > > > > To me any path could work, both /tmp or in the current dir, I have no > > preference. > > The important thing is to remove them wherever they are, as clobbering > > the build dir is bad as messing /tmp. > > > > Anyway, I double checked, and the only target which uses that > > temporary file is 'alltests' so, if the path is ok, I think that the > > condition "ifeq ($(MAKECMDGOALS),alltests)" is the only one which > > fixes the issue and keeps the behaviour unaltered. > > I did some quick tests and it works for me. > > > > Bye, > > -- > > Matteo Croce > > per aspera ad upstream > > Hi, > > any more thoughts about this patch? > > -- > Matteo Croce > per aspera ad upstream Hi, almost forgot about this one. Should I resend it, or it was nacked? Regards, -- Matteo Croce per aspera ad upstream
Re: [RFC 3/4] net: mvneta: add basic XDP support
On Tue, 1 Oct 2019 11:24:43 +0200 Lorenzo Bianconi wrote: > +static int mvneta_xdp_setup(struct net_device *dev, struct bpf_prog > *prog, > + struct netlink_ext_ack *extack) > +{ > + struct mvneta_port *pp = netdev_priv(dev); > + struct bpf_prog *old_prog; > + > + if (prog && dev->mtu > MVNETA_MAX_RX_BUF_SIZE) { > + NL_SET_ERR_MSG_MOD(extack, "Jumbo frames not > supported on XDP"); > + return -EOPNOTSUPP; -ENOTSUPP maybe? > + } > + > + mvneta_stop(dev); only stop and restart if already running > + > + old_prog = xchg(&pp->xdp_prog, prog); > + if (old_prog) > + bpf_prog_put(old_prog); > + > + mvneta_open(dev); ^^ -- Matteo Croce per aspera ad upstream
Re: [PATCH] iplink: document 'change' option to ip link
On Sat, Jul 27, 2019 at 12:00 AM Stephen Hemminger wrote: > > Add the command alias "change" to man page. > Don't show it on usage, since it is not commonly used. > > Reported-off-by: Matteo Croce > Signed-off-by: Stephen Hemminger > --- > man/man8/ip-link.8.in | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in > index 883d88077d66..a8ae72d2b097 100644 > --- a/man/man8/ip-link.8.in > +++ b/man/man8/ip-link.8.in > @@ -1815,6 +1815,11 @@ can move the system to an unpredictable state. The > solution > is to avoid changing several parameters with one > .B ip link set > call. > +The modifier > +.B change > +is equivalent to > +.BR "set" . > + > > .TP > .BI dev " DEVICE " > -- > 2.20.1 > Acked-by: Matteo Croce -- Matteo Croce per aspera ad upstream
Re: [PATCH iproute2] iplink: document the 'link change' subcommand
On July 26, 2019 8:25:14 PM GMT+02:00, Stephen Hemminger wrote: > On Wed, 24 Jul 2019 21:12:18 +0200 > Matteo Croce wrote: > > > ip link can set parameters both via the 'set' and 'change' keyword. > > In fact, 'change' is an alias for 'set'. > > Document this in the help and manpage. > > > > Fixes: 1d93483985f0 ("iplink: use netlink for link configuration") > > Signed-off-by: Matteo Croce > > Probably just done originally for compatibility in some way with ip > route. > Not sure if it really needs to be documented. Maybe not in the output help, but in the manpage we should state it. -- Matteo Croce per aspera ad upstream
[PATCH iproute2] iplink: document the 'link change' subcommand
ip link can set parameters both via the 'set' and 'change' keyword. In fact, 'change' is an alias for 'set'. Document this in the help and manpage. Fixes: 1d93483985f0 ("iplink: use netlink for link configuration") Signed-off-by: Matteo Croce --- ip/iplink.c | 4 ++-- man/man8/ip-link.8.in | 7 ++- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/ip/iplink.c b/ip/iplink.c index 212a0885..be237c93 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -64,12 +64,12 @@ void iplink_usage(void) "\n" " ip link delete { DEVICE | dev DEVICE | group DEVGROUP } type TYPE [ ARGS ]\n" "\n" - " ip link set { DEVICE | dev DEVICE | group DEVGROUP }\n" + " ip link { set | change } { DEVICE | dev DEVICE | group DEVGROUP }\n" " [ { up | down } ]\n" " [ type TYPE ARGS ]\n"); } else fprintf(stderr, - "Usage: ip link set DEVICE [ { up | down } ]\n"); + "Usage: ip link { set | change } DEVICE [ { up | down } ]\n"); fprintf(stderr, " [ arp { on | off } ]\n" diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in index 883d8807..d4be1a0e 100644 --- a/man/man8/ip-link.8.in +++ b/man/man8/ip-link.8.in @@ -1802,7 +1802,8 @@ deleted since it is the default group. .BI type " TYPE " specifies the type of the device. -.SS ip link set - change device attributes +.SS ip link set +.SS ip link change - change device attributes .PP .B Warning: @@ -1815,6 +1816,10 @@ can move the system to an unpredictable state. The solution is to avoid changing several parameters with one .B ip link set call. +.B set +and +.B change +are synonyms and have identical syntax and behavior. .TP .BI dev " DEVICE " -- 2.21.0
[PATCH iproute2 v2] utils: don't match empty strings as prefixes
iproute has an utility function which checks if a string is a prefix for another one, to allow use of abbreviated commands, e.g. 'addr' or 'a' instead of 'address'. This routine unfortunately considers an empty string as prefix of any pattern, leading to undefined behaviour when an empty argument is passed to ip: # ip '' 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever # tc '' qdisc noqueue 0: dev lo root refcnt 2 # ip address add 192.0.2.0/24 '' 198.51.100.1 dev dummy0 # ip addr show dev dummy0 6: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 02:9d:5e:e9:3f:c0 brd ff:ff:ff:ff:ff:ff inet 192.0.2.0/24 brd 198.51.100.1 scope global dummy0 valid_lft forever preferred_lft forever Rewrite matches() so it takes care of an empty input, and doesn't scan the input strings three times: the actual implementation does 2 strlen and a memcpy to accomplish the same task. Signed-off-by: Matteo Croce --- include/utils.h | 2 +- lib/utils.c | 15 ++- 2 files changed, 11 insertions(+), 6 deletions(-) diff --git a/include/utils.h b/include/utils.h index 1d9c1127..794d3605 100644 --- a/include/utils.h +++ b/include/utils.h @@ -198,7 +198,7 @@ int nodev(const char *dev); int check_ifname(const char *); int get_ifname(char *, const char *); const char *get_ifname_rta(int ifindex, const struct rtattr *rta); -int matches(const char *arg, const char *pattern); +bool matches(const char *prefix, const char *string); int inet_addr_match(const inet_prefix *a, const inet_prefix *b, int bits); int inet_addr_match_rta(const inet_prefix *m, const struct rtattr *rta); diff --git a/lib/utils.c b/lib/utils.c index 5da9a478..9ea21fa1 100644 --- a/lib/utils.c +++ b/lib/utils.c @@ -871,13 +871,18 @@ const char *get_ifname_rta(int ifindex, const struct rtattr *rta) return name; } -int matches(const char *cmd, const char *pattern) +/* Returns false if 'prefix' is a not empty prefix of 'string'. + */ +bool matches(const char *prefix, const char *string) { - int len = strlen(cmd); + if (!*prefix) + return true; + while (*string && *prefix == *string) { + prefix++; + string++; + } - if (len > strlen(pattern)) - return -1; - return memcmp(pattern, cmd, len); + return !!*prefix; } int inet_addr_match(const inet_prefix *a, const inet_prefix *b, int bits) -- 2.21.0
Re: [PATCH iproute2-next] Makefile: pass -pipe to the compiler
On Tue, Jun 11, 2019 at 9:28 PM Stephen Hemminger wrote: > > On Tue, 11 Jun 2019 20:05:13 +0200 > Matteo Croce wrote: > > > Pass the -pipe option to GCC, to use pipes instead of temp files. > > On a slow AMD G-T40E CPU we get a non negligible 6% improvement > > in build time. > > > > real1m15,111s > > user1m2,521s > > sys 0m12,465s > > > > real1m10,861s > > user1m2,520s > > sys 0m12,901s > > > > Signed-off-by: Matteo Croce > > Why bother, on my machine (make -j12). > > Before > real0m6.320s > user0m30.674s > sys 0m3.649s > > > After (with -pipe) > real0m6.158s > user0m31.197s > sys 0m3.532s > > > So it is slower. Get a faster disk :-) > I did it :) root@apu:~# hdparm -t /dev/sda /dev/sda: Timing buffered disk reads: 1086 MB in 3.00 seconds = 361.58 MB/sec No change at all thought. It's really a CPU bound job, and this machine is not a number cruncher: root@apu:~# dd if=/dev/zero bs=1G count=1 status=none |time -p sha1sum 2a492f15396a6768bcbca016993f4b4c8b0b5307 - real 16.00 user 12.38 sys 2.05 I think that the slight increase is due to the fact that the processes starts in parallel, instead of being serialized. > Maybe allow "EXTRA_CFLAGS" to be passed to Makefile for those that > have a burning need for this. Just out of curiosity, I checked the linux history repo to discover when it was introduced in the kernel, it tooks me a while to really find it: Linux-0.99.13k (September 19, 1993) -- Matteo Croce per aspera ad upstream
Re: [PATCH iproute2] utils: don't match empty strings as prefixes
On Wed, Jul 10, 2019 at 1:18 AM Matteo Croce wrote: > > On Tue, Jul 9, 2019 at 11:38 PM Stephen Hemminger > wrote: > > > > On Tue, 9 Jul 2019 22:40:40 +0200 > > Matteo Croce wrote: > > > > > iproute has an utility function which checks if a string is a prefix for > > > another one, to allow use of abbreviated commands, e.g. 'addr' or 'a' > > > instead of 'address'. > > > > > > This routine unfortunately considers an empty string as prefix > > > of any pattern, leading to undefined behaviour when an empty > > > argument is passed to ip: > > > > > > # ip '' > > > 1: lo: mtu 65536 qdisc noqueue state UNKNOWN > > > group default qlen 1000 > > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > > inet 127.0.0.1/8 scope host lo > > >valid_lft forever preferred_lft forever > > > inet6 ::1/128 scope host > > >valid_lft forever preferred_lft forever > > > > > > # tc '' > > > qdisc noqueue 0: dev lo root refcnt 2 > > > > > > # ip address add 192.0.2.0/24 '' 198.51.100.1 dev dummy0 > > > # ip addr show dev dummy0 > > > 6: dummy0: mtu 1500 qdisc noop state DOWN group > > > default qlen 1000 > > > link/ether 02:9d:5e:e9:3f:c0 brd ff:ff:ff:ff:ff:ff > > > inet 192.0.2.0/24 brd 198.51.100.1 scope global dummy0 > > >valid_lft forever preferred_lft forever > > > > > > Rewrite matches() so it takes care of an empty input, and doesn't > > > scan the input strings three times: the actual implementation > > > does 2 strlen and a memcpy to accomplish the same task. > > > > > > Signed-off-by: Matteo Croce > > > --- > > > include/utils.h | 2 +- > > > lib/utils.c | 14 +- > > > 2 files changed, 10 insertions(+), 6 deletions(-) > > > > > > diff --git a/include/utils.h b/include/utils.h > > > index 927fdc17..f4d12abb 100644 > > > --- a/include/utils.h > > > +++ b/include/utils.h > > > @@ -198,7 +198,7 @@ int nodev(const char *dev); > > > int check_ifname(const char *); > > > int get_ifname(char *, const char *); > > > const char *get_ifname_rta(int ifindex, const struct rtattr *rta); > > > -int matches(const char *arg, const char *pattern); > > > +int matches(const char *prefix, const char *string); > > > int inet_addr_match(const inet_prefix *a, const inet_prefix *b, int > > > bits); > > > int inet_addr_match_rta(const inet_prefix *m, const struct rtattr *rta); > > > > > > diff --git a/lib/utils.c b/lib/utils.c > > > index be0f11b0..73ce19bb 100644 > > > --- a/lib/utils.c > > > +++ b/lib/utils.c > > > @@ -887,13 +887,17 @@ const char *get_ifname_rta(int ifindex, const > > > struct rtattr *rta) > > > return name; > > > } > > > > > > -int matches(const char *cmd, const char *pattern) > > > +/* Check if 'prefix' is a non empty prefix of 'string' */ > > > +int matches(const char *prefix, const char *string) > > > { > > > - int len = strlen(cmd); > > > + if (!*prefix) > > > + return 1; > > > + while(*string && *prefix == *string) { > > > + prefix++; > > > + string++; > > > + } > > > > > > - if (len > strlen(pattern)) > > > - return -1; > > > - return memcmp(pattern, cmd, len); > > > + return *prefix; > > > } > > > > > > int inet_addr_match(const inet_prefix *a, const inet_prefix *b, int bits) > > > > ERROR: space required before the open parenthesis '(' > > #134: FILE: lib/utils.c:895: > > + while(*string && *prefix == *string) { > > > > total: 1 errors, 1 warnings, 30 lines checked > > > > The empty prefix string is a bug and should not be allowed. > > Also return value should be same as old code (yours isn't). > > > > > > > > The old return value was the difference between the first pair of > bytes, according to the memcmp manpage. > All calls only checks if the matches() return value is 0 or not 0: > > iproute2$ git grep 'matches(' |grep -v -e '== 0' -e '= 0' -e '!matches(' > include/utils.h:int matches(const char *prefix, const char *string); > include/xtables.h:extern void xtables_register_matches(struct > xtables_match *, unsigned int); > lib/color.c:if (matches(dup, "-color")) > lib/utils.c:int matches(const char *prefix, const char *string) > tc/tc.c:if (matches(argv[0], iter->c)) > > Is it a problem if it returns a non negative value for non matching strings? > > Regards, > > > -- > Matteo Croce > per aspera ad upstream Hi Stephen, should I send a v2 which keeps the old behaviour, even if noone checks for all the values? Just to clarify, the old behaviour of matches(cmd, pattern) was: -1 if len(cmd) > len(pattern) 0 if pattern is equal to cmd 0 if pattern starts with cmd < 0 if pattern is alphabetically lower than cmd > 0 if pattern is alphabetically higher than cmd Regards, -- Matteo Croce per aspera ad upstream
Re: [PATCH iproute2] utils: don't match empty strings as prefixes
On Tue, Jul 9, 2019 at 11:38 PM Stephen Hemminger wrote: > > On Tue, 9 Jul 2019 22:40:40 +0200 > Matteo Croce wrote: > > > iproute has an utility function which checks if a string is a prefix for > > another one, to allow use of abbreviated commands, e.g. 'addr' or 'a' > > instead of 'address'. > > > > This routine unfortunately considers an empty string as prefix > > of any pattern, leading to undefined behaviour when an empty > > argument is passed to ip: > > > > # ip '' > > 1: lo: mtu 65536 qdisc noqueue state UNKNOWN > > group default qlen 1000 > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > inet 127.0.0.1/8 scope host lo > >valid_lft forever preferred_lft forever > > inet6 ::1/128 scope host > >valid_lft forever preferred_lft forever > > > > # tc '' > > qdisc noqueue 0: dev lo root refcnt 2 > > > > # ip address add 192.0.2.0/24 '' 198.51.100.1 dev dummy0 > > # ip addr show dev dummy0 > > 6: dummy0: mtu 1500 qdisc noop state DOWN group > > default qlen 1000 > > link/ether 02:9d:5e:e9:3f:c0 brd ff:ff:ff:ff:ff:ff > > inet 192.0.2.0/24 brd 198.51.100.1 scope global dummy0 > >valid_lft forever preferred_lft forever > > > > Rewrite matches() so it takes care of an empty input, and doesn't > > scan the input strings three times: the actual implementation > > does 2 strlen and a memcpy to accomplish the same task. > > > > Signed-off-by: Matteo Croce > > --- > > include/utils.h | 2 +- > > lib/utils.c | 14 +- > > 2 files changed, 10 insertions(+), 6 deletions(-) > > > > diff --git a/include/utils.h b/include/utils.h > > index 927fdc17..f4d12abb 100644 > > --- a/include/utils.h > > +++ b/include/utils.h > > @@ -198,7 +198,7 @@ int nodev(const char *dev); > > int check_ifname(const char *); > > int get_ifname(char *, const char *); > > const char *get_ifname_rta(int ifindex, const struct rtattr *rta); > > -int matches(const char *arg, const char *pattern); > > +int matches(const char *prefix, const char *string); > > int inet_addr_match(const inet_prefix *a, const inet_prefix *b, int bits); > > int inet_addr_match_rta(const inet_prefix *m, const struct rtattr *rta); > > > > diff --git a/lib/utils.c b/lib/utils.c > > index be0f11b0..73ce19bb 100644 > > --- a/lib/utils.c > > +++ b/lib/utils.c > > @@ -887,13 +887,17 @@ const char *get_ifname_rta(int ifindex, const struct > > rtattr *rta) > > return name; > > } > > > > -int matches(const char *cmd, const char *pattern) > > +/* Check if 'prefix' is a non empty prefix of 'string' */ > > +int matches(const char *prefix, const char *string) > > { > > - int len = strlen(cmd); > > + if (!*prefix) > > + return 1; > > + while(*string && *prefix == *string) { > > + prefix++; > > + string++; > > + } > > > > - if (len > strlen(pattern)) > > - return -1; > > - return memcmp(pattern, cmd, len); > > + return *prefix; > > } > > > > int inet_addr_match(const inet_prefix *a, const inet_prefix *b, int bits) > > ERROR: space required before the open parenthesis '(' > #134: FILE: lib/utils.c:895: > + while(*string && *prefix == *string) { > > total: 1 errors, 1 warnings, 30 lines checked > > The empty prefix string is a bug and should not be allowed. > Also return value should be same as old code (yours isn't). > > > The old return value was the difference between the first pair of bytes, according to the memcmp manpage. All calls only checks if the matches() return value is 0 or not 0: iproute2$ git grep 'matches(' |grep -v -e '== 0' -e '= 0' -e '!matches(' include/utils.h:int matches(const char *prefix, const char *string); include/xtables.h:extern void xtables_register_matches(struct xtables_match *, unsigned int); lib/color.c:if (matches(dup, "-color")) lib/utils.c:int matches(const char *prefix, const char *string) tc/tc.c:if (matches(argv[0], iter->c)) Is it a problem if it returns a non negative value for non matching strings? Regards, -- Matteo Croce per aspera ad upstream
[PATCH iproute2] utils: don't match empty strings as prefixes
iproute has an utility function which checks if a string is a prefix for another one, to allow use of abbreviated commands, e.g. 'addr' or 'a' instead of 'address'. This routine unfortunately considers an empty string as prefix of any pattern, leading to undefined behaviour when an empty argument is passed to ip: # ip '' 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever # tc '' qdisc noqueue 0: dev lo root refcnt 2 # ip address add 192.0.2.0/24 '' 198.51.100.1 dev dummy0 # ip addr show dev dummy0 6: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 02:9d:5e:e9:3f:c0 brd ff:ff:ff:ff:ff:ff inet 192.0.2.0/24 brd 198.51.100.1 scope global dummy0 valid_lft forever preferred_lft forever Rewrite matches() so it takes care of an empty input, and doesn't scan the input strings three times: the actual implementation does 2 strlen and a memcpy to accomplish the same task. Signed-off-by: Matteo Croce --- include/utils.h | 2 +- lib/utils.c | 14 +- 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/include/utils.h b/include/utils.h index 927fdc17..f4d12abb 100644 --- a/include/utils.h +++ b/include/utils.h @@ -198,7 +198,7 @@ int nodev(const char *dev); int check_ifname(const char *); int get_ifname(char *, const char *); const char *get_ifname_rta(int ifindex, const struct rtattr *rta); -int matches(const char *arg, const char *pattern); +int matches(const char *prefix, const char *string); int inet_addr_match(const inet_prefix *a, const inet_prefix *b, int bits); int inet_addr_match_rta(const inet_prefix *m, const struct rtattr *rta); diff --git a/lib/utils.c b/lib/utils.c index be0f11b0..73ce19bb 100644 --- a/lib/utils.c +++ b/lib/utils.c @@ -887,13 +887,17 @@ const char *get_ifname_rta(int ifindex, const struct rtattr *rta) return name; } -int matches(const char *cmd, const char *pattern) +/* Check if 'prefix' is a non empty prefix of 'string' */ +int matches(const char *prefix, const char *string) { - int len = strlen(cmd); + if (!*prefix) + return 1; + while(*string && *prefix == *string) { + prefix++; + string++; + } - if (len > strlen(pattern)) - return -1; - return memcmp(pattern, cmd, len); + return *prefix; } int inet_addr_match(const inet_prefix *a, const inet_prefix *b, int bits) -- 2.21.0
[PATCH net v2] ipv4: don't set IPv6 only flags to IPv4 addresses
Avoid the situation where an IPV6 only flag is applied to an IPv4 address: # ip addr add 192.0.2.1/24 dev dummy0 nodad home mngtmpaddr noprefixroute # ip -4 addr show dev dummy0 2: dummy0: mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 inet 192.0.2.1/24 scope global noprefixroute dummy0 valid_lft forever preferred_lft forever Or worse, by sending a malicious netlink command: # ip -4 addr show dev dummy0 2: dummy0: mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 inet 192.0.2.1/24 scope global nodad optimistic dadfailed home tentative mngtmpaddr noprefixroute stable-privacy dummy0 valid_lft forever preferred_lft forever Signed-off-by: Matteo Croce --- net/ipv4/devinet.c | 8 1 file changed, 8 insertions(+) diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index c6bd0f7a020a..c5ebfa199794 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -62,6 +62,11 @@ #include #include +#define IPV6ONLY_FLAGS \ + (IFA_F_NODAD | IFA_F_OPTIMISTIC | IFA_F_DADFAILED | \ +IFA_F_HOMEADDRESS | IFA_F_TENTATIVE | \ +IFA_F_MANAGETEMPADDR | IFA_F_STABLE_PRIVACY) + static struct ipv4_devconf ipv4_devconf = { .data = { [IPV4_DEVCONF_ACCEPT_REDIRECTS - 1] = 1, @@ -468,6 +473,9 @@ static int __inet_insert_ifa(struct in_ifaddr *ifa, struct nlmsghdr *nlh, ifa->ifa_flags &= ~IFA_F_SECONDARY; last_primary = &in_dev->ifa_list; + /* Don't set IPv6 only flags to IPv4 addresses */ + ifa->ifa_flags &= ~IPV6ONLY_FLAGS; + for (ifap = &in_dev->ifa_list; (ifa1 = *ifap) != NULL; ifap = &ifa1->ifa_next) { if (!(ifa1->ifa_flags & IFA_F_SECONDARY) && -- 2.21.0
Re: [RFC iproute2] netns: add mounting state file for each netns
On Mon, Jul 1, 2019 at 2:38 PM Nicolas Dichtel wrote: > > Le 30/06/2019 à 21:29, Matteo Croce a écrit : > > When ip creates a netns, there is a small time interval between the > > placeholder file creation in NETNS_RUN_DIR and the bind mount from /proc. > > > > Add a temporary file named .mounting-$netns which gets deleted after the > > bind mount, so watching for delete event matching the .mounting-* name > > will notify watchers only after the bind mount has been done. > Probably a naive question, but why creating those '.mounting-$netns' files in > the directory where netns are stored? Why not another directory, something > like > /var/run/netns-monitor/? > > > Regards, > Nicolas Yes, would work too. But ideally I'd wait for the mount inotify notifications. -- Matteo Croce per aspera ad upstream
[RFC iproute2] netns: add mounting state file for each netns
When ip creates a netns, there is a small time interval between the placeholder file creation in NETNS_RUN_DIR and the bind mount from /proc. Add a temporary file named .mounting-$netns which gets deleted after the bind mount, so watching for delete event matching the .mounting-* name will notify watchers only after the bind mount has been done. Signed-off-by: Matteo Croce --- ip/ipnetns.c | 37 ++--- 1 file changed, 30 insertions(+), 7 deletions(-) diff --git a/ip/ipnetns.c b/ip/ipnetns.c index a883f210..23b95173 100644 --- a/ip/ipnetns.c +++ b/ip/ipnetns.c @@ -24,6 +24,8 @@ #include "namespace.h" #include "json_print.h" +#define TMP_PREFIX ".mounting-" + static int usage(void) { fprintf(stderr, @@ -47,6 +49,10 @@ static struct rtnl_handle rtnsh = { .fd = -1 }; static int have_rtnl_getnsid = -1; static int saved_netns = -1; +static int is_mounting_stab(const char *name) { + return !strncmp(name, TMP_PREFIX, sizeof(TMP_PREFIX) - 1); +} + static int ipnetns_accept_msg(struct rtnl_ctrl_data *ctrl, struct nlmsghdr *n, void *arg) { @@ -379,6 +385,8 @@ static int netns_list(int argc, char **argv) continue; if (strcmp(entry->d_name, "..") == 0) continue; + if (is_mounting_stab(entry->d_name)) + continue; open_json_object(NULL); print_string(PRINT_ANY, "name", @@ -676,7 +684,7 @@ static int netns_add(int argc, char **argv, bool create) * userspace tweaks like remounting /sys, or bind mounting * a new /etc/resolv.conf can be shared between users. */ - char netns_path[PATH_MAX], proc_path[PATH_MAX]; + char netns_path[PATH_MAX], tmp_path[PATH_MAX], proc_path[PATH_MAX]; const char *name; pid_t pid; int fd; @@ -701,6 +709,7 @@ static int netns_add(int argc, char **argv, bool create) name = argv[0]; snprintf(netns_path, sizeof(netns_path), "%s/%s", NETNS_RUN_DIR, name); + snprintf(tmp_path, sizeof(tmp_path), "%s/"TMP_PREFIX"%s", NETNS_RUN_DIR, name); if (create_netns_dir()) return -1; @@ -737,6 +746,14 @@ static int netns_add(int argc, char **argv, bool create) } close(fd); + fd = open(tmp_path, O_RDONLY|O_CREAT|O_EXCL, 0); + if (fd < 0) { + fprintf(stderr, "Cannot create namespace file \"%s\": %s\n", + tmp_path, strerror(errno)); + goto out_delete; + } + close(fd); + if (create) { netns_save(); if (unshare(CLONE_NEWNET) < 0) { @@ -757,6 +774,7 @@ static int netns_add(int argc, char **argv, bool create) goto out_delete; } netns_restore(); + unlink(tmp_path); return 0; out_delete: @@ -767,6 +785,10 @@ out_delete: fprintf(stderr, "Cannot remove namespace file \"%s\": %s\n", netns_path, strerror(errno)); } + if (unlink(tmp_path) < 0) { + fprintf(stderr, "Cannot remove namespace file \"%s\": %s\n", + tmp_path, strerror(errno)); + } return -1; } @@ -849,7 +871,7 @@ static int netns_monitor(int argc, char **argv) if (create_netns_dir()) return -1; - if (inotify_add_watch(fd, NETNS_RUN_DIR, IN_CREATE | IN_DELETE) < 0) { + if (inotify_add_watch(fd, NETNS_RUN_DIR, IN_DELETE) < 0) { fprintf(stderr, "inotify_add_watch failed: %s\n", strerror(errno)); return -1; @@ -865,9 +887,9 @@ static int netns_monitor(int argc, char **argv) for (event = (struct inotify_event *)buf; (char *)event < &buf[len]; event = (struct inotify_event *)((char *)event + sizeof(*event) + event->len)) { - if (event->mask & IN_CREATE) - printf("add %s\n", event->name); - if (event->mask & IN_DELETE) + if (is_mounting_stab(event->name)) + printf("add %s\n", event->name + sizeof(TMP_PREFIX) - 1); + else printf("delete %s\n", event->name); } } @@ -876,8 +898,9 @@ static int netns_monitor(int argc, char **argv) static int invalid_name(const char *name) { - return !*name || strlen(name) > NAME_MAX || - strchr(name, '/') || !strcmp(name, ".") || !strcmp(name, ".."); + return !*name || strlen(name) + sizeof(TMP_PREFIX) - 1 > NAME_MAX || + strchr(name, '/') || !strcmp(name, ".") || !strcmp(name, "..") || + is_mounting_stab(name); } int do_netns(int argc, char **argv) -- 2.21.0
Re: [RFC iproute2 1/1] ip: netns: add mounted state file for each netns
On Fri, Jun 28, 2019 at 7:06 PM Matteo Croce wrote: > > On Fri, Jun 28, 2019 at 6:27 PM David Howells wrote: > > > > Nicolas Dichtel wrote: > > > > > David Howells was working on a mount notification mechanism: > > > https://lwn.net/Articles/760714/ > > > https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications > > > > > > I don't know what is the status of this series. > > > > It's still alive. I just posted a new version on it. I'm hoping, possibly > > futiley, to get it in in this merge window. > > > > David > > Hi all, > > this could cause a clash if I create a netns with name ending with .mounted. > > $ sudo ip/ip netns add ns1.mounted > $ sudo ip/ip netns add ns1 > Cannot create namespace file "/var/run/netns/ns1.mounted": File exists > Cannot remove namespace file "/var/run/netns/ns1.mounted": Device or > resource busy > > If you want to go along this road, please either: > - disallow netns creation with names ending with .mounted > - or properly document it in the manpage > > Regards, > -- > Matteo Croce > per aspera ad upstream BTW, this breaks the namespace listing: # ip netns add test # ip netns list Error: Peer netns reference is invalid. Error: Peer netns reference is invalid. test.mounted test A better choice IMHO could be to create a temporary file before the placeholder, and delete it after the bind mount, so an inotify watcher can listen for the delete event. For example, when creating the namespace "foo": - create /var/run/netns/.foo.mounting - create /var/run/netns/foo - bind mount from /proc/.. to /var/run/netns/foo - remove /var/run/netns/.foo.mounting and exclude .*.mounting from the netns listing Or, announce netns creation/deletion in some other way (dbus?). Regards, -- Matteo Croce per aspera ad upstream
Re: [RFC iproute2 1/1] ip: netns: add mounted state file for each netns
On Fri, Jun 28, 2019 at 6:27 PM David Howells wrote: > > Nicolas Dichtel wrote: > > > David Howells was working on a mount notification mechanism: > > https://lwn.net/Articles/760714/ > > https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications > > > > I don't know what is the status of this series. > > It's still alive. I just posted a new version on it. I'm hoping, possibly > futiley, to get it in in this merge window. > > David Hi all, this could cause a clash if I create a netns with name ending with .mounted. $ sudo ip/ip netns add ns1.mounted $ sudo ip/ip netns add ns1 Cannot create namespace file "/var/run/netns/ns1.mounted": File exists Cannot remove namespace file "/var/run/netns/ns1.mounted": Device or resource busy If you want to go along this road, please either: - disallow netns creation with names ending with .mounted - or properly document it in the manpage Regards, -- Matteo Croce per aspera ad upstream
Re: [PATCH iproute2] testsuite: don't clobber /tmp
On Thu, Jun 13, 2019 at 7:15 PM Matteo Croce wrote: > > On Wed, Jun 12, 2019 at 8:20 PM Stephen Hemminger > wrote: > > > > On Wed, 12 Jun 2019 19:32:29 +0200 > > Matteo Croce wrote: > > > > > On Wed, Jun 12, 2019 at 6:04 PM Matteo Croce wrote: > > > > > > > > On Wed, Jun 12, 2019 at 5:55 PM Stephen Hemminger > > > > wrote: > > > > > > > > > > On Tue, 11 Jun 2019 20:03:26 +0200 > > > > > Matteo Croce wrote: > > > > > > > > > > > Even if not running the testsuite, every build will leave > > > > > > a stale tc_testkenv.* file in the system temp directory. > > > > > > Conditionally create the temp file only if we're running the > > > > > > testsuite. > > > > > > > > > > > > Signed-off-by: Matteo Croce > > > > > > --- > > > > > > testsuite/Makefile | 5 - > > > > > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/testsuite/Makefile b/testsuite/Makefile > > > > > > index 7f247bbc..5353244b 100644 > > > > > > --- a/testsuite/Makefile > > > > > > +++ b/testsuite/Makefile > > > > > > @@ -14,7 +14,9 @@ TESTS_DIR := $(dir $(TESTS)) > > > > > > > > > > > > IPVERS := $(filter-out iproute2/Makefile,$(wildcard iproute2/*)) > > > > > > > > > > > > -KENVFN := $(shell mktemp /tmp/tc_testkenv.XX) > > > > > > +ifeq ($(MAKECMDGOALS),alltests) > > > > > > + KENVFN := $(shell mktemp /tmp/tc_testkenv.XX) > > > > > > +endif > > > > > > ifneq (,$(wildcard /proc/config.gz)) > > > > > > KCPATH := /proc/config.gz > > > > > > else > > > > > > @@ -94,3 +96,4 @@ endif > > > > > > rm "$$TMP_ERR" "$$TMP_OUT"; \ > > > > > > sudo dmesg > $(RESULTS_DIR)/$@.$$o.dmesg; \ > > > > > > done > > > > > > + @$(RM) $(KENVFN) > > > > > > > > > > My concern is that there are several targets in this one Makefile. > > > > > > > > > > Why not use -u which gives name but does not create the file? > > > > > > > > As the manpage says, this is unsafe, as a file with the same name can > > > > be created in the meantime. > > > > Another option is to run the mktemp in the target shell, but this will > > > > require to escape every single end of line to make it a single shell > > > > command, e.g.: > > > > > > > > KENVFN=$$(mktemp /tmp/tc_testkenv.XX); \ > > > > if [ "$(KCPATH)" = "/proc/config.gz" ]; then \ > > > > gunzip -c $(KCPATH) >$$KENVFN; \ > > > > ... > > > > done ; \ > > > > $(RM) $$KENVFN > > > > > > > > -- > > > > Matteo Croce > > > > per aspera ad upstream > > > > > > Anyway, looking for "tc" instead of "alltests" is probably better, as > > > it only runs mktemp when at least the tc test is selected, both > > > manually or via make check from topdir, eg.g > > > > > > ifeq ($(MAKECMDGOALS),tc) > > > > > > Do you agree? > > > > Why use /tmp at all for this config file? > > To me any path could work, both /tmp or in the current dir, I have no > preference. > The important thing is to remove them wherever they are, as clobbering > the build dir is bad as messing /tmp. > > Anyway, I double checked, and the only target which uses that > temporary file is 'alltests' so, if the path is ok, I think that the > condition "ifeq ($(MAKECMDGOALS),alltests)" is the only one which > fixes the issue and keeps the behaviour unaltered. > I did some quick tests and it works for me. > > Bye, > -- > Matteo Croce > per aspera ad upstream Hi, any more thoughts about this patch? -- Matteo Croce per aspera ad upstream