Re: [Query] Delayed vxlan socket creation?
在 2016年12月15日 01:24, Cong Wang 写道: On Tue, Dec 13, 2016 at 11:49 PM, Du, Fan wrote: Hi I'm interested to one Docker issue[1] which looks like related to kernel vxlan socket creation as described in the thread. From my limited knowledge here, socket creation is synchronous , and after the *socket* syscall, the sock handle will be valid and ready to linkup. You need to read the code. vxlan tunnel is a UDP tunnel, it needs a kernel socket (and a port) to setup UDP communication, unlike GRE tunnel etc. I check the fix is merged in 4.0, my code base is pretty new, so somehow I failed to see the work queue stuff in drver/net/vxlan.c Somehow I'm not sure the detailed scenario here, and which/how possible commit fix? Thanks! Quoted analysis: -- (Found in kernel 3.13) The issue happens because in older kernels when a vxlan interface is created, the socket creation is queued up in a worker thread which actually creates the socket. But this needs to happen before we bring up the link on the vxlan interface. If for some chance, the worker thread hasn't completed the creation of the socket before we did link up then when we do link up the kernel checks if the socket was created and if not it will return ENOTCONN. This was a bug in the kernel which got fixed in later kernels. That is why retrying with a timer fixes the issue. This was introduced by commit 1c51a9159ddefa5119724a4c7da3fd3ef44b68d5 and later fixed by commit 56ef9c909b40483d2c8cb63fcbf83865f162d5ec. 信聪哥,得永生。 Thanks for the offending commit id!
Re: [Query] Delayed vxlan socket creation?
在 2016年12月14日 17:29, Jiri Benc 写道: On Wed, 14 Dec 2016 07:49:24 +, Du, Fan wrote: I'm interested to one Docker issue[1] which looks like related to kernel vxlan socket creation as described in the thread. From my limited knowledge here, socket creation is synchronous , and after the *socket* syscall, the sock handle will be valid and ready to linkup. Somehow I'm not sure the detailed scenario here, and which/how possible commit fix? baf606d9c9b1^..56ef9c909b40 Jiri Thanks a lot Jiri!
[Query] Delayed vxlan socket creation?
Hi I'm interested to one Docker issue[1] which looks like related to kernel vxlan socket creation as described in the thread. From my limited knowledge here, socket creation is synchronous , and after the *socket* syscall, the sock handle will be valid and ready to linkup. Somehow I'm not sure the detailed scenario here, and which/how possible commit fix? Thanks! Quoted analysis: -- (Found in kernel 3.13) The issue happens because in older kernels when a vxlan interface is created, the socket creation is queued up in a worker thread which actually creates the socket. But this needs to happen before we bring up the link on the vxlan interface. If for some chance, the worker thread hasn't completed the creation of the socket before we did link up then when we do link up the kernel checks if the socket was created and if not it will return ENOTCONN. This was a bug in the kernel which got fixed in later kernels. That is why retrying with a timer fixes the issue. [1]: https://github.com/docker/libnetwork/issues/1247
Re: GSO packets on lower MTU retaining gso_size?
On 2016/6/7 14:05, Yuval Mintz wrote: While experimenting with Vxlan tunnels, I've reached a topology where the Vxlan interface's MTU was 1500 while base-interface was smaller [600]. While 'regular' packets broke via ip-fragmentation, GSO SKBs passing from the vxlan interface to the base interface remained whole, and their `gso_size' remained matching to that of the vxlan-interface's MTU; This caused the HW to drop said packets, as it would have resulted with the device sending to the line packets with length larger than the mtu. Is this broken on the udp-tunnel transmit path, the setup or the driver [qede]? I believe it's identical to issue I met before[1], the owner of the offending code believe a host can't generate packet size larger than the underlying NIC MTU and refuse to do the GSO here. [1]: https://patchwork.ozlabs.org/patch/415791/
Re: [PATCH net-next] net: hns: optimize XGE capability by reducing cpu usage
On 2015/12/8 14:22, Yankejian (Hackim Yim) wrote: On 2015/12/7 16:58, Du, Fan wrote: > > >On 2015/12/5 15:32, yankejian wrote: >>here is the patch raising the performance of XGE by: >>1)changes the way page management method for enet momery, and >>2)reduces the count of rmb, and >>3)adds Memory prefetching > >Any numbers on how much it boost performance? > it is almost the same as 82599. I mean how much it improves performance *BEFORE* and *AFTER* this patch for Huawei XGE chip, because the commit log states it "raising the performance", but did give numbers of the testing. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] net: hns: optimize XGE capability by reducing cpu usage
On 2015/12/5 15:32, yankejian wrote: here is the patch raising the performance of XGE by: 1)changes the way page management method for enet momery, and 2)reduces the count of rmb, and 3)adds Memory prefetching Any numbers on how much it boost performance? Signed-off-by: yankejian --- drivers/net/ethernet/hisilicon/hns/hnae.h | 5 +- drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c | 1 - drivers/net/ethernet/hisilicon/hns/hns_enet.c | 79 +++ 3 files changed, 55 insertions(+), 30 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns/hnae.h b/drivers/net/ethernet/hisilicon/hns/hnae.h index d1f3316..6ca94dc 100644 --- a/drivers/net/ethernet/hisilicon/hns/hnae.h +++ b/drivers/net/ethernet/hisilicon/hns/hnae.h @@ -341,7 +341,8 @@ struct hnae_queue { void __iomem *io_base; phys_addr_t phy_base; struct hnae_ae_dev *dev;/* the device who use this queue */ - struct hnae_ring rx_ring, tx_ring; + struct hnae_ring rx_ring cacheline_internodealigned_in_smp; + struct hnae_ring tx_ring cacheline_internodealigned_in_smp; struct hnae_handle *handle; }; @@ -597,11 +598,9 @@ static inline void hnae_replace_buffer(struct hnae_ring *ring, int i, struct hnae_desc_cb *res_cb) { struct hnae_buf_ops *bops = ring->q->handle->bops; - struct hnae_desc_cb tmp_cb = ring->desc_cb[i]; bops->unmap_buffer(ring, &ring->desc_cb[i]); ring->desc_cb[i] = *res_cb; - *res_cb = tmp_cb; ring->desc[i].addr = (__le64)ring->desc_cb[i].dma; ring->desc[i].rx.ipoff_bnum_pid_flag = 0; } diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c index 77c6edb..522b264 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c @@ -341,7 +341,6 @@ void hns_ae_toggle_ring_irq(struct hnae_ring *ring, u32 mask) else flag = RCB_INT_FLAG_RX; - hns_rcb_int_clr_hw(ring->q, flag); hns_rcb_int_ctrl_hw(ring->q, flag, mask); } diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c index cad2663..e2be510 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c @@ -33,6 +33,7 @@ #define RCB_IRQ_NOT_INITED 0 #define RCB_IRQ_INITED 1 +#define HNS_BUFFER_SIZE_2048 2048 #define BD_MAX_SEND_SIZE 8191 #define SKB_TMP_LEN(SKB) \ @@ -491,13 +492,51 @@ static unsigned int hns_nic_get_headlen(unsigned char *data, u32 flag, return max_size; } -static void -hns_nic_reuse_page(struct hnae_desc_cb *desc_cb, int tsize, int last_offset) +static void hns_nic_reuse_page(struct sk_buff *skb, int i, + struct hnae_ring *ring, int pull_len, + struct hnae_desc_cb *desc_cb) { + struct hnae_desc *desc; + int truesize, size; + int last_offset = 0; + + desc = &ring->desc[ring->next_to_clean]; + size = le16_to_cpu(desc->rx.size); + +#if (PAGE_SIZE < 8192) + if (hnae_buf_size(ring) == HNS_BUFFER_SIZE_2048) { + truesize = hnae_buf_size(ring); + } else { + truesize = ALIGN(size, L1_CACHE_BYTES); + last_offset = hnae_page_size(ring) - hnae_buf_size(ring); + } + +#else + truesize = ALIGN(size, L1_CACHE_BYTES); + last_offset = hnae_page_size(ring) - hnae_buf_size(ring); +#endif + + skb_add_rx_frag(skb, i, desc_cb->priv, desc_cb->page_offset + pull_len, + size - pull_len, truesize - pull_len); + /* avoid re-using remote pages,flag default unreuse */ if (likely(page_to_nid(desc_cb->priv) == numa_node_id())) { +#if (PAGE_SIZE < 8192) + if (hnae_buf_size(ring) == HNS_BUFFER_SIZE_2048) { + /* if we are only owner of page we can reuse it */ + if (likely(page_count(desc_cb->priv) == 1)) { + /* flip page offset to other buffer */ + desc_cb->page_offset ^= truesize; + + desc_cb->reuse_flag = 1; + /* bump ref count on page before it is given*/ + get_page(desc_cb->priv); + } + return; + } +#endif /* move offset up to the next cache line */ - desc_cb->page_offset += tsize; + desc_cb->page_offset += truesize; if (desc_cb->page_offset <= last_offset) { desc_cb->reuse_flag = 1; @@ -529,11 +568,10 @@ static int hns_nic_poll_rx_skb(struct hns_nic_ring_data *ring_data, struct hnae_desc *desc; struct hnae_desc_cb *desc_cb;
RE: [RFC net-next] xfrm: refactory to avoid state tasklet scheduling errors
>-Original Message- >From: Giuseppe Cantavenera [mailto:giuseppe.cantaven...@azcom.it] >Sent: Tuesday, July 7, 2015 3:43 PM >To: netdev@vger.kernel.org >Cc: Giuseppe Cantavenera; Steffen Klassert; David S. Miller; Du, Fan; Alexander >Sverdlin; Matija Glavinic Pecotic; Giuseppe Cantavenera; Nicholas Faustini >Subject: [RFC net-next] xfrm: refactory to avoid state tasklet scheduling >errors > >The SA state is managed by a tasklet scheduled relying on the wall clock. >Previous changes have already tried to address bugs >when the system time is changed but some error conditions still exist, >because the logic is still coupled with the wall time. > >If the time is changed in between the SA is created and the tasklet timer >is started for the first time, the SA scheduling will be broken: >either the SA will expire and never be recreated, or it will expire at >an unexpected time. The reason is that x->curlft.add_time will not be valid >when the "next" variable is computed for the very first time >in xfrm_timer_handler(). > >Fix this behaviour by avoiding to rely on the system time. >Stick to relative time intervals and realise a total decoupling >from the wall time. > >Based on another patch written and published by >Fan Du (fan...@intel.com) in 2013 but never merged: >part of the code preserved, some rewritten and improved. >Changes to the logic accounting for the use_time expiration. >Here we allow both add_time and use_time expirations to be set. > >Cc: Steffen Klassert >Cc: David S. Miller >Cc: Fan Du >Cc: Alexander Sverdlin >Cc: Matija Glavinic Pecotic >Signed-off-by: Giuseppe Cantavenera >Signed-off-by: Nicholas Faustini >--- > >Hello, > >we also meet the same bug Fan Du did a while ago. >Two solutions were proposed in the past: >either forcibly mark as expired all of the keys every time the clock is set, >or replace the existing timers with relative ones. > >The former would introduce unexpected behaviour >(the keys would keep expiring when they shouldn't) and does not address the >real problem: THE COUPLING between the SA scheduling and the wall timer. >Actually it introduces even more of that. > >The latter is robust, extremly lightweight and maintanable, and preserves the >expected behaviour, that's why we preferred it. > >Any feedback or any other idea is greatly appreciated. Thanks for keep working this issue as I did 2 years ago. Objection against the original approach from the maintainers is that it complicates the logic to the degree which involving extra maintenance effort, that is the effort it's not worthwhile against the trouble it might introduce in the future. Another approach you can try is using monotonic boot time(counting in suspend time also) to mark the life time of SA, then the timer handler logic will be quite easier and smaller than now, sure it will be robust naturally. The cost is that SA lifetime displaying by setkey and SA migration has to be taken care of as SA life time is boot time now, not the wall time. >Thanks, >Regards, >Giuseppe -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] xfrm: fix a race in xfrm_state_lookup_byspi
>-Original Message- >From: roy.qing...@gmail.com [mailto:roy.qing...@gmail.com] >Sent: Wednesday, April 29, 2015 8:43 AM >To: netdev@vger.kernel.org >Cc: Du, Fan; steffen.klass...@secunet.com >Subject: [PATCH] xfrm: fix a race in xfrm_state_lookup_byspi > >From: Li RongQing > >The returned xfrm_state should be hold before unlock xfrm_state_lock, >otherwise the returned xfrm_state maybe be released. > >Fixes: c454997e6[{pktgen, xfrm} Introduce xfrm_state_lookup_byspi..] >Cc: Fan Du >Signed-off-by: Li RongQing Acked-by: Fan Du >--- > net/xfrm/xfrm_state.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > >diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c >index f5e39e3..96688cd 100644 >--- a/net/xfrm/xfrm_state.c >+++ b/net/xfrm/xfrm_state.c >@@ -927,8 +927,8 @@ struct xfrm_state *xfrm_state_lookup_byspi(struct net >*net, __be32 spi, > x->id.spi != spi) > continue; > >- spin_unlock_bh(&net->xfrm.xfrm_state_lock); > xfrm_state_hold(x); >+ spin_unlock_bh(&net->xfrm.xfrm_state_lock); > return x; > } > spin_unlock_bh(&net->xfrm.xfrm_state_lock); >-- >2.1.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html