Re: [Query] Delayed vxlan socket creation?

2016-12-15 Thread Du, Fan



在 2016年12月15日 01:24, Cong Wang 写道:

On Tue, Dec 13, 2016 at 11:49 PM, Du, Fan  wrote:

Hi

I'm interested to one Docker issue[1] which looks like related to kernel vxlan 
socket creation
as described in the thread. From my limited knowledge here, socket creation is 
synchronous ,
and after the *socket* syscall, the sock handle will be valid and ready to 
linkup.

You need to read the code. vxlan tunnel is a UDP tunnel, it needs a kernel
socket (and a port) to setup UDP communication, unlike GRE tunnel etc.

I check the fix is merged in 4.0, my code base is pretty new,
so somehow I failed to see the work queue stuff in drver/net/vxlan.c

Somehow I'm not sure the detailed scenario here, and which/how possible commit 
fix?
Thanks!

Quoted analysis:
--
(Found in kernel 3.13)
The issue happens because in older kernels when a vxlan interface is created,
the socket creation is queued up in a worker thread which actually creates
the socket. But this needs to happen before we bring up the link on the vxlan 
interface.
If for some chance, the worker thread hasn't completed the creation of the 
socket
before we did link up then when we do link up the kernel checks if the socket 
was
created and if not it will return ENOTCONN. This was a bug in the kernel which 
got fixed
in later kernels. That is why retrying with a timer fixes the issue.


This was introduced by commit 1c51a9159ddefa5119724a4c7da3fd3ef44b68d5
and later fixed by commit 56ef9c909b40483d2c8cb63fcbf83865f162d5ec.

信聪哥,得永生。
Thanks for the offending commit id!




Re: [Query] Delayed vxlan socket creation?

2016-12-15 Thread Du, Fan



在 2016年12月14日 17:29, Jiri Benc 写道:

On Wed, 14 Dec 2016 07:49:24 +, Du, Fan wrote:

I'm interested to one Docker issue[1] which looks like related to kernel vxlan 
socket creation
as described in the thread. From my limited knowledge here, socket creation is 
synchronous ,
and after the *socket* syscall, the sock handle will be valid and ready to 
linkup.

Somehow I'm not sure the detailed scenario here, and which/how possible commit 
fix?

baf606d9c9b1^..56ef9c909b40

  Jiri


Thanks a lot Jiri!


[Query] Delayed vxlan socket creation?

2016-12-13 Thread Du, Fan
Hi

I'm interested to one Docker issue[1] which looks like related to kernel vxlan 
socket creation
as described in the thread. From my limited knowledge here, socket creation is 
synchronous ,
and after the *socket* syscall, the sock handle will be valid and ready to 
linkup.

Somehow I'm not sure the detailed scenario here, and which/how possible commit 
fix?
Thanks!

Quoted analysis:
--
(Found in kernel 3.13)
The issue happens because in older kernels when a vxlan interface is created, 
the socket creation is queued up in a worker thread which actually creates 
the socket. But this needs to happen before we bring up the link on the vxlan 
interface. 
If for some chance, the worker thread hasn't completed the creation of the 
socket 
before we did link up then when we do link up the kernel checks if the socket 
was 
created and if not it will return ENOTCONN. This was a bug in the kernel which 
got fixed
in later kernels. That is why retrying with a timer fixes the issue.

[1]: https://github.com/docker/libnetwork/issues/1247



Re: GSO packets on lower MTU retaining gso_size?

2016-06-06 Thread Du, Fan



On 2016/6/7 14:05, Yuval Mintz wrote:

While experimenting with Vxlan tunnels, I've reached a topology where the
Vxlan interface's MTU was 1500 while base-interface was smaller [600].

While 'regular' packets broke via ip-fragmentation, GSO SKBs passing from
the vxlan interface to the base interface remained whole, and their
`gso_size' remained matching to that of the vxlan-interface's MTU;
This caused the HW to drop said packets, as it would have resulted with
the device sending to the line packets with length larger than the mtu.

Is this broken on the udp-tunnel transmit path, the setup or the driver [qede]?


I believe it's identical to issue I met before[1], the owner of the 
offending code
believe a host can't generate packet size larger than the underlying NIC 
MTU and

refuse to do the GSO here.

[1]: https://patchwork.ozlabs.org/patch/415791/




Re: [PATCH net-next] net: hns: optimize XGE capability by reducing cpu usage

2015-12-07 Thread Du, Fan



On 2015/12/8 14:22, Yankejian (Hackim Yim) wrote:


On 2015/12/7 16:58, Du, Fan wrote:

>
>
>On 2015/12/5 15:32, yankejian wrote:

>>here is the patch raising the performance of XGE by:
>>1)changes the way page management method for enet momery, and
>>2)reduces the count of rmb, and
>>3)adds Memory prefetching

>
>Any numbers on how much it boost performance?
>

it is almost the same as 82599.


I mean how much it improves performance *BEFORE* and *AFTER* this patch
for Huawei XGE chip, because the commit log states it "raising the 
performance",

but did give numbers of the testing.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: hns: optimize XGE capability by reducing cpu usage

2015-12-07 Thread Du, Fan



On 2015/12/5 15:32, yankejian wrote:

here is the patch raising the performance of XGE by:
1)changes the way page management method for enet momery, and
2)reduces the count of rmb, and
3)adds Memory prefetching


Any numbers on how much it boost performance?


Signed-off-by: yankejian 
---
  drivers/net/ethernet/hisilicon/hns/hnae.h |  5 +-
  drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c |  1 -
  drivers/net/ethernet/hisilicon/hns/hns_enet.c | 79 +++
  3 files changed, 55 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hnae.h 
b/drivers/net/ethernet/hisilicon/hns/hnae.h
index d1f3316..6ca94dc 100644
--- a/drivers/net/ethernet/hisilicon/hns/hnae.h
+++ b/drivers/net/ethernet/hisilicon/hns/hnae.h
@@ -341,7 +341,8 @@ struct hnae_queue {
void __iomem *io_base;
phys_addr_t phy_base;
struct hnae_ae_dev *dev;/* the device who use this queue */
-   struct hnae_ring rx_ring, tx_ring;
+   struct hnae_ring rx_ring cacheline_internodealigned_in_smp;
+   struct hnae_ring tx_ring cacheline_internodealigned_in_smp;
struct hnae_handle *handle;
  };

@@ -597,11 +598,9 @@ static inline void hnae_replace_buffer(struct hnae_ring 
*ring, int i,
   struct hnae_desc_cb *res_cb)
  {
struct hnae_buf_ops *bops = ring->q->handle->bops;
-   struct hnae_desc_cb tmp_cb = ring->desc_cb[i];

bops->unmap_buffer(ring, &ring->desc_cb[i]);
ring->desc_cb[i] = *res_cb;
-   *res_cb = tmp_cb;
ring->desc[i].addr = (__le64)ring->desc_cb[i].dma;
ring->desc[i].rx.ipoff_bnum_pid_flag = 0;
  }
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c 
b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
index 77c6edb..522b264 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
@@ -341,7 +341,6 @@ void hns_ae_toggle_ring_irq(struct hnae_ring *ring, u32 
mask)
else
flag = RCB_INT_FLAG_RX;

-   hns_rcb_int_clr_hw(ring->q, flag);
hns_rcb_int_ctrl_hw(ring->q, flag, mask);
  }

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c 
b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index cad2663..e2be510 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -33,6 +33,7 @@

  #define RCB_IRQ_NOT_INITED 0
  #define RCB_IRQ_INITED 1
+#define HNS_BUFFER_SIZE_2048 2048

  #define BD_MAX_SEND_SIZE 8191
  #define SKB_TMP_LEN(SKB) \
@@ -491,13 +492,51 @@ static unsigned int hns_nic_get_headlen(unsigned char 
*data, u32 flag,
return max_size;
  }

-static void
-hns_nic_reuse_page(struct hnae_desc_cb *desc_cb, int tsize, int last_offset)
+static void hns_nic_reuse_page(struct sk_buff *skb, int i,
+  struct hnae_ring *ring, int pull_len,
+  struct hnae_desc_cb *desc_cb)
  {
+   struct hnae_desc *desc;
+   int truesize, size;
+   int last_offset = 0;
+
+   desc = &ring->desc[ring->next_to_clean];
+   size = le16_to_cpu(desc->rx.size);
+
+#if (PAGE_SIZE < 8192)
+   if (hnae_buf_size(ring) == HNS_BUFFER_SIZE_2048) {
+   truesize = hnae_buf_size(ring);
+   } else {
+   truesize = ALIGN(size, L1_CACHE_BYTES);
+   last_offset = hnae_page_size(ring) - hnae_buf_size(ring);
+   }
+
+#else
+   truesize = ALIGN(size, L1_CACHE_BYTES);
+   last_offset = hnae_page_size(ring) - hnae_buf_size(ring);
+#endif
+
+   skb_add_rx_frag(skb, i, desc_cb->priv, desc_cb->page_offset + pull_len,
+   size - pull_len, truesize - pull_len);
+
 /* avoid re-using remote pages,flag default unreuse */
if (likely(page_to_nid(desc_cb->priv) == numa_node_id())) {
+#if (PAGE_SIZE < 8192)
+   if (hnae_buf_size(ring) == HNS_BUFFER_SIZE_2048) {
+   /* if we are only owner of page we can reuse it */
+   if (likely(page_count(desc_cb->priv) == 1)) {
+   /* flip page offset to other buffer */
+   desc_cb->page_offset ^= truesize;
+
+   desc_cb->reuse_flag = 1;
+   /* bump ref count on page before it is given*/
+   get_page(desc_cb->priv);
+   }
+   return;
+   }
+#endif
/* move offset up to the next cache line */
-   desc_cb->page_offset += tsize;
+   desc_cb->page_offset += truesize;

if (desc_cb->page_offset <= last_offset) {
desc_cb->reuse_flag = 1;
@@ -529,11 +568,10 @@ static int hns_nic_poll_rx_skb(struct hns_nic_ring_data 
*ring_data,
struct hnae_desc *desc;
struct hnae_desc_cb *desc_cb;

RE: [RFC net-next] xfrm: refactory to avoid state tasklet scheduling errors

2015-07-14 Thread Du, Fan


>-Original Message-
>From: Giuseppe Cantavenera [mailto:giuseppe.cantaven...@azcom.it]
>Sent: Tuesday, July 7, 2015 3:43 PM
>To: netdev@vger.kernel.org
>Cc: Giuseppe Cantavenera; Steffen Klassert; David S. Miller; Du, Fan; Alexander
>Sverdlin; Matija Glavinic Pecotic; Giuseppe Cantavenera; Nicholas Faustini
>Subject: [RFC net-next] xfrm: refactory to avoid state tasklet scheduling 
>errors
>
>The SA state is managed by a tasklet scheduled relying on the wall clock.
>Previous changes have already tried to address bugs
>when the system time is changed but some error conditions still exist,
>because the logic is still coupled with the wall time.
>
>If the time is changed in between the SA is created and the tasklet timer
>is started for the first time, the SA scheduling will be broken:
>either the SA will expire and never be recreated, or it will expire at
>an unexpected time.  The reason is that x->curlft.add_time will not be valid
>when the "next" variable is computed for the very first time
>in xfrm_timer_handler().
>
>Fix this behaviour by avoiding to rely on the system time.
>Stick to relative time intervals and realise a total decoupling
>from the wall time.
>
>Based on another patch written and published by
>Fan Du (fan...@intel.com) in 2013 but never merged:
>part of the code preserved, some rewritten and improved.
>Changes to the logic accounting for the use_time expiration.
>Here we allow both add_time and use_time expirations to be set.
>
>Cc: Steffen Klassert 
>Cc: David S. Miller 
>Cc: Fan Du 
>Cc: Alexander Sverdlin 
>Cc: Matija Glavinic Pecotic 
>Signed-off-by: Giuseppe Cantavenera 
>Signed-off-by: Nicholas Faustini 
>---
>
>Hello,
>
>we also meet the same bug Fan Du did a while ago.
>Two solutions were proposed in the past:
>either forcibly mark as expired all of the keys every time the clock is set,
>or replace the existing timers with relative ones.
>
>The former would introduce unexpected behaviour
>(the keys would keep expiring when they shouldn't) and does not address the
>real problem: THE COUPLING between the SA scheduling and the wall timer.
>Actually it introduces even more of that.
>
>The latter is robust, extremly lightweight and maintanable, and preserves the
>expected behaviour, that's why we preferred it.
>
>Any feedback or any other idea is greatly appreciated.

Thanks for keep working this issue as I did 2 years ago.

Objection against the original approach from the maintainers is that it 
complicates
the logic to the degree which involving extra maintenance effort, that is the 
effort
it's not worthwhile against the trouble it might introduce in the future.

Another approach you can try is using monotonic boot time(counting in suspend 
time also)
to mark the life time of SA, then the timer handler logic will be quite easier 
and smaller
than now, sure it will be robust naturally. The cost is that SA lifetime 
displaying by setkey
and SA migration has to be taken care of as SA life time is boot time now, not 
the wall time.


>Thanks,
>Regards,
>Giuseppe
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] xfrm: fix a race in xfrm_state_lookup_byspi

2015-04-28 Thread Du, Fan

>-Original Message-
>From: roy.qing...@gmail.com [mailto:roy.qing...@gmail.com]
>Sent: Wednesday, April 29, 2015 8:43 AM
>To: netdev@vger.kernel.org
>Cc: Du, Fan; steffen.klass...@secunet.com
>Subject: [PATCH] xfrm: fix a race in xfrm_state_lookup_byspi
>
>From: Li RongQing 
>
>The returned xfrm_state should be hold before unlock xfrm_state_lock,
>otherwise the returned xfrm_state maybe be released.
>
>Fixes: c454997e6[{pktgen, xfrm} Introduce xfrm_state_lookup_byspi..]
>Cc: Fan Du 
>Signed-off-by: Li RongQing 

Acked-by: Fan Du 
 

>---
> net/xfrm/xfrm_state.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
>index f5e39e3..96688cd 100644
>--- a/net/xfrm/xfrm_state.c
>+++ b/net/xfrm/xfrm_state.c
>@@ -927,8 +927,8 @@ struct xfrm_state *xfrm_state_lookup_byspi(struct net
>*net, __be32 spi,
>   x->id.spi != spi)
>   continue;
>
>-  spin_unlock_bh(&net->xfrm.xfrm_state_lock);
>   xfrm_state_hold(x);
>+  spin_unlock_bh(&net->xfrm.xfrm_state_lock);
>   return x;
>   }
>   spin_unlock_bh(&net->xfrm.xfrm_state_lock);
>--
>2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html