Re: [Xen-devel] [PATCH] xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend()

2017-02-01 Thread Eric Dumazet
On Wed, 2017-02-01 at 18:29 -0500, Boris Ostrovsky wrote:

> 
> I could not convince myself that napi_synchronize() is sufficient here
> (mostly because I am not familiar with napi flow). At the same time I
> would rather not make changes in anticipation of possible disappearance
> of netif_carrier_ok() in the future so I'd like this patch to go in as is.
> 
> Unless there are other problems with the patch or if Eric (or others)
> feel strongly about usage of netif_carrier_ok() here.
> 

No strong feelings from me.
We probably have more serious issues to fix anyway.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend()

2017-01-30 Thread Eric Dumazet
On Mon, 2017-01-30 at 13:23 -0500, Boris Ostrovsky wrote:

> We do netif_carrier_off() first thing in xennet_disconnect_backend() and
> the only place where the timer is rearmed is xennet_alloc_rx_buffers(),
> which is guarded by netif_carrier_ok() check.

Oh well, testing netif_carrier_ok() in packet processing fast path looks
unusual and a waste of cpu cycles. I've never seen that pattern before.

If one day, we remove this netif_carrier_ok() test during a cleanup,
then the race window will open again.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend()

2017-01-30 Thread Eric Dumazet
On Mon, 2017-01-30 at 12:45 -0500, Boris Ostrovsky wrote:
> rx_refill_timer should be deleted as soon as we disconnect from the
> backend since otherwise it is possible for the timer to go off before
> we get to xennet_destroy_queues(). If this happens we may dereference
> queue->rx.sring which is set to NULL in xennet_disconnect_backend().
> 
> Signed-off-by: Boris Ostrovsky 
> CC: sta...@vger.kernel.org
> ---
>  drivers/net/xen-netfront.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index 8315fe7..722fe9f 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -1379,6 +1379,8 @@ static void xennet_disconnect_backend(struct 
> netfront_info *info)
>   for (i = 0; i < num_queues && info->queues; ++i) {
>   struct netfront_queue *queue = >queues[i];
>  
> + del_timer_sync(>rx_refill_timer);
> +

If napi_disable() was not called before this del_timer_sync(), another
RX might come here and rearm rx_refill_timer.

>   if (queue->tx_irq && (queue->tx_irq == queue->rx_irq))
>   unbind_from_irqhandler(queue->tx_irq, queue);
>   if (queue->tx_irq && (queue->tx_irq != queue->rx_irq)) {
> @@ -1733,7 +1735,6 @@ static void xennet_destroy_queues(struct netfront_info 
> *info)
>  
>   if (netif_running(info->netdev))
>   napi_disable(>napi);
> - del_timer_sync(>rx_refill_timer);
>   netif_napi_del(>napi);
>   }
>  



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH net-next] xen-netfront: reject short packets and handle non-linear packets

2017-01-25 Thread Eric Dumazet
On Wed, 2017-01-25 at 16:26 +, Paul Durrant wrote:
> Sowmini points out two vulnerabilities in xen-netfront:
> 
> a) The code assumes that skb->len is at least ETH_HLEN.
> b) The code assumes that at least ETH_HLEN octets are in the linear
>port of the socket buffer.
> 
> This patch adds tests for both of these, and in the case of the latter
> pulls sufficient bytes into the linear area.
> 
> Signed-off-by: Paul Durrant 
> Reported-by: Sowmini Varadhan 
> Tested-by: Sowmini Varadhan 
> ---
> Cc: Boris Ostrovsky 
> Cc: Juergen Gross 
> ---
>  drivers/net/xen-netfront.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index 40f26b6..0478809 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -567,6 +567,10 @@ static int xennet_start_xmit(struct sk_buff *skb, struct 
> net_device *dev)
>   u16 queue_index;
>   struct sk_buff *nskb;
>  
> + /* Basic sanity check */
> + if (unlikely(skb->len < ETH_HLEN))
> + goto drop;
> +
>   /* Drop the packet if no queues are set up */
>   if (num_queues < 1)
>   goto drop;
> @@ -609,6 +613,11 @@ static int xennet_start_xmit(struct sk_buff *skb, struct 
> net_device *dev)
>   }
>  
>   len = skb_headlen(skb);
> + if (unlikely(len < ETH_HLEN)) {
> + if (!__pskb_pull_tail(skb, ETH_HLEN - len))
> + goto drop;
> + len = ETH_HLEN;
> + }

Looks like duplicated code, and buggy, considering the code above

page = virt_to_page(skb->data);
offset = offset_in_page(skb->data);

Your patch might end up with skb->data/head being reallocated, and use
after free would happen.

What about something like that ?

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 
40f26b69beb11459f0566fc1d1d739aa75e643bf..99a67fe4de86d3141169143b0820d00968cb09f2
 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -583,6 +583,8 @@ static int xennet_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
skb->len);
goto drop;
}
+   if (!pskb_may_pull(skb, ETH_HLEN))
+   goto drop;
 
slots = xennet_count_skb_slots(skb);
if (unlikely(slots > MAX_XEN_SKB_FRAGS + 1)) {



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen/blkback: use rb_entry()

2016-12-20 Thread Eric Dumazet
On Tue, 2016-12-20 at 12:51 -0500, Konrad Rzeszutek Wilk wrote:
> On Tue, Dec 20, 2016 at 05:44:06PM +, Roger Pau Monné wrote:
> > On Tue, Dec 20, 2016 at 11:47:03AM -0500, Konrad Rzeszutek Wilk wrote:
> > > On Tue, Dec 20, 2016 at 10:02:19PM +0800, Geliang Tang wrote:
> > > > To make the code clearer, use rb_entry() instead of container_of() to
> > > > deal with rbtree.
> > > 
> > > That is OK but I think 'container_of' is more clear.
> > > 
> > > Roger, thoughts?
> > 
> > I think so, container_of is a global macro that's widely used and everyone
> > knows, rb_entry OTOH it's not and it's use doesn't really simply the code at
> > all. I'm not really opposed, but it seems kind of a pointless change (not 
> > that
> > it's wrong).
> 
>  I concur.
> 
> Geliang Tang,
> 
> Thank you for the patch but there is no need for it.

The same could be said of list_entry()

#define hlist_entry(ptr, type, member) container_of(ptr,type,member)

#define list_entry(ptr, type, member) container_of(ptr, type, member)

# git grep -n list_entry | wc -l
3636

rb_entry() will probably make its way everywhere.




___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Linux 4.2-rc6 regression: RIP: e030:[ffffffff8110fb18] [ffffffff8110fb18] detach_if_pending+0x18/0x80

2015-08-17 Thread Eric Dumazet
On Mon, 2015-08-17 at 11:09 +0200, Sander Eikelenboom wrote:
 Saturday, August 15, 2015, 12:39:25 AM, you wrote:
 
  On Sat, 2015-08-15 at 00:09 +0200, Sander Eikelenboom wrote:
  On 2015-08-13 00:41, Eric Dumazet wrote:
   On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:
   
   Thanks for the reminder, but luckily i was aware of that,
   seen enough of your replies asking for patches to be resubmitted
   against the other tree ;)
   Kernel with patch is currently running so fingers crossed.
   
   Thanks for testing. I am definitely interested knowing your results.
  
  Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is 
  breaking things
  (have to test if a revert helps) i get this in some guests:
 
 
  Yes, this was fixed by :
  http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
 
 
 Hi Eric,
 
 With that patch i had a crash again this night, see below.
 
 --
 Sander
 
 [177459.188808] general protection fault:  [#1] SMP 
 [177459.199746] Modules linked in:
 [177459.210540] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
 4.2.0-rc6-20150815-linus-doflr-net+ #1
 [177459.221441] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
 V1.8B1 09/13/2010
 [177459.232247] task: 8221a580 ti: 8220 task.ti: 
 8220
 [177459.242931] RIP: e030:[8110eb58]  [8110eb58] 
 detach_if_pending+0x18/0x80
 [177459.253503] RSP: e02b:88005f6039d8  EFLAGS: 00010086
 [177459.264051] RAX: 8800584d6580 RBX: 880004901420 RCX: 
 dead00200200
 [177459.274599] RDX:  RSI: 88005f60e5c0 RDI: 
 880004901420
 [177459.285122] RBP: 88005f6039d8 R08: 0001 R09: 
 
 [177459.295286] R10: 0003 R11: 880004901394 R12: 
 0003
 [177459.305388] R13: 00010ae47040 R14: 07b98a00 R15: 
 88005f60e5c0
 [177459.315345] FS:  7f51317ec700() GS:88005f60() 
 knlGS:
 [177459.325340] CS:  e033 DS:  ES:  CR0: 8005003b
 [177459.335217] CR2: 010f8000 CR3: 2a154000 CR4: 
 0660
 [177459.345129] Stack:
 [177459.354783]  88005f603a28 8110ee7f 810fb261 
 0200
 [177459.364505]  0003 880004901380 0003 
 8800567d0d00
 [177459.374064]  07b98a00  88005f603a58 
 819b3eb3
 [177459.383532] Call Trace:
 [177459.392878]  IRQ 
 [177459.392935]  [8110ee7f] mod_timer_pending+0x3f/0xe0
 [177459.411058]  [810fb261] ? 
 __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
 [177459.419876]  [819b3eb3] __nf_ct_refresh_acct+0xa3/0xb0
 [177459.428642]  [819baafb] tcp_packet+0xb3b/0x1290
 [177459.437285]  [81a2535e] ? ip_output+0x5e/0xc0
 [177459.445845]  [810ca8ca] ? __local_bh_enable_ip+0x2a/0x90
 [177459.454331]  [819b35a9] ? __nf_conntrack_find_get+0x129/0x2a0
 [177459.462642]  [819b549c] nf_conntrack_in+0x29c/0x7c0
 [177459.470711]  [81a65e9c] ipv4_conntrack_local+0x4c/0x50
 [177459.478753]  [819ad67c] nf_iterate+0x4c/0x80
 [177459.486726]  [81102437] ? generic_handle_irq+0x27/0x40
 [177459.494634]  [819ad714] nf_hook_slow+0x64/0xc0
 [177459.502486]  [81a22d40] __ip_local_out_sk+0x90/0xa0
 [177459.510248]  [81a22c40] ? ip_forward_options+0x1a0/0x1a0
 [177459.517782]  [81a22d66] ip_local_out_sk+0x16/0x40
 [177459.525044]  [81a2343d] ip_queue_xmit+0x14d/0x350
 [177459.532247]  [81a3ae7e] tcp_transmit_skb+0x48e/0x960
 [177459.539413]  [81a3cddb] tcp_xmit_probe_skb+0xdb/0xf0
 [177459.546389]  [81a3dffb] tcp_write_wakeup+0x5b/0x150
 [177459.553061]  [81a3e51b] tcp_keepalive_timer+0x1fb/0x230
 [177459.559761]  [81a3e320] ? tcp_init_xmit_timers+0x20/0x20
 [177459.566447]  [8110f3c7] call_timer_fn.isra.27+0x17/0x80
 [177459.573121]  [81a3e320] ? tcp_init_xmit_timers+0x20/0x20
 [177459.579778]  [8110f55d] run_timer_softirq+0x12d/0x200
 [177459.586448]  [810ca6c3] __do_softirq+0x103/0x210
 [177459.593138]  [810ca9cb] irq_exit+0x4b/0xa0
 [177459.599783]  [814f05d4] xen_evtchn_do_upcall+0x34/0x50
 [177459.606300]  [81af93ae] xen_do_hypervisor_callback+0x1e/0x40
 [177459.612583]  EOI 
 [177459.612637]  [810013aa] ? xen_hypercall_sched_op+0xa/0x20
 [177459.625010]  [810013aa] ? xen_hypercall_sched_op+0xa/0x20
 [177459.631157]  [81008d60] ? xen_safe_halt+0x10/0x20
 [177459.637158]  [810188d3] ? default_idle+0x13/0x20
 [177459.643072]  [81018e1a] ? arch_cpu_idle+0xa/0x10
 [177459.648809]  [810f8e7e] ? default_idle_call+0x2e/0x50
 [177459.654650]  [810f9112] ? cpu_startup_entry+0x272/0x2e0
 [177459.660488]  [81ae79f7] ? rest_init+0x77/0x80
 [177459.666297]  [82312f58] ? start_kernel

Re: [Xen-devel] Linux 4.2-rc6 regression: RIP: e030:[ffffffff8110fb18] [ffffffff8110fb18] detach_if_pending+0x18/0x80

2015-08-17 Thread Eric Dumazet
On Mon, 2015-08-17 at 09:02 -0500, Jon Christopherson wrote:
 This is very similar to the behavior I am seeing in this bug:
 
 https://bugzilla.kernel.org/show_bug.cgi?id=102911

OK, but have you applied the fix ?

http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af

It will be part of net iteration from David Miller to Linus Torvald.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Linux 4.2-rc6 regression: RIP: e030:[ffffffff8110fb18] [ffffffff8110fb18] detach_if_pending+0x18/0x80

2015-08-17 Thread Eric Dumazet
From: Eric Dumazet eduma...@google.com

On Mon, 2015-08-17 at 16:25 +0200, Sander Eikelenboom wrote:
 Monday, August 17, 2015, 4:21:47 PM, you wrote:
 
  On Mon, 2015-08-17 at 09:02 -0500, Jon Christopherson wrote:
  This is very similar to the behavior I am seeing in this bug:
  
  https://bugzilla.kernel.org/show_bug.cgi?id=102911
 
  OK, but have you applied the fix ?
 
  http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
 
  It will be part of net iteration from David Miller to Linus Torvald.
 
 
 I did have that patch in for my last report.
 But i don't think he had (looking at the second part of his oops).
  

Then can you try following fix as well ?

Thanks !

[PATCH] timer: fix a race in __mod_timer()

lock_timer_base() can not catch following :

CPU1 ( in __mod_timer()
timer-flags |= TIMER_MIGRATING;
spin_unlock(base-lock);
base = new_base;
spin_lock(base-lock);
timer-flags = ~TIMER_BASEMASK;
  CPU2 (in lock_timer_base())
  see timer base is cpu0 base
  spin_lock_irqsave(base-lock, *flags);
  if (timer-flags == tf)
   return base; // oops, wrong base
timer-flags |= base-cpu // too late

We must write timer-flags in one go, otherwise we can fool other cpus.

Fixes: bc7a34b8b9eb (timer: Reduce timer migration overhead if disabled)
Signed-off-by: Eric Dumazet eduma...@google.com
Cc: Thomas Gleixner t...@linutronix.de
---
 kernel/time/timer.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 5e097fa9faf7..84190f02b521 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -807,8 +807,8 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
spin_unlock(base-lock);
base = new_base;
spin_lock(base-lock);
-   timer-flags = ~TIMER_BASEMASK;
-   timer-flags |= base-cpu;
+   WRITE_ONCE(timer-flags,
+  (timer-flags  ~TIMER_BASEMASK) | 
base-cpu);
}
}
 



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Linux 4.2-rc6 regression: RIP: e030:[ffffffff8110fb18] [ffffffff8110fb18] detach_if_pending+0x18/0x80

2015-08-14 Thread Eric Dumazet
On Sat, 2015-08-15 at 00:09 +0200, Sander Eikelenboom wrote:
 On 2015-08-13 00:41, Eric Dumazet wrote:
  On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:
  
  Thanks for the reminder, but luckily i was aware of that,
  seen enough of your replies asking for patches to be resubmitted
  against the other tree ;)
  Kernel with patch is currently running so fingers crossed.
  
  Thanks for testing. I am definitely interested knowing your results.
 
 Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is 
 breaking things
 (have to test if a revert helps) i get this in some guests:


Yes, this was fixed by :
http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 3/3] sched/preempt: fix cond_resched_lock() and cond_resched_softirq()

2015-07-15 Thread Eric Dumazet
On Wed, 2015-07-15 at 12:52 +0300, Konstantin Khlebnikov wrote:
 These functions check should_resched() before unlocking spinlock/bh-enable:
 preempt_count always non-zero = should_resched() always returns false.
 cond_resched_lock() worked iff spin_needbreak is set.

Interesting, this definitely used to work (linux-3.11)

Any idea of which commit broke things ?




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] net/bridge: Add missing in6_dev_put in br_validate_ipv6

2015-07-06 Thread Eric Dumazet
On Mon, 2015-07-06 at 11:35 +0100, Julien Grall wrote:

 __in6_dev_get requires to hold rcu_read_lock or RTNL. My knowledge on
 this code is very limited. Are we sure that one this lock is hold? At
 first glance, I wasn't able to find one.

You could play it safe ;)

diff --git a/net/bridge/br_netfilter_ipv6.c b/net/bridge/br_netfilter_ipv6.c
index 6d12d2675c80..90e8ccc21cc5 100644
--- a/net/bridge/br_netfilter_ipv6.c
+++ b/net/bridge/br_netfilter_ipv6.c
@@ -104,10 +104,12 @@ int br_validate_ipv6(struct sk_buff *skb)
 {
const struct ipv6hdr *hdr;
struct net_device *dev = skb-dev;
-   struct inet6_dev *idev = in6_dev_get(skb-dev);
+   struct inet6_dev *idev;
u32 pkt_len;
u8 ip6h_len = sizeof(struct ipv6hdr);
 
+   rcu_read_lock();
+   idev = __in6_dev_get(dev);
if (!pskb_may_pull(skb, ip6h_len))
goto inhdr_error;
 
@@ -140,11 +142,13 @@ int br_validate_ipv6(struct sk_buff *skb)
/* No IP options in IPv6 header; however it should be
 * checked if some next headers need special treatment
 */
+   rcu_read_unlock();
return 0;
 
 inhdr_error:
IP6_INC_STATS_BH(dev_net(dev), idev, IPSTATS_MIB_INHDRERRORS);
 drop:
+   rcu_read_unlock();
return -1;
 }
 



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] BUG: unable to handle kernel NULL pointer in __netdev_pick_tx()

2015-07-06 Thread Eric Dumazet
On Mon, 2015-07-06 at 16:26 +0800, Bob Liu wrote:
 Hi,
 
 I tried to run the latest kernel v4.2-rc1, but often got below panic during 
 system boot.
 
 [   42.118983] BUG: unable to handle kernel paging request at 003f
 [   42.119008] IP: [8161cfd0] __netdev_pick_tx+0x70/0x120
 [   42.119023] PGD 0 
 [   42.119026] Oops:  [#1] PREEMPT SMP 
 [   42.119031] Modules linked in: bridge stp llc iTCO_wdt iTCO_vendor_support 
 x86_pkg_temp_thermal coretemp pcspkr crc32_pclmul crc32c_intel 
 ghash_clmulni_intel ixgbe ptp pps_core cdc_ether usbnet mii mdio sb_edac dca 
 edac_core wmi i2c_i801 tpm_tis tpm lpc_ich mfd_core ipmi_si ipmi_msghandler 
 shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc uinput usb_storage mgag200 
 i2c_algo_bit drm_kms_helper ttm drm i2c_core nvme mpt2sas raid_class 
 scsi_transport_sas
 [   42.119073] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.2.0-rc1 #80
 [   42.119077] Hardware name: Oracle Corporation SUN SERVER X4-4/ASSY,MB WITH 
 TRAY, BIOS 24030400 08/22/2014
 [   42.119081] task: 880300b84000 ti: 880300b9 task.ti: 
 880300b9
 [   42.119085] RIP: e030:[8161cfd0]  [8161cfd0] 
 __netdev_pick_tx+0x70/0x120
 [   42.119091] RSP: e02b:880306d03868  EFLAGS: 00010206
 [   42.119093] RAX: 8802f676b6b0 RBX: 003f RCX: 
 8161cf60
 [   42.119097] RDX: 001c RSI: 8802fe24c900 RDI: 
 8802f96c
 [   42.119100] RBP: 880306d038a8 R08: 00023240 R09: 
 8160fb1c
 [   42.119104] R10:  R11:  R12: 
 8802fe24c900
 [   42.119107] R13:  R14:  R15: 
 8802f96c
 [   42.119121] FS:  () GS:880306d0() 
 knlGS:
 [   42.119124] CS:  e033 DS: 002b ES: 002b CR0: 80050033
 [   42.119127] CR2: 003f CR3: 01c1c000 CR4: 
 00042660
 [   42.119130] Stack:
 [   42.119132]  81d63850 8802f63040a0 880306d03888 
 8802fe24c900
 [   42.119137]  000e  8802f96c 
 8802fe24c400
 [   42.119141]  880306d038e8 a028bea4 8189cfe0 
 81d1b900
 [   42.119146] Call Trace:
 [   42.119149]  IRQ 
 [   42.119160]  [a028bea4] ixgbe_select_queue+0xc4/0x150 [ixgbe]
 [   42.119167]  [816240ee] netdev_pick_tx+0x5e/0xf0
 [   42.119170]  [81624210] __dev_queue_xmit+0x90/0x560
 [   42.119174]  [816246f3] dev_queue_xmit_sk+0x13/0x20
 [   42.119181]  [a02d2b3a] br_dev_queue_push_xmit+0x4a/0x80 [bridge]
 [   42.119186]  [a02d2cca] br_forward_finish+0x2a/0x80 [bridge]
 [   42.119191]  [a02d2da8] __br_forward+0x88/0x110 [bridge]
 [   42.119198]  [8160e18e] ? __skb_clone+0x2e/0x140
 [   42.119202]  [8160fb33] ? skb_clone+0x63/0xa0
 [   42.119206]  [a02d2d20] ? br_forward_finish+0x80/0x80 [bridge]
 [   42.119211]  [a02d2ac7] deliver_clone+0x37/0x60 [bridge]
 [   42.119215]  [a02d2c38] br_flood+0xc8/0x130 [bridge]
 [   42.119220]  [a02d2d20] ? br_forward_finish+0x80/0x80 [bridge]
 [   42.119255]  [a02d3229] br_flood_forward+0x19/0x20 [bridge]
 [   42.119260]  [a02d4188] br_handle_frame_finish+0x258/0x590 
 [bridge]
 [   42.119266]  [8172b5d0] ? get_partial_node.isra.63+0x1b7/0x1d4
 [   42.119272]  [a02d4606] br_handle_frame+0x146/0x270 [bridge]
 [   42.119277]  [8168ed39] ? udp_gro_receive+0x129/0x150
 [   42.119281]  [81621836] __netif_receive_skb_core+0x1d6/0xa20
 [   42.119286]  [81697a1d] ? inet_gro_receive+0x9d/0x230
 [   42.119290]  [81622098] __netif_receive_skb+0x18/0x60
 [   42.119294]  [81622113] netif_receive_skb_internal+0x33/0xb0
 [   42.119297]  [81622d3f] napi_gro_receive+0xbf/0x110
 [   42.119303]  [a028def0] ixgbe_clean_rx_irq+0x490/0x9e0 [ixgbe]
 [   42.119308]  [a028f0c0] ixgbe_poll+0x420/0x790 [ixgbe]
 [   42.119312]  [8162255d] net_rx_action+0x15d/0x340
 [   42.119321]  [81095426] __do_softirq+0xe6/0x2f0
 [   42.119324]  [81095904] irq_exit+0xf4/0x100
 [   42.119333]  [814275c9] xen_evtchn_do_upcall+0x39/0x50
 [   42.119340]  [817367de] xen_do_hypervisor_callback+0x1e/0x30
 [   42.119343]  EOI 
 [   42.119348]  [810013aa] ? xen_hypercall_sched_op+0xa/0x20
 [   42.119351]  [810013aa] ? xen_hypercall_sched_op+0xa/0x20
 [   42.119356]  [8100bbf0] ? xen_safe_halt+0x10/0x20
 [   42.119362]  [8101feab] ? default_idle+0x1b/0xf0
 [   42.119365]  [8102062f] ? arch_cpu_idle+0xf/0x20
 [   42.119370]  [810d273b] ? default_idle_call+0x3b/0x50
 [   42.119374]  [810d2a7f] ? cpu_startup_entry+0x2bf/0x350
 [   42.119379]  [8101290a] ? cpu_bringup_and_idle+0x2a/0x40
 [   42.119382] Code: 8b 87 e8 03 00 00 48 85 c0 0f 84 af 00 00 00 41 8b 94 24 
 ac 00 00 00 83 ea 01 48 8d 44 d0 10 48 8b 18 

Re: [Xen-devel] BUG: unable to handle kernel NULL pointer in __netdev_pick_tx()

2015-07-06 Thread Eric Dumazet
On Mon, 2015-07-06 at 19:13 +0800, Bob Liu wrote:

 
 Thank you for the quick fix!
 Tested by rebooting several times and didn't hit this panic any more.

Thanks Bob, I will submit an official patch then ;)



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-06-02 Thread Eric Dumazet
On Tue, 2015-06-02 at 10:52 +0100, Wei Liu wrote:
 Hi Eric
 
 Sorry for coming late to the discussion.
 
 On Thu, Apr 16, 2015 at 05:42:16AM -0700, Eric Dumazet wrote:
  On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote:
  
   He suggested that after he'd been prodded by 4 more e-mails in which two
   of us guessed what he was trying to get at.  That's what I was
   complaining about.
  
  My big complain is that I suggested to test to double the sysctl, which
  gave good results.
  
 
 Do I understand correctly that it's acceptable to you to double the size
 of the buffer? If so I will send a patch to do that.

Absolutely.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen-netfront: reset skb transport header before checksum

2015-05-11 Thread Eric Dumazet
On Mon, 2015-05-11 at 18:34 -0700, Venkat Venkatsubra wrote:
 In ed1f50c3a (net: add skb_checksum_setup) some checksum functions
 were introduced in core. Subsequent change b5cf66cd1 (xen-netfront:
 use new skb_checksum_setup function) made use of those functions to
 replace its own implementation. With that change ip_hdr() and tcp_hdr()
 were not pointing at the right place. d554f73df (xen-netfront: reset
 skb network header before checksum) fixed the ip_hdr().
 
 This patch fixes tcp_hdr().
 
 We saw this problem when LRO was enabled on the host and it results in
 netfront setting skb-ip_summed to CHECKSUM_UNNECESSARY.
 
 Signed-off-by: Venkat Venkatsubra venkat.x.venkatsu...@oracle.com
 ---
  drivers/net/xen-netfront.c | 1 +
  1 file changed, 1 insertion(+)
 
 diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
 index 3f45afd..ed5c242 100644
 --- a/drivers/net/xen-netfront.c
 +++ b/drivers/net/xen-netfront.c
 @@ -889,6 +889,7 @@ static int handle_incoming_queue(struct netfront_queue 
 *queue,
   /* Ethernet work: Delayed to here as it peeks the header. */
   skb-protocol = eth_type_trans(skb, queue-info-netdev);
   skb_reset_network_header(skb);
 + skb_set_transport_header(skb, ip_hdrlen(skb));
  
   if (checksum_setup(queue-info-netdev, skb)) {
   kfree_skb(skb);

This looks bogus for IPv6.

I would rather fix skb_checksum_setup_ip() and
skb_checksum_setup_ipv6(), since you already use correctly
skb_maybe_pull_tail() there.

In general, setting network and transport header at this point seems
extra work.




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-16 Thread Eric Dumazet
On Thu, 2015-04-16 at 12:39 +0100, George Dunlap wrote:
 On 04/15/2015 07:17 PM, Eric Dumazet wrote:
  Do not expect me to fight bufferbloat alone. Be part of the challenge,
  instead of trying to get back to proven bad solutions.
 
 I tried that.  I wrote a description of what I thought the situation
 was, so that you could correct me if my understanding was wrong, and
 then what I thought we could do about it.  You apparently didn't even
 read it, but just pointed me to a single cryptic comment that doesn't
 give me enough information to actually figure out what the situation is.
 
 We all agree that bufferbloat is a problem for everybody, and I can
 definitely understand the desire to actually make the situation better
 rather than dying the death of a thousand exceptions.
 
 If you want help fighting bufferbloat, you have to educate people to
 help you; or alternately, if you don't want to bother educating people,
 you have to fight it alone -- or lose the battle due to having a
 thousand exceptions.
 
 So, back to TSQ limits.  What's so magical about 2 packets being *in the
 device itself*?  And what does 1ms, or 2*64k packets (the default for
 tcp_limit_output_bytes), have anything to do with it?
 
 Your comment lists three benefits:
 1. better RTT estimation
 2. faster recovery
 3. high rates
 
 #3 is just marketing fluff; it's also contradicted by the statement that
 immediately follows it -- i.e., there are drivers for which the
 limitation does *not* give high rates.
 
 #1, as far as I can tell, has to do with measuring the *actual* minimal
 round trip time of an empty pipe, rather than the round trip time you
 get when there's 512MB of packets in the device buffer.  If a device has
 a large internal buffer, then having a large number of packets
 outstanding means that the measured RTT is skewed.
 
 The goal here, I take it, is to have this pipe *exactly* full; having
 it significantly more than full is what leads to bufferbloat.
 
 #2 sounds like you're saying that if there are too many packets
 outstanding when you discover that you need to adjust things, that it
 takes a long time for your changes to have an effect; i.e., if you have
 5ms of data in the pipe, it will take at least 5ms for your reduced
 transmmission rate to actually have an effect.
 
 Is that accurate, or have I misunderstood something?

#2 means that :

If you have an outstanding queue of 500 packets for a flow in qdisc.

A rtx has to be done, because we receive a SACK.

The rtx is queued _after_ the previous 500 packets.

500 packets have to be drained before rtx can be sent and eventually
reach destination.


These 500 packets will likely be dropped because the destination cannot
process them before the rtx.

2 TSO packets are already 90 packets (MSS=1448). It is not small, but a
good compromise allowing line rate even on 40Gbit NIC.


#1 is not marketing. It is hugely relevant.

You might use cubic as the default congestion control, you have to
understand we work hard on delay based cc, as losses are no longer a way
to measure congestion in modern networks.

Vegas and delay gradient congestion depends on precise RTT measures.

I added usec RTT estimations (instead of jiffies based rtt samples) to
increase resolution by 3 order of magnitude, not for marketing, but
because it had to be done when DC communications have typical rtt of 25
usec these days.

And jitter in host queues is not nice and must be kept at the minimum.

You do not have the whole picture, but this tight bufferbloat control is
one step before we can replace cubic by new upcoming cc, that many
companies are actively developing and testing.

The steps are the following :

1) TCP Small queues
2) FQ/pacing
3) TSO auto sizing
3) usec rtt estimations
4) New revolutionary cc module currently under test at Google,
   but others have alternatives.


The fact that few drivers have bugs should not stop this effort.

If you guys are in the Bay area, we would be happy to host a meeting
where we can present you how our work reduced packet drops in our
networks by 2 order of magnitude, and increased capacity by 40 or 50%.




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-16 Thread Eric Dumazet
On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote:

 He suggested that after he'd been prodded by 4 more e-mails in which two
 of us guessed what he was trying to get at.  That's what I was
 complaining about.

My big complain is that I suggested to test to double the sysctl, which
gave good results.

Then you provided a patch using a 8x factor. How does that sound ?

Next time I ask a raise, I should try a 8x factor as well, who knows,
it might be accepted.




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-15 Thread Eric Dumazet
On Wed, 2015-04-15 at 14:43 +0100, George Dunlap wrote:
 On Mon, Apr 13, 2015 at 2:49 PM, Eric Dumazet eric.duma...@gmail.com wrote:
  On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote:
 
  Is the problem perhaps that netback/netfront delays TX completion?
  Would it be better to see if that can be addressed properly, so that
  the original purpose of the patch (fighting bufferbloat) can be
  achieved while not degrading performance for Xen?  Or at least, so
  that people get decent perfomance out of the box without having to
  tweak TCP parameters?
 
  Sure, please provide a patch, that does not break back pressure.
 
  But just in case, if Xen performance relied on bufferbloat, it might be
  very difficult to reach a stable equilibrium : Any small change in stack
  or scheduling might introduce a significant difference in 'raw
  performance'.
 
 So help me understand this a little bit here.  tcp_limit_output_bytes
 limits the amount of data allowed to be in-transit between a send()
 and the wire, is that right?
 
 And so the bufferbloat problem you're talking about here are TCP
 buffers inside the kernel, and/or buffers in the NIC, is that right?
 
 So ideally, you want this to be large enough to fill the pipeline
 all the way from send() down to actually getting out on the wire;
 otherwise, you'll have gaps in the pipeline, and the machinery won't
 be working at full throttle.
 
 And the reason it's a problem is that many NICs now come with large
 send buffers; and effectively what happens then is that this makes the
 pipeline longer -- as the buffer fills up, the time between send()
 and the wire is increased.  This increased latency causes delays in
 round-trip-times and interferes with the mechanisms TCP uses to try to
 determine what the actual sustainable rate of data trasmission is.
 
 By limiting the number of in-transit bytes, you make sure that
 neither the kernel nor the NIC are going to have packets queues up for
 long lengths of time in buffers, and you keep this pipeline as close
 to the actual minimal length of the pipeline as possible.  And it
 sounds like for your 40G NIC, 128k is big enough to fill the pipeline
 without unduly making it longer by introducing buffering.
 
 Is that an accurate picture of what you're trying to achieve?
 
 But the problem for xennet (and a number of other drivers), as I
 understand it, is that at the moment the pipeline itself is just
 longer -- it just takes a longer time from the time you send a packet
 to the time it actually gets out on the wire.
 
 So it's not actually accurate to say that Xen performance relies on
 bufferbloat.  There's no buffering involved -- the pipeline is just
 longer, and so to fill up the pipeline you need more data.
 
 Basically, to maximize throughput while minimizing buffering, for
 *any* connection, tcp_limit_output_bytes should ideally be around
 (min_tx_latency * max_bandwidth).  For physical NICs, the minimum
 latency is really small, but for xennet -- and I'm guessing for a lot
 of virtualized cards -- the min_tx_latency will be a lot higher,
 requiring a much higher ideal tcp_limit_output value.
 
 Rather than trying to pick a single value which will be good for all
 NICs, it seems like it would make more sense to have this vary
 depending on the parameters of the NIC.  After all, for NICs that have
 low throughput -- say, old 100MiB NICs -- even 128k may still
 introduce a significant amount of buffering.
 
 Obviously one solution would be to allow the drivers themselves to set
 the tcp_limit_output_bytes, but that seems like a maintenance
 nightmare.
 
 Another simple solution would be to allow drivers to indicate whether
 they have a high transmit latency, and have the kernel use a higher
 value by default when that's the case.
 
 Probably the most sustainable solution would be to have the networking
 layer keep track of the average and minimum transmit latencies, and
 automatically adjust tcp_limit_output_bytes based on that.  (Keeping
 the minimum as well as the average because the whole problem with
 bufferbloat is that the more data you give it, the longer the apparent
 pipeline becomes.)
 
 Thoughts?

My thoughts that instead of these long talks you should guys read the
code :

/* TCP Small Queues :
 * Control number of packets in qdisc/devices to two packets / 
or ~1 ms.
 * This allows for :
 *  - better RTT estimation and ACK scheduling
 *  - faster recovery
 *  - high rates
 * Alas, some drivers / subsystems require a fair amount
 * of queued bytes to ensure line rate.
 * One example is wifi aggregation (802.11 AMPDU)
 */
limit = max(2 * skb-truesize, sk-sk_pacing_rate  10);
limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);


Then you'll see that most of your questions are already answered.

Feel free

Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes

2015-04-15 Thread Eric Dumazet
On Wed, 2015-04-15 at 15:36 +0100, Ian Campbell wrote:
 On Wed, 2015-04-15 at 15:19 +0100, George Dunlap wrote:
  On Mon, Apr 13, 2015 at 4:03 PM, Malcolm Crossley
 [...]
   From a networking point of view, the backend is a switch. Is it OK to
   consider the packet to have been transmitted from the guest point of
   view once the backend is aware of the packet?
  
   This would help justify the skb_orphan() in the frontend.
  
  This sounds sensible to me, particularly if virtio_net is already doing it.
 
 I also find Malcolm's argument above pretty compelling.

Yes, and then you'll have to help the virtio ongoing effort trying to
get rid of this skb_orphan()

Basically you're adding head of line blocking, as a single flow is able
to fill your queue.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-15 Thread Eric Dumazet
On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:
 On 04/15/2015 05:38 PM, Eric Dumazet wrote:
  My thoughts that instead of these long talks you should guys read the
  code :
  
  /* TCP Small Queues :
   * Control number of packets in qdisc/devices to two 
  packets / or ~1 ms.
   * This allows for :
   *  - better RTT estimation and ACK scheduling
   *  - faster recovery
   *  - high rates
   * Alas, some drivers / subsystems require a fair amount
   * of queued bytes to ensure line rate.
   * One example is wifi aggregation (802.11 AMPDU)
   */
  limit = max(2 * skb-truesize, sk-sk_pacing_rate  10);
  limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
  
  
  Then you'll see that most of your questions are already answered.
  
  Feel free to try to improve the behavior, if it does not hurt critical 
  workloads
  like TCP_RR, where we we send very small messages, millions times per 
  second.
 
 First of all, with regard to critical workloads, once this patch gets
 into distros, *normal TCP streams* on every VM running on Amazon,
 Rackspace, Linode, c will get a 30% hit in performance *by default*.
 Normal TCP streams on xennet *are* a critical workload, and deserve the
 same kind of accommodation as TCP_RR (if not more).  The same goes for
 virtio_net.
 
 Secondly, according to Stefano's and Jonathan's tests,
 tcp_limit_output_bytes completely fixes the problem for Xen.
 
 Which means that max(2*skb-truesize, sk-sk_pacing_rate 10) is
 *already* larger for Xen; that calculation mentioned in the comment is
 *already* doing the right thing.
 
 As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an
 automatic TSQ calculation which is actually choosing an effective value
 for xennet.
 
 It certainly makes sense for sysctl_tcp_limit_output_bytes to be an
 actual maximum limit.  I went back and looked at the original patch
 which introduced it (46d3ceabd), and it looks to me like it was designed
 to be a rough, quick estimate of two packets outstanding (by choosing
 the maximum size of the packet, 64k, and multiplying it by two).
 
 Now that you have a better algorithm -- the size of 2 actual packets or
 the amount transmitted in 1ms -- it seems like the default
 sysctl_tcp_limit_output_bytes should be higher, and let the automatic
 TSQ you have on the first line throttle things down when necessary.


I asked you guys to make a test by increasing
sysctl_tcp_limit_output_bytes

You have no need to explain me the code I wrote, thank you.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-15 Thread Eric Dumazet
On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote:

 Well, I'm not sure that it is George and Jonathan themselves who don't 
 want to change a sysctl, but the customers who would have to tweak that 
 in their VMs?

Keep in mind some VM users install custom qdisc, or even custom TCP
sysctls.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-15 Thread Eric Dumazet
On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:

 Which means that max(2*skb-truesize, sk-sk_pacing_rate 10) is
 *already* larger for Xen; that calculation mentioned in the comment is
 *already* doing the right thing.

Sigh.

1ms of traffic at 40Gbit is 5 MBytes

The reason for the cap to /proc/sys/net/ipv4/tcp_limit_output_bytes is
to provide the limitation of ~2 TSO packets, which _also_ is documented.

Without this limitation, 5 MBytes could translate to : Fill the queue,
do not limit.

If a particular driver needs to extend the limit, fine, document it and
take actions.




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-15 Thread Eric Dumazet
On Wed, 2015-04-15 at 10:55 -0700, Rick Jones wrote:
 
  Have you tested this patch on a NIC without GSO/TSO ?
 
  This would allow more than 500 packets for a single flow.
 
  Hello bufferbloat.
 
 Woudln't the fq_codel qdisc on that interface address that problem?

Last time I checked, default qdisc was pfifo_fast.

These guys do not want to change a sysctl, how pfifo_fast will magically
becomes fq_codel ?




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-15 Thread Eric Dumazet
On Wed, 2015-04-15 at 18:41 +0100, George Dunlap wrote:

 So you'd be OK with a patch like this?  (With perhaps a better changelog?)
 
  -George
 
 ---
 TSQ: Raise default static TSQ limit
 
 A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
 size of actual packets and the amount of data being transmitted.
 Raise the default static limit to allow that new limit to actually
 come into effect.
 
 This fixes a regression where NICs with large transmit completion
 times (such as xennet) had a 30% hit unless the user manually tweaked
 the value in /proc.
 
 Signed-off-by: George Dunlap george.dun...@eu.citrix.com
 
 diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
 index 1db253e..8ad7cdf 100644
 --- a/net/ipv4/tcp_output.c
 +++ b/net/ipv4/tcp_output.c
 @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
   */
  int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
 
 -/* Default TSQ limit of two TSO segments */
 -int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
 +/* Static TSQ limit.  A more dynamic limit is calculated in
 tcp_write_xmit. */
 +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;
 
  /* This limits the percentage of the congestion window which we
   * will allow a single TSO frame to consume.  Building TSO frames
 

Have you tested this patch on a NIC without GSO/TSO ?

This would allow more than 500 packets for a single flow.

Hello bufferbloat.

So my answer to this patch is a no.




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-15 Thread Eric Dumazet
On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:

 Maybe you should stop wasting all of our time and just tell us what
 you're thinking.

I think you make me wasting my time.

I already gave all the hints in prior discussions.

Rome was not built in one day.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-15 Thread Eric Dumazet
On Thu, 2015-04-16 at 12:20 +0800, Herbert Xu wrote:
 Eric Dumazet eric.duma...@gmail.com wrote:
 
  We already have netdev-gso_max_size and netdev-gso_max_segs
  which are cached into sk-sk_gso_max_size  sk-sk_gso_max_segs
 
 It is quite dangerous to attempt tricks like this because a
 tc redirection or netfilter nat could change the destination
 device rendering such hints incorrect.

Right but we are talking of performance hints, on quite basic VM setup.

Here the guest would use xen and this hint would apply.





___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes

2015-04-13 Thread Eric Dumazet
On Mon, 2015-04-13 at 14:46 +0100, David Vrabel wrote:

  And the proof-of-concept patch for idea (b) I used was:
  
  @@ -552,6 +552,8 @@ static int xennet_start_xmit(struct sk_buff *skb,
  struct net_device *dev)
   goto drop;
   }
  
  +skb_orphan(skb);
  +
   page = virt_to_page(skb-data);
   offset = offset_in_page(skb-data);
   len = skb_headlen(skb);
 
 No. This a bunch of allocations and  a full memcpy of all the frags.

skb_orphan() does nothing like that.

But the main concern here is it basically breaks back pressure.

And you do not want this, unless there is no other choice.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-13 Thread Eric Dumazet
On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote:

 Is the problem perhaps that netback/netfront delays TX completion?
 Would it be better to see if that can be addressed properly, so that
 the original purpose of the patch (fighting bufferbloat) can be
 achieved while not degrading performance for Xen?  Or at least, so
 that people get decent perfomance out of the box without having to
 tweak TCP parameters?

Sure, please provide a patch, that does not break back pressure.

But just in case, if Xen performance relied on bufferbloat, it might be
very difficult to reach a stable equilibrium : Any small change in stack
or scheduling might introduce a significant difference in 'raw
performance'.




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-09 Thread Eric Dumazet
On Thu, 2015-04-09 at 16:46 +0100, Stefano Stabellini wrote:
 Hi all,
 
 I found a performance regression when running netperf -t TCP_MAERTS from
 an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
 virtual machine are 30% slower than v3.18.
 
 Through bisection I found that the perf regression is caused by the
 prensence of the following commit in the guest kernel:
 
 
 commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
 Author: Eric Dumazet eduma...@google.com
 Date:   Sun Dec 7 12:22:18 2014 -0800
 
 tcp: refine TSO autosizing
 
 
 A simple revert would fix the issue.
 
 Does anybody have any ideas on what could be the cause of the problem?
 Suggestions on what to do to fix it?

You sent this to lkml while networking discussions are on netdev.

This topic had been discussed on netdev multiple times.

This commit restored original TCP Small Queue behavior, which is the
first step to fight bufferbloat.

Some network drivers are known to be problematic because of a delayed TX
completion.

So far this commit did not impact max single flow throughput on 40Gb
mlx4 NIC. (ie : line rate is possible)

Try to tweak /proc/sys/net/ipv4/tcp_limit_output_bytes to see if it
makes a difference ?




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-09 Thread Eric Dumazet
On Thu, 2015-04-09 at 17:36 +0100, Stefano Stabellini wrote:

 A very big difference:
 
 echo 262144  /proc/sys/net/ipv4/tcp_limit_output_bytes
 brings us much closer to the original performance, the slowdown is just
 8%

Cool.

 
 echo 1048576  /proc/sys/net/ipv4/tcp_limit_output_bytes
 fills the gap entirely, same performance as before refine TSO
 autosizing


Sure, this basically disables TCP Small Queue and select the opposite :

Favor single flow throughput and huge latencies (bufferbloat)


 
 
 What would be the next step for here?  Should I just document this as an
 important performance tweaking step for Xen, or is there something else
 we can do?

I guess this is a reasonable choice.

Note that /proc/sys/net/ipv4/tcp_limit_output_bytes is already
documented.




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] xen-netfront: transmit fully GSO-sized packets

2015-03-26 Thread Eric Dumazet
On Thu, 2015-03-26 at 11:13 +, Jonathan Davies wrote:
 xen-netfront limits transmitted skbs to be at most 44 segments in size. 
 However,
 GSO permits up to 65536 bytes, which means a maximum of 45 segments of 1448
 bytes each. This slight reduction in the size of packets means a slight loss 
 in
 efficiency.
 
 Since c/s 9ecd1a75d, xen-netfront sets gso_max_size to
 XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER,
 where XEN_NETIF_MAX_TX_SIZE is 65535 bytes.
 
 The calculation used by tcp_tso_autosize (and also tcp_xmit_size_goal since 
 c/s
 6c09fa09d) in determining when to split an skb into two is
 sk-sk_gso_max_size - 1 - MAX_TCP_HEADER.
 
 So the maximum permitted size of an skb is calculated to be
 (XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER) - 1 - MAX_TCP_HEADER.
 
 Intuitively, this looks like the wrong formula -- we don't need two TCP 
 headers.
 Instead, there is no need to deviate from the default gso_max_size of 65536 as
 this already accommodates the size of the header.
 
 Currently, the largest skb transmitted by netfront is 63712 bytes (44 segments
 of 1448 bytes each), as observed via tcpdump. This patch makes netfront send
 skbs of up to 65160 bytes (45 segments of 1448 bytes each).
 
 Fixes: 9ecd1a75d977 (xen-netfront: reduce gso_max_size to account for max 
 TCP header)
 Signed-off-by: Jonathan Davies jonathan.dav...@citrix.com
 ---
  drivers/net/xen-netfront.c | 2 --
  1 file changed, 2 deletions(-)
 
 diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
 index e9b960f..fb6e978 100644
 --- a/drivers/net/xen-netfront.c
 +++ b/drivers/net/xen-netfront.c
 @@ -1279,8 +1279,6 @@ static struct net_device *xennet_create_dev(struct 
 xenbus_device *dev)
   netdev-ethtool_ops = xennet_ethtool_ops;
   SET_NETDEV_DEV(netdev, dev-dev);
  
 - netif_set_gso_max_size(netdev, XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER);
 -
   np-netdev = netdev;
  
   netif_carrier_off(netdev);

Hmm, this partially reverts commit
9ecd1a75d977e2e8c48139c7d3efed183f898d94



Why xennet_change_mtu() is not changed by your patch ?




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes

2015-03-26 Thread Eric Dumazet
On Thu, 2015-03-26 at 16:46 +, Jonathan Davies wrote:
 Network drivers with slow TX completion can experience poor network transmit
 throughput, limited by hitting the sk_wmem_alloc limit check in 
 tcp_write_xmit.
 The limit is 128 KB (by default), which means we are limited to two 64 KB skbs
 in-flight. This has been observed to limit transmit throughput with 
 xen-netfront
 because its TX completion can be relatively slow compared to physical NIC
 drivers.
 
 There have been several modifications to the calculation of the sk_wmem_alloc
 limit in the past. Here is a brief history:
 
  * Since TSQ was introduced, the queue size limit was
sysctl_tcp_limit_output_bytes.
 
  * Commit c9eeec26 (tcp: TSQ can use a dynamic limit) made the limit
max(skb-truesize, sk-sk_pacing_rate  10).
This allows more packets in-flight according to the estimated rate.
 
  * Commit 98e09386 (tcp: tsq: restore minimal amount of queueing) made the
limit
max_t(unsigned int, sysctl_tcp_limit_output_bytes,
sk-sk_pacing_rate  10).
This ensures at least sysctl_tcp_limit_output_bytes in flight but allowed
more if rate estimation shows this to be worthwhile.
 
  * Commit 605ad7f1 (tcp: refine TSO autosizing) made the limit
min_t(u32, max(2 * skb-truesize, sk-sk_pacing_rate  10),
   sysctl_tcp_limit_output_bytes).
This meant that the limit can never exceed sysctl_tcp_limit_output_bytes,
regardless of what rate estimation suggests. It's not clear from the commit
message why this significant change was justified, changing
sysctl_tcp_limit_output_bytes from being a lower bound to an upper bound.
 
 This patch restores the behaviour that allows the limit to grow above
 sysctl_tcp_limit_output_bytes according to the rate estimation.
 
 This has been measured to improve xen-netfront throughput from a domU to dom0
 from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one domU to
 another (on the same host), throughput rose from 2.8 Gb/s to 8.0 Gb/s. In the
 latter case, TX completion is especially slow, explaining the large 
 improvement.
 These values were measured against 4.0-rc5 using iperf -c ip -i 1 using
 CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a pair of 
 Xeon
 E5-2650 v3 CPUs.
 
 Fixes: 605ad7f184b6 (tcp: refine TSO autosizing)
 Signed-off-by: Jonathan Davies jonathan.dav...@citrix.com
 ---
  net/ipv4/tcp_output.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
 index 1db253e..3a49af8 100644
 --- a/net/ipv4/tcp_output.c
 +++ b/net/ipv4/tcp_output.c
 @@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned 
 int mss_now, int nonagle,
* One example is wifi aggregation (802.11 AMPDU)
*/
   limit = max(2 * skb-truesize, sk-sk_pacing_rate  10);
 - limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
 + limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes);
  
   if (atomic_read(sk-sk_wmem_alloc)  limit) {
   set_bit(TSQ_THROTTLED, tp-tsq_flags);


That will get a NACK from me and Google TCP team of course.

Try your patch with low throughput flows, like 64kbps, GSO off...

And observe TCP timestamps and rtt estimations, critical for vegas or
any CC based on delay.

I get line rate on 40Gbps NIC using this (low) limit of 2 TSO packets.

This topic was discussed for wifi recently, and I suggested multiple
ways to solve the problem on problematic drivers.

There is a reason sysctl_tcp_limit_output_bytes exists : You can change
its value to whatever you want.

vi +652 Documentation/networking/ip-sysctl.txt

tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
gets losses notifications. With SNDBUF autotuning, this can
result in a large amount of packets queued in qdisc/device
on the local machine, hurting latency of other flows, for
typical pfifo_fast qdiscs.
tcp_limit_output_bytes limits the number of bytes on qdisc
or device to reduce artificial RTT/cwnd and reduce bufferbloat.
Default: 131072

Documentation is pretty clear : This is the max value, not a min one.






___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] net: Detect drivers that reschedule NAPI and exhaust budget

2014-12-20 Thread Eric Dumazet
On Sat, 2014-12-20 at 17:55 +1100, Herbert Xu wrote:

 -- 8 --
 The commit d75b1ade567ffab085e8adbbdacf0092d10cd09c (net: less
 interrupt masking in NAPI) required drivers to leave poll_list
 empty if the entire budget is consumed.
 
 We have already had two broken drivers so let's add a check for
 this.
 
 Signed-off-by: Herbert Xu herb...@gondor.apana.org.au
 
 diff --git a/net/core/dev.c b/net/core/dev.c
 index f411c28..47fdc5c 100644
 --- a/net/core/dev.c
 +++ b/net/core/dev.c
 @@ -4620,7 +4620,13 @@ static void net_rx_action(struct softirq_action *h)
*/
   napi_gro_flush(n, HZ = 1000);
   }
 - list_add_tail(n-poll_list, repoll);
 + /* Some drivers may have called napi_schedule
 +  * prior to exhausting their budget.
 +  */
 + if (unlikely(!list_empty(n-poll_list)))
 + pr_warn(%s: Budget exhausted after 
 napi rescheduled\n, n-dev ? n-dev-name : backlog);
 + else
 + list_add_tail(n-poll_list, repoll);
   }
   }
  
 Thanks,

OK, but could you :
1) use pr_warn_once()
2) split the too long line
pr_warn_once(%s: Budget exhausted after napi rescheduled\n,
 n-dev ? n-dev-name : backlog);

Thanks Herbert !



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] net: Detect drivers that reschedule NAPI and exhaust budget

2014-12-19 Thread Eric Dumazet
On Sat, 2014-12-20 at 11:36 +1100, Herbert Xu wrote:
 On Sat, Dec 20, 2014 at 11:23:27AM +1100, Herbert Xu wrote:
 
  A similar bug exists in virtio_net.
 
 In order to detect other drivers doing this we should add something
 like this.
 
 -- 8 --
 The commit d75b1ade567ffab085e8adbbdacf0092d10cd09c (net: less
 interrupt masking in NAPI) required drivers to leave poll_list
 empty if the entire budget is consumed.
 
 We have already had two broken drivers so let's add a check for
 this.
  
 Signed-off-by: Herbert Xu herb...@gondor.apana.org.au
 
 diff --git a/net/core/dev.c b/net/core/dev.c
 index f411c28..88f9725 100644
 --- a/net/core/dev.c
 +++ b/net/core/dev.c
 @@ -4620,7 +4620,11 @@ static void net_rx_action(struct softirq_action *h)
*/
   napi_gro_flush(n, HZ = 1000);
   }
 - list_add_tail(n-poll_list, repoll);
 + /* Some drivers may have called napi_schedule
 +  * prior to exhausting their budget.
 +  */
 + if (!WARN_ON_ONCE(!list_empty(n-poll_list)))
 + list_add_tail(n-poll_list, repoll);
   }
   }
  

I do not think stack trace will point to the buggy driver ?

IMO it would be better to print a single line with the netdev name ?




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel