Re: [Xen-devel] [PATCH] xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend()
On Wed, 2017-02-01 at 18:29 -0500, Boris Ostrovsky wrote: > > I could not convince myself that napi_synchronize() is sufficient here > (mostly because I am not familiar with napi flow). At the same time I > would rather not make changes in anticipation of possible disappearance > of netif_carrier_ok() in the future so I'd like this patch to go in as is. > > Unless there are other problems with the patch or if Eric (or others) > feel strongly about usage of netif_carrier_ok() here. > No strong feelings from me. We probably have more serious issues to fix anyway. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend()
On Mon, 2017-01-30 at 13:23 -0500, Boris Ostrovsky wrote: > We do netif_carrier_off() first thing in xennet_disconnect_backend() and > the only place where the timer is rearmed is xennet_alloc_rx_buffers(), > which is guarded by netif_carrier_ok() check. Oh well, testing netif_carrier_ok() in packet processing fast path looks unusual and a waste of cpu cycles. I've never seen that pattern before. If one day, we remove this netif_carrier_ok() test during a cleanup, then the race window will open again. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend()
On Mon, 2017-01-30 at 12:45 -0500, Boris Ostrovsky wrote: > rx_refill_timer should be deleted as soon as we disconnect from the > backend since otherwise it is possible for the timer to go off before > we get to xennet_destroy_queues(). If this happens we may dereference > queue->rx.sring which is set to NULL in xennet_disconnect_backend(). > > Signed-off-by: Boris Ostrovsky> CC: sta...@vger.kernel.org > --- > drivers/net/xen-netfront.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c > index 8315fe7..722fe9f 100644 > --- a/drivers/net/xen-netfront.c > +++ b/drivers/net/xen-netfront.c > @@ -1379,6 +1379,8 @@ static void xennet_disconnect_backend(struct > netfront_info *info) > for (i = 0; i < num_queues && info->queues; ++i) { > struct netfront_queue *queue = >queues[i]; > > + del_timer_sync(>rx_refill_timer); > + If napi_disable() was not called before this del_timer_sync(), another RX might come here and rearm rx_refill_timer. > if (queue->tx_irq && (queue->tx_irq == queue->rx_irq)) > unbind_from_irqhandler(queue->tx_irq, queue); > if (queue->tx_irq && (queue->tx_irq != queue->rx_irq)) { > @@ -1733,7 +1735,6 @@ static void xennet_destroy_queues(struct netfront_info > *info) > > if (netif_running(info->netdev)) > napi_disable(>napi); > - del_timer_sync(>rx_refill_timer); > netif_napi_del(>napi); > } > ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH net-next] xen-netfront: reject short packets and handle non-linear packets
On Wed, 2017-01-25 at 16:26 +, Paul Durrant wrote: > Sowmini points out two vulnerabilities in xen-netfront: > > a) The code assumes that skb->len is at least ETH_HLEN. > b) The code assumes that at least ETH_HLEN octets are in the linear >port of the socket buffer. > > This patch adds tests for both of these, and in the case of the latter > pulls sufficient bytes into the linear area. > > Signed-off-by: Paul Durrant> Reported-by: Sowmini Varadhan > Tested-by: Sowmini Varadhan > --- > Cc: Boris Ostrovsky > Cc: Juergen Gross > --- > drivers/net/xen-netfront.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c > index 40f26b6..0478809 100644 > --- a/drivers/net/xen-netfront.c > +++ b/drivers/net/xen-netfront.c > @@ -567,6 +567,10 @@ static int xennet_start_xmit(struct sk_buff *skb, struct > net_device *dev) > u16 queue_index; > struct sk_buff *nskb; > > + /* Basic sanity check */ > + if (unlikely(skb->len < ETH_HLEN)) > + goto drop; > + > /* Drop the packet if no queues are set up */ > if (num_queues < 1) > goto drop; > @@ -609,6 +613,11 @@ static int xennet_start_xmit(struct sk_buff *skb, struct > net_device *dev) > } > > len = skb_headlen(skb); > + if (unlikely(len < ETH_HLEN)) { > + if (!__pskb_pull_tail(skb, ETH_HLEN - len)) > + goto drop; > + len = ETH_HLEN; > + } Looks like duplicated code, and buggy, considering the code above page = virt_to_page(skb->data); offset = offset_in_page(skb->data); Your patch might end up with skb->data/head being reallocated, and use after free would happen. What about something like that ? diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c index 40f26b69beb11459f0566fc1d1d739aa75e643bf..99a67fe4de86d3141169143b0820d00968cb09f2 100644 --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -583,6 +583,8 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev) skb->len); goto drop; } + if (!pskb_may_pull(skb, ETH_HLEN)) + goto drop; slots = xennet_count_skb_slots(skb); if (unlikely(slots > MAX_XEN_SKB_FRAGS + 1)) { ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] xen/blkback: use rb_entry()
On Tue, 2016-12-20 at 12:51 -0500, Konrad Rzeszutek Wilk wrote: > On Tue, Dec 20, 2016 at 05:44:06PM +, Roger Pau Monné wrote: > > On Tue, Dec 20, 2016 at 11:47:03AM -0500, Konrad Rzeszutek Wilk wrote: > > > On Tue, Dec 20, 2016 at 10:02:19PM +0800, Geliang Tang wrote: > > > > To make the code clearer, use rb_entry() instead of container_of() to > > > > deal with rbtree. > > > > > > That is OK but I think 'container_of' is more clear. > > > > > > Roger, thoughts? > > > > I think so, container_of is a global macro that's widely used and everyone > > knows, rb_entry OTOH it's not and it's use doesn't really simply the code at > > all. I'm not really opposed, but it seems kind of a pointless change (not > > that > > it's wrong). > > I concur. > > Geliang Tang, > > Thank you for the patch but there is no need for it. The same could be said of list_entry() #define hlist_entry(ptr, type, member) container_of(ptr,type,member) #define list_entry(ptr, type, member) container_of(ptr, type, member) # git grep -n list_entry | wc -l 3636 rb_entry() will probably make its way everywhere. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Linux 4.2-rc6 regression: RIP: e030:[ffffffff8110fb18] [ffffffff8110fb18] detach_if_pending+0x18/0x80
On Mon, 2015-08-17 at 11:09 +0200, Sander Eikelenboom wrote: Saturday, August 15, 2015, 12:39:25 AM, you wrote: On Sat, 2015-08-15 at 00:09 +0200, Sander Eikelenboom wrote: On 2015-08-13 00:41, Eric Dumazet wrote: On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote: Thanks for the reminder, but luckily i was aware of that, seen enough of your replies asking for patches to be resubmitted against the other tree ;) Kernel with patch is currently running so fingers crossed. Thanks for testing. I am definitely interested knowing your results. Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is breaking things (have to test if a revert helps) i get this in some guests: Yes, this was fixed by : http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af Hi Eric, With that patch i had a crash again this night, see below. -- Sander [177459.188808] general protection fault: [#1] SMP [177459.199746] Modules linked in: [177459.210540] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.2.0-rc6-20150815-linus-doflr-net+ #1 [177459.221441] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640) , BIOS V1.8B1 09/13/2010 [177459.232247] task: 8221a580 ti: 8220 task.ti: 8220 [177459.242931] RIP: e030:[8110eb58] [8110eb58] detach_if_pending+0x18/0x80 [177459.253503] RSP: e02b:88005f6039d8 EFLAGS: 00010086 [177459.264051] RAX: 8800584d6580 RBX: 880004901420 RCX: dead00200200 [177459.274599] RDX: RSI: 88005f60e5c0 RDI: 880004901420 [177459.285122] RBP: 88005f6039d8 R08: 0001 R09: [177459.295286] R10: 0003 R11: 880004901394 R12: 0003 [177459.305388] R13: 00010ae47040 R14: 07b98a00 R15: 88005f60e5c0 [177459.315345] FS: 7f51317ec700() GS:88005f60() knlGS: [177459.325340] CS: e033 DS: ES: CR0: 8005003b [177459.335217] CR2: 010f8000 CR3: 2a154000 CR4: 0660 [177459.345129] Stack: [177459.354783] 88005f603a28 8110ee7f 810fb261 0200 [177459.364505] 0003 880004901380 0003 8800567d0d00 [177459.374064] 07b98a00 88005f603a58 819b3eb3 [177459.383532] Call Trace: [177459.392878] IRQ [177459.392935] [8110ee7f] mod_timer_pending+0x3f/0xe0 [177459.411058] [810fb261] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 [177459.419876] [819b3eb3] __nf_ct_refresh_acct+0xa3/0xb0 [177459.428642] [819baafb] tcp_packet+0xb3b/0x1290 [177459.437285] [81a2535e] ? ip_output+0x5e/0xc0 [177459.445845] [810ca8ca] ? __local_bh_enable_ip+0x2a/0x90 [177459.454331] [819b35a9] ? __nf_conntrack_find_get+0x129/0x2a0 [177459.462642] [819b549c] nf_conntrack_in+0x29c/0x7c0 [177459.470711] [81a65e9c] ipv4_conntrack_local+0x4c/0x50 [177459.478753] [819ad67c] nf_iterate+0x4c/0x80 [177459.486726] [81102437] ? generic_handle_irq+0x27/0x40 [177459.494634] [819ad714] nf_hook_slow+0x64/0xc0 [177459.502486] [81a22d40] __ip_local_out_sk+0x90/0xa0 [177459.510248] [81a22c40] ? ip_forward_options+0x1a0/0x1a0 [177459.517782] [81a22d66] ip_local_out_sk+0x16/0x40 [177459.525044] [81a2343d] ip_queue_xmit+0x14d/0x350 [177459.532247] [81a3ae7e] tcp_transmit_skb+0x48e/0x960 [177459.539413] [81a3cddb] tcp_xmit_probe_skb+0xdb/0xf0 [177459.546389] [81a3dffb] tcp_write_wakeup+0x5b/0x150 [177459.553061] [81a3e51b] tcp_keepalive_timer+0x1fb/0x230 [177459.559761] [81a3e320] ? tcp_init_xmit_timers+0x20/0x20 [177459.566447] [8110f3c7] call_timer_fn.isra.27+0x17/0x80 [177459.573121] [81a3e320] ? tcp_init_xmit_timers+0x20/0x20 [177459.579778] [8110f55d] run_timer_softirq+0x12d/0x200 [177459.586448] [810ca6c3] __do_softirq+0x103/0x210 [177459.593138] [810ca9cb] irq_exit+0x4b/0xa0 [177459.599783] [814f05d4] xen_evtchn_do_upcall+0x34/0x50 [177459.606300] [81af93ae] xen_do_hypervisor_callback+0x1e/0x40 [177459.612583] EOI [177459.612637] [810013aa] ? xen_hypercall_sched_op+0xa/0x20 [177459.625010] [810013aa] ? xen_hypercall_sched_op+0xa/0x20 [177459.631157] [81008d60] ? xen_safe_halt+0x10/0x20 [177459.637158] [810188d3] ? default_idle+0x13/0x20 [177459.643072] [81018e1a] ? arch_cpu_idle+0xa/0x10 [177459.648809] [810f8e7e] ? default_idle_call+0x2e/0x50 [177459.654650] [810f9112] ? cpu_startup_entry+0x272/0x2e0 [177459.660488] [81ae79f7] ? rest_init+0x77/0x80 [177459.666297] [82312f58] ? start_kernel
Re: [Xen-devel] Linux 4.2-rc6 regression: RIP: e030:[ffffffff8110fb18] [ffffffff8110fb18] detach_if_pending+0x18/0x80
On Mon, 2015-08-17 at 09:02 -0500, Jon Christopherson wrote: This is very similar to the behavior I am seeing in this bug: https://bugzilla.kernel.org/show_bug.cgi?id=102911 OK, but have you applied the fix ? http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af It will be part of net iteration from David Miller to Linus Torvald. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Linux 4.2-rc6 regression: RIP: e030:[ffffffff8110fb18] [ffffffff8110fb18] detach_if_pending+0x18/0x80
From: Eric Dumazet eduma...@google.com On Mon, 2015-08-17 at 16:25 +0200, Sander Eikelenboom wrote: Monday, August 17, 2015, 4:21:47 PM, you wrote: On Mon, 2015-08-17 at 09:02 -0500, Jon Christopherson wrote: This is very similar to the behavior I am seeing in this bug: https://bugzilla.kernel.org/show_bug.cgi?id=102911 OK, but have you applied the fix ? http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af It will be part of net iteration from David Miller to Linus Torvald. I did have that patch in for my last report. But i don't think he had (looking at the second part of his oops). Then can you try following fix as well ? Thanks ! [PATCH] timer: fix a race in __mod_timer() lock_timer_base() can not catch following : CPU1 ( in __mod_timer() timer-flags |= TIMER_MIGRATING; spin_unlock(base-lock); base = new_base; spin_lock(base-lock); timer-flags = ~TIMER_BASEMASK; CPU2 (in lock_timer_base()) see timer base is cpu0 base spin_lock_irqsave(base-lock, *flags); if (timer-flags == tf) return base; // oops, wrong base timer-flags |= base-cpu // too late We must write timer-flags in one go, otherwise we can fool other cpus. Fixes: bc7a34b8b9eb (timer: Reduce timer migration overhead if disabled) Signed-off-by: Eric Dumazet eduma...@google.com Cc: Thomas Gleixner t...@linutronix.de --- kernel/time/timer.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 5e097fa9faf7..84190f02b521 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -807,8 +807,8 @@ __mod_timer(struct timer_list *timer, unsigned long expires, spin_unlock(base-lock); base = new_base; spin_lock(base-lock); - timer-flags = ~TIMER_BASEMASK; - timer-flags |= base-cpu; + WRITE_ONCE(timer-flags, + (timer-flags ~TIMER_BASEMASK) | base-cpu); } } ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Linux 4.2-rc6 regression: RIP: e030:[ffffffff8110fb18] [ffffffff8110fb18] detach_if_pending+0x18/0x80
On Sat, 2015-08-15 at 00:09 +0200, Sander Eikelenboom wrote: On 2015-08-13 00:41, Eric Dumazet wrote: On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote: Thanks for the reminder, but luckily i was aware of that, seen enough of your replies asking for patches to be resubmitted against the other tree ;) Kernel with patch is currently running so fingers crossed. Thanks for testing. I am definitely interested knowing your results. Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is breaking things (have to test if a revert helps) i get this in some guests: Yes, this was fixed by : http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2 3/3] sched/preempt: fix cond_resched_lock() and cond_resched_softirq()
On Wed, 2015-07-15 at 12:52 +0300, Konstantin Khlebnikov wrote: These functions check should_resched() before unlocking spinlock/bh-enable: preempt_count always non-zero = should_resched() always returns false. cond_resched_lock() worked iff spin_needbreak is set. Interesting, this definitely used to work (linux-3.11) Any idea of which commit broke things ? ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] net/bridge: Add missing in6_dev_put in br_validate_ipv6
On Mon, 2015-07-06 at 11:35 +0100, Julien Grall wrote: __in6_dev_get requires to hold rcu_read_lock or RTNL. My knowledge on this code is very limited. Are we sure that one this lock is hold? At first glance, I wasn't able to find one. You could play it safe ;) diff --git a/net/bridge/br_netfilter_ipv6.c b/net/bridge/br_netfilter_ipv6.c index 6d12d2675c80..90e8ccc21cc5 100644 --- a/net/bridge/br_netfilter_ipv6.c +++ b/net/bridge/br_netfilter_ipv6.c @@ -104,10 +104,12 @@ int br_validate_ipv6(struct sk_buff *skb) { const struct ipv6hdr *hdr; struct net_device *dev = skb-dev; - struct inet6_dev *idev = in6_dev_get(skb-dev); + struct inet6_dev *idev; u32 pkt_len; u8 ip6h_len = sizeof(struct ipv6hdr); + rcu_read_lock(); + idev = __in6_dev_get(dev); if (!pskb_may_pull(skb, ip6h_len)) goto inhdr_error; @@ -140,11 +142,13 @@ int br_validate_ipv6(struct sk_buff *skb) /* No IP options in IPv6 header; however it should be * checked if some next headers need special treatment */ + rcu_read_unlock(); return 0; inhdr_error: IP6_INC_STATS_BH(dev_net(dev), idev, IPSTATS_MIB_INHDRERRORS); drop: + rcu_read_unlock(); return -1; } ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] BUG: unable to handle kernel NULL pointer in __netdev_pick_tx()
On Mon, 2015-07-06 at 16:26 +0800, Bob Liu wrote: Hi, I tried to run the latest kernel v4.2-rc1, but often got below panic during system boot. [ 42.118983] BUG: unable to handle kernel paging request at 003f [ 42.119008] IP: [8161cfd0] __netdev_pick_tx+0x70/0x120 [ 42.119023] PGD 0 [ 42.119026] Oops: [#1] PREEMPT SMP [ 42.119031] Modules linked in: bridge stp llc iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal coretemp pcspkr crc32_pclmul crc32c_intel ghash_clmulni_intel ixgbe ptp pps_core cdc_ether usbnet mii mdio sb_edac dca edac_core wmi i2c_i801 tpm_tis tpm lpc_ich mfd_core ipmi_si ipmi_msghandler shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc uinput usb_storage mgag200 i2c_algo_bit drm_kms_helper ttm drm i2c_core nvme mpt2sas raid_class scsi_transport_sas [ 42.119073] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.2.0-rc1 #80 [ 42.119077] Hardware name: Oracle Corporation SUN SERVER X4-4/ASSY,MB WITH TRAY, BIOS 24030400 08/22/2014 [ 42.119081] task: 880300b84000 ti: 880300b9 task.ti: 880300b9 [ 42.119085] RIP: e030:[8161cfd0] [8161cfd0] __netdev_pick_tx+0x70/0x120 [ 42.119091] RSP: e02b:880306d03868 EFLAGS: 00010206 [ 42.119093] RAX: 8802f676b6b0 RBX: 003f RCX: 8161cf60 [ 42.119097] RDX: 001c RSI: 8802fe24c900 RDI: 8802f96c [ 42.119100] RBP: 880306d038a8 R08: 00023240 R09: 8160fb1c [ 42.119104] R10: R11: R12: 8802fe24c900 [ 42.119107] R13: R14: R15: 8802f96c [ 42.119121] FS: () GS:880306d0() knlGS: [ 42.119124] CS: e033 DS: 002b ES: 002b CR0: 80050033 [ 42.119127] CR2: 003f CR3: 01c1c000 CR4: 00042660 [ 42.119130] Stack: [ 42.119132] 81d63850 8802f63040a0 880306d03888 8802fe24c900 [ 42.119137] 000e 8802f96c 8802fe24c400 [ 42.119141] 880306d038e8 a028bea4 8189cfe0 81d1b900 [ 42.119146] Call Trace: [ 42.119149] IRQ [ 42.119160] [a028bea4] ixgbe_select_queue+0xc4/0x150 [ixgbe] [ 42.119167] [816240ee] netdev_pick_tx+0x5e/0xf0 [ 42.119170] [81624210] __dev_queue_xmit+0x90/0x560 [ 42.119174] [816246f3] dev_queue_xmit_sk+0x13/0x20 [ 42.119181] [a02d2b3a] br_dev_queue_push_xmit+0x4a/0x80 [bridge] [ 42.119186] [a02d2cca] br_forward_finish+0x2a/0x80 [bridge] [ 42.119191] [a02d2da8] __br_forward+0x88/0x110 [bridge] [ 42.119198] [8160e18e] ? __skb_clone+0x2e/0x140 [ 42.119202] [8160fb33] ? skb_clone+0x63/0xa0 [ 42.119206] [a02d2d20] ? br_forward_finish+0x80/0x80 [bridge] [ 42.119211] [a02d2ac7] deliver_clone+0x37/0x60 [bridge] [ 42.119215] [a02d2c38] br_flood+0xc8/0x130 [bridge] [ 42.119220] [a02d2d20] ? br_forward_finish+0x80/0x80 [bridge] [ 42.119255] [a02d3229] br_flood_forward+0x19/0x20 [bridge] [ 42.119260] [a02d4188] br_handle_frame_finish+0x258/0x590 [bridge] [ 42.119266] [8172b5d0] ? get_partial_node.isra.63+0x1b7/0x1d4 [ 42.119272] [a02d4606] br_handle_frame+0x146/0x270 [bridge] [ 42.119277] [8168ed39] ? udp_gro_receive+0x129/0x150 [ 42.119281] [81621836] __netif_receive_skb_core+0x1d6/0xa20 [ 42.119286] [81697a1d] ? inet_gro_receive+0x9d/0x230 [ 42.119290] [81622098] __netif_receive_skb+0x18/0x60 [ 42.119294] [81622113] netif_receive_skb_internal+0x33/0xb0 [ 42.119297] [81622d3f] napi_gro_receive+0xbf/0x110 [ 42.119303] [a028def0] ixgbe_clean_rx_irq+0x490/0x9e0 [ixgbe] [ 42.119308] [a028f0c0] ixgbe_poll+0x420/0x790 [ixgbe] [ 42.119312] [8162255d] net_rx_action+0x15d/0x340 [ 42.119321] [81095426] __do_softirq+0xe6/0x2f0 [ 42.119324] [81095904] irq_exit+0xf4/0x100 [ 42.119333] [814275c9] xen_evtchn_do_upcall+0x39/0x50 [ 42.119340] [817367de] xen_do_hypervisor_callback+0x1e/0x30 [ 42.119343] EOI [ 42.119348] [810013aa] ? xen_hypercall_sched_op+0xa/0x20 [ 42.119351] [810013aa] ? xen_hypercall_sched_op+0xa/0x20 [ 42.119356] [8100bbf0] ? xen_safe_halt+0x10/0x20 [ 42.119362] [8101feab] ? default_idle+0x1b/0xf0 [ 42.119365] [8102062f] ? arch_cpu_idle+0xf/0x20 [ 42.119370] [810d273b] ? default_idle_call+0x3b/0x50 [ 42.119374] [810d2a7f] ? cpu_startup_entry+0x2bf/0x350 [ 42.119379] [8101290a] ? cpu_bringup_and_idle+0x2a/0x40 [ 42.119382] Code: 8b 87 e8 03 00 00 48 85 c0 0f 84 af 00 00 00 41 8b 94 24 ac 00 00 00 83 ea 01 48 8d 44 d0 10 48 8b 18
Re: [Xen-devel] BUG: unable to handle kernel NULL pointer in __netdev_pick_tx()
On Mon, 2015-07-06 at 19:13 +0800, Bob Liu wrote: Thank you for the quick fix! Tested by rebooting several times and didn't hit this panic any more. Thanks Bob, I will submit an official patch then ;) ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Tue, 2015-06-02 at 10:52 +0100, Wei Liu wrote: Hi Eric Sorry for coming late to the discussion. On Thu, Apr 16, 2015 at 05:42:16AM -0700, Eric Dumazet wrote: On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote: He suggested that after he'd been prodded by 4 more e-mails in which two of us guessed what he was trying to get at. That's what I was complaining about. My big complain is that I suggested to test to double the sysctl, which gave good results. Do I understand correctly that it's acceptable to you to double the size of the buffer? If so I will send a patch to do that. Absolutely. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] xen-netfront: reset skb transport header before checksum
On Mon, 2015-05-11 at 18:34 -0700, Venkat Venkatsubra wrote: In ed1f50c3a (net: add skb_checksum_setup) some checksum functions were introduced in core. Subsequent change b5cf66cd1 (xen-netfront: use new skb_checksum_setup function) made use of those functions to replace its own implementation. With that change ip_hdr() and tcp_hdr() were not pointing at the right place. d554f73df (xen-netfront: reset skb network header before checksum) fixed the ip_hdr(). This patch fixes tcp_hdr(). We saw this problem when LRO was enabled on the host and it results in netfront setting skb-ip_summed to CHECKSUM_UNNECESSARY. Signed-off-by: Venkat Venkatsubra venkat.x.venkatsu...@oracle.com --- drivers/net/xen-netfront.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c index 3f45afd..ed5c242 100644 --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -889,6 +889,7 @@ static int handle_incoming_queue(struct netfront_queue *queue, /* Ethernet work: Delayed to here as it peeks the header. */ skb-protocol = eth_type_trans(skb, queue-info-netdev); skb_reset_network_header(skb); + skb_set_transport_header(skb, ip_hdrlen(skb)); if (checksum_setup(queue-info-netdev, skb)) { kfree_skb(skb); This looks bogus for IPv6. I would rather fix skb_checksum_setup_ip() and skb_checksum_setup_ipv6(), since you already use correctly skb_maybe_pull_tail() there. In general, setting network and transport header at this point seems extra work. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Thu, 2015-04-16 at 12:39 +0100, George Dunlap wrote: On 04/15/2015 07:17 PM, Eric Dumazet wrote: Do not expect me to fight bufferbloat alone. Be part of the challenge, instead of trying to get back to proven bad solutions. I tried that. I wrote a description of what I thought the situation was, so that you could correct me if my understanding was wrong, and then what I thought we could do about it. You apparently didn't even read it, but just pointed me to a single cryptic comment that doesn't give me enough information to actually figure out what the situation is. We all agree that bufferbloat is a problem for everybody, and I can definitely understand the desire to actually make the situation better rather than dying the death of a thousand exceptions. If you want help fighting bufferbloat, you have to educate people to help you; or alternately, if you don't want to bother educating people, you have to fight it alone -- or lose the battle due to having a thousand exceptions. So, back to TSQ limits. What's so magical about 2 packets being *in the device itself*? And what does 1ms, or 2*64k packets (the default for tcp_limit_output_bytes), have anything to do with it? Your comment lists three benefits: 1. better RTT estimation 2. faster recovery 3. high rates #3 is just marketing fluff; it's also contradicted by the statement that immediately follows it -- i.e., there are drivers for which the limitation does *not* give high rates. #1, as far as I can tell, has to do with measuring the *actual* minimal round trip time of an empty pipe, rather than the round trip time you get when there's 512MB of packets in the device buffer. If a device has a large internal buffer, then having a large number of packets outstanding means that the measured RTT is skewed. The goal here, I take it, is to have this pipe *exactly* full; having it significantly more than full is what leads to bufferbloat. #2 sounds like you're saying that if there are too many packets outstanding when you discover that you need to adjust things, that it takes a long time for your changes to have an effect; i.e., if you have 5ms of data in the pipe, it will take at least 5ms for your reduced transmmission rate to actually have an effect. Is that accurate, or have I misunderstood something? #2 means that : If you have an outstanding queue of 500 packets for a flow in qdisc. A rtx has to be done, because we receive a SACK. The rtx is queued _after_ the previous 500 packets. 500 packets have to be drained before rtx can be sent and eventually reach destination. These 500 packets will likely be dropped because the destination cannot process them before the rtx. 2 TSO packets are already 90 packets (MSS=1448). It is not small, but a good compromise allowing line rate even on 40Gbit NIC. #1 is not marketing. It is hugely relevant. You might use cubic as the default congestion control, you have to understand we work hard on delay based cc, as losses are no longer a way to measure congestion in modern networks. Vegas and delay gradient congestion depends on precise RTT measures. I added usec RTT estimations (instead of jiffies based rtt samples) to increase resolution by 3 order of magnitude, not for marketing, but because it had to be done when DC communications have typical rtt of 25 usec these days. And jitter in host queues is not nice and must be kept at the minimum. You do not have the whole picture, but this tight bufferbloat control is one step before we can replace cubic by new upcoming cc, that many companies are actively developing and testing. The steps are the following : 1) TCP Small queues 2) FQ/pacing 3) TSO auto sizing 3) usec rtt estimations 4) New revolutionary cc module currently under test at Google, but others have alternatives. The fact that few drivers have bugs should not stop this effort. If you guys are in the Bay area, we would be happy to host a meeting where we can present you how our work reduced packet drops in our networks by 2 order of magnitude, and increased capacity by 40 or 50%. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote: He suggested that after he'd been prodded by 4 more e-mails in which two of us guessed what he was trying to get at. That's what I was complaining about. My big complain is that I suggested to test to double the sysctl, which gave good results. Then you provided a patch using a 8x factor. How does that sound ? Next time I ask a raise, I should try a 8x factor as well, who knows, it might be accepted. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Wed, 2015-04-15 at 14:43 +0100, George Dunlap wrote: On Mon, Apr 13, 2015 at 2:49 PM, Eric Dumazet eric.duma...@gmail.com wrote: On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote: Is the problem perhaps that netback/netfront delays TX completion? Would it be better to see if that can be addressed properly, so that the original purpose of the patch (fighting bufferbloat) can be achieved while not degrading performance for Xen? Or at least, so that people get decent perfomance out of the box without having to tweak TCP parameters? Sure, please provide a patch, that does not break back pressure. But just in case, if Xen performance relied on bufferbloat, it might be very difficult to reach a stable equilibrium : Any small change in stack or scheduling might introduce a significant difference in 'raw performance'. So help me understand this a little bit here. tcp_limit_output_bytes limits the amount of data allowed to be in-transit between a send() and the wire, is that right? And so the bufferbloat problem you're talking about here are TCP buffers inside the kernel, and/or buffers in the NIC, is that right? So ideally, you want this to be large enough to fill the pipeline all the way from send() down to actually getting out on the wire; otherwise, you'll have gaps in the pipeline, and the machinery won't be working at full throttle. And the reason it's a problem is that many NICs now come with large send buffers; and effectively what happens then is that this makes the pipeline longer -- as the buffer fills up, the time between send() and the wire is increased. This increased latency causes delays in round-trip-times and interferes with the mechanisms TCP uses to try to determine what the actual sustainable rate of data trasmission is. By limiting the number of in-transit bytes, you make sure that neither the kernel nor the NIC are going to have packets queues up for long lengths of time in buffers, and you keep this pipeline as close to the actual minimal length of the pipeline as possible. And it sounds like for your 40G NIC, 128k is big enough to fill the pipeline without unduly making it longer by introducing buffering. Is that an accurate picture of what you're trying to achieve? But the problem for xennet (and a number of other drivers), as I understand it, is that at the moment the pipeline itself is just longer -- it just takes a longer time from the time you send a packet to the time it actually gets out on the wire. So it's not actually accurate to say that Xen performance relies on bufferbloat. There's no buffering involved -- the pipeline is just longer, and so to fill up the pipeline you need more data. Basically, to maximize throughput while minimizing buffering, for *any* connection, tcp_limit_output_bytes should ideally be around (min_tx_latency * max_bandwidth). For physical NICs, the minimum latency is really small, but for xennet -- and I'm guessing for a lot of virtualized cards -- the min_tx_latency will be a lot higher, requiring a much higher ideal tcp_limit_output value. Rather than trying to pick a single value which will be good for all NICs, it seems like it would make more sense to have this vary depending on the parameters of the NIC. After all, for NICs that have low throughput -- say, old 100MiB NICs -- even 128k may still introduce a significant amount of buffering. Obviously one solution would be to allow the drivers themselves to set the tcp_limit_output_bytes, but that seems like a maintenance nightmare. Another simple solution would be to allow drivers to indicate whether they have a high transmit latency, and have the kernel use a higher value by default when that's the case. Probably the most sustainable solution would be to have the networking layer keep track of the average and minimum transmit latencies, and automatically adjust tcp_limit_output_bytes based on that. (Keeping the minimum as well as the average because the whole problem with bufferbloat is that the more data you give it, the longer the apparent pipeline becomes.) Thoughts? My thoughts that instead of these long talks you should guys read the code : /* TCP Small Queues : * Control number of packets in qdisc/devices to two packets / or ~1 ms. * This allows for : * - better RTT estimation and ACK scheduling * - faster recovery * - high rates * Alas, some drivers / subsystems require a fair amount * of queued bytes to ensure line rate. * One example is wifi aggregation (802.11 AMPDU) */ limit = max(2 * skb-truesize, sk-sk_pacing_rate 10); limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); Then you'll see that most of your questions are already answered. Feel free
Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
On Wed, 2015-04-15 at 15:36 +0100, Ian Campbell wrote: On Wed, 2015-04-15 at 15:19 +0100, George Dunlap wrote: On Mon, Apr 13, 2015 at 4:03 PM, Malcolm Crossley [...] From a networking point of view, the backend is a switch. Is it OK to consider the packet to have been transmitted from the guest point of view once the backend is aware of the packet? This would help justify the skb_orphan() in the frontend. This sounds sensible to me, particularly if virtio_net is already doing it. I also find Malcolm's argument above pretty compelling. Yes, and then you'll have to help the virtio ongoing effort trying to get rid of this skb_orphan() Basically you're adding head of line blocking, as a single flow is able to fill your queue. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote: On 04/15/2015 05:38 PM, Eric Dumazet wrote: My thoughts that instead of these long talks you should guys read the code : /* TCP Small Queues : * Control number of packets in qdisc/devices to two packets / or ~1 ms. * This allows for : * - better RTT estimation and ACK scheduling * - faster recovery * - high rates * Alas, some drivers / subsystems require a fair amount * of queued bytes to ensure line rate. * One example is wifi aggregation (802.11 AMPDU) */ limit = max(2 * skb-truesize, sk-sk_pacing_rate 10); limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); Then you'll see that most of your questions are already answered. Feel free to try to improve the behavior, if it does not hurt critical workloads like TCP_RR, where we we send very small messages, millions times per second. First of all, with regard to critical workloads, once this patch gets into distros, *normal TCP streams* on every VM running on Amazon, Rackspace, Linode, c will get a 30% hit in performance *by default*. Normal TCP streams on xennet *are* a critical workload, and deserve the same kind of accommodation as TCP_RR (if not more). The same goes for virtio_net. Secondly, according to Stefano's and Jonathan's tests, tcp_limit_output_bytes completely fixes the problem for Xen. Which means that max(2*skb-truesize, sk-sk_pacing_rate 10) is *already* larger for Xen; that calculation mentioned in the comment is *already* doing the right thing. As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an automatic TSQ calculation which is actually choosing an effective value for xennet. It certainly makes sense for sysctl_tcp_limit_output_bytes to be an actual maximum limit. I went back and looked at the original patch which introduced it (46d3ceabd), and it looks to me like it was designed to be a rough, quick estimate of two packets outstanding (by choosing the maximum size of the packet, 64k, and multiplying it by two). Now that you have a better algorithm -- the size of 2 actual packets or the amount transmitted in 1ms -- it seems like the default sysctl_tcp_limit_output_bytes should be higher, and let the automatic TSQ you have on the first line throttle things down when necessary. I asked you guys to make a test by increasing sysctl_tcp_limit_output_bytes You have no need to explain me the code I wrote, thank you. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote: Well, I'm not sure that it is George and Jonathan themselves who don't want to change a sysctl, but the customers who would have to tweak that in their VMs? Keep in mind some VM users install custom qdisc, or even custom TCP sysctls. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote: Which means that max(2*skb-truesize, sk-sk_pacing_rate 10) is *already* larger for Xen; that calculation mentioned in the comment is *already* doing the right thing. Sigh. 1ms of traffic at 40Gbit is 5 MBytes The reason for the cap to /proc/sys/net/ipv4/tcp_limit_output_bytes is to provide the limitation of ~2 TSO packets, which _also_ is documented. Without this limitation, 5 MBytes could translate to : Fill the queue, do not limit. If a particular driver needs to extend the limit, fine, document it and take actions. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Wed, 2015-04-15 at 10:55 -0700, Rick Jones wrote: Have you tested this patch on a NIC without GSO/TSO ? This would allow more than 500 packets for a single flow. Hello bufferbloat. Woudln't the fq_codel qdisc on that interface address that problem? Last time I checked, default qdisc was pfifo_fast. These guys do not want to change a sysctl, how pfifo_fast will magically becomes fq_codel ? ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Wed, 2015-04-15 at 18:41 +0100, George Dunlap wrote: So you'd be OK with a patch like this? (With perhaps a better changelog?) -George --- TSQ: Raise default static TSQ limit A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the size of actual packets and the amount of data being transmitted. Raise the default static limit to allow that new limit to actually come into effect. This fixes a regression where NICs with large transmit completion times (such as xennet) had a 30% hit unless the user manually tweaked the value in /proc. Signed-off-by: George Dunlap george.dun...@eu.citrix.com diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1db253e..8ad7cdf 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1; */ int sysctl_tcp_workaround_signed_windows __read_mostly = 0; -/* Default TSQ limit of two TSO segments */ -int sysctl_tcp_limit_output_bytes __read_mostly = 131072; +/* Static TSQ limit. A more dynamic limit is calculated in tcp_write_xmit. */ +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576; /* This limits the percentage of the congestion window which we * will allow a single TSO frame to consume. Building TSO frames Have you tested this patch on a NIC without GSO/TSO ? This would allow more than 500 packets for a single flow. Hello bufferbloat. So my answer to this patch is a no. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote: Maybe you should stop wasting all of our time and just tell us what you're thinking. I think you make me wasting my time. I already gave all the hints in prior discussions. Rome was not built in one day. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Thu, 2015-04-16 at 12:20 +0800, Herbert Xu wrote: Eric Dumazet eric.duma...@gmail.com wrote: We already have netdev-gso_max_size and netdev-gso_max_segs which are cached into sk-sk_gso_max_size sk-sk_gso_max_segs It is quite dangerous to attempt tricks like this because a tc redirection or netfilter nat could change the destination device rendering such hints incorrect. Right but we are talking of performance hints, on quite basic VM setup. Here the guest would use xen and this hint would apply. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
On Mon, 2015-04-13 at 14:46 +0100, David Vrabel wrote: And the proof-of-concept patch for idea (b) I used was: @@ -552,6 +552,8 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev) goto drop; } +skb_orphan(skb); + page = virt_to_page(skb-data); offset = offset_in_page(skb-data); len = skb_headlen(skb); No. This a bunch of allocations and a full memcpy of all the frags. skb_orphan() does nothing like that. But the main concern here is it basically breaks back pressure. And you do not want this, unless there is no other choice. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote: Is the problem perhaps that netback/netfront delays TX completion? Would it be better to see if that can be addressed properly, so that the original purpose of the patch (fighting bufferbloat) can be achieved while not degrading performance for Xen? Or at least, so that people get decent perfomance out of the box without having to tweak TCP parameters? Sure, please provide a patch, that does not break back pressure. But just in case, if Xen performance relied on bufferbloat, it might be very difficult to reach a stable equilibrium : Any small change in stack or scheduling might introduce a significant difference in 'raw performance'. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Thu, 2015-04-09 at 16:46 +0100, Stefano Stabellini wrote: Hi all, I found a performance regression when running netperf -t TCP_MAERTS from an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the virtual machine are 30% slower than v3.18. Through bisection I found that the perf regression is caused by the prensence of the following commit in the guest kernel: commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89 Author: Eric Dumazet eduma...@google.com Date: Sun Dec 7 12:22:18 2014 -0800 tcp: refine TSO autosizing A simple revert would fix the issue. Does anybody have any ideas on what could be the cause of the problem? Suggestions on what to do to fix it? You sent this to lkml while networking discussions are on netdev. This topic had been discussed on netdev multiple times. This commit restored original TCP Small Queue behavior, which is the first step to fight bufferbloat. Some network drivers are known to be problematic because of a delayed TX completion. So far this commit did not impact max single flow throughput on 40Gb mlx4 NIC. (ie : line rate is possible) Try to tweak /proc/sys/net/ipv4/tcp_limit_output_bytes to see if it makes a difference ? ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Thu, 2015-04-09 at 17:36 +0100, Stefano Stabellini wrote: A very big difference: echo 262144 /proc/sys/net/ipv4/tcp_limit_output_bytes brings us much closer to the original performance, the slowdown is just 8% Cool. echo 1048576 /proc/sys/net/ipv4/tcp_limit_output_bytes fills the gap entirely, same performance as before refine TSO autosizing Sure, this basically disables TCP Small Queue and select the opposite : Favor single flow throughput and huge latencies (bufferbloat) What would be the next step for here? Should I just document this as an important performance tweaking step for Xen, or is there something else we can do? I guess this is a reasonable choice. Note that /proc/sys/net/ipv4/tcp_limit_output_bytes is already documented. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] xen-netfront: transmit fully GSO-sized packets
On Thu, 2015-03-26 at 11:13 +, Jonathan Davies wrote: xen-netfront limits transmitted skbs to be at most 44 segments in size. However, GSO permits up to 65536 bytes, which means a maximum of 45 segments of 1448 bytes each. This slight reduction in the size of packets means a slight loss in efficiency. Since c/s 9ecd1a75d, xen-netfront sets gso_max_size to XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER, where XEN_NETIF_MAX_TX_SIZE is 65535 bytes. The calculation used by tcp_tso_autosize (and also tcp_xmit_size_goal since c/s 6c09fa09d) in determining when to split an skb into two is sk-sk_gso_max_size - 1 - MAX_TCP_HEADER. So the maximum permitted size of an skb is calculated to be (XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER) - 1 - MAX_TCP_HEADER. Intuitively, this looks like the wrong formula -- we don't need two TCP headers. Instead, there is no need to deviate from the default gso_max_size of 65536 as this already accommodates the size of the header. Currently, the largest skb transmitted by netfront is 63712 bytes (44 segments of 1448 bytes each), as observed via tcpdump. This patch makes netfront send skbs of up to 65160 bytes (45 segments of 1448 bytes each). Fixes: 9ecd1a75d977 (xen-netfront: reduce gso_max_size to account for max TCP header) Signed-off-by: Jonathan Davies jonathan.dav...@citrix.com --- drivers/net/xen-netfront.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c index e9b960f..fb6e978 100644 --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -1279,8 +1279,6 @@ static struct net_device *xennet_create_dev(struct xenbus_device *dev) netdev-ethtool_ops = xennet_ethtool_ops; SET_NETDEV_DEV(netdev, dev-dev); - netif_set_gso_max_size(netdev, XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER); - np-netdev = netdev; netif_carrier_off(netdev); Hmm, this partially reverts commit 9ecd1a75d977e2e8c48139c7d3efed183f898d94 Why xennet_change_mtu() is not changed by your patch ? ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
On Thu, 2015-03-26 at 16:46 +, Jonathan Davies wrote: Network drivers with slow TX completion can experience poor network transmit throughput, limited by hitting the sk_wmem_alloc limit check in tcp_write_xmit. The limit is 128 KB (by default), which means we are limited to two 64 KB skbs in-flight. This has been observed to limit transmit throughput with xen-netfront because its TX completion can be relatively slow compared to physical NIC drivers. There have been several modifications to the calculation of the sk_wmem_alloc limit in the past. Here is a brief history: * Since TSQ was introduced, the queue size limit was sysctl_tcp_limit_output_bytes. * Commit c9eeec26 (tcp: TSQ can use a dynamic limit) made the limit max(skb-truesize, sk-sk_pacing_rate 10). This allows more packets in-flight according to the estimated rate. * Commit 98e09386 (tcp: tsq: restore minimal amount of queueing) made the limit max_t(unsigned int, sysctl_tcp_limit_output_bytes, sk-sk_pacing_rate 10). This ensures at least sysctl_tcp_limit_output_bytes in flight but allowed more if rate estimation shows this to be worthwhile. * Commit 605ad7f1 (tcp: refine TSO autosizing) made the limit min_t(u32, max(2 * skb-truesize, sk-sk_pacing_rate 10), sysctl_tcp_limit_output_bytes). This meant that the limit can never exceed sysctl_tcp_limit_output_bytes, regardless of what rate estimation suggests. It's not clear from the commit message why this significant change was justified, changing sysctl_tcp_limit_output_bytes from being a lower bound to an upper bound. This patch restores the behaviour that allows the limit to grow above sysctl_tcp_limit_output_bytes according to the rate estimation. This has been measured to improve xen-netfront throughput from a domU to dom0 from 5.5 Gb/s to 8.0 Gb/s. Or, in the case of transmitting from one domU to another (on the same host), throughput rose from 2.8 Gb/s to 8.0 Gb/s. In the latter case, TX completion is especially slow, explaining the large improvement. These values were measured against 4.0-rc5 using iperf -c ip -i 1 using CentOS 7.0 VM(s) on Citrix XenServer 6.5 on a Dell R730 host with a pair of Xeon E5-2650 v3 CPUs. Fixes: 605ad7f184b6 (tcp: refine TSO autosizing) Signed-off-by: Jonathan Davies jonathan.dav...@citrix.com --- net/ipv4/tcp_output.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1db253e..3a49af8 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2052,7 +2052,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, * One example is wifi aggregation (802.11 AMPDU) */ limit = max(2 * skb-truesize, sk-sk_pacing_rate 10); - limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); + limit = max_t(u32, limit, sysctl_tcp_limit_output_bytes); if (atomic_read(sk-sk_wmem_alloc) limit) { set_bit(TSQ_THROTTLED, tp-tsq_flags); That will get a NACK from me and Google TCP team of course. Try your patch with low throughput flows, like 64kbps, GSO off... And observe TCP timestamps and rtt estimations, critical for vegas or any CC based on delay. I get line rate on 40Gbps NIC using this (low) limit of 2 TSO packets. This topic was discussed for wifi recently, and I suggested multiple ways to solve the problem on problematic drivers. There is a reason sysctl_tcp_limit_output_bytes exists : You can change its value to whatever you want. vi +652 Documentation/networking/ip-sysctl.txt tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. TCP bulk sender tends to increase packets in flight until it gets losses notifications. With SNDBUF autotuning, this can result in a large amount of packets queued in qdisc/device on the local machine, hurting latency of other flows, for typical pfifo_fast qdiscs. tcp_limit_output_bytes limits the number of bytes on qdisc or device to reduce artificial RTT/cwnd and reduce bufferbloat. Default: 131072 Documentation is pretty clear : This is the max value, not a min one. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] net: Detect drivers that reschedule NAPI and exhaust budget
On Sat, 2014-12-20 at 17:55 +1100, Herbert Xu wrote: -- 8 -- The commit d75b1ade567ffab085e8adbbdacf0092d10cd09c (net: less interrupt masking in NAPI) required drivers to leave poll_list empty if the entire budget is consumed. We have already had two broken drivers so let's add a check for this. Signed-off-by: Herbert Xu herb...@gondor.apana.org.au diff --git a/net/core/dev.c b/net/core/dev.c index f411c28..47fdc5c 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4620,7 +4620,13 @@ static void net_rx_action(struct softirq_action *h) */ napi_gro_flush(n, HZ = 1000); } - list_add_tail(n-poll_list, repoll); + /* Some drivers may have called napi_schedule + * prior to exhausting their budget. + */ + if (unlikely(!list_empty(n-poll_list))) + pr_warn(%s: Budget exhausted after napi rescheduled\n, n-dev ? n-dev-name : backlog); + else + list_add_tail(n-poll_list, repoll); } } Thanks, OK, but could you : 1) use pr_warn_once() 2) split the too long line pr_warn_once(%s: Budget exhausted after napi rescheduled\n, n-dev ? n-dev-name : backlog); Thanks Herbert ! ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] net: Detect drivers that reschedule NAPI and exhaust budget
On Sat, 2014-12-20 at 11:36 +1100, Herbert Xu wrote: On Sat, Dec 20, 2014 at 11:23:27AM +1100, Herbert Xu wrote: A similar bug exists in virtio_net. In order to detect other drivers doing this we should add something like this. -- 8 -- The commit d75b1ade567ffab085e8adbbdacf0092d10cd09c (net: less interrupt masking in NAPI) required drivers to leave poll_list empty if the entire budget is consumed. We have already had two broken drivers so let's add a check for this. Signed-off-by: Herbert Xu herb...@gondor.apana.org.au diff --git a/net/core/dev.c b/net/core/dev.c index f411c28..88f9725 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4620,7 +4620,11 @@ static void net_rx_action(struct softirq_action *h) */ napi_gro_flush(n, HZ = 1000); } - list_add_tail(n-poll_list, repoll); + /* Some drivers may have called napi_schedule + * prior to exhausting their budget. + */ + if (!WARN_ON_ONCE(!list_empty(n-poll_list))) + list_add_tail(n-poll_list, repoll); } } I do not think stack trace will point to the buggy driver ? IMO it would be better to print a single line with the netdev name ? ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel