bond: take rcu lock in netpoll_send_skb_on_dev
The bonding driver lacks the rcu lock when it calls down into netdev_lower_get_next_private_rcu from bond_poll_controller, which results in a trace like: WARNING: CPU: 2 PID: 179 at net/core/dev.c:6567 netdev_lower_get_next_private_rcu+0x34/0x40 CPU: 2 PID: 179 Comm: kworker/u16:15 Not tainted 4.19.0-rc5-backup+ #1 Workqueue: bond0 bond_mii_monitor RIP: 0010:netdev_lower_get_next_private_rcu+0x34/0x40 Code: 48 89 fb e8 fe 29 63 ff 85 c0 74 1e 48 8b 45 00 48 81 c3 c0 00 00 00 48 8b 00 48 39 d8 74 0f 48 89 45 00 48 8b 40 f8 5b 5d c3 <0f> 0b eb de 31 c0 eb f5 0f 1f 40 00 0f 1f 44 00 00 48 8> RSP: 0018:c987fa68 EFLAGS: 00010046 RAX: RBX: 880429614560 RCX: RDX: 0001 RSI: RDI: a184ada0 RBP: c987fa80 R08: 0001 R09: R10: c987f9f0 R11: 880429798040 R12: 8804289d5980 R13: a1511f60 R14: 00c8 R15: FS: () GS:88042f88() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f4b78fce180 CR3: 00018180f006 CR4: 001606e0 Call Trace: bond_poll_controller+0x52/0x170 netpoll_poll_dev+0x79/0x290 netpoll_send_skb_on_dev+0x158/0x2c0 netpoll_send_udp+0x2d5/0x430 write_ext_msg+0x1e0/0x210 console_unlock+0x3c4/0x630 vprintk_emit+0xfa/0x2f0 printk+0x52/0x6e ? __netdev_printk+0x12b/0x220 netdev_info+0x64/0x80 ? bond_3ad_set_carrier+0xe9/0x180 bond_select_active_slave+0x1fc/0x310 bond_mii_monitor+0x709/0x9b0 process_one_work+0x221/0x5e0 worker_thread+0x4f/0x3b0 kthread+0x100/0x140 ? process_one_work+0x5e0/0x5e0 ? kthread_delayed_work_timer_fn+0x90/0x90 ret_from_fork+0x24/0x30 We're also doing rcu dereferences a layer up in netpoll_send_skb_on_dev before we call down into netpoll_poll_dev, so just take the lock there. Suggested-by: Cong Wang Signed-off-by: Dave Jones diff --git a/net/core/netpoll.c b/net/core/netpoll.c index 3219a2932463..692367d7c280 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -330,6 +330,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb, /* It is up to the caller to keep npinfo alive. */ struct netpoll_info *npinfo; + rcu_read_lock_bh(); lockdep_assert_irqs_disabled(); npinfo = rcu_dereference_bh(np->dev->npinfo); @@ -374,6 +375,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb, skb_queue_tail(>txq, skb); schedule_delayed_work(>tx_work,0); } + rcu_read_unlock_bh(); } EXPORT_SYMBOL(netpoll_send_skb_on_dev);
Re: bond: take rcu lock in bond_poll_controller
On Fri, Sep 28, 2018 at 12:03:22PM -0700, Cong Wang wrote: > On Fri, Sep 28, 2018 at 12:02 PM Cong Wang wrote: > > > > On Fri, Sep 28, 2018 at 11:26 AM Dave Jones > > wrote: > > > diff --git a/net/core/netpoll.c b/net/core/netpoll.c > > > index 3219a2932463..4f9494381635 100644 > > > --- a/net/core/netpoll.c > > > +++ b/net/core/netpoll.c > > > @@ -330,6 +330,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, > > > struct sk_buff *skb, > > > /* It is up to the caller to keep npinfo alive. */ > > > struct netpoll_info *npinfo; > > > > > > + rcu_read_lock(); > > > lockdep_assert_irqs_disabled(); > > > > > > npinfo = rcu_dereference_bh(np->dev->npinfo); > > > > I think you probably need rcu_read_lock_bh() to satisfy > > rcu_deference_bh()... > > But irq is disabled here, so not sure if rcu_read_lock_bh() > could cause trouble... Interesting... I was wondering for a moment why I never got a warning, then I remembered I disabled lockdep for that machine because nfs spews stuff. I'll doublecheck, and post v4. lol, this looked like a 2 minute fix at first. Dave
bond: take rcu lock in bond_poll_controller
Callers of bond_for_each_slave_rcu are expected to hold the rcu lock, otherwise a trace like below is shown WARNING: CPU: 2 PID: 179 at net/core/dev.c:6567 netdev_lower_get_next_private_rcu+0x34/0x40 CPU: 2 PID: 179 Comm: kworker/u16:15 Not tainted 4.19.0-rc5-backup+ #1 Workqueue: bond0 bond_mii_monitor RIP: 0010:netdev_lower_get_next_private_rcu+0x34/0x40 Code: 48 89 fb e8 fe 29 63 ff 85 c0 74 1e 48 8b 45 00 48 81 c3 c0 00 00 00 48 8b 00 48 39 d8 74 0f 48 89 45 00 48 8b 40 f8 5b 5d c3 <0f> 0b eb de 31 c0 eb f5 0f 1f 40 00 0f 1f 44 00 00 48 8> RSP: 0018:c987fa68 EFLAGS: 00010046 RAX: RBX: 880429614560 RCX: RDX: 0001 RSI: RDI: a184ada0 RBP: c987fa80 R08: 0001 R09: R10: c987f9f0 R11: 880429798040 R12: 8804289d5980 R13: a1511f60 R14: 00c8 R15: FS: () GS:88042f88() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f4b78fce180 CR3: 00018180f006 CR4: 001606e0 Call Trace: bond_poll_controller+0x52/0x170 netpoll_poll_dev+0x79/0x290 netpoll_send_skb_on_dev+0x158/0x2c0 netpoll_send_udp+0x2d5/0x430 write_ext_msg+0x1e0/0x210 console_unlock+0x3c4/0x630 vprintk_emit+0xfa/0x2f0 printk+0x52/0x6e ? __netdev_printk+0x12b/0x220 netdev_info+0x64/0x80 ? bond_3ad_set_carrier+0xe9/0x180 bond_select_active_slave+0x1fc/0x310 bond_mii_monitor+0x709/0x9b0 process_one_work+0x221/0x5e0 worker_thread+0x4f/0x3b0 kthread+0x100/0x140 ? process_one_work+0x5e0/0x5e0 ? kthread_delayed_work_timer_fn+0x90/0x90 ret_from_fork+0x24/0x30 Suggested-by: Cong Wang Signed-off-by: Dave Jones -- v3: Do this in netpoll_send_skb_on_dev as Cong suggests. diff --git a/net/core/netpoll.c b/net/core/netpoll.c index 3219a2932463..4f9494381635 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -330,6 +330,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb, /* It is up to the caller to keep npinfo alive. */ struct netpoll_info *npinfo; + rcu_read_lock(); lockdep_assert_irqs_disabled(); npinfo = rcu_dereference_bh(np->dev->npinfo); @@ -374,6 +375,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb, skb_queue_tail(>txq, skb); schedule_delayed_work(>tx_work,0); } + rcu_read_unlock(); } EXPORT_SYMBOL(netpoll_send_skb_on_dev);
Re: bond: take rcu lock in bond_poll_controller
On Fri, Sep 28, 2018 at 10:31:39AM -0700, Cong Wang wrote: > On Fri, Sep 28, 2018 at 10:25 AM Dave Jones wrote: > > > > On Fri, Sep 28, 2018 at 09:55:52AM -0700, Cong Wang wrote: > > > On Fri, Sep 28, 2018 at 9:18 AM Dave Jones > > wrote: > > > > > > > > Callers of bond_for_each_slave_rcu are expected to hold the rcu lock, > > > > otherwise a trace like below is shown > > > > > > So why not take rcu read lock in netpoll_send_skb_on_dev() where > > > RCU is also assumed? > > > > that does seem to solve the backtrace spew I saw too, so if that's > > preferable I can respin the patch. > > > >From my observations, netpoll_send_skb_on_dev() does not take > RCU read lock _and_ it relies on rcu read lock because it calls > rcu_dereference_bh(). > > If my observation is correct, you should catch a RCU warning like > this but within netpoll_send_skb_on_dev(). > > > > As I said, I can't explain why you didn't trigger the RCU warning in > > > netpoll_send_skb_on_dev()... > > > > netpoll_send_skb_on_dev takes the rcu lock itself. > > Could you please point me where exactly is the rcu lock here? > > I am too stupid to see it. :) No, I'm the stupid one. I looked at the tree I had just edited to try your proposed change. Now that I've untangled myself, I'll repost with your suggested change. Dave
Re: bond: take rcu lock in bond_poll_controller
On Fri, Sep 28, 2018 at 09:55:52AM -0700, Cong Wang wrote: > On Fri, Sep 28, 2018 at 9:18 AM Dave Jones wrote: > > > > Callers of bond_for_each_slave_rcu are expected to hold the rcu lock, > > otherwise a trace like below is shown > > So why not take rcu read lock in netpoll_send_skb_on_dev() where > RCU is also assumed? that does seem to solve the backtrace spew I saw too, so if that's preferable I can respin the patch. > As I said, I can't explain why you didn't trigger the RCU warning in > netpoll_send_skb_on_dev()... netpoll_send_skb_on_dev takes the rcu lock itself. Dave
bond: take rcu lock in bond_poll_controller
Callers of bond_for_each_slave_rcu are expected to hold the rcu lock, otherwise a trace like below is shown WARNING: CPU: 2 PID: 179 at net/core/dev.c:6567 netdev_lower_get_next_private_rcu+0x34/0x40 CPU: 2 PID: 179 Comm: kworker/u16:15 Not tainted 4.19.0-rc5-backup+ #1 Workqueue: bond0 bond_mii_monitor RIP: 0010:netdev_lower_get_next_private_rcu+0x34/0x40 Code: 48 89 fb e8 fe 29 63 ff 85 c0 74 1e 48 8b 45 00 48 81 c3 c0 00 00 00 48 8b 00 48 39 d8 74 0f 48 89 45 00 48 8b 40 f8 5b 5d c3 <0f> 0b eb de 31 c0 eb f5 0f 1f 40 00 0f 1f 44 00 00 48 8> RSP: 0018:c987fa68 EFLAGS: 00010046 RAX: RBX: 880429614560 RCX: RDX: 0001 RSI: RDI: a184ada0 RBP: c987fa80 R08: 0001 R09: R10: c987f9f0 R11: 880429798040 R12: 8804289d5980 R13: a1511f60 R14: 00c8 R15: FS: () GS:88042f88() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f4b78fce180 CR3: 00018180f006 CR4: 001606e0 Call Trace: bond_poll_controller+0x52/0x170 netpoll_poll_dev+0x79/0x290 netpoll_send_skb_on_dev+0x158/0x2c0 netpoll_send_udp+0x2d5/0x430 write_ext_msg+0x1e0/0x210 console_unlock+0x3c4/0x630 vprintk_emit+0xfa/0x2f0 printk+0x52/0x6e ? __netdev_printk+0x12b/0x220 netdev_info+0x64/0x80 ? bond_3ad_set_carrier+0xe9/0x180 bond_select_active_slave+0x1fc/0x310 bond_mii_monitor+0x709/0x9b0 process_one_work+0x221/0x5e0 worker_thread+0x4f/0x3b0 kthread+0x100/0x140 ? process_one_work+0x5e0/0x5e0 ? kthread_delayed_work_timer_fn+0x90/0x90 ret_from_fork+0x24/0x30 Signed-off-by: Dave Jones diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index c05c01a00755..77a3607a7099 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -977,6 +977,7 @@ static void bond_poll_controller(struct net_device *bond_dev) if (bond_3ad_get_active_agg_info(bond, _info)) return; + rcu_read_lock(); bond_for_each_slave_rcu(bond, slave, iter) { if (!bond_slave_is_up(slave)) continue; @@ -992,6 +993,7 @@ static void bond_poll_controller(struct net_device *bond_dev) netpoll_poll_dev(slave->dev); } + rcu_read_unlock(); } static void bond_netpoll_cleanup(struct net_device *bond_dev)
bond: take rcu lock in bond_poll_controller
Callers of bond_for_each_slave_rcu are expected to hold the rcu lock, otherwise a trace like below is shown WARNING: CPU: 2 PID: 179 at net/core/dev.c:6567 netdev_lower_get_next_private_rcu+0x34/0x40 CPU: 2 PID: 179 Comm: kworker/u16:15 Not tainted 4.19.0-rc5-backup+ #1 Workqueue: bond0 bond_mii_monitor RIP: 0010:netdev_lower_get_next_private_rcu+0x34/0x40 Code: 48 89 fb e8 fe 29 63 ff 85 c0 74 1e 48 8b 45 00 48 81 c3 c0 00 00 00 48 8b 00 48 39 d8 74 0f 48 89 45 00 48 8b 40 f8 5b 5d c3 <0f> 0b eb de 31 c0 eb f5 0f 1f 40 00 0f 1f 44 00 00 48 8> RSP: 0018:c987fa68 EFLAGS: 00010046 RAX: RBX: 880429614560 RCX: RDX: 0001 RSI: RDI: a184ada0 RBP: c987fa80 R08: 0001 R09: R10: c987f9f0 R11: 880429798040 R12: 8804289d5980 R13: a1511f60 R14: 00c8 R15: FS: () GS:88042f88() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f4b78fce180 CR3: 00018180f006 CR4: 001606e0 Call Trace: bond_poll_controller+0x52/0x170 netpoll_poll_dev+0x79/0x290 netpoll_send_skb_on_dev+0x158/0x2c0 netpoll_send_udp+0x2d5/0x430 write_ext_msg+0x1e0/0x210 console_unlock+0x3c4/0x630 vprintk_emit+0xfa/0x2f0 printk+0x52/0x6e ? __netdev_printk+0x12b/0x220 netdev_info+0x64/0x80 ? bond_3ad_set_carrier+0xe9/0x180 bond_select_active_slave+0x1fc/0x310 bond_mii_monitor+0x709/0x9b0 process_one_work+0x221/0x5e0 worker_thread+0x4f/0x3b0 kthread+0x100/0x140 ? process_one_work+0x5e0/0x5e0 ? kthread_delayed_work_timer_fn+0x90/0x90 ret_from_fork+0x24/0x30 Signed-off-by: Dave Jones diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index a764a83f99da..519968d4513b 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -978,6 +978,7 @@ static void bond_poll_controller(struct net_device *bond_dev) if (bond_3ad_get_active_agg_info(bond, _info)) return; + rcu_read_lock(); bond_for_each_slave_rcu(bond, slave, iter) { ops = slave->dev->netdev_ops; if (!bond_slave_is_up(slave) || !ops->ndo_poll_controller) @@ -998,6 +999,7 @@ static void bond_poll_controller(struct net_device *bond_dev) ops->ndo_poll_controller(slave->dev); up(>dev_lock); } + rcu_read_unlock(); } static void bond_netpoll_cleanup(struct net_device *bond_dev)
ipset: suspicious RCU usage
= WARNING: suspicious RCU usage 4.16.0-rc3-firewall+ #1 Not tainted - net/netfilter/ipset/ip_set_core.c:1354 suspicious rcu_dereference_protected() usage! \x0aother info that might help us debug this:\x0a \x0arcu_scheduler_active = 2, debug_locks = 1 2 locks held by ipset/16323: #0: nlk_cb_mutex-NETFILTER ){+.+.} , at: [<5e54683c>] netlink_dump+0x1c/0x2b0 #1: (ip_set_ref_lock){++..} , at: [<89f25f26>] ip_set_dump_start+0x133/0x7a0 stack backtrace: CPU: 0 PID: 16323 Comm: ipset Not tainted 4.16.0-rc3-firewall+ #1 Call Trace: dump_stack+0x67/0x8e ip_set_dump_start+0x5f0/0x7a0 ? ip_set_dump_start+0x5/0x7a0 ? __kmalloc_reserve.isra.38+0x29/0x70 ? ksize+0x10/0xa0 ? __alloc_skb+0x90/0x1b0 netlink_dump+0x106/0x2b0 netlink_recvmsg+0x337/0x380 ? copy_msghdr_from_user+0xdb/0x150 ___sys_recvmsg+0xc6/0x160 ? netlink_sendmsg+0x129/0x420 ? SYSC_sendto+0x11b/0x180 __sys_recvmsg+0x51/0x90 do_syscall_64+0x84/0x735 ? trace_hardirqs_off_thunk+0x1a/0x1c entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x7fa78e00ca57 RSP: 002b:7fff61cfad78 EFLAGS: 0246 ORIG_RAX: 002f RAX: ffda RBX: 55f22c7c06f0 RCX: 7fa78e00ca57 RDX: RSI: 7fff61cfada0 RDI: 0003 RBP: 55f22c7bf4b8 R08: 7fa78dd1fbe0 R09: 000c R10: R11: 0246 R12: 1000 R13: 55f22c7c04d0 R14: R15: 55f22bf13908
[no subject]
This is in regards to an inheritance on your surname, reply back using your email address, stating your full name for more details. Reply to email for info. Email me here ( ger...@dr.com )
Re: [4.15-rc9] fs_reclaim lockdep trace
On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote: > Dave, would you try below patch? > > >From cae2cbf389ae3cdef1b492622722b4aeb07eb284 Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> > Date: Sun, 28 Jan 2018 14:17:14 +0900 > Subject: [PATCH] lockdep: Fix fs_reclaim warning. Seems to suppress the warning for me. Tested-by: Dave Jones <da...@codemonkey.org.uk>
Re: [4.15-rc9] fs_reclaim lockdep trace
On Tue, Jan 23, 2018 at 08:36:51PM -0500, Dave Jones wrote: > Just triggered this on a server I was rsync'ing to. Actually, I can trigger this really easily, even with an rsync from one disk to another. Though that also smells a little like networking in the traces. Maybe netdev has ideas. The first instance: > > WARNING: possible recursive locking detected > 4.15.0-rc9-backup-debug+ #1 Not tainted > > sshd/24800 is trying to acquire lock: > (fs_reclaim){+.+.}, at: [<84f438c2>] > fs_reclaim_acquire.part.102+0x5/0x30 > > but task is already holding lock: > (fs_reclaim){+.+.}, at: [<84f438c2>] > fs_reclaim_acquire.part.102+0x5/0x30 > > other info that might help us debug this: > Possible unsafe locking scenario: > >CPU0 > > lock(fs_reclaim); > lock(fs_reclaim); > > *** DEADLOCK *** > > May be due to missing lock nesting notation > > 2 locks held by sshd/24800: > #0: (sk_lock-AF_INET6){+.+.}, at: [<1a069652>] > tcp_sendmsg+0x19/0x40 > #1: (fs_reclaim){+.+.}, at: [<84f438c2>] > fs_reclaim_acquire.part.102+0x5/0x30 > > stack backtrace: > CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1 > Call Trace: > dump_stack+0xbc/0x13f > ? _atomic_dec_and_lock+0x101/0x101 > ? fs_reclaim_acquire.part.102+0x5/0x30 > ? print_lock+0x54/0x68 > __lock_acquire+0xa09/0x2040 > ? debug_show_all_locks+0x2f0/0x2f0 > ? mutex_destroy+0x120/0x120 > ? hlock_class+0xa0/0xa0 > ? kernel_text_address+0x5c/0x90 > ? __kernel_text_address+0xe/0x30 > ? unwind_get_return_address+0x2f/0x50 > ? __save_stack_trace+0x92/0x100 > ? graph_lock+0x8d/0x100 > ? check_noncircular+0x20/0x20 > ? __lock_acquire+0x616/0x2040 > ? debug_show_all_locks+0x2f0/0x2f0 > ? __lock_acquire+0x616/0x2040 > ? debug_show_all_locks+0x2f0/0x2f0 > ? print_irqtrace_events+0x110/0x110 > ? active_load_balance_cpu_stop+0x7b0/0x7b0 > ? debug_show_all_locks+0x2f0/0x2f0 > ? mark_lock+0x1b1/0xa00 > ? lock_acquire+0x12e/0x350 > lock_acquire+0x12e/0x350 > ? fs_reclaim_acquire.part.102+0x5/0x30 > ? lockdep_rcu_suspicious+0x100/0x100 > ? set_next_entity+0x20e/0x10d0 > ? mark_lock+0x1b1/0xa00 > ? match_held_lock+0x8d/0x440 > ? mark_lock+0x1b1/0xa00 > ? save_trace+0x1e0/0x1e0 > ? print_irqtrace_events+0x110/0x110 > ? alloc_extent_state+0xa7/0x410 > fs_reclaim_acquire.part.102+0x29/0x30 > ? fs_reclaim_acquire.part.102+0x5/0x30 > kmem_cache_alloc+0x3d/0x2c0 > ? rb_erase+0xe63/0x1240 > alloc_extent_state+0xa7/0x410 > ? lock_extent_buffer_for_io+0x3f0/0x3f0 > ? find_held_lock+0x6d/0xd0 > ? test_range_bit+0x197/0x210 > ? lock_acquire+0x350/0x350 > ? do_raw_spin_unlock+0x147/0x220 > ? do_raw_spin_trylock+0x100/0x100 > ? iotree_fs_info+0x30/0x30 > __clear_extent_bit+0x3ea/0x570 > ? clear_state_bit+0x270/0x270 > ? count_range_bits+0x2f0/0x2f0 > ? lock_acquire+0x350/0x350 > ? rb_prev+0x21/0x90 > try_release_extent_mapping+0x21a/0x260 > __btrfs_releasepage+0xb0/0x1c0 > ? btrfs_submit_direct+0xca0/0xca0 > ? check_new_page_bad+0x1f0/0x1f0 > ? match_held_lock+0xa5/0x440 > ? debug_show_all_locks+0x2f0/0x2f0 > btrfs_releasepage+0x161/0x170 > ? __btrfs_releasepage+0x1c0/0x1c0 > ? page_rmapping+0xd0/0xd0 > ? rmap_walk+0x100/0x100 > try_to_release_page+0x162/0x1c0 > ? generic_file_write_iter+0x3c0/0x3c0 > ? page_evictable+0xcc/0x110 > ? lookup_address_in_pgd+0x107/0x190 > shrink_page_list+0x1d5a/0x2fb0 > ? putback_lru_page+0x3f0/0x3f0 > ? save_trace+0x1e0/0x1e0 > ? _lookup_address_cpa.isra.13+0x40/0x60 > ? debug_show_all_locks+0x2f0/0x2f0 > ? kmem_cache_free+0x8c/0x280 > ? free_extent_state+0x1c8/0x3b0 > ? mark_lock+0x1b1/0xa00 > ? page_rmapping+0xd0/0xd0 > ? print_irqtrace_events+0x110/0x110 > ? shrink_node_memcg.constprop.88+0x4c9/0x5e0 > ? shrink_node+0x12d/0x260 > ? try_to_free_pages+0x418/0xaf0 > ? __alloc_pages_slowpath+0x976/0x1790 > ? __alloc_pages_nodemask+0x52c/0x5c0 > ? delete_node+0x28d/0x5c0 > ? find_held_lock+0x6d/0xd0 > ? free_pcppages_bulk+0x381/0x570 > ? lock_acquire+0x350/0x350 > ? do_raw_spin_unlock+0x147/0x220 > ? do_raw_spin_trylock+0x100/0x100 > ? __lock_is_held+0x51/0xc0 > ? _raw_spin_unlock+0x24/0x30 > ? free_pcppages_bulk+0x381/0x570 > ? mark_lock+0x1b1/0xa00 > ? free_compound_page+0x30/0x30 > ? print_irqtrace_events+0x110/0x110 > ? __kernel_map_pages+0x2c9/0x310 > ? mark_lock+0
ipset related DEBUG_VIRTUAL crash.
I have a script that hourly replaces an ipset list. This has been in place for a year or so, but last night it triggered this on 4.14-rc7 [455951.731181] kernel BUG at arch/x86/mm/physaddr.c:26! [455951.737016] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN [455951.742525] CPU: 0 PID: 3850 Comm: ipset Not tainted 4.14.0-rc7-firewall+ #1 [455951.753293] task: 88013033cfc0 task.stack: 8801c3d48000 [455951.758567] RIP: 0010:__phys_addr+0x5b/0x80 [455951.763742] RSP: 0018:8801c3d4f528 EFLAGS: 00010287 [455951.768838] RAX: 7800849b62b6 RBX: 849b62b6 RCX: 9f072a5d [455951.773881] RDX: dc00 RSI: dc00 RDI: a06917e0 [455951.778844] RBP: 7800049b62b6 R08: 0002 R09: [455951.783729] R10: R11: R12: 9fca8b05 [455951.788524] R13: 8801ce844268 R14: 049b62b6 R15: 8801ce8442ea [455951.793239] FS: 7fb44e656c80() GS:8801d320() knlGS: [455951.797904] CS: 0010 DS: ES: CR0: 80050033 [455951.802479] CR2: 7ffeeafd70a8 CR3: 0001b6cd2001 CR4: 000606f0 [455951.806998] Call Trace: [455951.811404] kfree+0x4c/0x310 [455951.815714] hash_ip4_ahash_destroy+0x85/0xd0 [455951.819944] hash_ip4_destroy+0x64/0x90 [455951.824069] ip_set_destroy+0x4f0/0x500 [455951.828098] ? ip_set_destroy+0x5/0x500 [455951.832029] ? __rcu_read_unlock+0xd3/0x190 [455951.835867] ? ip_set_utest+0x560/0x560 [455951.839610] ? ip_set_utest+0x560/0x560 [455951.843239] nfnetlink_rcv_msg+0x73e/0x770 [455951.846780] ? nfnetlink_rcv_msg+0x352/0x770 [455951.850229] ? nfnetlink_rcv+0xe90/0xe90 [455951.853571] ? native_sched_clock+0xe8/0x190 [455951.856822] ? lock_release+0x5d3/0x7d0 [455951.859976] netlink_rcv_skb+0x121/0x230 [455951.863037] ? nfnetlink_rcv+0xe90/0xe90 [455951.865999] ? netlink_ack+0x4c0/0x4c0 [455951.868866] ? ns_capable_common+0x68/0xc0 [455951.871638] nfnetlink_rcv+0x1ad/0xe90 [455951.874312] ? lock_acquire+0x380/0x380 [455951.876891] ? __rcu_read_unlock+0xd3/0x190 [455951.879378] ? __rcu_read_lock+0x30/0x30 [455951.881764] ? rcu_is_watching+0xa4/0xf0 [455951.884048] ? netlink_connect+0x1e0/0x1e0 [455951.886236] ? nfnl_err_reset+0x180/0x180 [455951.888329] ? netlink_deliver_tap+0x128/0x560 [455951.890333] ? netlink_deliver_tap+0x5/0x560 [455951.892229] ? iov_iter_advance+0x172/0x7f0 [455951.894029] ? netlink_getname+0x150/0x150 [455951.895736] ? can_nice.part.77+0x20/0x20 [455951.897342] ? iov_iter_copy_from_user_atomic+0x7d0/0x7d0 [455951.898877] ? netlink_trim+0x111/0x1b0 [455951.900394] ? netlink_skb_destructor+0xf0/0xf0 [455951.901908] netlink_unicast+0x2b1/0x340 [455951.903397] ? netlink_detachskb+0x30/0x30 [455951.904862] ? lock_acquire+0x380/0x380 [455951.906299] ? lockdep_rcu_suspicious+0x100/0x100 [455951.907729] netlink_sendmsg+0x4f2/0x650 [455951.909141] ? netlink_broadcast_filtered+0x9e0/0x9e0 [455951.910565] ? _copy_from_user+0x86/0xc0 [455951.911964] ? netlink_broadcast_filtered+0x9e0/0x9e0 [455951.913364] SYSC_sendto+0x2f0/0x3c0 [455951.914741] ? SYSC_connect+0x210/0x210 [455951.916111] ? bad_area_access_error+0x230/0x230 [455951.917479] ? ___sys_recvmsg+0x320/0x320 [455951.918811] ? sock_wake_async+0xc0/0xc0 [455951.920112] ? SyS_brk+0x3ae/0x3d0 [455951.921381] ? prepare_exit_to_usermode+0xde/0x230 [455951.922642] ? enter_from_user_mode+0x30/0x30 [455951.923913] ? mark_held_locks+0x1b/0xa0 [455951.925179] ? entry_SYSCALL_64_fastpath+0x5/0xad [455951.926459] ? trace_hardirqs_on_caller+0x185/0x260 [455951.927747] ? trace_hardirqs_on_thunk+0x1a/0x1c [455951.929031] entry_SYSCALL_64_fastpath+0x18/0xad [455951.930314] RIP: 0033:0x7fb44df4ac53 [455951.931592] RSP: 002b:7ffeeafb6a08 EFLAGS: 0246 [455951.932914] ORIG_RAX: 002c [455951.934231] RAX: ffda RBX: 55b8f35d26d0 RCX: 7fb44df4ac53 [455951.935603] RDX: 002c RSI: 55b8f35d14b8 RDI: 0003 [455951.936991] RBP: 55b8f35cf010 R08: 7fb44dc5dbe0 R09: 000c [455951.938387] R10: R11: 0246 R12: 7fb44e43b020 [455951.939795] R13: 7ffeeafb6acc R14: R15: 55b8f1ca68e0 [455951.941208] Code: 80 48 39 eb 72 25 48 c7 c7 09 d6 a4 a0 e8 3e 28 2c 00 0f b6 0d 80 ab 9d 01 48 8d 45 00 48 d3 e8 48 85 c0 75 06 5b 48 89 e8 5d c3 <0f> 0b 48 c7 c7 10 c0 62 a0 e8 a7 2a 2c 00 48 8b 2d 60 95 5b 01 [455951.993251] RIP: __phys_addr+0x5b/0x80 RSP: 8801c3d4f528 [455982.040898] ---[ end trace dfb8a0f07b7c5316 ]--- [459428.674105] == [459428.679829] BUG: KASAN: use-after-free in __mutex_lock+0x26c/0xf30 [459428.685463] Read of size 4 at addr 88013033d020 by task ipset/4611 [459428.696474] CPU: 0 PID: 4611 Comm: ipset Tainted: G D 4.14.0-rc7-firewall+ #1 [459428.707271] Call Trace: [459428.712489]
Re: [4.14rc6] __tcp_select_window divide by zero.
On Tue, Oct 24, 2017 at 09:00:30AM -0400, Dave Jones wrote: > divide error: [#1] SMP KASAN > CPU: 0 PID: 31140 Comm: trinity-c12 Not tainted 4.14.0-rc6-think+ #1 > RIP: 0010:__tcp_select_window+0x21f/0x400 > Call Trace: > tcp_cleanup_rbuf+0x27d/0x2a0 > tcp_recvmsg+0x7a9/0x1430 > inet_recvmsg+0x10b/0x360 > sock_read_iter+0x19d/0x240 > do_iter_readv_writev+0x2e4/0x320 > do_iter_read+0x149/0x280 > vfs_readv+0x107/0x180 > do_readv+0xc0/0x1b0 > do_syscall_64+0x182/0x400 > entry_SYSCALL64_slow_path+0x25/0x25 > Code: 41 5e 41 5f c3 48 8d bb 48 09 00 00 e8 4b 2b 30 ff 8b 83 48 09 00 00 > 89 ea 44 29 f2 39 c2 7d 08 39 c5 0f 8d 86 01 00 00 89 e8 99 <41> f7 fe 89 e8 > 29 d0 eb 8c 41 f7 df 48 89 c7 44 89 f9 d3 fd e8 > RIP: __tcp_select_window+0x21f/0x400 RSP: 8803df54f418 > > >if (window <= free_space - mss || window > free_space) >window = rounddown(free_space, mss); I'm still hitting this fairly often, so I threw in a debug patch, and when this happens.. [53182.361210] window: 0 free_space: 0 mss: 0 Any suggestions on what we should default the window size to be in this situation to avoid the rounddown ? Dave
[net-next] tcp_delack_timer circular locking dependancy
[ 105.316650] == [ 105.316818] WARNING: possible circular locking dependency detected [ 105.316986] 4.14.0-rc7-think+ #1 Not tainted [ 105.317108] -- [ 105.317273] swapper/2/0 is trying to acquire lock: [ 105.317407] ( [ 105.317476] slock-AF_INET6 [ 105.317564] ){+.-.} [ 105.317642] , at: [] tcp_delack_timer+0x26/0x130 [ 105.317807] but task is already holding lock: [ 105.317961] ( [ 105.318024] (timer) [ 105.318097] #5 [ 105.318168] ){+.-.} [ 105.318241] , at: [] call_timer_fn+0x5/0x5e0 [ 105.318393] which lock already depends on the new lock. [ 105.318594] the existing dependency chain (in reverse order) is: [ 105.318781] -> #1 [ 105.318879] ( [ 105.318939] (timer) [ 105.319009] #5 [ 105.319068] ){+.-.} [ 105.319137] : [ 105.319195]del_timer_sync+0x3c/0xb0 [ 105.319313]inet_csk_reqsk_queue_drop+0x26c/0x4e0 [ 105.319459]inet_csk_complete_hashdance+0x1e/0x90 [ 105.319598]tcp_check_req+0x787/0x9a0 [ 105.319716]tcp_v6_rcv+0x914/0x1060 [ 105.319828]ip6_input_finish+0x291/0xba0 [ 105.319950]ip6_input+0xb2/0x380 [ 105.320059]ip6_rcv_finish+0x103/0x350 [ 105.320180]ipv6_rcv+0x93f/0xff0 [ 105.320291]__netif_receive_skb_core+0x13ef/0x1900 [ 105.320436]netif_receive_skb_internal+0xea/0x4c0 [ 105.320579]napi_gro_receive+0x28e/0x320 [ 105.320705]e1000_clean_rx_irq+0x3e9/0x6f0 [ 105.320838]e1000e_poll+0x14e/0x570 [ 105.320954]net_rx_action+0x4db/0xc80 [ 105.321075]__do_softirq+0x1ca/0x7bf [ 105.321194]irq_exit+0x104/0x110 [ 105.321303]do_IRQ+0xb2/0x130 [ 105.321407]ret_from_intr+0x0/0x19 [ 105.321523]cpuidle_enter_state+0x223/0x5b0 [ 105.321655]do_idle+0x110/0x1b0 [ 105.321766]cpu_startup_entry+0xdb/0xe0 [ 105.321891]start_secondary+0x2e9/0x360 [ 105.322014]verify_cpu+0x0/0xf1 [ 105.322121] -> #0 [ 105.322215] ( [ 105.322276] slock-AF_INET6 [ 105.322359] ){+.-.} [ 105.322428] : [ 105.322487]lock_acquire+0x12e/0x350 [ 105.322602]_raw_spin_lock+0x30/0x70 [ 105.322722]tcp_delack_timer+0x26/0x130 [ 105.322846]call_timer_fn+0x188/0x5e0 [ 105.322966]__run_timers+0x54d/0x670 [ 105.323084]run_timer_softirq+0x2a/0x50 [ 105.323208]__do_softirq+0x1ca/0x7bf [ 105.323325]irq_exit+0x104/0x110 [ 105.323435]smp_apic_timer_interrupt+0x14b/0x510 [ 105.323576]apic_timer_interrupt+0x9a/0xa0 [ 105.323705]cpuidle_enter_state+0x223/0x5b0 [ 105.323836]do_idle+0x110/0x1b0 [ 105.323944]cpu_startup_entry+0xdb/0xe0 [ 105.324067]start_secondary+0x2e9/0x360 [ 105.324189]verify_cpu+0x0/0xf1 [ 105.324295] other info that might help us debug this: [ 105.324489] Possible unsafe locking scenario: [ 105.324644]CPU0CPU1 [ 105.324767] [ 105.324890] lock( [ 105.324963] (timer) [ 105.325033] #5 [ 105.325093] ); [ 105.325152]lock( [ 105.325278] slock-AF_INET6 [ 105.325360] ); [ 105.325419]lock( [ 105.325544] (timer) [ 105.325612] #5 [ 105.325670] ); [ 105.325729] lock( [ 105.325797] slock-AF_INET6 [ 105.325879] ); [ 105.325938] *** DEADLOCK *** [ 105.326086] 1 lock held by swapper/2/0: [ 105.326193] #0: [ 105.326257] ( [ 105.331697] (timer) [ 105.337038] #5 [ 105.342339] ){+.-.} [ 105.347620] , at: [] call_timer_fn+0x5/0x5e0 [ 105.353021] stack backtrace: [ 105.363515] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.14.0-rc7-think+ #1 [ 105.368886] Hardware name: LENOVO ThinkServer TS140/ThinkServer TS140, BIOS FBKTB3AUS 06/16/2015 [ 105.374330] Call Trace: [ 105.379697] [ 105.384997] dump_stack+0xbc/0x145 [ 105.390339] ? dma_virt_map_sg+0xfb/0xfb [ 105.395733] ? call_timer_fn+0x5/0x5e0 [ 105.401076] ? print_lock+0x54/0x68 [ 105.406344] print_circular_bug.isra.42+0x283/0x2bc [ 105.411695] ? print_circular_bug_header+0xda/0xda [ 105.417054] ? graph_lock+0x8d/0x100 [ 105.422419] ? check_noncircular+0x20/0x20 [ 105.427857] ? sched_clock_cpu+0x14/0xf0 [ 105.433309] __lock_acquire+0x1f4a/0x2050 [ 105.438725] ? debug_check_no_locks_freed+0x1a0/0x1a0 [ 105.444160] ? __lock_acquire+0x6b3/0x2050 [ 105.449580] ? debug_check_no_locks_freed+0x1a0/0x1a0 [ 105.455015] ? sched_clock_cpu+0x14/0xf0 [ 105.460514] ? __lock_acquire+0x6b3/0x2050 [ 105.465984] ? cyc2ns_read_end+0x10/0x10 [ 105.471395] ? debug_check_no_locks_freed+0x1a0/0x1a0 [ 105.476934] ? mark_lock+0x16f/0x9b0 [ 105.482507] ? print_irqtrace_events+0x110/0x110 [ 105.488150] ?
[4.14rc6] __tcp_select_window divide by zero.
divide error: [#1] SMP KASAN CPU: 0 PID: 31140 Comm: trinity-c12 Not tainted 4.14.0-rc6-think+ #1 task: 8803c0d08040 task.stack: 8803df548000 RIP: 0010:__tcp_select_window+0x21f/0x400 RSP: 0018:8803df54f418 EFLAGS: 00010246 RAX: RBX: 880458fd3140 RCX: 82120ea5 RDX: RSI: dc00 RDI: 880458fd3a88 RBP: R08: 0001 R09: R10: R11: R12: 00098968 R13: 11007bea9e87 R14: R15: FS: 7f76da1db700() GS:88046ae0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 0003f67cd002 CR4: 001606f0 DR0: 7f76d819f000 DR1: 7f75a29f5000 DR2: DR3: DR6: 0ff0 DR7: 0600 Call Trace: ? tcp_schedule_loss_probe+0x270/0x270 ? lock_acquire+0x12e/0x350 ? tcp_recvmsg+0x124/0x1430 ? lock_release+0x890/0x890 ? do_raw_spin_trylock+0x100/0x100 ? do_raw_spin_trylock+0x40/0x100 tcp_cleanup_rbuf+0x27d/0x2a0 ? tcp_recv_skb+0x180/0x180 ? mark_held_locks+0x70/0xa0 ? __local_bh_enable_ip+0x60/0x90 tcp_recvmsg+0x7a9/0x1430 ? tcp_recv_timestamp+0x250/0x250 ? __free_insn_slot+0x390/0x390 ? rcu_is_watching+0x88/0xd0 ? entry_SYSCALL64_slow_path+0x25/0x25 ? is_bpf_text_address+0x86/0xf0 ? kernel_text_address+0xec/0x100 ? __kernel_text_address+0xe/0x30 ? unwind_get_return_address+0x2f/0x50 ? __save_stack_trace+0x92/0x100 ? memcmp+0x45/0x70 ? match_held_lock+0x93/0x410 ? save_trace+0x1c0/0x1c0 ? save_stack+0x89/0xb0 ? save_stack+0x32/0xb0 ? kasan_kmalloc+0xa0/0xd0 ? native_sched_clock+0xf9/0x1a0 ? rw_copy_check_uvector+0x15e/0x180 inet_recvmsg+0x10b/0x360 ? inet_create+0x770/0x770 ? sched_clock_cpu+0x14/0xf0 ? sched_clock_cpu+0x14/0xf0 sock_read_iter+0x19d/0x240 ? sock_recvmsg+0x60/0x60 do_iter_readv_writev+0x2e4/0x320 ? vfs_dedupe_file_range+0x3e0/0x3e0 do_iter_read+0x149/0x280 vfs_readv+0x107/0x180 ? compat_rw_copy_check_uvector+0x1d0/0x1d0 ? fget_raw+0x10/0x10 ? __lock_is_held+0x2e/0xd0 ? do_preadv+0xf0/0xf0 ? __fdget_pos+0x82/0x110 ? __fdget_raw+0x10/0x10 ? do_readv+0xc0/0x1b0 do_readv+0xc0/0x1b0 ? vfs_readv+0x180/0x180 ? mark_held_locks+0x1b/0xa0 ? do_syscall_64+0xae/0x400 ? do_preadv+0xf0/0xf0 do_syscall_64+0x182/0x400 ? syscall_return_slowpath+0x270/0x270 ? rcu_read_lock_sched_held+0x90/0xa0 ? __context_tracking_exit.part.4+0x223/0x290 ? mark_held_locks+0x1b/0xa0 ? return_from_SYSCALL_64+0x2d/0x7a ? trace_hardirqs_on_caller+0x17a/0x250 ? trace_hardirqs_on_thunk+0x1a/0x1c entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f76d9b05219 RSP: 002b:7ffd41fd30d8 EFLAGS: 0246 ORIG_RAX: 0013 RAX: ffda RBX: 0013 RCX: 7f76d9b05219 RDX: 0016 RSI: 5611ca731c70 RDI: 0179 RBP: 7ffd41fd3180 R08: 00a07395 R09: 000a10d65a68 R10: 0001 R11: 0246 R12: 0002 R13: 7f76da180058 R14: 7f76da1db698 R15: 7f76da18 Code: 41 5e 41 5f c3 48 8d bb 48 09 00 00 e8 4b 2b 30 ff 8b 83 48 09 00 00 89 ea 44 29 f2 39 c2 7d 08 39 c5 0f 8d 86 01 00 00 89 e8 99 <41> f7 fe 89 e8 29 d0 eb 8c 41 f7 df 48 89 c7 44 89 f9 d3 fd e8 RIP: __tcp_select_window+0x21f/0x400 RSP: 8803df54f418 window = rounddown(free_space, mss); 45ec: 89 e8 mov%ebp,%eax 45ee: 99 cltd 45ef: 41 f7 feidiv %r14d 45f2: 89 e8 mov%ebp,%eax 45f4: 29 d0 sub%edx,%eax 45f6: eb 8c jmp4584 <__tcp_select_window+0x1b4> 45f8: 41 f7 dfneg%r15d
Stuck TX Using Iperf
Hello Cavium Ethernet Driver Maintainers, I'm working on a custom board using a Cavium OcteonTX CN80XX cpu running a mainline 4.12.7 kernel and I've run into a problem where the TX of my BGX0 configured for SGMII becomes stuck. After boot I'm able to bring the interface up, run dhclient, and ping a host machine with no issues. However, every time that I try to run an iperf TCP test I get a stuck TX queue. Bringing the interface down and then up does not resolve the problem, but physically reconnecting the cable connected to the interface does. Also, after I reconnect the cable I am no longer able to reproduce the stuck TX queue until reboot. I receive no kernel driver messages of any sort during my iperf test, it just stalls until I kill it. Any help would be appreciated. Regards, Robert Jones - Software Engineer Gateworks Corporation
BUG_ON(sg->sg_magic != SG_MAGIC) on tls socket.
kernel BUG at ./include/linux/scatterlist.h:189! invalid opcode: [#1] SMP KASAN CPU: 3 PID: 20890 Comm: trinity-c51 Not tainted 4.13.0-rc4-think+ #5 task: 88036e3d1cc0 task.stack: 88033e9d8000 RIP: 0010:tls_push_record+0x675/0x680 RSP: 0018:88033e9df630 EFLAGS: 00010287 RAX: RBX: 8802ee3b8968 RCX: 82226754 RDX: dc00 RSI: dc00 RDI: 8802ee3b8c10 RBP: 88033e9df6d0 R08: R09: ed005d107004 R10: 0004 R11: ed005d107003 R12: 880341b30668 R13: 8802ee3b8c10 R14: 8802ee3b8c38 R15: 87654321 FS: 7f465ced2700() GS:88046b60() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0029cbf021f8 CR3: 00045abd CR4: 001406e0 DR0: 7f8f17bd2000 DR1: 7f8c27d0f000 DR2: DR3: DR6: 0ff0 DR7: 0600 Call Trace: ? copy_page_to_iter+0x6c0/0x6c0 tls_sw_sendmsg+0x6d8/0x9c0 ? alloc_sg+0x510/0x510 ? cyc2ns_read_end+0x10/0x10 ? import_iovec+0xa8/0x1f0 ? do_syscall_64+0x1bc/0x3e0 ? entry_SYSCALL64_slow_path+0x25/0x25 inet_sendmsg+0xce/0x310 ? inet_recvmsg+0x3a0/0x3a0 ? inet_recvmsg+0x3a0/0x3a0 sock_write_iter+0x1b0/0x280 ? kernel_sendmsg+0x70/0x70 ? __might_sleep+0x72/0xe0 do_iter_readv_writev+0x29a/0x370 ? vfs_dedupe_file_range+0x3f0/0x3f0 ? rw_verify_area+0x65/0x150 do_iter_write+0xd7/0x2a0 ? __hrtimer_run_queues+0x980/0x980 vfs_writev+0x142/0x220 ? __fget_light+0x1ae/0x230 ? vfs_iter_write+0x70/0x70 ? syscall_exit_register+0x3f0/0x3f0 ? rcutorture_record_progress+0x20/0x20 ? __fdget_pos+0x88/0x120 ? __fdget_raw+0x20/0x20 do_writev+0xd2/0x1c0 ? do_writev+0xd2/0x1c0 ? vfs_writev+0x220/0x220 ? mark_held_locks+0x23/0xb0 ? do_syscall_64+0xc0/0x3e0 ? SyS_readv+0x20/0x20 SyS_writev+0x10/0x20 do_syscall_64+0x1bc/0x3e0 ? syscall_return_slowpath+0x240/0x240 ? __context_tracking_exit.part.5+0x23d/0x2a0 ? cpumask_check.part.2+0x10/0x10 ? mark_held_locks+0x23/0xb0 ? return_from_SYSCALL_64+0x2d/0x7a ? trace_hardirqs_on_caller+0x182/0x260 ? trace_hardirqs_on_thunk+0x1a/0x1c entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f465c7fd219 RSP: 002b:7ffda332a238 EFLAGS: 0246 ORIG_RAX: 0014 RAX: ffda RBX: 0014 RCX: 7f465c7fd219 RDX: 0047 RSI: 0029cbef1b50 RDI: 0137 RBP: 7ffda332a2e0 R08: 0100 R09: fff8 R10: fff9 R11: 0246 R12: 0002 R13: 7f465cd66058 R14: 7f465ced2698 R15: 7f465cd66000 Code: 8d bb 58 04 00 00 e8 3b d5 20 ff 48 8b 83 58 04 00 00 f0 80 48 08 04 48 83 c4 78 44 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 1f 44 00 00 0f 1f 44 00 00 55 ba 17 00 00 RIP: tls_push_record+0x675/0x680 RSP: 88033e9df630 186 static inline void sg_mark_end(struct scatterlist *sg) 187 { 188 #ifdef CONFIG_DEBUG_SG 189 BUG_ON(sg->sg_magic != SG_MAGIC); 190 #endif
KASAN: slab-out-of-bounds from net_namespace.c:ops_init
== BUG: KASAN: slab-out-of-bounds in ops_init+0x201/0x330 Write of size 8 at addr 88045744c448 by task trinity-c4/1499 CPU: 2 PID: 1499 Comm: trinity-c4 Not tainted 4.13.0-rc4-think+ #5 Call Trace: dump_stack+0xc5/0x151 ? dma_virt_map_sg+0xff/0xff ? show_regs_print_info+0x41/0x41 print_address_description+0xd9/0x260 kasan_report+0x27a/0x370 ? ops_init+0x201/0x330 __asan_store8+0x57/0x90 ops_init+0x201/0x330 ? net_alloc_generic+0x50/0x50 ? __raw_spin_lock_init+0x21/0x80 ? trace_hardirqs_on_caller+0x182/0x260 ? lockdep_init_map+0xb2/0x2b0 setup_net+0x208/0x400 ? ops_init+0x330/0x330 ? copy_net_ns+0x151/0x390 ? can_nice.part.81+0x20/0x20 ? rcu_is_watching+0x8d/0xd0 ? __lock_is_held+0x30/0xd0 ? rcutorture_record_progress+0x20/0x20 ? copy_net_ns+0x151/0x390 copy_net_ns+0x200/0x390 ? net_drop_ns+0x20/0x20 ? do_mount+0x19d0/0x19d0 ? create_new_namespaces+0x97/0x450 ? rcu_read_lock_sched_held+0x96/0xa0 ? kmem_cache_alloc+0x28a/0x2f0 create_new_namespaces+0x317/0x450 ? sys_ni_syscall+0x20/0x20 ? cap_capable+0x7f/0xf0 unshare_nsproxy_namespaces+0x77/0xf0 SyS_unshare+0x573/0xbb0 ? walk_process_tree+0x2a0/0x2a0 ? lock_release+0x920/0x920 ? lock_release+0x920/0x920 ? mntput_no_expire+0x117/0x620 ? rcu_is_watching+0x8d/0xd0 ? exit_to_usermode_loop+0x1b0/0x1b0 ? rcu_read_lock_sched_held+0x96/0xa0 ? __context_tracking_exit.part.5+0x23d/0x2a0 ? cpumask_check.part.2+0x10/0x10 ? context_tracking_user_exit+0x30/0x30 ? __f_unlock_pos+0x15/0x20 ? SyS_read+0x146/0x160 ? do_syscall_64+0xc0/0x3e0 ? walk_process_tree+0x2a0/0x2a0 do_syscall_64+0x1bc/0x3e0 ? syscall_return_slowpath+0x240/0x240 ? mark_held_locks+0x23/0xb0 ? return_from_SYSCALL_64+0x2d/0x7a ? trace_hardirqs_on_caller+0x182/0x260 ? trace_hardirqs_on_thunk+0x1a/0x1c entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f9e1c454219 RSP: 002b:7fff180f9c88 EFLAGS: 0246 ORIG_RAX: 0110 RAX: ffda RBX: 0110 RCX: 7f9e1c454219 RDX: 00c4 RSI: 800ff000 RDI: 74060700 RBP: 7fff180f9d30 R08: 0002 R09: 2fa420810090095e R10: 880ffb40 R11: 0246 R12: 0002 R13: 7f9e1cb06058 R14: 7f9e1cb29698 R15: 7f9e1cb06000 Allocated by task 1499: save_stack_trace+0x1b/0x20 save_stack+0x43/0xd0 kasan_kmalloc+0xad/0xe0 __kmalloc+0x14b/0x370 net_alloc_generic+0x25/0x50 copy_net_ns+0x130/0x390 create_new_namespaces+0x317/0x450 unshare_nsproxy_namespaces+0x77/0xf0 SyS_unshare+0x573/0xbb0 do_syscall_64+0x1bc/0x3e0 return_from_SYSCALL_64+0x0/0x7a Freed by task 504: save_stack_trace+0x1b/0x20 save_stack+0x43/0xd0 kasan_slab_free+0x72/0xc0 kfree+0xe1/0x2f0 rcu_process_callbacks+0x5a6/0x1dc0 __do_softirq+0x1e7/0x817 The buggy address belongs to the object at 88045744c3c8 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 0 bytes to the right of 128-byte region [88045744c3c8, 88045744c448) The buggy address belongs to the page: page:ea00115d1300 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0 flags: 0x80008100(slab|head) raw: 80008100 000100110011 raw: ea00113f2b20 ea0011328a20 880467c0f140 page dumped because: kasan: bad access detected Memory state around the buggy address: 88045744c300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 88045744c380: fc fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 >88045744c400: 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc ^ 88045744c480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 88045744c500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ==
Re: sctp refcount bug.
On Thu, Jul 13, 2017 at 11:38:34AM -0300, Marcelo Ricardo Leitner wrote: > On Thu, Jul 13, 2017 at 10:36:39AM -0400, Dave Jones wrote: > > Hit this on Linus' current tree. > > > > > > refcount_t: underflow; use-after-free. > > Any tips on how to reproduce this? Only seen it once so far. Will see if I can narrow it down if it reproduces. It took ~12 hours of fuzzing to find overnight. Dave
sctp refcount bug.
Hit this on Linus' current tree. refcount_t: underflow; use-after-free. [ cut here ] WARNING: CPU: 2 PID: 14455 at lib/refcount.c:186 refcount_sub_and_test+0x45/0x50 CPU: 2 PID: 14455 Comm: trinity-c46 Tainted: G D 4.12.0-think+ #11 task: 8804fc71b8c0 task.stack: c90002328000 RIP: 0010:refcount_sub_and_test+0x45/0x50 RSP: 0018:c9000232ba58 EFLAGS: 00010282 RAX: 0026 RBX: 88001db1d1c0 RCX: RDX: RSI: 88050a3ccca8 RDI: 88050a3ccca8 RBP: c9000232ba58 R08: R09: 0001 R10: c9000232ba88 R11: R12: 88000d3f9b40 R13: 880456948008 R14: 880456948870 R15: c9000232bd10 FS: 7f79b1032700() GS:88050a20() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 008436726348 CR3: 00022cc87000 CR4: 001406e0 DR0: 7f731068f000 DR1: 7f2d83eb9000 DR2: 7f302340e000 DR3: DR6: 0ff0 DR7: 0600 Call Trace: sctp_wfree+0x5d/0x190 [sctp] skb_release_head_state+0x64/0xc0 skb_release_all+0x12/0x30 consume_skb+0x50/0x170 sctp_chunk_put+0x59/0x80 [sctp] sctp_chunk_free+0x26/0x30 [sctp] __sctp_outq_teardown+0x1d8/0x270 [sctp] sctp_outq_free+0xe/0x10 [sctp] sctp_association_free+0x92/0x220 [sctp] sctp_do_sm+0x12a6/0x1920 [sctp] ? __get_user_4+0x18/0x20 ? no_context+0x3f/0x360 ? lock_acquire+0xe7/0x1e0 ? skb_dequeue+0x1d/0x70 sctp_primitive_SHUTDOWN+0x33/0x40 [sctp] sctp_close+0x26e/0x2a0 [sctp] inet_release+0x3c/0x60 sock_release+0x1f/0x80 sock_close+0x12/0x20 __fput+0xf8/0x200 fput+0xe/0x10 task_work_run+0x85/0xc0 exit_to_usermode_loop+0xa8/0xb0 do_syscall_64+0x151/0x190 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f79b095b1e9 RSP: 002b:7ffc5eca3088 EFLAGS: 0246 ORIG_RAX: 0120 RAX: fff2 RBX: 0120 RCX: 7f79b095b1e9 RDX: 006e RSI: 008436738120 RDI: 0130 RBP: 7ffc5eca3130 R08: R09: 0ff0 R10: 00080800 R11: 0246 R12: 0002 R13: 7f79b0ee9058 R14: 7f79b1032698 R15: 7f79b0ee9000 Code: 75 e6 85 d2 0f 94 c0 c3 31 c0 c3 80 3d ce 95 bc 00 00 75 f4 55 48 c7 c7 00 d9 ee 81 48 89 e5 c6 05 ba 95 bc 00 01 e8 fc 2c c0 ff <0f> ff 31 c0 5d c3 0f 1f 44 00 00 55 48 89 fe bf 01 00 00 00 48 ---[ end trace 19b7bd878c0f56fd ]--- [ cut here ] WARNING: CPU: 2 PID: 14455 at net/ipv4/af_inet.c:154 inet_sock_destruct+0x1b8/0x1f0 CPU: 2 PID: 14455 Comm: trinity-c46 Tainted: G D W 4.12.0-think+ #11 task: 8804fc71b8c0 task.stack: c90002328000 RIP: 0010:inet_sock_destruct+0x1b8/0x1f0 RSP: 0018:c9000232bcf8 EFLAGS: 00010286 RAX: RBX: 88000d3f9b40 RCX: RDX: fd00 RSI: 0300 RDI: 88000d3f9ca8 RBP: c9000232bd08 R08: R09: R10: R11: R12: 88000d3f9ca8 R13: 88000d3f9b40 R14: 88000d3f9bc8 R15: 8801836e21d0 FS: 7f79b1032700() GS:88050a20() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 3b9ab732 CR3: 00022cc87000 CR4: 001406e0 DR0: 7f731068f000 DR1: 7f2d83eb9000 DR2: 7f302340e000 DR3: DR6: 0ff0 DR7: 0600 Call Trace: sctp_destruct_sock+0x25/0x30 [sctp] __sk_destruct+0x28/0x230 sk_destruct+0x20/0x30 __sk_free+0x43/0xa0 sk_free+0x25/0x30 sctp_close+0x218/0x2a0 [sctp] inet_release+0x3c/0x60 sock_release+0x1f/0x80 sock_close+0x12/0x20 __fput+0xf8/0x200 fput+0xe/0x10 task_work_run+0x85/0xc0 exit_to_usermode_loop+0xa8/0xb0 do_syscall_64+0x151/0x190 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f79b095b1e9 RSP: 002b:7ffc5eca3088 EFLAGS: 0246 ORIG_RAX: 0120 RAX: fff2 RBX: 0120 RCX: 7f79b095b1e9 RDX: 006e RSI: 008436738120 RDI: 0130 RBP: 7ffc5eca3130 R08: R09: 0ff0 R10: 00080800 R11: 0246 R12: 0002 R13: 7f79b0ee9058 R14: 7f79b1032698 R15: 7f79b0ee9000 Code: df e8 bd 5f f4 ff e9 07 ff ff ff 0f ff 8b 83 8c 02 00 00 85 c0 0f 84 2d ff ff ff 0f ff 8b 93 88 02 00 00 85 d2 0f 84 2b ff ff ff <0f> ff 8b 83 40 02 00 00 85 c0 0f 84 29 ff ff ff 0f ff e9 22 ff ---[ end trace 19b7bd878c0f56fe ]--- [ cut here ] WARNING: CPU: 2 PID: 14455 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x1c8/0x1f0 CPU: 2 PID: 14455 Comm: trinity-c46 Tainted: G D W 4.12.0-think+ #11 task: 8804fc71b8c0 task.stack: c90002328000 RIP: 0010:inet_sock_destruct+0x1c8/0x1f0 RSP: 0018:c9000232bcf8 EFLAGS: 00010206 RAX: 0300 RBX: 88000d3f9b40 RCX: RDX: fd00 RSI: 0300 RDI:
netconsole refcount warning
The new refcount debugging code spews this twice during boot on my router.. refcount_t: increment on 0; use-after-free. [ cut here ] WARNING: CPU: 1 PID: 17 at lib/refcount.c:152 refcount_inc+0x2b/0x30 CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted 4.12.0-firewall+ #8 task: 8801d4441ac0 task.stack: 8801d445 RIP: 0010:refcount_inc+0x2b/0x30 RSP: 0018:8801d4456da8 EFLAGS: 00010046 RAX: 002c RBX: 8801d4c3cf40 RCX: RDX: 002c RSI: 0003 RDI: ed003a88adab RBP: 8801d4456da8 R08: 0003 R09: fbfff4afcb57 R10: R11: fbfff4afcb58 R12: 8801d4c3c540 R13: 0082 R14: 8801ce9c7ff8 R15: 8801ce9c8aa0 FS: () GS:8801d6a0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7fa2b803156e CR3: 0001c405d000 CR4: 000406e0 Call Trace: zap_completion_queue+0xad/0x1a0 netpoll_poll_dev+0x16f/0x3f0 netpoll_send_skb_on_dev+0x25a/0x360 netpoll_send_udp+0x526/0x850 write_ext_msg+0x212/0x230 ? _raw_spin_unlock_irqrestore+0x43/0x70 ? write_msg+0x11f/0x130 console_unlock+0x3ea/0x6e0 vprintk_emit+0x298/0x3a0 vprintk_default+0x1f/0x30 vprintk_func+0x34/0xb0 printk+0x95/0xb2 ? show_regs_print_info+0x45/0x45 ? nf_log_buf_open+0x2c/0x70 ? nf_log_buf_close+0x26/0x70 nf_log_buf_close+0x3c/0x70 nf_log_ip_packet+0x111/0x250 nf_log_packet+0x19e/0x330 ? nf_logger_find_get+0x1c0/0x1c0 ? debug_show_all_locks+0x1e0/0x1e0 ? __local_bh_enable_ip+0x64/0xb0 ? debug_smp_processor_id+0x17/0x20 log_tg+0x13d/0x170 ? log_tg_check+0x70/0x70 ? trace_hardirqs_on+0xe/0x10 ? __local_bh_enable_ip+0x64/0xb0 ? _raw_spin_unlock_bh+0x35/0x40 ipt_do_table+0x770/0xbb0 ? mark_lock+0xb7/0x7d0 ? sched_clock_cpu+0x1c/0x130 ? ipt_alloc_initial_table+0x2d0/0x2d0 ? debug_smp_processor_id+0x17/0x20 ? __lock_is_held+0x55/0x110 ? ipt_unregister_table+0x50/0x50 iptable_filter_hook+0x53/0xd0 nf_hook_slow+0x4a/0x120 ip_local_deliver+0x1ba/0x2c0 ? ip_local_deliver+0x100/0x2c0 ? ip_call_ra_chain+0x270/0x270 ? inet_del_offload+0x40/0x40 ip_rcv_finish+0x2b9/0x880 ip_rcv+0x51f/0x8a0 ? ip_rcv+0x5ae/0x8a0 ? ip_local_deliver+0x2c0/0x2c0 ? ip_local_deliver_finish+0x4d0/0x4d0 ? ip_local_deliver+0x2c0/0x2c0 __netif_receive_skb_core+0xd4b/0x1210 ? enqueue_to_backlog+0x620/0x620 ? ktime_get_with_offset+0x11d/0x290 __netif_receive_skb+0x27/0xc0 ? debug_smp_processor_id+0x17/0x20 netif_receive_skb_internal+0x3e3/0xc90 ? netif_receive_skb_internal+0x90/0xc90 ? __build_skb+0x2f/0x140 ? __dev_queue_xmit+0xd30/0xd30 ? debug_dma_sync_single_for_device+0xb7/0xc0 ? debug_dma_sync_single_for_cpu+0xc0/0xc0 ? dev_gro_receive+0x90/0x9b0 ? __lock_is_held+0x30/0x110 ? __asan_loadN+0x10/0x20 ? skb_gro_reset_offset+0x93/0x140 napi_gro_receive+0x1d1/0x270 rtl8169_poll+0x49b/0xb30 net_rx_action+0x4c4/0x7d0 ? napi_complete_done+0x1b0/0x1b0 ? __lock_is_held+0x30/0x110 __do_softirq+0x113/0x611 run_ksoftirqd+0x22/0x90 smpboot_thread_fn+0x348/0x4f0 ? __local_bh_enable_ip+0xb0/0xb0 ? sort_range+0x30/0x30 ? schedule+0x6c/0xe0 ? __kthread_parkme+0xf2/0x110 kthread+0x1ab/0x200 ? sort_range+0x30/0x30 ? __kthread_create_on_node+0x340/0x340 ret_from_fork+0x27/0x40 Code: 55 48 89 e5 e8 97 ff ff ff 84 c0 74 02 5d c3 80 3d 5d 3e 06 01 00 75 f5 48 c7 c7 20 69 f1 a4 c6 05 4d 3e 06 01 01 e8 ca 41 bc ff <0f> ff 5d c3 90 55 48 89 e5 41 54 44 8d 27 48 8d 3e 53 48 8d 1e ---[ end trace a9116b75ea217b54 ]---
Your first payment of $5000 ,
Attn Beneficiary, We have deposited the check of your fund ($2.5m USD) through western union money transfer department after our finally meeting today regarding your fund, Now all you will do is to contact western union director Mis Rose Kelly ,And She will give you the direction on how you will be receiving your funds daily. Remember to send her your Full information to avoid wrong transfer such as, Your Receiver Name-- Your Country Your City--- Your Phone No--- Your Address Your Id card or pasport :... Therefore you are advised to contact western union accountant Manager to her bellow information and tell She to give you the Mtcn, sender name and question/answer to pick the money, CONTACT Name: Mis Rose Kelly reply to ( wstun.office...@gmail.com ) Phone: +229-99374614 Get back to us once you receive your total fund of $2.5m. Thanks and God bless you. Best Regards WESTERN UNION AGENT
Re: [PATCH v2 1/3] mfd: max8998: Remove CONFIG_OF around max8998_dt_match
On Tue, 11 Apr 2017, Florian Fainelli wrote: > A subsequent patch is going to make of_match_node() an inline stub when > CONFIG_OF is disabled which will properly take care of having the compiler > eliminate the variable. To avoid more #ifdef/#else, just always make the match > table available. > > Signed-off-by: Florian Fainelli <f.faine...@gmail.com> > --- > drivers/mfd/max8998.c | 2 -- > 1 file changed, 2 deletions(-) If it works, great! For my own reference: Acked-for-MFD-by: Lee Jones <lee.jo...@linaro.org> > diff --git a/drivers/mfd/max8998.c b/drivers/mfd/max8998.c > index 4c33b8063bc3..372f681ec1bb 100644 > --- a/drivers/mfd/max8998.c > +++ b/drivers/mfd/max8998.c > @@ -129,14 +129,12 @@ int max8998_update_reg(struct i2c_client *i2c, u8 reg, > u8 val, u8 mask) > } > EXPORT_SYMBOL(max8998_update_reg); > > -#ifdef CONFIG_OF > static const struct of_device_id max8998_dt_match[] = { > { .compatible = "maxim,max8998", .data = (void *)TYPE_MAX8998 }, > { .compatible = "national,lp3974", .data = (void *)TYPE_LP3974 }, > { .compatible = "ti,lp3974", .data = (void *)TYPE_LP3974 }, > {}, > }; > -#endif > > /* > * Only the common platform data elements for max8998 are parsed here from > the -- Lee Jones Linaro STMicroelectronics Landing Team Lead Linaro.org │ Open source software for ARM SoCs Follow Linaro: Facebook | Twitter | Blog
Re: af_packet: use after free in prb_retire_rx_blk_timer_expired
On Mon, Apr 10, 2017 at 07:03:30PM +, alexander.le...@verizon.com wrote: > Hi all, > > I seem to be hitting this use-after-free on a -next kernel using trinity: > > [ 531.036054] BUG: KASAN: use-after-free in prb_retire_rx_blk_timer_expired > (net/packet/af_packet.c:688) > [ 531.036961] Read of size 8 at addr 88038c1fb0e8 by task > swapper/1/0 > [ 531.037727] > > [ 531.037928] CPU: 1 PID: 0 Comm: swapper/1 Not > tainted 4.11.0-rc5-next-20170407-dirty #24 Funny, I was just going over my old pending bugs, and found this one from January that looks like what happens with the same bug, but without kasan.. context: PID: 0 TASK: 881ff2fa5100 CPU: 5 COMMAND: "swapper/5" panic: general protection fault: [#1] netversion: 2.2-1 (Feb 2014) Backtrace: #0 [881fffaa3c00] machine_kexec at 81044af8 #1 [881fffaa3c60] __crash_kexec at 810ec755 #2 [881fffaa3d28] crash_kexec at 810ec81f #3 [881fffaa3d40] oops_end at 8101e348 #4 [881fffaa3d68] die at 8101e76b #5 [881fffaa3d98] do_general_protection at 8101be76 #6 [881fffaa3dc0] general_protection at 817fe5a2 [exception RIP: prb_retire_rx_blk_timer_expired+65] RIP: 817e6e41 RSP: 881fffaa3e78 RFLAGS: 00010246 RAX: RBX: 881fd7075800 RCX: RDX: 883ff0a16bb0 RSI: 0074636361757063 RDI: 881fd70758bc RBP: 881fffaa3e88 R8: 0001 R9: 0005 R10: R11: R12: 881fd7075b78 R13: 0100 R14: 817e6e00 R15: 881fd7075800 ORIG_RAX: CS: 0010 SS: 0018 #7 [881fffaa3e90] call_timer_fn at 810cec35 #8 [881fffaa3ec8] run_timer_softirq at 810cf01c #9 [881fffaa3f28] __softirqentry_text_start at 817ff05c #10 [881fffaa3f88] irq_exit at 8107d5fc #11 [881fffaa3f98] smp_apic_timer_interrupt at 817feea2 #12 [881fffaa3fb0] apic_timer_interrupt at 817fd56f --- --- #13 [881ff2fbfdd0] apic_timer_interrupt at 817fd56f RIP: 0018 RSP: RFLAGS: 81ebbb60 RAX: e8e0002a0400 RBX: 0067b502e95f RCX: 0006 RDX: 002e RSI: 0034 RDI: 0001 RBP: 81150540 R8: 881ff2fbfee0 R9: 0001 R10: 0005 R11: 81ebbb60 R12: 881ff2fbfe48 R13: 881ff2fa5110 R14: R15: 881ff2fa5100 ORIG_RAX: 881fffab5340 CS: 20c49ba5e353f7cf SS: ff10 WARNING: possibly bogus exception frame Dmesg: Code: 00 00 48 8b 93 10 03 00 00 80 bb 21 03 00 00 00 44 0f b6 83 20 03 00 00 0f b7 c8 48 8b 34 ca 75 57 <44> 8b 5e 0c 45 85 db 74 1d 8b 93 68 03 00 00 85 d2 74 13 f3 90 RIP [] prb_retire_rx_blk_timer_expired+0x41/0x120 RSP [ cut here ]
Re: [PATCH V8 1/3] irq: Add flags to request_percpu_irq function
On Thu, Mar 23, 2017 at 06:42:01PM +0100, Daniel Lezcano wrote: > diff --git a/drivers/clocksource/timer-nps.c b/drivers/clocksource/timer-nps.c > index da1f798..dbdb622 100644 > --- a/drivers/clocksource/timer-nps.c > +++ b/drivers/clocksource/timer-nps.c > @@ -256,7 +256,7 @@ static int __init nps_setup_clockevent(struct device_node > *node) > return ret; > > /* Needs apriori irq_set_percpu_devid() done in intc map function */ > - ret = request_percpu_irq(nps_timer0_irq, timer_irq_handler, > + ret = request_percpu_irq(nps_timer0_irq, timer_irq_handler, IRQF_TIMER, >"Timer0 (per-cpu-tick)", >_clockevent_device); Wrong parameter order here. drew
Re: run_timer_softirq gpf. [smc]
On Tue, Mar 21, 2017 at 08:25:39PM +0100, Thomas Gleixner wrote: > > I just hit this while fuzzing.. > > > > general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC > > CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc2-think+ #1 > > task: 88017f0ed440 task.stack: c9094000 > > RIP: 0010:run_timer_softirq+0x15f/0x700 > > RSP: 0018:880507c03ec8 EFLAGS: 00010086 > > RAX: dead0200 RBX: 880507dd0d00 RCX: 0002 > > RDX: 880507c03ed0 RSI: RDI: 8204b3a0 > > RBP: 880507c03f48 R08: 880507dd12d0 R09: 880507c03ed8 > > R10: 880507dd0db0 R11: R12: 8215cc38 > > R13: 880507c03ed0 R14: 82005188 R15: 8804b55491a8 > > FS: () GS:880507c0() > > knlGS: > > CS: 0010 DS: ES: CR0: 80050033 > > CR2: 0004 CR3: 05011000 CR4: 001406e0 > > Call Trace: > > > > ? clockevents_program_event+0x47/0x120 > > __do_softirq+0xbf/0x5b1 > > irq_exit+0xb5/0xc0 > > smp_apic_timer_interrupt+0x3d/0x50 > > apic_timer_interrupt+0x97/0xa0 > > RIP: 0010:cpuidle_enter_state+0x12e/0x400 > > RSP: 0018:c9097e40 EFLAGS: 0202 > > [CONT START] ORIG_RAX: ff10 > > RAX: 88017f0ed440 RBX: e8a03cc8 RCX: 0001 > > RDX: 20c49ba5e353f7cf RSI: 0001 RDI: 88017f0ed440 > > RBP: c9097e80 R08: R09: 0008 > > R10: R11: R12: 0005 > > R13: 820b9338 R14: 0005 R15: 820b9320 > > > > cpuidle_enter+0x17/0x20 > > call_cpuidle+0x23/0x40 > > do_idle+0xfb/0x200 > > cpu_startup_entry+0x71/0x80 > > start_secondary+0x16a/0x210 > > start_cpu+0x14/0x14 > > Code: 8b 05 ce 1b ef 7e 83 f8 03 0f 87 4e 01 00 00 89 c0 49 0f a3 04 24 0f > > 82 0a 01 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 > > 41 f6 47 2a 20 49 c7 47 08 00 00 00 00 48 89 df 48 > > The timer which expires has timer->entry.next == POISON2 ! > > it's a classic list corruption. The > bad news is that there is no trace of the culprit because that happens when > some other timer expires after some random amount of time. > > If that is reproducible, then please enable debugobjects. That should > pinpoint the culprit. It's net/smc. This recently had a similar bug with workqueues. (https://marc.info/?l=linux-kernel=148821582909541) fixed by 637fdbae60d6cb9f6e963c1079d7e0445c86ff7d so it's probably unsurprising that there are similar issues. WARNING: CPU: 0 PID: 2430 at lib/debugobjects.c:289 debug_print_object+0x87/0xb0 ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20 CPU: 0 PID: 2430 Comm: trinity-c4 Not tainted 4.11.0-rc3-think+ #3 Call Trace: dump_stack+0x68/0x93 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 ? debug_check_no_obj_freed+0xd9/0x260 debug_print_object+0x87/0xb0 ? work_on_cpu+0xd0/0xd0 debug_check_no_obj_freed+0x219/0x260 ? __sk_destruct+0x10d/0x1c0 kmem_cache_free+0x9f/0x370 __sk_destruct+0x10d/0x1c0 sk_destruct+0x20/0x30 __sk_free+0x43/0xa0 sk_free+0x18/0x20 smc_release+0x100/0x1a0 [smc] sock_release+0x1f/0x80 sock_close+0x12/0x20 __fput+0xf3/0x200 fput+0xe/0x10 task_work_run+0x85/0xb0 do_exit+0x331/0xd70 __secure_computing+0x9c/0xa0 syscall_trace_enter+0xd1/0x3d0 do_syscall_64+0x15f/0x1d0 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f535f4b19e7 RSP: 002b:7fff1a0f40e8 EFLAGS: 0246 ORIG_RAX: 0008 RAX: ffda RBX: 0004 RCX: 7f535f4b19e7 RDX: RSI: RDI: 0004 RBP: R08: 7f535fb8b000 R09: 00c17c2740a303e4 R10: R11: 0246 R12: 7fff1a0f40f5 R13: 7f535fb60048 R14: 7f535fb83ad8 R15: 7f535fb6 ---[ end trace ee67155de15508db ]--- == [ INFO: possible circular locking dependency detected ] 4.11.0-rc3-think+ #3 Not tainted --- trinity-c4/2430 is trying to acquire lock: ( (console_sem).lock ){-.-...} , at: [] down_trylock+0x14/0x40 but task is already holding lock: ( _hash[i].lock ){-.-.-.} , at: [] debug_check_no_obj_freed+0xd9/0x260 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 ( _hash[i].lock ){-.-.-.} : lock_acquire+0x102/0x260 _raw_spin_lock_irqsave+0x4c/0x90 __debug_object_init+0x79/0x460 debug_object_init+0x16/0x20 hrtimer_init+0x25/0x1d0 init_dl_task_timer+0x20/0x30 __sched_fork.isra.91+0x9c/0x140 init_idle+0x51/0x240 sched_init+0x4cd/0x547 start_kernel+0x246/0x45d x86_64_start_reservations+0x2a/0x2c x86_64_start_kernel+0x178/0x18b
Re: [4.10+] sctp lockdep trace
On Tue, Mar 14, 2017 at 11:35:33AM +0800, Xin Long wrote: > >> > [ 245.416594] ( > >> > [ 245.424928] sk_lock-AF_INET > >> > [ 245.433279] ){+.+.+.} > >> > [ 245.441889] , at: [] sctp_sendmsg+0x330/0xfe0 > >> > [sctp] > >> > [ 245.450167] > >> >stack backtrace: > >> > [ 245.466352] CPU: 3 PID: 1781 Comm: trinity-c30 Not tainted > >> > 4.10.0-think+ #7 > >> > [ 245.482894] Call Trace: > >> > [ 245.491096] dump_stack+0x68/0x93 > >> > [ 245.499314] lockdep_rcu_suspicious+0xce/0xf0 > >> > [ 245.507610] sctp_hash_transport+0x6c0/0x7e0 [sctp] > >> > [ 245.515972] ? sctp_endpoint_bh_rcv+0x171/0x290 [sctp] > >> > [ 245.524366] sctp_assoc_add_peer+0x290/0x3c0 [sctp] > >> > [ 245.532736] sctp_sendmsg+0x8f7/0xfe0 [sctp] > >> > [ 245.541040] ? rw_copy_check_uvector+0x8e/0x190 > >> > [ 245.549402] ? import_iovec+0x3a/0xe0 > >> > [ 245.557679] inet_sendmsg+0x49/0x1e0 > >> > [ 245.565887] ___sys_sendmsg+0x2d4/0x300 > >> > [ 245.574092] ? debug_smp_processor_id+0x17/0x20 > >> > [ 245.582342] ? debug_smp_processor_id+0x17/0x20 > >> > [ 245.590508] ? get_lock_stats+0x19/0x50 > >> > [ 245.598641] __sys_sendmsg+0x54/0x90 > >> > [ 245.606745] SyS_sendmsg+0x12/0x20 > >> > [ 245.614784] do_syscall_64+0x66/0x1d0 > >> > [ 245.622828] entry_SYSCALL64_slow_path+0x25/0x25 > >> > [ 245.630894] RIP: 0033:0x7fe095fcb0f9 > >> > [ 245.638962] RSP: 002b:7ffc5601b1d8 EFLAGS: 0246 > >> > [ 245.647071] ORIG_RAX: 002e > >> > [ 245.655186] RAX: ffda RBX: 002e RCX: > >> > 7fe095fcb0f9 > >> > [ 245.663435] RDX: 0080 RSI: 5592de12ddc0 RDI: > >> > 012d > >> > [ 245.671776] RBP: 7fe0965c8000 R08: c000 R09: > >> > 00dc > >> > [ 245.680111] R10: 000302120088 R11: 0246 R12: > >> > 0002 > >> > [ 245.688460] R13: 7fe0965c8048 R14: 7fe0966a1ad8 R15: > >> > 7fe0965c8000 > >> > > >> > >> Cc'ing Xin and linux-sctp@ mailing list. > > > > Seems the same as Andrey Konovalov had reported? > > > I would think so, this patch has fixed it: > > commit 5179b26694c92373275e4933f5d0ff32d585c675 > Author: Xin Long> Date: Tue Feb 28 12:41:29 2017 +0800 > > sctp: call rcu_read_lock before checking for duplicate transport nodes > > not sure which commit your tests are based on, Dave, can you > check if this fix has been in your test kernel? Haven't seen this in a while. Let's call it fixed. Dave
[4.10+] sctp lockdep trace
[ 244.251557] === [ 244.263321] [ ERR: suspicious RCU usage. ] [ 244.274982] 4.10.0-think+ #7 Not tainted [ 244.286511] --- [ 244.298008] ./include/linux/rhashtable.h:602 suspicious rcu_dereference_check() usage! [ 244.309665] other info that might help us debug this: [ 244.344629] rcu_scheduler_active = 2, debug_locks = 1 [ 244.367839] 1 lock held by trinity-c30/1781: [ 244.379481] #0: [ 244.390848] ( [ 244.402372] sk_lock-AF_INET [ 244.413825] ){+.+.+.} [ 244.425231] , at: [] sctp_sendmsg+0x330/0xfe0 [sctp] [ 244.436774] stack backtrace: [ 244.459620] CPU: 3 PID: 1781 Comm: trinity-c30 Not tainted 4.10.0-think+ #7 [ 244.482790] Call Trace: [ 244.494201] dump_stack+0x68/0x93 [ 244.505598] lockdep_rcu_suspicious+0xce/0xf0 [ 244.516924] sctp_hash_transport+0x406/0x7e0 [sctp] [ 244.528137] ? sctp_endpoint_bh_rcv+0x171/0x290 [sctp] [ 244.539243] sctp_assoc_add_peer+0x290/0x3c0 [sctp] [ 244.550291] sctp_sendmsg+0x8f7/0xfe0 [sctp] [ 244.561258] ? rw_copy_check_uvector+0x8e/0x190 [ 244.572308] ? import_iovec+0x3a/0xe0 [ 244.583232] inet_sendmsg+0x49/0x1e0 [ 244.594150] ___sys_sendmsg+0x2d4/0x300 [ 244.605002] ? debug_smp_processor_id+0x17/0x20 [ 244.615844] ? debug_smp_processor_id+0x17/0x20 [ 244.626533] ? get_lock_stats+0x19/0x50 [ 244.637141] __sys_sendmsg+0x54/0x90 [ 244.647817] SyS_sendmsg+0x12/0x20 [ 244.658400] do_syscall_64+0x66/0x1d0 [ 244.668990] entry_SYSCALL64_slow_path+0x25/0x25 [ 244.679582] RIP: 0033:0x7fe095fcb0f9 [ 244.690079] RSP: 002b:7ffc5601b1d8 EFLAGS: 0246 [ 244.700704] ORIG_RAX: 002e [ 244.711248] RAX: ffda RBX: 002e RCX: 7fe095fcb0f9 [ 244.721818] RDX: 0080 RSI: 5592de12ddc0 RDI: 012d [ 244.732282] RBP: 7fe0965c8000 R08: c000 R09: 00dc [ 244.742576] R10: 000302120088 R11: 0246 R12: 0002 [ 244.752804] R13: 7fe0965c8048 R14: 7fe0966a1ad8 R15: 7fe0965c8000 [ 244.775549] === [ 244.785875] [ ERR: suspicious RCU usage. ] [ 244.796951] 4.10.0-think+ #7 Not tainted [ 244.807185] --- [ 244.819213] ./include/linux/rhashtable.h:605 suspicious rcu_dereference_check() usage! [ 244.829420] other info that might help us debug this: [ 244.859963] rcu_scheduler_active = 2, debug_locks = 1 [ 244.879766] 1 lock held by trinity-c30/1781: [ 244.889953] #0: [ 244.90] ( [ 244.909854] sk_lock-AF_INET [ 244.919645] ){+.+.+.} [ 244.929238] , at: [] sctp_sendmsg+0x330/0xfe0 [sctp] [ 244.939167] stack backtrace: [ 244.958506] CPU: 3 PID: 1781 Comm: trinity-c30 Not tainted 4.10.0-think+ #7 [ 244.978102] Call Trace: [ 244.987735] dump_stack+0x68/0x93 [ 244.997112] lockdep_rcu_suspicious+0xce/0xf0 [ 245.006588] sctp_hash_transport+0x4ca/0x7e0 [sctp] [ 245.016264] ? sctp_endpoint_bh_rcv+0x171/0x290 [sctp] [ 245.025797] sctp_assoc_add_peer+0x290/0x3c0 [sctp] [ 245.035380] sctp_sendmsg+0x8f7/0xfe0 [sctp] [ 245.044883] ? rw_copy_check_uvector+0x8e/0x190 [ 245.054464] ? import_iovec+0x3a/0xe0 [ 245.064016] inet_sendmsg+0x49/0x1e0 [ 245.073516] ___sys_sendmsg+0x2d4/0x300 [ 245.082967] ? debug_smp_processor_id+0x17/0x20 [ 245.092448] ? debug_smp_processor_id+0x17/0x20 [ 245.101850] ? get_lock_stats+0x19/0x50 [ 245.70] __sys_sendmsg+0x54/0x90 [ 245.120451] SyS_sendmsg+0x12/0x20 [ 245.129649] do_syscall_64+0x66/0x1d0 [ 245.138783] entry_SYSCALL64_slow_path+0x25/0x25 [ 245.147678] RIP: 0033:0x7fe095fcb0f9 [ 245.156588] RSP: 002b:7ffc5601b1d8 EFLAGS: 0246 [ 245.165503] ORIG_RAX: 002e [ 245.174601] RAX: ffda RBX: 002e RCX: 7fe095fcb0f9 [ 245.183861] RDX: 0080 RSI: 5592de12ddc0 RDI: 012d [ 245.193038] RBP: 7fe0965c8000 R08: c000 R09: 00dc [ 245.202214] R10: 000302120088 R11: 0246 R12: 0002 [ 245.211261] R13: 7fe0965c8048 R14: 7fe0966a1ad8 R15: 7fe0965c8000 [ 245.308216] === [ 245.317295] [ ERR: suspicious RCU usage. ] [ 245.327876] 4.10.0-think+ #7 Not tainted [ 245.337065] --- [ 245.345840] ./include/linux/rhashtable.h:616 suspicious rcu_dereference_check() usage! [ 245.356501] other info that might help us debug this: [ 245.382185] rcu_scheduler_active = 2, debug_locks = 1 [ 245.399415] 1 lock held by trinity-c30/1781: [ 245.408138] #0: [ 245.416594] ( [ 245.424928] sk_lock-AF_INET [ 245.433279] ){+.+.+.} [ 245.441889] , at: [] sctp_sendmsg+0x330/0xfe0 [sctp] [ 245.450167] stack backtrace: [ 245.466352] CPU: 3 PID: 1781 Comm: trinity-c30 Not tainted 4.10.0-think+ #7 [
Re: [PATCH v2 net-next] liquidio: improve UDP TX performance
On 02/21/2017 01:09 PM, Felix Manlunas wrote: From: VSR Burru <veerasenareddy.bu...@cavium.com> Improve UDP TX performance by: * reducing the ring size from 2K to 512 * replacing the numerous streaming DMA allocations for info buffers and gather lists with one large consistent DMA allocation per ring Netperf benchmark numbers before and after patch: PF UDP TX +++++-+ ||| Before| After | | | Number || Patch | Patch | | | of| Packet | Throughput | Throughput | Percent | | Flows | Size | (Gbps)| (Gbps)| Change | +++++-+ || 360 | 0.52 | 0.93 | +78.9 | | 1| 1024 | 1.62 | 2.84 | +75.3 | || 1518 | 2.44 | 4.21 | +72.5 | +++++-+ || 360 | 0.45 | 1.59 | +253.3 | | 4| 1024 | 1.34 | 5.48 | +308.9 | || 1518 | 2.27 | 8.31 | +266.1 | +++++-+ || 360 | 0.40 | 1.61 | +302.5 | | 8| 1024 | 1.64 | 4.24 | +158.5 | || 1518 | 2.87 | 6.52 | +127.2 | +++++-+ VF UDP TX +++++-+ ||| Before| After | | | Number || Patch | Patch | | | of| Packet | Throughput | Throughput | Percent | | Flows | Size | (Gbps)| (Gbps)| Change | +++++-+ || 360 | 1.28 | 1.49 | +16.4 | | 1| 1024 | 4.44 | 4.39 | -1.1 | || 1518 | 6.08 | 6.51 | +7.1 | +++++-+ || 360 | 2.35 | 2.35 |0.0 | | 4| 1024 | 6.41 | 8.07 | +25.9 | || 1518 | 9.56 | 9.54 | -0.2 | +++++-+ || 360 | 3.41 | 3.65 | +7.0 | | 8| 1024 | 9.35 | 9.34 | -0.1 | || 1518 | 9.56 | 9.57 | +0.1 | +++++-+ Some good looking numbers there. As one approaches the wire limit for bitrate, the likes of a netperf service demand can be used to demonstrate the performance change - though there isn't an easy way to do that for parallel flows. happy benchmarking, rick jones
Re: [PATCH net-next] liquidio: improve UDP TX performance
On 02/16/2017 10:38 AM, Felix Manlunas wrote: From: VSR Burru <veerasenareddy.bu...@cavium.com> Improve UDP TX performance by: * reducing the ring size from 2K to 512 * replacing the numerous streaming DMA allocations for info buffers and gather lists with one large consistent DMA allocation per ring By how much was UDP TX performance improved? happy benchmarking, rick jones
Re: [PATCH net-next] virtio: Fix affinity for >32 VCPUs
On 02/03/2017 10:31 AM, Willem de Bruijn wrote: Configuring interrupts and xps from userspace at boot is more robust, as device driver defaults can change. But especially for customers who are unaware of these settings, choosing sane defaults won't hurt. The devil is in finding the sane defaults. For example, the issues we've seen with VMs sending traffic getting reordered when the driver took it upon itself to enable xps. rick jones
Re: [PATCH net-next] virtio: Fix affinity for >32 VCPUs
On 02/03/2017 10:22 AM, Benjamin Serebrin wrote: Thanks, Michael, I'll put this text in the commit log: XPS settings aren't write-able from userspace, so the only way I know to fix XPS is in the driver. ?? root@np-cp1-c0-m1-mgmt:/home/stack# cat /sys/devices/pci:00/:00:02.0/:04:00.0/net/hed0/queues/tx-0/xps_cpus ,0001 root@np-cp1-c0-m1-mgmt:/home/stack# echo 0 > /sys/devices/pci:00/:00:02.0/:04:00.0/net/hed0/queues/tx-0/xps_cpus root@np-cp1-c0-m1-mgmt:/home/stack# cat /sys/devices/pci:00/:00:02.0/:04:00.0/net/hed0/queues/tx-0/xps_cpus ,
prb_retire_rx_blk_timer_expired use-after-free
RSI looks kinda like slab poison here, so re-using a free'd ptr ? general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-rc4-think+ #2 task: 81e16500 task.stack: 81e0 RIP: 0010:prb_retire_rx_blk_timer_expired+0x42/0x130 RSP: 0018:880507803e30 EFLAGS: 00010246 RAX: 81e16500 RBX: 8804bc751158 RCX: RDX: 8804fb6e8008 RSI: a56b6b6b6b6b6b6b RDI: 0001 RBP: 880507803e48 R08: R09: 0001 R10: 61f74469 R11: 0054 R12: 8804bc751338 R13: 8804bc7516d8 R14: 818ab6a0 R15: 8804bc751158 FS: () GS:88050780() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 5578f64a0130 CR3: 03e11000 CR4: 001406f0 DR0: 7f539ba38000 DR1: DR2: DR3: DR6: 0ff0 DR7: 0600 Call Trace: call_timer_fn+0xd2/0x340 ? call_timer_fn+0x5/0x340 ? prb_retire_current_block+0x100/0x100 run_timer_softirq+0x284/0x650 ? 0xa035c077 ? run_timer_softirq+0x5/0x650 ? lapic_next_deadline+0x5/0x40 __do_softirq+0x143/0x431 irq_exit+0xa5/0xb0 smp_apic_timer_interrupt+0x3d/0x50 apic_timer_interrupt+0x8d/0xa0 RIP: 0010:cpuidle_enter_state+0x129/0x360 RSP: 0018:81e03db8 EFLAGS: 0246 ORIG_RAX: ff10 RAX: RBX: e8603cc8 RCX: 001f RDX: 20c49ba5e353f7cf RSI: 81c5e743 RDI: 81c48102 RBP: 81e03df8 R08: cccd R09: 0018 R10: 022e R11: 0a2c R12: 0005 R13: 81eaf918 R14: 0005 R15: 81eaf900 ? cpuidle_enter_state+0x113/0x360 cpuidle_enter+0x17/0x20 call_cpuidle+0x23/0x40 do_idle+0xf6/0x1f0 cpu_startup_entry+0x71/0x80 rest_init+0xb8/0xc0 start_kernel+0x432/0x453 x86_64_start_reservations+0x2a/0x2c x86_64_start_kernel+0x178/0x18b start_cpu+0x14/0x14 ? start_cpu+0x14/0x14 Code: fb 4c 89 e7 e8 b0 f1 01 00 0f b7 8b 2a 05 00 00 48 8b 93 18 05 00 00 80 bb 29 05 00 00 00 0f b6 bb 28 05 00 00 48 8b 34 ca 75 58 <8b> 56 0c 48 89 c8 85 d2 74 1d 8b 93 70 05 00 00 85 d2 74 13 f3 All code 0: fb sti 1: 4c 89 e7mov%r12,%rdi 4: e8 b0 f1 01 00 callq 0x1f1b9 9: 0f b7 8b 2a 05 00 00movzwl 0x52a(%rbx),%ecx 10: 48 8b 93 18 05 00 00mov0x518(%rbx),%rdx 17: 80 bb 29 05 00 00 00cmpb $0x0,0x529(%rbx) 1e: 0f b6 bb 28 05 00 00movzbl 0x528(%rbx),%edi 25: 48 8b 34 ca mov(%rdx,%rcx,8),%rsi 29: 75 58 jne0x83 2b:* 8b 56 0cmov0xc(%rsi),%edx <-- trapping instruction 2e: 48 89 c8mov%rcx,%rax 31: 85 d2 test %edx,%edx 33: 74 1d je 0x52 35: 8b 93 70 05 00 00 mov0x570(%rbx),%edx 3b: 85 d2 test %edx,%edx 3d: 74 13 je 0x52 3f: f3 repz Code starting with the faulting instruction === 0: 8b 56 0cmov0xc(%rsi),%edx 3: 48 89 c8mov%rcx,%rax 6: 85 d2 test %edx,%edx 8: 74 1d je 0x27 a: 8b 93 70 05 00 00 mov0x570(%rbx),%edx 10: 85 d2 test %edx,%edx 12: 74 13 je 0x27 14: f3 repz That code is the BLOCK_NUM_PKTS line here.. 677 spin_lock(>sk.sk_receive_queue.lock); 678 679 frozen = prb_queue_frozen(pkc); 680 pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc); 681 682 if (unlikely(pkc->delete_blk_timer)) 683 goto out; 684 685 /* We only need to plug the race when the block is partially filled. 686 * tpacket_rcv: 687 * lock(); increment BLOCK_NUM_PKTS; unlock() 688 * copy_bits() is in progress ... 689 * timer fires on other cpu: 690 * we can't retire the current block because copy_bits 691 * is in progress. 692 * 693 */ 694 if (BLOCK_NUM_PKTS(pbd)) {
Re: [PATCH net-next] tcp: accept RST for rcv_nxt - 1 after receiving a FIN
On 01/17/2017 11:13 AM, Eric Dumazet wrote: On Tue, Jan 17, 2017 at 11:04 AM, Rick Jones <rick.jon...@hpe.com> wrote: Drifting a bit, and it doesn't change the value of dealing with it, but out of curiosity, when you say mostly in CLOSE_WAIT, why aren't the server-side applications reacting to the read return of zero triggered by the arrival of the FIN? Even if the application reacts, and calls close(fd), kernel will still try to push the data that was queued into socket write queue prior to receiving the FIN. By allowing this RST, we can flush the whole data and react much faster, avoiding locking memory in the kernel for very long time. Understood. I was just wondering if there is also an application bug here. happy benchmarking, rick jones
Re: [PATCH net-next] tcp: accept RST for rcv_nxt - 1 after receiving a FIN
On 01/17/2017 10:37 AM, Jason Baron wrote: From: Jason Baron <jba...@akamai.com> Using a Mac OSX box as a client connecting to a Linux server, we have found that when certain applications (such as 'ab'), are abruptly terminated (via ^C), a FIN is sent followed by a RST packet on tcp connections. The FIN is accepted by the Linux stack but the RST is sent with the same sequence number as the FIN, and Linux responds with a challenge ACK per RFC 5961. The OSX client then sometimes (they are rate-limited) does not reply with any RST as would be expected on a closed socket. This results in sockets accumulating on the Linux server left mostly in the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible. This sequence of events can tie up a lot of resources on the Linux server since there may be a lot of data in write buffers at the time of the RST. Accepting a RST equal to rcv_nxt - 1, after we have already successfully processed a FIN, has made a significant difference for us in practice, by freeing up unneeded resources in a more expedient fashion. Drifting a bit, and it doesn't change the value of dealing with it, but out of curiosity, when you say mostly in CLOSE_WAIT, why aren't the server-side applications reacting to the read return of zero triggered by the arrival of the FIN? happy benchmarking, rick jones
Re: [pull request][for-next] Mellanox mlx5 Reorganize core driver directory layout
On 01/13/2017 02:56 PM, Tom Herbert wrote: On Fri, Jan 13, 2017 at 2:45 PM, Saeed Mahameed what configuration are you running ? what traffic ? Nothing fancy. 8 queues and 20 concurrent netperf TCP_STREAMs trips it. Not a lot of them, but I don't think we really should ever see these errors. Straight-up defaults with netperf, or do you use specific -s/S or -m/M options? happy benchmarking, rick jones
ipv6: remove unnecessary inet6_sk check
np is already assigned in the variable declaration of ping_v6_sendmsg. At this point, we have already dereferenced np several times, so the NULL check is also redundant. Suggested-by: Eric Dumazet <eric.duma...@gmail.com> Signed-off-by: Dave Jones <da...@codemonkey.org.uk> diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index e1f8b34d7a2e..9b522fa90e6d 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -126,12 +126,6 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) return PTR_ERR(dst); rt = (struct rt6_info *) dst; - np = inet6_sk(sk); - if (!np) { - err = -EBADF; - goto dst_err_out; - } - if (!fl6.flowi6_oif && ipv6_addr_is_multicast()) fl6.flowi6_oif = np->mcast_oif; else if (!fl6.flowi6_oif) @@ -166,7 +160,6 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) } release_sock(sk); -dst_err_out: dst_release(dst); if (err)
sunrpc: Illegal context switch in RCU read-side critical section!
Just noticed this on 4.9. Will try and repro on 4.10rc1 later, but hitting unrelated boot problems on that machine right now. === [ INFO: suspicious RCU usage. ] 4.9.0-backup-debug+ #1 Not tainted --- ./include/linux/rcupdate.h:557 Illegal context switch in RCU read-side critical section! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 1 5 locks held by kworker/4:1/66: #0: ("%s"("ipv6_addrconf")){.+.+..}, at: [] process_one_work+0x184/0x790 #1: ((addr_chk_work).work){+.+...}, at: [] process_one_work+0x184/0x790 #2: (rtnl_mutex){+.+.+.}, at: [] rtnl_lock+0x17/0x20 #3: (rcu_read_lock_bh){..}, at: [] addrconf_verify_rtnl+0x23/0x500 #4: (rcu_read_lock){..}, at: [] atomic_notifier_call_chain+0x5/0x110 stack backtrace: CPU: 4 PID: 66 Comm: kworker/4:1 Not tainted 4.9.0-backup-debug+ #1 Workqueue: ipv6_addrconf addrconf_verify_work c9273a28 8e5b4ca5 88042ae19780 0001 c9273a58 8e0d530e 8efcc659 09a7 8804180b8580 c9273a80 8e0ad2b7 Call Trace: [] dump_stack+0x68/0x93 [] lockdep_rcu_suspicious+0xce/0xf0 [] ___might_sleep.part.103+0xa7/0x230 [] __might_sleep+0x4b/0x90 [] lock_sock_nested+0x32/0xb0 [] sock_setsockopt+0x8b/0xa50 [] ? __local_bh_enable_ip+0x65/0xb0 [] kernel_setsockopt+0x49/0x50 [] svc_tcp_kill_temp_xprt+0x4a/0x60 [] svc_age_temp_xprts_now+0x12f/0x1b0 [] nfsd_inet6addr_event+0x192/0x1f0 [] ? nfsd_inet6addr_event+0x5/0x1f0 [] notifier_call_chain+0x39/0xa0 [] atomic_notifier_call_chain+0x6e/0x110 [] ? atomic_notifier_call_chain+0x5/0x110 [] inet6addr_notifier_call_chain+0x1b/0x20 [] ipv6_del_addr+0x12c/0x200 [] addrconf_verify_rtnl+0x417/0x500 [] ? addrconf_verify_rtnl+0x23/0x500 [] addrconf_verify_work+0x13/0x20 [] process_one_work+0x20b/0x790 [] ? process_one_work+0x184/0x790 [] worker_thread+0x4e/0x490 [] ? process_one_work+0x790/0x790 [] ? process_one_work+0x790/0x790 [] kthread+0xff/0x120 [] ? kthread_worker_fn+0x140/0x140 [] ret_from_fork+0x27/0x40
ipv6: handle -EFAULT from skb_copy_bits
By setting certain socket options on ipv6 raw sockets, we can confuse the length calculation in rawv6_push_pending_frames triggering a BUG_ON. RIP: 0010:[] [] rawv6_sendmsg+0xc30/0xc40 RSP: 0018:881f6c4a7c18 EFLAGS: 00010282 RAX: fff2 RBX: 881f6c681680 RCX: 0002 RDX: 881f6c4a7cf8 RSI: 0030 RDI: 881fed0f6a00 RBP: 881f6c4a7da8 R08: R09: 0009 R10: 881fed0f6a00 R11: 0009 R12: 0030 R13: 881fed0f6a00 R14: 881fee39ba00 R15: 881fefa93a80 Call Trace: [] ? unmap_page_range+0x693/0x830 [] inet_sendmsg+0x67/0xa0 [] sock_sendmsg+0x38/0x50 [] SYSC_sendto+0xef/0x170 [] SyS_sendto+0xe/0x10 [] do_syscall_64+0x50/0xa0 [] entry_SYSCALL64_slow_path+0x25/0x25 Handle by jumping to the failure path if skb_copy_bits gets an EFAULT. Reproducer: #include #include #include #include #include #include #include #define LEN 504 int main(int argc, char* argv[]) { int fd; int zero = 0; char buf[LEN]; memset(buf, 0, LEN); fd = socket(AF_INET6, SOCK_RAW, 7); setsockopt(fd, SOL_IPV6, IPV6_CHECKSUM, , 4); setsockopt(fd, SOL_IPV6, IPV6_DSTOPTS, , LEN); sendto(fd, buf, 1, 0, (struct sockaddr *) buf, 110); } Signed-off-by: Dave Jones <da...@codemonkey.org.uk> diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 291ebc260e70..ea89073c8247 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -591,7 +591,11 @@ static int rawv6_push_pending_frames(struct sock *sk, struct flowi6 *fl6, } offset += skb_transport_offset(skb); - BUG_ON(skb_copy_bits(skb, offset, , 2)); + err = skb_copy_bits(skb, offset, , 2); + if (err < 0) { + ip6_flush_pending_frames(sk); + goto out; + } /* in case cksum was not initialized */ if (unlikely(csum))
Re: ipv6: handle -EFAULT from skb_copy_bits
On Wed, Dec 21, 2016 at 10:33:20PM +0100, Hannes Frederic Sowa wrote: > > Given all of this, I think the best thing to do is validate the offset > > after the queue walks, which is pretty much what Dave Jones's original > > patch was doing. > > I think both approaches protect against the bug reasonably well, but > Dave's patch has a bug: we must either call ip6_flush_pending_frames to > clear the socket write queue with the buggy send request. I can fix that up and resubmit, or we can go with your approach. DaveM ? Dave
Re: ipv6: handle -EFAULT from skb_copy_bits
On Tue, Dec 20, 2016 at 11:31:38AM -0800, Cong Wang wrote: > On Tue, Dec 20, 2016 at 10:17 AM, Dave Jones <da...@codemonkey.org.uk> wrote: > > On Mon, Dec 19, 2016 at 08:36:23PM -0500, David Miller wrote: > > > From: Dave Jones <da...@codemonkey.org.uk> > > > Date: Mon, 19 Dec 2016 19:40:13 -0500 > > > > > > > On Mon, Dec 19, 2016 at 07:31:44PM -0500, Dave Jones wrote: > > > > > > > > > Unfortunately, this made no difference. I spent some time today > > trying > > > > > to make a better reproducer, but failed. I'll revisit again > > tomorrow. > > > > > > > > > > Maybe I need >1 process/thread to trigger this. That would > > explain why > > > > > I can trigger it with Trinity. > > > > > > > > scratch that last part, I finally just repro'd it with a single > > process. > > > > > > Thanks for the info, I'll try to think about this some more. > > > > I threw in some debug printks right before that BUG_ON. > > it's always this: > > > > skb->len=31 skb->data_len=0 offset:30 total_len:9 > > Clearly we fail because 30 > 31 - 2, seems 'offset' is not correct here, > off-by-one? Ok, I finally made a messy, albeit good enough reproducer. #include #include #include #include #include #include #include #define LEN 504 int main(int argc, char* argv[]) { int fd; int zero = 0; char buf[LEN]; memset(buf, 0, LEN); fd = socket(AF_INET6, SOCK_RAW, 7); setsockopt(fd, SOL_IPV6, IPV6_CHECKSUM, , 4); setsockopt(fd, SOL_IPV6, IPV6_DSTOPTS, , LEN); sendto(fd, buf, 1, 0, (struct sockaddr *) buf, 110); }
Re: ipv6: handle -EFAULT from skb_copy_bits
On Tue, Dec 20, 2016 at 01:28:13PM -0500, David Miller wrote: > This has to do with the SKB buffer layout and geometry, not whether > the packet is "fragmented" in the protocol sense. > > So no, this isn't a criteria for packets being filtered out by this > point. > > Can you try to capture what sk->sk_socket->type and > inet_sk(sk)->hdrincl are set to at the time of the crash? > type:3 hdrincl:0 Dave
Re: ipv6: handle -EFAULT from skb_copy_bits
On Mon, Dec 19, 2016 at 08:36:23PM -0500, David Miller wrote: > From: Dave Jones <da...@codemonkey.org.uk> > Date: Mon, 19 Dec 2016 19:40:13 -0500 > > > On Mon, Dec 19, 2016 at 07:31:44PM -0500, Dave Jones wrote: > > > > > Unfortunately, this made no difference. I spent some time today trying > > > to make a better reproducer, but failed. I'll revisit again tomorrow. > > > > > > Maybe I need >1 process/thread to trigger this. That would explain why > > > I can trigger it with Trinity. > > > > scratch that last part, I finally just repro'd it with a single process. > > Thanks for the info, I'll try to think about this some more. I threw in some debug printks right before that BUG_ON. it's always this: skb->len=31 skb->data_len=0 offset:30 total_len:9 Shouldn't we have kicked out data_len=0 skb's somewhere before we got this far ? Dave
Re: [PATCH 1/3] NFC: trf7970a: add device tree option for 27MHz clock
On 2016-12-20 17:16, Geoff Lansberry wrote: > From: Geoff Lansberry> > The TRF7970A has configuration options to support hardware designs > which use a 27.12MHz clock. This commit adds a device tree option > 'clock-frequency' to support configuring the this chip for default > 13.56MHz clock or the optional 27.12MHz clock. > --- > .../devicetree/bindings/net/nfc/trf7970a.txt | 4 ++ > drivers/nfc/trf7970a.c | 50 > +- > 2 files changed, 43 insertions(+), 11 deletions(-) > > diff --git a/Documentation/devicetree/bindings/net/nfc/trf7970a.txt > b/Documentation/devicetree/bindings/net/nfc/trf7970a.txt > index 32b35a0..e262ac1 100644 > --- a/Documentation/devicetree/bindings/net/nfc/trf7970a.txt > +++ b/Documentation/devicetree/bindings/net/nfc/trf7970a.txt > @@ -21,6 +21,8 @@ Optional SoC Specific Properties: > - t5t-rmb-extra-byte-quirk: Specify that the trf7970a has the erratum >where an extra byte is returned by Read Multiple Block commands issued >to Type 5 tags. > +- clock-frequency: Set to specify that the input frequency to the trf7970a > is 1356Hz or 2712Hz > + You're adding an empty line here that is removed in the next patch. > > Example (for ARM-based BeagleBone with TRF7970A on SPI1): > > @@ -43,6 +45,8 @@ Example (for ARM-based BeagleBone with TRF7970A on SPI1): > irq-status-read-quirk; > en2-rf-quirk; > t5t-rmb-extra-byte-quirk; > + vdd_io_1v8; This does not belong here, and so no need to remove in the next patch. > + clock-frequency = <2712>; > status = "okay"; > }; > }; > diff --git a/drivers/nfc/trf7970a.c b/drivers/nfc/trf7970a.c > index 26c9dbb..4e051e9 100644 > --- a/drivers/nfc/trf7970a.c > +++ b/drivers/nfc/trf7970a.c > @@ -124,6 +124,9 @@ >NFC_PROTO_ISO15693_MASK | NFC_PROTO_NFC_DEP_MASK) > > #define TRF7970A_AUTOSUSPEND_DELAY 3 /* 30 seconds */ > +#define TRF7970A_13MHZ_CLOCK_FREQUENCY 1356 > +#define TRF7970A_27MHZ_CLOCK_FREQUENCY 2712 > + > > #define TRF7970A_RX_SKB_ALLOC_SIZE 256 > > @@ -1056,12 +1059,11 @@ static int trf7970a_init(struct trf7970a *trf) > > trf->chip_status_ctrl &= ~TRF7970A_CHIP_STATUS_RF_ON; > > - ret = trf7970a_write(trf, TRF7970A_MODULATOR_SYS_CLK_CTRL, 0); > + ret = trf7970a_write(trf, TRF7970A_MODULATOR_SYS_CLK_CTRL, > + trf->modulator_sys_clk_ctrl); > if (ret) > goto err_out; > > - trf->modulator_sys_clk_ctrl = 0; > - > ret = trf7970a_write(trf, TRF7970A_ADJUTABLE_FIFO_IRQ_LEVELS, > TRF7970A_ADJUTABLE_FIFO_IRQ_LEVELS_WLH_96 | > TRF7970A_ADJUTABLE_FIFO_IRQ_LEVELS_WLL_32); > @@ -1181,27 +1183,37 @@ static int trf7970a_in_config_rf_tech(struct trf7970a > *trf, int tech) > switch (tech) { > case NFC_DIGITAL_RF_TECH_106A: > trf->iso_ctrl_tech = TRF7970A_ISO_CTRL_14443A_106; > - trf->modulator_sys_clk_ctrl = TRF7970A_MODULATOR_DEPTH_OOK; > + trf->modulator_sys_clk_ctrl = > + (trf->modulator_sys_clk_ctrl & 0xF8) | > + TRF7970A_MODULATOR_DEPTH_OOK; > trf->guard_time = TRF7970A_GUARD_TIME_NFCA; > break; > case NFC_DIGITAL_RF_TECH_106B: > trf->iso_ctrl_tech = TRF7970A_ISO_CTRL_14443B_106; > - trf->modulator_sys_clk_ctrl = TRF7970A_MODULATOR_DEPTH_ASK10; > + trf->modulator_sys_clk_ctrl = > + (trf->modulator_sys_clk_ctrl & 0xF8) | > + TRF7970A_MODULATOR_DEPTH_ASK10; > trf->guard_time = TRF7970A_GUARD_TIME_NFCB; > break; > case NFC_DIGITAL_RF_TECH_212F: > trf->iso_ctrl_tech = TRF7970A_ISO_CTRL_FELICA_212; > - trf->modulator_sys_clk_ctrl = TRF7970A_MODULATOR_DEPTH_ASK10; > + trf->modulator_sys_clk_ctrl = > + (trf->modulator_sys_clk_ctrl & 0xF8) | > + TRF7970A_MODULATOR_DEPTH_ASK10; > trf->guard_time = TRF7970A_GUARD_TIME_NFCF; > break; > case NFC_DIGITAL_RF_TECH_424F: > trf->iso_ctrl_tech = TRF7970A_ISO_CTRL_FELICA_424; > - trf->modulator_sys_clk_ctrl = TRF7970A_MODULATOR_DEPTH_ASK10; > + trf->modulator_sys_clk_ctrl = > + (trf->modulator_sys_clk_ctrl & 0xF8) | > + TRF7970A_MODULATOR_DEPTH_ASK10; > trf->guard_time = TRF7970A_GUARD_TIME_NFCF; > break; > case NFC_DIGITAL_RF_TECH_ISO15693: > trf->iso_ctrl_tech = TRF7970A_ISO_CTRL_15693_SGL_1OF4_2648; > - trf->modulator_sys_clk_ctrl = TRF7970A_MODULATOR_DEPTH_OOK; > + trf->modulator_sys_clk_ctrl = > + (trf->modulator_sys_clk_ctrl &
Re: ipv6: handle -EFAULT from skb_copy_bits
On Mon, Dec 19, 2016 at 07:31:44PM -0500, Dave Jones wrote: > Unfortunately, this made no difference. I spent some time today trying > to make a better reproducer, but failed. I'll revisit again tomorrow. > > Maybe I need >1 process/thread to trigger this. That would explain why > I can trigger it with Trinity. scratch that last part, I finally just repro'd it with a single process. Dave
Re: ipv6: handle -EFAULT from skb_copy_bits
On Mon, Dec 19, 2016 at 02:48:48PM -0500, David Miller wrote: > One thing that's interesting is that if the user picks "IPPROTO_RAW" > as the value of 'protocol' we set inet->hdrincl to 1. > > The user can also set inet->hdrincl to 1 or 0 via setsockopt(). > > I think this is part of the problem. The test above means to check > for "RAW socket with hdrincl set" and is trying to do this more simply. > But because setsockopt() can set this arbitrarily, testing sk_protocol > alone isn't enough. > > So changing: > > sk->sk_protocol == IPPROTO_RAW > > into something like: > > (sk->sk_socket->type == SOCK_RAW && inet_sk(sk)->hdrincl) > > should correct the test. > .. > > You can test if the change I suggest above works. Unfortunately, this made no difference. I spent some time today trying to make a better reproducer, but failed. I'll revisit again tomorrow. Maybe I need >1 process/thread to trigger this. That would explain why I can trigger it with Trinity. Dave
Re: ipv6: handle -EFAULT from skb_copy_bits
On Sat, Dec 17, 2016 at 10:41:20AM -0500, David Miller wrote: > > It seems to be possible to craft a packet for sendmsg that triggers > > the -EFAULT path in skb_copy_bits resulting in a BUG_ON that looks like: > > > > RIP: 0010:[] [] > > rawv6_sendmsg+0xc30/0xc40 > > RSP: 0018:881f6c4a7c18 EFLAGS: 00010282 > > RAX: fff2 RBX: 881f6c681680 RCX: 0002 > > RDX: 881f6c4a7cf8 RSI: 0030 RDI: 881fed0f6a00 > > RBP: 881f6c4a7da8 R08: R09: 0009 > > R10: 881fed0f6a00 R11: 0009 R12: 0030 > > R13: 881fed0f6a00 R14: 881fee39ba00 R15: 881fefa93a80 > > > > Call Trace: > > [] ? unmap_page_range+0x693/0x830 > > [] inet_sendmsg+0x67/0xa0 > > [] sock_sendmsg+0x38/0x50 > > [] SYSC_sendto+0xef/0x170 > > [] SyS_sendto+0xe/0x10 > > [] do_syscall_64+0x50/0xa0 > > [] entry_SYSCALL64_slow_path+0x25/0x25 > > > > Handle this in rawv6_push_pending_frames and jump to the failure path. > > > > Signed-off-by: Dave Jones <da...@codemonkey.org.uk> > > Hmmm, that's interesting. Becaue the code in __ip6_append_data(), which > sets up the ->cork.base.length value, seems to be defensively trying to > avoid this possibility. > > For example, it checks things like: > > if (cork->length + length > mtu - headersize && ipc6->dontfrag && > (sk->sk_protocol == IPPROTO_UDP || > sk->sk_protocol == IPPROTO_RAW)) { > > This is why the transport offset plus the length should never exceed > the total length for that skb_copy_bits() call. > > Perhaps this protocol check in the code above is incomplete? Do you > know what the sk->sk_protocol value was when that BUG triggered? That > might shine some light on what is really happening here. Hm. sk_protocol = 7, struct sock { __sk_common = { { skc_addrpair = 0, { skc_daddr = 0, skc_rcv_saddr = 0 } }, { skc_hash = 0, skc_u16hashes = {0, 0} }, { skc_portpair = 458752, { skc_dport = 0, skc_num = 7 } }, skc_family = 10, skc_state = 7 '\a', skc_reuse = 1 '\001', skc_reuseport = 0 '\000', skc_ipv6only = 0 '\000', skc_net_refcnt = 1 '\001', skc_bound_dev_if = 0, { skc_bind_node = { next = 0x0, pprev = 0x0 }, skc_portaddr_node = { next = 0x0, pprev = 0x0 } }, skc_prot = 0x81cf3bc0 , skc_net = { net = 0x81ce78c0 }, skc_v6_daddr = { in6_u = { u6_addr8 = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, u6_addr32 = {0, 0, 0, 0} } }, }, skc_v6_rcv_saddr = { in6_u = { u6_addr8 = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, u6_addr32 = {0, 0, 0, 0} } }, skc_cookie = { counter = 0 }, { skc_flags = 256, skc_listener = 0x100, skc_tw_dr = 0x100 }, skc_dontcopy_begin = 0x881fd1ce9b68, { skc_node = { next = 0x0, pprev = 0x0 }, skc_nulls_node = { next = 0x0, pprev = 0x0 } }, skc_tx_queue_mapping = -1, { skc_incoming_cpu = -1, skc_rcv_wnd = 4294967295, skc_tw_rcv_nxt = 4294967295 }, skc_refcnt = { counter = 1 }, skc_dontcopy_end = 0x881fd1ce9b84, { skc_rxhash = 0, skc_window_clamp = 0, skc_tw_snd_nxt = 0 } }, sk_lock = { slock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, owned = 1, wq = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0x881fd1ce9b98, prev = 0x881fd1ce9b98 } } }, sk_receive_queue = { next = 0x881fd1ce9ba8, prev = 0x881fd1ce9ba8, qlen = 0, lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } } }, sk_backlog = { rmem_alloc = { counter = 0 }, len = 0, head = 0x0, tail = 0x0 }, sk_forward_alloc = 0, sk_txhash = 0, sk_napi_id = 0, sk_ll_usec = 0, sk_drops = { counter = 0 }, sk_rcvbuf = 1
Re: ipv6: handle -EFAULT from skb_copy_bits
On Sat, Dec 17, 2016 at 10:41:20AM -0500, David Miller wrote: > From: Dave Jones <da...@codemonkey.org.uk> > Date: Wed, 14 Dec 2016 10:47:29 -0500 > > > It seems to be possible to craft a packet for sendmsg that triggers > > the -EFAULT path in skb_copy_bits resulting in a BUG_ON that looks like: > > > > RIP: 0010:[] [] > > rawv6_sendmsg+0xc30/0xc40 > > RSP: 0018:881f6c4a7c18 EFLAGS: 00010282 > > RAX: fff2 RBX: 881f6c681680 RCX: 0002 > > RDX: 881f6c4a7cf8 RSI: 0030 RDI: 881fed0f6a00 > > RBP: 881f6c4a7da8 R08: R09: 0009 > > R10: 881fed0f6a00 R11: 0009 R12: 0030 > > R13: 881fed0f6a00 R14: 881fee39ba00 R15: 881fefa93a80 > > > > Call Trace: > > [] ? unmap_page_range+0x693/0x830 > > [] inet_sendmsg+0x67/0xa0 > > [] sock_sendmsg+0x38/0x50 > > [] SYSC_sendto+0xef/0x170 > > [] SyS_sendto+0xe/0x10 > > [] do_syscall_64+0x50/0xa0 > > [] entry_SYSCALL64_slow_path+0x25/0x25 > > > > Handle this in rawv6_push_pending_frames and jump to the failure path. > > > > Signed-off-by: Dave Jones <da...@codemonkey.org.uk> > > Hmmm, that's interesting. Becaue the code in __ip6_append_data(), which > sets up the ->cork.base.length value, seems to be defensively trying to > avoid this possibility. > > For example, it checks things like: > > if (cork->length + length > mtu - headersize && ipc6->dontfrag && > (sk->sk_protocol == IPPROTO_UDP || > sk->sk_protocol == IPPROTO_RAW)) { > > This is why the transport offset plus the length should never exceed > the total length for that skb_copy_bits() call. > > Perhaps this protocol check in the code above is incomplete? Do you > know what the sk->sk_protocol value was when that BUG triggered? That > might shine some light on what is really happening here. I'll see if I can craft up a reproducer next week. For some reason I've not hit this on my test setup at home, but it reproduces daily in our test setup at facebook. The only thing I can think of is that those fb boxes are ipv6 only, so I might be exercising v4 more at home. Dave
ipv6: handle -EFAULT from skb_copy_bits
It seems to be possible to craft a packet for sendmsg that triggers the -EFAULT path in skb_copy_bits resulting in a BUG_ON that looks like: RIP: 0010:[] [] rawv6_sendmsg+0xc30/0xc40 RSP: 0018:881f6c4a7c18 EFLAGS: 00010282 RAX: fff2 RBX: 881f6c681680 RCX: 0002 RDX: 881f6c4a7cf8 RSI: 0030 RDI: 881fed0f6a00 RBP: 881f6c4a7da8 R08: R09: 0009 R10: 881fed0f6a00 R11: 0009 R12: 0030 R13: 881fed0f6a00 R14: 881fee39ba00 R15: 881fefa93a80 Call Trace: [] ? unmap_page_range+0x693/0x830 [] inet_sendmsg+0x67/0xa0 [] sock_sendmsg+0x38/0x50 [] SYSC_sendto+0xef/0x170 [] SyS_sendto+0xe/0x10 [] do_syscall_64+0x50/0xa0 [] entry_SYSCALL64_slow_path+0x25/0x25 Handle this in rawv6_push_pending_frames and jump to the failure path. Signed-off-by: Dave Jones <da...@codemonkey.org.uk> diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 291ebc260e70..35aa82faa052 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -591,7 +591,9 @@ static int rawv6_push_pending_frames(struct sock *sk, struct flowi6 *fl6, } offset += skb_transport_offset(skb); - BUG_ON(skb_copy_bits(skb, offset, , 2)); + err = skb_copy_bits(skb, offset, , 2); + if (err < 0) + goto out; /* in case cksum was not initialized */ if (unlikely(csum))
netconsole: sleeping function called from invalid context
I think this has been around for a while, but for some reason I'm running into it a lot today. BUG: sleeping function called from invalid context at kernel/irq/manage.c:110 in_atomic(): 1, irqs_disabled(): 1, pid: 1839, name: modprobe no locks held by modprobe/1839. Preemption disabled at: [] write_ext_msg+0x73/0x2d0 CPU: 0 PID: 1839 Comm: modprobe Not tainted 4.9.0-rc8-think+ #5 880442287300 81651e19 8001 88044221d380 006e 880442287338 87c3 88044221d388 8207b940 006e Call Trace: [] dump_stack+0x6c/0x93 [] ___might_sleep+0x193/0x210 [] __might_sleep+0x71/0xe0 [] ? __synchronize_hardirq+0x94/0xa0 [] synchronize_irq+0xa8/0x170 [] ? set_irq_wake_real+0x90/0x90 [] ? synchronize_irq+0x5/0x170 [] ? disable_irq+0x5/0x30 [] disable_irq+0x28/0x30 [] e1000_netpoll+0x1c4/0x200 [] ? e1000_intr_msix_tx+0x190/0x190 [] netpoll_poll_dev+0xa0/0x3b0 [] ? preempt_count_sub+0x18/0xd0 [] netpoll_send_skb_on_dev+0x20d/0x3d0 [] netpoll_send_udp+0x535/0x8c0 [] write_ext_msg+0x286/0x2d0 [] ? check_preemption_disabled+0x3b/0x160 [] call_console_drivers.isra.20.constprop.26+0x165/0x310 [] console_unlock+0x3b6/0x840 [] vprintk_emit+0x4b5/0x6e0 [] vprintk_default+0x48/0x80 [] printk+0xbc/0xe7 [] ? printk_lock.constprop.1+0x102/0x102 [] ? printk+0x5/0xe7 [] ? bt_init+0x1/0xfa [bluetooth] [] bt_info+0xdd/0x110 [bluetooth] [] ? bt_to_errno+0x50/0x50 [bluetooth] [] ? bt_info+0x5/0x110 [bluetooth] [] sco_init+0xb0/0xc40 [bluetooth] [] ? 0xa099 [] bt_init+0x9d/0xfa [bluetooth] [] do_one_initcall+0x199/0x220 [] ? initcall_blacklisted+0x170/0x170 [] ? do_init_module+0xe3/0x2fd [] ? 0xa099 [] ? do_one_initcall+0x5/0x220 [] ? __asan_register_globals+0x7c/0xa0 [] do_init_module+0xf4/0x2fd [] load_module+0x3a79/0x4670 [] ? disable_ro_nx+0x80/0x80 [] ? module_frob_arch_sections+0x20/0x20 [] ? __buffer_unlock_commit+0x4a/0x90 [] ? trace_function+0x9c/0xc0 [] ? function_trace_call+0xea/0x290 [] ? SYSC_finit_module+0x181/0x1c0 [] ? module_frob_arch_sections+0x20/0x20 [] ? get_user_arg_ptr.isra.26+0xa0/0xa0 [] ? load_module+0x5/0x4670 [] SYSC_finit_module+0x181/0x1c0 [] ? SYSC_init_module+0x220/0x220 [] ? function_trace_call+0xea/0x290 [] ? SyS_init_module+0x10/0x10 [] ? SyS_init_module+0x10/0x10 [] ? SyS_finit_module+0x5/0x10 [] ? __this_cpu_preempt_check+0x1c/0x20 [] ? SyS_init_module+0x10/0x10 [] SyS_finit_module+0xe/0x10 [] do_syscall_64+0x100/0x2b0 [] entry_SYSCALL64_slow_path+0x25/0x25
Re: [PATCH net-next] udp: under rx pressure, try to condense skbs
On 12/08/2016 07:30 AM, Eric Dumazet wrote: On Thu, 2016-12-08 at 10:46 +0100, Jesper Dangaard Brouer wrote: Hmmm... I'm not thrilled to have such heuristics, that change memory behavior when half of the queue size (sk->sk_rcvbuf) is reached. Well, copybreak drivers do that unconditionally, even under no stress at all, you really should complain then. Isn't that behaviour based (in part?) on the observation/belief that it is fewer cycles to copy the small packet into a small buffer than to send the larger buffer up the stack and have to allocate and map a replacement? rick jones
Re: [PATCH net-next 2/4] mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs
On 12/02/2016 03:23 PM, Martin KaFai Lau wrote: When XDP prog is attached, it is currently limiting MTU to be FRAG_SZ0 - ETH_HLEN - (2 * VLAN_HLEN) which is 1514 in x86. AFAICT, since mlx4 is doing one page per packet for XDP, we can at least raise the MTU limitation up to PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) which this patch is doing. It will be useful in the next patch which allows XDP program to extend the packet by adding new header(s). Is mlx4 the only driver doing page-per-packet? rick jones
Re: Initial thoughts on TXDP
On 12/01/2016 02:12 PM, Tom Herbert wrote: We have consider both request size and response side in RPC. Presumably, something like a memcache server is most serving data as opposed to reading it, we are looking to receiving much smaller packets than being sent. Requests are going to be quite small say 100 bytes and unless we are doing significant amount of pipelining on connections GRO would rarely kick-in. Response size will have a lot of variability, anything from a few kilobytes up to a megabyte. I'm sorry I can't be more specific this is an artifact of datacenters that have 100s of different applications and communication patterns. Maybe 100b request size, 8K, 16K, 64K response sizes might be good for test. No worries on the specific sizes, it is a classic "How long is a piece of string?" sort of question. Not surprisingly, as the size of what is being received grows, so too the delta between GRO on and off. stack@np-cp1-c0-m1-mgmt:~/rjones2$ HDR="-P 1"; for r in 8K 16K 64K 1M; do for gro in on off; do sudo ethtool -K hed0 gro ${gro}; brand="$r gro $gro"; ./netperf -B "$brand" -c -H np-cp1-c1-m3-mgmt -t TCP_RR $HDR -- -P 12867 -r 128,${r} -o result_brand,throughput,local_sd; HDR="-P 0"; done; done MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Result Tag,Throughput,Local Service Demand "8K gro on",9899.84,35.947 "8K gro off",7299.54,61.097 "16K gro on",8119.38,58.367 "16K gro off",5176.87,95.317 "64K gro on",4429.57,110.629 "64K gro off",2128.58,289.913 "1M gro on",887.85,918.447 "1M gro off",335.97,3427.587 So that gives a feel for by how much this alternative mechanism would have to reduce path-length to maintain the CPU overhead, were the mechanism to preclude GRO. rick
Re: Initial thoughts on TXDP
On 12/01/2016 12:18 PM, Tom Herbert wrote: On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones <rick.jon...@hpe.com> wrote: Just how much per-packet path-length are you thinking will go away under the likes of TXDP? It is admittedly "just" netperf but losing TSO/GSO does some non-trivial things to effective overhead (service demand) and so throughput: For plain in order TCP packets I believe we should be able process each packet at nearly same speed as GRO. Most of the protocol processing we do between GRO and the stack are the same, the differences are that we need to do a connection lookup in the stack path (note we now do this is UDP GRO and that hasn't show up as a major hit). We also need to consider enqueue/dequeue on the socket which is a major reason to try for lockless sockets in this instance. So waving hands a bit, and taking the service demand for the GRO-on receive test in my previous message (860 ns/KB), that would be ~ (1448/1024)*860 or ~1.216 usec of CPU time per TCP segment, including ACK generation which unless an explicit ACK-avoidance heuristic a la HP-UX 11/Solaris 2 is put in place would be for every-other segment. Etc etc. Sure, but trying running something emulates a more realistic workload than a TCP stream, like RR test with relative small payload and many connections. That is a good point, which of course is why the RR tests are there in netperf :) Don't get me wrong, I *like* seeing path-length reductions. What would you posit is a relatively small payload? The promotion of IR10 suggests that perhaps 14KB or so is a sufficiently common so I'll grasp at that as the length of a piece of string: stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,14K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380 128 14336 10.00 8118.31 1.57 -1.00 46.410 -1.000 16384 87380 stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,14K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380 128 14336 10.00 5837.35 2.20 -1.00 90.628 -1.000 16384 87380 So, losing GRO doubled the service demand. I suppose I could see cutting path-length in half based on the things you listed which would be bypassed? I'm sure mileage will vary with different NICs and CPUs. The ones used here happened to be to hand. happy benchmarking, rick Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly goes to more than doubling, and halving to 7K narrows the delta: stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,28K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380 128 28672 10.00 6732.32 1.79 -1.00 63.819 -1.000 16384 87380 stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,28K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380 128 28672 10.00 3780.47 2.32 -1.00 147.280 -1.000 16384 87380 stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,7K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S
Re: Initial thoughts on TXDP
On 12/01/2016 11:05 AM, Tom Herbert wrote: For the GSO and GRO the rationale is that performing the extra SW processing to do the offloads is significantly less expensive than running each packet through the full stack. This is true in a multi-layered generalized stack. In TXDP, however, we should be able to optimize the stack data path such that that would no longer be true. For instance, if we can process the packets received on a connection quickly enough so that it's about the same or just a little more costly than GRO processing then we might bypass GRO entirely. TSO is probably still relevant in TXDP since it reduces overheads processing TX in the device itself. Just how much per-packet path-length are you thinking will go away under the likes of TXDP? It is admittedly "just" netperf but losing TSO/GSO does some non-trivial things to effective overhead (service demand) and so throughput: stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P 12867 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB 87380 16384 1638410.00 9260.24 2.02 -1.000.428 -1.000 stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P 12867 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB 87380 16384 1638410.00 5621.82 4.25 -1.001.486 -1.000 And that is still with the stretch-ACKs induced by GRO at the receiver. Losing GRO has quite similar results: stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_MAERTS -- -P 12867 MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Recv Send RecvSend Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB 87380 16384 1638410.00 9154.02 4.00 -1.000.860 -1.000 stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_MAERTS -- -P 12867 MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Recv Send RecvSend Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB 87380 16384 1638410.00 4212.06 5.36 -1.002.502 -1.000 I'm sure there is a very non-trivial "it depends" component here - netperf will get the peak benefit from *SO and so one will see the peak difference in service demands - but even if one gets only 6 segments per *SO that is a lot of path-length to make-up. 4.4 kernel, BE3 NICs ... E5-2640 0 @ 2.50GHz And even if one does have the CPU cycles to burn so to speak, the effect on power consumption needs to be included in the calculus. happy benchmarking, rick jones
Re: Netperf UDP issue with connected sockets
On 11/30/2016 02:43 AM, Jesper Dangaard Brouer wrote: Notice the "fib_lookup" cost is still present, even when I use option "-- -n -N" to create a connected socket. As Eric taught us, this is because we should use syscalls "send" or "write" on a connected socket. In theory, once the data socket is connected, the send_data() call in src/nettest_omni.c is supposed to use send() rather than sendto(). And indeed, based on a quick check, send() is what is being called, though it becomes it seems a sendto() system call - with the destination information NJULL: write(1, "send\n", 5) = 5 sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 1024 write(1, "send\n", 5) = 5 sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 1024 So I'm not sure what might be going-on there. You can get netperf to use write() instead of send() by adding a test-specific -I option. happy benchmarking, rick My udp_flood tool[1] cycle through the different syscalls: taskset -c 2 ~/git/network-testing/src/udp_flood 198.18.50.1 --count $((10**7)) --pmtu 2 ns/pkt pps cycles/pkt send473.08 2113816.28 1891 sendto 558.58 1790265.84 2233 sendmsg 587.24 1702873.80 2348 sendMmsg/32 547.57 1826265.90 2189 write 518.36 1929175.52 2072 Using "send" seems to be the fastest option. Some notes on test: I've forced TX completions to happen on another CPU0 and pinned the udp_flood program (to CPU2) as I want to avoid the CPU scheduler to move udp_flood around as this cause fluctuations in the results (as it stress the memory allocations more). My udp_flood --pmtu option is documented in the --help usage text (see below signature)
Re: Netperf UDP issue with connected sockets
On 11/28/2016 10:33 AM, Rick Jones wrote: On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote: time to try IP_MTU_DISCOVER ;) To Rick, maybe you can find a good solution or option with Eric's hint, to send appropriate sized UDP packets with Don't Fragment (DF). Jesper - Top of trunk has a change adding an omni, test-specific -f option which will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket. Is that sufficient to your needs? Usage examples: raj@tardy:~/netperf2_trunk/src$ ./netperf -t UDP_STREAM -l 1 -H raj-folio.americas.hpqcorp.net -- -m 1472 -f MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to raj-folio.americas.hpqcorp.net () port 0 AF_INET Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 2129921472 1.0077495 0 912.35 212992 1.0077495912.35 [1]+ Doneemacs nettest_omni.c raj@tardy:~/netperf2_trunk/src$ ./netperf -t UDP_STREAM -l 1 -H raj-folio.americas.hpqcorp.net -- -m 14720 -f MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to raj-folio.americas.hpqcorp.net () port 0 AF_INET send_data: data send error: Message too long (errno 90) netperf: send_omni: send_data failed: Message too long happy benchmarking, rick jones
Re: Netperf UDP issue with connected sockets
On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote: time to try IP_MTU_DISCOVER ;) To Rick, maybe you can find a good solution or option with Eric's hint, to send appropriate sized UDP packets with Don't Fragment (DF). Jesper - Top of trunk has a change adding an omni, test-specific -f option which will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket. Is that sufficient to your needs? happy benchmarking, rick
Re: Netperf UDP issue with connected sockets
On 11/17/2016 04:37 PM, Julian Anastasov wrote: On Thu, 17 Nov 2016, Rick Jones wrote: raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472 ... socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0 connected socket can benefit from dst cached in socket but not if SO_DONTROUTE is set. If we do not want to send packets via gateway this -l 1 should help but I don't see IP_TTL setsockopt in your first example with connect() to 127.0.0.1. Also, may be there can be another default, if -l is used to specify TTL then SO_DONTROUTE should not be set. I.e. we should avoid SO_DONTROUTE, if possible. The global -l option specifies the duration of the test. It doesn't specify the TTL of the IP datagrams being generated by the actions of the test. I resisted setting SO_DONTROUTE for a number of years after the first instance of UDP_STREAM being used in link up/down testing took-out a company's network (including security camera feeds to galactic HQ) but at this point I'm likely to keep it in there because there ended-up being a second such incident. It is set only for UDP_STREAM. It isn't set for UDP_RR or TCP_*. And for UDP_STREAM it can be overridden by the test-specific -R option. happy benchmarking, rick jones
Re: Netperf UDP issue with connected sockets
On 11/17/2016 01:44 PM, Eric Dumazet wrote: because netperf sends the same message over and over... Well, sort of, by default. That can be altered to a degree. The global -F option should cause netperf to fill the buffers in its send ring with data from the specified file. The number of buffers in the send ring can be controlled via the global -W option. The number of elements in the ring will default to one more than the initial SO_SNDBUF size divided by the send size. raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472 ... socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0 setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0 open("src/nettest_omni.c", O_RDONLY)= 5 fstat(5, {st_dev=makedev(8, 2), st_ino=82075297, st_mode=S_IFREG|0664, st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=456, st_size=230027, st_atime=2016/11/16-09:49:29, st_mtime=2016/11/16-09:49:24, st_ctime=2016/11/16-09:49:24}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3099f62000 read(5, "#ifdef HAVE_CONFIG_H\n#include <c"..., 4096) = 4096 read(5, "_INTEGER *intvl_two_ptr = "..., 4096) = 4096 read(5, "interval_count = interval_burst;"..., 4096) = 4096 read(5, ";\n\n/* these will control the wid"..., 4096) = 4096 read(5, "\n LOCAL_SECURITY_ENABLED_NUM,\n "..., 4096) = 4096 read(5, " , \n "..., 4096) = 4096 ... rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 0x7f30994a7cb0}, NULL, 8) = 0 rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 0x7f30994a7cb0}, NULL, 8) = 0 alarm(1)= 0 sendto(4, "#ifdef HAVE_CONFIG_H\n#include <c"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 sendto(4, " used\\n\\\n-m local,remote S"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 sendto(4, " do here but clear the legacy fl"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 sendto(4, "e before we scan the test-specif"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 sendto(4, "\n\n\tfprintf(where,\n\t\ttput_fmt_1_l"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 Of course, it will continue to send the same messages from the send_ring over and over instead of putting different data into the buffers each time, but if one has a sufficiently large -W option specified... happy benchmarking, rick jones
Re: Netperf UDP issue with connected sockets
On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote: time to try IP_MTU_DISCOVER ;) To Rick, maybe you can find a good solution or option with Eric's hint, to send appropriate sized UDP packets with Don't Fragment (DF). Well, I suppose adding another setsockopt() to the data socket creation wouldn't be too difficult, along with another command-line option to cause it to happen. Could we leave things as "make sure you don't need fragmentation when you use this" or would netperf have to start processing ICMP messages? happy benchmarking, rick jones
Re: Netperf UDP issue with connected sockets
On 11/16/2016 02:40 PM, Jesper Dangaard Brouer wrote: On Wed, 16 Nov 2016 09:46:37 -0800 Rick Jones <rick.jon...@hpe.com> wrote: It is a wild guess, but does setting SO_DONTROUTE affect whether or not a connect() would have the desired effect? That is there to protect people from themselves (long story about people using UDP_STREAM to stress improperly air-gapped systems during link up/down testing) It can be disabled with a test-specific -R 1 option, so your netperf command would become: netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1 Using -R 1 does not seem to help remove __ip_select_ident() Bummer. It was a wild guess anyway, since I was seeing a connect() call on the data socket. Samples: 56K of event 'cycles', Event count (approx.): 78628132661 Overhead CommandShared ObjectSymbol +9.11% netperf[kernel.vmlinux] [k] __ip_select_ident +6.98% netperf[kernel.vmlinux] [k] _raw_spin_lock +6.21% swapper[mlx5_core] [k] mlx5e_poll_tx_cq +5.03% netperf[kernel.vmlinux] [k] copy_user_enhanced_fast_string +4.69% netperf[kernel.vmlinux] [k] __ip_make_skb +4.63% netperf[kernel.vmlinux] [k] skb_set_owner_w +4.15% swapper[kernel.vmlinux] [k] __slab_free +3.80% netperf[mlx5_core] [k] mlx5e_sq_xmit +2.00% swapper[kernel.vmlinux] [k] sock_wfree +1.94% netperfnetperf [.] send_data +1.92% netperfnetperf [.] send_omni_inner Well, the next step I suppose is to have you try a quick netperf UDP_STREAM under strace to see if your netperf binary does what mine did: strace -v -o /tmp/netperf.strace netperf -H 198.18.50.1 -t UDP_STREAM -l 1 -- -m 1472 -n -N -R 1 And see if you see the connect() I saw. (Note, I make the runtime 1 second) rick
Re: Netperf UDP issue with connected sockets
On 11/16/2016 04:16 AM, Jesper Dangaard Brouer wrote: [1] Subj: High perf top ip_idents_reserve doing netperf UDP_STREAM - https://www.spinics.net/lists/netdev/msg294752.html Not fixed in version 2.7.0. - ftp://ftp.netperf.org/netperf/netperf-2.7.0.tar.gz Used extra netperf configure compile options: ./configure --enable-histogram --enable-demo It seems like some fix attempts exists in the SVN repository:: svn checkout http://www.netperf.org/svn/netperf2/trunk/ netperf2-svn svn log -r709 # A quick stab at getting remote connect going for UDP_STREAM svn diff -r708:709 Testing with SVN version, still show __ip_select_ident() in top#1. Indeed, there was a fix for getting the remote side connect()ed. Looking at what I have for the top of trunk I do though see a connect() call being made at the local end: socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0 setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0 brk(0xe53000) = 0xe53000 getsockname(4, {sa_family=AF_INET, sin_port=htons(59758), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0 sendto(3, "\0\0\0a\377\377\377\377\377\377\377\377\377\377\377\377\0\0\0\10\0\0\0\0\0\0\0\321\377\377\377\377"..., 656, 0, NULL, 0) = 656 select(1024, [3], NULL, NULL, {120, 0}) = 1 (in [3], left {119, 995630}) recvfrom(3, "\0\0\0b\0\0\0\0\0\3@\0\0\3@\0\0\0\0\2\0\3@\0\377\377\377\377\0\0\0\321"..., 656, 0, NULL, NULL) = 656 write(1, "need to connect is 1\n", 21) = 21 rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 0x7f2824eb2cb0}, NULL, 8) = 0 rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 0x7f2824eb2cb0}, NULL, 8) = 0 alarm(1)= 0 connect(4, {sa_family=AF_INET, sin_port=htons(34832), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 1024 sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 1024 sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 1024 the only difference there with top of trunk is that "need to connect" write/printf I just put in the code to be a nice marker in the system call trace. It is a wild guess, but does setting SO_DONTROUTE affect whether or not a connect() would have the desired effect? That is there to protect people from themselves (long story about people using UDP_STREAM to stress improperly air-gapped systems during link up/down testing) It can be disabled with a test-specific -R 1 option, so your netperf command would become: netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1 (p.s. is netperf ever going to be converted from SVN to git?) Well my git-fu could use some work (gentle, offlinetaps with a clueful tutorial bat would be welcome), and at least in the past, going to git was held back because there were a bunch of netperf users on Windows and there wasn't (at the time) support for git under Windows. But I am not against the idea in principle. happy benchmarking, rick jones PS - rick.jo...@hp.com no longer works. rick.jon...@hpe.com should be used instead.
Re: [patch] netlink.7: srcfix Change buffer size in example code about reading netlink message.
Lets change the example so others don't propagate the problem further. Signed-off-by David Wilder <dwil...@us.ibm.com> --- man7/netlink.7.orig 2016-11-14 13:30:36.522101156 -0800 +++ man7/netlink.7 2016-11-14 13:30:51.002086354 -0800 @@ -511,7 +511,7 @@ .in +4n .nf int len; -char buf[4096]; +char buf[8192]; Since there doesn't seem to be a define one could use in the user space linux/netlink.h (?), but there are comments in the example code in the manpage, how about also including a brief comment to the effect that using 8192 bytes will avoid message truncation problems on platforms with a large PAGE_SIZE? /* avoid msg truncation on > 4096 byte PAGE_SIZE platforms */ or something like that. rick jones
Re: [PATCH RFC 0/2] ethtool: Add actual port speed reporting
And besides, one can argue that in the SR-IOV scenario the VF has no business knowing the physical port speed. Good point, but there are more use-cases we should consider. For example, when using Multi-Host/Flex-10/Multi-PF each PF should be able to query both physical port speed and actual speed. Despite my email address, I'm not fully versed on VC/Flex, but I have always been under the impression that the flexnics created were, conceptually, "distinct" NICs considered independently of the physical port over which they operated. Tossing another worm or three into the can, while "back in the day" (when some of the first ethtool changes to report speeds other than the "normal" ones went in) the speed of a flexnic was fixed, today, it can actually operate in a range. From a minimum guarantee to an "if there is bandwidth available" cap. rick jones
Re: [bnx2] [Regression 4.8] Driver loading fails without firmware
On 10/25/2016 08:31 AM, Paul Menzel wrote: To my knowledge, the firmware files haven’t changed since years [1]. Indeed - it looks like I read "bnx2" and thought "bnx2x" Must remember to hold-off on replying until after the morning orange juice is consumed :) rick
Re: [bnx2] [Regression 4.8] Driver loading fails without firmware
On 10/25/2016 07:33 AM, Paul Menzel wrote: Dear Linux folks, A server with the Broadcom devices below, fails to load the drivers because of missing firmware. I have run into the same sort of issue from time to time when going to a newer kernel. A newer version of the driver wants a newer version of the firmware. Usually, finding a package "out there" with the newer version of the firmware, and installing it onto the system is sufficient. happy benchmarking, rick jones
Re: Accelerated receive flow steering (aRFS) for UDP
On 10/10/2016 09:08 AM, Rick Jones wrote: On 10/09/2016 03:33 PM, Eric Dumazet wrote: OK, I am adding/CC Rick Jones, netperf author, since it seems a netperf bug, not a kernel one. I believe I already mentioned fact that "UDP_STREAM -- -N" was not doing a connect() on the receiver side. I can confirm that the receive side of the netperf omni path isn't trying to connect UDP datagrams. I will see what I can put together. I've put something together and pushed it to the netperf top of trunk. It seems to have been successful on a quick loopback UDP_STREAM test. happy benchmarking, rick jones
Re: Accelerated receive flow steering (aRFS) for UDP
On 10/09/2016 03:33 PM, Eric Dumazet wrote: OK, I am adding/CC Rick Jones, netperf author, since it seems a netperf bug, not a kernel one. I believe I already mentioned fact that "UDP_STREAM -- -N" was not doing a connect() on the receiver side. I can confirm that the receive side of the netperf omni path isn't trying to connect UDP datagrams. I will see what I can put together. happy benchmarking, rick jones rick.jon...@hpe.com
Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
On 09/29/2016 06:18 AM, Eric Dumazet wrote: Well, then what this patch series is solving ? You have a producer of packets running on 8 vcpus in a VM. Packets are exiting the VM and need to be queued on a mq NIC in the hypervisor. Flow X can be scheduled on any of these 8 vcpus, so XPS is currently selecting different TXQ. Just for completeness, in my testing, the VMs were single-vCPU. rick jones
Re: [PATCH RFC 0/4] xfs: Transmit flow steering
Here is a quick look at performance tests for the result of trying the prototype fix for the packet reordering problem with VMs sending over an XPS-configured NIC. In particular, the Emulex/Avago/Broadcom Skyhawk. The fix was applied to a 4.4 kernel. Before: 3884 Mbit/s After: 8897 Mbit/s That was from a VM on a node with a Skyhawk and 2 E5-2640 processors to baremetal E5-2640 with a BE3. Physical MTU was 1500, the VM's vNIC's MTU was 1400. Systems were HPE ProLiants in OS Control Mode for power management, with the "performance" frequency governor loaded. An OpenStack Mitaka setup with Distributed Virtual Router. We had some other NIC types in the setup as well. XPS was also enabled on the ConnectX3-Pro. It was not enabled on the 82599ES (a function of the kernel being used, which had it disabled from the first reports of XPS negatively affecting VM traffic at the beginning of the year) Average Mbit/s From NIC type To Bare Metal BE3: NIC Type, CPU on VM HostBeforeAfter ConnectX-3 Pro,E5-2670v39224 9271 BE3, E5-26409016 9022 82599, E5-2640 9192 9003 BCM57840, E5-2640 9213 9153 Skyhawk, E5-26403884 8897 For completeness: Average Mbit/s To NIC type from Bare Metal BE3: NIC Type, CPU on VM HostBeforeAfter ConnectX-3 Pro,E5-2670v39322 9144 BE3, E5-26409074 9017 82599, E5-2640 8670 8564 BCM57840, E5-2640 2468 * 7979 Skyhawk, E5-26408897 9269 * This is the Busted bnx2x NIC FW GRO implementation issue. It was not visible in the "After" because the system was setup to disable the NIC FW GRO by the time it booted on the fix kernel. Average Transactions/s Between NIC type and Bare Metal BE3: NIC Type, CPU on VM HostBeforeAfter ConnectX-3 Pro,E5-2670v3 12421 12612 BE3, E5-26408178 8484 82599, E5-2640 8499 8549 BCM57840, E5-2640 8544 8560 Skyhawk, E5-26408537 8701 happy benchmarking, Drew Balliet Jeurg Haefliger rick jones The semi-cooked results with additional statistics: 554M - BE3 544+M - ConnectX-3 Pro 560M - 82599ES 630M - BCM57840 650M - Skyhawk (substitute is simply replacing a system name with the model of NIC and CPU) Bulk To (South) and From (North) VM, Before: $ ../substitute.sh vxlan_554m_control_performance_gvnr_dvr_northsouth_stream.log | ~/netperf2_trunk/doc/examples/parse_single_stream.py -r -5 -f 1 -f 3 -f 4 -f 7 -f 8 Field1,Field3,Field4,Field7,Field8,Min,P10,Median,Average,P90,P99,Max,Count North,560M,E5-2640,554FLB,E5-2640,8148.090,9048.830,9235.400,9192.868,9315.980,9338.845,9339.500,113 North,630M,E5-2640,554FLB,E5-2640,8909.980,9113.238,9234.750,9213.140,9299.442,9336.206,9337.830,47 North,544+M,E5-2670v3,554FLB,E5-2640,9013.740,9182.546,9229.620,9224.025,9264.036,9299.206,9301.970,99 North,650M,E5-2640,554FLB,E5-2640,3187.680,3393.724,3796.160,3884.765,4405.096,4941.391,4956.300,129 North,554M,E5-2640,554FLB,E5-2640,8700.930,8855.768,9026.030,9016.061,9158.846,9213.687,9226.150,135 South,554FLB,E5-2640,560M,E5-2640,7754.350,8193.114,8718.540,8670.612,9026.436,9262.355,9285.010,113 South,554FLB,E5-2640,630M,E5-2640,1897.660,2068.290,2514.430,2468.323,2787.162,2942.934,2957.250,53 South,554FLB,E5-2640,544+M,E5-2670v3,9298.260,9314.432,9323.220,9322.207,9328.324,9330.704,9331.080,100 South,554FLB,E5-2640,650M,E5-2640,8407.050,8907.136,9304.390,9206.776,9321.320,9325.347,9326.410,103 South,554FLB,E5-2640,554M,E5-2640,7844.900,8632.530,9199.385,9074.535,9308.070,9319.224,9322.360,132 0 too-short lines ignored. Bulk To (South) and From (North) VM, After: $ ../substitute.sh vxlan_554m_control_performance_gvnr_xpsfix_dvr_northsouth_stream.log | ~/netperf2_trunk/doc/examples/parse_single_stream.py -r -5 -f 1 -f 3 -f 4 -f 7 -f 8 Field1,Field3,Field4,Field7,Field8,Min,P10,Median,Average,P90,P99,Max,Count North,560M,E5-2640,554FLB,E5-2640,7576.790,8213.890,9182.870,9003.190,9295.975,9315.878,9318.160,36 North,630M,E5-2640,554FLB,E5-2640,8811.800,8924.000,9206.660,9153.076,9306.287,9315.152,9315.790,12 North,544+M,E5-2670v3,554FLB,E5-2640,9135.990,9228.520,9277.465,9271.875,9324.545,9339.604,9339.780,46 North,650M,E5-2640,554FLB,E5-2640,8133.420,8483.340,8995.040,8897.779,9129.056,9165.230,9165.860,43 North,554M,E5-2640,554FLB,E5-2640,8438.390,8879.150,9048.590,9022.813,9181.540,9248.650,9297.660,101 South,554FLB,E5-2640,630M,E5-2640,7347.120,7592.565,7951.325,7979.951,8365.400,8575.837,8579.890,16 South,554FLB,E5-2640,560M,E5-2640,7719.510,8044.496,8602.750,8564.741,9172.824,9248.686,9259.070,45 South,554
Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control
On 09/19/2016 02:10 PM, Eric Dumazet wrote: On Mon, Sep 19, 2016 at 1:57 PM, Stephen Hemminger <step...@networkplumber.org> wrote: Looks good, but could I suggest a simple optimization. All these parameters are immutable in the version of BBR you are submitting. Why not make the values const? And eliminate the always true long-term bw estimate variable? We could do that. We used to have variables (aka module params) while BBR was cooking in our kernels ;) Are there better than epsilon odds of someone perhaps wanting to poke those values as it gets exposure beyond Google? happy benchmarking, rick jones
I hope this email meets you well in good health condition
How you doing today? I hope you are doing well. My name is Jones, from the US. I'm in Syria right now fighting ISIS. I want to get to know you better, if I may be so bold. I consider myself an easy-going man, and I am currently looking for a relationship in which I feel loved. Please tell me more about yourself, if you don't mind. Hope to hear from you soon. Regards, Jones
Re: [PATCH next 3/3] ipvlan: Introduce l3s mode
On 09/09/2016 02:53 PM, Mahesh Bandewar wrote: @@ -48,6 +48,11 @@ master device for the L2 processing and routing from that instance will be used before packets are queued on the outbound device. In this mode the slaves will not receive nor can send multicast / broadcast traffic. +4.3 L3S mode: + This is very similar to the L3 mode except that iptables conn-tracking +works in this mode and that is why L3-symsetric (L3s) from iptables perspective. +This will have slightly less performance but that shouldn't matter since you +are choosing this mode over plain-L3 mode to make conn-tracking work. What is that first sentence trying to say? It appears to be incomplete, and is that supposed to be "L3-symmetric?" happy benchmarking, rick jones
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On 09/08/2016 11:16 AM, Tom Herbert wrote: On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouer <bro...@redhat.com> wrote: On Thu, 8 Sep 2016 09:26:03 -0700 Tom Herbert <t...@herbertland.com> wrote: Shouldn't qdisc bulk size be based on the BQL limit? What is the simple algorithm to apply to in-flight packets? Maybe the algorithm is not so simple, and we likely also have to take BQL bytes into account. The reason for wanting packets-in-flight is because we are attacking a transaction cost. The tailptr/doorbell cost around 70ns. (Based on data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 = 70.74). The 10G wirespeed small packets budget is 67.2ns, this with fixed overhead per packet of 70ns we can never reach 10G wirespeed. But you should be able to do this with BQL and it is more accurate. BQL tells how many bytes need to be sent and that can be used to create a bulk of packets to send with one doorbell. With small packets and the "default" ring size for this NIC/driver combination, is the BQL large enough that the ring fills before one hits the BQL? rick jones
Re: ipv6: release dst in ping_v6_sendmsg
On Tue, Sep 06, 2016 at 10:52:43AM -0700, Eric Dumazet wrote: > > > @@ -126,8 +126,10 @@ static int ping_v6_sendmsg(struct sock *sk, struct > > > msghdr *msg, size_t len) > > > rt = (struct rt6_info *) dst; > > > > > > np = inet6_sk(sk); > > > -if (!np) > > > -return -EBADF; > > > +if (!np) { > > > +err = -EBADF; > > > +goto dst_err_out; > > > +} > > > > > > if (!fl6.flowi6_oif && ipv6_addr_is_multicast()) > > > fl6.flowi6_oif = np->mcast_oif; > > > @@ -163,6 +165,9 @@ static int ping_v6_sendmsg(struct sock *sk, struct > > > msghdr *msg, size_t len) > > > } > > > release_sock(sk); > > > > > > +dst_err_out: > > > +dst_release(dst); > > > + > > > if (err) > > > return err; > > > > > > > Acked-by: Martin KaFai Lau> > This really does not make sense to me. > > If np was NULL, we should have a crash before. In the case where I was seeing the traces, we were taking the 'success' path through the function, so sk was non-null. > So we should remove this test, since it is absolutely useless. Looking closer, it seems the assignment of np is duplicated also, so that can also go. This is orthogonal to the dst leak though. I'll submit a follow-up cleaning that up. Dave
ipv6: release dst in ping_v6_sendmsg
Neither the failure or success paths of ping_v6_sendmsg release the dst it acquires. This leads to a flood of warnings from "net/core/dst.c:288 dst_release" on older kernels that don't have 8bf4ada2e21378816b28205427ee6b0e1ca4c5f1 backported. That patch optimistically hoped this had been fixed post 3.10, but it seems at least one case wasn't, where I've seen this triggered a lot from machines doing unprivileged icmp sockets. Cc: Martin Lau <ka...@fb.com> Signed-off-by: Dave Jones <da...@codemonkey.org.uk> diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index 0900352c924c..0e983b694ee8 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -126,8 +126,10 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) rt = (struct rt6_info *) dst; np = inet6_sk(sk); - if (!np) - return -EBADF; + if (!np) { + err = -EBADF; + goto dst_err_out; + } if (!fl6.flowi6_oif && ipv6_addr_is_multicast()) fl6.flowi6_oif = np->mcast_oif; @@ -163,6 +165,9 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) } release_sock(sk); +dst_err_out: + dst_release(dst); + if (err) return err;
Re: [PATCH] softirq: let ksoftirqd do its job
On 08/31/2016 04:11 PM, Eric Dumazet wrote: On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote: With regard to drops, are both of you sure you're using the same socket buffer sizes? Does it really matter ? At least at points in the past I have seen different drop counts at the SO_RCVBUF based on using (sometimes much) larger sizes. The hypothesis I was operating under at the time was that this dealt with those situations where the netserver was held-off from running for "a little while" from time to time. It didn't change things for a sustained overload situation though. In the meantime, is anything interesting happening with TCP_RR or TCP_STREAM? TCP_RR is driven by the network latency, we do not drop packets in the socket itself. I've been of the opinion it (single stream) is driven by path length. Sometimes by NIC latency. But then I'm almost always measuring in the LAN rather than across the WAN. happy benchmarking, rick
Re: [PATCH] softirq: let ksoftirqd do its job
With regard to drops, are both of you sure you're using the same socket buffer sizes? In the meantime, is anything interesting happening with TCP_RR or TCP_STREAM? happy benchmarking, rick jones
Re: [PATCH v2 net-next] documentation: Document issues with VMs and XPS and drivers enabling it on their own
On 08/27/2016 12:41 PM, Tom Herbert wrote: On Fri, Aug 26, 2016 at 9:35 PM, David Miller <da...@davemloft.net> wrote: From: Tom Herbert <t...@herbertland.com> Date: Thu, 25 Aug 2016 16:43:35 -0700 This seems like it will only confuse users even more. You've clearly identified an issue, let's figure out how to fix it. I kinda feel the same way about this situation. I'm working on XFS (as the transmit analogue to RFS). We'll track flows enough so that we should know when it's safe to move them. Is the XFS you are working on going to subsume XPS or will the two continue to exist in parallel a la RPS and RFS? rick jones
[PATCH v2 net-next] documentation: Document issues with VMs and XPS and drivers enabling it on their own
From: Rick Jones <rick.jon...@hpe.com> Since XPS was first introduced two things have happened. Some drivers have started enabling XPS on their own initiative, and it has been found that when a VM is sending data through a host interface with XPS enabled, that traffic can end-up seriously out of order. Signed-off-by: Rick Jones <rick.jon...@hpe.com> Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com> --- diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt index 59f4db2..50cc888 100644 --- a/Documentation/networking/scaling.txt +++ b/Documentation/networking/scaling.txt @@ -400,15 +400,31 @@ transport layer is responsible for setting ooo_okay appropriately. TCP, for instance, sets the flag when all data for a connection has been acknowledged. +When the traffic source is a VM running on the host, there is no +socket structure known to the host. In this case, unless the VM is +itself CPU-pinned, the traffic being sent from it can end-up queued to +multiple transmit queues and end-up being transmitted out of order. + +In some cases this can result in a considerable loss of performance. + +In such situations, XPS should not be enabled at runtime, or +explicitly disabled if the NIC driver(s) in question enable it on +their own. Otherwise, if possible, the VMs should be CPU pinned. + XPS Configuration -XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by -default for SMP). The functionality remains disabled until explicitly -configured. To enable XPS, the bitmap of CPUs that may use a transmit -queue is configured using the sysfs file entry: +XPS is available only if the kconfig symbol CONFIG_XPS is enabled +prior to building the kernel. It is enabled by default for SMP kernel +configurations. In many cases the functionality remains disabled at +runtime until explicitly configured by the system administrator. To +enable XPS, the bitmap of CPUs that may use a transmit queue is +configured using the sysfs file entry: /sys/class/net//queues/tx-/xps_cpus +However, some NIC drivers will configure XPS at runtime for the +interfaces they drive, via a call to netif_set_xps_queue. + == Suggested Configuration For a network device with a single transmission queue, XPS configuration
[PATCH net-next] documentation: Document issues with VMs and XPS and drivers enabling it on their own
From: Rick Jones <rick.jon...@hpe.com> Since XPS was first introduced two things have happened. Some drivers have started enabling XPS on their own initiative, and it has been found that when a VM is sending data through a host interface with XPS enabled, that traffic can end-up seriously out of order. Signed-off-by: Rick Jones <rick.jon...@hpe.com> --- diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt index 59f4db2..50cc888 100644 --- a/Documentation/networking/scaling.txt +++ b/Documentation/networking/scaling.txt @@ -400,15 +400,31 @@ transport layer is responsible for setting ooo_okay appropriately. TCP, for instance, sets the flag when all data for a connection has been acknowledged. +When the traffic source is a VM running on the host, there is no +socket structure known to the host. In this case, unless the VM is +itself CPU-pinned, the traffic being sent from it can end-up queued to +multiple transmit queues and end-up being transmitted out of order. + +In some cases this can result in a considerable loss of performance. + +In such situations, XPS should not be enabled at runtime, or +explicitly disabled if the NIC driver(s) in question enable it on +their own. Othersise, if possible, the VMs should be CPU pinned. + XPS Configuration -XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by -default for SMP). The functionality remains disabled until explicitly -configured. To enable XPS, the bitmap of CPUs that may use a transmit -queue is configured using the sysfs file entry: +XPS is available only if the kconfig symbol CONFIG_XPS is enabled +prior to building the kernel. It is enabled by default for SMP kernel +configurations. In many cases the functionality remains disabled at +runtime until explicitly configured by the system administrator. To +enable XPS, the bitmap of CPUs that may use a transmit queue is +configured using the sysfs file entry: /sys/class/net//queues/tx-/xps_cpus +However, some NIC drivers will configure XPS at runtime for the +interfaces they drive, via a call to netif_set_xps_queue. + == Suggested Configuration For a network device with a single transmission queue, XPS configuration
Re: [RFC PATCH] net: Require socket to allow XPS to set queue mapping
On 08/25/2016 02:08 PM, Eric Dumazet wrote: When XPS was submitted, it was _not_ enabled by default and 'magic' Some NIC vendors decided it was a good thing, you should complain to them ;) I kindasorta am with the emails I've been sending to netdev :) And also hopefully precluding others going down that path. happy benchmarking, rick
Re: [RFC PATCH] net: Require socket to allow XPS to set queue mapping
On 08/25/2016 12:49 PM, Eric Dumazet wrote: On Thu, 2016-08-25 at 12:23 -0700, Alexander Duyck wrote: A simpler approach is provided with this patch. With it we disable XPS any time a socket is not present for a given flow. By doing this we can avoid using XPS for any routing or bridging situations in which XPS is likely more of a hinderance than a help. Yes, but this will destroy isolation for people properly doing VM cpu pining. Why not simply stop enabling XPS by default. Treat it like RPS and RFS (unless I've missed a patch...). The people who are already doing the extra steps to pin VMs can enable XPS in that case. It isn't clear that one should always pin VMs - for example if a (public) cloud needed to oversubscribe the cores. happy benchmarking, rick jones
Re: A second case of XPS considerably reducing single-stream performance
On 08/25/2016 12:19 PM, Alexander Duyck wrote: The problem is that there is no socket associated with the guest from the host's perspective. This is resulting in the traffic bouncing between queues because there is no saved socket to lock the interface onto. I was looking into this recently as well and had considered a couple of options. The first is to fall back to just using skb_tx_hash() when skb->sk is null for a given buffer. I have a patch I have been toying around with but I haven't submitted it yet. If you would like I can submit it as an RFC to get your thoughts. The second option is to enforce the use of RPS for any interfaces that do not perform Rx in NAPI context. The correct solution for this is probably some combination of the two as you have to have all queueing done in order at every stage of the packet processing. I don't know with interfaces would be hit, but just in general, I'm not sure that requiring RPS be enabled is a good solution - picking where traffic is processed based on its addressing is fine in a benchmarking situation, but I think it is better to have the process/thread scheduler decide where something should run and not the addressing of the connections that thread/process is servicing. I would be interested in seeing the RFC patch you propose. Apart from that, given the prevalence of VMs these days I wonder if perhaps simply not enabling XPS by default isn't a viable alternative. I've not played with containers to know if they would exhibit this too. Drifting ever so slightly, if drivers are going to continue to enable XPS by default, Documentation/networking/scaling.txt might use a tweak: diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/sca index 59f4db2..8b5537c 100644 --- a/Documentation/networking/scaling.txt +++ b/Documentation/networking/scaling.txt @@ -402,10 +402,12 @@ acknowledged. XPS Configuration -XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by -default for SMP). The functionality remains disabled until explicitly -configured. To enable XPS, the bitmap of CPUs that may use a transmit -queue is configured using the sysfs file entry: +XPS is available only when the kconfig symbol CONFIG_XPS is enabled +(on by default for SMP). The drivers for some NICs will enable the +functionality by default. For others the functionality remains +disabled until explicitly configured. To enable XPS, the bitmap of +CPUs that may use a transmit queue is configured using the sysfs file +entry: /sys/class/net//queues/tx-/xps_cpus The original wording leaves the impression that XPS is not enabled by default. rick
Re: A second case of XPS considerably reducing single-stream performance
Also, while it doesn't seem to have the same massive effect on throughput, I can also see out of order behaviour happening when the sending VM is on a node with a ConnectX-3 Pro NIC. Its driver is also enabling XPS it would seem. I'm not *certain* but looking at the traces it appears that with the ConnectX-3 Pro there is more interleaving of the out-of-order traffic than there is with the Skyhawk. The ConnectX-3 Pro happens to be in a newer generation server with a newer processor than the other systems where I've seen this. I do not see the out-of-order behaviour when the NIC at the sending end is a BCM57840. It does not appear that the bnx2x driver in the 4.4 kernel is enabling XPS. So, it would seem that there are three cases of enabling XPS resulting in out-of-order traffic, two of which result in a non-trivial loss of performance. happy benchmarking, rick jones
Re: [PATCH net-next] net: minor optimization in qdisc_qstats_cpu_drop()
On 08/24/2016 10:23 AM, Eric Dumazet wrote: From: Eric Dumazet <eduma...@google.com> per_cpu_inc() is faster (at least on x86) than per_cpu_ptr(xxx)++; Is it possible it is non-trivially slower on other architectures? rick jones Signed-off-by: Eric Dumazet <eduma...@google.com> --- include/net/sch_generic.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 0d501779cc68f9426e58da6d039dd64adc937c20..52a2015667b49c8315edbb26513a98d4c677fee5 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -592,7 +592,7 @@ static inline void qdisc_qstats_drop(struct Qdisc *sch) static inline void qdisc_qstats_cpu_drop(struct Qdisc *sch) { - qstats_drop_inc(this_cpu_ptr(sch->cpu_qstats)); + this_cpu_inc(sch->cpu_qstats->drops); } static inline void qdisc_qstats_overlimit(struct Qdisc *sch)
A second case of XPS considerably reducing single-stream performance
Back in February of this year, I reported some performance issues with the ixgbe driver enabling XPS by default and instance network performance in OpenStack: http://www.spinics.net/lists/netdev/msg362915.html I've now seen the same thing with be2net and Skyhawk. In this case, the magnitude of the delta is even greater. Disabling XPS increased the netperf single-stream performance out of the instance from an average of 4108 Mbit/s to Mbit/s or 116%. Should drivers really be enabling XPS by default? Instance To Outside World Single-stream netperf ~30 Samples for Each Statistic Mbit/s SkyhawkBE3 #1BE3 #2 XPS On XPS Off XPS On XPS Off XPS On XPS Off Median4192 8883 8930 8853 8917 8695 Average 4108 8940 8859 8885 8671 happy benchmarking, rick jones The sample counts below may not fully support the additional statistics but for the curious: raj@tardy:/tmp$ ~/netperf2_trunk/doc/examples/parse_single_stream.py -r 6 waxon_performance.log -f 2 Field2,Min,P10,Median,Average,P90,P99,Max,Count be3-1,8758.850,8811.600,8930.900,8940.555,9096.470,9175.839,9183.690,31 be3-2,8588.450,8736.967,8917.075,8885.322,9017.914,9075.735,9094.620,32 skyhawk,3326.760,3536.008,4192.780,4108.513,4651.164,4723.322,4724.320,27 0 too-short lines ignored. raj@tardy:/tmp$ ~/netperf2_trunk/doc/examples/parse_single_stream.py -r 6 waxoff_performance.log -f 2 Field2,Min,P10,Median,Average,P90,P99,Max,Count be3-1,8461.080,8634.690,8853.260,8859.870,9064.480,9247.770,9253.050,31 be3-2,7519.130,8368.564,8695.140,8671.241,9068.588,9200.719,9241.500,27 skyhawk,8071.180,8651.587,8883.340,.411,9135.603,9141.229,9142.010,32 0 too-short lines ignored. "waxon" is with XPS enabled, "waxoff" is with XPS disabled. The servers are the same models/config as in February. stack@np-cp1-comp0013-mgmt:~$ sudo ethtool -i hed3 driver: be2net version: 10.6.0.3 firmware-version: 10.7.110.45
e1000: __pskb_pull_tail failed
MY NFS server running 4.8-rc1 is getting flooded with this message: e1000e :00:19.0 eth0: __pskb_pull_tail failed. Never saw it happen with 4.7 or earlier. That device is this onboard NIC: 00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (2) I218-V Dave
Re: [PATCH net 1/2] tg3: Fix for diasllow rx coalescing time to be 0
On 08/02/2016 09:13 PM, skallam wrote: From: Satish Baddipadige <satish.baddipad...@broadcom.com> When the rx coalescing time is 0, interrupts are not generated from the controller and rx path hangs. To avoid this rx hang, updating the driver to not allow rx coalescing time to be 0. Signed-off-by: Satish Baddipadige <satish.baddipad...@broadcom.com> Signed-off-by: Siva Reddy Kallam <siva.kal...@broadcom.com> Signed-off-by: Michael Chan <michael.c...@broadcom.com> --- drivers/net/ethernet/broadcom/tg3.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c index ff300f7..f3c6c91 100644 --- a/drivers/net/ethernet/broadcom/tg3.c +++ b/drivers/net/ethernet/broadcom/tg3.c @@ -14014,6 +14014,7 @@ static int tg3_set_coalesce(struct net_device *dev, struct ethtool_coalesce *ec) } if ((ec->rx_coalesce_usecs > MAX_RXCOL_TICKS) || + (!ec->rx_coalesce_usecs) || (ec->tx_coalesce_usecs > MAX_TXCOL_TICKS) || (ec->rx_max_coalesced_frames > MAX_RXMAX_FRAMES) || (ec->tx_max_coalesced_frames > MAX_TXMAX_FRAMES) || Should anything then happen with: /* No rx interrupts will be generated if both are zero */ if ((ec->rx_coalesce_usecs == 0) && (ec->rx_max_coalesced_frames == 0)) return -EINVAL; which is the next block of code? The logic there seems to suggest that it was intended to be able to have an rx_coalesce_usecs of 0 and rely on packet arrival to trigger an interrupt. Presumably setting rx_max_coalesced_frames to 1 to disable interrupt coalescing. happy benchmarking, rick jones
Re: [iproute PATCH 0/2] Netns performance improvements
On 07/08/2016 01:01 AM, Nicolas Dichtel wrote: Those 300 routers will each have at least one namespace along with the dhcp namespaces. Depending on the nature of the routers (Distributed versus Centralized Virtual Routers - DVR vs CVR) and whether the routers are supposed to be "HA" there can be more than one namespace for a given router. 300 routers is far from the upper limit/goal. Back in HP Public Cloud, we were running as many as 700 routers per network node (*), and more than four network nodes. (back then it was just the one namespace per router and network). Mileage will of course vary based on the "oomph" of one's network node(s). Thank you for the details. Do you have a script or something else to easily reproduce this problem? Do you mean for my much older, slightly different stuff done in HP Public Cloud, or for what Phil (?) is doing presently? I believe Phil posted something several messages back in the thread. happy benchmarking, rick jones
Re: [iproute PATCH 0/2] Netns performance improvements
On 07/07/2016 09:34 AM, Eric W. Biederman wrote: Rick Jones <rick.jon...@hpe.com> writes: 300 routers is far from the upper limit/goal. Back in HP Public Cloud, we were running as many as 700 routers per network node (*), and more than four network nodes. (back then it was just the one namespace per router and network). Mileage will of course vary based on the "oomph" of one's network node(s). To clarify processes for these routers and dhcp servers are created with "ip netns exec"? I believe so, but it would be good to have someone else confirm that, and speak to your paragraph below. If that is the case and you are using this feature as effectively a lightweight container and not lots vrfs in a single network stack then I suspect much larger gains can be had by creating a variant of ip netns exec avoids the mount propagation. ... * Didn't want to go much higher than that because each router had a port on a common linux bridge and getting to > 1024 would be an unpleasant day. * I would have thought all you have to do is bump of the size of the linux neighbour cache. echo $BIGNUM > /proc/sys/net/ipv4/neigh/default/gc_thresh3 We didn't want to hit the 1024 port limit of a (then?) Linux bridge. rick Having a bit of deja vu but I suspect things like commit 0818bf27c05b2de56c5b2bd08cfae2a939bd5f52 are not exactly on the same wavelength, just my brain seeing "namespaces" and "performance" and lighting-up :)
Re: [iproute PATCH 0/2] Netns performance improvements
On 07/07/2016 08:48 AM, Phil Sutter wrote: On Thu, Jul 07, 2016 at 02:59:48PM +0200, Nicolas Dichtel wrote: Le 07/07/2016 13:17, Phil Sutter a écrit : [snip] The issue came up during OpenStack Neutron testing, see this ticket for reference: https://bugzilla.redhat.com/show_bug.cgi?id=1310795 Access to this ticket is not public :( *Sigh* OK, here are a few quotes: "OpenStack Neutron controller nodes, when undergoing testing, are locking up specifically during creation and mounting of namespaces. They appear to be blocking behind vfsmount_lock, and contention for the namespace_sem" "During the scale testing, we have 300 routers, 600 dhcp namespaces spread across four neutron network nodes. When then start as one set of standard Openstack Rally benchmark test cycle against neutron. An example scenario is creating 10x networks, list them, delete them and repeat 10x times. The second set performs an L3 benchmark test between two instances." Those 300 routers will each have at least one namespace along with the dhcp namespaces. Depending on the nature of the routers (Distributed versus Centralized Virtual Routers - DVR vs CVR) and whether the routers are supposed to be "HA" there can be more than one namespace for a given router. 300 routers is far from the upper limit/goal. Back in HP Public Cloud, we were running as many as 700 routers per network node (*), and more than four network nodes. (back then it was just the one namespace per router and network). Mileage will of course vary based on the "oomph" of one's network node(s). happy benchmarking, rick jones * Didn't want to go much higher than that because each router had a port on a common linux bridge and getting to > 1024 would be an unpleasant day.
Re: strange Mac OSX RST behavior
On 07/01/2016 08:10 AM, Jason Baron wrote: I'm wondering if anybody else has run into this... On Mac OSX 10.11.5 (latest version), we have found that when tcp connections are abruptly terminated (via ^C), a FIN is sent followed by an RST packet. That just seems, well, silly. If the client application wants to use abortive close (sigh..) it should do so, there shouldn't be this little-bit-pregnant, correct close initiation (FIN) followed by a RST. The RST is sent with the same sequence number as the FIN, and thus dropped since the stack only accepts RST packets matching rcv_nxt (RFC 5961). This could also be resolved if Mac OSX replied with an RST on the closed socket, but it appears that it does not. The workaround here is then to reset the connection, if the RST is is equal to rcv_nxt - 1, if we have already received a FIN. The RST attack surface is limited b/c we only accept the RST after we've accepted a FIN and have not previously sent a FIN and received back the corresponding ACK. In other words RST is only accepted in the tcp states: TCP_CLOSE_WAIT, TCP_LAST_ACK, and TCP_CLOSING. I'm interested if anybody else has run into this issue. Its problematic since it takes up server resources for sockets sitting in TCP_CLOSE_WAIT. Isn't the server application expected to act on the read return of zero (which is supposed to be) triggered by the receipt of the FIN segment? rick jones We are also in the process of contacting Apple to see what can be done here...workaround patch is below.
Re: [PATCH v12 net-next 1/1] hv_sock: introduce Hyper-V Sockets
On 06/28/2016 02:59 AM, Dexuan Cui wrote: The idea here is: IMO the syscalls sys_read()/write() shoudn't return -ENOMEM, so I have to make sure the buffer allocation succeeds? I tried to use kmalloc with __GFP_NOFAIL, but I hit a warning in in mm/page_alloc.c: WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); What error code do you think I should return? EAGAIN, ERESTARTSYS, or something else? May I have your suggestion? Thanks! What happens as far as errno is concerned when an application makes a read() call against a (say TCP) socket associated with a connection which has been reset? Is it limited to those errno values listed in the read() manpage, or does it end-up getting an errno value from those listed in the recv() manpage? Or, perhaps even one not (presently) listed in either? rick jones
Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
On 06/24/2016 04:43 PM, Tom Herbert wrote: Here's Christoph's slides on TFO in the wild which presents a good summary of the middlebox problem. There is one significant difference in that ECN needs network support whereas TFO didn't. Given that experience, I'm doubtful other new features at L4 could ever be productively use (like EDO or maybe TCP-ENO). https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf Perhaps I am being overly optimistic, but my takeaway from those slides is Apple were able to come-up with ways to deal with the middleboxes and so could indeed productively use TCP FastOpen. "Overall, very good success-rate" though tempered by "But... middleboxes were a big issue in some ISPs..." Though it doesn't get into how big (some connections, many, most, all?) and how many ISPs. rick jones Just an anecdote... Not that I am a "power user" of my iPhone running 9.3.2 (13F69) nor that I know that anything I am using is the Apple Service stated as using TFO (mostly Safari, Mail and Messages) but if it is, I cannot say that any troubles under the covers have been noticed by me.
Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
On 06/24/2016 02:46 PM, Tom Herbert wrote: On Fri, Jun 24, 2016 at 2:36 PM, Rick Jones <rick.jon...@hpe.com> wrote: How would you define "severely?" Has it actually been more severe than for say ECN? Or it was for say SACK or PAWS? ECN is probably even a bigger disappointment in terms of seeing deployment :-( From http://ecn.ethz.ch/ecn-pam15.pdf: "Even though ECN was standardized in 2001, and it is widely implemented in end-systems, it is barely deployed. This is due to a history of problems with severely broken middleboxes shortly after standardization, which led to connectivity failure and guidance to leave ECN disabled." SACK and PAWS seemed to have faired a little better I believe. The conclusion of that (rather interesting) paper reads: "Our analysis therefore indicates that enabling ECN by default would lead to connections to about five websites per thousand to suffer additional setup latency with RFC 3168 fallback. This represents an order of magnitude fewer than the about forty per thousand which experience transient or permanent connection failure due to other operational issues" Doesn't that then suggest that not enabling ECN is basically a matter of FUD more than remaining assumed broken middleboxes? My main point is that in the past at least, trouble with broken middleboxes didn't lead us to start wrapping all our TCP/transport traffic in UDP to try to hide it from them. We've managed to get SACK and PAWS universal without having to resort to that, and it would seem we could get ECN universal if we could overcome our FUD. Why would TFO for instance be any different? There was an equally interesting second paragraph in the conclusion: "As not all websites are equally popular, failures on five per thousand websites does not by any means imply that five per thousand connection attempts will fail. While estimation of connection attempt rate by rank is out of scope of this work, we note that the highest ranked website exhibiting stable connection failure has rank 596, and only 13 such sites appear in the top 5000" rick jones
Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
On 06/24/2016 02:12 PM, Tom Herbert wrote: The client OS side is only part of the story. Middlebox intrusion at L4 is also a major issue we need to address. The "failure" of TFO is a good case study. Both the upgrade issues on clients and the tendency for some middleboxes to drop SYN packets with data have together severely hindered what otherwise should have been straightforward and useful feature to deploy. How would you define "severely?" Has it actually been more severe than for say ECN? Or it was for say SACK or PAWS? rick jones
[4.6] kernel BUG at net/ipv6/raw.c:592
Found this logs after a Trinity run. kernel BUG at net/ipv6/raw.c:592! [ cut here ] invalid opcode: [#1] SMP Modules linked in: udp_diag dccp_ipv6 dccp_ipv4 dccp sctp af_key tcp_diag inet_diag ip6table_filter xt_NFLOG nfnetlink_log xt_comment xt_statistic iptable_filter nfsv3 nfs_acl nfs fscache lockd grace autofs4 i2c_piix4 rpcsec_gss_krb5 auth_rpcgss oid_registry sunrpc loop dummy ipmi_devintf iTCO_wdt iTCO_vendor_support acpi_cpufreq efivars ipmi_si ipmi_msghandler i2c_i801 i2c_core sg lpc_ich mfd_core button CPU: 2 PID: 28854 Comm: trinity-c23 Not tainted 4.6.0 #1 Hardware name: Quanta Leopard-DDR3/Leopard-DDR3, BIOS F06_3A14.DDR3 05/13/2015 task: 880459cab600 ti: 880747bc4000 task.ti: 880747bc4000 RIP: 0010:[] [] rawv6_sendmsg+0xc30/0xc40 RSP: 0018:880747bc7bf8 EFLAGS: 00010282 RAX: fff2 RBX: 88080c6f2d00 RCX: 0002 RDX: 880747bc7cd8 RSI: 0030 RDI: 8803de801500 RBP: 880747bc7d90 R08: 002d R09: 0009 R10: 8803de801500 R11: 0009 R12: 0030 R13: 8803de801500 R14: 88086d67e000 R15: 88046bdac480 FS: 7fe29c566700() GS:88046fa4() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 01f0f2c0 CR3: 00080b99d000 CR4: 001406e0 Stack: 88086d67e000 880747bc7d18 88046bdac480 8804 880747bc7c68 88086d67e000 8808002d 88080009 0001 Call Trace: [] ? page_fault+0x22/0x30 [] ? bad_to_user+0x6a/0x6fa [] inet_sendmsg+0x67/0xa0 [] sock_sendmsg+0x38/0x50 [] sock_write_iter+0x78/0xd0 [] __vfs_write+0xaa/0xe0 [] vfs_write+0xa2/0x1a0 [] SyS_write+0x46/0xa0 [] entry_SYSCALL_64_fastpath+0x13/0x8f Code: 23 f7 ff ff f7 d0 41 01 c0 41 83 d0 00 e9 ac fd ff ff 48 8b 44 24 48 48 8b 80 c0 01 00 00 65 48 ff 40 28 8b 51 78 d0 41 01 c0 41 83 d0 00 e9 ac fd ff ff 48 8b 44 24 48 48 8b 80 c0 01 00 00 65 48 ff 40 28 8b 51 78 e9 64 fe ff ff <0f> 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 RIP [] rawv6_sendmsg+0xc30/0xc40 RSP 590 591 offset += skb_transport_offset(skb); 592 BUG_ON(skb_copy_bits(skb, offset, , 2)); 593
Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
On 06/22/2016 04:10 PM, Rick Jones wrote: My systems are presently in the midst of an install but I should be able to demonstrate it in the morning (US Pacific time, modulo the shuttle service of a car repair place) The installs finished sooner than I thought. So, receiver: root@np-cp1-comp0001-mgmt:/home/stack# uname -a Linux np-cp1-comp0001-mgmt 4.4.11-2-amd64-hpelinux #hpelinux1 SMP Mon May 23 15:39:22 UTC 2016 x86_64 GNU/Linux root@np-cp1-comp0001-mgmt:/home/stack# ethtool -i hed2 driver: bnx2x version: 1.712.30-0 firmware-version: bc 7.10.10 bus-info: :05:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes the hed2 interface is a port of an HPE 630M NIC, based on the BCM57840: 05:00.0 Ethernet controller: Broadcom Corporation BCM57840 NetXtreme II 10/20-Gigabit Ethernet (rev 11) Subsystem: Hewlett-Packard Company HP FlexFabric 20Gb 2-port 630M Adapter (The pci.ids entry being from before that 10 GbE IP was purchased from Broadcom by QLogic...) Verify that LRO is disabled (IIRC it is enabled by default): root@np-cp1-comp0001-mgmt:/home/stack# ethtool -k hed2 | grep large large-receive-offload: off Verify that disable_tpa is not set: root@np-cp1-comp0001-mgmt:/home/stack# cat /sys/module/bnx2x/parameters/disable_tpa 0 So this means we will see NIC-firmware GRO. Start a tcpdump on the receiver: root@np-cp1-comp0001-mgmt:/home/stack# tcpdump -s 96 -c 200 -i hed2 -w foo.pcap port 12867 tcpdump: listening on hed2, link-type EN10MB (Ethernet), capture size 96 bytes Start a netperf test targeting that system, specifying a smaller MSS: stack@np-cp1-comp0002-mgmt:~$ ./netperf -H np-cp1-comp0001-guest -- -G 1400 -P 12867 -O throughput,transport_mss MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-comp0001-guest () port 12867 AF_INET : demo Throughput Transport MSS bytes 3372.821388 Come back to the receiver and post-process the tcpdump capture to get the average segment size for the data segments: 200 packets captured 2000916 packets received by filter 0 packets dropped by kernel root@np-cp1-comp0001-mgmt:/home/stack# tcpdump -n -r foo.pcap | fgrep -v "length 0" | awk '{sum += $NF}END{print "Average:",sum/NR}' reading from file foo.pcap, link-type EN10MB (Ethernet) Average: 2741.93 and finally a snippet of the capture: 00:37:47.333414 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [S], seq 1236484791, win 28000, options [mss 1400,sackOK,TS val 1491134 ecr 0,nop,wscale 7], length 0 00:37:47.333488 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [S.], seq 134167501, ack 1236484792, win 28960, options [mss 1460,sackOK,TS val 1499053 ecr 1491134,nop,wscale 7], length 0 00:37:47.333731 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 0 00:37:47.333788 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 1:2777, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 2776 00:37:47.333815 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 2777, win 270, options [nop,nop,TS val 1499053 ecr 1491134], length 0 00:37:47.333822 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 2777:5553, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 2776 00:37:47.333837 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 5553, win 313, options [nop,nop,TS val 1499053 ecr 1491134], length 0 00:37:47.333842 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 5553:8329, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 2776 00:37:47.333856 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 8329:11105, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 2776 00:37:47.333869 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 8329, win 357, options [nop,nop,TS val 1499053 ecr 1491134], length 0 00:37:47.333879 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 11105:13881, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 2776 00:37:47.333891 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 11105, win 400, options [nop,nop,TS val 1499053 ecr 1491134], length 0 00:37:47.333911 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 13881, win 444, options [nop,nop,TS val 1499053 ecr 1491134], length 0 00:37:47.333964 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 13881:16657, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 2776 00:37:47.333982 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 16657:19433, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 2776 00:37:47.333989 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 19433:22209, ack 1, win 219, options [nop,nop,TS val 149