bond: take rcu lock in netpoll_send_skb_on_dev

2018-09-28 Thread Dave Jones
The bonding driver lacks the rcu lock when it calls down into
netdev_lower_get_next_private_rcu from bond_poll_controller, which
results in a trace like:

WARNING: CPU: 2 PID: 179 at net/core/dev.c:6567 
netdev_lower_get_next_private_rcu+0x34/0x40
CPU: 2 PID: 179 Comm: kworker/u16:15 Not tainted 4.19.0-rc5-backup+ #1
Workqueue: bond0 bond_mii_monitor
RIP: 0010:netdev_lower_get_next_private_rcu+0x34/0x40
Code: 48 89 fb e8 fe 29 63 ff 85 c0 74 1e 48 8b 45 00 48 81 c3 c0 00 00 00 48 
8b 00 48 39 d8 74 0f 48 89 45 00 48 8b 40 f8 5b 5d c3 <0f> 0b eb de 31 c0 eb f5 
0f 1f 40 00 0f 1f 44 00 00 48 8>
RSP: 0018:c987fa68 EFLAGS: 00010046
RAX:  RBX: 880429614560 RCX: 
RDX: 0001 RSI:  RDI: a184ada0
RBP: c987fa80 R08: 0001 R09: 
R10: c987f9f0 R11: 880429798040 R12: 8804289d5980
R13: a1511f60 R14: 00c8 R15: 
FS:  () GS:88042f88() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f4b78fce180 CR3: 00018180f006 CR4: 001606e0
Call Trace:
 bond_poll_controller+0x52/0x170
 netpoll_poll_dev+0x79/0x290
 netpoll_send_skb_on_dev+0x158/0x2c0
 netpoll_send_udp+0x2d5/0x430
 write_ext_msg+0x1e0/0x210
 console_unlock+0x3c4/0x630
 vprintk_emit+0xfa/0x2f0
 printk+0x52/0x6e
 ? __netdev_printk+0x12b/0x220
 netdev_info+0x64/0x80
 ? bond_3ad_set_carrier+0xe9/0x180
 bond_select_active_slave+0x1fc/0x310
 bond_mii_monitor+0x709/0x9b0
 process_one_work+0x221/0x5e0
 worker_thread+0x4f/0x3b0
 kthread+0x100/0x140
 ? process_one_work+0x5e0/0x5e0
 ? kthread_delayed_work_timer_fn+0x90/0x90
 ret_from_fork+0x24/0x30

We're also doing rcu dereferences a layer up in netpoll_send_skb_on_dev
before we call down into netpoll_poll_dev, so just take the lock there.

Suggested-by: Cong Wang 
Signed-off-by: Dave Jones 

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 3219a2932463..692367d7c280 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -330,6 +330,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct 
sk_buff *skb,
/* It is up to the caller to keep npinfo alive. */
struct netpoll_info *npinfo;
 
+   rcu_read_lock_bh();
lockdep_assert_irqs_disabled();
 
npinfo = rcu_dereference_bh(np->dev->npinfo);
@@ -374,6 +375,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct 
sk_buff *skb,
skb_queue_tail(>txq, skb);
schedule_delayed_work(>tx_work,0);
}
+   rcu_read_unlock_bh();
 }
 EXPORT_SYMBOL(netpoll_send_skb_on_dev);
 


Re: bond: take rcu lock in bond_poll_controller

2018-09-28 Thread Dave Jones
On Fri, Sep 28, 2018 at 12:03:22PM -0700, Cong Wang wrote:
 > On Fri, Sep 28, 2018 at 12:02 PM Cong Wang  wrote:
 > >
 > > On Fri, Sep 28, 2018 at 11:26 AM Dave Jones  
 > > wrote:
 > > > diff --git a/net/core/netpoll.c b/net/core/netpoll.c
 > > > index 3219a2932463..4f9494381635 100644
 > > > --- a/net/core/netpoll.c
 > > > +++ b/net/core/netpoll.c
 > > > @@ -330,6 +330,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, 
 > > > struct sk_buff *skb,
 > > > /* It is up to the caller to keep npinfo alive. */
 > > > struct netpoll_info *npinfo;
 > > >
 > > > +   rcu_read_lock();
 > > > lockdep_assert_irqs_disabled();
 > > >
 > > > npinfo = rcu_dereference_bh(np->dev->npinfo);
 > >
 > > I think you probably need rcu_read_lock_bh() to satisfy
 > > rcu_deference_bh()...
 > 
 > But irq is disabled here, so not sure if rcu_read_lock_bh()
 > could cause trouble... Interesting...

I was wondering for a moment why I never got a warning, then I
remembered I disabled lockdep for that machine because nfs spews stuff.

I'll doublecheck, and post v4. lol, this looked like a 2 minute fix at first.

Dave


bond: take rcu lock in bond_poll_controller

2018-09-28 Thread Dave Jones
Callers of bond_for_each_slave_rcu are expected to hold the rcu lock,
otherwise a trace like below is shown

WARNING: CPU: 2 PID: 179 at net/core/dev.c:6567 
netdev_lower_get_next_private_rcu+0x34/0x40
CPU: 2 PID: 179 Comm: kworker/u16:15 Not tainted 4.19.0-rc5-backup+ #1
Workqueue: bond0 bond_mii_monitor
RIP: 0010:netdev_lower_get_next_private_rcu+0x34/0x40
Code: 48 89 fb e8 fe 29 63 ff 85 c0 74 1e 48 8b 45 00 48 81 c3 c0 00 00 00 48 
8b 00 48 39 d8 74 0f 48 89 45 00 48 8b 40 f8 5b 5d c3 <0f> 0b eb de 31 c0 eb f5 
0f 1f 40 00 0f 1f 44 00 00 48 8>
RSP: 0018:c987fa68 EFLAGS: 00010046
RAX:  RBX: 880429614560 RCX: 
RDX: 0001 RSI:  RDI: a184ada0
RBP: c987fa80 R08: 0001 R09: 
R10: c987f9f0 R11: 880429798040 R12: 8804289d5980
R13: a1511f60 R14: 00c8 R15: 
FS:  () GS:88042f88() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f4b78fce180 CR3: 00018180f006 CR4: 001606e0
Call Trace:
 bond_poll_controller+0x52/0x170
 netpoll_poll_dev+0x79/0x290
 netpoll_send_skb_on_dev+0x158/0x2c0
 netpoll_send_udp+0x2d5/0x430
 write_ext_msg+0x1e0/0x210
 console_unlock+0x3c4/0x630
 vprintk_emit+0xfa/0x2f0
 printk+0x52/0x6e
 ? __netdev_printk+0x12b/0x220
 netdev_info+0x64/0x80
 ? bond_3ad_set_carrier+0xe9/0x180
 bond_select_active_slave+0x1fc/0x310
 bond_mii_monitor+0x709/0x9b0
 process_one_work+0x221/0x5e0
 worker_thread+0x4f/0x3b0
 kthread+0x100/0x140
 ? process_one_work+0x5e0/0x5e0
 ? kthread_delayed_work_timer_fn+0x90/0x90
 ret_from_fork+0x24/0x30

Suggested-by: Cong Wang 
Signed-off-by: Dave Jones 

-- 
v3: Do this in netpoll_send_skb_on_dev as Cong suggests.

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 3219a2932463..4f9494381635 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -330,6 +330,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct 
sk_buff *skb,
/* It is up to the caller to keep npinfo alive. */
struct netpoll_info *npinfo;
 
+   rcu_read_lock();
lockdep_assert_irqs_disabled();
 
npinfo = rcu_dereference_bh(np->dev->npinfo);
@@ -374,6 +375,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct 
sk_buff *skb,
skb_queue_tail(>txq, skb);
schedule_delayed_work(>tx_work,0);
}
+   rcu_read_unlock();
 }
 EXPORT_SYMBOL(netpoll_send_skb_on_dev);
 


Re: bond: take rcu lock in bond_poll_controller

2018-09-28 Thread Dave Jones
On Fri, Sep 28, 2018 at 10:31:39AM -0700, Cong Wang wrote:
 > On Fri, Sep 28, 2018 at 10:25 AM Dave Jones  wrote:
 > >
 > > On Fri, Sep 28, 2018 at 09:55:52AM -0700, Cong Wang wrote:
 > >  > On Fri, Sep 28, 2018 at 9:18 AM Dave Jones  
 > > wrote:
 > >  > >
 > >  > > Callers of bond_for_each_slave_rcu are expected to hold the rcu lock,
 > >  > > otherwise a trace like below is shown
 > >  >
 > >  > So why not take rcu read lock in netpoll_send_skb_on_dev() where
 > >  > RCU is also assumed?
 > >
 > > that does seem to solve the backtrace spew I saw too, so if that's
 > > preferable I can respin the patch.
 > 
 > 
 > >From my observations, netpoll_send_skb_on_dev() does not take
 > RCU read lock _and_ it relies on rcu read lock because it calls
 > rcu_dereference_bh().
 > 
 > If my observation is correct, you should catch a RCU warning like
 > this but within netpoll_send_skb_on_dev().
 >
 > >  > As I said, I can't explain why you didn't trigger the RCU warning in
 > >  > netpoll_send_skb_on_dev()...
 > >
 > > netpoll_send_skb_on_dev takes the rcu lock itself.
 > 
 > Could you please point me where exactly is the rcu lock here?
 > 
 > I am too stupid to see it. :)

No, I'm the stupid one. I looked at the tree I had just edited to try your
proposed change. 

Now that I've untangled myself, I'll repost with your suggested change.

Dave



Re: bond: take rcu lock in bond_poll_controller

2018-09-28 Thread Dave Jones
On Fri, Sep 28, 2018 at 09:55:52AM -0700, Cong Wang wrote:
 > On Fri, Sep 28, 2018 at 9:18 AM Dave Jones  wrote:
 > >
 > > Callers of bond_for_each_slave_rcu are expected to hold the rcu lock,
 > > otherwise a trace like below is shown
 > 
 > So why not take rcu read lock in netpoll_send_skb_on_dev() where
 > RCU is also assumed?

that does seem to solve the backtrace spew I saw too, so if that's
preferable I can respin the patch.

 > As I said, I can't explain why you didn't trigger the RCU warning in
 > netpoll_send_skb_on_dev()...

netpoll_send_skb_on_dev takes the rcu lock itself.

Dave



bond: take rcu lock in bond_poll_controller

2018-09-28 Thread Dave Jones
Callers of bond_for_each_slave_rcu are expected to hold the rcu lock,
otherwise a trace like below is shown

WARNING: CPU: 2 PID: 179 at net/core/dev.c:6567 
netdev_lower_get_next_private_rcu+0x34/0x40
CPU: 2 PID: 179 Comm: kworker/u16:15 Not tainted 4.19.0-rc5-backup+ #1
Workqueue: bond0 bond_mii_monitor
RIP: 0010:netdev_lower_get_next_private_rcu+0x34/0x40
Code: 48 89 fb e8 fe 29 63 ff 85 c0 74 1e 48 8b 45 00 48 81 c3 c0 00 00 00 48 
8b 00 48 39 d8 74 0f 48 89 45 00 48 8b 40 f8 5b 5d c3 <0f> 0b eb de 31 c0 eb f5 
0f 1f 40 00 0f 1f 44 00 00 48 8>
RSP: 0018:c987fa68 EFLAGS: 00010046
RAX:  RBX: 880429614560 RCX: 
RDX: 0001 RSI:  RDI: a184ada0
RBP: c987fa80 R08: 0001 R09: 
R10: c987f9f0 R11: 880429798040 R12: 8804289d5980
R13: a1511f60 R14: 00c8 R15: 
FS:  () GS:88042f88() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f4b78fce180 CR3: 00018180f006 CR4: 001606e0
Call Trace:
 bond_poll_controller+0x52/0x170
 netpoll_poll_dev+0x79/0x290
 netpoll_send_skb_on_dev+0x158/0x2c0
 netpoll_send_udp+0x2d5/0x430
 write_ext_msg+0x1e0/0x210
 console_unlock+0x3c4/0x630
 vprintk_emit+0xfa/0x2f0
 printk+0x52/0x6e
 ? __netdev_printk+0x12b/0x220
 netdev_info+0x64/0x80
 ? bond_3ad_set_carrier+0xe9/0x180
 bond_select_active_slave+0x1fc/0x310
 bond_mii_monitor+0x709/0x9b0
 process_one_work+0x221/0x5e0
 worker_thread+0x4f/0x3b0
 kthread+0x100/0x140
 ? process_one_work+0x5e0/0x5e0
 ? kthread_delayed_work_timer_fn+0x90/0x90
 ret_from_fork+0x24/0x30

Signed-off-by: Dave Jones 

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index c05c01a00755..77a3607a7099 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -977,6 +977,7 @@ static void bond_poll_controller(struct net_device 
*bond_dev)
if (bond_3ad_get_active_agg_info(bond, _info))
return;
 
+   rcu_read_lock();
bond_for_each_slave_rcu(bond, slave, iter) {
if (!bond_slave_is_up(slave))
continue;
@@ -992,6 +993,7 @@ static void bond_poll_controller(struct net_device 
*bond_dev)
 
netpoll_poll_dev(slave->dev);
}
+   rcu_read_unlock();
 }
 
 static void bond_netpoll_cleanup(struct net_device *bond_dev)



bond: take rcu lock in bond_poll_controller

2018-09-24 Thread Dave Jones
Callers of bond_for_each_slave_rcu are expected to hold the rcu lock,
otherwise a trace like below is shown

WARNING: CPU: 2 PID: 179 at net/core/dev.c:6567 
netdev_lower_get_next_private_rcu+0x34/0x40
CPU: 2 PID: 179 Comm: kworker/u16:15 Not tainted 4.19.0-rc5-backup+ #1
Workqueue: bond0 bond_mii_monitor
RIP: 0010:netdev_lower_get_next_private_rcu+0x34/0x40
Code: 48 89 fb e8 fe 29 63 ff 85 c0 74 1e 48 8b 45 00 48 81 c3 c0 00 00 00 48 
8b 00 48 39 d8 74 0f 48 89 45 00 48 8b 40 f8 5b 5d c3 <0f> 0b eb de 31 c0 eb f5 
0f 1f 40 00 0f 1f 44 00 00 48 8>
RSP: 0018:c987fa68 EFLAGS: 00010046
RAX:  RBX: 880429614560 RCX: 
RDX: 0001 RSI:  RDI: a184ada0
RBP: c987fa80 R08: 0001 R09: 
R10: c987f9f0 R11: 880429798040 R12: 8804289d5980
R13: a1511f60 R14: 00c8 R15: 
FS:  () GS:88042f88() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f4b78fce180 CR3: 00018180f006 CR4: 001606e0
Call Trace:
 bond_poll_controller+0x52/0x170
 netpoll_poll_dev+0x79/0x290
 netpoll_send_skb_on_dev+0x158/0x2c0
 netpoll_send_udp+0x2d5/0x430
 write_ext_msg+0x1e0/0x210
 console_unlock+0x3c4/0x630
 vprintk_emit+0xfa/0x2f0
 printk+0x52/0x6e
 ? __netdev_printk+0x12b/0x220
 netdev_info+0x64/0x80
 ? bond_3ad_set_carrier+0xe9/0x180
 bond_select_active_slave+0x1fc/0x310
 bond_mii_monitor+0x709/0x9b0
 process_one_work+0x221/0x5e0
 worker_thread+0x4f/0x3b0
 kthread+0x100/0x140
 ? process_one_work+0x5e0/0x5e0
 ? kthread_delayed_work_timer_fn+0x90/0x90
 ret_from_fork+0x24/0x30

Signed-off-by: Dave Jones 

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index a764a83f99da..519968d4513b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -978,6 +978,7 @@ static void bond_poll_controller(struct net_device 
*bond_dev)
if (bond_3ad_get_active_agg_info(bond, _info))
return;
 
+   rcu_read_lock();
bond_for_each_slave_rcu(bond, slave, iter) {
ops = slave->dev->netdev_ops;
if (!bond_slave_is_up(slave) || !ops->ndo_poll_controller)
@@ -998,6 +999,7 @@ static void bond_poll_controller(struct net_device 
*bond_dev)
ops->ndo_poll_controller(slave->dev);
up(>dev_lock);
}
+   rcu_read_unlock();
 }
 
 static void bond_netpoll_cleanup(struct net_device *bond_dev)


ipset: suspicious RCU usage

2018-02-28 Thread Dave Jones

=
WARNING: suspicious RCU usage
4.16.0-rc3-firewall+ #1 Not tainted
-
net/netfilter/ipset/ip_set_core.c:1354 suspicious rcu_dereference_protected() 
usage!
\x0aother info that might help us debug this:\x0a
\x0arcu_scheduler_active = 2, debug_locks = 1
2 locks held by ipset/16323:
 #0: nlk_cb_mutex-NETFILTER ){+.+.} , at: [<5e54683c>] 
netlink_dump+0x1c/0x2b0
 #1: (ip_set_ref_lock){++..} , at: [<89f25f26>] 
ip_set_dump_start+0x133/0x7a0

stack backtrace:
CPU: 0 PID: 16323 Comm: ipset Not tainted 4.16.0-rc3-firewall+ #1 
Call Trace:
 dump_stack+0x67/0x8e
 ip_set_dump_start+0x5f0/0x7a0
 ? ip_set_dump_start+0x5/0x7a0
 ? __kmalloc_reserve.isra.38+0x29/0x70
 ? ksize+0x10/0xa0
 ? __alloc_skb+0x90/0x1b0
 netlink_dump+0x106/0x2b0
 netlink_recvmsg+0x337/0x380
 ? copy_msghdr_from_user+0xdb/0x150
 ___sys_recvmsg+0xc6/0x160
 ? netlink_sendmsg+0x129/0x420
 ? SYSC_sendto+0x11b/0x180
 __sys_recvmsg+0x51/0x90
 do_syscall_64+0x84/0x735
 ? trace_hardirqs_off_thunk+0x1a/0x1c
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x7fa78e00ca57
RSP: 002b:7fff61cfad78 EFLAGS: 0246
ORIG_RAX: 002f
RAX: ffda RBX: 55f22c7c06f0 RCX: 7fa78e00ca57
RDX:  RSI: 7fff61cfada0 RDI: 0003
RBP: 55f22c7bf4b8 R08: 7fa78dd1fbe0 R09: 000c
R10:  R11: 0246 R12: 1000
R13: 55f22c7c04d0 R14:  R15: 55f22bf13908



[no subject]

2018-02-04 Thread Jones
This is in regards to an inheritance on your surname, reply back using your 
email address, stating your full name for more details. Reply to email for 
info. Email me here ( ger...@dr.com )


Re: [4.15-rc9] fs_reclaim lockdep trace

2018-01-28 Thread Dave Jones
On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
 > Dave, would you try below patch?
 > 
 > >From cae2cbf389ae3cdef1b492622722b4aeb07eb284 Mon Sep 17 00:00:00 2001
 > From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
 > Date: Sun, 28 Jan 2018 14:17:14 +0900
 > Subject: [PATCH] lockdep: Fix fs_reclaim warning.


Seems to suppress the warning for me.

Tested-by: Dave Jones <da...@codemonkey.org.uk>



Re: [4.15-rc9] fs_reclaim lockdep trace

2018-01-27 Thread Dave Jones
On Tue, Jan 23, 2018 at 08:36:51PM -0500, Dave Jones wrote:
 > Just triggered this on a server I was rsync'ing to.

Actually, I can trigger this really easily, even with an rsync from one
disk to another.  Though that also smells a little like networking in
the traces. Maybe netdev has ideas.

 
The first instance:

 > 
 > WARNING: possible recursive locking detected
 > 4.15.0-rc9-backup-debug+ #1 Not tainted
 > 
 > sshd/24800 is trying to acquire lock:
 >  (fs_reclaim){+.+.}, at: [<84f438c2>] 
 > fs_reclaim_acquire.part.102+0x5/0x30
 > 
 > but task is already holding lock:
 >  (fs_reclaim){+.+.}, at: [<84f438c2>] 
 > fs_reclaim_acquire.part.102+0x5/0x30
 > 
 > other info that might help us debug this:
 >  Possible unsafe locking scenario:
 > 
 >CPU0
 >
 >   lock(fs_reclaim);
 >   lock(fs_reclaim);
 > 
 >  *** DEADLOCK ***
 > 
 >  May be due to missing lock nesting notation
 > 
 > 2 locks held by sshd/24800:
 >  #0:  (sk_lock-AF_INET6){+.+.}, at: [<1a069652>] 
 > tcp_sendmsg+0x19/0x40
 >  #1:  (fs_reclaim){+.+.}, at: [<84f438c2>] 
 > fs_reclaim_acquire.part.102+0x5/0x30
 > 
 > stack backtrace:
 > CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
 > Call Trace:
 >  dump_stack+0xbc/0x13f
 >  ? _atomic_dec_and_lock+0x101/0x101
 >  ? fs_reclaim_acquire.part.102+0x5/0x30
 >  ? print_lock+0x54/0x68
 >  __lock_acquire+0xa09/0x2040
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? mutex_destroy+0x120/0x120
 >  ? hlock_class+0xa0/0xa0
 >  ? kernel_text_address+0x5c/0x90
 >  ? __kernel_text_address+0xe/0x30
 >  ? unwind_get_return_address+0x2f/0x50
 >  ? __save_stack_trace+0x92/0x100
 >  ? graph_lock+0x8d/0x100
 >  ? check_noncircular+0x20/0x20
 >  ? __lock_acquire+0x616/0x2040
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? __lock_acquire+0x616/0x2040
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? print_irqtrace_events+0x110/0x110
 >  ? active_load_balance_cpu_stop+0x7b0/0x7b0
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? mark_lock+0x1b1/0xa00
 >  ? lock_acquire+0x12e/0x350
 >  lock_acquire+0x12e/0x350
 >  ? fs_reclaim_acquire.part.102+0x5/0x30
 >  ? lockdep_rcu_suspicious+0x100/0x100
 >  ? set_next_entity+0x20e/0x10d0
 >  ? mark_lock+0x1b1/0xa00
 >  ? match_held_lock+0x8d/0x440
 >  ? mark_lock+0x1b1/0xa00
 >  ? save_trace+0x1e0/0x1e0
 >  ? print_irqtrace_events+0x110/0x110
 >  ? alloc_extent_state+0xa7/0x410
 >  fs_reclaim_acquire.part.102+0x29/0x30
 >  ? fs_reclaim_acquire.part.102+0x5/0x30
 >  kmem_cache_alloc+0x3d/0x2c0
 >  ? rb_erase+0xe63/0x1240
 >  alloc_extent_state+0xa7/0x410
 >  ? lock_extent_buffer_for_io+0x3f0/0x3f0
 >  ? find_held_lock+0x6d/0xd0
 >  ? test_range_bit+0x197/0x210
 >  ? lock_acquire+0x350/0x350
 >  ? do_raw_spin_unlock+0x147/0x220
 >  ? do_raw_spin_trylock+0x100/0x100
 >  ? iotree_fs_info+0x30/0x30
 >  __clear_extent_bit+0x3ea/0x570
 >  ? clear_state_bit+0x270/0x270
 >  ? count_range_bits+0x2f0/0x2f0
 >  ? lock_acquire+0x350/0x350
 >  ? rb_prev+0x21/0x90
 >  try_release_extent_mapping+0x21a/0x260
 >  __btrfs_releasepage+0xb0/0x1c0
 >  ? btrfs_submit_direct+0xca0/0xca0
 >  ? check_new_page_bad+0x1f0/0x1f0
 >  ? match_held_lock+0xa5/0x440
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  btrfs_releasepage+0x161/0x170
 >  ? __btrfs_releasepage+0x1c0/0x1c0
 >  ? page_rmapping+0xd0/0xd0
 >  ? rmap_walk+0x100/0x100
 >  try_to_release_page+0x162/0x1c0
 >  ? generic_file_write_iter+0x3c0/0x3c0
 >  ? page_evictable+0xcc/0x110
 >  ? lookup_address_in_pgd+0x107/0x190
 >  shrink_page_list+0x1d5a/0x2fb0
 >  ? putback_lru_page+0x3f0/0x3f0
 >  ? save_trace+0x1e0/0x1e0
 >  ? _lookup_address_cpa.isra.13+0x40/0x60
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? kmem_cache_free+0x8c/0x280
 >  ? free_extent_state+0x1c8/0x3b0
 >  ? mark_lock+0x1b1/0xa00
 >  ? page_rmapping+0xd0/0xd0
 >  ? print_irqtrace_events+0x110/0x110
 >  ? shrink_node_memcg.constprop.88+0x4c9/0x5e0
 >  ? shrink_node+0x12d/0x260
 >  ? try_to_free_pages+0x418/0xaf0
 >  ? __alloc_pages_slowpath+0x976/0x1790
 >  ? __alloc_pages_nodemask+0x52c/0x5c0
 >  ? delete_node+0x28d/0x5c0
 >  ? find_held_lock+0x6d/0xd0
 >  ? free_pcppages_bulk+0x381/0x570
 >  ? lock_acquire+0x350/0x350
 >  ? do_raw_spin_unlock+0x147/0x220
 >  ? do_raw_spin_trylock+0x100/0x100
 >  ? __lock_is_held+0x51/0xc0
 >  ? _raw_spin_unlock+0x24/0x30
 >  ? free_pcppages_bulk+0x381/0x570
 >  ? mark_lock+0x1b1/0xa00
 >  ? free_compound_page+0x30/0x30
 >  ? print_irqtrace_events+0x110/0x110
 >  ? __kernel_map_pages+0x2c9/0x310
 >  ? mark_lock+0

ipset related DEBUG_VIRTUAL crash.

2017-11-04 Thread Dave Jones
I have a script that hourly replaces an ipset list. This has been in
place for a year or so, but last night it triggered this on 4.14-rc7

[455951.731181] kernel BUG at arch/x86/mm/physaddr.c:26!
[455951.737016] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[455951.742525] CPU: 0 PID: 3850 Comm: ipset Not tainted 4.14.0-rc7-firewall+ 
#1 
[455951.753293] task: 88013033cfc0 task.stack: 8801c3d48000
[455951.758567] RIP: 0010:__phys_addr+0x5b/0x80
[455951.763742] RSP: 0018:8801c3d4f528 EFLAGS: 00010287
[455951.768838] RAX: 7800849b62b6 RBX: 849b62b6 RCX: 
9f072a5d
[455951.773881] RDX: dc00 RSI: dc00 RDI: 
a06917e0
[455951.778844] RBP: 7800049b62b6 R08: 0002 R09: 

[455951.783729] R10:  R11:  R12: 
9fca8b05
[455951.788524] R13: 8801ce844268 R14: 049b62b6 R15: 
8801ce8442ea
[455951.793239] FS:  7fb44e656c80() GS:8801d320() 
knlGS:
[455951.797904] CS:  0010 DS:  ES:  CR0: 80050033
[455951.802479] CR2: 7ffeeafd70a8 CR3: 0001b6cd2001 CR4: 
000606f0
[455951.806998] Call Trace:
[455951.811404]  kfree+0x4c/0x310
[455951.815714]  hash_ip4_ahash_destroy+0x85/0xd0
[455951.819944]  hash_ip4_destroy+0x64/0x90
[455951.824069]  ip_set_destroy+0x4f0/0x500
[455951.828098]  ? ip_set_destroy+0x5/0x500
[455951.832029]  ? __rcu_read_unlock+0xd3/0x190
[455951.835867]  ? ip_set_utest+0x560/0x560
[455951.839610]  ? ip_set_utest+0x560/0x560
[455951.843239]  nfnetlink_rcv_msg+0x73e/0x770
[455951.846780]  ? nfnetlink_rcv_msg+0x352/0x770
[455951.850229]  ? nfnetlink_rcv+0xe90/0xe90
[455951.853571]  ? native_sched_clock+0xe8/0x190
[455951.856822]  ? lock_release+0x5d3/0x7d0
[455951.859976]  netlink_rcv_skb+0x121/0x230
[455951.863037]  ? nfnetlink_rcv+0xe90/0xe90
[455951.865999]  ? netlink_ack+0x4c0/0x4c0
[455951.868866]  ? ns_capable_common+0x68/0xc0
[455951.871638]  nfnetlink_rcv+0x1ad/0xe90
[455951.874312]  ? lock_acquire+0x380/0x380
[455951.876891]  ? __rcu_read_unlock+0xd3/0x190
[455951.879378]  ? __rcu_read_lock+0x30/0x30
[455951.881764]  ? rcu_is_watching+0xa4/0xf0
[455951.884048]  ? netlink_connect+0x1e0/0x1e0
[455951.886236]  ? nfnl_err_reset+0x180/0x180
[455951.888329]  ? netlink_deliver_tap+0x128/0x560
[455951.890333]  ? netlink_deliver_tap+0x5/0x560
[455951.892229]  ? iov_iter_advance+0x172/0x7f0
[455951.894029]  ? netlink_getname+0x150/0x150
[455951.895736]  ? can_nice.part.77+0x20/0x20
[455951.897342]  ? iov_iter_copy_from_user_atomic+0x7d0/0x7d0
[455951.898877]  ? netlink_trim+0x111/0x1b0
[455951.900394]  ? netlink_skb_destructor+0xf0/0xf0
[455951.901908]  netlink_unicast+0x2b1/0x340
[455951.903397]  ? netlink_detachskb+0x30/0x30
[455951.904862]  ? lock_acquire+0x380/0x380
[455951.906299]  ? lockdep_rcu_suspicious+0x100/0x100
[455951.907729]  netlink_sendmsg+0x4f2/0x650
[455951.909141]  ? netlink_broadcast_filtered+0x9e0/0x9e0
[455951.910565]  ? _copy_from_user+0x86/0xc0
[455951.911964]  ? netlink_broadcast_filtered+0x9e0/0x9e0
[455951.913364]  SYSC_sendto+0x2f0/0x3c0
[455951.914741]  ? SYSC_connect+0x210/0x210
[455951.916111]  ? bad_area_access_error+0x230/0x230
[455951.917479]  ? ___sys_recvmsg+0x320/0x320
[455951.918811]  ? sock_wake_async+0xc0/0xc0
[455951.920112]  ? SyS_brk+0x3ae/0x3d0
[455951.921381]  ? prepare_exit_to_usermode+0xde/0x230
[455951.922642]  ? enter_from_user_mode+0x30/0x30
[455951.923913]  ? mark_held_locks+0x1b/0xa0
[455951.925179]  ? entry_SYSCALL_64_fastpath+0x5/0xad
[455951.926459]  ? trace_hardirqs_on_caller+0x185/0x260
[455951.927747]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[455951.929031]  entry_SYSCALL_64_fastpath+0x18/0xad
[455951.930314] RIP: 0033:0x7fb44df4ac53
[455951.931592] RSP: 002b:7ffeeafb6a08 EFLAGS: 0246
[455951.932914]  ORIG_RAX: 002c
[455951.934231] RAX: ffda RBX: 55b8f35d26d0 RCX: 
7fb44df4ac53
[455951.935603] RDX: 002c RSI: 55b8f35d14b8 RDI: 
0003
[455951.936991] RBP: 55b8f35cf010 R08: 7fb44dc5dbe0 R09: 
000c
[455951.938387] R10:  R11: 0246 R12: 
7fb44e43b020
[455951.939795] R13: 7ffeeafb6acc R14:  R15: 
55b8f1ca68e0
[455951.941208] Code: 80 48 39 eb 72 25 48 c7 c7 09 d6 a4 a0 e8 3e 28 2c 00 0f 
b6 0d 80 ab 9d 01 48 8d 45 00 48 d3 e8 48 85 c0 75 06 5b 48 89 e8 5d c3 <0f> 0b 
48 c7 c7 10 c0 62 a0 e8 a7 2a 2c 00 48 8b 2d 60 95 5b 01 
[455951.993251] RIP: __phys_addr+0x5b/0x80 RSP: 8801c3d4f528
[455982.040898] ---[ end trace dfb8a0f07b7c5316 ]---
[459428.674105] 
==
[459428.679829] BUG: KASAN: use-after-free in __mutex_lock+0x26c/0xf30
[459428.685463] Read of size 4 at addr 88013033d020 by task ipset/4611
[459428.696474] CPU: 0 PID: 4611 Comm: ipset Tainted: G  D 
4.14.0-rc7-firewall+ #1 
[459428.707271] Call Trace:
[459428.712489]  

Re: [4.14rc6] __tcp_select_window divide by zero.

2017-11-03 Thread Dave Jones
On Tue, Oct 24, 2017 at 09:00:30AM -0400, Dave Jones wrote:
 > divide error:  [#1] SMP KASAN
 > CPU: 0 PID: 31140 Comm: trinity-c12 Not tainted 4.14.0-rc6-think+ #1 
 > RIP: 0010:__tcp_select_window+0x21f/0x400
 > Call Trace:
 >  tcp_cleanup_rbuf+0x27d/0x2a0
 >  tcp_recvmsg+0x7a9/0x1430
 >  inet_recvmsg+0x10b/0x360
 >  sock_read_iter+0x19d/0x240
 >  do_iter_readv_writev+0x2e4/0x320
 >  do_iter_read+0x149/0x280
 >  vfs_readv+0x107/0x180
 >  do_readv+0xc0/0x1b0
 >  do_syscall_64+0x182/0x400
 >  entry_SYSCALL64_slow_path+0x25/0x25
 > Code: 41 5e 41 5f c3 48 8d bb 48 09 00 00 e8 4b 2b 30 ff 8b 83 48 09 00 00 
 > 89 ea 44 29 f2 39 c2 7d 08 39 c5 0f 8d 86 01 00 00 89 e8 99 <41> f7 fe 89 e8 
 > 29 d0 eb 8c 41 f7 df 48 89 c7 44 89 f9 d3 fd e8 
 > RIP: __tcp_select_window+0x21f/0x400 RSP: 8803df54f418
 > 
 >
 >if (window <= free_space - mss || window > free_space)
 >window = rounddown(free_space, mss);

I'm still hitting this fairly often, so I threw in a debug patch, and
when this happens..

[53182.361210] window: 0 free_space: 0 mss: 0

Any suggestions on what we should default the window size to be in
this situation to avoid the rounddown ?


Dave



[net-next] tcp_delack_timer circular locking dependancy

2017-10-30 Thread Dave Jones
[  105.316650] ==
[  105.316818] WARNING: possible circular locking dependency detected
[  105.316986] 4.14.0-rc7-think+ #1 Not tainted
[  105.317108] --
[  105.317273] swapper/2/0 is trying to acquire lock:
[  105.317407]  (
[  105.317476] slock-AF_INET6
[  105.317564] ){+.-.}
[  105.317642] , at: [] tcp_delack_timer+0x26/0x130
[  105.317807] 
   but task is already holding lock:
[  105.317961]  (
[  105.318024] (timer)
[  105.318097] #5
[  105.318168] ){+.-.}
[  105.318241] , at: [] call_timer_fn+0x5/0x5e0
[  105.318393] 
   which lock already depends on the new lock.

[  105.318594] 
   the existing dependency chain (in reverse order) is:
[  105.318781] 
   -> #1
[  105.318879]  (
[  105.318939] (timer)
[  105.319009] #5
[  105.319068] ){+.-.}
[  105.319137] :
[  105.319195]del_timer_sync+0x3c/0xb0
[  105.319313]inet_csk_reqsk_queue_drop+0x26c/0x4e0
[  105.319459]inet_csk_complete_hashdance+0x1e/0x90
[  105.319598]tcp_check_req+0x787/0x9a0
[  105.319716]tcp_v6_rcv+0x914/0x1060
[  105.319828]ip6_input_finish+0x291/0xba0
[  105.319950]ip6_input+0xb2/0x380
[  105.320059]ip6_rcv_finish+0x103/0x350
[  105.320180]ipv6_rcv+0x93f/0xff0
[  105.320291]__netif_receive_skb_core+0x13ef/0x1900
[  105.320436]netif_receive_skb_internal+0xea/0x4c0
[  105.320579]napi_gro_receive+0x28e/0x320
[  105.320705]e1000_clean_rx_irq+0x3e9/0x6f0
[  105.320838]e1000e_poll+0x14e/0x570
[  105.320954]net_rx_action+0x4db/0xc80
[  105.321075]__do_softirq+0x1ca/0x7bf
[  105.321194]irq_exit+0x104/0x110
[  105.321303]do_IRQ+0xb2/0x130
[  105.321407]ret_from_intr+0x0/0x19
[  105.321523]cpuidle_enter_state+0x223/0x5b0
[  105.321655]do_idle+0x110/0x1b0
[  105.321766]cpu_startup_entry+0xdb/0xe0
[  105.321891]start_secondary+0x2e9/0x360
[  105.322014]verify_cpu+0x0/0xf1
[  105.322121] 
   -> #0
[  105.322215]  (
[  105.322276] slock-AF_INET6
[  105.322359] ){+.-.}
[  105.322428] :
[  105.322487]lock_acquire+0x12e/0x350
[  105.322602]_raw_spin_lock+0x30/0x70
[  105.322722]tcp_delack_timer+0x26/0x130
[  105.322846]call_timer_fn+0x188/0x5e0
[  105.322966]__run_timers+0x54d/0x670
[  105.323084]run_timer_softirq+0x2a/0x50
[  105.323208]__do_softirq+0x1ca/0x7bf
[  105.323325]irq_exit+0x104/0x110
[  105.323435]smp_apic_timer_interrupt+0x14b/0x510
[  105.323576]apic_timer_interrupt+0x9a/0xa0
[  105.323705]cpuidle_enter_state+0x223/0x5b0
[  105.323836]do_idle+0x110/0x1b0
[  105.323944]cpu_startup_entry+0xdb/0xe0
[  105.324067]start_secondary+0x2e9/0x360
[  105.324189]verify_cpu+0x0/0xf1
[  105.324295] 
   other info that might help us debug this:

[  105.324489]  Possible unsafe locking scenario:

[  105.324644]CPU0CPU1
[  105.324767]
[  105.324890]   lock(
[  105.324963] (timer)
[  105.325033] #5
[  105.325093] );
[  105.325152]lock(
[  105.325278] slock-AF_INET6
[  105.325360] );
[  105.325419]lock(
[  105.325544] (timer)
[  105.325612] #5
[  105.325670] );
[  105.325729]   lock(
[  105.325797] slock-AF_INET6
[  105.325879] );
[  105.325938] 
*** DEADLOCK ***

[  105.326086] 1 lock held by swapper/2/0:
[  105.326193]  #0: 
[  105.326257]  (
[  105.331697] (timer)
[  105.337038] #5
[  105.342339] ){+.-.}
[  105.347620] , at: [] call_timer_fn+0x5/0x5e0
[  105.353021] 
   stack backtrace:
[  105.363515] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.14.0-rc7-think+ #1
[  105.368886] Hardware name: LENOVO ThinkServer TS140/ThinkServer TS140, BIOS 
FBKTB3AUS 06/16/2015
[  105.374330] Call Trace:
[  105.379697]  
[  105.384997]  dump_stack+0xbc/0x145
[  105.390339]  ? dma_virt_map_sg+0xfb/0xfb
[  105.395733]  ? call_timer_fn+0x5/0x5e0
[  105.401076]  ? print_lock+0x54/0x68
[  105.406344]  print_circular_bug.isra.42+0x283/0x2bc
[  105.411695]  ? print_circular_bug_header+0xda/0xda
[  105.417054]  ? graph_lock+0x8d/0x100
[  105.422419]  ? check_noncircular+0x20/0x20
[  105.427857]  ? sched_clock_cpu+0x14/0xf0
[  105.433309]  __lock_acquire+0x1f4a/0x2050
[  105.438725]  ? debug_check_no_locks_freed+0x1a0/0x1a0
[  105.444160]  ? __lock_acquire+0x6b3/0x2050
[  105.449580]  ? debug_check_no_locks_freed+0x1a0/0x1a0
[  105.455015]  ? sched_clock_cpu+0x14/0xf0
[  105.460514]  ? __lock_acquire+0x6b3/0x2050
[  105.465984]  ? cyc2ns_read_end+0x10/0x10
[  105.471395]  ? debug_check_no_locks_freed+0x1a0/0x1a0
[  105.476934]  ? mark_lock+0x16f/0x9b0
[  105.482507]  ? print_irqtrace_events+0x110/0x110
[  105.488150]  ? 

[4.14rc6] __tcp_select_window divide by zero.

2017-10-24 Thread Dave Jones
divide error:  [#1] SMP KASAN
CPU: 0 PID: 31140 Comm: trinity-c12 Not tainted 4.14.0-rc6-think+ #1 
task: 8803c0d08040 task.stack: 8803df548000
RIP: 0010:__tcp_select_window+0x21f/0x400
RSP: 0018:8803df54f418 EFLAGS: 00010246
RAX:  RBX: 880458fd3140 RCX: 82120ea5
RDX:  RSI: dc00 RDI: 880458fd3a88
RBP:  R08: 0001 R09: 
R10:  R11:  R12: 00098968
R13: 11007bea9e87 R14:  R15: 
FS:  7f76da1db700() GS:88046ae0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 0003f67cd002 CR4: 001606f0
DR0: 7f76d819f000 DR1: 7f75a29f5000 DR2: 
DR3:  DR6: 0ff0 DR7: 0600
Call Trace:
 ? tcp_schedule_loss_probe+0x270/0x270
 ? lock_acquire+0x12e/0x350
 ? tcp_recvmsg+0x124/0x1430
 ? lock_release+0x890/0x890
 ? do_raw_spin_trylock+0x100/0x100
 ? do_raw_spin_trylock+0x40/0x100
 tcp_cleanup_rbuf+0x27d/0x2a0
 ? tcp_recv_skb+0x180/0x180
 ? mark_held_locks+0x70/0xa0
 ? __local_bh_enable_ip+0x60/0x90
 tcp_recvmsg+0x7a9/0x1430
 ? tcp_recv_timestamp+0x250/0x250
 ? __free_insn_slot+0x390/0x390
 ? rcu_is_watching+0x88/0xd0
 ? entry_SYSCALL64_slow_path+0x25/0x25
 ? is_bpf_text_address+0x86/0xf0
 ? kernel_text_address+0xec/0x100
 ? __kernel_text_address+0xe/0x30
 ? unwind_get_return_address+0x2f/0x50
 ? __save_stack_trace+0x92/0x100
 ? memcmp+0x45/0x70
 ? match_held_lock+0x93/0x410
 ? save_trace+0x1c0/0x1c0
 ? save_stack+0x89/0xb0
 ? save_stack+0x32/0xb0
 ? kasan_kmalloc+0xa0/0xd0
 ? native_sched_clock+0xf9/0x1a0
 ? rw_copy_check_uvector+0x15e/0x180
 inet_recvmsg+0x10b/0x360
 ? inet_create+0x770/0x770
 ? sched_clock_cpu+0x14/0xf0
 ? sched_clock_cpu+0x14/0xf0
 sock_read_iter+0x19d/0x240
 ? sock_recvmsg+0x60/0x60
 do_iter_readv_writev+0x2e4/0x320
 ? vfs_dedupe_file_range+0x3e0/0x3e0
 do_iter_read+0x149/0x280
 vfs_readv+0x107/0x180
 ? compat_rw_copy_check_uvector+0x1d0/0x1d0
 ? fget_raw+0x10/0x10
 ? __lock_is_held+0x2e/0xd0
 ? do_preadv+0xf0/0xf0
 ? __fdget_pos+0x82/0x110
 ? __fdget_raw+0x10/0x10
 ? do_readv+0xc0/0x1b0
 do_readv+0xc0/0x1b0
 ? vfs_readv+0x180/0x180
 ? mark_held_locks+0x1b/0xa0
 ? do_syscall_64+0xae/0x400
 ? do_preadv+0xf0/0xf0
 do_syscall_64+0x182/0x400
 ? syscall_return_slowpath+0x270/0x270
 ? rcu_read_lock_sched_held+0x90/0xa0
 ? __context_tracking_exit.part.4+0x223/0x290
 ? mark_held_locks+0x1b/0xa0
 ? return_from_SYSCALL_64+0x2d/0x7a
 ? trace_hardirqs_on_caller+0x17a/0x250
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f76d9b05219
RSP: 002b:7ffd41fd30d8 EFLAGS: 0246  ORIG_RAX: 0013
RAX: ffda RBX: 0013 RCX: 7f76d9b05219
RDX: 0016 RSI: 5611ca731c70 RDI: 0179
RBP: 7ffd41fd3180 R08: 00a07395 R09: 000a10d65a68
R10: 0001 R11: 0246 R12: 0002
R13: 7f76da180058 R14: 7f76da1db698 R15: 7f76da18
Code: 41 5e 41 5f c3 48 8d bb 48 09 00 00 e8 4b 2b 30 ff 8b 83 48 09 00 00 89 
ea 44 29 f2 39 c2 7d 08 39 c5 0f 8d 86 01 00 00 89 e8 99 <41> f7 fe 89 e8 29 d0 
eb 8c 41 f7 df 48 89 c7 44 89 f9 d3 fd e8 
RIP: __tcp_select_window+0x21f/0x400 RSP: 8803df54f418



window = rounddown(free_space, mss);
45ec:   89 e8   mov%ebp,%eax
45ee:   99  cltd   
45ef:   41 f7 feidiv   %r14d
45f2:   89 e8   mov%ebp,%eax
45f4:   29 d0   sub%edx,%eax
45f6:   eb 8c   jmp4584 <__tcp_select_window+0x1b4>
45f8:   41 f7 dfneg%r15d




Stuck TX Using Iperf

2017-08-16 Thread Robert Jones
Hello Cavium Ethernet Driver Maintainers,

I'm working on a custom board using a Cavium OcteonTX CN80XX cpu
running a mainline 4.12.7 kernel and I've run into a problem where the
TX of my BGX0 configured for SGMII becomes stuck.

After boot I'm able to bring the interface up, run dhclient, and ping
a host machine with no issues. However, every time that I try to run
an iperf TCP test I get a stuck TX queue. Bringing the interface down
and then up does not resolve the problem, but physically reconnecting
the cable connected to the interface does. Also, after I reconnect the
cable I am no longer able to reproduce the stuck TX queue until
reboot. I receive no kernel driver messages of any sort during my
iperf test, it just stalls until I kill it.

Any help would be appreciated.

Regards,

Robert Jones - Software Engineer
Gateworks Corporation


BUG_ON(sg->sg_magic != SG_MAGIC) on tls socket.

2017-08-11 Thread Dave Jones
kernel BUG at ./include/linux/scatterlist.h:189!
invalid opcode:  [#1] SMP KASAN
CPU: 3 PID: 20890 Comm: trinity-c51 Not tainted 4.13.0-rc4-think+ #5 
task: 88036e3d1cc0 task.stack: 88033e9d8000
RIP: 0010:tls_push_record+0x675/0x680
RSP: 0018:88033e9df630 EFLAGS: 00010287
RAX:  RBX: 8802ee3b8968 RCX: 82226754
RDX: dc00 RSI: dc00 RDI: 8802ee3b8c10
RBP: 88033e9df6d0 R08:  R09: ed005d107004
R10: 0004 R11: ed005d107003 R12: 880341b30668
R13: 8802ee3b8c10 R14: 8802ee3b8c38 R15: 87654321
FS:  7f465ced2700() GS:88046b60() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0029cbf021f8 CR3: 00045abd CR4: 001406e0
DR0: 7f8f17bd2000 DR1: 7f8c27d0f000 DR2: 
DR3:  DR6: 0ff0 DR7: 0600
Call Trace:
 ? copy_page_to_iter+0x6c0/0x6c0
 tls_sw_sendmsg+0x6d8/0x9c0
 ? alloc_sg+0x510/0x510
 ? cyc2ns_read_end+0x10/0x10
 ? import_iovec+0xa8/0x1f0
 ? do_syscall_64+0x1bc/0x3e0
 ? entry_SYSCALL64_slow_path+0x25/0x25
 inet_sendmsg+0xce/0x310
 ? inet_recvmsg+0x3a0/0x3a0
 ? inet_recvmsg+0x3a0/0x3a0
 sock_write_iter+0x1b0/0x280
 ? kernel_sendmsg+0x70/0x70
 ? __might_sleep+0x72/0xe0
 do_iter_readv_writev+0x29a/0x370
 ? vfs_dedupe_file_range+0x3f0/0x3f0
 ? rw_verify_area+0x65/0x150
 do_iter_write+0xd7/0x2a0
 ? __hrtimer_run_queues+0x980/0x980
 vfs_writev+0x142/0x220
 ? __fget_light+0x1ae/0x230
 ? vfs_iter_write+0x70/0x70
 ? syscall_exit_register+0x3f0/0x3f0
 ? rcutorture_record_progress+0x20/0x20
 ? __fdget_pos+0x88/0x120
 ? __fdget_raw+0x20/0x20
 do_writev+0xd2/0x1c0
 ? do_writev+0xd2/0x1c0
 ? vfs_writev+0x220/0x220
 ? mark_held_locks+0x23/0xb0
 ? do_syscall_64+0xc0/0x3e0
 ? SyS_readv+0x20/0x20
 SyS_writev+0x10/0x20
 do_syscall_64+0x1bc/0x3e0
 ? syscall_return_slowpath+0x240/0x240
 ? __context_tracking_exit.part.5+0x23d/0x2a0
 ? cpumask_check.part.2+0x10/0x10
 ? mark_held_locks+0x23/0xb0
 ? return_from_SYSCALL_64+0x2d/0x7a
 ? trace_hardirqs_on_caller+0x182/0x260
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f465c7fd219
RSP: 002b:7ffda332a238 EFLAGS: 0246
 ORIG_RAX: 0014
RAX: ffda RBX: 0014 RCX: 7f465c7fd219
RDX: 0047 RSI: 0029cbef1b50 RDI: 0137
RBP: 7ffda332a2e0 R08: 0100 R09: fff8
R10: fff9 R11: 0246 R12: 0002
R13: 7f465cd66058 R14: 7f465ced2698 R15: 7f465cd66000
Code: 8d bb 58 04 00 00 e8 3b d5 20 ff 48 8b 83 58 04 00 00 f0 80 48 08 04 48 
83 c4 78 44 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 
1f 44 00 00 0f 1f 44 00 00 55 ba 17 00 00 
RIP: tls_push_record+0x675/0x680 RSP: 88033e9df630



186 static inline void sg_mark_end(struct scatterlist *sg)
187 {
188 #ifdef CONFIG_DEBUG_SG
189 BUG_ON(sg->sg_magic != SG_MAGIC);
190 #endif





KASAN: slab-out-of-bounds from net_namespace.c:ops_init

2017-08-11 Thread Dave Jones
==
BUG: KASAN: slab-out-of-bounds in ops_init+0x201/0x330
Write of size 8 at addr 88045744c448 by task trinity-c4/1499

CPU: 2 PID: 1499 Comm: trinity-c4 Not tainted 4.13.0-rc4-think+ #5 
Call Trace:
 dump_stack+0xc5/0x151
 ? dma_virt_map_sg+0xff/0xff
 ? show_regs_print_info+0x41/0x41
 print_address_description+0xd9/0x260
 kasan_report+0x27a/0x370
 ? ops_init+0x201/0x330
 __asan_store8+0x57/0x90
 ops_init+0x201/0x330
 ? net_alloc_generic+0x50/0x50
 ? __raw_spin_lock_init+0x21/0x80
 ? trace_hardirqs_on_caller+0x182/0x260
 ? lockdep_init_map+0xb2/0x2b0
 setup_net+0x208/0x400
 ? ops_init+0x330/0x330
 ? copy_net_ns+0x151/0x390
 ? can_nice.part.81+0x20/0x20
 ? rcu_is_watching+0x8d/0xd0
 ? __lock_is_held+0x30/0xd0
 ? rcutorture_record_progress+0x20/0x20
 ? copy_net_ns+0x151/0x390
 copy_net_ns+0x200/0x390
 ? net_drop_ns+0x20/0x20
 ? do_mount+0x19d0/0x19d0
 ? create_new_namespaces+0x97/0x450
 ? rcu_read_lock_sched_held+0x96/0xa0
 ? kmem_cache_alloc+0x28a/0x2f0
 create_new_namespaces+0x317/0x450
 ? sys_ni_syscall+0x20/0x20
 ? cap_capable+0x7f/0xf0
 unshare_nsproxy_namespaces+0x77/0xf0
 SyS_unshare+0x573/0xbb0
 ? walk_process_tree+0x2a0/0x2a0
 ? lock_release+0x920/0x920
 ? lock_release+0x920/0x920
 ? mntput_no_expire+0x117/0x620
 ? rcu_is_watching+0x8d/0xd0
 ? exit_to_usermode_loop+0x1b0/0x1b0
 ? rcu_read_lock_sched_held+0x96/0xa0
 ? __context_tracking_exit.part.5+0x23d/0x2a0
 ? cpumask_check.part.2+0x10/0x10
 ? context_tracking_user_exit+0x30/0x30
 ? __f_unlock_pos+0x15/0x20
 ? SyS_read+0x146/0x160
 ? do_syscall_64+0xc0/0x3e0
 ? walk_process_tree+0x2a0/0x2a0
 do_syscall_64+0x1bc/0x3e0
 ? syscall_return_slowpath+0x240/0x240
 ? mark_held_locks+0x23/0xb0
 ? return_from_SYSCALL_64+0x2d/0x7a
 ? trace_hardirqs_on_caller+0x182/0x260
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f9e1c454219
RSP: 002b:7fff180f9c88 EFLAGS: 0246
 ORIG_RAX: 0110
RAX: ffda RBX: 0110 RCX: 7f9e1c454219
RDX: 00c4 RSI: 800ff000 RDI: 74060700
RBP: 7fff180f9d30 R08: 0002 R09: 2fa420810090095e
R10: 880ffb40 R11: 0246 R12: 0002
R13: 7f9e1cb06058 R14: 7f9e1cb29698 R15: 7f9e1cb06000

Allocated by task 1499:
 save_stack_trace+0x1b/0x20
 save_stack+0x43/0xd0
 kasan_kmalloc+0xad/0xe0
 __kmalloc+0x14b/0x370
 net_alloc_generic+0x25/0x50
 copy_net_ns+0x130/0x390
 create_new_namespaces+0x317/0x450
 unshare_nsproxy_namespaces+0x77/0xf0
 SyS_unshare+0x573/0xbb0
 do_syscall_64+0x1bc/0x3e0
 return_from_SYSCALL_64+0x0/0x7a

Freed by task 504:
 save_stack_trace+0x1b/0x20
 save_stack+0x43/0xd0
 kasan_slab_free+0x72/0xc0
 kfree+0xe1/0x2f0
 rcu_process_callbacks+0x5a6/0x1dc0
 __do_softirq+0x1e7/0x817

The buggy address belongs to the object at 88045744c3c8
 which belongs to the cache kmalloc-128 of size 128
The buggy address is located 0 bytes to the right of
 128-byte region [88045744c3c8, 88045744c448)
The buggy address belongs to the page:
page:ea00115d1300 count:1 mapcount:0 mapping:  (null) index:0x0
 compound_mapcount: 0
flags: 0x80008100(slab|head)
raw: 80008100   000100110011
raw: ea00113f2b20 ea0011328a20 880467c0f140 
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 88045744c300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 88045744c380: fc fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00
>88045744c400: 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc
  ^
 88045744c480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 88045744c500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==



Re: sctp refcount bug.

2017-07-13 Thread Dave Jones
On Thu, Jul 13, 2017 at 11:38:34AM -0300, Marcelo Ricardo Leitner wrote:
 > On Thu, Jul 13, 2017 at 10:36:39AM -0400, Dave Jones wrote:
 > > Hit this on Linus' current tree.
 > > 
 > > 
 > > refcount_t: underflow; use-after-free.
 > 
 > Any tips on how to reproduce this?

Only seen it once so far. Will see if I can narrow it down if it
reproduces.  It took ~12 hours of fuzzing to find overnight.

Dave



sctp refcount bug.

2017-07-13 Thread Dave Jones
Hit this on Linus' current tree.


refcount_t: underflow; use-after-free.
[ cut here ]
WARNING: CPU: 2 PID: 14455 at lib/refcount.c:186 refcount_sub_and_test+0x45/0x50
CPU: 2 PID: 14455 Comm: trinity-c46 Tainted: G  D 4.12.0-think+ #11 
task: 8804fc71b8c0 task.stack: c90002328000
RIP: 0010:refcount_sub_and_test+0x45/0x50
RSP: 0018:c9000232ba58 EFLAGS: 00010282
RAX: 0026 RBX: 88001db1d1c0 RCX: 
RDX:  RSI: 88050a3ccca8 RDI: 88050a3ccca8
RBP: c9000232ba58 R08:  R09: 0001
R10: c9000232ba88 R11:  R12: 88000d3f9b40
R13: 880456948008 R14: 880456948870 R15: c9000232bd10
FS:  7f79b1032700() GS:88050a20() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 008436726348 CR3: 00022cc87000 CR4: 001406e0
DR0: 7f731068f000 DR1: 7f2d83eb9000 DR2: 7f302340e000
DR3:  DR6: 0ff0 DR7: 0600
Call Trace:
 sctp_wfree+0x5d/0x190 [sctp]
 skb_release_head_state+0x64/0xc0
 skb_release_all+0x12/0x30
 consume_skb+0x50/0x170
 sctp_chunk_put+0x59/0x80 [sctp]
 sctp_chunk_free+0x26/0x30 [sctp]
 __sctp_outq_teardown+0x1d8/0x270 [sctp]
 sctp_outq_free+0xe/0x10 [sctp]
 sctp_association_free+0x92/0x220 [sctp]
 sctp_do_sm+0x12a6/0x1920 [sctp]
 ? __get_user_4+0x18/0x20
 ? no_context+0x3f/0x360
 ? lock_acquire+0xe7/0x1e0
 ? skb_dequeue+0x1d/0x70
 sctp_primitive_SHUTDOWN+0x33/0x40 [sctp]
 sctp_close+0x26e/0x2a0 [sctp]
 inet_release+0x3c/0x60
 sock_release+0x1f/0x80
 sock_close+0x12/0x20
 __fput+0xf8/0x200
 fput+0xe/0x10
 task_work_run+0x85/0xc0
 exit_to_usermode_loop+0xa8/0xb0
 do_syscall_64+0x151/0x190
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f79b095b1e9
RSP: 002b:7ffc5eca3088 EFLAGS: 0246
 ORIG_RAX: 0120
RAX: fff2 RBX: 0120 RCX: 7f79b095b1e9
RDX: 006e RSI: 008436738120 RDI: 0130
RBP: 7ffc5eca3130 R08:  R09: 0ff0
R10: 00080800 R11: 0246 R12: 0002
R13: 7f79b0ee9058 R14: 7f79b1032698 R15: 7f79b0ee9000
Code: 75 e6 85 d2 0f 94 c0 c3 31 c0 c3 80 3d ce 95 bc 00 00 75 f4 55 48 c7 c7 
00 d9 ee 81 48 89 e5 c6 05 ba 95 bc 00 01 e8 fc 2c c0 ff <0f> ff 31 c0 5d c3 0f 
1f 44 00 00 55 48 89 fe bf 01 00 00 00 48 
---[ end trace 19b7bd878c0f56fd ]---
[ cut here ]
WARNING: CPU: 2 PID: 14455 at net/ipv4/af_inet.c:154 
inet_sock_destruct+0x1b8/0x1f0
CPU: 2 PID: 14455 Comm: trinity-c46 Tainted: G  D W   4.12.0-think+ #11 
task: 8804fc71b8c0 task.stack: c90002328000
RIP: 0010:inet_sock_destruct+0x1b8/0x1f0
RSP: 0018:c9000232bcf8 EFLAGS: 00010286
RAX:  RBX: 88000d3f9b40 RCX: 
RDX: fd00 RSI: 0300 RDI: 88000d3f9ca8
RBP: c9000232bd08 R08:  R09: 
R10:  R11:  R12: 88000d3f9ca8
R13: 88000d3f9b40 R14: 88000d3f9bc8 R15: 8801836e21d0
FS:  7f79b1032700() GS:88050a20() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 3b9ab732 CR3: 00022cc87000 CR4: 001406e0
DR0: 7f731068f000 DR1: 7f2d83eb9000 DR2: 7f302340e000
DR3:  DR6: 0ff0 DR7: 0600
Call Trace:
 sctp_destruct_sock+0x25/0x30 [sctp]
 __sk_destruct+0x28/0x230
 sk_destruct+0x20/0x30
 __sk_free+0x43/0xa0
 sk_free+0x25/0x30
 sctp_close+0x218/0x2a0 [sctp]
 inet_release+0x3c/0x60
 sock_release+0x1f/0x80
 sock_close+0x12/0x20
 __fput+0xf8/0x200
 fput+0xe/0x10
 task_work_run+0x85/0xc0
 exit_to_usermode_loop+0xa8/0xb0
 do_syscall_64+0x151/0x190
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f79b095b1e9
RSP: 002b:7ffc5eca3088 EFLAGS: 0246
 ORIG_RAX: 0120
RAX: fff2 RBX: 0120 RCX: 7f79b095b1e9
RDX: 006e RSI: 008436738120 RDI: 0130
RBP: 7ffc5eca3130 R08:  R09: 0ff0
R10: 00080800 R11: 0246 R12: 0002
R13: 7f79b0ee9058 R14: 7f79b1032698 R15: 7f79b0ee9000
Code: df e8 bd 5f f4 ff e9 07 ff ff ff 0f ff 8b 83 8c 02 00 00 85 c0 0f 84 2d 
ff ff ff 0f ff 8b 93 88 02 00 00 85 d2 0f 84 2b ff ff ff <0f> ff 8b 83 40 02 00 
00 85 c0 0f 84 29 ff ff ff 0f ff e9 22 ff 
---[ end trace 19b7bd878c0f56fe ]---
[ cut here ]
WARNING: CPU: 2 PID: 14455 at net/ipv4/af_inet.c:155 
inet_sock_destruct+0x1c8/0x1f0
CPU: 2 PID: 14455 Comm: trinity-c46 Tainted: G  D W   4.12.0-think+ #11 
task: 8804fc71b8c0 task.stack: c90002328000
RIP: 0010:inet_sock_destruct+0x1c8/0x1f0
RSP: 0018:c9000232bcf8 EFLAGS: 00010206
RAX: 0300 RBX: 88000d3f9b40 RCX: 
RDX: fd00 RSI: 0300 RDI: 

netconsole refcount warning

2017-07-09 Thread Dave Jones
The new refcount debugging code spews this twice during boot on my router..


refcount_t: increment on 0; use-after-free.
[ cut here ]
WARNING: CPU: 1 PID: 17 at lib/refcount.c:152 refcount_inc+0x2b/0x30
CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted 4.12.0-firewall+ #8 
task: 8801d4441ac0 task.stack: 8801d445
RIP: 0010:refcount_inc+0x2b/0x30
RSP: 0018:8801d4456da8 EFLAGS: 00010046
RAX: 002c RBX: 8801d4c3cf40 RCX: 
RDX: 002c RSI: 0003 RDI: ed003a88adab
RBP: 8801d4456da8 R08: 0003 R09: fbfff4afcb57
R10:  R11: fbfff4afcb58 R12: 8801d4c3c540
R13: 0082 R14: 8801ce9c7ff8 R15: 8801ce9c8aa0
FS:  () GS:8801d6a0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fa2b803156e CR3: 0001c405d000 CR4: 000406e0
Call Trace:
 zap_completion_queue+0xad/0x1a0
 netpoll_poll_dev+0x16f/0x3f0
 netpoll_send_skb_on_dev+0x25a/0x360
 netpoll_send_udp+0x526/0x850
 write_ext_msg+0x212/0x230
 ? _raw_spin_unlock_irqrestore+0x43/0x70
 ? write_msg+0x11f/0x130
 console_unlock+0x3ea/0x6e0
 vprintk_emit+0x298/0x3a0
 vprintk_default+0x1f/0x30
 vprintk_func+0x34/0xb0
 printk+0x95/0xb2
 ? show_regs_print_info+0x45/0x45
 ? nf_log_buf_open+0x2c/0x70
 ? nf_log_buf_close+0x26/0x70
 nf_log_buf_close+0x3c/0x70
 nf_log_ip_packet+0x111/0x250
 nf_log_packet+0x19e/0x330
 ? nf_logger_find_get+0x1c0/0x1c0
 ? debug_show_all_locks+0x1e0/0x1e0
 ? __local_bh_enable_ip+0x64/0xb0
 ? debug_smp_processor_id+0x17/0x20
 log_tg+0x13d/0x170
 ? log_tg_check+0x70/0x70
 ? trace_hardirqs_on+0xe/0x10
 ? __local_bh_enable_ip+0x64/0xb0
 ? _raw_spin_unlock_bh+0x35/0x40
 ipt_do_table+0x770/0xbb0
 ? mark_lock+0xb7/0x7d0
 ? sched_clock_cpu+0x1c/0x130
 ? ipt_alloc_initial_table+0x2d0/0x2d0
 ? debug_smp_processor_id+0x17/0x20
 ? __lock_is_held+0x55/0x110
 ? ipt_unregister_table+0x50/0x50
 iptable_filter_hook+0x53/0xd0
 nf_hook_slow+0x4a/0x120
 ip_local_deliver+0x1ba/0x2c0
 ? ip_local_deliver+0x100/0x2c0
 ? ip_call_ra_chain+0x270/0x270
 ? inet_del_offload+0x40/0x40
 ip_rcv_finish+0x2b9/0x880
 ip_rcv+0x51f/0x8a0
 ? ip_rcv+0x5ae/0x8a0
 ? ip_local_deliver+0x2c0/0x2c0
 ? ip_local_deliver_finish+0x4d0/0x4d0
 ? ip_local_deliver+0x2c0/0x2c0
 __netif_receive_skb_core+0xd4b/0x1210
 ? enqueue_to_backlog+0x620/0x620
 ? ktime_get_with_offset+0x11d/0x290
 __netif_receive_skb+0x27/0xc0
 ? debug_smp_processor_id+0x17/0x20
 netif_receive_skb_internal+0x3e3/0xc90
 ? netif_receive_skb_internal+0x90/0xc90
 ? __build_skb+0x2f/0x140
 ? __dev_queue_xmit+0xd30/0xd30
 ? debug_dma_sync_single_for_device+0xb7/0xc0
 ? debug_dma_sync_single_for_cpu+0xc0/0xc0
 ? dev_gro_receive+0x90/0x9b0
 ? __lock_is_held+0x30/0x110
 ? __asan_loadN+0x10/0x20
 ? skb_gro_reset_offset+0x93/0x140
 napi_gro_receive+0x1d1/0x270
 rtl8169_poll+0x49b/0xb30
 net_rx_action+0x4c4/0x7d0
 ? napi_complete_done+0x1b0/0x1b0
 ? __lock_is_held+0x30/0x110
 __do_softirq+0x113/0x611
 run_ksoftirqd+0x22/0x90
 smpboot_thread_fn+0x348/0x4f0
 ? __local_bh_enable_ip+0xb0/0xb0
 ? sort_range+0x30/0x30
 ? schedule+0x6c/0xe0
 ? __kthread_parkme+0xf2/0x110
 kthread+0x1ab/0x200
 ? sort_range+0x30/0x30
 ? __kthread_create_on_node+0x340/0x340
 ret_from_fork+0x27/0x40
Code: 55 48 89 e5 e8 97 ff ff ff 84 c0 74 02 5d c3 80 3d 5d 3e 06 01 00 75 f5 
48 c7 c7 20 69 f1 a4 c6 05 4d 3e 06 01 01 e8 ca 41 bc ff <0f> ff 5d c3 90 55 48 
89 e5 41 54 44 8d 27 48 8d 3e 53 48 8d 1e 
---[ end trace a9116b75ea217b54 ]---



Your first payment of $5000 ,

2017-05-15 Thread Mrs. Linda Jones
Attn Beneficiary,

We have deposited the check of your fund ($2.5m USD) through western union 
money transfer
department after our finally meeting today regarding your fund, Now all you
will do is to contact western union director Mis Rose Kelly  ,And She will give 
you
the direction on how you will be receiving your funds daily. Remember to send
her your Full information to avoid wrong transfer such as,

Your Receiver Name--­
Your Country­
Your City---­
Your Phone No---­
Your Address­
Your Id card or pasport :...

Therefore you are advised to contact western union accountant Manager 
 to her bellow information and tell She to give you the Mtcn, sender name
and question/answer to pick the money,

CONTACT Name: Mis Rose Kelly reply to (  wstun.office...@gmail.com  )
Phone: +229-99374614 


Get back to us once you receive your total fund of $2.5m.
Thanks and God bless you.

Best Regards
WESTERN UNION AGENT


Re: [PATCH v2 1/3] mfd: max8998: Remove CONFIG_OF around max8998_dt_match

2017-04-12 Thread Lee Jones
On Tue, 11 Apr 2017, Florian Fainelli wrote:

> A subsequent patch is going to make of_match_node() an inline stub when
> CONFIG_OF is disabled which will properly take care of having the compiler
> eliminate the variable. To avoid more #ifdef/#else, just always make the match
> table available.
> 
> Signed-off-by: Florian Fainelli <f.faine...@gmail.com>
> ---
>  drivers/mfd/max8998.c | 2 --
>  1 file changed, 2 deletions(-)

If it works, great!

For my own reference:
  Acked-for-MFD-by: Lee Jones <lee.jo...@linaro.org>
  
> diff --git a/drivers/mfd/max8998.c b/drivers/mfd/max8998.c
> index 4c33b8063bc3..372f681ec1bb 100644
> --- a/drivers/mfd/max8998.c
> +++ b/drivers/mfd/max8998.c
> @@ -129,14 +129,12 @@ int max8998_update_reg(struct i2c_client *i2c, u8 reg, 
> u8 val, u8 mask)
>  }
>  EXPORT_SYMBOL(max8998_update_reg);
>  
> -#ifdef CONFIG_OF
>  static const struct of_device_id max8998_dt_match[] = {
>   { .compatible = "maxim,max8998", .data = (void *)TYPE_MAX8998 },
>   { .compatible = "national,lp3974", .data = (void *)TYPE_LP3974 },
>   { .compatible = "ti,lp3974", .data = (void *)TYPE_LP3974 },
>   {},
>  };
> -#endif
>  
>  /*
>   * Only the common platform data elements for max8998 are parsed here from 
> the

-- 
Lee Jones
Linaro STMicroelectronics Landing Team Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog


Re: af_packet: use after free in prb_retire_rx_blk_timer_expired

2017-04-10 Thread Dave Jones
On Mon, Apr 10, 2017 at 07:03:30PM +, alexander.le...@verizon.com wrote:
 > Hi all,
 > 
 > I seem to be hitting this use-after-free on a -next kernel using trinity:
 >
 > [  531.036054] BUG: KASAN: use-after-free in prb_retire_rx_blk_timer_expired 
 > (net/packet/af_packet.c:688) 
 > [  531.036961] Read of size 8 at addr 88038c1fb0e8 by task 
 > swapper/1/0  
 >   [  531.037727] 
 >  
 >   [  531.037928] CPU: 1 PID: 0 Comm: swapper/1 Not 
 > tainted 4.11.0-rc5-next-20170407-dirty #24

Funny, I was just going over my old pending bugs, and found this one
from January that looks like what happens with the same bug, but without kasan..

context: PID: 0  TASK: 881ff2fa5100  CPU: 5   COMMAND: "swapper/5"
panic: general protection fault:  [#1]
netversion: 2.2-1 (Feb 2014)
Backtrace:
 #0 [881fffaa3c00] machine_kexec at 81044af8
 #1 [881fffaa3c60] __crash_kexec at 810ec755
 #2 [881fffaa3d28] crash_kexec at 810ec81f
 #3 [881fffaa3d40] oops_end at 8101e348
 #4 [881fffaa3d68] die at 8101e76b
 #5 [881fffaa3d98] do_general_protection at 8101be76
 #6 [881fffaa3dc0] general_protection at 817fe5a2
[exception RIP: prb_retire_rx_blk_timer_expired+65]
RIP: 817e6e41  RSP: 881fffaa3e78  RFLAGS: 00010246
RAX:   RBX: 881fd7075800  RCX: 
RDX: 883ff0a16bb0  RSI: 0074636361757063  RDI: 881fd70758bc
RBP: 881fffaa3e88   R8: 0001   R9: 0005
R10:   R11:   R12: 881fd7075b78
R13: 0100  R14: 817e6e00  R15: 881fd7075800
ORIG_RAX:   CS: 0010  SS: 0018
 #7 [881fffaa3e90] call_timer_fn at 810cec35
 #8 [881fffaa3ec8] run_timer_softirq at 810cf01c
 #9 [881fffaa3f28] __softirqentry_text_start at 817ff05c
#10 [881fffaa3f88] irq_exit at 8107d5fc
#11 [881fffaa3f98] smp_apic_timer_interrupt at 817feea2
#12 [881fffaa3fb0] apic_timer_interrupt at 817fd56f
---  ---
#13 [881ff2fbfdd0] apic_timer_interrupt at 817fd56f
RIP: 0018  RSP:   RFLAGS: 81ebbb60
RAX: e8e0002a0400  RBX: 0067b502e95f  RCX: 0006
RDX: 002e  RSI: 0034  RDI: 0001
RBP: 81150540   R8: 881ff2fbfee0   R9: 0001
R10: 0005  R11: 81ebbb60  R12: 881ff2fbfe48
R13: 881ff2fa5110  R14:   R15: 881ff2fa5100
ORIG_RAX: 881fffab5340  CS: 20c49ba5e353f7cf  SS: ff10
WARNING: possibly bogus exception frame
Dmesg:
Code: 00 00 48 8b 93 10 03 00 00 80 bb 21 03 00 00 00 44 0f b6 83 20 03 00 00 
0f b7 c8 48 8b 34 ca 75 57 <44> 8b 5e 0c 45 85 db 74 1d 8b 93 68 03 00 00 85 d2 
74 13 f3 90 

RIP 
 [] prb_retire_rx_blk_timer_expired+0x41/0x120
 RSP 
[ cut here ]



Re: [PATCH V8 1/3] irq: Add flags to request_percpu_irq function

2017-03-27 Thread Andrew Jones
On Thu, Mar 23, 2017 at 06:42:01PM +0100, Daniel Lezcano wrote:
> diff --git a/drivers/clocksource/timer-nps.c b/drivers/clocksource/timer-nps.c
> index da1f798..dbdb622 100644
> --- a/drivers/clocksource/timer-nps.c
> +++ b/drivers/clocksource/timer-nps.c
> @@ -256,7 +256,7 @@ static int __init nps_setup_clockevent(struct device_node 
> *node)
>   return ret;
>  
>   /* Needs apriori irq_set_percpu_devid() done in intc map function */
> - ret = request_percpu_irq(nps_timer0_irq, timer_irq_handler,
> + ret = request_percpu_irq(nps_timer0_irq, timer_irq_handler, IRQF_TIMER,
>"Timer0 (per-cpu-tick)",
>_clockevent_device);

Wrong parameter order here.

drew


Re: run_timer_softirq gpf. [smc]

2017-03-21 Thread Dave Jones
On Tue, Mar 21, 2017 at 08:25:39PM +0100, Thomas Gleixner wrote:
 
 > > I just hit this while fuzzing..
 > > 
 > > general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 > > CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc2-think+ #1 
 > > task: 88017f0ed440 task.stack: c9094000
 > > RIP: 0010:run_timer_softirq+0x15f/0x700
 > > RSP: 0018:880507c03ec8 EFLAGS: 00010086
 > > RAX: dead0200 RBX: 880507dd0d00 RCX: 0002
 > > RDX: 880507c03ed0 RSI:  RDI: 8204b3a0
 > > RBP: 880507c03f48 R08: 880507dd12d0 R09: 880507c03ed8
 > > R10: 880507dd0db0 R11:  R12: 8215cc38
 > > R13: 880507c03ed0 R14: 82005188 R15: 8804b55491a8
 > > FS:  () GS:880507c0() 
 > > knlGS:
 > > CS:  0010 DS:  ES:  CR0: 80050033
 > > CR2: 0004 CR3: 05011000 CR4: 001406e0
 > > Call Trace:
 > >  
 > >  ? clockevents_program_event+0x47/0x120
 > >  __do_softirq+0xbf/0x5b1
 > >  irq_exit+0xb5/0xc0
 > >  smp_apic_timer_interrupt+0x3d/0x50
 > >  apic_timer_interrupt+0x97/0xa0
 > > RIP: 0010:cpuidle_enter_state+0x12e/0x400
 > > RSP: 0018:c9097e40 EFLAGS: 0202
 > > [CONT START]  ORIG_RAX: ff10
 > > RAX: 88017f0ed440 RBX: e8a03cc8 RCX: 0001
 > > RDX: 20c49ba5e353f7cf RSI: 0001 RDI: 88017f0ed440
 > > RBP: c9097e80 R08:  R09: 0008
 > > R10:  R11:  R12: 0005
 > > R13: 820b9338 R14: 0005 R15: 820b9320
 > >  
 > >  cpuidle_enter+0x17/0x20
 > >  call_cpuidle+0x23/0x40
 > >  do_idle+0xfb/0x200
 > >  cpu_startup_entry+0x71/0x80
 > >  start_secondary+0x16a/0x210
 > >  start_cpu+0x14/0x14
 > > Code: 8b 05 ce 1b ef 7e 83 f8 03 0f 87 4e 01 00 00 89 c0 49 0f a3 04 24 0f 
 > > 82 0a 01 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 
 > > 41 f6 47 2a 20 49 c7 47 08 00 00 00 00 48 89 df 48 
 > 
 > The timer which expires has timer->entry.next == POISON2 !
 > 
 > it's a classic list corruption.  The
 > bad news is that there is no trace of the culprit because that happens when
 > some other timer expires after some random amount of time.
 > 
 > If that is reproducible, then please enable debugobjects. That should
 > pinpoint the culprit.

It's net/smc.  This recently had a similar bug with workqueues. 
(https://marc.info/?l=linux-kernel=148821582909541) fixed by 
637fdbae60d6cb9f6e963c1079d7e0445c86ff7d
so it's probably unsurprising that there are similar issues.


WARNING: CPU: 0 PID: 2430 at lib/debugobjects.c:289 debug_print_object+0x87/0xb0
ODEBUG: free active (active state 0) object type: timer_list hint: 
delayed_work_timer_fn+0x0/0x20
CPU: 0 PID: 2430 Comm: trinity-c4 Not tainted 4.11.0-rc3-think+ #3 
Call Trace:
 dump_stack+0x68/0x93
 __warn+0xcb/0xf0
 warn_slowpath_fmt+0x5f/0x80
 ? debug_check_no_obj_freed+0xd9/0x260
 debug_print_object+0x87/0xb0
 ? work_on_cpu+0xd0/0xd0
 debug_check_no_obj_freed+0x219/0x260
 ? __sk_destruct+0x10d/0x1c0
 kmem_cache_free+0x9f/0x370
 __sk_destruct+0x10d/0x1c0
 sk_destruct+0x20/0x30
 __sk_free+0x43/0xa0
 sk_free+0x18/0x20
 smc_release+0x100/0x1a0 [smc]
 sock_release+0x1f/0x80
 sock_close+0x12/0x20
 __fput+0xf3/0x200
 fput+0xe/0x10
 task_work_run+0x85/0xb0
 do_exit+0x331/0xd70
 __secure_computing+0x9c/0xa0
 syscall_trace_enter+0xd1/0x3d0
 do_syscall_64+0x15f/0x1d0
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f535f4b19e7
RSP: 002b:7fff1a0f40e8 EFLAGS: 0246
 ORIG_RAX: 0008
RAX: ffda RBX: 0004 RCX: 7f535f4b19e7
RDX:  RSI:  RDI: 0004
RBP:  R08: 7f535fb8b000 R09: 00c17c2740a303e4
R10:  R11: 0246 R12: 7fff1a0f40f5
R13: 7f535fb60048 R14: 7f535fb83ad8 R15: 7f535fb6
---[ end trace ee67155de15508db ]---

==
[ INFO: possible circular locking dependency detected ]
4.11.0-rc3-think+ #3 Not tainted
---
trinity-c4/2430 is trying to acquire lock:
 (
(console_sem).lock
){-.-...}
, at: [] down_trylock+0x14/0x40

but task is already holding lock:
 (
_hash[i].lock
){-.-.-.}
, at: [] debug_check_no_obj_freed+0xd9/0x260

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3
 (
_hash[i].lock
){-.-.-.}
:
   lock_acquire+0x102/0x260
   _raw_spin_lock_irqsave+0x4c/0x90
   __debug_object_init+0x79/0x460
   debug_object_init+0x16/0x20
   hrtimer_init+0x25/0x1d0
   init_dl_task_timer+0x20/0x30
   __sched_fork.isra.91+0x9c/0x140
   init_idle+0x51/0x240
   sched_init+0x4cd/0x547
   start_kernel+0x246/0x45d
   x86_64_start_reservations+0x2a/0x2c
   x86_64_start_kernel+0x178/0x18b
   

Re: [4.10+] sctp lockdep trace

2017-03-14 Thread Dave Jones
On Tue, Mar 14, 2017 at 11:35:33AM +0800, Xin Long wrote:
 > >> > [  245.416594]  (
 > >> > [  245.424928] sk_lock-AF_INET
 > >> > [  245.433279] ){+.+.+.}
 > >> > [  245.441889] , at: [] sctp_sendmsg+0x330/0xfe0 
 > >> > [sctp]
 > >> > [  245.450167]
 > >> >stack backtrace:
 > >> > [  245.466352] CPU: 3 PID: 1781 Comm: trinity-c30 Not tainted 
 > >> > 4.10.0-think+ #7
 > >> > [  245.482894] Call Trace:
 > >> > [  245.491096]  dump_stack+0x68/0x93
 > >> > [  245.499314]  lockdep_rcu_suspicious+0xce/0xf0
 > >> > [  245.507610]  sctp_hash_transport+0x6c0/0x7e0 [sctp]
 > >> > [  245.515972]  ? sctp_endpoint_bh_rcv+0x171/0x290 [sctp]
 > >> > [  245.524366]  sctp_assoc_add_peer+0x290/0x3c0 [sctp]
 > >> > [  245.532736]  sctp_sendmsg+0x8f7/0xfe0 [sctp]
 > >> > [  245.541040]  ? rw_copy_check_uvector+0x8e/0x190
 > >> > [  245.549402]  ? import_iovec+0x3a/0xe0
 > >> > [  245.557679]  inet_sendmsg+0x49/0x1e0
 > >> > [  245.565887]  ___sys_sendmsg+0x2d4/0x300
 > >> > [  245.574092]  ? debug_smp_processor_id+0x17/0x20
 > >> > [  245.582342]  ? debug_smp_processor_id+0x17/0x20
 > >> > [  245.590508]  ? get_lock_stats+0x19/0x50
 > >> > [  245.598641]  __sys_sendmsg+0x54/0x90
 > >> > [  245.606745]  SyS_sendmsg+0x12/0x20
 > >> > [  245.614784]  do_syscall_64+0x66/0x1d0
 > >> > [  245.622828]  entry_SYSCALL64_slow_path+0x25/0x25
 > >> > [  245.630894] RIP: 0033:0x7fe095fcb0f9
 > >> > [  245.638962] RSP: 002b:7ffc5601b1d8 EFLAGS: 0246
 > >> > [  245.647071]  ORIG_RAX: 002e
 > >> > [  245.655186] RAX: ffda RBX: 002e RCX: 
 > >> > 7fe095fcb0f9
 > >> > [  245.663435] RDX: 0080 RSI: 5592de12ddc0 RDI: 
 > >> > 012d
 > >> > [  245.671776] RBP: 7fe0965c8000 R08: c000 R09: 
 > >> > 00dc
 > >> > [  245.680111] R10: 000302120088 R11: 0246 R12: 
 > >> > 0002
 > >> > [  245.688460] R13: 7fe0965c8048 R14: 7fe0966a1ad8 R15: 
 > >> > 7fe0965c8000
 > >> >
 > >>
 > >> Cc'ing Xin and linux-sctp@ mailing list.
 > >
 > > Seems the same as Andrey Konovalov had reported?
 > >
 > I would think so, this patch has fixed it:
 > 
 > commit 5179b26694c92373275e4933f5d0ff32d585c675
 > Author: Xin Long 
 > Date:   Tue Feb 28 12:41:29 2017 +0800
 > 
 > sctp: call rcu_read_lock before checking for duplicate transport nodes
 > 
 > not sure which commit your tests are based on, Dave, can you
 > check if this fix has been in your test kernel?

Haven't seen this in a while. Let's call it fixed.

Dave


[4.10+] sctp lockdep trace

2017-02-24 Thread Dave Jones
[  244.251557] ===
[  244.263321] [ ERR: suspicious RCU usage.  ]
[  244.274982] 4.10.0-think+ #7 Not tainted
[  244.286511] ---
[  244.298008] ./include/linux/rhashtable.h:602 suspicious 
rcu_dereference_check() usage!
[  244.309665] 
   other info that might help us debug this:

[  244.344629] 
   rcu_scheduler_active = 2, debug_locks = 1
[  244.367839] 1 lock held by trinity-c30/1781:
[  244.379481]  #0: 
[  244.390848]  (
[  244.402372] sk_lock-AF_INET
[  244.413825] ){+.+.+.}
[  244.425231] , at: [] sctp_sendmsg+0x330/0xfe0 [sctp]
[  244.436774] 
   stack backtrace:
[  244.459620] CPU: 3 PID: 1781 Comm: trinity-c30 Not tainted 4.10.0-think+ #7 
[  244.482790] Call Trace:
[  244.494201]  dump_stack+0x68/0x93
[  244.505598]  lockdep_rcu_suspicious+0xce/0xf0
[  244.516924]  sctp_hash_transport+0x406/0x7e0 [sctp]
[  244.528137]  ? sctp_endpoint_bh_rcv+0x171/0x290 [sctp]
[  244.539243]  sctp_assoc_add_peer+0x290/0x3c0 [sctp]
[  244.550291]  sctp_sendmsg+0x8f7/0xfe0 [sctp]
[  244.561258]  ? rw_copy_check_uvector+0x8e/0x190
[  244.572308]  ? import_iovec+0x3a/0xe0
[  244.583232]  inet_sendmsg+0x49/0x1e0
[  244.594150]  ___sys_sendmsg+0x2d4/0x300
[  244.605002]  ? debug_smp_processor_id+0x17/0x20
[  244.615844]  ? debug_smp_processor_id+0x17/0x20
[  244.626533]  ? get_lock_stats+0x19/0x50
[  244.637141]  __sys_sendmsg+0x54/0x90
[  244.647817]  SyS_sendmsg+0x12/0x20
[  244.658400]  do_syscall_64+0x66/0x1d0
[  244.668990]  entry_SYSCALL64_slow_path+0x25/0x25
[  244.679582] RIP: 0033:0x7fe095fcb0f9
[  244.690079] RSP: 002b:7ffc5601b1d8 EFLAGS: 0246
[  244.700704]  ORIG_RAX: 002e
[  244.711248] RAX: ffda RBX: 002e RCX: 7fe095fcb0f9
[  244.721818] RDX: 0080 RSI: 5592de12ddc0 RDI: 012d
[  244.732282] RBP: 7fe0965c8000 R08: c000 R09: 00dc
[  244.742576] R10: 000302120088 R11: 0246 R12: 0002
[  244.752804] R13: 7fe0965c8048 R14: 7fe0966a1ad8 R15: 7fe0965c8000

[  244.775549] ===
[  244.785875] [ ERR: suspicious RCU usage.  ]
[  244.796951] 4.10.0-think+ #7 Not tainted
[  244.807185] ---
[  244.819213] ./include/linux/rhashtable.h:605 suspicious 
rcu_dereference_check() usage!
[  244.829420] 
   other info that might help us debug this:

[  244.859963] 
   rcu_scheduler_active = 2, debug_locks = 1
[  244.879766] 1 lock held by trinity-c30/1781:
[  244.889953]  #0: 
[  244.90]  (
[  244.909854] sk_lock-AF_INET
[  244.919645] ){+.+.+.}
[  244.929238] , at: [] sctp_sendmsg+0x330/0xfe0 [sctp]
[  244.939167] 
   stack backtrace:
[  244.958506] CPU: 3 PID: 1781 Comm: trinity-c30 Not tainted 4.10.0-think+ #7 
[  244.978102] Call Trace:
[  244.987735]  dump_stack+0x68/0x93
[  244.997112]  lockdep_rcu_suspicious+0xce/0xf0
[  245.006588]  sctp_hash_transport+0x4ca/0x7e0 [sctp]
[  245.016264]  ? sctp_endpoint_bh_rcv+0x171/0x290 [sctp]
[  245.025797]  sctp_assoc_add_peer+0x290/0x3c0 [sctp]
[  245.035380]  sctp_sendmsg+0x8f7/0xfe0 [sctp]
[  245.044883]  ? rw_copy_check_uvector+0x8e/0x190
[  245.054464]  ? import_iovec+0x3a/0xe0
[  245.064016]  inet_sendmsg+0x49/0x1e0
[  245.073516]  ___sys_sendmsg+0x2d4/0x300
[  245.082967]  ? debug_smp_processor_id+0x17/0x20
[  245.092448]  ? debug_smp_processor_id+0x17/0x20
[  245.101850]  ? get_lock_stats+0x19/0x50
[  245.70]  __sys_sendmsg+0x54/0x90
[  245.120451]  SyS_sendmsg+0x12/0x20
[  245.129649]  do_syscall_64+0x66/0x1d0
[  245.138783]  entry_SYSCALL64_slow_path+0x25/0x25
[  245.147678] RIP: 0033:0x7fe095fcb0f9
[  245.156588] RSP: 002b:7ffc5601b1d8 EFLAGS: 0246
[  245.165503]  ORIG_RAX: 002e
[  245.174601] RAX: ffda RBX: 002e RCX: 7fe095fcb0f9
[  245.183861] RDX: 0080 RSI: 5592de12ddc0 RDI: 012d
[  245.193038] RBP: 7fe0965c8000 R08: c000 R09: 00dc
[  245.202214] R10: 000302120088 R11: 0246 R12: 0002
[  245.211261] R13: 7fe0965c8048 R14: 7fe0966a1ad8 R15: 7fe0965c8000

[  245.308216] ===
[  245.317295] [ ERR: suspicious RCU usage.  ]
[  245.327876] 4.10.0-think+ #7 Not tainted
[  245.337065] ---
[  245.345840] ./include/linux/rhashtable.h:616 suspicious 
rcu_dereference_check() usage!
[  245.356501] 
   other info that might help us debug this:

[  245.382185] 
   rcu_scheduler_active = 2, debug_locks = 1
[  245.399415] 1 lock held by trinity-c30/1781:
[  245.408138]  #0: 
[  245.416594]  (
[  245.424928] sk_lock-AF_INET
[  245.433279] ){+.+.+.}
[  245.441889] , at: [] sctp_sendmsg+0x330/0xfe0 [sctp]
[  245.450167] 
   stack backtrace:
[  245.466352] CPU: 3 PID: 1781 Comm: trinity-c30 Not tainted 4.10.0-think+ #7 
[  

Re: [PATCH v2 net-next] liquidio: improve UDP TX performance

2017-02-21 Thread Rick Jones

On 02/21/2017 01:09 PM, Felix Manlunas wrote:

From: VSR Burru <veerasenareddy.bu...@cavium.com>

Improve UDP TX performance by:
* reducing the ring size from 2K to 512
* replacing the numerous streaming DMA allocations for info buffers and
  gather lists with one large consistent DMA allocation per ring

Netperf benchmark numbers before and after patch:

PF UDP TX
+++++-+
|||  Before|  After | |
| Number ||  Patch |  Patch | |
|  of| Packet | Throughput | Throughput | Percent |
| Flows  |  Size  |  (Gbps)|  (Gbps)| Change  |
+++++-+
||   360  |   0.52 |   0.93 |  +78.9  |
|   1|  1024  |   1.62 |   2.84 |  +75.3  |
||  1518  |   2.44 |   4.21 |  +72.5  |
+++++-+
||   360  |   0.45 |   1.59 | +253.3  |
|   4|  1024  |   1.34 |   5.48 | +308.9  |
||  1518  |   2.27 |   8.31 | +266.1  |
+++++-+
||   360  |   0.40 |   1.61 | +302.5  |
|   8|  1024  |   1.64 |   4.24 | +158.5  |
||  1518  |   2.87 |   6.52 | +127.2  |
+++++-+


VF UDP TX
+++++-+
|||  Before|  After | |
| Number ||  Patch |  Patch | |
|  of| Packet | Throughput | Throughput | Percent |
| Flows  |  Size  |  (Gbps)|  (Gbps)| Change  |
+++++-+
||   360  |   1.28 |   1.49 |  +16.4  |
|   1|  1024  |   4.44 |   4.39 |   -1.1  |
||  1518  |   6.08 |   6.51 |   +7.1  |
+++++-+
||   360  |   2.35 |   2.35 |0.0  |
|   4|  1024  |   6.41 |   8.07 |  +25.9  |
||  1518  |   9.56 |   9.54 |   -0.2  |
+++++-+
||   360  |   3.41 |   3.65 |   +7.0  |
|   8|  1024  |   9.35 |   9.34 |   -0.1  |
||  1518  |   9.56 |   9.57 |   +0.1  |
+++++-+


Some good looking numbers there.  As one approaches the wire limit for 
bitrate, the likes of a netperf service demand can be used to 
demonstrate the performance change - though there isn't an easy way to 
do that for parallel flows.


happy benchmarking,

rick jones



Re: [PATCH net-next] liquidio: improve UDP TX performance

2017-02-16 Thread Rick Jones

On 02/16/2017 10:38 AM, Felix Manlunas wrote:

From: VSR Burru <veerasenareddy.bu...@cavium.com>

Improve UDP TX performance by:
* reducing the ring size from 2K to 512
* replacing the numerous streaming DMA allocations for info buffers and
  gather lists with one large consistent DMA allocation per ring


By how much was UDP TX performance improved?

happy benchmarking,

rick jones



Re: [PATCH net-next] virtio: Fix affinity for >32 VCPUs

2017-02-03 Thread Rick Jones

On 02/03/2017 10:31 AM, Willem de Bruijn wrote:

Configuring interrupts and xps from userspace at boot is more robust,
as device driver defaults can change. But especially for customers who
are unaware of these settings, choosing sane defaults won't hurt.


The devil is in finding the sane defaults.  For example, the issues 
we've seen with VMs sending traffic getting reordered when the driver 
took it upon itself to enable xps.


rick jones


Re: [PATCH net-next] virtio: Fix affinity for >32 VCPUs

2017-02-03 Thread Rick Jones

On 02/03/2017 10:22 AM, Benjamin Serebrin wrote:

Thanks, Michael, I'll put this text in the commit log:

XPS settings aren't write-able from userspace, so the only way I know
to fix XPS is in the driver.


??

root@np-cp1-c0-m1-mgmt:/home/stack# cat 
/sys/devices/pci:00/:00:02.0/:04:00.0/net/hed0/queues/tx-0/xps_cpus

,0001
root@np-cp1-c0-m1-mgmt:/home/stack# echo 0 > 
/sys/devices/pci:00/:00:02.0/:04:00.0/net/hed0/queues/tx-0/xps_cpus
root@np-cp1-c0-m1-mgmt:/home/stack# cat 
/sys/devices/pci:00/:00:02.0/:04:00.0/net/hed0/queues/tx-0/xps_cpus

,



prb_retire_rx_blk_timer_expired use-after-free

2017-01-18 Thread Dave Jones
RSI looks kinda like slab poison here, so re-using a free'd ptr ?

general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-rc4-think+ #2 
task: 81e16500 task.stack: 81e0
RIP: 0010:prb_retire_rx_blk_timer_expired+0x42/0x130
RSP: 0018:880507803e30 EFLAGS: 00010246
RAX: 81e16500 RBX: 8804bc751158 RCX: 
RDX: 8804fb6e8008 RSI: a56b6b6b6b6b6b6b RDI: 0001
RBP: 880507803e48 R08:  R09: 0001
R10: 61f74469 R11: 0054 R12: 8804bc751338
R13: 8804bc7516d8 R14: 818ab6a0 R15: 8804bc751158
FS:  () GS:88050780() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 5578f64a0130 CR3: 03e11000 CR4: 001406f0
DR0: 7f539ba38000 DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0600
Call Trace:
 
 call_timer_fn+0xd2/0x340
 ? call_timer_fn+0x5/0x340
 ? prb_retire_current_block+0x100/0x100
 run_timer_softirq+0x284/0x650
 ? 0xa035c077
 ? run_timer_softirq+0x5/0x650
 ? lapic_next_deadline+0x5/0x40
 __do_softirq+0x143/0x431
 irq_exit+0xa5/0xb0
 smp_apic_timer_interrupt+0x3d/0x50
 apic_timer_interrupt+0x8d/0xa0
RIP: 0010:cpuidle_enter_state+0x129/0x360
RSP: 0018:81e03db8 EFLAGS: 0246
  ORIG_RAX: ff10
RAX:  RBX: e8603cc8 RCX: 001f
RDX: 20c49ba5e353f7cf RSI: 81c5e743 RDI: 81c48102
RBP: 81e03df8 R08: cccd R09: 0018
R10: 022e R11: 0a2c R12: 0005
R13: 81eaf918 R14: 0005 R15: 81eaf900
 
 ? cpuidle_enter_state+0x113/0x360
 cpuidle_enter+0x17/0x20
 call_cpuidle+0x23/0x40
 do_idle+0xf6/0x1f0
 cpu_startup_entry+0x71/0x80
 rest_init+0xb8/0xc0
 start_kernel+0x432/0x453
 x86_64_start_reservations+0x2a/0x2c
 x86_64_start_kernel+0x178/0x18b
 start_cpu+0x14/0x14
 ? start_cpu+0x14/0x14
Code: fb 4c 89 e7 e8 b0 f1 01 00 0f b7 8b 2a 05 00 00 48 8b 93 18 05 00 00 80 
bb 29 05 00 00 00 0f b6 bb 28 05 00 00 48 8b 34 ca 75 58 <8b> 56 0c 48 89 c8 85 
d2 74 1d 8b 93 70 05 00 00 85 d2 74 13 f3 

All code

   0:   fb  sti
   1:   4c 89 e7mov%r12,%rdi
   4:   e8 b0 f1 01 00  callq  0x1f1b9
   9:   0f b7 8b 2a 05 00 00movzwl 0x52a(%rbx),%ecx
  10:   48 8b 93 18 05 00 00mov0x518(%rbx),%rdx
  17:   80 bb 29 05 00 00 00cmpb   $0x0,0x529(%rbx)
  1e:   0f b6 bb 28 05 00 00movzbl 0x528(%rbx),%edi
  25:   48 8b 34 ca mov(%rdx,%rcx,8),%rsi
  29:   75 58   jne0x83
  2b:*  8b 56 0cmov0xc(%rsi),%edx <-- trapping 
instruction
  2e:   48 89 c8mov%rcx,%rax
  31:   85 d2   test   %edx,%edx
  33:   74 1d   je 0x52
  35:   8b 93 70 05 00 00   mov0x570(%rbx),%edx
  3b:   85 d2   test   %edx,%edx
  3d:   74 13   je 0x52
  3f:   f3  repz

Code starting with the faulting instruction
===
   0:   8b 56 0cmov0xc(%rsi),%edx
   3:   48 89 c8mov%rcx,%rax
   6:   85 d2   test   %edx,%edx
   8:   74 1d   je 0x27
   a:   8b 93 70 05 00 00   mov0x570(%rbx),%edx
  10:   85 d2   test   %edx,%edx
  12:   74 13   je 0x27
  14:   f3  repz

That code is the BLOCK_NUM_PKTS line here..

 677 spin_lock(>sk.sk_receive_queue.lock);
 678 
 679 frozen = prb_queue_frozen(pkc);
 680 pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
 681 
 682 if (unlikely(pkc->delete_blk_timer))
 683 goto out;
 684 
 685 /* We only need to plug the race when the block is partially 
filled.
 686  * tpacket_rcv:
 687  *  lock(); increment BLOCK_NUM_PKTS; unlock()
 688  *  copy_bits() is in progress ...
 689  *  timer fires on other cpu:
 690  *  we can't retire the current block because copy_bits
 691  *  is in progress.
 692  *
 693  */
 694 if (BLOCK_NUM_PKTS(pbd)) {




Re: [PATCH net-next] tcp: accept RST for rcv_nxt - 1 after receiving a FIN

2017-01-17 Thread Rick Jones

On 01/17/2017 11:13 AM, Eric Dumazet wrote:

On Tue, Jan 17, 2017 at 11:04 AM, Rick Jones <rick.jon...@hpe.com> wrote:

Drifting a bit, and it doesn't change the value of dealing with it, but out
of curiosity, when you say mostly in CLOSE_WAIT, why aren't the server-side
applications reacting to the read return of zero triggered by the arrival of
the FIN?


Even if the application reacts, and calls close(fd), kernel will still
try to push the data that was queued into socket write queue prior to
receiving the FIN.

By allowing this RST, we can flush the whole data and react much
faster, avoiding locking memory in the kernel for very long time.


Understood.  I was just wondering if there is also an application bug here.

happy benchmarking,

rick jones


Re: [PATCH net-next] tcp: accept RST for rcv_nxt - 1 after receiving a FIN

2017-01-17 Thread Rick Jones

On 01/17/2017 10:37 AM, Jason Baron wrote:

From: Jason Baron <jba...@akamai.com>

Using a Mac OSX box as a client connecting to a Linux server, we have found
that when certain applications (such as 'ab'), are abruptly terminated
(via ^C), a FIN is sent followed by a RST packet on tcp connections. The
FIN is accepted by the Linux stack but the RST is sent with the same
sequence number as the FIN, and Linux responds with a challenge ACK per
RFC 5961. The OSX client then sometimes (they are rate-limited) does not
reply with any RST as would be expected on a closed socket.

This results in sockets accumulating on the Linux server left mostly in
the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
This sequence of events can tie up a lot of resources on the Linux server
since there may be a lot of data in write buffers at the time of the RST.
Accepting a RST equal to rcv_nxt - 1, after we have already successfully
processed a FIN, has made a significant difference for us in practice, by
freeing up unneeded resources in a more expedient fashion.


Drifting a bit, and it doesn't change the value of dealing with it, but 
out of curiosity, when you say mostly in CLOSE_WAIT, why aren't the 
server-side applications reacting to the read return of zero triggered 
by the arrival of the FIN?


happy benchmarking,

rick jones


Re: [pull request][for-next] Mellanox mlx5 Reorganize core driver directory layout

2017-01-13 Thread Rick Jones

On 01/13/2017 02:56 PM, Tom Herbert wrote:

On Fri, Jan 13, 2017 at 2:45 PM, Saeed Mahameed

what configuration are you running ? what traffic ?


Nothing fancy. 8 queues and 20 concurrent netperf TCP_STREAMs trips
it. Not a lot of them, but I don't think we really should ever see
these errors.


Straight-up defaults with netperf, or do you use specific -s/S or -m/M 
options?


happy benchmarking,

rick jones



ipv6: remove unnecessary inet6_sk check

2016-12-28 Thread Dave Jones
np is already assigned in the variable declaration of ping_v6_sendmsg.
At this point, we have already dereferenced np several times, so the
NULL check is also redundant.

Suggested-by: Eric Dumazet <eric.duma...@gmail.com>
Signed-off-by: Dave Jones <da...@codemonkey.org.uk>

diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c
index e1f8b34d7a2e..9b522fa90e6d 100644
--- a/net/ipv6/ping.c
+++ b/net/ipv6/ping.c
@@ -126,12 +126,6 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
return PTR_ERR(dst);
rt = (struct rt6_info *) dst;
 
-   np = inet6_sk(sk);
-   if (!np) {
-   err = -EBADF;
-   goto dst_err_out;
-   }
-
if (!fl6.flowi6_oif && ipv6_addr_is_multicast())
fl6.flowi6_oif = np->mcast_oif;
else if (!fl6.flowi6_oif)
@@ -166,7 +160,6 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
}
release_sock(sk);
 
-dst_err_out:
dst_release(dst);
 
if (err)


sunrpc: Illegal context switch in RCU read-side critical section!

2016-12-27 Thread Dave Jones
Just noticed this on 4.9. Will try and repro on 4.10rc1 later, but hitting
unrelated boot problems on that machine right now.

===
[ INFO: suspicious RCU usage. ]
4.9.0-backup-debug+ #1 Not tainted
---
./include/linux/rcupdate.h:557 Illegal context switch in RCU read-side critical 
section!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 1
5 locks held by kworker/4:1/66:
 #0:  ("%s"("ipv6_addrconf")){.+.+..}, at: [] 
process_one_work+0x184/0x790
 #1:  ((addr_chk_work).work){+.+...}, at: [] 
process_one_work+0x184/0x790
 #2:  (rtnl_mutex){+.+.+.}, at: [] rtnl_lock+0x17/0x20
 #3:  (rcu_read_lock_bh){..}, at: [] 
addrconf_verify_rtnl+0x23/0x500
 #4:  (rcu_read_lock){..}, at: [] 
atomic_notifier_call_chain+0x5/0x110

stack backtrace:
CPU: 4 PID: 66 Comm: kworker/4:1 Not tainted 4.9.0-backup-debug+ #1
Workqueue: ipv6_addrconf addrconf_verify_work
 c9273a28 8e5b4ca5 88042ae19780 0001
 c9273a58 8e0d530e  8efcc659
 09a7 8804180b8580 c9273a80 8e0ad2b7
Call Trace:
 [] dump_stack+0x68/0x93
 [] lockdep_rcu_suspicious+0xce/0xf0
 [] ___might_sleep.part.103+0xa7/0x230
 [] __might_sleep+0x4b/0x90
 [] lock_sock_nested+0x32/0xb0
 [] sock_setsockopt+0x8b/0xa50
 [] ? __local_bh_enable_ip+0x65/0xb0
 [] kernel_setsockopt+0x49/0x50
 [] svc_tcp_kill_temp_xprt+0x4a/0x60
 [] svc_age_temp_xprts_now+0x12f/0x1b0
 [] nfsd_inet6addr_event+0x192/0x1f0
 [] ? nfsd_inet6addr_event+0x5/0x1f0
 [] notifier_call_chain+0x39/0xa0
 [] atomic_notifier_call_chain+0x6e/0x110
 [] ? atomic_notifier_call_chain+0x5/0x110
 [] inet6addr_notifier_call_chain+0x1b/0x20
 [] ipv6_del_addr+0x12c/0x200
 [] addrconf_verify_rtnl+0x417/0x500
 [] ? addrconf_verify_rtnl+0x23/0x500
 [] addrconf_verify_work+0x13/0x20
 [] process_one_work+0x20b/0x790
 [] ? process_one_work+0x184/0x790
 [] worker_thread+0x4e/0x490
 [] ? process_one_work+0x790/0x790
 [] ? process_one_work+0x790/0x790
 [] kthread+0xff/0x120
 [] ? kthread_worker_fn+0x140/0x140
 [] ret_from_fork+0x27/0x40



ipv6: handle -EFAULT from skb_copy_bits

2016-12-22 Thread Dave Jones
By setting certain socket options on ipv6 raw sockets, we can confuse the
length calculation in rawv6_push_pending_frames triggering a BUG_ON.

RIP: 0010:[] [] rawv6_sendmsg+0xc30/0xc40
RSP: 0018:881f6c4a7c18  EFLAGS: 00010282
RAX: fff2 RBX: 881f6c681680 RCX: 0002
RDX: 881f6c4a7cf8 RSI: 0030 RDI: 881fed0f6a00
RBP: 881f6c4a7da8 R08:  R09: 0009
R10: 881fed0f6a00 R11: 0009 R12: 0030
R13: 881fed0f6a00 R14: 881fee39ba00 R15: 881fefa93a80

Call Trace:
 [] ? unmap_page_range+0x693/0x830
 [] inet_sendmsg+0x67/0xa0
 [] sock_sendmsg+0x38/0x50
 [] SYSC_sendto+0xef/0x170
 [] SyS_sendto+0xe/0x10
 [] do_syscall_64+0x50/0xa0
 [] entry_SYSCALL64_slow_path+0x25/0x25

Handle by jumping to the failure path if skb_copy_bits gets an EFAULT.

Reproducer:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define LEN 504

int main(int argc, char* argv[])
{
int fd;
int zero = 0;
char buf[LEN];

memset(buf, 0, LEN);

fd = socket(AF_INET6, SOCK_RAW, 7);

setsockopt(fd, SOL_IPV6, IPV6_CHECKSUM, , 4);
setsockopt(fd, SOL_IPV6, IPV6_DSTOPTS, , LEN);

sendto(fd, buf, 1, 0, (struct sockaddr *) buf, 110);
}

Signed-off-by: Dave Jones <da...@codemonkey.org.uk>

diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 291ebc260e70..ea89073c8247 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -591,7 +591,11 @@ static int rawv6_push_pending_frames(struct sock *sk, 
struct flowi6 *fl6,
}
 
offset += skb_transport_offset(skb);
-   BUG_ON(skb_copy_bits(skb, offset, , 2));
+   err = skb_copy_bits(skb, offset, , 2);
+   if (err < 0) {
+   ip6_flush_pending_frames(sk);
+   goto out;
+   }
 
/* in case cksum was not initialized */
if (unlikely(csum))


Re: ipv6: handle -EFAULT from skb_copy_bits

2016-12-21 Thread Dave Jones
On Wed, Dec 21, 2016 at 10:33:20PM +0100, Hannes Frederic Sowa wrote:

 > > Given all of this, I think the best thing to do is validate the offset
 > > after the queue walks, which is pretty much what Dave Jones's original
 > > patch was doing.
 > 
 > I think both approaches protect against the bug reasonably well, but
 > Dave's patch has a bug: we must either call ip6_flush_pending_frames to
 > clear the socket write queue with the buggy send request.

I can fix that up and resubmit, or we can go with your approach.
DaveM ?

Dave



Re: ipv6: handle -EFAULT from skb_copy_bits

2016-12-20 Thread Dave Jones
On Tue, Dec 20, 2016 at 11:31:38AM -0800, Cong Wang wrote:
 > On Tue, Dec 20, 2016 at 10:17 AM, Dave Jones <da...@codemonkey.org.uk> wrote:
 > > On Mon, Dec 19, 2016 at 08:36:23PM -0500, David Miller wrote:
 > >  > From: Dave Jones <da...@codemonkey.org.uk>
 > >  > Date: Mon, 19 Dec 2016 19:40:13 -0500
 > >  >
 > >  > > On Mon, Dec 19, 2016 at 07:31:44PM -0500, Dave Jones wrote:
 > >  > >
 > >  > >  > Unfortunately, this made no difference.  I spent some time today 
 > > trying
 > >  > >  > to make a better reproducer, but failed. I'll revisit again 
 > > tomorrow.
 > >  > >  >
 > >  > >  > Maybe I need >1 process/thread to trigger this.  That would 
 > > explain why
 > >  > >  > I can trigger it with Trinity.
 > >  > >
 > >  > > scratch that last part, I finally just repro'd it with a single 
 > > process.
 > >  >
 > >  > Thanks for the info, I'll try to think about this some more.
 > >
 > > I threw in some debug printks right before that BUG_ON.
 > > it's always this:
 > >
 > > skb->len=31 skb->data_len=0 offset:30 total_len:9
 > 
 > Clearly we fail because 30 > 31 - 2, seems 'offset' is not correct here,
 > off-by-one?

Ok, I finally made a messy, albeit good enough reproducer.

#include 
#include 
#include 
#include 

#include 
#include 
#include 

#define LEN 504

int main(int argc, char* argv[])
{
int fd;
int zero = 0;
char buf[LEN];

memset(buf, 0, LEN);

fd = socket(AF_INET6, SOCK_RAW, 7);

setsockopt(fd, SOL_IPV6, IPV6_CHECKSUM, , 4);
setsockopt(fd, SOL_IPV6, IPV6_DSTOPTS, , LEN);

sendto(fd, buf, 1, 0, (struct sockaddr *) buf, 110);
}



Re: ipv6: handle -EFAULT from skb_copy_bits

2016-12-20 Thread Dave Jones
On Tue, Dec 20, 2016 at 01:28:13PM -0500, David Miller wrote:
 
 > This has to do with the SKB buffer layout and geometry, not whether
 > the packet is "fragmented" in the protocol sense.
 > 
 > So no, this isn't a criteria for packets being filtered out by this
 > point.
 > 
 > Can you try to capture what sk->sk_socket->type and
 > inet_sk(sk)->hdrincl are set to at the time of the crash?
 > 

type:3 hdrincl:0

Dave



Re: ipv6: handle -EFAULT from skb_copy_bits

2016-12-20 Thread Dave Jones
On Mon, Dec 19, 2016 at 08:36:23PM -0500, David Miller wrote:
 > From: Dave Jones <da...@codemonkey.org.uk>
 > Date: Mon, 19 Dec 2016 19:40:13 -0500
 > 
 > > On Mon, Dec 19, 2016 at 07:31:44PM -0500, Dave Jones wrote:
 > > 
 > >  > Unfortunately, this made no difference.  I spent some time today trying
 > >  > to make a better reproducer, but failed. I'll revisit again tomorrow.
 > >  > 
 > >  > Maybe I need >1 process/thread to trigger this.  That would explain why
 > >  > I can trigger it with Trinity.
 > > 
 > > scratch that last part, I finally just repro'd it with a single process.
 > 
 > Thanks for the info, I'll try to think about this some more.

I threw in some debug printks right before that BUG_ON.
it's always this:

skb->len=31 skb->data_len=0 offset:30 total_len:9

Shouldn't we have kicked out data_len=0 skb's somewhere before we got this far ?

Dave



Re: [PATCH 1/3] NFC: trf7970a: add device tree option for 27MHz clock

2016-12-20 Thread Jones Desougi
On 2016-12-20 17:16, Geoff Lansberry wrote:
> From: Geoff Lansberry 
> 
> The TRF7970A has configuration options to support hardware designs
> which use a 27.12MHz clock. This commit adds a device tree option
> 'clock-frequency' to support configuring the this chip for default
> 13.56MHz clock or the optional 27.12MHz clock.
> ---
>  .../devicetree/bindings/net/nfc/trf7970a.txt   |  4 ++
>  drivers/nfc/trf7970a.c | 50 
> +-
>  2 files changed, 43 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/nfc/trf7970a.txt 
> b/Documentation/devicetree/bindings/net/nfc/trf7970a.txt
> index 32b35a0..e262ac1 100644
> --- a/Documentation/devicetree/bindings/net/nfc/trf7970a.txt
> +++ b/Documentation/devicetree/bindings/net/nfc/trf7970a.txt
> @@ -21,6 +21,8 @@ Optional SoC Specific Properties:
>  - t5t-rmb-extra-byte-quirk: Specify that the trf7970a has the erratum
>where an extra byte is returned by Read Multiple Block commands issued
>to Type 5 tags.
> +- clock-frequency: Set to specify that the input frequency to the trf7970a 
> is 1356Hz or 2712Hz
> +
You're adding an empty line here that is removed in the next patch.

>  
>  Example (for ARM-based BeagleBone with TRF7970A on SPI1):
>  
> @@ -43,6 +45,8 @@ Example (for ARM-based BeagleBone with TRF7970A on SPI1):
>   irq-status-read-quirk;
>   en2-rf-quirk;
>   t5t-rmb-extra-byte-quirk;
> + vdd_io_1v8;
This does not belong here, and so no need to remove in the next patch.

> + clock-frequency = <2712>;
>   status = "okay";
>   };
>  };
> diff --git a/drivers/nfc/trf7970a.c b/drivers/nfc/trf7970a.c
> index 26c9dbb..4e051e9 100644
> --- a/drivers/nfc/trf7970a.c
> +++ b/drivers/nfc/trf7970a.c
> @@ -124,6 +124,9 @@
>NFC_PROTO_ISO15693_MASK | NFC_PROTO_NFC_DEP_MASK)
>  
>  #define TRF7970A_AUTOSUSPEND_DELAY   3 /* 30 seconds */
> +#define TRF7970A_13MHZ_CLOCK_FREQUENCY   1356
> +#define TRF7970A_27MHZ_CLOCK_FREQUENCY   2712
> +
>  
>  #define TRF7970A_RX_SKB_ALLOC_SIZE   256
>  
> @@ -1056,12 +1059,11 @@ static int trf7970a_init(struct trf7970a *trf)
>  
>   trf->chip_status_ctrl &= ~TRF7970A_CHIP_STATUS_RF_ON;
>  
> - ret = trf7970a_write(trf, TRF7970A_MODULATOR_SYS_CLK_CTRL, 0);
> + ret = trf7970a_write(trf, TRF7970A_MODULATOR_SYS_CLK_CTRL,
> + trf->modulator_sys_clk_ctrl);
>   if (ret)
>   goto err_out;
>  
> - trf->modulator_sys_clk_ctrl = 0;
> -
>   ret = trf7970a_write(trf, TRF7970A_ADJUTABLE_FIFO_IRQ_LEVELS,
>   TRF7970A_ADJUTABLE_FIFO_IRQ_LEVELS_WLH_96 |
>   TRF7970A_ADJUTABLE_FIFO_IRQ_LEVELS_WLL_32);
> @@ -1181,27 +1183,37 @@ static int trf7970a_in_config_rf_tech(struct trf7970a 
> *trf, int tech)
>   switch (tech) {
>   case NFC_DIGITAL_RF_TECH_106A:
>   trf->iso_ctrl_tech = TRF7970A_ISO_CTRL_14443A_106;
> - trf->modulator_sys_clk_ctrl = TRF7970A_MODULATOR_DEPTH_OOK;
> + trf->modulator_sys_clk_ctrl =
> + (trf->modulator_sys_clk_ctrl & 0xF8) |
> + TRF7970A_MODULATOR_DEPTH_OOK;
>   trf->guard_time = TRF7970A_GUARD_TIME_NFCA;
>   break;
>   case NFC_DIGITAL_RF_TECH_106B:
>   trf->iso_ctrl_tech = TRF7970A_ISO_CTRL_14443B_106;
> - trf->modulator_sys_clk_ctrl = TRF7970A_MODULATOR_DEPTH_ASK10;
> + trf->modulator_sys_clk_ctrl =
> + (trf->modulator_sys_clk_ctrl & 0xF8) |
> + TRF7970A_MODULATOR_DEPTH_ASK10;
>   trf->guard_time = TRF7970A_GUARD_TIME_NFCB;
>   break;
>   case NFC_DIGITAL_RF_TECH_212F:
>   trf->iso_ctrl_tech = TRF7970A_ISO_CTRL_FELICA_212;
> - trf->modulator_sys_clk_ctrl = TRF7970A_MODULATOR_DEPTH_ASK10;
> + trf->modulator_sys_clk_ctrl =
> + (trf->modulator_sys_clk_ctrl & 0xF8) |
> + TRF7970A_MODULATOR_DEPTH_ASK10;
>   trf->guard_time = TRF7970A_GUARD_TIME_NFCF;
>   break;
>   case NFC_DIGITAL_RF_TECH_424F:
>   trf->iso_ctrl_tech = TRF7970A_ISO_CTRL_FELICA_424;
> - trf->modulator_sys_clk_ctrl = TRF7970A_MODULATOR_DEPTH_ASK10;
> + trf->modulator_sys_clk_ctrl =
> + (trf->modulator_sys_clk_ctrl & 0xF8) |
> + TRF7970A_MODULATOR_DEPTH_ASK10;
>   trf->guard_time = TRF7970A_GUARD_TIME_NFCF;
>   break;
>   case NFC_DIGITAL_RF_TECH_ISO15693:
>   trf->iso_ctrl_tech = TRF7970A_ISO_CTRL_15693_SGL_1OF4_2648;
> - trf->modulator_sys_clk_ctrl = TRF7970A_MODULATOR_DEPTH_OOK;
> + trf->modulator_sys_clk_ctrl =
> + (trf->modulator_sys_clk_ctrl & 

Re: ipv6: handle -EFAULT from skb_copy_bits

2016-12-19 Thread Dave Jones
On Mon, Dec 19, 2016 at 07:31:44PM -0500, Dave Jones wrote:

 > Unfortunately, this made no difference.  I spent some time today trying
 > to make a better reproducer, but failed. I'll revisit again tomorrow.
 > 
 > Maybe I need >1 process/thread to trigger this.  That would explain why
 > I can trigger it with Trinity.

scratch that last part, I finally just repro'd it with a single process.

Dave



Re: ipv6: handle -EFAULT from skb_copy_bits

2016-12-19 Thread Dave Jones
On Mon, Dec 19, 2016 at 02:48:48PM -0500, David Miller wrote:

 > One thing that's interesting is that if the user picks "IPPROTO_RAW"
 > as the value of 'protocol' we set inet->hdrincl to 1.
 > 
 > The user can also set inet->hdrincl to 1 or 0 via setsockopt().
 > 
 > I think this is part of the problem.  The test above means to check
 > for "RAW socket with hdrincl set" and is trying to do this more simply.
 > But because setsockopt() can set this arbitrarily, testing sk_protocol
 > alone isn't enough.
 > 
 > So changing:
 > 
 >  sk->sk_protocol == IPPROTO_RAW
 > 
 > into something like:
 > 
 >  (sk->sk_socket->type == SOCK_RAW && inet_sk(sk)->hdrincl)
 > 
 > should correct the test.
 >  ..
 > 
 > You can test if the change I suggest above works.

Unfortunately, this made no difference.  I spent some time today trying
to make a better reproducer, but failed. I'll revisit again tomorrow.

Maybe I need >1 process/thread to trigger this.  That would explain why
I can trigger it with Trinity.

Dave



Re: ipv6: handle -EFAULT from skb_copy_bits

2016-12-19 Thread Dave Jones
On Sat, Dec 17, 2016 at 10:41:20AM -0500, David Miller wrote:

 > > It seems to be possible to craft a packet for sendmsg that triggers
 > > the -EFAULT path in skb_copy_bits resulting in a BUG_ON that looks like:
 > > 
 > > RIP: 0010:[] [] 
 > > rawv6_sendmsg+0xc30/0xc40
 > > RSP: 0018:881f6c4a7c18  EFLAGS: 00010282
 > > RAX: fff2 RBX: 881f6c681680 RCX: 0002
 > > RDX: 881f6c4a7cf8 RSI: 0030 RDI: 881fed0f6a00
 > > RBP: 881f6c4a7da8 R08:  R09: 0009
 > > R10: 881fed0f6a00 R11: 0009 R12: 0030
 > > R13: 881fed0f6a00 R14: 881fee39ba00 R15: 881fefa93a80
 > > 
 > > Call Trace:
 > >  [] ? unmap_page_range+0x693/0x830
 > >  [] inet_sendmsg+0x67/0xa0
 > >  [] sock_sendmsg+0x38/0x50
 > >  [] SYSC_sendto+0xef/0x170
 > >  [] SyS_sendto+0xe/0x10
 > >  [] do_syscall_64+0x50/0xa0
 > >  [] entry_SYSCALL64_slow_path+0x25/0x25
 > > 
 > > Handle this in rawv6_push_pending_frames and jump to the failure path.
 > > 
 > > Signed-off-by: Dave Jones <da...@codemonkey.org.uk>
 > 
 > Hmmm, that's interesting.  Becaue the code in __ip6_append_data(), which
 > sets up the ->cork.base.length value, seems to be defensively trying to
 > avoid this possibility.
 > 
 > For example, it checks things like:
 > 
 >  if (cork->length + length > mtu - headersize && ipc6->dontfrag &&
 >  (sk->sk_protocol == IPPROTO_UDP ||
 >   sk->sk_protocol == IPPROTO_RAW)) {
 > 
 > This is why the transport offset plus the length should never exceed
 > the total length for that skb_copy_bits() call.
 > 
 > Perhaps this protocol check in the code above is incomplete?  Do you
 > know what the sk->sk_protocol value was when that BUG triggered?  That
 > might shine some light on what is really happening here.

Hm.
  sk_protocol = 7, 






struct sock {
  __sk_common = {
{
  skc_addrpair = 0, 
  {
skc_daddr = 0, 
skc_rcv_saddr = 0
  }
}, 
{
  skc_hash = 0, 
  skc_u16hashes = {0, 0}
}, 
{
  skc_portpair = 458752, 
  {
skc_dport = 0, 
skc_num = 7
  }
}, 
skc_family = 10, 
skc_state = 7 '\a', 
skc_reuse = 1 '\001', 
skc_reuseport = 0 '\000', 
skc_ipv6only = 0 '\000', 
skc_net_refcnt = 1 '\001', 
skc_bound_dev_if = 0, 
{
  skc_bind_node = {
next = 0x0, 
pprev = 0x0
  }, 
  skc_portaddr_node = {
next = 0x0, 
pprev = 0x0
  }
}, 
skc_prot = 0x81cf3bc0 , 
skc_net = {
  net = 0x81ce78c0 
}, 
skc_v6_daddr = {
  in6_u = {
u6_addr8 = 
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, 
u6_addr32 = {0, 0, 0, 0}
  }
}, 
}, 
skc_v6_rcv_saddr = {
  in6_u = {
u6_addr8 = 
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, 
u6_addr32 = {0, 0, 0, 0}
  }
}, 
skc_cookie = {
  counter = 0
}, 
{
  skc_flags = 256, 
  skc_listener = 0x100, 
  skc_tw_dr = 0x100
}, 
skc_dontcopy_begin = 0x881fd1ce9b68, 
{
  skc_node = {
next = 0x0, 
pprev = 0x0
  }, 
  skc_nulls_node = {
next = 0x0, 
pprev = 0x0
  }
}, 
skc_tx_queue_mapping = -1, 
{
  skc_incoming_cpu = -1, 
  skc_rcv_wnd = 4294967295, 
  skc_tw_rcv_nxt = 4294967295
}, 
skc_refcnt = {
  counter = 1
}, 
skc_dontcopy_end = 0x881fd1ce9b84, 
{
  skc_rxhash = 0, 
  skc_window_clamp = 0, 
  skc_tw_snd_nxt = 0
}
  }, 
  sk_lock = {
slock = {
  {
rlock = {
  raw_lock = {
val = {
  counter = 0
}
  }
}
  }
}, 
owned = 1, 
wq = {
  lock = {
{
  rlock = {
raw_lock = {
  val = {
counter = 0
  }
}
  }
}
  }, 
  task_list = {
next = 0x881fd1ce9b98, 
prev = 0x881fd1ce9b98
  }
}
  }, 
  sk_receive_queue = {
next = 0x881fd1ce9ba8, 
prev = 0x881fd1ce9ba8, 
qlen = 0, 
lock = {
  {
rlock = {
  raw_lock = {
val = {
  counter = 0
}
  }
}
  }
}
  }, 
  sk_backlog = {
rmem_alloc = {
  counter = 0
}, 
len = 0, 
head = 0x0, 
tail = 0x0
  }, 
  sk_forward_alloc = 0, 
  sk_txhash = 0, 
  sk_napi_id = 0, 
  sk_ll_usec = 0, 
  sk_drops = {
counter = 0
  }, 
  sk_rcvbuf = 1

Re: ipv6: handle -EFAULT from skb_copy_bits

2016-12-17 Thread Dave Jones
On Sat, Dec 17, 2016 at 10:41:20AM -0500, David Miller wrote:
 > From: Dave Jones <da...@codemonkey.org.uk>
 > Date: Wed, 14 Dec 2016 10:47:29 -0500
 > 
 > > It seems to be possible to craft a packet for sendmsg that triggers
 > > the -EFAULT path in skb_copy_bits resulting in a BUG_ON that looks like:
 > > 
 > > RIP: 0010:[] [] 
 > > rawv6_sendmsg+0xc30/0xc40
 > > RSP: 0018:881f6c4a7c18  EFLAGS: 00010282
 > > RAX: fff2 RBX: 881f6c681680 RCX: 0002
 > > RDX: 881f6c4a7cf8 RSI: 0030 RDI: 881fed0f6a00
 > > RBP: 881f6c4a7da8 R08:  R09: 0009
 > > R10: 881fed0f6a00 R11: 0009 R12: 0030
 > > R13: 881fed0f6a00 R14: 881fee39ba00 R15: 881fefa93a80
 > > 
 > > Call Trace:
 > >  [] ? unmap_page_range+0x693/0x830
 > >  [] inet_sendmsg+0x67/0xa0
 > >  [] sock_sendmsg+0x38/0x50
 > >  [] SYSC_sendto+0xef/0x170
 > >  [] SyS_sendto+0xe/0x10
 > >  [] do_syscall_64+0x50/0xa0
 > >  [] entry_SYSCALL64_slow_path+0x25/0x25
 > > 
 > > Handle this in rawv6_push_pending_frames and jump to the failure path.
 > > 
 > > Signed-off-by: Dave Jones <da...@codemonkey.org.uk>
 > 
 > Hmmm, that's interesting.  Becaue the code in __ip6_append_data(), which
 > sets up the ->cork.base.length value, seems to be defensively trying to
 > avoid this possibility.
 > 
 > For example, it checks things like:
 > 
 >  if (cork->length + length > mtu - headersize && ipc6->dontfrag &&
 >  (sk->sk_protocol == IPPROTO_UDP ||
 >   sk->sk_protocol == IPPROTO_RAW)) {
 > 
 > This is why the transport offset plus the length should never exceed
 > the total length for that skb_copy_bits() call.
 > 
 > Perhaps this protocol check in the code above is incomplete?  Do you
 > know what the sk->sk_protocol value was when that BUG triggered?  That
 > might shine some light on what is really happening here.

I'll see if I can craft up a reproducer next week.
For some reason I've not hit this on my test setup at home, but it
reproduces daily in our test setup at facebook.  The only thing
I can think of is that those fb boxes are ipv6 only, so I might be
exercising v4 more at home.

Dave



ipv6: handle -EFAULT from skb_copy_bits

2016-12-14 Thread Dave Jones
It seems to be possible to craft a packet for sendmsg that triggers
the -EFAULT path in skb_copy_bits resulting in a BUG_ON that looks like:

RIP: 0010:[] [] rawv6_sendmsg+0xc30/0xc40
RSP: 0018:881f6c4a7c18  EFLAGS: 00010282
RAX: fff2 RBX: 881f6c681680 RCX: 0002
RDX: 881f6c4a7cf8 RSI: 0030 RDI: 881fed0f6a00
RBP: 881f6c4a7da8 R08:  R09: 0009
R10: 881fed0f6a00 R11: 0009 R12: 0030
R13: 881fed0f6a00 R14: 881fee39ba00 R15: 881fefa93a80

Call Trace:
 [] ? unmap_page_range+0x693/0x830
 [] inet_sendmsg+0x67/0xa0
 [] sock_sendmsg+0x38/0x50
 [] SYSC_sendto+0xef/0x170
 [] SyS_sendto+0xe/0x10
 [] do_syscall_64+0x50/0xa0
 [] entry_SYSCALL64_slow_path+0x25/0x25

Handle this in rawv6_push_pending_frames and jump to the failure path.

Signed-off-by: Dave Jones <da...@codemonkey.org.uk>

diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 291ebc260e70..35aa82faa052 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -591,7 +591,9 @@ static int rawv6_push_pending_frames(struct sock *sk, 
struct flowi6 *fl6,
}
 
offset += skb_transport_offset(skb);
-   BUG_ON(skb_copy_bits(skb, offset, , 2));
+   err = skb_copy_bits(skb, offset, , 2);
+   if (err < 0)
+   goto out;
 
/* in case cksum was not initialized */
if (unlikely(csum))



netconsole: sleeping function called from invalid context

2016-12-08 Thread Dave Jones
I think this has been around for a while, but for some reason I'm running into
it a lot today.


BUG: sleeping function called from invalid context at kernel/irq/manage.c:110
in_atomic(): 1, irqs_disabled(): 1, pid: 1839, name: modprobe
no locks held by modprobe/1839.
Preemption disabled at:
[] write_ext_msg+0x73/0x2d0
CPU: 0 PID: 1839 Comm: modprobe Not tainted 4.9.0-rc8-think+ #5 
 880442287300
 81651e19
 8001
 
 88044221d380
 006e
 880442287338
 87c3
 88044221d388
 8207b940
 006e
 
Call Trace:
 [] dump_stack+0x6c/0x93
 [] ___might_sleep+0x193/0x210
 [] __might_sleep+0x71/0xe0
 [] ? __synchronize_hardirq+0x94/0xa0
 [] synchronize_irq+0xa8/0x170
 [] ? set_irq_wake_real+0x90/0x90
 [] ? synchronize_irq+0x5/0x170
 [] ? disable_irq+0x5/0x30
 [] disable_irq+0x28/0x30
 [] e1000_netpoll+0x1c4/0x200
 [] ? e1000_intr_msix_tx+0x190/0x190
 [] netpoll_poll_dev+0xa0/0x3b0
 [] ? preempt_count_sub+0x18/0xd0
 [] netpoll_send_skb_on_dev+0x20d/0x3d0
 [] netpoll_send_udp+0x535/0x8c0
 [] write_ext_msg+0x286/0x2d0
 [] ? check_preemption_disabled+0x3b/0x160
 [] call_console_drivers.isra.20.constprop.26+0x165/0x310
 [] console_unlock+0x3b6/0x840
 [] vprintk_emit+0x4b5/0x6e0
 [] vprintk_default+0x48/0x80
 [] printk+0xbc/0xe7
 [] ? printk_lock.constprop.1+0x102/0x102
 [] ? printk+0x5/0xe7
 [] ? bt_init+0x1/0xfa [bluetooth]
 [] bt_info+0xdd/0x110 [bluetooth]
 [] ? bt_to_errno+0x50/0x50 [bluetooth]
 [] ? bt_info+0x5/0x110 [bluetooth]
 [] sco_init+0xb0/0xc40 [bluetooth]
 [] ? 0xa099
 [] bt_init+0x9d/0xfa [bluetooth]
 [] do_one_initcall+0x199/0x220
 [] ? initcall_blacklisted+0x170/0x170
 [] ? do_init_module+0xe3/0x2fd
 [] ? 0xa099
 [] ? do_one_initcall+0x5/0x220
 [] ? __asan_register_globals+0x7c/0xa0
 [] do_init_module+0xf4/0x2fd
 [] load_module+0x3a79/0x4670
 [] ? disable_ro_nx+0x80/0x80
 [] ? module_frob_arch_sections+0x20/0x20
 [] ? __buffer_unlock_commit+0x4a/0x90
 [] ? trace_function+0x9c/0xc0
 [] ? function_trace_call+0xea/0x290
 [] ? SYSC_finit_module+0x181/0x1c0
 [] ? module_frob_arch_sections+0x20/0x20
 [] ? get_user_arg_ptr.isra.26+0xa0/0xa0
 [] ? load_module+0x5/0x4670
 [] SYSC_finit_module+0x181/0x1c0
 [] ? SYSC_init_module+0x220/0x220
 [] ? function_trace_call+0xea/0x290
 [] ? SyS_init_module+0x10/0x10
 [] ? SyS_init_module+0x10/0x10
 [] ? SyS_finit_module+0x5/0x10
 [] ? __this_cpu_preempt_check+0x1c/0x20
 [] ? SyS_init_module+0x10/0x10
 [] SyS_finit_module+0xe/0x10
 [] do_syscall_64+0x100/0x2b0
 [] entry_SYSCALL64_slow_path+0x25/0x25



Re: [PATCH net-next] udp: under rx pressure, try to condense skbs

2016-12-08 Thread Rick Jones

On 12/08/2016 07:30 AM, Eric Dumazet wrote:

On Thu, 2016-12-08 at 10:46 +0100, Jesper Dangaard Brouer wrote:


Hmmm... I'm not thrilled to have such heuristics, that change memory
behavior when half of the queue size (sk->sk_rcvbuf) is reached.


Well, copybreak drivers do that unconditionally, even under no stress at
all, you really should complain then.


Isn't that behaviour based (in part?) on the observation/belief that it 
is fewer cycles to copy the small packet into a small buffer than to 
send the larger buffer up the stack and have to allocate and map a 
replacement?


rick jones



Re: [PATCH net-next 2/4] mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs

2016-12-02 Thread Rick Jones

On 12/02/2016 03:23 PM, Martin KaFai Lau wrote:

When XDP prog is attached, it is currently limiting
MTU to be FRAG_SZ0 - ETH_HLEN - (2 * VLAN_HLEN) which is 1514
in x86.

AFAICT, since mlx4 is doing one page per packet for XDP,
we can at least raise the MTU limitation up to
PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) which this patch is
doing.  It will be useful in the next patch which allows
XDP program to extend the packet by adding new header(s).


Is mlx4 the only driver doing page-per-packet?

rick jones



Re: Initial thoughts on TXDP

2016-12-01 Thread Rick Jones

On 12/01/2016 02:12 PM, Tom Herbert wrote:

We have consider both request size and response side in RPC.
Presumably, something like a memcache server is most serving data as
opposed to reading it, we are looking to receiving much smaller
packets than being sent. Requests are going to be quite small say 100
bytes and unless we are doing significant amount of pipelining on
connections GRO would rarely kick-in. Response size will have a lot of
variability, anything from a few kilobytes up to a megabyte. I'm sorry
I can't be more specific this is an artifact of datacenters that have
100s of different applications and communication patterns. Maybe 100b
request size, 8K, 16K, 64K response sizes might be good for test.


No worries on the specific sizes, it is a classic "How long is a piece 
of string?" sort of question.


Not surprisingly, as the size of what is being received grows, so too 
the delta between GRO on and off.


stack@np-cp1-c0-m1-mgmt:~/rjones2$ HDR="-P 1"; for r in 8K 16K 64K 1M; 
do for gro in on off; do sudo ethtool -K hed0 gro ${gro}; brand="$r gro 
$gro"; ./netperf -B "$brand" -c -H np-cp1-c1-m3-mgmt -t TCP_RR $HDR -- 
-P 12867 -r 128,${r} -o result_brand,throughput,local_sd; HDR="-P 0"; 
done; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Result Tag,Throughput,Local Service Demand
"8K gro on",9899.84,35.947
"8K gro off",7299.54,61.097
"16K gro on",8119.38,58.367
"16K gro off",5176.87,95.317
"64K gro on",4429.57,110.629
"64K gro off",2128.58,289.913
"1M gro on",887.85,918.447
"1M gro off",335.97,3427.587

So that gives a feel for by how much this alternative mechanism would 
have to reduce path-length to maintain the CPU overhead, were the 
mechanism to preclude GRO.


rick




Re: Initial thoughts on TXDP

2016-12-01 Thread Rick Jones

On 12/01/2016 12:18 PM, Tom Herbert wrote:

On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones <rick.jon...@hpe.com> wrote:

Just how much per-packet path-length are you thinking will go away under the
likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO does some
non-trivial things to effective overhead (service demand) and so throughput:


For plain in order TCP packets I believe we should be able process
each packet at nearly same speed as GRO. Most of the protocol
processing we do between GRO and the stack are the same, the
differences are that we need to do a connection lookup in the stack
path (note we now do this is UDP GRO and that hasn't show up as a
major hit). We also need to consider enqueue/dequeue on the socket
which is a major reason to try for lockless sockets in this instance.


So waving hands a bit, and taking the service demand for the GRO-on 
receive test in my previous message (860 ns/KB), that would be ~ 
(1448/1024)*860 or ~1.216 usec of CPU time per TCP segment, including 
ACK generation which unless an explicit ACK-avoidance heuristic a la 
HP-UX 11/Solaris 2 is put in place would be for every-other segment. Etc 
etc.



Sure, but trying running something emulates a more realistic workload
than a TCP stream, like RR test with relative small payload and many
connections.


That is a good point, which of course is why the RR tests are there in 
netperf :) Don't get me wrong, I *like* seeing path-length reductions. 
What would you posit is a relatively small payload?  The promotion of 
IR10 suggests that perhaps 14KB or so is a sufficiently common so I'll 
grasp at that as the length of a piece of string:


stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,14K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  128 14336  10.00   8118.31  1.57   -1.00  46.410  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,14K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  128 14336  10.00   5837.35  2.20   -1.00  90.628  -1.000
16384  87380

So, losing GRO doubled the service demand.  I suppose I could see 
cutting path-length in half based on the things you listed which would 
be bypassed?


I'm sure mileage will vary with different NICs and CPUs.  The ones used 
here happened to be to hand.


happy benchmarking,

rick

Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly 
goes to more than doubling, and halving to 7K narrows the delta:


stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,28K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  128 28672  10.00   6732.32  1.79   -1.00  63.819  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,28K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  128 28672  10.00   3780.47  2.32   -1.00  147.280  -1.000
16384  87380



stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,7K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S

Re: Initial thoughts on TXDP

2016-12-01 Thread Rick Jones

On 12/01/2016 11:05 AM, Tom Herbert wrote:

For the GSO and GRO the rationale is that performing the extra SW
processing to do the offloads is significantly less expensive than
running each packet through the full stack. This is true in a
multi-layered generalized stack. In TXDP, however, we should be able
to optimize the stack data path such that that would no longer be
true. For instance, if we can process the packets received on a
connection quickly enough so that it's about the same or just a little
more costly than GRO processing then we might bypass GRO entirely.
TSO is probably still relevant in TXDP since it reduces overheads
processing TX in the device itself.


Just how much per-packet path-length are you thinking will go away under 
the likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO 
does some non-trivial things to effective overhead (service demand) and 
so throughput:


stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- 
-P 12867
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   SendSend  Utilization   Service 
Demand

Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local 
remote

bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB

 87380  16384  1638410.00  9260.24   2.02 -1.000.428 
-1.000

stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- 
-P 12867
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   SendSend  Utilization   Service 
Demand

Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local 
remote

bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB

 87380  16384  1638410.00  5621.82   4.25 -1.001.486 
-1.000


And that is still with the stretch-ACKs induced by GRO at the receiver.

Losing GRO has quite similar results:
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_MAERTS -- -P 12867
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   SendSend  Utilization   Service 
Demand

Socket Socket  Message  Elapsed  Recv Send RecvSend
Size   SizeSize Time Throughput  localremote   local 
remote

bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB

 87380  16384  1638410.00  9154.02   4.00 -1.000.860 
-1.000

stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off

stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_MAERTS -- -P 12867
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   SendSend  Utilization   Service 
Demand

Socket Socket  Message  Elapsed  Recv Send RecvSend
Size   SizeSize Time Throughput  localremote   local 
remote

bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB

 87380  16384  1638410.00  4212.06   5.36 -1.002.502 
-1.000


I'm sure there is a very non-trivial "it depends" component here - 
netperf will get the peak benefit from *SO and so one will see the peak 
difference in service demands - but even if one gets only 6 segments per 
*SO that is a lot of path-length to make-up.


4.4 kernel, BE3 NICs ... E5-2640 0 @ 2.50GHz

And even if one does have the CPU cycles to burn so to speak, the effect 
on power consumption needs to be included in the calculus.


happy benchmarking,

rick jones


Re: Netperf UDP issue with connected sockets

2016-11-30 Thread Rick Jones

On 11/30/2016 02:43 AM, Jesper Dangaard Brouer wrote:

Notice the "fib_lookup" cost is still present, even when I use
option "-- -n -N" to create a connected socket.  As Eric taught us,
this is because we should use syscalls "send" or "write" on a connected
socket.


In theory, once the data socket is connected, the send_data() call in 
src/nettest_omni.c is supposed to use send() rather than sendto().


And indeed, based on a quick check, send() is what is being called, 
though it becomes it seems a sendto() system call - with the destination 
information NJULL:


write(1, "send\n", 5)   = 5
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024

write(1, "send\n", 5)   = 5
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024


So I'm not sure what might be going-on there.

You can get netperf to use write() instead of send() by adding a 
test-specific -I option.


happy benchmarking,

rick



My udp_flood tool[1] cycle through the different syscalls:

taskset -c 2 ~/git/network-testing/src/udp_flood 198.18.50.1 --count $((10**7)) 
--pmtu 2
ns/pkt  pps cycles/pkt
send473.08  2113816.28  1891
sendto  558.58  1790265.84  2233
sendmsg 587.24  1702873.80  2348
sendMmsg/32 547.57  1826265.90  2189
write   518.36  1929175.52  2072

Using "send" seems to be the fastest option.

Some notes on test: I've forced TX completions to happen on another CPU0
and pinned the udp_flood program (to CPU2) as I want to avoid the CPU
scheduler to move udp_flood around as this cause fluctuations in the
results (as it stress the memory allocations more).

My udp_flood --pmtu option is documented in the --help usage text (see below 
signature)





Re: Netperf UDP issue with connected sockets

2016-11-28 Thread Rick Jones

On 11/28/2016 10:33 AM, Rick Jones wrote:

On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:

time to try IP_MTU_DISCOVER ;)


To Rick, maybe you can find a good solution or option with Eric's hint,
to send appropriate sized UDP packets with Don't Fragment (DF).


Jesper -

Top of trunk has a change adding an omni, test-specific -f option which
will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket.  Is that
sufficient to your needs?


Usage examples:

raj@tardy:~/netperf2_trunk/src$ ./netperf -t UDP_STREAM -l 1 -H 
raj-folio.americas.hpqcorp.net -- -m 1472 -f
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
raj-folio.americas.hpqcorp.net () port 0 AF_INET

Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

2129921472   1.0077495  0 912.35
212992   1.0077495912.35

[1]+  Doneemacs nettest_omni.c
raj@tardy:~/netperf2_trunk/src$ ./netperf -t UDP_STREAM -l 1 -H 
raj-folio.americas.hpqcorp.net -- -m 14720 -f
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
raj-folio.americas.hpqcorp.net () port 0 AF_INET

send_data: data send error: Message too long (errno 90)
netperf: send_omni: send_data failed: Message too long

happy benchmarking,

rick jones


Re: Netperf UDP issue with connected sockets

2016-11-28 Thread Rick Jones

On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:

time to try IP_MTU_DISCOVER ;)


To Rick, maybe you can find a good solution or option with Eric's hint,
to send appropriate sized UDP packets with Don't Fragment (DF).


Jesper -

Top of trunk has a change adding an omni, test-specific -f option which 
will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket.  Is that 
sufficient to your needs?


happy benchmarking,

rick



Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Rick Jones

On 11/17/2016 04:37 PM, Julian Anastasov wrote:

On Thu, 17 Nov 2016, Rick Jones wrote:


raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F
src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472

...

socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")},
16) = 0
setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0


connected socket can benefit from dst cached in socket
but not if SO_DONTROUTE is set. If we do not want to send packets
via gateway this -l 1 should help but I don't see IP_TTL setsockopt
in your first example with connect() to 127.0.0.1.

Also, may be there can be another default, if -l is used to
specify TTL then SO_DONTROUTE should not be set. I.e. we should
avoid SO_DONTROUTE, if possible.


The global -l option specifies the duration of the test.  It doesn't 
specify the TTL of the IP datagrams being generated by the actions of 
the test.


I resisted setting SO_DONTROUTE for a number of years after the first 
instance of UDP_STREAM being used in link up/down testing took-out a 
company's network (including security camera feeds to galactic HQ) but 
at this point I'm likely to keep it in there because there ended-up 
being a second such incident.  It is set only for UDP_STREAM.  It isn't 
set for UDP_RR or TCP_*.  And for UDP_STREAM it can be overridden by the 
test-specific -R option.


happy benchmarking,

rick jones


Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Rick Jones

On 11/17/2016 01:44 PM, Eric Dumazet wrote:

because netperf sends the same message
over and over...


Well, sort of, by default.  That can be altered to a degree.

The global -F option should cause netperf to fill the buffers in its 
send ring with data from the specified file.  The number of buffers in 
the send ring can be controlled via the global -W option.  The number of 
elements in the ring will default to one more than the initial SO_SNDBUF 
size divided by the send size.


raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf 
-F src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472


...

socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0

setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0
open("src/nettest_omni.c", O_RDONLY)= 5
fstat(5, {st_dev=makedev(8, 2), st_ino=82075297, st_mode=S_IFREG|0664, 
st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=456, 
st_size=230027, st_atime=2016/11/16-09:49:29, 
st_mtime=2016/11/16-09:49:24, st_ctime=2016/11/16-09:49:24}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x7f3099f62000

read(5, "#ifdef HAVE_CONFIG_H\n#include <c"..., 4096) = 4096
read(5, "_INTEGER *intvl_two_ptr = "..., 4096) = 4096
read(5, "interval_count = interval_burst;"..., 4096) = 4096
read(5, ";\n\n/* these will control the wid"..., 4096) = 4096
read(5, "\n  LOCAL_SECURITY_ENABLED_NUM,\n "..., 4096) = 4096
read(5, "  ,  \n  "..., 4096) = 4096

...

rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 
0x7f30994a7cb0}, NULL, 8) = 0
rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 
0x7f30994a7cb0}, NULL, 8) = 0

alarm(1)= 0
sendto(4, "#ifdef HAVE_CONFIG_H\n#include <c"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, " used\\n\\\n-m local,remote   S"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, " do here but clear the legacy fl"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, "e before we scan the test-specif"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, "\n\n\tfprintf(where,\n\t\ttput_fmt_1_l"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472


Of course, it will continue to send the same messages from the send_ring 
over and over instead of putting different data into the buffers each 
time, but if one has a sufficiently large -W option specified...

happy benchmarking,

rick jones


Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Rick Jones

On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:

time to try IP_MTU_DISCOVER ;)


To Rick, maybe you can find a good solution or option with Eric's hint,
to send appropriate sized UDP packets with Don't Fragment (DF).


Well, I suppose adding another setsockopt() to the data socket creation 
wouldn't be too difficult, along with another command-line option to 
cause it to happen.


Could we leave things as "make sure you don't need fragmentation when 
you use this" or would netperf have to start processing ICMP messages?


happy benchmarking,

rick jones



Re: Netperf UDP issue with connected sockets

2016-11-16 Thread Rick Jones

On 11/16/2016 02:40 PM, Jesper Dangaard Brouer wrote:

On Wed, 16 Nov 2016 09:46:37 -0800
Rick Jones <rick.jon...@hpe.com> wrote:

It is a wild guess, but does setting SO_DONTROUTE affect whether or not
a connect() would have the desired effect?  That is there to protect
people from themselves (long story about people using UDP_STREAM to
stress improperly air-gapped systems during link up/down testing)
It can be disabled with a test-specific -R 1 option, so your netperf
command would become:

netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1


Using -R 1 does not seem to help remove __ip_select_ident()


Bummer.  It was a wild guess anyway, since I was seeing a connect() call 
on the data socket.



Samples: 56K of event 'cycles', Event count (approx.): 78628132661
  Overhead  CommandShared ObjectSymbol
+9.11%  netperf[kernel.vmlinux] [k] __ip_select_ident
+6.98%  netperf[kernel.vmlinux] [k] _raw_spin_lock
+6.21%  swapper[mlx5_core]  [k] mlx5e_poll_tx_cq
+5.03%  netperf[kernel.vmlinux] [k] 
copy_user_enhanced_fast_string
+4.69%  netperf[kernel.vmlinux] [k] __ip_make_skb
+4.63%  netperf[kernel.vmlinux] [k] skb_set_owner_w
+4.15%  swapper[kernel.vmlinux] [k] __slab_free
+3.80%  netperf[mlx5_core]  [k] mlx5e_sq_xmit
+2.00%  swapper[kernel.vmlinux] [k] sock_wfree
+1.94%  netperfnetperf  [.] send_data
+1.92%  netperfnetperf  [.] send_omni_inner


Well, the next step I suppose is to have you try a quick netperf 
UDP_STREAM under strace to see if your netperf binary does what mine did:


strace -v -o /tmp/netperf.strace netperf -H 198.18.50.1 -t UDP_STREAM -l 
1 -- -m 1472 -n -N -R 1


And see if you see the connect() I saw. (Note, I make the runtime 1 second)

rick


Re: Netperf UDP issue with connected sockets

2016-11-16 Thread Rick Jones

On 11/16/2016 04:16 AM, Jesper Dangaard Brouer wrote:

[1] Subj: High perf top ip_idents_reserve doing netperf UDP_STREAM
 - https://www.spinics.net/lists/netdev/msg294752.html

Not fixed in version 2.7.0.
 - ftp://ftp.netperf.org/netperf/netperf-2.7.0.tar.gz

Used extra netperf configure compile options:
 ./configure  --enable-histogram --enable-demo

It seems like some fix attempts exists in the SVN repository::

 svn checkout http://www.netperf.org/svn/netperf2/trunk/ netperf2-svn
 svn log -r709
 # A quick stab at getting remote connect going for UDP_STREAM
 svn diff -r708:709

Testing with SVN version, still show __ip_select_ident() in top#1.


Indeed, there was a fix for getting the remote side connect()ed. 
Looking at what I have for the top of trunk I do though see a connect() 
call being made at the local end:


socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0

setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0
brk(0xe53000)   = 0xe53000
getsockname(4, {sa_family=AF_INET, sin_port=htons(59758), 
sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
sendto(3, 
"\0\0\0a\377\377\377\377\377\377\377\377\377\377\377\377\0\0\0\10\0\0\0\0\0\0\0\321\377\377\377\377"..., 
656, 0, NULL, 0) = 656

select(1024, [3], NULL, NULL, {120, 0}) = 1 (in [3], left {119, 995630})
recvfrom(3, 
"\0\0\0b\0\0\0\0\0\3@\0\0\3@\0\0\0\0\2\0\3@\0\377\377\377\377\0\0\0\321"..., 
656, 0, NULL, NULL) = 656

write(1, "need to connect is 1\n", 21)  = 21
rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 
0x7f2824eb2cb0}, NULL, 8) = 0
rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 
0x7f2824eb2cb0}, NULL, 8) = 0

alarm(1)= 0
connect(4, {sa_family=AF_INET, sin_port=htons(34832), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024


the only difference there with top of trunk is that "need to connect" 
write/printf I just put in the code to be a nice marker in the system 
call trace.


It is a wild guess, but does setting SO_DONTROUTE affect whether or not 
a connect() would have the desired effect?  That is there to protect 
people from themselves (long story about people using UDP_STREAM to 
stress improperly air-gapped systems during link up/down testing) 
It can be disabled with a test-specific -R 1 option, so your netperf 
command would become:


netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1



(p.s. is netperf ever going to be converted from SVN to git?)



Well  my git-fu could use some work (gentle, offlinetaps with a 
clueful tutorial bat would be welcome), and at least in the past, going 
to git was held back because there were a bunch of netperf users on 
Windows and there wasn't (at the time) support for git under Windows.


But I am not against the idea in principle.

happy benchmarking,

rick jones

PS - rick.jo...@hp.com no longer works.  rick.jon...@hpe.com should be 
used instead.


Re: [patch] netlink.7: srcfix Change buffer size in example code about reading netlink message.

2016-11-14 Thread Rick Jones

Lets change the example so others don't propagate the problem further.

Signed-off-by David Wilder <dwil...@us.ibm.com>

--- man7/netlink.7.orig 2016-11-14 13:30:36.522101156 -0800
+++ man7/netlink.7  2016-11-14 13:30:51.002086354 -0800
@@ -511,7 +511,7 @@
 .in +4n
 .nf
 int len;
-char buf[4096];
+char buf[8192];


Since there doesn't seem to be a define one could use in the user space 
linux/netlink.h (?), but there are comments in the example code in the 
manpage, how about also including a brief comment to the effect that 
using 8192 bytes will avoid message truncation problems on platforms 
with a large PAGE_SIZE?


/* avoid msg truncation on > 4096 byte PAGE_SIZE platforms */

or something like that.

rick jones


Re: [PATCH RFC 0/2] ethtool: Add actual port speed reporting

2016-11-03 Thread Rick Jones

And besides, one can argue that in the SR-IOV scenario the VF has no business
knowing the physical port speed.



Good point, but there are more use-cases we should consider.
For example, when using Multi-Host/Flex-10/Multi-PF each PF should
be able to query both physical port speed and actual speed.


Despite my email address, I'm not fully versed on VC/Flex, but I have 
always been under the impression that the flexnics created were, 
conceptually, "distinct" NICs considered independently of the physical 
port over which they operated.  Tossing another worm or three into the 
can, while "back in the day" (when some of the first ethtool changes to 
report speeds other than the "normal" ones went in) the speed of a 
flexnic was fixed, today, it can actually operate in a range.  From a 
minimum guarantee to an "if there is bandwidth available" cap.


rick jones



Re: [bnx2] [Regression 4.8] Driver loading fails without firmware

2016-10-25 Thread Rick Jones

On 10/25/2016 08:31 AM, Paul Menzel wrote:

To my knowledge, the firmware files haven’t changed since years [1].


Indeed - it looks like I read "bnx2" and thought "bnx2x"  Must remember 
to hold-off on replying until after the morning orange juice is consumed :)


rick


Re: [bnx2] [Regression 4.8] Driver loading fails without firmware

2016-10-25 Thread Rick Jones

On 10/25/2016 07:33 AM, Paul Menzel wrote:

Dear Linux folks,

A server with the Broadcom devices below, fails to load the drivers
because of missing firmware.


I have run into the same sort of issue from time to time when going to a 
newer kernel.  A newer version of the driver wants a newer version of 
the firmware.  Usually, finding a package "out there" with the newer 
version of the firmware, and installing it onto the system is sufficient.


happy benchmarking,

rick jones


Re: Accelerated receive flow steering (aRFS) for UDP

2016-10-10 Thread Rick Jones

On 10/10/2016 09:08 AM, Rick Jones wrote:

On 10/09/2016 03:33 PM, Eric Dumazet wrote:

OK, I am adding/CC Rick Jones, netperf author, since it seems a netperf
bug, not a kernel one.

I believe I already mentioned fact that "UDP_STREAM -- -N" was not doing
a connect() on the receiver side.


I can confirm that the receive side of the netperf omni path isn't
trying to connect UDP datagrams.  I will see what I can put together.


I've put something together and pushed it to the netperf top of trunk. 
It seems to have been successful on a quick loopback UDP_STREAM test.


happy benchmarking,

rick jones



Re: Accelerated receive flow steering (aRFS) for UDP

2016-10-10 Thread Rick Jones

On 10/09/2016 03:33 PM, Eric Dumazet wrote:

OK, I am adding/CC Rick Jones, netperf author, since it seems a netperf
bug, not a kernel one.

I believe I already mentioned fact that "UDP_STREAM -- -N" was not doing
a connect() on the receiver side.


I can confirm that the receive side of the netperf omni path isn't 
trying to connect UDP datagrams.  I will see what I can put together.


happy benchmarking,

rick jones
rick.jon...@hpe.com



Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket

2016-09-29 Thread Rick Jones

On 09/29/2016 06:18 AM, Eric Dumazet wrote:

Well, then what this patch series is solving ?

You have a producer of packets running on 8 vcpus in a VM.

Packets are exiting the VM and need to be queued on a mq NIC in the
hypervisor.

Flow X can be scheduled on any of these 8 vcpus, so XPS is currently
selecting different TXQ.


Just for completeness, in my testing, the VMs were single-vCPU.

rick jones


Re: [PATCH RFC 0/4] xfs: Transmit flow steering

2016-09-28 Thread Rick Jones


Here is a quick look at performance tests for the result of trying the
prototype fix for the packet reordering problem with VMs sending over
an XPS-configured NIC.  In particular, the Emulex/Avago/Broadcom
Skyhawk.  The fix was applied to a 4.4 kernel.

Before: 3884 Mbit/s
After: 8897 Mbit/s

That was from a VM on a node with a Skyhawk and 2 E5-2640 processors
to baremetal E5-2640 with a BE3.  Physical MTU was 1500, the VM's
vNIC's MTU was 1400.  Systems were HPE ProLiants in OS Control Mode
for power management, with the "performance" frequency governor
loaded. An OpenStack Mitaka setup with Distributed Virtual Router.

We had some other NIC types in the setup as well.  XPS was also
enabled on the ConnectX3-Pro.  It was not enabled on the 82599ES (a
function of the kernel being used, which had it disabled from the
first reports of XPS negatively affecting VM traffic at the beginning
of the year)

Average Mbit/s From NIC type To Bare Metal BE3:
NIC Type,
 CPU on VM HostBeforeAfter

ConnectX-3 Pro,E5-2670v39224 9271
BE3, E5-26409016 9022
82599, E5-2640  9192 9003
BCM57840, E5-2640   9213 9153
Skyhawk, E5-26403884 8897

For completeness:
Average Mbit/s To NIC type from Bare Metal BE3:
NIC Type,
 CPU on VM HostBeforeAfter

ConnectX-3 Pro,E5-2670v39322 9144
BE3, E5-26409074 9017
82599, E5-2640  8670 8564
BCM57840, E5-2640   2468 *   7979
Skyhawk, E5-26408897 9269

* This is the Busted bnx2x NIC FW GRO implementation issue.  It was
  not visible in the "After" because the system was setup to disable
  the NIC FW GRO by the time it booted on the fix kernel.

Average Transactions/s Between NIC type and Bare Metal BE3:
NIC Type,
 CPU on VM HostBeforeAfter

ConnectX-3 Pro,E5-2670v3   12421 12612
BE3, E5-26408178  8484
82599, E5-2640  8499  8549
BCM57840, E5-2640   8544  8560
Skyhawk, E5-26408537  8701

happy benchmarking,

Drew Balliet
Jeurg Haefliger
rick jones

The semi-cooked results with additional statistics:

554M  - BE3
544+M - ConnectX-3 Pro
560M - 82599ES
630M - BCM57840
650M - Skyhawk

(substitute is simply replacing a system name with the model of NIC and CPU)
Bulk To (South) and From (North) VM, Before:
$ ../substitute.sh 
vxlan_554m_control_performance_gvnr_dvr_northsouth_stream.log | 
~/netperf2_trunk/doc/examples/parse_single_stream.py -r -5 -f 1 -f 3 -f 
4 -f 7 -f 8

Field1,Field3,Field4,Field7,Field8,Min,P10,Median,Average,P90,P99,Max,Count
North,560M,E5-2640,554FLB,E5-2640,8148.090,9048.830,9235.400,9192.868,9315.980,9338.845,9339.500,113
North,630M,E5-2640,554FLB,E5-2640,8909.980,9113.238,9234.750,9213.140,9299.442,9336.206,9337.830,47
North,544+M,E5-2670v3,554FLB,E5-2640,9013.740,9182.546,9229.620,9224.025,9264.036,9299.206,9301.970,99
North,650M,E5-2640,554FLB,E5-2640,3187.680,3393.724,3796.160,3884.765,4405.096,4941.391,4956.300,129
North,554M,E5-2640,554FLB,E5-2640,8700.930,8855.768,9026.030,9016.061,9158.846,9213.687,9226.150,135
South,554FLB,E5-2640,560M,E5-2640,7754.350,8193.114,8718.540,8670.612,9026.436,9262.355,9285.010,113
South,554FLB,E5-2640,630M,E5-2640,1897.660,2068.290,2514.430,2468.323,2787.162,2942.934,2957.250,53
South,554FLB,E5-2640,544+M,E5-2670v3,9298.260,9314.432,9323.220,9322.207,9328.324,9330.704,9331.080,100
South,554FLB,E5-2640,650M,E5-2640,8407.050,8907.136,9304.390,9206.776,9321.320,9325.347,9326.410,103
South,554FLB,E5-2640,554M,E5-2640,7844.900,8632.530,9199.385,9074.535,9308.070,9319.224,9322.360,132
0 too-short lines ignored.

Bulk To (South) and From (North) VM, After:

$ ../substitute.sh 
vxlan_554m_control_performance_gvnr_xpsfix_dvr_northsouth_stream.log | 
~/netperf2_trunk/doc/examples/parse_single_stream.py -r -5 -f 1 -f 3 -f 
4 -f 7 -f 8

Field1,Field3,Field4,Field7,Field8,Min,P10,Median,Average,P90,P99,Max,Count
North,560M,E5-2640,554FLB,E5-2640,7576.790,8213.890,9182.870,9003.190,9295.975,9315.878,9318.160,36
North,630M,E5-2640,554FLB,E5-2640,8811.800,8924.000,9206.660,9153.076,9306.287,9315.152,9315.790,12
North,544+M,E5-2670v3,554FLB,E5-2640,9135.990,9228.520,9277.465,9271.875,9324.545,9339.604,9339.780,46
North,650M,E5-2640,554FLB,E5-2640,8133.420,8483.340,8995.040,8897.779,9129.056,9165.230,9165.860,43
North,554M,E5-2640,554FLB,E5-2640,8438.390,8879.150,9048.590,9022.813,9181.540,9248.650,9297.660,101
South,554FLB,E5-2640,630M,E5-2640,7347.120,7592.565,7951.325,7979.951,8365.400,8575.837,8579.890,16
South,554FLB,E5-2640,560M,E5-2640,7719.510,8044.496,8602.750,8564.741,9172.824,9248.686,9259.070,45
South,554

Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control

2016-09-19 Thread Rick Jones

On 09/19/2016 02:10 PM, Eric Dumazet wrote:

On Mon, Sep 19, 2016 at 1:57 PM, Stephen Hemminger
<step...@networkplumber.org> wrote:


Looks good, but could I suggest a simple optimization.
All these parameters are immutable in the version of BBR you are submitting.
Why not make the values const? And eliminate the always true long-term bw 
estimate
variable?



We could do that.

We used to have variables (aka module params) while BBR was cooking in
our kernels ;)


Are there better than epsilon odds of someone perhaps wanting to poke 
those values as it gets exposure beyond Google?


happy benchmarking,

rick jones


I hope this email meets you well in good health condition

2016-09-15 Thread Jones
How you doing today? I hope you are doing well. My name is Jones, from the US. 
I'm in Syria right now fighting ISIS. I want to get to know you better, if I 
may be so bold. I consider myself an easy-going man, and I am currently looking 
for a relationship in which I feel loved. Please tell me more about yourself, 
if you don't mind.

Hope to hear from you soon.

Regards,
Jones


Re: [PATCH next 3/3] ipvlan: Introduce l3s mode

2016-09-09 Thread Rick Jones

On 09/09/2016 02:53 PM, Mahesh Bandewar wrote:


@@ -48,6 +48,11 @@ master device for the L2 processing and routing from that 
instance will be
 used before packets are queued on the outbound device. In this mode the slaves
 will not receive nor can send multicast / broadcast traffic.

+4.3 L3S mode:
+   This is very similar to the L3 mode except that iptables conn-tracking
+works in this mode and that is why L3-symsetric (L3s) from iptables 
perspective.
+This will have slightly less performance but that shouldn't matter since you
+are choosing this mode over plain-L3 mode to make conn-tracking work.


What is that first sentence trying to say?  It appears to be incomplete, 
and is that supposed to be "L3-symmetric?"


happy benchmarking,

rick jones


Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-08 Thread Rick Jones

On 09/08/2016 11:16 AM, Tom Herbert wrote:

On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouer
<bro...@redhat.com> wrote:

On Thu, 8 Sep 2016 09:26:03 -0700
Tom Herbert <t...@herbertland.com> wrote:

Shouldn't qdisc bulk size be based on the BQL limit? What is the
simple algorithm to apply to in-flight packets?


Maybe the algorithm is not so simple, and we likely also have to take
BQL bytes into account.

The reason for wanting packets-in-flight is because we are attacking a
transaction cost.  The tailptr/doorbell cost around 70ns.  (Based on
data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 =
70.74). The 10G wirespeed small packets budget is 67.2ns, this with
fixed overhead per packet of 70ns we can never reach 10G wirespeed.


But you should be able to do this with BQL and it is more accurate.
BQL tells how many bytes need to be sent and that can be used to
create a bulk of packets to send with one doorbell.


With small packets and the "default" ring size for this NIC/driver 
combination, is the BQL large enough that the ring fills before one hits 
the BQL?


rick jones



Re: ipv6: release dst in ping_v6_sendmsg

2016-09-06 Thread Dave Jones
On Tue, Sep 06, 2016 at 10:52:43AM -0700, Eric Dumazet wrote:
 
 > > > @@ -126,8 +126,10 @@ static int ping_v6_sendmsg(struct sock *sk, struct 
 > > > msghdr *msg, size_t len)
 > > >  rt = (struct rt6_info *) dst;
 > > >
 > > >  np = inet6_sk(sk);
 > > > -if (!np)
 > > > -return -EBADF;
 > > > +if (!np) {
 > > > +err = -EBADF;
 > > > +goto dst_err_out;
 > > > +}
 > > >
 > > >  if (!fl6.flowi6_oif && ipv6_addr_is_multicast())
 > > >  fl6.flowi6_oif = np->mcast_oif;
 > > > @@ -163,6 +165,9 @@ static int ping_v6_sendmsg(struct sock *sk, struct 
 > > > msghdr *msg, size_t len)
 > > >  }
 > > >  release_sock(sk);
 > > >
 > > > +dst_err_out:
 > > > +dst_release(dst);
 > > > +
 > > >  if (err)
 > > >  return err;
 > > >
 > > 
 > > Acked-by: Martin KaFai Lau 
 > 
 > This really does not make sense to me.
 > 
 > If np was NULL, we should have a crash before.

In the case where I was seeing the traces, we were taking the 'success'
path through the function, so sk was non-null.

 > So we should remove this test, since it is absolutely useless.

Looking closer, it seems the assignment of np is duplicated also,
so that can also go.   This is orthogonal to the dst leak though.
I'll submit a follow-up cleaning that up.

Dave



ipv6: release dst in ping_v6_sendmsg

2016-09-02 Thread Dave Jones
Neither the failure or success paths of ping_v6_sendmsg release
the dst it acquires.  This leads to a flood of warnings from
"net/core/dst.c:288 dst_release" on older kernels that
don't have 8bf4ada2e21378816b28205427ee6b0e1ca4c5f1 backported.

That patch optimistically hoped this had been fixed post 3.10, but
it seems at least one case wasn't, where I've seen this triggered
a lot from machines doing unprivileged icmp sockets.

Cc: Martin Lau <ka...@fb.com>
Signed-off-by: Dave Jones <da...@codemonkey.org.uk>

diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c
index 0900352c924c..0e983b694ee8 100644
--- a/net/ipv6/ping.c
+++ b/net/ipv6/ping.c
@@ -126,8 +126,10 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
rt = (struct rt6_info *) dst;
 
np = inet6_sk(sk);
-   if (!np)
-   return -EBADF;
+   if (!np) {
+   err = -EBADF;
+   goto dst_err_out;
+   }
 
if (!fl6.flowi6_oif && ipv6_addr_is_multicast())
fl6.flowi6_oif = np->mcast_oif;
@@ -163,6 +165,9 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
}
release_sock(sk);
 
+dst_err_out:
+   dst_release(dst);
+
if (err)
return err;
 


Re: [PATCH] softirq: let ksoftirqd do its job

2016-08-31 Thread Rick Jones

On 08/31/2016 04:11 PM, Eric Dumazet wrote:

On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote:

With regard to drops, are both of you sure you're using the same socket
buffer sizes?


Does it really matter ?


At least at points in the past I have seen different drop counts at the 
SO_RCVBUF based on using (sometimes much) larger sizes.  The hypothesis 
I was operating under at the time was that this dealt with those 
situations where the netserver was held-off from running for "a little 
while" from time to time.  It didn't change things for a sustained 
overload situation though.



In the meantime, is anything interesting happening with TCP_RR or
TCP_STREAM?


TCP_RR is driven by the network latency, we do not drop packets in the
socket itself.


I've been of the opinion it (single stream) is driven by path length. 
Sometimes by NIC latency.  But then I'm almost always measuring in the 
LAN rather than across the WAN.


happy benchmarking,

rick


Re: [PATCH] softirq: let ksoftirqd do its job

2016-08-31 Thread Rick Jones
With regard to drops, are both of you sure you're using the same socket 
buffer sizes?


In the meantime, is anything interesting happening with TCP_RR or 
TCP_STREAM?


happy benchmarking,

rick jones


Re: [PATCH v2 net-next] documentation: Document issues with VMs and XPS and drivers enabling it on their own

2016-08-29 Thread Rick Jones

On 08/27/2016 12:41 PM, Tom Herbert wrote:

On Fri, Aug 26, 2016 at 9:35 PM, David Miller <da...@davemloft.net> wrote:

From: Tom Herbert <t...@herbertland.com>
Date: Thu, 25 Aug 2016 16:43:35 -0700


This seems like it will only confuse users even more. You've clearly
identified an issue, let's figure out how to fix it.


I kinda feel the same way about this situation.


I'm working on XFS (as the transmit analogue to RFS). We'll track
flows enough so that we should know when it's safe to move them.


Is the XFS you are working on going to subsume XPS or will the two 
continue to exist in parallel a la RPS and RFS?


rick jones



[PATCH v2 net-next] documentation: Document issues with VMs and XPS and drivers enabling it on their own

2016-08-25 Thread Rick Jones
From: Rick Jones <rick.jon...@hpe.com>

Since XPS was first introduced two things have happened.  Some drivers
have started enabling XPS on their own initiative, and it has been
found that when a VM is sending data through a host interface with XPS
enabled, that traffic can end-up seriously out of order.

Signed-off-by: Rick Jones <rick.jon...@hpe.com>
Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
---

diff --git a/Documentation/networking/scaling.txt 
b/Documentation/networking/scaling.txt
index 59f4db2..50cc888 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -400,15 +400,31 @@ transport layer is responsible for setting ooo_okay 
appropriately. TCP,
 for instance, sets the flag when all data for a connection has been
 acknowledged.
 
+When the traffic source is a VM running on the host, there is no
+socket structure known to the host.  In this case, unless the VM is
+itself CPU-pinned, the traffic being sent from it can end-up queued to
+multiple transmit queues and end-up being transmitted out of order.
+
+In some cases this can result in a considerable loss of performance.
+
+In such situations, XPS should not be enabled at runtime, or
+explicitly disabled if the NIC driver(s) in question enable it on
+their own.  Otherwise, if possible, the VMs should be CPU pinned.
+
  XPS Configuration
 
-XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
-default for SMP). The functionality remains disabled until explicitly
-configured. To enable XPS, the bitmap of CPUs that may use a transmit
-queue is configured using the sysfs file entry:
+XPS is available only if the kconfig symbol CONFIG_XPS is enabled
+prior to building the kernel.  It is enabled by default for SMP kernel
+configurations.  In many cases the functionality remains disabled at
+runtime until explicitly configured by the system administrator. To
+enable XPS, the bitmap of CPUs that may use a transmit queue is
+configured using the sysfs file entry:
 
 /sys/class/net//queues/tx-/xps_cpus
 
+However, some NIC drivers will configure XPS at runtime for the
+interfaces they drive, via a call to netif_set_xps_queue.
+
 == Suggested Configuration
 
 For a network device with a single transmission queue, XPS configuration


[PATCH net-next] documentation: Document issues with VMs and XPS and drivers enabling it on their own

2016-08-25 Thread Rick Jones
From: Rick Jones <rick.jon...@hpe.com>

Since XPS was first introduced two things have happened.  Some drivers
have started enabling XPS on their own initiative, and it has been
found that when a VM is sending data through a host interface with XPS
enabled, that traffic can end-up seriously out of order.

Signed-off-by: Rick Jones <rick.jon...@hpe.com>

---

diff --git a/Documentation/networking/scaling.txt 
b/Documentation/networking/scaling.txt
index 59f4db2..50cc888 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -400,15 +400,31 @@ transport layer is responsible for setting ooo_okay 
appropriately. TCP,
 for instance, sets the flag when all data for a connection has been
 acknowledged.
 
+When the traffic source is a VM running on the host, there is no
+socket structure known to the host.  In this case, unless the VM is
+itself CPU-pinned, the traffic being sent from it can end-up queued to
+multiple transmit queues and end-up being transmitted out of order.
+
+In some cases this can result in a considerable loss of performance.
+
+In such situations, XPS should not be enabled at runtime, or
+explicitly disabled if the NIC driver(s) in question enable it on
+their own.  Othersise, if possible, the VMs should be CPU pinned.
+
  XPS Configuration
 
-XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
-default for SMP). The functionality remains disabled until explicitly
-configured. To enable XPS, the bitmap of CPUs that may use a transmit
-queue is configured using the sysfs file entry:
+XPS is available only if the kconfig symbol CONFIG_XPS is enabled
+prior to building the kernel.  It is enabled by default for SMP kernel
+configurations.  In many cases the functionality remains disabled at
+runtime until explicitly configured by the system administrator. To
+enable XPS, the bitmap of CPUs that may use a transmit queue is
+configured using the sysfs file entry:
 
 /sys/class/net//queues/tx-/xps_cpus
 
+However, some NIC drivers will configure XPS at runtime for the
+interfaces they drive, via a call to netif_set_xps_queue.
+
 == Suggested Configuration
 
 For a network device with a single transmission queue, XPS configuration


Re: [RFC PATCH] net: Require socket to allow XPS to set queue mapping

2016-08-25 Thread Rick Jones

On 08/25/2016 02:08 PM, Eric Dumazet wrote:

When XPS was submitted, it was _not_ enabled by default and 'magic'

Some NIC vendors decided it was a good thing, you should complain to
them ;)


I kindasorta am with the emails I've been sending to netdev :)  And also 
hopefully precluding others going down that path.


happy benchmarking,

rick



Re: [RFC PATCH] net: Require socket to allow XPS to set queue mapping

2016-08-25 Thread Rick Jones

On 08/25/2016 12:49 PM, Eric Dumazet wrote:

On Thu, 2016-08-25 at 12:23 -0700, Alexander Duyck wrote:

A simpler approach is provided with this patch.  With it we disable XPS any
time a socket is not present for a given flow.  By doing this we can avoid
using XPS for any routing or bridging situations in which XPS is likely
more of a hinderance than a help.


Yes, but this will destroy isolation for people properly doing VM cpu
pining.


Why not simply stop enabling XPS by default. Treat it like RPS and RFS 
(unless I've missed a patch...). The people who are already doing the 
extra steps to pin VMs can enable XPS in that case.  It isn't clear that 
one should always pin VMs - for example if a (public) cloud needed to 
oversubscribe the cores.


happy benchmarking,

rick jones


Re: A second case of XPS considerably reducing single-stream performance

2016-08-25 Thread Rick Jones

On 08/25/2016 12:19 PM, Alexander Duyck wrote:

The problem is that there is no socket associated with the guest from
the host's perspective.  This is resulting in the traffic bouncing
between queues because there is no saved socket  to lock the interface
onto.

I was looking into this recently as well and had considered a couple
of options.  The first is to fall back to just using skb_tx_hash()
when skb->sk is null for a given buffer.  I have a patch I have been
toying around with but I haven't submitted it yet.  If you would like
I can submit it as an RFC to get your thoughts.  The second option is
to enforce the use of RPS for any interfaces that do not perform Rx in
NAPI context.  The correct solution for this is probably some
combination of the two as you have to have all queueing done in order
at every stage of the packet processing.


I don't know with interfaces would be hit, but just in general, I'm not 
sure that requiring RPS be enabled is a good solution - picking where 
traffic is processed based on its addressing is fine in a benchmarking 
situation, but I think it is better to have the process/thread scheduler 
decide where something should run and not the addressing of the 
connections that thread/process is servicing.


I would be interested in seeing the RFC patch you propose.

Apart from that, given the prevalence of VMs these days I wonder if 
perhaps simply not enabling XPS by default isn't a viable alternative. 
I've not played with containers to know if they would exhibit this too.


Drifting ever so slightly, if drivers are going to continue to enable 
XPS by default, Documentation/networking/scaling.txt might use a tweak:


diff --git a/Documentation/networking/scaling.txt 
b/Documentation/networking/sca

index 59f4db2..8b5537c 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -402,10 +402,12 @@ acknowledged.

  XPS Configuration

-XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
-default for SMP). The functionality remains disabled until explicitly
-configured. To enable XPS, the bitmap of CPUs that may use a transmit
-queue is configured using the sysfs file entry:
+XPS is available only when the kconfig symbol CONFIG_XPS is enabled
+(on by default for SMP). The drivers for some NICs will enable the
+functionality by default.  For others the functionality remains
+disabled until explicitly configured. To enable XPS, the bitmap of
+CPUs that may use a transmit queue is configured using the sysfs file
+entry:

 /sys/class/net//queues/tx-/xps_cpus


The original wording leaves the impression that XPS is not enabled by 
default.


rick


Re: A second case of XPS considerably reducing single-stream performance

2016-08-24 Thread Rick Jones
Also, while it doesn't seem to have the same massive effect on 
throughput, I can also see out of order behaviour happening when the 
sending VM is on a node with a ConnectX-3 Pro NIC.  Its driver is also 
enabling XPS it would seem.  I'm not *certain* but looking at the traces 
it appears that with the ConnectX-3 Pro there is more interleaving of 
the out-of-order traffic than there is with the Skyhawk.  The ConnectX-3 
Pro happens to be in a newer generation server with a newer processor 
than the other systems where I've seen this.


I do not see the out-of-order behaviour when the NIC at the sending end 
is a BCM57840.  It does not appear that the bnx2x driver in the 4.4 
kernel is enabling XPS.


So, it would seem that there are three cases of enabling XPS resulting 
in out-of-order traffic, two of which result in a non-trivial loss of 
performance.


happy benchmarking,

rick jones


Re: [PATCH net-next] net: minor optimization in qdisc_qstats_cpu_drop()

2016-08-24 Thread Rick Jones

On 08/24/2016 10:23 AM, Eric Dumazet wrote:

From: Eric Dumazet <eduma...@google.com>

per_cpu_inc() is faster (at least on x86) than per_cpu_ptr(xxx)++;


Is it possible it is non-trivially slower on other architectures?

rick jones



Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 include/net/sch_generic.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 
0d501779cc68f9426e58da6d039dd64adc937c20..52a2015667b49c8315edbb26513a98d4c677fee5
 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -592,7 +592,7 @@ static inline void qdisc_qstats_drop(struct Qdisc *sch)

 static inline void qdisc_qstats_cpu_drop(struct Qdisc *sch)
 {
-   qstats_drop_inc(this_cpu_ptr(sch->cpu_qstats));
+   this_cpu_inc(sch->cpu_qstats->drops);
 }

 static inline void qdisc_qstats_overlimit(struct Qdisc *sch)





A second case of XPS considerably reducing single-stream performance

2016-08-24 Thread Rick Jones
Back in February of this year, I reported some performance issues with 
the ixgbe driver enabling XPS by default and instance network 
performance in OpenStack:


http://www.spinics.net/lists/netdev/msg362915.html

I've now seen the same thing with be2net and Skyhawk.  In this case, the 
magnitude of the delta is even greater.  Disabling XPS increased the 
netperf single-stream performance out of the instance from an average of 
4108 Mbit/s to  Mbit/s or 116%.


Should drivers really be enabling XPS by default?

  Instance To Outside World
Single-stream netperf
~30 Samples for Each Statistic
  Mbit/s

 SkyhawkBE3 #1BE3 #2
 XPS On   XPS Off  XPS On   XPS Off  XPS On   XPS Off
Median4192 8883 8930 8853 8917 8695
Average   4108  8940 8859 8885 8671

happy benchmarking,

rick jones

The sample counts below may not fully support the additional statistics 
but for the curious:


raj@tardy:/tmp$ ~/netperf2_trunk/doc/examples/parse_single_stream.py -r 
6 waxon_performance.log  -f 2

Field2,Min,P10,Median,Average,P90,P99,Max,Count
be3-1,8758.850,8811.600,8930.900,8940.555,9096.470,9175.839,9183.690,31
be3-2,8588.450,8736.967,8917.075,8885.322,9017.914,9075.735,9094.620,32
skyhawk,3326.760,3536.008,4192.780,4108.513,4651.164,4723.322,4724.320,27
0 too-short lines ignored.
raj@tardy:/tmp$ ~/netperf2_trunk/doc/examples/parse_single_stream.py -r 
6 waxoff_performance.log  -f 2

Field2,Min,P10,Median,Average,P90,P99,Max,Count
be3-1,8461.080,8634.690,8853.260,8859.870,9064.480,9247.770,9253.050,31
be3-2,7519.130,8368.564,8695.140,8671.241,9068.588,9200.719,9241.500,27
skyhawk,8071.180,8651.587,8883.340,.411,9135.603,9141.229,9142.010,32
0 too-short lines ignored.

"waxon" is with XPS enabled, "waxoff" is with XPS disabled.  The servers 
are the same models/config as in February.


stack@np-cp1-comp0013-mgmt:~$ sudo ethtool -i hed3
driver: be2net
version: 10.6.0.3
firmware-version: 10.7.110.45


e1000: __pskb_pull_tail failed

2016-08-09 Thread Dave Jones
MY NFS server running 4.8-rc1 is getting flooded with this message:

e1000e :00:19.0 eth0: __pskb_pull_tail failed.

Never saw it happen with 4.7 or earlier.


That device is this onboard NIC: 

00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (2) I218-V

Dave



Re: [PATCH net 1/2] tg3: Fix for diasllow rx coalescing time to be 0

2016-08-03 Thread Rick Jones

On 08/02/2016 09:13 PM, skallam wrote:

From: Satish Baddipadige <satish.baddipad...@broadcom.com>

When the rx coalescing time is 0, interrupts
are not generated from the controller and rx path hangs.
To avoid this rx hang, updating the driver to not allow
rx coalescing time to be 0.

Signed-off-by: Satish Baddipadige <satish.baddipad...@broadcom.com>
Signed-off-by: Siva Reddy Kallam <siva.kal...@broadcom.com>
Signed-off-by: Michael Chan <michael.c...@broadcom.com>
---
 drivers/net/ethernet/broadcom/tg3.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c 
b/drivers/net/ethernet/broadcom/tg3.c
index ff300f7..f3c6c91 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -14014,6 +14014,7 @@ static int tg3_set_coalesce(struct net_device *dev, 
struct ethtool_coalesce *ec)
}

if ((ec->rx_coalesce_usecs > MAX_RXCOL_TICKS) ||
+   (!ec->rx_coalesce_usecs) ||
(ec->tx_coalesce_usecs > MAX_TXCOL_TICKS) ||
(ec->rx_max_coalesced_frames > MAX_RXMAX_FRAMES) ||
(ec->tx_max_coalesced_frames > MAX_TXMAX_FRAMES) ||



Should anything then happen with:

/* No rx interrupts will be generated if both are zero */
if ((ec->rx_coalesce_usecs == 0) &&
(ec->rx_max_coalesced_frames == 0))
return -EINVAL;


which is the next block of code?  The logic there seems to suggest that 
it was intended to be able to have an rx_coalesce_usecs of 0 and rely on 
packet arrival to trigger an interrupt.  Presumably setting 
rx_max_coalesced_frames to 1 to disable interrupt coalescing.


happy benchmarking,

rick jones


Re: [iproute PATCH 0/2] Netns performance improvements

2016-07-08 Thread Rick Jones

On 07/08/2016 01:01 AM, Nicolas Dichtel wrote:

Those 300 routers will each have at least one namespace along with the dhcp
namespaces.  Depending on the nature of the routers (Distributed versus
Centralized Virtual Routers - DVR vs CVR) and whether the routers are supposed
to be "HA" there can be more than one namespace for a given router.

300 routers is far from the upper limit/goal.  Back in HP Public Cloud, we were
running as many as 700 routers per network node (*), and more than four network
nodes. (back then it was just the one namespace per router and network). Mileage
will of course vary based on the "oomph" of one's network node(s).

Thank you for the details.

Do you have a script or something else to easily reproduce this problem?


Do you mean for my much older, slightly different stuff done in HP 
Public Cloud, or for what Phil (?) is doing presently?  I believe Phil 
posted something several messages back in the thread.


happy benchmarking,

rick jones


Re: [iproute PATCH 0/2] Netns performance improvements

2016-07-07 Thread Rick Jones

On 07/07/2016 09:34 AM, Eric W. Biederman wrote:

Rick Jones <rick.jon...@hpe.com> writes:

300 routers is far from the upper limit/goal.  Back in HP Public
Cloud, we were running as many as 700 routers per network node (*),
and more than four network nodes. (back then it was just the one
namespace per router and network). Mileage will of course vary based
on the "oomph" of one's network node(s).


To clarify processes for these routers and dhcp servers are created
with "ip netns exec"?


I believe so, but it would be good to have someone else confirm that, 
and speak to your paragraph below.



If that is the case and you are using this feature as effectively a
lightweight container and not lots vrfs in a single network stack
then I suspect much larger gains can be had by creating a variant
of ip netns exec avoids the mount propagation.



...


* Didn't want to go much higher than that because each router had a
port on a common linux bridge and getting to > 1024 would be an
unpleasant day.


* I would have thought all you have to do is bump of the size
   of the linux neighbour cache.  echo $BIGNUM > 
/proc/sys/net/ipv4/neigh/default/gc_thresh3


We didn't want to hit the 1024 port limit of a (then?) Linux bridge.

rick

Having a bit of deja vu but I suspect things like commit 
0818bf27c05b2de56c5b2bd08cfae2a939bd5f52  are not exactly on the same 
wavelength, just my brain seeing "namespaces" and "performance" and 
lighting-up :)


Re: [iproute PATCH 0/2] Netns performance improvements

2016-07-07 Thread Rick Jones

On 07/07/2016 08:48 AM, Phil Sutter wrote:

On Thu, Jul 07, 2016 at 02:59:48PM +0200, Nicolas Dichtel wrote:

Le 07/07/2016 13:17, Phil Sutter a écrit :
[snip]

The issue came up during OpenStack Neutron testing, see this ticket for
reference:

https://bugzilla.redhat.com/show_bug.cgi?id=1310795

Access to this ticket is not public :(


*Sigh* OK, here are a few quotes:

"OpenStack Neutron controller nodes, when undergoing testing, are
locking up specifically during creation and mounting of namespaces.
They appear to be blocking behind vfsmount_lock, and contention for the
namespace_sem"

"During the scale testing, we have 300 routers, 600 dhcp namespaces
spread across four neutron network nodes. When then start as one set of
standard Openstack Rally benchmark test cycle against neutron. An
example scenario is creating 10x networks, list them, delete them and
repeat 10x times. The second set performs an L3 benchmark test between
two instances."



Those 300 routers will each have at least one namespace along with the 
dhcp namespaces.  Depending on the nature of the routers (Distributed 
versus Centralized Virtual Routers - DVR vs CVR) and whether the routers 
are supposed to be "HA" there can be more than one namespace for a given 
router.


300 routers is far from the upper limit/goal.  Back in HP Public Cloud, 
we were running as many as 700 routers per network node (*), and more 
than four network nodes. (back then it was just the one namespace per 
router and network). Mileage will of course vary based on the "oomph" of 
one's network node(s).


happy benchmarking,

rick jones

* Didn't want to go much higher than that because each router had a port 
on a common linux bridge and getting to > 1024 would be an unpleasant day.


Re: strange Mac OSX RST behavior

2016-07-01 Thread Rick Jones

On 07/01/2016 08:10 AM, Jason Baron wrote:

I'm wondering if anybody else has run into this...

On Mac OSX 10.11.5 (latest version), we have found that when tcp
connections are abruptly terminated (via ^C), a FIN is sent followed
by an RST packet.


That just seems, well, silly.  If the client application wants to use 
abortive close (sigh..) it should do so, there shouldn't be this 
little-bit-pregnant, correct close initiation (FIN) followed by a RST.



The RST is sent with the same sequence number as the
FIN, and thus dropped since the stack only accepts RST packets matching
rcv_nxt (RFC 5961). This could also be resolved if Mac OSX replied with
an RST on the closed socket, but it appears that it does not.

The workaround here is then to reset the connection, if the RST is
is equal to rcv_nxt - 1, if we have already received a FIN.

The RST attack surface is limited b/c we only accept the RST after we've
accepted a FIN and have not previously sent a FIN and received back the
corresponding ACK. In other words RST is only accepted in the tcp
states: TCP_CLOSE_WAIT, TCP_LAST_ACK, and TCP_CLOSING.

I'm interested if anybody else has run into this issue. Its problematic
since it takes up server resources for sockets sitting in TCP_CLOSE_WAIT.


Isn't the server application expected to act on the read return of zero 
(which is supposed to be) triggered by the receipt of the FIN segment?


rick jones


We are also in the process of contacting Apple to see what can be done
here...workaround patch is below.


Re: [PATCH v12 net-next 1/1] hv_sock: introduce Hyper-V Sockets

2016-06-28 Thread Rick Jones

On 06/28/2016 02:59 AM, Dexuan Cui wrote:

The idea here is: IMO the syscalls sys_read()/write() shoudn't return
-ENOMEM, so I have to make sure the buffer allocation succeeds?

I tried to use kmalloc with __GFP_NOFAIL, but I hit a warning in
in mm/page_alloc.c:
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

What error code do you think I should return?
EAGAIN, ERESTARTSYS, or something else?

May I have your suggestion? Thanks!


What happens as far as errno is concerned when an application makes a 
read() call against a (say TCP) socket associated with a connection 
which has been reset?  Is it limited to those errno values listed in the 
read() manpage, or does it end-up getting an errno value from those 
listed in the recv() manpage?  Or, perhaps even one not (presently) 
listed in either?


rick jones



Re: [PATCH net-next 0/8] tou: Transports over UDP - part I

2016-06-24 Thread Rick Jones

On 06/24/2016 04:43 PM, Tom Herbert wrote:

Here's Christoph's slides on TFO in the wild which presents a good
summary of the middlebox problem. There is one significant difference
in that ECN needs network support whereas TFO didn't. Given that
experience, I'm doubtful other new features at L4 could ever be
productively use (like EDO or maybe TCP-ENO).

https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf


Perhaps I am being overly optimistic, but my takeaway from those slides 
is Apple were able to come-up with ways to deal with the middleboxes and 
so could indeed productively use TCP FastOpen.


"Overall, very good success-rate"
though tempered by
"But... middleboxes were a big issue in some ISPs..."

Though it doesn't get into how big (some connections, many, most, all?) 
and how many ISPs.


rick jones

Just an anecdote...  Not that I am a "power user" of my iPhone running 
9.3.2 (13F69) nor that I know that anything I am using is the Apple 
Service stated as using TFO (mostly Safari, Mail and Messages) but if it 
is, I cannot say that any troubles under the covers have been noticed by me.


Re: [PATCH net-next 0/8] tou: Transports over UDP - part I

2016-06-24 Thread Rick Jones

On 06/24/2016 02:46 PM, Tom Herbert wrote:

On Fri, Jun 24, 2016 at 2:36 PM, Rick Jones <rick.jon...@hpe.com> wrote:

How would you define "severely?"  Has it actually been more severe than for
say ECN?  Or it was for say SACK or PAWS?


ECN is probably even a bigger disappointment in terms of seeing
deployment :-( From http://ecn.ethz.ch/ecn-pam15.pdf:

"Even though ECN was standardized in 2001, and it is widely
implemented in end-systems, it is barely deployed. This is due to a
history of problems with severely broken middleboxes shortly after
standardization, which led to connectivity failure and guidance to
leave ECN disabled."

SACK and PAWS seemed to have faired a little better I believe.


The conclusion of that (rather interesting) paper reads:

"Our analysis therefore indicates that enabling ECN by default would
lead to connections to about five websites per thousand to suffer
additional setup latency with RFC 3168 fallback. This represents an
order of magnitude fewer than the about forty per thousand which
experience transient or permanent connection failure due to other
operational issues"

Doesn't that then suggest that not enabling ECN is basically a matter of 
FUD more than remaining assumed broken middleboxes?


My main point is that in the past at least, trouble with broken 
middleboxes didn't lead us to start wrapping all our TCP/transport 
traffic in UDP to try to hide it from them.  We've managed to get SACK 
and PAWS universal without having to resort to that, and it would seem 
we could get ECN universal if we could overcome our FUD.  Why would TFO 
for instance be any different?


There was an equally interesting second paragraph in the conclusion:

"As not all websites are equally popular, failures on five per thousand
websites does not by any means imply that five per thousand connection 
attempts will fail. While estimation of connection attempt rate by rank 
is out of scope of this work, we note that the highest ranked website 
exhibiting stable connection failure has rank 596, and only 13 such 
sites appear in the top 5000"


rick jones


Re: [PATCH net-next 0/8] tou: Transports over UDP - part I

2016-06-24 Thread Rick Jones

On 06/24/2016 02:12 PM, Tom Herbert wrote:

The client OS side is only part of the story. Middlebox intrusion at
L4 is also a major issue we need to address. The "failure" of TFO is a
good case study. Both the upgrade issues on clients and the tendency
for some middleboxes to drop SYN packets with data have together
severely hindered what otherwise should have been straightforward and
useful feature to deploy.


How would you define "severely?"  Has it actually been more severe than 
for say ECN?  Or it was for say SACK or PAWS?


rick jones



[4.6] kernel BUG at net/ipv6/raw.c:592

2016-06-23 Thread Dave Jones

Found this logs after a Trinity run.

kernel BUG at net/ipv6/raw.c:592!
[ cut here ]
invalid opcode:  [#1] SMP 

Modules linked in: udp_diag dccp_ipv6 dccp_ipv4 dccp sctp af_key tcp_diag 
inet_diag ip6table_filter xt_NFLOG nfnetlink_log xt_comment xt_statistic 
iptable_filter nfsv3 nfs_acl nfs fscache lockd grace autofs4 i2c_piix4 
rpcsec_gss_krb5 auth_rpcgss oid_registry sunrpc loop dummy ipmi_devintf 
iTCO_wdt iTCO_vendor_support acpi_cpufreq efivars ipmi_si ipmi_msghandler 
i2c_i801 i2c_core sg lpc_ich mfd_core button

CPU: 2 PID: 28854 Comm: trinity-c23 Not tainted 4.6.0 #1
Hardware name: Quanta Leopard-DDR3/Leopard-DDR3, BIOS F06_3A14.DDR3 05/13/2015
task: 880459cab600 ti: 880747bc4000 task.ti: 880747bc4000
RIP: 0010:[] [] rawv6_sendmsg+0xc30/0xc40
RSP: 0018:880747bc7bf8  EFLAGS: 00010282
RAX: fff2 RBX: 88080c6f2d00 RCX: 0002
RDX: 880747bc7cd8 RSI: 0030 RDI: 8803de801500
RBP: 880747bc7d90 R08: 002d R09: 0009
R10: 8803de801500 R11: 0009 R12: 0030
R13: 8803de801500 R14: 88086d67e000 R15: 88046bdac480
FS:  7fe29c566700() GS:88046fa4() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 01f0f2c0 CR3: 00080b99d000 CR4: 001406e0
Stack:
  88086d67e000 880747bc7d18 88046bdac480
 8804  880747bc7c68 88086d67e000
 8808002d 88080009  0001
 
Call Trace:
 [] ? page_fault+0x22/0x30
 [] ? bad_to_user+0x6a/0x6fa
 [] inet_sendmsg+0x67/0xa0
 [] sock_sendmsg+0x38/0x50
 [] sock_write_iter+0x78/0xd0
 [] __vfs_write+0xaa/0xe0
 [] vfs_write+0xa2/0x1a0
 [] SyS_write+0x46/0xa0 
 [] entry_SYSCALL_64_fastpath+0x13/0x8f
Code: 23 f7 ff ff f7 d0 41 01 c0 41 83 d0 00 e9 ac fd ff ff 48 8b 44 24 48 48 
8b 80 c0 01 00 00 65 48 ff 40 28 8b 51 78 d0 41 01 c0 41 83 d0 00 e9 ac fd ff 
ff 48 8b 44 24 48 48 8b 80 c0 01 00 00 65 48 ff 40 28 8b 51 78 e9 64 fe ff ff 
<0f> 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 

RIP [] rawv6_sendmsg+0xc30/0xc40
 RSP 

 590 
 591 offset += skb_transport_offset(skb);
 592 BUG_ON(skb_copy_bits(skb, offset, , 2));
 593 



Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Rick Jones

On 06/22/2016 04:10 PM, Rick Jones wrote:

My systems are presently in the midst of an install but I should be able
to demonstrate it in the morning (US Pacific time, modulo the shuttle
service of a car repair place)


The installs finished sooner than I thought.  So, receiver:


root@np-cp1-comp0001-mgmt:/home/stack# uname -a
Linux np-cp1-comp0001-mgmt 4.4.11-2-amd64-hpelinux #hpelinux1 SMP Mon 
May 23 15:39:22 UTC 2016 x86_64 GNU/Linux

root@np-cp1-comp0001-mgmt:/home/stack# ethtool -i hed2
driver: bnx2x
version: 1.712.30-0
firmware-version: bc 7.10.10
bus-info: :05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

the hed2 interface is a port of an HPE 630M NIC, based on the BCM57840:

05:00.0 Ethernet controller: Broadcom Corporation BCM57840 NetXtreme II 
10/20-Gigabit Ethernet (rev 11)

Subsystem: Hewlett-Packard Company HP FlexFabric 20Gb 2-port 630M 
Adapter

(The pci.ids entry being from before that 10 GbE IP was purchased from 
Broadcom by QLogic...)


Verify that LRO is disabled (IIRC it is enabled by default):

root@np-cp1-comp0001-mgmt:/home/stack# ethtool -k hed2 | grep large
large-receive-offload: off

Verify that disable_tpa is not set:

root@np-cp1-comp0001-mgmt:/home/stack# cat 
/sys/module/bnx2x/parameters/disable_tpa

0

So this means we will see NIC-firmware GRO.

Start a tcpdump on the receiver:
root@np-cp1-comp0001-mgmt:/home/stack# tcpdump -s 96 -c 200 -i hed2 
-w foo.pcap port 12867
tcpdump: listening on hed2, link-type EN10MB (Ethernet), capture size 96 
bytes


Start a netperf test targeting that system, specifying a smaller MSS:

stack@np-cp1-comp0002-mgmt:~$ ./netperf -H np-cp1-comp0001-guest -- -G 
1400 -P 12867 -O throughput,transport_mss
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-comp0001-guest () port 12867 AF_INET : demo

Throughput Transport
   MSS
   bytes

3372.821388

Come back to the receiver and post-process the tcpdump capture to get 
the average segment size for the data segments:


200 packets captured
2000916 packets received by filter
0 packets dropped by kernel
root@np-cp1-comp0001-mgmt:/home/stack# tcpdump -n -r foo.pcap | fgrep -v 
"length 0" | awk '{sum += $NF}END{print "Average:",sum/NR}'

reading from file foo.pcap, link-type EN10MB (Ethernet)
Average: 2741.93

and finally a snippet of the capture:

00:37:47.333414 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [S], seq 
1236484791, win 28000, options [mss 1400,sackOK,TS val 1491134 ecr 
0,nop,wscale 7], length 0
00:37:47.333488 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [S.], 
seq 134167501, ack 1236484792, win 28960, options [mss 1460,sackOK,TS 
val 1499053 ecr 1491134,nop,wscale 7], length 0
00:37:47.333731 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], ack 
1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 0
00:37:47.333788 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
1:2777, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], 
length 2776
00:37:47.333815 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
2777, win 270, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333822 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
2777:5553, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], 
length 2776
00:37:47.333837 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
5553, win 313, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333842 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
5553:8329, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], 
length 2776
00:37:47.333856 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
8329:11105, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333869 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
8329, win 357, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333879 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
11105:13881, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333891 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
11105, win 400, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333911 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
13881, win 444, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333964 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
13881:16657, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333982 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
16657:19433, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333989 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
19433:22209, ack 1, win 219, options [nop,nop,TS val 149

  1   2   3   4   5   6   7   >