date:20180105

Re: [PATCH net] ipv6: remove null_entry before adding default route

2018-01-05 Thread Martin KaFai Lau

On Fri, Jan 05, 2018 at 05:38:35PM -0800, Wei Wang wrote:
> From: Wei Wang 
> 
> In the current code, when creating a new fib6 table, tb6_root.leaf gets
> initialized to net->ipv6.ip6_null_entry.
> If a default route is being added with rt->rt6i_metric = 0x,
> fib6_add() will add this route after net->ipv6.ip6_null_entry. As
> null_entry is shared, it could cause problem.
> 
> In order to fix it, set fn->leaf to NULL before calling
> fib6_add_rt2node() when trying to add the first default route.
> And reset fn->leaf to null_entry when adding fails or when deleting the
> last default route.
> 
> syzkaller reported the following issue which is fixed by this commit:
> =
> WARNING: suspicious RCU usage
> 4.15.0-rc5+ #171 Not tainted
> -
> net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!
> 
> other info that might help us debug this:
> 
> rcu_scheduler_active = 2, debug_locks = 1
> 4 locks held by swapper/0/0:
>  #0:  ((>ipv6.ip6_fib_timer)){+.-.}, at: [] 
> lockdep_copy_map include/linux/lockdep.h:178 [inline]
>  #0:  ((>ipv6.ip6_fib_timer)){+.-.}, at: [] 
> call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
>  #1:  (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>] 
> spin_lock_bh include/linux/spinlock.h:315 [inline]
>  #1:  (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>] 
> fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
>  #2:  (rcu_read_lock){}, at: [<91db762d>] 
> __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
>  #3:  (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] spin_lock_bh 
> include/linux/spinlock.h:315 [inline]
>  #3:  (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] 
> __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948
> 
> stack backtrace:
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
>  fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
>  fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
>  fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
>  fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
>  fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
>  __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
>  fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
>  fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
>  fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
>  call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
>  expire_timers kernel/time/timer.c:1357 [inline]
>  __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
>  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>  invoke_softirq kernel/softirq.c:365 [inline]
>  irq_exit+0x1cc/0x200 kernel/softirq.c:405
>  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
>  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
>  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
>  
> 
> Reported-by: syzbot 
> Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in 
> fib6_table")
> Signed-off-by: Wei Wang 
> ---
>  net/ipv6/ip6_fib.c | 45 +++--
>  1 file changed, 35 insertions(+), 10 deletions(-)
> 
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index d11a5578e4f8..37cb4ad1ea29 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -640,6 +640,11 @@ static struct fib6_node *fib6_add_1(struct net *net,
>   if (!(fn->fn_flags & RTN_RTINFO)) {
>   RCU_INIT_POINTER(fn->leaf, NULL);
>   rt6_release(leaf);
> + /* remove null_entry in the root node */
> + } else if (fn->fn_flags & RTN_TL_ROOT &&
> +rcu_access_pointer(fn->leaf) ==
> +net->ipv6.ip6_null_entry) {
> + RCU_INIT_POINTER(fn->leaf, NULL);
It seems the reader side could see tb6_root.leaf == NULL after
this change and I think it should be fine?  If it is, instead
of switching betwen NULL and ip6_null_entry, would it be simpler
to always set tb6_root.leaf to NULL until a legit default route
is added?

>   }
>  
>   return fn;
> @@ -1270,14 +1275,27 @@ int fib6_add(struct fib6_node *root, struct rt6_info 
> *rt,
>   return err;
>  
>  failure:
> - /* fn->leaf could be NULL if fn is an intermediate node and we
> -  * failed to add the new route to it in both subtree creation
> -  * failure and fib6_add_rt2node() failure case.
> -  * In both cases, fib6_repair_tree() should be called

Re: [wireless-testsing2:master 1/4] drivers/net/netdevsim/bpf.c:130:14: sparse: incompatible types for 'case' statement

2018-01-05 Thread Fengguang Wu


On Wed, Jan 03, 2018 at 05:02:37PM -0800, Jakub Kicinski wrote:

On Thu, 4 Jan 2018 03:53:20 +0800, kbuild test robot wrote:

tree:   
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-testing.git 
master
head:   6b3b30d0c31ddb2f4d8208c90bc2b4adef47204d
commit: af2cae39f6ab9dc596616d6a28c7772e1dd55e91 [1/4] Merge remote-tracking 
branch 'wireless-drivers-next/master'
reproduce:
# apt-get install sparse
git checkout af2cae39f6ab9dc596616d6a28c7772e1dd55e91
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__



   drivers/net/netdevsim/bpf.c: In function 'nsim_bpf_setup_tc_block_cb':
>> drivers/net/netdevsim/bpf.c:130:7: error: 'TC_CLSBPF_REPLACE' undeclared 
(first use in this function); did you mean 'TC_RED_REPLACE'?
 case TC_CLSBPF_REPLACE:
  ^
  TC_RED_REPLACE


FWIW looks like the tree contains old net-next code and latest net
(linux/master) code.  Pulling from net-next will solve this.


:: TO: Jakub Kicinski 
:: CC: Daniel Borkmann 


Interestingly Daniel and I were not CCed on the report, is this
intentional?


Yeah the above ":: TO/CC" lines are for manual addition when
appropriate. They are the author/committer of the commit that last
modified the code line in question.

Thanks,
Fengguang

Re: KASAN: use-after-free Read in sctp_packet_transmit

2018-01-05 Thread Xin Long

On Sat, Jan 6, 2018 at 6:07 AM, syzbot
 wrote:
> Hello,
>
> syzkaller hit the following crash on
> 8a4816cad00bf14642f0ed6043b32d29a05006ce
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> Unfortunately, I don't have any reproducer for this bug yet.
>
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+5adcca18fca253b4c...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> ==
> BUG: KASAN: use-after-free in sctp_packet_transmit+0x3505/0x3750
> net/sctp/output.c:643
> Read of size 8 at addr 8801bda9fb80 by task modprobe/23740
>
> CPU: 1 PID: 23740 Comm: modprobe Not tainted 4.15.0-rc5+ #175
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
>  sctp_packet_transmit+0x3505/0x3750 net/sctp/output.c:643
>  sctp_outq_flush+0x121b/0x4060 net/sctp/outqueue.c:1197
>  sctp_outq_uncork+0x5a/0x70 net/sctp/outqueue.c:776
>  sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1807 [inline]
>  sctp_side_effects net/sctp/sm_sideeffect.c:1210 [inline]
>  sctp_do_sm+0x4e0/0x6ed0 net/sctp/sm_sideeffect.c:1181
>  sctp_generate_heartbeat_event+0x292/0x3f0 net/sctp/sm_sideeffect.c:406
>  call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
>  expire_timers kernel/time/timer.c:1357 [inline]
>  __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
>  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>  invoke_softirq kernel/softirq.c:365 [inline]
>  irq_exit+0x1cc/0x200 kernel/softirq.c:405
>  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
>  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
>  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
>  
> RIP: 0010:__preempt_count_add arch/x86/include/asm/preempt.h:76 [inline]
> RIP: 0010:__rcu_read_lock include/linux/rcupdate.h:83 [inline]
> RIP: 0010:rcu_read_lock include/linux/rcupdate.h:629 [inline]
> RIP: 0010:__is_insn_slot_addr+0x8f/0x330 kernel/kprobes.c:303
> RSP: 0018:8801d4937430 EFLAGS: 0283 ORIG_RAX: ff11
> RAX: 8801bf13c000 RBX: 8656dd00 RCX: 8170bd88
> RDX:  RSI:  RDI: 8656dd00
> RBP: 8801d4937518 R08:  R09: 11003a926e67
> R10: 8801d4937300 R11:  R12: 
> R13:  R14: 8801d49374f0 R15: 8801dae230c0
>  is_kprobe_insn_slot include/linux/kprobes.h:318 [inline]
>  kernel_text_address+0x132/0x140 kernel/extable.c:150
>  __kernel_text_address+0xd/0x40 kernel/extable.c:107
>  unwind_get_return_address+0x61/0xa0 arch/x86/kernel/unwind_frame.c:18
>  __save_stack_trace+0x7e/0xd0 arch/x86/kernel/stacktrace.c:45
>  save_stack_trace+0x1a/0x20 arch/x86/kernel/stacktrace.c:60
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
>  kmem_cache_alloc+0x12e/0x760 mm/slab.c:3544
>  kmem_cache_zalloc include/linux/slab.h:678 [inline]
>  file_alloc_security security/selinux/hooks.c:369 [inline]
>  selinux_file_alloc_security+0xae/0x190 security/selinux/hooks.c:3454
>  security_file_alloc+0x6d/0xa0 security/security.c:873
>  get_empty_filp+0x189/0x4f0 fs/file_table.c:129
>  path_openat+0xed/0x3530 fs/namei.c:3496
>  do_filp_open+0x25b/0x3b0 fs/namei.c:3554
>  do_sys_open+0x502/0x6d0 fs/open.c:1059
>  SYSC_open fs/open.c:1077 [inline]
>  SyS_open+0x2d/0x40 fs/open.c:1072
>  entry_SYSCALL_64_fastpath+0x23/0x9a
> RIP: 0033:0x7efdff1bb120
> RSP: 002b:7ffde6213c08 EFLAGS: 0246 ORIG_RAX: 0002
> RAX: ffda RBX: 55c34fab4090 RCX: 7efdff1bb120
> RDX: 01b6 RSI: 0008 RDI: 7ffde6213d20
> RBP: 7ffde6214d90 R08: 0008 R09: 0001
> R10:  R11: 0246 R12: 55c34fab4090
> R13: 7ffde6215de0 R14:  R15: 
>
> Allocated by task 23739:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
>  kmem_cache_alloc+0x12e/0x760 mm/slab.c:3544
>  kmem_cache_zalloc include/linux/slab.h:678

Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams

On Fri, Jan 5, 2018 at 6:22 PM, Eric W. Biederman  wrote:
> Dan Williams  writes:
>
>> Quoting Mark's original RFC:
>>
>> "Recently, Google Project Zero discovered several classes of attack
>> against speculative execution. One of these, known as variant-1, allows
>> explicit bounds checks to be bypassed under speculation, providing an
>> arbitrary read gadget. Further details can be found on the GPZ blog [1]
>> and the Documentation patch in this series."
>>
>> This series incorporates Mark Rutland's latest api and adds the x86
>> specific implementation of nospec_barrier. The
>> nospec_{array_ptr,ptr,barrier} helpers are then combined with a kernel
>> wide analysis performed by Elena Reshetova to address static analysis
>> reports where speculative execution on a userspace controlled value
>> could bypass a bounds check. The patches address a precondition for the
>> attack discussed in the Spectre paper [2].
>
> Please expand this.
>
> It is not clear what the static analysis is looking for.  Have a clear
> description of what is being fixed is crucial for allowing any of these
> changes.
>
> For the details given in the change description what I read is magic
> changes because a magic process says this code is vunlerable.

Yes, that was my first reaction to the patches as well, I try below to
add some more background and guidance, but in the end these are static
analysis reports across a wide swath of sub-systems. It's going to
take some iteration with domain experts to improve the patch
descriptions, and that's the point of this series, to get the better
trained eyes from the actual sub-system owners to take a look at these
reports.

For example, I'm looking for feedback like what Srinivas gave where he
identified that the report is bogus, the branch condition can not be
seeded with bad values in that path. Be like Srinivas.

> Given the similarities in the code that is being patched to many other
> places in the kernel it is not at all clear that this small set of
> changes is sufficient for any purpose.

I find this assertion absurd, when in the past have we as kernel
developers ever been handed a static analysis report and then
questioned why the static analysis did not flag other call sites
before first reviewing the ones it did find?

>> A consideration worth noting for reviewing these patches is to weigh the
>> dramatic cost of being wrong about whether a given report is exploitable
>> vs the overhead nospec_{array_ptr,ptr} may introduce. In other words,
>> lets make the bar for applying these patches be "can you prove that the
>> bounds check bypass is *not* exploitable". Consider that the Spectre
>> paper reports one example of a speculation window being ~180 cycles.
>
>
>> Note that there is also a proposal from Linus, array_access [3], that
>> attempts to quash speculative execution past a bounds check without
>> introducing an lfence instruction. That may be a future optimization
>> possibility that is compatible with this api, but it would appear to
>> need guarantees from the compiler that it is not clear the kernel can
>> rely on at this point. It is also not clear that it would be a
>> significant performance win vs lfence.
>
> It is also not clear that these changes fix anything, or are in any
> sense correct for the problem they are trying to fix as the problem
> is not clearly described.

I'll try my best. I don't have first hand knowledge of how the static
analyzer is doing this job, and I don't think it matters for
evaluating these reports. I'll give you my thoughts on how I would
handle one of these reports if it flagged one of the sub-systems I
maintain.

Start with the example from the Spectre paper:

if (x < array1_size)
y = array2[array1[x] * 256];

In all the patches 'x' and 'array1' are called out explicitly. For example:

net: mpls: prevent bounds-check bypass via speculative execution

Static analysis reports that 'index' may be a user controlled value that
is used as a data dependency reading 'rt' from the 'platform_label'
array...

So the first thing to review is whether the analyzer got it wrong and
'x' is not arbitrarily controllable by userspace to cause speculation
outside of the checked bounds. Be like Srinivas. The next step is to
ask whether the code can be refactored so that 'x' is sanitized
earlier in the call stack, especially if the nospec_array_ptr() lands
in a hot path. The next aspect that I expect most would be tempted to
go check is whether 'array2[array1[x]]' occurs later in the code
stream, but with speculation windows being architecture dependent and
potentially large (~180 cycles in one case says the paper) I submit
that we should err on the side of caution and not guess if that second
dependent read has been emitted somewhere in the instruction stream.

> In at least one place (mpls) you are patching a fast path.  Compile out
> or don't load mpls by all means.  But it is not

Re: [PATCH net-next 06/20] net: hns3: Modify the update period of packet statistics

2018-01-05 Thread lipeng (Y)




On 2018/1/5 22:54, Andrew Lunn wrote:

--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -1126,6 +1126,7 @@ static int hns3_nic_set_features(struct net_device 
*netdev,
  {
struct hns3_nic_priv *priv = netdev_priv(netdev);
int queue_num = priv->ae_handle->kinfo.num_tqps;
+   struct hnae3_handle *handle = priv->ae_handle;
struct hns3_enet_ring *ring;
unsigned int start;
unsigned int idx;
@@ -1134,6 +1135,8 @@ static int hns3_nic_set_features(struct net_device 
*netdev,
u64 tx_pkts = 0;
u64 rx_pkts = 0;
  
+	handle->ae_algo->ops->update_stats(handle, >stats);

+
for (idx = 0; idx < queue_num; idx++) {
/* fetch the tx stats */
ring = priv->ring_data[idx].ring;

There is something odd going on with patch here. Notice how it says
hns3_nic_set_features(). This is not the function being patched, it is
actually the next one, hns3_nic_get_stats64(), which makes a lot more
sense.

Is it because the static void is on the previous line?

Yes, it is because the static void is on the previous line.

I can add one patch to fix the  previous line ,  and this patch will 
correct  automatically.


do it need V2 patchset? or push a new patch after this patchset?



It would be nice if the function was correctly reported. It makes it
easier to review the patch.

Andrew

.

Re: [PATCH v2] openvswitch: Trim off padding before L3+ netfilter processing

2018-01-05 Thread Pravin Shelar

On Fri, Jan 5, 2018 at 3:20 PM, Ed Swierk  wrote:
> On Fri, Jan 5, 2018 at 10:14 AM, Ed Swierk 
> wrote:
>> On Thu, Jan 4, 2018 at 7:36 PM, Pravin Shelar  wrote:
>>> OVS already pull all required headers in skb linear data, so no need
>>> to redo all of it. only check required is the ip-checksum validation.
>>> I think we could avoid it in most of cases by checking skb length to
>>> ipheader length before verifying the ip header-checksum.
>>
>> Shouldn't the IP header checksum be verified even earlier, like in
>> key_extract(), before actually using any of the fields in the IP
>> header?
>
> Something like this for verifying the IP header checksum (not tested):
>
AFAIU openflow does not need this verification, so it is not required
in flow extract.

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-05 Thread Dan Williams

On Fri, Jan 5, 2018 at 6:52 PM, Linus Torvalds
 wrote:
> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams  wrote:
>> From: Andi Kleen 
>>
>> When access_ok fails we should always stop speculating.
>> Add the required barriers to the x86 access_ok macro.
>
> Honestly, this seems completely bogus.
>
> The description is pure garbage afaik.
>
> The fact is, we have to stop speculating when access_ok() does *not*
> fail - because that's when we'll actually do the access. And it's that
> access that needs to be non-speculative.
>
> That actually seems to be what the code does (it stops speculation
> when __range_not_ok() returns false, but access_ok() is
> !__range_not_ok()). But the explanation is crap, and dangerous.

Oh, bother, yes, good catch. It's been a long week.  I'll take a look
at moving this to uaccess_begin().

Re: [PATCH 01/18] asm-generic/barrier: add generic nospec helpers

2018-01-05 Thread Dan Williams

On Fri, Jan 5, 2018 at 6:55 PM, Linus Torvalds
 wrote:
> On Fri, Jan 5, 2018 at 5:09 PM, Dan Williams  wrote:
>> +#ifndef nospec_ptr
>> +#define nospec_ptr(ptr, lo, hi) 
>>\
>
> Do we actually want this horrible interface?
>
> It just causes the compiler - or inline asm - to generate worse code,
> because it needs to compare against both high and low limits.
>
> Basically all users are arrays that are zero-based, and where a
> comparison against the high _index_ limit would be sufficient.
>
> But the way this is all designed, it's literally designed for bad code
> generation for the unusual case, and the usual array case is written
> in the form of the unusual and wrong non-array case. That really seems
> excessively stupid.

Yes, it appears we can kill nospec_ptr() and move nospec_array_ptr()
to assume 0 based arrays rather than use nospec_ptr.

RE: [Intel-wired-lan] [PATCH 09/27] igb: Use timecounter_initialize interface

2018-01-05 Thread Brown, Aaron F

> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Sagar Arun Kamble
> Sent: Thursday, December 14, 2017 11:38 PM
> To: linux-ker...@vger.kernel.org
> Cc: intel-wired-...@lists.osuosl.org; Richard Cochran
> ; Kamble, Sagar A
> ; netdev@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH 09/27] igb: Use timecounter_initialize
> interface
> 
> With new interface timecounter_initialize we can initialize timecounter
> fields and underlying cyclecounter together. Update igb ptp timecounter
> init with this new function.
> 
> Signed-off-by: Sagar Arun Kamble 
> Cc: Richard Cochran 
> Cc: Jeff Kirsher 
> Cc: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> ---
>  drivers/net/ethernet/intel/igb/igb.h |  4 
>  drivers/net/ethernet/intel/igb/igb_ptp.c | 23 ++-
>  2 files changed, 18 insertions(+), 9 deletions(-)
> 

Tested-by: Aaron Brown

RE: [Intel-wired-lan] [PATCH 22/27] ixgbe: Use timecounter_reset interface

2018-01-05 Thread Brown, Aaron F



> -Original Message-
> From: Brown, Aaron F
> Sent: Friday, January 5, 2018 8:34 PM
> To: 'Sagar Arun Kamble' ; linux-
> ker...@vger.kernel.org
> Cc: intel-wired-...@lists.osuosl.org; Richard Cochran
> ; Kamble, Sagar A
> ; netdev@vger.kernel.org
> Subject: RE: [Intel-wired-lan] [PATCH 22/27] ixgbe: Use timecounter_reset
> interface
> 
> > From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> > Behalf Of Sagar Arun Kamble
> > Sent: Thursday, December 14, 2017 11:39 PM
> > To: linux-ker...@vger.kernel.org
> > Cc: intel-wired-...@lists.osuosl.org; Richard Cochran
> > ; Kamble, Sagar A
> > ; netdev@vger.kernel.org
> > Subject: [Intel-wired-lan] [PATCH 22/27] ixgbe: Use timecounter_reset
> > interface
> >
> > With new interface timecounter_reset we can update the start time for
> > timecounter. Update ixgbe_ptp_settime with this new function.
> >
> > Signed-off-by: Sagar Arun Kamble 
> > Cc: Richard Cochran 
> > Cc: Jeff Kirsher 
> > Cc: intel-wired-...@lists.osuosl.org
> > Cc: netdev@vger.kernel.org
> > Cc: linux-ker...@vger.kernel.org
> > ---
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> 
> Tested-by: Aaron Brown 
Strike my Tested-by: for this (ixgbe) instance.  It was meant for igb.

Re: [PATCH net v3 0/2] SCTP PMTU discovery fixes

2018-01-05 Thread Xin Long

On Fri, Jan 5, 2018 at 9:17 PM, Marcelo Ricardo Leitner
 wrote:
> This patchset fixes 2 issues with PMTU discovery that can lead to flood
> of retransmissions.
> The first patch fixes the issue for when PMTUD is disabled by the
> application, while the second fixes it for when its enabled.
>
> Please consider these to stable.
>
> Thanks,
>
> Marcelo Ricardo Leitner (2):
>   sctp: do not retransmit upon FragNeeded if PMTU discovery is disabled
>   sctp: fix the handling of ICMP Frag Needed for too small MTUs
>
>  include/net/sctp/structs.h |  2 +-
>  net/sctp/input.c   | 28 
>  net/sctp/transport.c   | 29 +++--
>  3 files changed, 36 insertions(+), 23 deletions(-)
>
> --
> 2.14.3
>
Reviewed-by: Xin Long

RE: [Intel-wired-lan] [PATCH 22/27] ixgbe: Use timecounter_reset interface

2018-01-05 Thread Brown, Aaron F

> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Sagar Arun Kamble
> Sent: Thursday, December 14, 2017 11:39 PM
> To: linux-ker...@vger.kernel.org
> Cc: intel-wired-...@lists.osuosl.org; Richard Cochran
> ; Kamble, Sagar A
> ; netdev@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH 22/27] ixgbe: Use timecounter_reset
> interface
> 
> With new interface timecounter_reset we can update the start time for
> timecounter. Update ixgbe_ptp_settime with this new function.
> 
> Signed-off-by: Sagar Arun Kamble 
> Cc: Richard Cochran 
> Cc: Jeff Kirsher 
> Cc: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Tested-by: Aaron Brown

RE: [Intel-wired-lan] [PATCH 21/27] igb: Use timecounter_reset interface

2018-01-05 Thread Brown, Aaron F

> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Sagar Arun Kamble
> Sent: Thursday, December 14, 2017 11:39 PM
> To: linux-ker...@vger.kernel.org
> Cc: intel-wired-...@lists.osuosl.org; Richard Cochran
> ; Kamble, Sagar A
> ; netdev@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH 21/27] igb: Use timecounter_reset
> interface
> 
> With new interface timecounter_reset we can update the start time for
> timecounter. Update igb_ptp_settime_82576 with this new function.
> 
> Signed-off-by: Sagar Arun Kamble 
> Cc: Richard Cochran 
> Cc: Jeff Kirsher 
> Cc: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> ---
>  drivers/net/ethernet/intel/igb/igb_ptp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Tested-by: Aaron Brown

RE: [Intel-wired-lan] [PATCH 08/27] e1000e: Use timecounter_initialize interface

2018-01-05 Thread Brown, Aaron F

> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Sagar Arun Kamble
> Sent: Thursday, December 14, 2017 11:38 PM
> To: linux-ker...@vger.kernel.org
> Cc: intel-wired-...@lists.osuosl.org; Richard Cochran
> ; Kamble, Sagar A
> ; netdev@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH 08/27] e1000e: Use timecounter_initialize
> interface
> 
> With new interface timecounter_initialize we can initialize timecounter
> fields and underlying cyclecounter together. Update e1000e timecounter
> init with this new function.
> 
> Signed-off-by: Sagar Arun Kamble 
> Cc: Richard Cochran 
> Cc: Jeff Kirsher 
> Cc: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> ---
>  drivers/net/ethernet/intel/e1000e/e1000.h  |  4 
>  drivers/net/ethernet/intel/e1000e/netdev.c | 31 +-
> 
>  2 files changed, 22 insertions(+), 13 deletions(-)
> 

Tested-by: Aaron Brown

RE: [Intel-wired-lan] [PATCH 20/27] e1000e: Use timecounter_reset interface

2018-01-05 Thread Brown, Aaron F

> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Sagar Arun Kamble
> Sent: Thursday, December 14, 2017 11:39 PM
> To: linux-ker...@vger.kernel.org
> Cc: intel-wired-...@lists.osuosl.org; Richard Cochran
> ; Kamble, Sagar A
> ; netdev@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH 20/27] e1000e: Use timecounter_reset
> interface
> 
> With new interface timecounter_reset we can update the start time for
> timecounter. Update e1000e_phc_settime with this new function.
> 
> Signed-off-by: Sagar Arun Kamble 
> Cc: Richard Cochran 
> Cc: Jeff Kirsher 
> Cc: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> ---
>  drivers/net/ethernet/intel/e1000e/ptp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
Tested-by: Aaron Brown

RE: [Intel-wired-lan] [PATCH 1/1] timecounter: Make cyclecounter struct part of timecounter struct

2018-01-05 Thread Brown, Aaron F

> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Jeff Kirsher
> Sent: Wednesday, December 6, 2017 8:25 AM
> To: Kamble, Sagar A ; linux-
> ker...@vger.kernel.org
> Cc: alsa-de...@alsa-project.org; linux-r...@vger.kernel.org;
> netdev@vger.kernel.org; Richard Cochran ;
> Stephen Boyd ; Chris Wilson  wilson.co.uk>; John Stultz ; intel-wired-
> l...@lists.osuosl.org; Thomas Gleixner ;
> kvm...@lists.cs.columbia.edu; linux-arm-ker...@lists.infradead.org
> Subject: Re: [Intel-wired-lan] [PATCH 1/1] timecounter: Make cyclecounter
> struct part of timecounter struct
> 
> On Sat, 2017-12-02 at 10:01 +0530, Sagar Arun Kamble wrote:
> > There is no real need for the users of timecounters to define
> > cyclecounter
> > and timecounter variables separately. Since timecounter will always
> > be
> > based on cyclecounter, have cyclecounter struct as member of
> > timecounter
> > struct.
> >
> > Suggested-by: Chris Wilson 
> > Signed-off-by: Sagar Arun Kamble 
> > Cc: Chris Wilson 
> > Cc: Richard Cochran 
> > Cc: John Stultz 
> > Cc: Thomas Gleixner 
> > Cc: Stephen Boyd 
> > Cc: linux-ker...@vger.kernel.org
> > Cc: linux-arm-ker...@lists.infradead.org
> > Cc: netdev@vger.kernel.org
> > Cc: intel-wired-...@lists.osuosl.org
> > Cc: linux-r...@vger.kernel.org
> > Cc: alsa-de...@alsa-project.org
> > Cc: kvm...@lists.cs.columbia.edu
> 
> Acked-by: Jeff Kirsher 
> 

Tested-by: Aaron Brown 

> For the changes to the Intel drivers.
> 
> > ---
> >  arch/microblaze/kernel/timer.c | 20 ++--
> >  drivers/clocksource/arm_arch_timer.c   | 19 ++--
> >  drivers/net/ethernet/amd/xgbe/xgbe-dev.c   |  3 +-
> >  drivers/net/ethernet/amd/xgbe/xgbe-ptp.c   |  9 +++---
> >  drivers/net/ethernet/amd/xgbe/xgbe.h   |  1 -
> >  drivers/net/ethernet/broadcom/bnx2x/bnx2x.h|  1 -
> >  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c   | 20 ++--
> >  drivers/net/ethernet/freescale/fec.h   |  1 -
> >  drivers/net/ethernet/freescale/fec_ptp.c   | 30 +---
> > --
> >  drivers/net/ethernet/intel/e1000e/e1000.h  |  1 -
> >  drivers/net/ethernet/intel/e1000e/netdev.c | 27 --
> > --
> >  drivers/net/ethernet/intel/e1000e/ptp.c|  2 +-
> >  drivers/net/ethernet/intel/igb/igb.h   |  1 -
> >  drivers/net/ethernet/intel/igb/igb_ptp.c   | 25 --
> > -
> >  drivers/net/ethernet/intel/ixgbe/ixgbe.h   |  1 -
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c   | 17 +-
> >  drivers/net/ethernet/mellanox/mlx4/en_clock.c  | 28 --
> > ---
> >  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  1 -
> >  .../net/ethernet/mellanox/mlx5/core/lib/clock.c| 34 ++
> > --
> >  drivers/net/ethernet/qlogic/qede/qede_ptp.c| 20 ++--
> >  drivers/net/ethernet/ti/cpts.c | 36
> > --
> >  drivers/net/ethernet/ti/cpts.h |  1 -
> >  include/linux/mlx5/driver.h|  1 -
> >  include/linux/timecounter.h|  4 +--
> >  include/sound/hdaudio.h|  1 -
> >  kernel/time/timecounter.c  | 28 --
> > ---
> >  sound/hda/hdac_stream.c|  7 +++--
> >  virt/kvm/arm/arch_timer.c  |  6 ++--
> >  28 files changed, 163 insertions(+), 182 deletions(-)

Re: [patch net-next v6 00/11] net: sched: allow qdiscs to share filter block instances

2018-01-05 Thread David Ahern

On 1/5/18 4:09 PM, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> Currently the filters added to qdiscs are independent. So for example if you
> have 2 netdevices and you create ingress qdisc on both and you want to add
> identical filter rules both, you need to add them twice. This patchset
> makes this easier and mainly saves resources allowing to share all filters
> within a qdisc - I call it a "filter block". Also this helps to save
> resources when we do offload to hw for example to expensive TCAM.
> 
> So back to the example. First, we create 2 qdiscs. Both will share
> block number 22. "22" is just an identification. If we don't pass any
> block number, a new one will be generated by kernel:
> 
> $ tc qdisc add dev ens7 ingress block 22
> 
> $ tc qdisc add dev ens8 ingress block 22
> 
> 
> Now if we list the qdiscs, we will see the block index in the output:
> 
> $ tc qdisc
> qdisc ingress : dev ens7 parent :fff1 block 22
> qdisc ingress : dev ens8 parent :fff1 block 22
> 
> 
> To make is more visual, the situation looks like this:
> 
>ens7 ingress qdisc ens7 ingress qdisc
>   |  |
>   |  |
>   +-->  block 22  <--+
> 
> Unlimited number of qdiscs may share the same block.
> 
> Now we can add filter using the block index:
> 
> $ tc filter add block 22 protocol ip pref 25 flower dst_ip 192.168.0.0/16 
> action drop
> 
> 
> Note we cannot use the qdisc for filter manipulations for shared blocks:
> 
> $ tc filter add dev ens8 ingress protocol ip pref 1 flower dst_ip 
> 192.168.100.2 action drop
> Error: Cannot work with shared block, please use block index.
> 
> 
> We will see the same output if we list filters for ingress qdisc of
> ens7 and ens8, also for the block 22:
> 
> $ tc filter show block 22
> filter block 22 protocol ip pref 25 flower chain 0
> filter block 22 protocol ip pref 25 flower chain 0 handle 0x1
> ...
> 
> $ tc filter show dev ens7 ingress
> filter block 22 protocol ip pref 25 flower chain 0
> filter block 22 protocol ip pref 25 flower chain 0 handle 0x1
> ...
> 
> $ tc filter show dev ens8 ingress
> filter block 22 protocol ip pref 25 flower chain 0
> filter block 22 protocol ip pref 25 flower chain 0 handle 0x1
> ...

I like the API and output shown here, but I am not getting that with the
patches.

In this example, I am using 42 for the block id:

$ tc qdisc show dev eth2
qdisc mq 0: root
qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1
1 1 1
qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1
1 1 1
qdisc ingress : parent :fff1 block 42

It allows me to add a filter using the device:
$ tc filter add dev eth2 ingress protocol ip pref 1 flower dst_ip
192.168.101.2 action drop
$  echo $?
0

And it modifies the shared block:
$  tc filter show block 42
filter pref 1 flower chain 0
filter pref 1 flower chain 0 handle 0x1
  eth_type ipv4
  dst_ip 192.168.100.2
  not_in_hw
action order 1: gact action drop
 random type none pass val 0
 index 2 ref 1 bind 1

filter pref 1 flower chain 0 handle 0x2
  eth_type ipv4
  dst_ip 192.168.101.2
  not_in_hw
action order 1: gact action drop
 random type none pass val 0
 index 3 ref 1 bind 1

filter pref 25 flower chain 0
filter pref 25 flower chain 0 handle 0x1
  eth_type ipv4
  dst_ip 192.168.0.0/16
  not_in_hw
action order 1: gact action drop
 random type none pass val 0
 index 1 ref 1 bind 1

Notice the header does not give the 'filter block N protocol' part. I
don't get that using the device either (tc filter show dev eth2 ingress).

Something else I noticed is that I do not get an error message if I pass
an invalid block id:

$ tc filter show block 22
$ echo $?
0
$  tc qdisc show | grep block
qdisc ingress : dev eth2 parent :fff1 block 42

[PATCH iproute2] Restore --no-print-directory option for silent builds

2018-01-05 Thread David Ahern

Commit 69fed534a533 ("change how Config is used in Makefile's") removed
Config from Makefile. Config had the checks to set VERBOSE based on user
request and VERBOSE is used to add the --no-print-directory argument.
Since Config is gone, add the relevant setup for VERBOSE to Makefile
to restore quieter builds by default.

Fixes: 69fed534a533 ("change how Config is used in Makefile's")
Signed-off-by: David Ahern 
---
 Makefile | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/Makefile b/Makefile
index 6a51e0db9107..32587db3be70 100644
--- a/Makefile
+++ b/Makefile
@@ -1,6 +1,13 @@
 # SPDX-License-Identifier: GPL-2.0
 # Top level Makefile for iproute2
 
+ifeq ("$(origin V)", "command line")
+VERBOSE = $(V)
+endif
+ifndef VERBOSE
+VERBOSE = 0
+endif
+
 ifeq ($(VERBOSE),0)
 MAKEFLAGS += --no-print-directory
 endif
-- 
2.11.0

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-05 Thread Linus Torvalds

On Fri, Jan 5, 2018 at 6:52 PM, Linus Torvalds
 wrote:
>
> The fact is, we have to stop speculating when access_ok() does *not*
> fail - because that's when we'll actually do the access. And it's that
> access that needs to be non-speculative.

I also suspect we should probably do this entirely differently.

Maybe the whole lfence can be part of uaccess_begin() instead (ie
currently 'stac()'). That would fit the existing structure better, I
think. And it would avoid any confusion about the whole "when to stop
speculation".

 Linus

Re: [PATCH 01/18] asm-generic/barrier: add generic nospec helpers

2018-01-05 Thread Linus Torvalds

On Fri, Jan 5, 2018 at 5:09 PM, Dan Williams  wrote:
> +#ifndef nospec_ptr
> +#define nospec_ptr(ptr, lo, hi)  
>   \

Do we actually want this horrible interface?

It just causes the compiler - or inline asm - to generate worse code,
because it needs to compare against both high and low limits.

Basically all users are arrays that are zero-based, and where a
comparison against the high _index_ limit would be sufficient.

But the way this is all designed, it's literally designed for bad code
generation for the unusual case, and the usual array case is written
in the form of the unusual and wrong non-array case. That really seems
excessively stupid.

 Linus

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-05 Thread Linus Torvalds

On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams  wrote:
> From: Andi Kleen 
>
> When access_ok fails we should always stop speculating.
> Add the required barriers to the x86 access_ok macro.

Honestly, this seems completely bogus.

The description is pure garbage afaik.

The fact is, we have to stop speculating when access_ok() does *not*
fail - because that's when we'll actually do the access. And it's that
access that needs to be non-speculative.

That actually seems to be what the code does (it stops speculation
when __range_not_ok() returns false, but access_ok() is
!__range_not_ok()). But the explanation is crap, and dangerous.

 Linus

Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-05 Thread Eric W. Biederman

Dan Williams  writes:

> Quoting Mark's original RFC:
>
> "Recently, Google Project Zero discovered several classes of attack
> against speculative execution. One of these, known as variant-1, allows
> explicit bounds checks to be bypassed under speculation, providing an
> arbitrary read gadget. Further details can be found on the GPZ blog [1]
> and the Documentation patch in this series."
>
> This series incorporates Mark Rutland's latest api and adds the x86
> specific implementation of nospec_barrier. The
> nospec_{array_ptr,ptr,barrier} helpers are then combined with a kernel
> wide analysis performed by Elena Reshetova to address static analysis
> reports where speculative execution on a userspace controlled value
> could bypass a bounds check. The patches address a precondition for the
> attack discussed in the Spectre paper [2].

Please expand this.

It is not clear what the static analysis is looking for.  Have a clear
description of what is being fixed is crucial for allowing any of these
changes.

For the details given in the change description what I read is magic
changes because a magic process says this code is vunlerable.

Given the similarities in the code that is being patched to many other
places in the kernel it is not at all clear that this small set of
changes is sufficient for any purpose.

> A consideration worth noting for reviewing these patches is to weigh the
> dramatic cost of being wrong about whether a given report is exploitable
> vs the overhead nospec_{array_ptr,ptr} may introduce. In other words,
> lets make the bar for applying these patches be "can you prove that the
> bounds check bypass is *not* exploitable". Consider that the Spectre
> paper reports one example of a speculation window being ~180 cycles.


> Note that there is also a proposal from Linus, array_access [3], that
> attempts to quash speculative execution past a bounds check without
> introducing an lfence instruction. That may be a future optimization
> possibility that is compatible with this api, but it would appear to
> need guarantees from the compiler that it is not clear the kernel can
> rely on at this point. It is also not clear that it would be a
> significant performance win vs lfence.

It is also not clear that these changes fix anything, or are in any
sense correct for the problem they are trying to fix as the problem
is not clearly described.

In at least one place (mpls) you are patching a fast path.  Compile out
or don't load mpls by all means.  But it is not acceptable to change the
fast path without even considering performance.

So because the description sucks, and the due diligence is not there.

Nacked-by: "Eric W. Biederman" 

to the series.


>
> These patches also will also be available via the 'nospec' git branch
> here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec
>
> [1]: 
> https://googleprojectzero.blogspot.co.uk/2018/01/reading-privileged-memory-with-side.html
> [2]: https://spectreattack.com/spectre.pdf
> [3]: https://marc.info/?l=linux-kernel=151510446027625=2
>
> ---
>
> Andi Kleen (1):
>   x86, barrier: stop speculation for failed access_ok
>
> Dan Williams (13):
>   x86: implement nospec_barrier()
>   [media] uvcvideo: prevent bounds-check bypass via speculative execution
>   carl9170: prevent bounds-check bypass via speculative execution
>   p54: prevent bounds-check bypass via speculative execution
>   qla2xxx: prevent bounds-check bypass via speculative execution
>   cw1200: prevent bounds-check bypass via speculative execution
>   Thermal/int340x: prevent bounds-check bypass via speculative execution
>   ipv6: prevent bounds-check bypass via speculative execution
>   ipv4: prevent bounds-check bypass via speculative execution
>   vfs, fdtable: prevent bounds-check bypass via speculative execution
>   net: mpls: prevent bounds-check bypass via speculative execution
>   udf: prevent bounds-check bypass via speculative execution
>   userns: prevent bounds-check bypass via speculative execution
>
> Mark Rutland (4):
>   asm-generic/barrier: add generic nospec helpers
>   Documentation: document nospec helpers
>   arm64: implement nospec_ptr()
>   arm: implement nospec_ptr()
>
>  Documentation/speculation.txt  |  166 
> 
>  arch/arm/include/asm/barrier.h |   75 +
>  arch/arm64/include/asm/barrier.h   |   55 +++
>  arch/x86/include/asm/barrier.h |6 +
>  arch/x86/include/asm/uaccess.h |   17 ++
>  drivers/media/usb/uvc/uvc_v4l2.c   |7 +
>  drivers/net/wireless/ath/carl9170/main.c   |6 -
>  drivers/net/wireless/intersil/p54/main.c   |8 +
>  drivers/net/wireless/st/cw1200/sta.c   |   10 +
>  drivers/net/wireless/st/cw1200/wsm.h   |

Re: [PATCH 12/18] Thermal/int340x: prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams

On Fri, Jan 5, 2018 at 5:53 PM, Srinivas Pandruvada
 wrote:
> On Fri, 2018-01-05 at 17:10 -0800, Dan Williams wrote:
>> Static analysis reports that 'trip' may be a user controlled value
>> that
>> is used as a data dependency to read '*temp' from the 'd->aux_trips'
>> array.  In order to avoid potential leaks of kernel memory values,
>> block
>> speculative execution of the instruction stream that could issue
>> reads
>> based on an invalid value of '*temp'.
>
> Not against the change as this is in a very slow path. But the trip is
> not an arbitrary value which user can enter.
>
> This trip value is the one of the sysfs attribute in thermal zone. For
> example
>
> # cd /sys/class/thermal/thermal_zone1
> # ls trip_point_?_temp
> trip_point_0_temp  trip_point_1_temp  trip_point_2_temp  trip_point_3_t
> emp  trip_point_4_temp  trip_point_5_temp  trip_point_6_temp
>
> Here the "trip" is one of the above trip_point_*_temp. So in this case
> it can be from 0 to 6 as user can't do
> # cat trip_point_7_temp
> as there is no sysfs attribute for trip_point_7_temp.
>
> The actual "trip" was obtained in thermal core via
>
>   if (sscanf(attr->attr.name, "trip_point_%d_temp", ) != 1)
> return -EINVAL;
>
> Thanks,
> Srinivas

Ah, great, thanks. So do we even need the bounds check at that point?

Re: [PATCH 12/18] Thermal/int340x: prevent bounds-check bypass via speculative execution

2018-01-05 Thread Srinivas Pandruvada

On Fri, 2018-01-05 at 17:10 -0800, Dan Williams wrote:
> Static analysis reports that 'trip' may be a user controlled value
> that
> is used as a data dependency to read '*temp' from the 'd->aux_trips'
> array.  In order to avoid potential leaks of kernel memory values,
> block
> speculative execution of the instruction stream that could issue
> reads
> based on an invalid value of '*temp'.

Not against the change as this is in a very slow path. But the trip is
not an arbitrary value which user can enter.

This trip value is the one of the sysfs attribute in thermal zone. For
example

# cd /sys/class/thermal/thermal_zone1
# ls trip_point_?_temp
trip_point_0_temp  trip_point_1_temp  trip_point_2_temp  trip_point_3_t
emp  trip_point_4_temp  trip_point_5_temp  trip_point_6_temp

Here the "trip" is one of the above trip_point_*_temp. So in this case
it can be from 0 to 6 as user can't do
# cat trip_point_7_temp
as there is no sysfs attribute for trip_point_7_temp.

The actual "trip" was obtained in thermal core via

      if (sscanf(attr->attr.name, "trip_point_%d_temp", ) != 1)
return -EINVAL;

Thanks,
Srinivas



> 
> Based on an original patch by Elena Reshetova.
> 
> Cc: Srinivas Pandruvada 
> Cc: Zhang Rui 
> Cc: Eduardo Valentin 
> Signed-off-by: Elena Reshetova 
> Signed-off-by: Dan Williams 
> ---
>  .../thermal/int340x_thermal/int340x_thermal_zone.c |   14 
> --
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
> b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
> index 145a5c53ff5c..442a1d9bf7ad 100644
> --- a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
> +++ b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "int340x_thermal_zone.h"
>  
>  static int int340x_thermal_get_zone_temp(struct thermal_zone_device
> *zone,
> @@ -52,20 +53,21 @@ static int int340x_thermal_get_trip_temp(struct
> thermal_zone_device *zone,
>    int trip, int *temp)
>  {
>   struct int34x_thermal_zone *d = zone->devdata;
> + unsigned long *elem;
>   int i;
>  
>   if (d->override_ops && d->override_ops->get_trip_temp)
>   return d->override_ops->get_trip_temp(zone, trip,
> temp);
>  
> - if (trip < d->aux_trip_nr)
> - *temp = d->aux_trips[trip];
> - else if (trip == d->crt_trip_id)
> + if ((elem = nospec_array_ptr(d->aux_trips, trip, d-
> >aux_trip_nr))) {
> + *temp = *elem;
> + } else if (trip == d->crt_trip_id) {
>   *temp = d->crt_temp;
> - else if (trip == d->psv_trip_id)
> + } else if (trip == d->psv_trip_id) {
>   *temp = d->psv_temp;
> - else if (trip == d->hot_trip_id)
> + } else if (trip == d->hot_trip_id) {
>   *temp = d->hot_temp;
> - else {
> + } else {
>   for (i = 0; i < INT340X_THERMAL_MAX_ACT_TRIP_COUNT;
> i++) {
>   if (d->act_trips[i].valid &&
>   d->act_trips[i].id == trip) {
>

Re: [PATCH bpf-next v4 07/11] bpf: Add support for reading sk_state and more

2018-01-05 Thread Daniel Borkmann

On 01/05/2018 12:55 AM, Lawrence Brakmo wrote:
> Add support for reading many more tcp_sock fields
> 
>   state,  same as sk->sk_state
>   rtt_min same as sk->rtt_min.s[0].v (current rtt_min)
>   snd_ssthresh
>   rcv_nxt
>   snd_nxt
>   snd_una
>   mss_cache
>   ecn_flags
>   rate_delivered
>   rate_interval_us
>   packets_out
>   retrans_out
>   total_retrans
>   segs_in
>   data_segs_in
>   segs_out
>   data_segs_out
>   bytes_received (__u64)
>   bytes_acked(__u64)
> 
> Signed-off-by: Lawrence Brakmo 
> ---
>  include/uapi/linux/bpf.h |  19 +
>  net/core/filter.c| 101 
> ++-
>  2 files changed, 119 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index a1316c7..a8f4cf0 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -963,6 +963,25 @@ struct bpf_sock_ops {
>   __u32 snd_cwnd;
>   __u32 srtt_us;  /* Averaged RTT << 3 in usecs */
>   __u32 bpf_sock_ops_flags; /* flags defined in uapi/linux/tcp.h */
> + __u32 state;
> + __u32 rtt_min;
> + __u32 snd_ssthresh;
> + __u32 rcv_nxt;
> + __u32 snd_nxt;
> + __u32 snd_una;
> + __u32 mss_cache;
> + __u32 ecn_flags;
> + __u32 rate_delivered;
> + __u32 rate_interval_us;
> + __u32 packets_out;
> + __u32 retrans_out;
> + __u32 total_retrans;
> + __u32 segs_in;
> + __u32 data_segs_in;
> + __u32 segs_out;
> + __u32 data_segs_out;
> + __u64 bytes_received;
> + __u64 bytes_acked;
>  };
>  
>  /* List of known BPF sock_ops operators.
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 76fd6e9..d4c5c1a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3829,7 +3829,7 @@ static bool __is_valid_sock_ops_access(int off, int 
> size)
>   /* The verifier guarantees that size > 0. */
>   if (off % size != 0)
>   return false;
> - if (size != sizeof(__u32))
> + if (size != sizeof(__u32) && size != sizeof(__u64))
>   return false;

Doesn't this have the side-effect that this would kind of let is_valid_access()
callback allow not for narrower but wider access, e.g. on the u32 members when
doing BPF_DW. It would still get ignored later on in the ctx rewrite, though, 
but
it also means that we could try with 2nd part of u64 member offsets with BPF_W
which would then return a 'verifier is misconfigured' error. I think it would
be better to enforce BPF_DW access only on the u64 types and BPF_W access only
on the u32 types before it becomes uapi instead opening this up generically. 
E.g.
we do similar approach in pe_prog_is_valid_access() for sample_period (taking 
the
the narrow access bits aside from there).

Thanks,
Daniel

[PATCH net] ipv6: remove null_entry before adding default route

2018-01-05 Thread Wei Wang

From: Wei Wang 

In the current code, when creating a new fib6 table, tb6_root.leaf gets
initialized to net->ipv6.ip6_null_entry.
If a default route is being added with rt->rt6i_metric = 0x,
fib6_add() will add this route after net->ipv6.ip6_null_entry. As
null_entry is shared, it could cause problem.

In order to fix it, set fn->leaf to NULL before calling
fib6_add_rt2node() when trying to add the first default route.
And reset fn->leaf to null_entry when adding fails or when deleting the
last default route.

syzkaller reported the following issue which is fixed by this commit:
=
WARNING: suspicious RCU usage
4.15.0-rc5+ #171 Not tainted
-
net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
4 locks held by swapper/0/0:
 #0:  ((>ipv6.ip6_fib_timer)){+.-.}, at: [] 
lockdep_copy_map include/linux/lockdep.h:178 [inline]
 #0:  ((>ipv6.ip6_fib_timer)){+.-.}, at: [] 
call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
 #1:  (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>] 
spin_lock_bh include/linux/spinlock.h:315 [inline]
 #1:  (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>] 
fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
 #2:  (rcu_read_lock){}, at: [<91db762d>] 
__fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
 #3:  (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] spin_lock_bh 
include/linux/spinlock.h:315 [inline]
 #3:  (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] 
__fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:53
 lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
 fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
 fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
 fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
 fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
 fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
 __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
 fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
 fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
 fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
 call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
 expire_timers kernel/time/timer.c:1357 [inline]
 __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
 __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
 invoke_softirq kernel/softirq.c:365 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:540 [inline]
 smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
 

Reported-by: syzbot 
Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Signed-off-by: Wei Wang 
---
 net/ipv6/ip6_fib.c | 45 +++--
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index d11a5578e4f8..37cb4ad1ea29 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -640,6 +640,11 @@ static struct fib6_node *fib6_add_1(struct net *net,
if (!(fn->fn_flags & RTN_RTINFO)) {
RCU_INIT_POINTER(fn->leaf, NULL);
rt6_release(leaf);
+   /* remove null_entry in the root node */
+   } else if (fn->fn_flags & RTN_TL_ROOT &&
+  rcu_access_pointer(fn->leaf) ==
+  net->ipv6.ip6_null_entry) {
+   RCU_INIT_POINTER(fn->leaf, NULL);
}
 
return fn;
@@ -1270,14 +1275,27 @@ int fib6_add(struct fib6_node *root, struct rt6_info 
*rt,
return err;
 
 failure:
-   /* fn->leaf could be NULL if fn is an intermediate node and we
-* failed to add the new route to it in both subtree creation
-* failure and fib6_add_rt2node() failure case.
-* In both cases, fib6_repair_tree() should be called to fix
+   /* fn->leaf could be NULL if:
+* 1. fn is the root node in the table and we fail to add the default
+* route to it.
+* In this case, we put fn->leaf back to net->ipv6.ip6_null_entry as
+* the way the table was created.
+* 2. fn is an intermediate node and we failed to add the new
+* route to it in both subtree creation failure and fib6_add_rt2node()
+* failure case.
+* In this case, fib6_repair_tree() should be

[PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams

Quoting Mark's original RFC:

"Recently, Google Project Zero discovered several classes of attack
against speculative execution. One of these, known as variant-1, allows
explicit bounds checks to be bypassed under speculation, providing an
arbitrary read gadget. Further details can be found on the GPZ blog [1]
and the Documentation patch in this series."

This series incorporates Mark Rutland's latest api and adds the x86
specific implementation of nospec_barrier. The
nospec_{array_ptr,ptr,barrier} helpers are then combined with a kernel
wide analysis performed by Elena Reshetova to address static analysis
reports where speculative execution on a userspace controlled value
could bypass a bounds check. The patches address a precondition for the
attack discussed in the Spectre paper [2].

A consideration worth noting for reviewing these patches is to weigh the
dramatic cost of being wrong about whether a given report is exploitable
vs the overhead nospec_{array_ptr,ptr} may introduce. In other words,
lets make the bar for applying these patches be "can you prove that the
bounds check bypass is *not* exploitable". Consider that the Spectre
paper reports one example of a speculation window being ~180 cycles.

Note that there is also a proposal from Linus, array_access [3], that
attempts to quash speculative execution past a bounds check without
introducing an lfence instruction. That may be a future optimization
possibility that is compatible with this api, but it would appear to
need guarantees from the compiler that it is not clear the kernel can
rely on at this point. It is also not clear that it would be a
significant performance win vs lfence.

These patches also will also be available via the 'nospec' git branch
here:

git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec

[1]: 
https://googleprojectzero.blogspot.co.uk/2018/01/reading-privileged-memory-with-side.html
[2]: https://spectreattack.com/spectre.pdf
[3]: https://marc.info/?l=linux-kernel=151510446027625=2

---

Andi Kleen (1):
  x86, barrier: stop speculation for failed access_ok

Dan Williams (13):
  x86: implement nospec_barrier()
  [media] uvcvideo: prevent bounds-check bypass via speculative execution
  carl9170: prevent bounds-check bypass via speculative execution
  p54: prevent bounds-check bypass via speculative execution
  qla2xxx: prevent bounds-check bypass via speculative execution
  cw1200: prevent bounds-check bypass via speculative execution
  Thermal/int340x: prevent bounds-check bypass via speculative execution
  ipv6: prevent bounds-check bypass via speculative execution
  ipv4: prevent bounds-check bypass via speculative execution
  vfs, fdtable: prevent bounds-check bypass via speculative execution
  net: mpls: prevent bounds-check bypass via speculative execution
  udf: prevent bounds-check bypass via speculative execution
  userns: prevent bounds-check bypass via speculative execution

Mark Rutland (4):
  asm-generic/barrier: add generic nospec helpers
  Documentation: document nospec helpers
  arm64: implement nospec_ptr()
  arm: implement nospec_ptr()

 Documentation/speculation.txt  |  166 
 arch/arm/include/asm/barrier.h |   75 +
 arch/arm64/include/asm/barrier.h   |   55 +++
 arch/x86/include/asm/barrier.h |6 +
 arch/x86/include/asm/uaccess.h |   17 ++
 drivers/media/usb/uvc/uvc_v4l2.c   |7 +
 drivers/net/wireless/ath/carl9170/main.c   |6 -
 drivers/net/wireless/intersil/p54/main.c   |8 +
 drivers/net/wireless/st/cw1200/sta.c   |   10 +
 drivers/net/wireless/st/cw1200/wsm.h   |4 
 drivers/scsi/qla2xxx/qla_mr.c  |   15 +-
 .../thermal/int340x_thermal/int340x_thermal_zone.c |   14 +-
 fs/udf/misc.c  |   39 +++--
 include/asm-generic/barrier.h  |   68 
 include/linux/fdtable.h|5 -
 kernel/user_namespace.c|   10 -
 net/ipv4/raw.c |9 +
 net/ipv6/raw.c |9 +
 net/mpls/af_mpls.c |   12 +
 19 files changed, 466 insertions(+), 69 deletions(-)
 create mode 100644 Documentation/speculation.txt

[PATCH 01/18] asm-generic/barrier: add generic nospec helpers

2018-01-05 Thread Dan Williams

From: Mark Rutland 

Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
memory accesses under a bounds check may be speculated even if the
bounds check fails, providing a primitive for building a side channel.

This patch adds helpers which can be used to inhibit the use of
out-of-bounds pointers under speculation.

A generic implementation is provided for compatibility, but does not
guarantee safety under speculation. Architectures are expected to
override these helpers as necessary.

Signed-off-by: Mark Rutland 
Signed-off-by: Will Deacon 
Cc: Daniel Willams 
Cc: Peter Zijlstra 
Signed-off-by: Dan Williams 
---
 include/asm-generic/barrier.h |   68 +
 1 file changed, 68 insertions(+)

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index fe297b599b0a..91c3071f49e5 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -54,6 +54,74 @@
 #define read_barrier_depends() do { } while (0)
 #endif
 
+/*
+ * Inhibit subsequent speculative memory accesses.
+ *
+ * Architectures with a suitable memory barrier should provide an
+ * implementation. This is non-portable, and generic code should use
+ * nospec_ptr().
+ */
+#ifndef __nospec_barrier
+#define __nospec_barrier() do { } while (0)
+#endif
+
+/**
+ * nospec_ptr() - Ensure a  pointer is bounded, even under speculation.
+ *
+ * @ptr: the pointer to test
+ * @lo: the lower valid bound for @ptr, inclusive
+ * @hi: the upper valid bound for @ptr, exclusive
+ *
+ * If @ptr falls in the interval [@lo, @i), returns @ptr, otherwise returns
+ * NULL.
+ *
+ * Architectures which do not provide __nospec_barrier() should override this
+ * to ensure that ptr falls in the [lo, hi) interval both under architectural
+ * execution and under speculation, preventing propagation of an out-of-bounds
+ * pointer to code which is speculatively executed.
+ */
+#ifndef nospec_ptr
+#define nospec_ptr(ptr, lo, hi)
\
+({ \
+   typeof (ptr) __ret; \
+   typeof (ptr) __ptr = (ptr); \
+   typeof (ptr) __lo = (lo);   \
+   typeof (ptr) __hi = (hi);   \
+   \
+   __ret = (__lo <= __ptr && __ptr < __hi) ? __ptr : NULL; \
+   \
+   __nospec_barrier(); \
+   \
+   __ret;  \
+})
+#endif
+
+/**
+ * nospec_array_ptr - Generate a pointer to an array element, ensuring the
+ * pointer is bounded under speculation.
+ *
+ * @arr: the base of the array
+ * @idx: the index of the element
+ * @sz: the number of elements in the array
+ *
+ * If @idx falls in the interval [0, @sz), returns the pointer to @arr[@idx],
+ * otherwise returns NULL.
+ *
+ * This is a wrapper around nospec_ptr(), provided for convenience.
+ * Architectures should implement nospec_ptr() to ensure this is the case
+ * under speculation.
+ */
+#define nospec_array_ptr(arr, idx, sz) \
+({ \
+   typeof(*(arr)) *__arr = (arr);  \
+   typeof(idx) __idx = (idx);  \
+   typeof(sz) __sz = (sz); \
+   \
+   nospec_ptr(__arr + __idx, __arr, __arr + __sz); \
+})
+
+#undef __nospec_barrier
+
 #ifndef __smp_mb
 #define __smp_mb() mb()
 #endif

[PATCH 02/18] Documentation: document nospec helpers

2018-01-05 Thread Dan Williams

From: Mark Rutland 

Document the rationale and usage of the new nospec*() helpers.

Signed-off-by: Mark Rutland 
Signed-off-by: Will Deacon 
Cc: Dan Williams 
Cc: Jonathan Corbet 
Cc: Peter Zijlstra 
Signed-off-by: Dan Williams 
---
 Documentation/speculation.txt |  166 +
 1 file changed, 166 insertions(+)
 create mode 100644 Documentation/speculation.txt

diff --git a/Documentation/speculation.txt b/Documentation/speculation.txt
new file mode 100644
index ..748fcd4dcda4
--- /dev/null
+++ b/Documentation/speculation.txt
@@ -0,0 +1,166 @@
+This document explains potential effects of speculation, and how undesirable
+effects can be mitigated portably using common APIs.
+
+===
+Speculation
+===
+
+To improve performance and minimize average latencies, many contemporary CPUs
+employ speculative execution techniques such as branch prediction, performing
+work which may be discarded at a later stage.
+
+Typically speculative execution cannot be observed from architectural state,
+such as the contents of registers. However, in some cases it is possible to
+observe its impact on microarchitectural state, such as the presence or
+absence of data in caches. Such state may form side-channels which can be
+observed to extract secret information.
+
+For example, in the presence of branch prediction, it is possible for bounds
+checks to be ignored by code which is speculatively executed. Consider the
+following code:
+
+   int load_array(int *array, unsigned int idx) {
+   if (idx >= MAX_ARRAY_ELEMS)
+   return 0;
+   else
+   return array[idx];
+   }
+
+Which, on arm64, may be compiled to an assembly sequence such as:
+
+   CMP , #MAX_ARRAY_ELEMS
+   B.LTless
+   MOV , #0
+   RET
+  less:
+   LDR , [, ]
+   RET
+
+It is possible that a CPU mis-predicts the conditional branch, and
+speculatively loads array[idx], even if idx >= MAX_ARRAY_ELEMS. This value
+will subsequently be discarded, but the speculated load may affect
+microarchitectural state which can be subsequently measured.
+
+More complex sequences involving multiple dependent memory accesses may result
+in sensitive information being leaked. Consider the following code, building on
+the prior example:
+
+   int load_dependent_arrays(int *arr1, int *arr2, int idx) {
+   int val1, val2,
+
+   val1 = load_array(arr1, idx);
+   val2 = load_array(arr2, val1);
+
+   return val2;
+   }
+
+Under speculation, the first call to load_array() may return the value of an
+out-of-bounds address, while the second call will influence microarchitectural
+state dependent on this value. This may provide an arbitrary read primitive.
+
+
+Mitigating speculation side-channels
+
+
+The kernel provides a generic API to ensure that bounds checks are respected
+even under speculation. Architectures which are affected by speculation-based
+side-channels are expected to implement these primitives.
+
+The following helpers found in  can be used to prevent
+information from being leaked via side-channels.
+
+* nospec_ptr(ptr, lo, hi)
+
+  Returns a sanitized pointer that is bounded by the [lo, hi) interval. When
+  ptr < lo, or ptr >= hi, NULL is returned. Prevents an out-of-bounds pointer
+  being propagated to code which is speculatively executed.
+
+  This is expected to be used by code which computes pointers to data
+  structures, where part of the address (such as an array index) may be
+  user-controlled.
+
+  This can be used to protect the earlier load_array() example:
+
+  int load_array(int *array, unsigned int idx)
+  {
+   int *elem;
+
+   if ((elem = nospec_ptr(array + idx, array, array + MAX_ARRAY_ELEMS)))
+   return *elem;
+   else
+   return 0;
+  }
+
+  This can also be used in situations where multiple fields on a structure are
+  accessed:
+
+   struct foo array[SIZE];
+   int a, b;
+
+   void do_thing(int idx)
+   {
+   struct foo *elem;
+
+   if ((elem = nospec_ptr(array + idx, array, array + SIZE)) {
+   a = elem->field_a;
+   b = elem->field_b;
+   }
+   }
+
+  It is imperative that the returned pointer is used. Pointers which are
+  generated separately are subject to a number of potential CPU and compiler
+  optimizations, and may still be used speculatively. For example, this means
+  that the following sequence is unsafe:
+
+   struct foo array[SIZE];
+   int a, b;
+
+   void do_thing(int idx)
+   {
+   if (nospec_ptr(array + idx, array, array +

[PATCH 04/18] arm: implement nospec_ptr()

2018-01-05 Thread Dan Williams

From: Mark Rutland 

This patch implements nospec_ptr() for arm, following the recommended
architectural sequences for the arm and thumb instruction sets.

Signed-off-by: Mark Rutland 
Signed-off-by: Dan Williams 
---
 arch/arm/include/asm/barrier.h |   75 
 1 file changed, 75 insertions(+)

diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 40f5c410fd8c..6384c90e4b72 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -37,6 +37,81 @@
 #define dmb(x) __asm__ __volatile__ ("" : : : "memory")
 #endif
 
+#ifdef CONFIG_THUMB2_KERNEL
+#define __load_no_speculate_n(ptr, lo, hi, failval, cmpptr, sz)\
+({ \
+   typeof(*ptr) __nln_val; \
+   typeof(*ptr) __failval =\
+   (typeof(*ptr)(unsigned long)(failval)); \
+   \
+   asm volatile (  \
+   "   cmp %[c], %[l]\n"   \
+   "   it  hs\n"   \
+   "   cmphs   %[h], %[c]\n"   \
+   "   blo 1f\n"   \
+   "   ld" #sz " %[v], %[p]\n" \
+   "1: it  lo\n"   \
+   "   movlo   %[v], %[f]\n"   \
+   "   .inst 0xf3af8014 @ CSDB\n"  \
+   : [v] "=" (__nln_val) \
+   : [p] "m" (*(ptr)), [l] "r" (lo), [h] "r" (hi), \
+ [f] "r" (__failval), [c] "r" (cmpptr) \
+   : "cc");\
+   \
+   __nln_val;  \
+})
+#else
+#define __load_no_speculate_n(ptr, lo, hi, failval, cmpptr, sz)\
+({ \
+   typeof(*ptr) __nln_val; \
+   typeof(*ptr) __failval =\
+   (typeof(*ptr)(unsigned long)(failval)); \
+   \
+   asm volatile (  \
+   "   cmp %[c], %[l]\n"   \
+   "   cmphs   %[h], %[c]\n"   \
+   "   ldr" #sz "hi %[v], %[p]\n"  \
+   "   movls   %[v], %[f]\n"   \
+   "   .inst 0xe320f014 @ CSDB\n"  \
+   : [v] "=" (__nln_val) \
+   : [p] "m" (*(ptr)), [l] "r" (lo), [h] "r" (hi), \
+ [f] "r" (__failval), [c] "r" (cmpptr) \
+   : "cc");\
+   \
+   __nln_val;  \
+})
+#endif
+
+#define __load_no_speculate(ptr, lo, hi, failval, cmpptr)  \
+({ \
+   typeof(*(ptr)) __nl_val;\
+   \
+   switch (sizeof(__nl_val)) { \
+   case 1: \
+   __nl_val = __load_no_speculate_n(ptr, lo, hi, failval,  \
+cmpptr, b);\
+   break;  \
+   case 2: \
+   __nl_val = __load_no_speculate_n(ptr, lo, hi, failval,  \
+cmpptr, h);\
+   break;  \
+   case 4: \
+   __nl_val = __load_no_speculate_n(ptr, lo, hi, failval,  \
+cmpptr, ); \
+   break;  \
+   default:\
+   BUILD_BUG();\
+   }   \
+   \
+   __nl_val;   \
+})
+
+#define nospec_ptr(ptr, lo, hi)

[PATCH 03/18] arm64: implement nospec_ptr()

2018-01-05 Thread Dan Williams

From: Mark Rutland 

This patch implements nospec_ptr() for arm64, following the recommended
architectural sequence.

Signed-off-by: Mark Rutland 
Signed-off-by: Will Deacon 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Signed-off-by: Dan Williams 
---
 arch/arm64/include/asm/barrier.h |   55 ++
 1 file changed, 55 insertions(+)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 77651c49ef44..b4819f6a0e5c 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -40,6 +40,61 @@
 #define dma_rmb()  dmb(oshld)
 #define dma_wmb()  dmb(oshst)
 
+#define __load_no_speculate_n(ptr, lo, hi, failval, cmpptr, w, sz) \
+({ \
+   typeof(*ptr)__nln_val;  \
+   typeof(*ptr)__failval = \
+   (typeof(*ptr))(unsigned long)(failval); \
+   \
+   asm volatile (  \
+   "   cmp %[c], %[l]\n"   \
+   "   ccmp%[c], %[h], 2, cs\n"\
+   "   b.cs1f\n"   \
+   "   ldr" #sz " %" #w "[v], %[p]\n"  \
+   "1: csel%" #w "[v], %" #w "[v], %" #w "[f], cc\n"   \
+   "   hint#0x14 // CSDB\n"\
+   : [v] "=" (__nln_val) \
+   : [p] "m" (*(ptr)), [l] "r" (lo), [h] "r" (hi), \
+ [f] "rZ" (__failval), [c] "r" (cmpptr)\
+   : "cc");\
+   \
+   __nln_val;  \
+})
+
+#define __load_no_speculate(ptr, lo, hi, failval, cmpptr)  \
+({ \
+   typeof(*(ptr)) __nl_val;\
+   \
+   switch (sizeof(__nl_val)) { \
+   case 1: \
+   __nl_val = __load_no_speculate_n(ptr, lo, hi, failval,  \
+cmpptr, w, b); \
+   break;  \
+   case 2: \
+   __nl_val = __load_no_speculate_n(ptr, lo, hi, failval,  \
+cmpptr, w, h); \
+   break;  \
+   case 4: \
+   __nl_val = __load_no_speculate_n(ptr, lo, hi, failval,  \
+cmpptr, w, );  \
+   break;  \
+   case 8: \
+   __nl_val = __load_no_speculate_n(ptr, lo, hi, failval,  \
+cmpptr, x, );  \
+   break;  \
+   default:\
+   BUILD_BUG();\
+   }   \
+   \
+   __nl_val;   \
+})
+
+#define nospec_ptr(ptr, lo, hi)
\
+({ \
+   typeof(ptr) __np_ptr = (ptr);   \
+   __load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr);\
+})
+
 #define __smp_mb() dmb(ish)
 #define __smp_rmb()dmb(ishld)
 #define __smp_wmb()dmb(ishst)

[PATCH 05/18] x86: implement nospec_barrier()

2018-01-05 Thread Dan Williams

The new speculative execution barrier, nospec_barrier(), ensures
that any userspace controllable speculation doesn't cross the boundary.

Any user observable speculative activity on this CPU thread before this
point either completes, reaches a state it can no longer cause an
observable activity, or is aborted before instructions after the barrier
execute.

In the x86 case nospec_barrier() resolves to an lfence if
X86_FEATURE_LFENCE_RDTSC is present. Other architectures can define
their variants.

Note the expectation is that this barrier is never used directly, at
least outside of architecture specific code. It is implied by the
nospec_{array_ptr,ptr} macros.

x86, for now, depends on the barrier for protection while other
architectures place their speculation prevention in
nospec_{ptr,array_ptr} when a barrier instruction is not available or
too heavy-weight. In the x86 case lfence is not a fully serializing
instruction so it is not as expensive as other barriers.

Suggested-by: Peter Zijlstra 
Suggested-by: Arjan van de Ven 
Suggested-by: Alan Cox 
Cc: Mark Rutland 
Cc: Greg KH 
Cc: Thomas Gleixner 
Cc: Alan Cox 
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 arch/x86/include/asm/barrier.h |6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 7fb336210e1b..1148cd9f5ae7 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -24,6 +24,12 @@
 #define wmb()  asm volatile("sfence" ::: "memory")
 #endif
 
+/*
+ * CPUs without LFENCE don't really speculate much. Possibly fall back to 
IRET-to-self.
+ */
+#define __nospec_barrier() alternative("", "lfence", X86_FEATURE_LFENCE_RDTSC)
+#define nospec_barrier __nospec_barrier
+
 #ifdef CONFIG_X86_PPRO_FENCE
 #define dma_rmb()  rmb()
 #else

[PATCH 07/18] [media] uvcvideo: prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams

Static analysis reports that 'index' may be a user controlled value that
is used as a data dependency to read 'pin' from the
'selector->baSourceID' array. In order to avoid potential leaks of
kernel memory values, block speculative execution of the instruction
stream that could issue reads based on an invalid value of 'pin'.

Based on an original patch by Elena Reshetova.

Cc: Laurent Pinchart 
Cc: Mauro Carvalho Chehab 
Cc: linux-me...@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 drivers/media/usb/uvc/uvc_v4l2.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/media/usb/uvc/uvc_v4l2.c b/drivers/media/usb/uvc/uvc_v4l2.c
index 3e7e283a44a8..7442626dc20e 100644
--- a/drivers/media/usb/uvc/uvc_v4l2.c
+++ b/drivers/media/usb/uvc/uvc_v4l2.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -810,6 +811,7 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,
struct uvc_entity *iterm = NULL;
u32 index = input->index;
int pin = 0;
+   __u8 *elem;
 
if (selector == NULL ||
(chain->dev->quirks & UVC_QUIRK_IGNORE_SELECTOR_UNIT)) {
@@ -820,8 +822,9 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,
break;
}
pin = iterm->id;
-   } else if (index < selector->bNrInPins) {
-   pin = selector->baSourceID[index];
+   } else if ((elem = nospec_array_ptr(selector->baSourceID, index,
+   selector->bNrInPins))) {
+   pin = *elem;
list_for_each_entry(iterm, >entities, chain) {
if (!UVC_ENTITY_IS_ITERM(iterm))
continue;

[PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-05 Thread Dan Williams

From: Andi Kleen 

When access_ok fails we should always stop speculating.
Add the required barriers to the x86 access_ok macro.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnd Bergmann 
Cc: x...@kernel.org
Signed-off-by: Andi Kleen 
Signed-off-by: Dan Williams 
---
 arch/x86/include/asm/uaccess.h |   17 +
 include/asm-generic/barrier.h  |6 +++---
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 574dff4d2913..9b6f20cfaeb9 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -43,6 +43,8 @@ static inline void set_fs(mm_segment_t fs)
 /*
  * Test whether a block of memory is a valid user space address.
  * Returns 0 if the range is valid, nonzero otherwise.
+ *
+ * We also disable speculation when a check fails.
  */
 static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, 
unsigned long limit)
 {
@@ -53,14 +55,19 @@ static inline bool __chk_range_not_ok(unsigned long addr, 
unsigned long size, un
 * important to subtract the size from the
 * limit, not add it to the address).
 */
-   if (__builtin_constant_p(size))
-   return unlikely(addr > limit - size);
+   if (__builtin_constant_p(size)) {
+   if (unlikely(addr > limit - size))
+   return true;
+   nospec_barrier();
+   return false;
+   }
 
/* Arbitrary sizes? Be careful about overflow */
addr += size;
-   if (unlikely(addr < size))
+   if (unlikely(addr < size || addr > limit))
return true;
-   return unlikely(addr > limit);
+   nospec_barrier();
+   return false;
 }
 
 #define __range_not_ok(addr, size, limit)  \
@@ -94,6 +101,8 @@ static inline bool __chk_range_not_ok(unsigned long addr, 
unsigned long size, un
  * Note that, depending on architecture, this function probably just
  * checks that the pointer is in the user space range - after calling
  * this function, memory access functions may still return -EFAULT.
+ *
+ * Stops speculation automatically
  */
 #define access_ok(type, addr, size)\
 ({ \
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 91c3071f49e5..a11765eba860 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -59,7 +59,9 @@
  *
  * Architectures with a suitable memory barrier should provide an
  * implementation. This is non-portable, and generic code should use
- * nospec_ptr().
+ * nospec_{array_ptr,ptr}. Arch-specific code should define and use
+ * nospec_barrier() for usages where nospec_{array_ptr,ptr} is
+ * unsuitable.
  */
 #ifndef __nospec_barrier
 #define __nospec_barrier() do { } while (0)
@@ -120,8 +122,6 @@
nospec_ptr(__arr + __idx, __arr, __arr + __sz); \
 })
 
-#undef __nospec_barrier
-
 #ifndef __smp_mb
 #define __smp_mb() mb()
 #endif

[PATCH 08/18] carl9170: prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams

Static analysis reports that 'queue' may be a user controlled value that
is used as a data dependency to read from the 'ar9170_qmap' array. In
order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue reads
based on an invalid result of 'ar9170_qmap[queue]'. In this case the
value of 'ar9170_qmap[queue]' is immediately reused as an index to the
'ar->edcf' array.

Based on an original patch by Elena Reshetova.

Cc: Christian Lamparter 
Cc: Kalle Valo 
Cc: linux-wirel...@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 drivers/net/wireless/ath/carl9170/main.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/ath/carl9170/main.c 
b/drivers/net/wireless/ath/carl9170/main.c
index 988c8857d78c..0ff34cbe2b62 100644
--- a/drivers/net/wireless/ath/carl9170/main.c
+++ b/drivers/net/wireless/ath/carl9170/main.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "hw.h"
@@ -1384,11 +1385,12 @@ static int carl9170_op_conf_tx(struct ieee80211_hw *hw,
   const struct ieee80211_tx_queue_params *param)
 {
struct ar9170 *ar = hw->priv;
+   const u8 *elem;
int ret;
 
mutex_lock(>mutex);
-   if (queue < ar->hw->queues) {
-   memcpy(>edcf[ar9170_qmap[queue]], param, sizeof(*param));
+   if ((elem = nospec_array_ptr(ar9170_qmap, queue, ar->hw->queues))) {
+   memcpy(>edcf[*elem], param, sizeof(*param));
ret = carl9170_set_qos(ar);
} else {
ret = -EINVAL;

[PATCH 09/18] p54: prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams

Static analysis reports that 'queue' may be a user controlled value that
is used as a data dependency to read from the 'priv->qos_params' array.
In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue reads
based on an invalid result of 'priv->qos_params[queue]'.

Based on an original patch by Elena Reshetova.

Cc: Christian Lamparter 
Cc: Kalle Valo 
Cc: linux-wirel...@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 drivers/net/wireless/intersil/p54/main.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/intersil/p54/main.c 
b/drivers/net/wireless/intersil/p54/main.c
index ab6d39e12069..85c9cbee35fc 100644
--- a/drivers/net/wireless/intersil/p54/main.c
+++ b/drivers/net/wireless/intersil/p54/main.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -411,12 +412,13 @@ static int p54_conf_tx(struct ieee80211_hw *dev,
   const struct ieee80211_tx_queue_params *params)
 {
struct p54_common *priv = dev->priv;
+   struct p54_edcf_queue_param *p54_q;
int ret;
 
mutex_lock(>conf_mutex);
-   if (queue < dev->queues) {
-   P54_SET_QUEUE(priv->qos_params[queue], params->aifs,
-   params->cw_min, params->cw_max, params->txop);
+   if ((p54_q = nospec_array_ptr(priv->qos_params, queue, dev->queues))) {
+   P54_SET_QUEUE(p54_q[0], params->aifs, params->cw_min,
+   params->cw_max, params->txop);
ret = p54_set_edcf(priv);
} else
ret = -EINVAL;

[PATCH 11/18] cw1200: prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams

Static analysis reports that 'queue' may be a user controlled value that
is used as a data dependency to read 'txq_params' from the
'priv->tx_queue_params.params' array.  In order to avoid potential leaks
of kernel memory values, block speculative execution of the instruction
stream that could issue reads based on an invalid value of 'txq_params'.
In this case 'txq_params' is referenced later in the function.

Based on an original patch by Elena Reshetova.

Cc: Solomon Peachy 
Cc: Kalle Valo 
Cc: linux-wirel...@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 drivers/net/wireless/st/cw1200/sta.c |   10 ++
 drivers/net/wireless/st/cw1200/wsm.h |4 +---
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/wireless/st/cw1200/sta.c 
b/drivers/net/wireless/st/cw1200/sta.c
index 38678e9a0562..886942617f14 100644
--- a/drivers/net/wireless/st/cw1200/sta.c
+++ b/drivers/net/wireless/st/cw1200/sta.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "cw1200.h"
 #include "sta.h"
@@ -612,18 +613,19 @@ int cw1200_conf_tx(struct ieee80211_hw *dev, struct 
ieee80211_vif *vif,
   u16 queue, const struct ieee80211_tx_queue_params *params)
 {
struct cw1200_common *priv = dev->priv;
+   struct wsm_set_tx_queue_params *txq_params;
int ret = 0;
/* To prevent re-applying PM request OID again and again*/
bool old_uapsd_flags;
 
mutex_lock(>conf_mutex);
 
-   if (queue < dev->queues) {
+   if ((txq_params = nospec_array_ptr(priv->tx_queue_params.params,
+   queue, dev->queues))) {
old_uapsd_flags = le16_to_cpu(priv->uapsd_info.uapsd_flags);
 
-   WSM_TX_QUEUE_SET(>tx_queue_params, queue, 0, 0, 0);
-   ret = wsm_set_tx_queue_params(priv,
- 
>tx_queue_params.params[queue], queue);
+   WSM_TX_QUEUE_SET(txq_params, 0, 0, 0);
+   ret = wsm_set_tx_queue_params(priv, txq_params, queue);
if (ret) {
ret = -EINVAL;
goto out;
diff --git a/drivers/net/wireless/st/cw1200/wsm.h 
b/drivers/net/wireless/st/cw1200/wsm.h
index 48086e849515..8c8d9191e233 100644
--- a/drivers/net/wireless/st/cw1200/wsm.h
+++ b/drivers/net/wireless/st/cw1200/wsm.h
@@ -1099,10 +1099,8 @@ struct wsm_tx_queue_params {
 };
 
 
-#define WSM_TX_QUEUE_SET(queue_params, queue, ack_policy, allowed_time,\
-   max_life_time)  \
+#define WSM_TX_QUEUE_SET(p, ack_policy, allowed_time, max_life_time)   \
 do {   \
-   struct wsm_set_tx_queue_params *p = &(queue_params)->params[queue]; \
p->ackPolicy = (ack_policy);\
p->allowedMediumTime = (allowed_time);  \
p->maxTransmitLifetime = (max_life_time);   \

[PATCH 10/18] qla2xxx: prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams

Static analysis reports that 'handle' may be a user controlled value
that is used as a data dependency to read 'sp' from the
'req->outstanding_cmds' array.  In order to avoid potential leaks of
kernel memory values, block speculative execution of the instruction
stream that could issue reads based on an invalid value of 'sp'. In this
case 'sp' is directly dereferenced later in the function.

Based on an original patch by Elena Reshetova.

Cc: qla2xxx-upstr...@qlogic.com
Cc: "James E.J. Bottomley" 
Cc: "Martin K. Petersen" 
Cc: linux-s...@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 drivers/scsi/qla2xxx/qla_mr.c |   15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_mr.c b/drivers/scsi/qla2xxx/qla_mr.c
index d5da3981cefe..128b41de3784 100644
--- a/drivers/scsi/qla2xxx/qla_mr.c
+++ b/drivers/scsi/qla2xxx/qla_mr.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2275,7 +2276,7 @@ qlafx00_ioctl_iosb_entry(scsi_qla_host_t *vha, struct 
req_que *req,
 static void
 qlafx00_status_entry(scsi_qla_host_t *vha, struct rsp_que *rsp, void *pkt)
 {
-   srb_t   *sp;
+   srb_t   *sp, **elem;
fc_port_t   *fcport;
struct scsi_cmnd *cp;
struct sts_entry_fx00 *sts;
@@ -2304,8 +2305,9 @@ qlafx00_status_entry(scsi_qla_host_t *vha, struct rsp_que 
*rsp, void *pkt)
req = ha->req_q_map[que];
 
/* Validate handle. */
-   if (handle < req->num_outstanding_cmds)
-   sp = req->outstanding_cmds[handle];
+   if ((elem = nospec_array_ptr(req->outstanding_cmds, handle,
+   req->num_outstanding_cmds)))
+   sp = *elem;
else
sp = NULL;
 
@@ -2626,7 +2628,7 @@ static void
 qlafx00_multistatus_entry(struct scsi_qla_host *vha,
struct rsp_que *rsp, void *pkt)
 {
-   srb_t   *sp;
+   srb_t   *sp, **elem;
struct multi_sts_entry_fx00 *stsmfx;
struct qla_hw_data *ha = vha->hw;
uint32_t handle, hindex, handle_count, i;
@@ -2655,8 +2657,9 @@ qlafx00_multistatus_entry(struct scsi_qla_host *vha,
req = ha->req_q_map[que];
 
/* Validate handle. */
-   if (handle < req->num_outstanding_cmds)
-   sp = req->outstanding_cmds[handle];
+   if ((elem = nospec_array_ptr(req->outstanding_cmds, handle,
+   req->num_outstanding_cmds)))
+   sp = *elem;
else
sp = NULL;

[PATCH 13/18] ipv6: prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams

Static analysis reports that 'offset' may be a user controlled value
that is used as a data dependency reading from a raw6_frag_vec buffer.
In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue further
reads based on an invalid '*(rfv->c + offset)' value.

Based on an original patch by Elena Reshetova.

Cc: "David S. Miller" 
Cc: Alexey Kuznetsov 
Cc: Hideaki YOSHIFUJI 
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 net/ipv6/raw.c |9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 761a473a07c5..384e3d59d148 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -725,17 +726,17 @@ static int raw6_getfrag(void *from, char *to, int offset, 
int len, int odd,
   struct sk_buff *skb)
 {
struct raw6_frag_vec *rfv = from;
+   char *rfv_buf;
 
-   if (offset < rfv->hlen) {
+   if ((rfv_buf = nospec_array_ptr(rfv->c, offset, rfv->hlen))) {
int copy = min(rfv->hlen - offset, len);
 
if (skb->ip_summed == CHECKSUM_PARTIAL)
-   memcpy(to, rfv->c + offset, copy);
+   memcpy(to, rfv_buf, copy);
else
skb->csum = csum_block_add(
skb->csum,
-   csum_partial_copy_nocheck(rfv->c + offset,
- to, copy, 0),
+   csum_partial_copy_nocheck(rfv_buf, to, copy, 0),
odd);
 
odd = 0;

[PATCH 15/18] vfs, fdtable: prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams

Expectedly, static analysis reports that 'fd' is a user controlled value
that is used as a data dependency to read from the 'fdt->fd' array.  In
order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue reads
based on an invalid 'file *' returned from __fcheck_files.

Based on an original patch by Elena Reshetova.

Cc: Al Viro 
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 include/linux/fdtable.h |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 1c65817673db..4a147c5c2533 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -81,9 +81,10 @@ struct dentry;
 static inline struct file *__fcheck_files(struct files_struct *files, unsigned 
int fd)
 {
struct fdtable *fdt = rcu_dereference_raw(files->fdt);
+   struct file __rcu **fdp;
 
-   if (fd < fdt->max_fds)
-   return rcu_dereference_raw(fdt->fd[fd]);
+   if ((fdp = nospec_array_ptr(fdt->fd, fd, fdt->max_fds)))
+   return rcu_dereference_raw(*fdp);
return NULL;
 }

[PATCH 16/18] net: mpls: prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams

Static analysis reports that 'index' may be a user controlled value that
is used as a data dependency reading 'rt' from the 'platform_label'
array.  In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue further
reads based on an invalid 'rt' value.

Based on an original patch by Elena Reshetova.

Cc: "David S. Miller" 
Cc: Eric W. Biederman 
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 net/mpls/af_mpls.c |   12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 8ca9915befc8..ebcf0e246cfe 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -77,12 +78,13 @@ static void rtmsg_lfib(int event, u32 label, struct 
mpls_route *rt,
 static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
 {
struct mpls_route *rt = NULL;
+   struct mpls_route __rcu **platform_label =
+   rcu_dereference(net->mpls.platform_label);
+   struct mpls_route __rcu **rtp;
 
-   if (index < net->mpls.platform_labels) {
-   struct mpls_route __rcu **platform_label =
-   rcu_dereference(net->mpls.platform_label);
-   rt = rcu_dereference(platform_label[index]);
-   }
+   if ((rtp = nospec_array_ptr(platform_label, index,
+   net->mpls.platform_labels)))
+   rt = rcu_dereference(*rtp);
return rt;
 }

[PATCH 17/18] udf: prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams

Static analysis reports that 'eahd->appAttrLocation' and
'eahd->impAttrLocation' may be a user controlled values that are used as
data dependencies for calculating source and destination buffers for
memmove operations. In order to avoid potential leaks of kernel memory
values, block speculative execution of the instruction stream that could
issue further reads based on invalid 'aal' or 'ial' values.

Based on an original patch by Elena Reshetova.

Cc: Jan Kara 
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 fs/udf/misc.c |   39 +--
 1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/fs/udf/misc.c b/fs/udf/misc.c
index 401e64cde1be..9403160822de 100644
--- a/fs/udf/misc.c
+++ b/fs/udf/misc.c
@@ -51,6 +51,8 @@ struct genericFormat *udf_add_extendedattr(struct inode 
*inode, uint32_t size,
int offset;
uint16_t crclen;
struct udf_inode_info *iinfo = UDF_I(inode);
+   uint8_t *ea_dst, *ea_src;
+   uint32_t aal, ial;
 
ea = iinfo->i_ext.i_data;
if (iinfo->i_lenEAttr) {
@@ -100,33 +102,34 @@ struct genericFormat *udf_add_extendedattr(struct inode 
*inode, uint32_t size,
 
offset = iinfo->i_lenEAttr;
if (type < 2048) {
-   if (le32_to_cpu(eahd->appAttrLocation) <
-   iinfo->i_lenEAttr) {
-   uint32_t aal =
-   le32_to_cpu(eahd->appAttrLocation);
-   memmove([offset - aal + size],
-   [aal], offset - aal);
+   aal = le32_to_cpu(eahd->appAttrLocation);
+   if ((ea_dst = nospec_array_ptr(ea, offset - aal + size,
+  iinfo->i_lenEAttr)) &&
+   (ea_src = nospec_array_ptr(ea, aal,
+  iinfo->i_lenEAttr))) {
+   memmove(ea_dst, ea_src, offset - aal);
offset -= aal;
eahd->appAttrLocation =
cpu_to_le32(aal + size);
}
-   if (le32_to_cpu(eahd->impAttrLocation) <
-   iinfo->i_lenEAttr) {
-   uint32_t ial =
-   le32_to_cpu(eahd->impAttrLocation);
-   memmove([offset - ial + size],
-   [ial], offset - ial);
+
+   ial = le32_to_cpu(eahd->impAttrLocation);
+   if ((ea_dst = nospec_array_ptr(ea, offset - ial + size,
+  iinfo->i_lenEAttr)) &&
+   (ea_src = nospec_array_ptr(ea, ial,
+  iinfo->i_lenEAttr))) {
+   memmove(ea_dst, ea_src, offset - ial);
offset -= ial;
eahd->impAttrLocation =
cpu_to_le32(ial + size);
}
} else if (type < 65536) {
-   if (le32_to_cpu(eahd->appAttrLocation) <
-   iinfo->i_lenEAttr) {
-   uint32_t aal =
-   le32_to_cpu(eahd->appAttrLocation);
-   memmove([offset - aal + size],
-   [aal], offset - aal);
+   aal = le32_to_cpu(eahd->appAttrLocation);
+   if ((ea_dst = nospec_array_ptr(ea, offset - aal + size,
+  iinfo->i_lenEAttr)) &&
+   (ea_src = nospec_array_ptr(ea, aal,
+  iinfo->i_lenEAttr))) {
+   memmove(ea_dst, ea_src, offset - aal);
offset -= aal;
eahd->appAttrLocation =
cpu_to_le32(aal + size);

[PATCH 18/18] userns: prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams

Static analysis reports that 'pos' may be a user controlled value that
is used as a data dependency determining which extent to return out of
'map'. In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue further
reads based on an invalid speculative result from 'm_start()'.

Based on an original patch by Elena Reshetova.

Cc: "Eric W. Biederman" 
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 kernel/user_namespace.c |   10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..e958f2e5c061 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -648,15 +648,13 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,
 {
loff_t pos = *ppos;
unsigned extents = map->nr_extents;
-   smp_rmb();
 
-   if (pos >= extents)
-   return NULL;
+   /* paired with smp_wmb in map_write */
+   smp_rmb();
 
if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
-   return >extent[pos];
-
-   return >forward[pos];
+   return nospec_array_ptr(map->extent, pos, extents);
+   return nospec_array_ptr(map->forward, pos, extents);
 }
 
 static void *uid_m_start(struct seq_file *seq, loff_t *ppos)

[PATCH 12/18] Thermal/int340x: prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams

Static analysis reports that 'trip' may be a user controlled value that
is used as a data dependency to read '*temp' from the 'd->aux_trips'
array.  In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue reads
based on an invalid value of '*temp'.

Based on an original patch by Elena Reshetova.

Cc: Srinivas Pandruvada 
Cc: Zhang Rui 
Cc: Eduardo Valentin 
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 .../thermal/int340x_thermal/int340x_thermal_zone.c |   14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c 
b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
index 145a5c53ff5c..442a1d9bf7ad 100644
--- a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
+++ b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "int340x_thermal_zone.h"
 
 static int int340x_thermal_get_zone_temp(struct thermal_zone_device *zone,
@@ -52,20 +53,21 @@ static int int340x_thermal_get_trip_temp(struct 
thermal_zone_device *zone,
 int trip, int *temp)
 {
struct int34x_thermal_zone *d = zone->devdata;
+   unsigned long *elem;
int i;
 
if (d->override_ops && d->override_ops->get_trip_temp)
return d->override_ops->get_trip_temp(zone, trip, temp);
 
-   if (trip < d->aux_trip_nr)
-   *temp = d->aux_trips[trip];
-   else if (trip == d->crt_trip_id)
+   if ((elem = nospec_array_ptr(d->aux_trips, trip, d->aux_trip_nr))) {
+   *temp = *elem;
+   } else if (trip == d->crt_trip_id) {
*temp = d->crt_temp;
-   else if (trip == d->psv_trip_id)
+   } else if (trip == d->psv_trip_id) {
*temp = d->psv_temp;
-   else if (trip == d->hot_trip_id)
+   } else if (trip == d->hot_trip_id) {
*temp = d->hot_temp;
-   else {
+   } else {
for (i = 0; i < INT340X_THERMAL_MAX_ACT_TRIP_COUNT; i++) {
if (d->act_trips[i].valid &&
d->act_trips[i].id == trip) {

[PATCH 14/18] ipv4: prevent bounds-check bypass via speculative execution

2018-01-05 Thread Dan Williams

Static analysis reports that 'offset' may be a user controlled value
that is used as a data dependency reading from a raw_frag_vec buffer.
In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue further
reads based on an invalid '*(rfv->c + offset)' value.

Based on an original patch by Elena Reshetova.

Cc: "David S. Miller" 
Cc: Alexey Kuznetsov 
Cc: Hideaki YOSHIFUJI 
Cc: netdev@vger.kernel.org
Signed-off-by: Elena Reshetova 
Signed-off-by: Dan Williams 
---
 net/ipv4/raw.c |9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 125c1eab3eaa..f72b20131a15 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -57,6 +57,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -472,17 +473,17 @@ static int raw_getfrag(void *from, char *to, int offset, 
int len, int odd,
   struct sk_buff *skb)
 {
struct raw_frag_vec *rfv = from;
+   char *rfv_buf;
 
-   if (offset < rfv->hlen) {
+   if ((rfv_buf = nospec_array_ptr(rfv->hdr.c, offset, rfv->hlen))) {
int copy = min(rfv->hlen - offset, len);
 
if (skb->ip_summed == CHECKSUM_PARTIAL)
-   memcpy(to, rfv->hdr.c + offset, copy);
+   memcpy(to, rfv_buf, copy);
else
skb->csum = csum_block_add(
skb->csum,
-   csum_partial_copy_nocheck(rfv->hdr.c + offset,
- to, copy, 0),
+   csum_partial_copy_nocheck(rfv_buf, to, copy, 0),
odd);
 
odd = 0;

Re: [PATCH net-next v3 00/10] net: qualcomm: rmnet: Enable csum offloads

2018-01-05 Thread Subash Abhinov Kasiviswanathan


On 2018-01-05 13:41, Subash Abhinov Kasiviswanathan wrote:

This series introduces the MAPv4 packet format for checksum
offload plus some other minor changes.

Patches 1-3 are cleanups.

Patch 4 renames the ingress format to data format so that all data
formats can be configured using this going forward.

Patch 5 uses the pacing helper to improve TCP transmit performance.

Patch 6-9 defines the the MAPv4 for checksum offload for RX and TX.
A new header and trailer format are used as part of MAPv4.
For RX checksum offload, only the 1's complement of the IP payload
portion is computed by hardware. The meta data from RX header is
used to verify the checksum field in the packet. Note that the
IP packet and its field itself is not modified by hardware.
This gives metadata to help with the RX checksum. For TX, the
required metadata is filled up so hardware can compute the
checksum.

Patch 10 enables GSO on rmnet devices

v1->v2: Fix sparse errors reported by kbuild test robot

v2->v3: Update the commit message for Patch 5 based on Eric's comments

Subash Abhinov Kasiviswanathan (10):
  net: qualcomm: rmnet: Remove redundant check when stamping map header
  net: qualcomm: rmnet: Remove invalid condition while stamping mux id
  net: qualcomm: rmnet: Remove unused function declaration
  net: qualcomm: rmnet: Rename ingress data format to data format
  net: qualcomm: rmnet: Set pacing shift
  net: qualcomm: rmnet: Define the MAPv4 packet formats
  net: qualcomm: rmnet: Add support for RX checksum offload
  net: qualcomm: rmnet: Handle command packets with checksum trailer
  net: qualcomm: rmnet: Add support for TX checksum offload
  net: qualcomm: rmnet: Add support for GSO

 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c |  10 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h |   2 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c   |  36 ++-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h|  23 +-
 .../ethernet/qualcomm/rmnet/rmnet_map_command.c|  17 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c   | 309 
-

 .../net/ethernet/qualcomm/rmnet/rmnet_private.h|   2 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c|   4 +
 8 files changed, 378 insertions(+), 25 deletions(-)


Hi David

I dont see this series in patchwork.
Do I need to resubmit this?

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Re: [bpf-next V4 PATCH 13/14] bpf: finally expose xdp_rxq_info to XDP bpf-programs

2018-01-05 Thread Daniel Borkmann

On 01/03/2018 11:26 AM, Jesper Dangaard Brouer wrote:
> Now all XDP driver have been updated to setup xdp_rxq_info and assign
> this to xdp_buff->rxq.  Thus, it is now safe to enable access to some
> of the xdp_rxq_info struct members.
> 
> This patch extend xdp_md and expose UAPI to userspace for
> ingress_ifindex and rx_queue_index.  Access happens via bpf
> instruction rewrite, that load data directly from struct xdp_rxq_info.
> 
> * ingress_ifindex map to xdp_rxq_info->dev->ifindex
> * rx_queue_index  map to xdp_rxq_info->queue_index
> 
> Signed-off-by: Jesper Dangaard Brouer 
> Acked-by: Alexei Starovoitov 
> ---
>  include/uapi/linux/bpf.h |3 +++
>  net/core/filter.c|   19 +++
>  2 files changed, 22 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 69eabfcb9bdb..a6000a95d40e 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -899,6 +899,9 @@ struct xdp_md {
>   __u32 data;
>   __u32 data_end;
>   __u32 data_meta;
> + /* Below access go though struct xdp_rxq_info */
> + __u32 ingress_ifindex; /* rxq->dev->ifindex */
> + __u32 rx_queue_index;  /* rxq->queue_index  */
>  };
>  
>  enum sk_action {
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 130b842c3a15..acdb94c0e97f 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4304,6 +4304,25 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type 
> type,
> si->dst_reg, si->src_reg,
> offsetof(struct xdp_buff, data_end));
>   break;
> + case offsetof(struct xdp_md, ingress_ifindex):
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, rxq),
> +   si->dst_reg, si->src_reg,
> +   offsetof(struct xdp_buff, rxq));
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_rxq_info, 
> dev),
> +   si->dst_reg, si->dst_reg,
> +   offsetof(struct xdp_rxq_info, dev));
> + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> +   bpf_target_off(struct net_device,
> +  ifindex, 4, target_size));

The bpf_target_off() is actually only used in the context of narrow ctx access.

This should just be:

*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
  offsetof(struct net_device, ifindex));

> + break;
> + case offsetof(struct xdp_md, rx_queue_index):
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, rxq),
> +   si->dst_reg, si->src_reg,
> +   offsetof(struct xdp_buff, rxq));
> + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> +   bpf_target_off(struct xdp_rxq_info,
> + queue_index, 4, target_size));

And here:

*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
  offsetof(struct xdp_rxq_info, queue_index));

> + break;
>   }
>  
>   return insn - insn_buf;
>

Re: [PATCH iproute2 v2 2/3] link_iptnl: Print tunnel mode

2018-01-05 Thread Stephen Hemminger

On Tue,  2 Jan 2018 23:27:58 +0200
Serhey Popovych  wrote:

> Tunnel mode does not appear in parameters print for iptnl
> supported tunnels like ipip and sit, while printed for
> ip6tnl.
> 
> Print tunnel mode as "proto" field name for JSON and
> without any name when printing to cli to follow ip6tnl
> behaviour.
> 
> For non JSON output we have:
> 
>$ ip -d link show dev sit1
> 
> Before:
> ---
> 17: sit1@NONE:  mtu 1480 qdisc noop state DOWN ...
> link/sit X.X.X.X brd 0.0.0.0 promiscuity 0
> sit remote any local X.X.X.X ...
> ~~~
> 
> After:
> --
> 17: sit1@NONE:  mtu 1480 qdisc noop state DOWN ...
> link/sit X.X.X.X brd 0.0.0.0 promiscuity 0
> sit any remote any local X.X.X.X ...
> ^^^
> 
> Signed-off-by: Serhey Popovych 

Applied all three to master.
Thanks.

[PATCH ipsec] xfrm: don't call xfrm_policy_cache_flush while holding spinlock

2018-01-05 Thread Florian Westphal

xfrm_policy_cache_flush can sleep, so it cannot be called while holding
a spinlock.  We could release the lock first, but I don't see why we need
to invoke this function here in first place, the packet path won't reuse
an xdst entry unless its still valid.

While at it, add an annotation to xfrm_policy_cache_flush, it would
have probably caught this bug sooner.

Fixes: ec30d78c14a813 ("xfrm: add xdst pcpu cache")
Reported-by: syzbot+e149f7d1328c26f9c...@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal 
---
 net/xfrm/xfrm_policy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index d8a8129b9232..688336cb9956 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -974,8 +974,6 @@ int xfrm_policy_flush(struct net *net, u8 type, bool 
task_valid)
}
if (!cnt)
err = -ESRCH;
-   else
-   xfrm_policy_cache_flush();
 out:
spin_unlock_bh(>xfrm.xfrm_policy_lock);
return err;
@@ -1744,6 +1742,8 @@ void xfrm_policy_cache_flush(void)
bool found = 0;
int cpu;
 
+   might_sleep();
+
local_bh_disable();
rcu_read_lock();
for_each_possible_cpu(cpu) {
-- 
2.13.6

Re: [bpf-next V4 PATCH 00/14] xdp: new XDP rx-queue info concept

2018-01-05 Thread Alexei Starovoitov

On Wed, Jan 03, 2018 at 11:25:08AM +0100, Jesper Dangaard Brouer wrote:
> V4:
> * Added reviewers/acks to patches
> * Fix patch desc in i40e that got out-of-sync with code
> * Add SPDX license headers for the two new files added in patch 14
> 
> V3:
> * Fixed bug in virtio_net driver
> * Removed export of xdp_rxq_info_init()
> 
> V2:
> * Changed API exposed to drivers
>   - Removed invocation of "init" in drivers, and only call "reg"
> (Suggested by Saeed)
>   - Allow "reg" to fail and handle this in drivers
> (Suggested by David Ahern)
> * Removed the SINKQ qtype, instead allow to register as "unused"
> * Also fixed some drivers during testing on actual HW (noted in patches)
> 
> There is a need for XDP to know more about the RX-queue a given XDP
> frames have arrived on.  For both the XDP bpf-prog and kernel side.
> 
> Instead of extending struct xdp_buff each time new info is needed,
> this patchset takes a different approach.  Struct xdp_buff is only
> extended with a pointer to a struct xdp_rxq_info (allowing for easier
> extending this later).  This xdp_rxq_info contains information related
> to how the driver have setup the individual RX-queue's.  This is
> read-mostly information, and all xdp_buff frames (in drivers
> napi_poll) point to the same xdp_rxq_info (per RX-queue).
> 
> We stress this data/cache-line is for read-mostly info.  This is NOT
> for dynamic per packet info, use the data_meta for such use-cases.
> 
> This patchset start out small, and only expose ingress_ifindex and the
> RX-queue index to the XDP/BPF program. Access to tangible info like
> the ingress ifindex and RX queue index, is fairly easy to comprehent.
> The other future use-cases could allow XDP frames to be recycled back
> to the originating device driver, by providing info on RX device and
> queue number.
> 
> As XDP doesn't have driver feature flags, and eBPF code due to
> bpf-tail-calls cannot determine that XDP driver invoke it, this
> patchset have to update every driver that support XDP.
> 
> For driver developers (review individual driver patches!):
> 
> The xdp_rxq_info is tied to the drivers RX-ring(s). Whenever a RX-ring
> modification require (temporary) stopping RX frames, then the
> xdp_rxq_info should (likely) also be unregistred and re-registered,
> especially if reallocating the pages in the ring. Make sure ethtool
> set_channels does the right thing. When replacing XDP prog, if and
> only if RX-ring need to be changed, then also re-register the
> xdp_rxq_info.
> 
> I'm Cc'ing the individual driver patches to the registered maintainers.
> 
> Testing:
> 
> I've only tested the NIC drivers I have hardware for.  The general
> test procedure is to (DUT = Device Under Test):
>  (1) run pktgen script pktgen_sample04_many_flows.sh   (against DUT)
>  (2) run samples/bpf program xdp_rxq_info --dev $DEV   (on DUT)
>  (3) runtime modify number of NIC queues via ethtool -L(on DUT)
>  (4) runtime modify number of NIC ring-size via ethtool -G (on DUT)
> 
> Patch based on git tree bpf-next (at commit fb982666e380c1632a):
>  https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/

Applied, thank you Jesper.

I think Michael's suggested micro optimization for patch 7 can
be done as a follow up.

Re: [PATCH net-next v2 09/10] bnxt_en: add support for software dynamic interrupt moderation

2018-01-05 Thread Michael Chan

On Fri, Jan 5, 2018 at 2:58 PM, Andy Gospodarek  wrote:

> @@ -5705,7 +5748,13 @@ static void bnxt_enable_napi(struct bnxt *bp)
> int i;
>
> for (i = 0; i < bp->cp_nr_rings; i++) {
> +   struct bnxt_cp_ring_info *cpr = >bnapi[i]->cp_ring;
> bp->bnapi[i]->in_reset = false;
> +
> +   if (!(bp->bnapi[i]->flags & BNXT_NAPI_FLAG_XDP)) {

This actually won't work.  The XDP rings are the rings always with RX
rings because we need to be able to retransmit an XDP packet from the
RX ring to the "paired" TX ring under the same NAPI.

if (bp->bnapi[i]->rx_ring)

is the better way to check.  Because MQPRIO, XDP, and ethtool channels
settings can all create TX rings without RX rings.  If the rx_ring
pointer is NULL, there is no RX ring present for this CMPL ring and we
can skip.

> +   INIT_WORK(>dim.work, bnxt_dim_work);
> +   cpr->dim.mode = NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE;
> +   }
> napi_enable(>bnapi[i]->napi);
> }
>  }

[iproute2 net-next 1/2] tc: implement filter block sharing to ingress and clsact qdiscs

2018-01-05 Thread Jiri Pirko

From: Jiri Pirko 

Signed-off-by: Jiri Pirko 
---
 include/uapi/linux/pkt_sched.h | 11 +
 tc/q_clsact.c  | 56 ++
 tc/q_ingress.c | 32 +---
 3 files changed, 91 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096..8cc554a 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -934,4 +934,15 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+/* Ingress/clsact */
+
+enum {
+   TCA_CLSACT_UNSPEC,
+   TCA_CLSACT_INGRESS_BLOCK,
+   TCA_CLSACT_EGRESS_BLOCK,
+   __TCA_CLSACT_MAX
+};
+
+#define TCA_CLSACT_MAX (__TCA_CLSACT_MAX - 1)
+
 #endif
diff --git a/tc/q_clsact.c b/tc/q_clsact.c
index 341f653..06d67db 100644
--- a/tc/q_clsact.c
+++ b/tc/q_clsact.c
@@ -7,23 +7,69 @@
 
 static void explain(void)
 {
-   fprintf(stderr, "Usage: ... clsact\n");
+   fprintf(stderr, "Usage: ... clsact [ingress_block BLOCK_INDEX] 
[egress_block BLOCK_INDEX]\n");
 }
 
 static int clsact_parse_opt(struct qdisc_util *qu, int argc, char **argv,
struct nlmsghdr *n, const char *dev)
 {
-   if (argc > 0) {
-   fprintf(stderr, "What is \"%s\"?\n", *argv);
-   explain();
-   return -1;
+   struct rtattr *tail;
+   unsigned int ingress_block = 0;
+   unsigned int egress_block = 0;
+
+   while (argc > 0) {
+   if (strcmp(*argv, "ingress_block") == 0) {
+   NEXT_ARG();
+   if (get_unsigned(_block, *argv, 0)) {
+   fprintf(stderr, "Illegal \"ingress_block\"\n");
+   return -1;
+   }
+   } else if (strcmp(*argv, "egress_block") == 0) {
+   NEXT_ARG();
+   if (get_unsigned(_block, *argv, 0)) {
+   fprintf(stderr, "Illegal \"egress_block\"\n");
+   return -1;
+   }
+   } else {
+   fprintf(stderr, "What is \"%s\"?\n", *argv);
+   explain();
+   return -1;
+   }
+   NEXT_ARG_FWD();
}
 
+   tail = NLMSG_TAIL(n);
+   addattr_l(n, 1024, TCA_OPTIONS, NULL, 0);
+   if (ingress_block)
+   addattr32(n, 1024, TCA_CLSACT_INGRESS_BLOCK, ingress_block);
+   if (egress_block)
+   addattr32(n, 1024, TCA_CLSACT_EGRESS_BLOCK, egress_block);
+   tail->rta_len = (void *) NLMSG_TAIL(n) - (void *) tail;
return 0;
 }
 
 static int clsact_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 {
+   struct rtattr *tb[TCA_CLSACT_MAX + 1];
+   unsigned int block;
+
+   if (!opt)
+   return 0;
+
+   parse_rtattr_nested(tb, TCA_CLSACT_MAX, opt);
+
+   if (tb[TCA_CLSACT_INGRESS_BLOCK] &&
+   RTA_PAYLOAD(tb[TCA_CLSACT_INGRESS_BLOCK]) >= sizeof(__u32)) {
+   block = rta_getattr_u32(tb[TCA_CLSACT_INGRESS_BLOCK]);
+   print_uint(PRINT_ANY, "ingress_block",
+  "ingress_block %u ", block);
+   }
+   if (tb[TCA_CLSACT_EGRESS_BLOCK] &&
+   RTA_PAYLOAD(tb[TCA_CLSACT_EGRESS_BLOCK]) >= sizeof(__u32)) {
+   block = rta_getattr_u32(tb[TCA_CLSACT_EGRESS_BLOCK]);
+   print_uint(PRINT_ANY, "egress_block",
+  "egress_block %u ", block);
+   }
return 0;
 }
 
diff --git a/tc/q_ingress.c b/tc/q_ingress.c
index 1e42229..6899c4d 100644
--- a/tc/q_ingress.c
+++ b/tc/q_ingress.c
@@ -17,30 +17,56 @@
 
 static void explain(void)
 {
-   fprintf(stderr, "Usage: ... ingress\n");
+   fprintf(stderr, "Usage: ... ingress [block BLOCK_INDEX]\n");
 }
 
 static int ingress_parse_opt(struct qdisc_util *qu, int argc, char **argv,
 struct nlmsghdr *n, const char *dev)
 {
+   struct rtattr *tail;
+   unsigned int block;
+
while (argc > 0) {
if (strcmp(*argv, "handle") == 0) {
NEXT_ARG();
-   argc--; argv++;
+   } else if (strcmp(*argv, "block") == 0) {
+   NEXT_ARG();
+   if (get_unsigned(, *argv, 0)) {
+   fprintf(stderr, "Illegal \"block\"\n");
+   return -1;
+   }
} else {
fprintf(stderr, "What is \"%s\"?\n", *argv);
explain();
return -1;
}
+   NEXT_ARG_FWD();
}
 
+   tail = NLMSG_TAIL(n);
+   addattr_l(n, 1024, TCA_OPTIONS, NULL, 0);
+   if (block)
+   addattr32(n, 1024, TCA_CLSACT_INGRESS_BLOCK, block);
+

[iproute2 net-next 2/2] tc: introduce support for block-handle for filter operations

2018-01-05 Thread Jiri Pirko

From: Jiri Pirko 

Signed-off-by: Jiri Pirko 
---
 tc/tc_filter.c | 100 +++--
 1 file changed, 83 insertions(+), 17 deletions(-)

diff --git a/tc/tc_filter.c b/tc/tc_filter.c
index 545cc3a..4127090 100644
--- a/tc/tc_filter.c
+++ b/tc/tc_filter.c
@@ -28,14 +28,17 @@
 static void usage(void)
 {
fprintf(stderr,
-   "Usage: tc filter [ add | del | change | replace | show ] dev 
STRING\n"
-   "Usage: tc filter get dev STRING parent CLASSID protocol PROTO 
handle FILTERID pref PRIO FILTER_TYPE\n"
+   "Usage: tc filter [ add | del | change | replace | show ] [ dev 
STRING ]\n"
+   "   tc filter [ add | del | change | replace | show ] [ 
block BLOCK_INDEX ]\n"
+   "   tc filter get dev STRING parent CLASSID protocol PROTO 
handle FILTERID pref PRIO FILTER_TYPE\n"
+   "   tc filter get block BLOCK_INDEX protocol PROTO handle 
FILTERID pref PRIO FILTER_TYPE\n"
"   [ pref PRIO ] protocol PROTO [ chain CHAIN_INDEX ]\n"
"   [ estimator INTERVAL TIME_CONSTANT ]\n"
"   [ root | ingress | egress | parent CLASSID ]\n"
"   [ handle FILTERID ] [ [ FILTER_TYPE ] [ help | OPTIONS 
] ]\n"
"\n"
"   tc filter show [ dev STRING ] [ root | ingress | egress 
| parent CLASSID ]\n"
+   "   tc filter show [ block BLOCK_INDEX ]\n"
"Where:\n"
"FILTER_TYPE := { rsvp | u32 | bpf | fw | route | etc. }\n"
"FILTERID := ... format depends on classifier, see there\n"
@@ -60,6 +63,7 @@ static int tc_filter_modify(int cmd, unsigned int flags, int 
argc, char **argv)
int protocol_set = 0;
__u32 chain_index;
int chain_index_set = 0;
+   __u32 block_index = 0;
char *fhandle = NULL;
char  d[IFNAMSIZ] = {};
char  k[FILTER_NAMESZ] = {};
@@ -73,7 +77,21 @@ static int tc_filter_modify(int cmd, unsigned int flags, int 
argc, char **argv)
NEXT_ARG();
if (d[0])
duparg("dev", *argv);
+   if (block_index) {
+   fprintf(stderr, "Error: \"dev\" cannot be used 
in the same time as \"block\"\n");
+   return -1;
+   }
strncpy(d, *argv, sizeof(d)-1);
+   } else if (matches(*argv, "block") == 0) {
+   NEXT_ARG();
+   if (block_index)
+   duparg("block", *argv);
+   if (d[0]) {
+   fprintf(stderr, "Error: \"block\" cannot be 
used in the same time as \"dev\"\n");
+   return -1;
+   }
+   if (get_u32(_index, *argv, 0) || !block_index)
+   invarg("invalid block index value", *argv);
} else if (strcmp(*argv, "root") == 0) {
if (req.t.tcm_parent) {
fprintf(stderr,
@@ -168,6 +186,9 @@ static int tc_filter_modify(int cmd, unsigned int flags, 
int argc, char **argv)
fprintf(stderr, "Cannot find device \"%s\"\n", d);
return 1;
}
+   } else if (block_index) {
+   req.t.tcm_ifindex = 0;
+   req.t.tcm_parent = block_index;
}
 
if (q) {
@@ -206,6 +227,7 @@ static __u32 filter_prio;
 static __u32 filter_protocol;
 static __u32 filter_chain_index;
 static int filter_chain_index_set;
+static __u32 filter_block_index;
 __u16 f_proto;
 
 int print_filter(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
@@ -252,21 +274,27 @@ int print_filter(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
print_bool(PRINT_ANY, "added", "added ", true);
 
print_string(PRINT_FP, NULL, "filter ", NULL);
-   if (!filter_ifindex || filter_ifindex != t->tcm_ifindex)
-   print_string(PRINT_ANY, "dev", "dev %s ",
-ll_index_to_name(t->tcm_ifindex));
-
-   if (!filter_parent || filter_parent != t->tcm_parent) {
-   if (t->tcm_parent == TC_H_ROOT)
-   print_bool(PRINT_ANY, "root", "root ", true);
-   else if (t->tcm_parent == TC_H_MAKE(TC_H_CLSACT, 
TC_H_MIN_INGRESS))
-   print_bool(PRINT_ANY, "ingress", "ingress ", true);
-   else if (t->tcm_parent == TC_H_MAKE(TC_H_CLSACT, 
TC_H_MIN_EGRESS))
-   print_bool(PRINT_ANY, "egress", "egress ", true);
-   else {
-   print_tc_classid(abuf, sizeof(abuf), t->tcm_parent);
-   print_string(PRINT_ANY, "parent", "parent %s ",

[patch net-next v6 09/11] mlxsw: spectrum_acl: Don't store netdev and ingress for ruleset unbind

2018-01-05 Thread Jiri Pirko

From: Jiri Pirko 

Instead, pass netdev and ingress flag to ruleset unbind op.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  3 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c |  9 --
 .../ethernet/mellanox/mlxsw/spectrum_acl_tcam.c| 33 +++---
 3 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index a0adcd8..523e64e 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -477,7 +477,8 @@ struct mlxsw_sp_acl_profile_ops {
void (*ruleset_del)(struct mlxsw_sp *mlxsw_sp, void *ruleset_priv);
int (*ruleset_bind)(struct mlxsw_sp *mlxsw_sp, void *ruleset_priv,
struct net_device *dev, bool ingress);
-   void (*ruleset_unbind)(struct mlxsw_sp *mlxsw_sp, void *ruleset_priv);
+   void (*ruleset_unbind)(struct mlxsw_sp *mlxsw_sp, void *ruleset_priv,
+  struct net_device *dev, bool ingress);
u16 (*ruleset_group_id)(void *ruleset_priv);
size_t rule_priv_size;
int (*rule_add)(struct mlxsw_sp *mlxsw_sp,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
index ead4cb8..7fb41a4 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
@@ -128,11 +128,12 @@ static int mlxsw_sp_acl_ruleset_bind(struct mlxsw_sp 
*mlxsw_sp,
 }
 
 static void mlxsw_sp_acl_ruleset_unbind(struct mlxsw_sp *mlxsw_sp,
-   struct mlxsw_sp_acl_ruleset *ruleset)
+   struct mlxsw_sp_acl_ruleset *ruleset,
+   struct net_device *dev, bool ingress)
 {
const struct mlxsw_sp_acl_profile_ops *ops = ruleset->ht_key.ops;
 
-   ops->ruleset_unbind(mlxsw_sp, ruleset->priv);
+   ops->ruleset_unbind(mlxsw_sp, ruleset->priv, dev, ingress);
 }
 
 static struct mlxsw_sp_acl_ruleset *
@@ -200,7 +201,9 @@ static void mlxsw_sp_acl_ruleset_destroy(struct mlxsw_sp 
*mlxsw_sp,
struct mlxsw_sp_acl *acl = mlxsw_sp->acl;
 
if (!ruleset->ht_key.chain_index)
-   mlxsw_sp_acl_ruleset_unbind(mlxsw_sp, ruleset);
+   mlxsw_sp_acl_ruleset_unbind(mlxsw_sp, ruleset,
+   ruleset->ht_key.dev,
+   ruleset->ht_key.ingress);
rhashtable_remove_fast(>ruleset_ht, >ht_node,
   mlxsw_sp_acl_ruleset_ht_params);
ops->ruleset_del(mlxsw_sp, ruleset->priv);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c
index 7e8284b..50b2f9a 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c
@@ -154,10 +154,6 @@ struct mlxsw_sp_acl_tcam_group {
struct list_head region_list;
unsigned int region_count;
struct rhashtable chunk_ht;
-   struct {
-   u16 local_port;
-   bool ingress;
-   } bound;
struct mlxsw_sp_acl_tcam_group_ops *ops;
const struct mlxsw_sp_acl_tcam_pattern *patterns;
unsigned int patterns_count;
@@ -271,26 +267,28 @@ mlxsw_sp_acl_tcam_group_bind(struct mlxsw_sp *mlxsw_sp,
return -EINVAL;
 
mlxsw_sp_port = netdev_priv(dev);
-   group->bound.local_port = mlxsw_sp_port->local_port;
-   group->bound.ingress = ingress;
-   mlxsw_reg_ppbt_pack(ppbt_pl,
-   group->bound.ingress ? MLXSW_REG_PXBT_E_IACL :
-  MLXSW_REG_PXBT_E_EACL,
-   MLXSW_REG_PXBT_OP_BIND, group->bound.local_port,
+   mlxsw_reg_ppbt_pack(ppbt_pl, ingress ? MLXSW_REG_PXBT_E_IACL :
+  MLXSW_REG_PXBT_E_EACL,
+   MLXSW_REG_PXBT_OP_BIND, mlxsw_sp_port->local_port,
group->id);
return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(ppbt), ppbt_pl);
 }
 
 static void
 mlxsw_sp_acl_tcam_group_unbind(struct mlxsw_sp *mlxsw_sp,
-  struct mlxsw_sp_acl_tcam_group *group)
+  struct mlxsw_sp_acl_tcam_group *group,
+  struct net_device *dev, bool ingress)
 {
+   struct mlxsw_sp_port *mlxsw_sp_port;
char ppbt_pl[MLXSW_REG_PPBT_LEN];
 
-   mlxsw_reg_ppbt_pack(ppbt_pl,
-   group->bound.ingress ? MLXSW_REG_PXBT_E_IACL :
-  MLXSW_REG_PXBT_E_EACL,
-   MLXSW_REG_PXBT_OP_UNBIND, group->bound.local_port,
+   if

[patch net-next v6 11/11] mlxsw: spectrum_acl: Pass mlxsw_sp_port down to ruleset bind/unbind ops

2018-01-05 Thread Jiri Pirko

From: Jiri Pirko 

No need to convert from mlxsw_sp_port to net_device and back again.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  6 +++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c |  4 ++--
 .../ethernet/mellanox/mlxsw/spectrum_acl_tcam.c| 27 +-
 3 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index ab6ada7..525552d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -477,9 +477,11 @@ struct mlxsw_sp_acl_profile_ops {
   void *priv, void *ruleset_priv);
void (*ruleset_del)(struct mlxsw_sp *mlxsw_sp, void *ruleset_priv);
int (*ruleset_bind)(struct mlxsw_sp *mlxsw_sp, void *ruleset_priv,
-   struct net_device *dev, bool ingress);
+   struct mlxsw_sp_port *mlxsw_sp_port,
+   bool ingress);
void (*ruleset_unbind)(struct mlxsw_sp *mlxsw_sp, void *ruleset_priv,
-  struct net_device *dev, bool ingress);
+  struct mlxsw_sp_port *mlxsw_sp_port,
+  bool ingress);
u16 (*ruleset_group_id)(void *ruleset_priv);
size_t rule_priv_size;
int (*rule_add)(struct mlxsw_sp *mlxsw_sp,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
index f98bca9..9439bfa 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
@@ -169,7 +169,7 @@ mlxsw_sp_acl_ruleset_bind(struct mlxsw_sp *mlxsw_sp,
const struct mlxsw_sp_acl_profile_ops *ops = ruleset->ht_key.ops;
 
return ops->ruleset_bind(mlxsw_sp, ruleset->priv,
-binding->mlxsw_sp_port->dev, binding->ingress);
+binding->mlxsw_sp_port, binding->ingress);
 }
 
 static void
@@ -181,7 +181,7 @@ mlxsw_sp_acl_ruleset_unbind(struct mlxsw_sp *mlxsw_sp,
const struct mlxsw_sp_acl_profile_ops *ops = ruleset->ht_key.ops;
 
ops->ruleset_unbind(mlxsw_sp, ruleset->priv,
-   binding->mlxsw_sp_port->dev, binding->ingress);
+   binding->mlxsw_sp_port, binding->ingress);
 }
 
 static bool mlxsw_sp_acl_ruleset_block_bound(struct mlxsw_sp_acl_block *block)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c
index 50b2f9a..c6e180c 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c
@@ -258,15 +258,11 @@ static void mlxsw_sp_acl_tcam_group_del(struct mlxsw_sp 
*mlxsw_sp,
 static int
 mlxsw_sp_acl_tcam_group_bind(struct mlxsw_sp *mlxsw_sp,
 struct mlxsw_sp_acl_tcam_group *group,
-struct net_device *dev, bool ingress)
+struct mlxsw_sp_port *mlxsw_sp_port,
+bool ingress)
 {
-   struct mlxsw_sp_port *mlxsw_sp_port;
char ppbt_pl[MLXSW_REG_PPBT_LEN];
 
-   if (!mlxsw_sp_port_dev_check(dev))
-   return -EINVAL;
-
-   mlxsw_sp_port = netdev_priv(dev);
mlxsw_reg_ppbt_pack(ppbt_pl, ingress ? MLXSW_REG_PXBT_E_IACL :
   MLXSW_REG_PXBT_E_EACL,
MLXSW_REG_PXBT_OP_BIND, mlxsw_sp_port->local_port,
@@ -277,15 +273,11 @@ mlxsw_sp_acl_tcam_group_bind(struct mlxsw_sp *mlxsw_sp,
 static void
 mlxsw_sp_acl_tcam_group_unbind(struct mlxsw_sp *mlxsw_sp,
   struct mlxsw_sp_acl_tcam_group *group,
-  struct net_device *dev, bool ingress)
+  struct mlxsw_sp_port *mlxsw_sp_port,
+  bool ingress)
 {
-   struct mlxsw_sp_port *mlxsw_sp_port;
char ppbt_pl[MLXSW_REG_PPBT_LEN];
 
-   if (WARN_ON(!mlxsw_sp_port_dev_check(dev)))
-   return;
-
-   mlxsw_sp_port = netdev_priv(dev);
mlxsw_reg_ppbt_pack(ppbt_pl, ingress ? MLXSW_REG_PXBT_E_IACL :
   MLXSW_REG_PXBT_E_EACL,
MLXSW_REG_PXBT_OP_UNBIND, mlxsw_sp_port->local_port,
@@ -1054,22 +1046,25 @@ mlxsw_sp_acl_tcam_flower_ruleset_del(struct mlxsw_sp 
*mlxsw_sp,
 static int
 mlxsw_sp_acl_tcam_flower_ruleset_bind(struct mlxsw_sp *mlxsw_sp,
  void *ruleset_priv,
- struct net_device *dev, bool ingress)
+ struct mlxsw_sp_port *mlxsw_sp_port,
+ bool ingress)
 {
struct

[patch net-next v6 02/11] net: sched: avoid usage of tp->q in tcf_classify

2018-01-05 Thread Jiri Pirko

From: Jiri Pirko 

Use block index in the messages instead.

Signed-off-by: Jiri Pirko 
---
 net/sched/cls_api.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 2dd584a..f50d203 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -680,8 +680,9 @@ int tcf_classify(struct sk_buff *skb, const struct 
tcf_proto *tp,
 #ifdef CONFIG_NET_CLS_ACT
 reset:
if (unlikely(limit++ >= max_reclassify_loop)) {
-   net_notice_ratelimited("%s: reclassify loop, rule prio %u, 
protocol %02x\n",
-  tp->q->ops->id, tp->prio & 0x,
+   net_notice_ratelimited("%u: reclassify loop, rule prio %u, 
protocol %02x\n",
+  tp->chain->block->index,
+  tp->prio & 0x,
   ntohs(tp->protocol));
return TC_ACT_SHOT;
}
-- 
2.9.5

[patch net-next v6 05/11] net: sched: keep track of offloaded filters and check tc offload feature

2018-01-05 Thread Jiri Pirko

From: Jiri Pirko 

During block bind, we need to check tc offload feature. If it is
disabled yet still the block contains offloaded filters, forbid the
bind. Also forbid to register callback for a block that already
contains offloaded filters, as the play back is not supported now.
For keeping track of offloaded filters there is a new counter
introduced, alongside with couple of helpers called from cls_* code.
These helpers set and clear TCA_CLS_FLAGS_IN_HW flag.

Signed-off-by: Jiri Pirko 
---
v4->v5:
- add tracking of binding of devs that are unable to offload and check
  that before block cbs call.
v3->v4:
- propagate netdev_ops->ndo_setup_tc error up to tcf_block_offload_bind
  caller
v2->v3:
- new patch
---
 include/net/sch_generic.h | 18 +++
 net/sched/cls_api.c   | 79 ++-
 net/sched/cls_bpf.c   |  5 ++-
 net/sched/cls_flower.c|  3 +-
 net/sched/cls_matchall.c  |  3 +-
 net/sched/cls_u32.c   | 13 
 6 files changed, 98 insertions(+), 23 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index dba2214..ab86b64 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -289,8 +289,26 @@ struct tcf_block {
struct list_head cb_list;
struct list_head owner_list;
bool keep_dst;
+   unsigned int offloadcnt; /* Number of oddloaded filters */
+   unsigned int nooffloaddevcnt; /* Number of devs unable to do offload */
 };
 
+static inline void tcf_block_offload_inc(struct tcf_block *block, u32 *flags)
+{
+   if (*flags & TCA_CLS_FLAGS_IN_HW)
+   return;
+   *flags |= TCA_CLS_FLAGS_IN_HW;
+   block->offloadcnt++;
+}
+
+static inline void tcf_block_offload_dec(struct tcf_block *block, u32 *flags)
+{
+   if (!(*flags & TCA_CLS_FLAGS_IN_HW))
+   return;
+   *flags &= ~TCA_CLS_FLAGS_IN_HW;
+   block->offloadcnt--;
+}
+
 static inline void qdisc_cb_private_validate(const struct sk_buff *skb, int sz)
 {
struct qdisc_skb_cb *qcb;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3e12ef9..ae60fce 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -265,31 +265,66 @@ void tcf_chain_put(struct tcf_chain *chain)
 }
 EXPORT_SYMBOL(tcf_chain_put);
 
-static void tcf_block_offload_cmd(struct tcf_block *block, struct Qdisc *q,
- struct tcf_block_ext_info *ei,
- enum tc_block_command command)
+static bool tcf_block_offload_in_use(struct tcf_block *block)
+{
+   return block->offloadcnt;
+}
+
+static int tcf_block_offload_cmd(struct tcf_block *block,
+struct net_device *dev,
+struct tcf_block_ext_info *ei,
+enum tc_block_command command)
 {
-   struct net_device *dev = q->dev_queue->dev;
struct tc_block_offload bo = {};
 
-   if (!dev->netdev_ops->ndo_setup_tc)
-   return;
bo.command = command;
bo.binder_type = ei->binder_type;
bo.block = block;
-   dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_BLOCK, );
+   return dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_BLOCK, );
 }
 
-static void tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
-  struct tcf_block_ext_info *ei)
+static int tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
+ struct tcf_block_ext_info *ei)
 {
-   tcf_block_offload_cmd(block, q, ei, TC_BLOCK_BIND);
+   struct net_device *dev = q->dev_queue->dev;
+   int err;
+
+   if (!dev->netdev_ops->ndo_setup_tc)
+   goto no_offload_dev_inc;
+
+   /* If tc offload feature is disabled and the block we try to bind
+* to already has some offloaded filters, forbid to bind.
+*/
+   if (!tc_can_offload(dev) && tcf_block_offload_in_use(block))
+   return -EOPNOTSUPP;
+
+   err = tcf_block_offload_cmd(block, dev, ei, TC_BLOCK_BIND);
+   if (err == -EOPNOTSUPP)
+   goto no_offload_dev_inc;
+   return err;
+
+no_offload_dev_inc:
+   if (tcf_block_offload_in_use(block))
+   return -EOPNOTSUPP;
+   block->nooffloaddevcnt++;
+   return 0;
 }
 
 static void tcf_block_offload_unbind(struct tcf_block *block, struct Qdisc *q,
 struct tcf_block_ext_info *ei)
 {
-   tcf_block_offload_cmd(block, q, ei, TC_BLOCK_UNBIND);
+   struct net_device *dev = q->dev_queue->dev;
+   int err;
+
+   if (!dev->netdev_ops->ndo_setup_tc)
+   goto no_offload_dev_dec;
+   err = tcf_block_offload_cmd(block, dev, ei, TC_BLOCK_UNBIND);
+   if (err == -EOPNOTSUPP)
+   goto no_offload_dev_dec;
+   return;
+
+no_offload_dev_dec:
+   WARN_ON(block->nooffloaddevcnt-- == 0);
 }
 
 static int
@@

[patch net-next v6 00/11] net: sched: allow qdiscs to share filter block instances

2018-01-05 Thread Jiri Pirko

From: Jiri Pirko 

Currently the filters added to qdiscs are independent. So for example if you
have 2 netdevices and you create ingress qdisc on both and you want to add
identical filter rules both, you need to add them twice. This patchset
makes this easier and mainly saves resources allowing to share all filters
within a qdisc - I call it a "filter block". Also this helps to save
resources when we do offload to hw for example to expensive TCAM.

So back to the example. First, we create 2 qdiscs. Both will share
block number 22. "22" is just an identification. If we don't pass any
block number, a new one will be generated by kernel:

$ tc qdisc add dev ens7 ingress block 22

$ tc qdisc add dev ens8 ingress block 22


Now if we list the qdiscs, we will see the block index in the output:

$ tc qdisc
qdisc ingress : dev ens7 parent :fff1 block 22
qdisc ingress : dev ens8 parent :fff1 block 22


To make is more visual, the situation looks like this:

   ens7 ingress qdisc ens7 ingress qdisc
  |  |
  |  |
  +-->  block 22  <--+

Unlimited number of qdiscs may share the same block.

Now we can add filter using the block index:

$ tc filter add block 22 protocol ip pref 25 flower dst_ip 192.168.0.0/16 
action drop


Note we cannot use the qdisc for filter manipulations for shared blocks:

$ tc filter add dev ens8 ingress protocol ip pref 1 flower dst_ip 192.168.100.2 
action drop
Error: Cannot work with shared block, please use block index.


We will see the same output if we list filters for ingress qdisc of
ens7 and ens8, also for the block 22:

$ tc filter show block 22
filter block 22 protocol ip pref 25 flower chain 0
filter block 22 protocol ip pref 25 flower chain 0 handle 0x1
...

$ tc filter show dev ens7 ingress
filter block 22 protocol ip pref 25 flower chain 0
filter block 22 protocol ip pref 25 flower chain 0 handle 0x1
...

$ tc filter show dev ens8 ingress
filter block 22 protocol ip pref 25 flower chain 0
filter block 22 protocol ip pref 25 flower chain 0 handle 0x1
...

---
v5->v6:
- added patch 6 that introduces block handle

v4->v5:
- patch 5:
 - add tracking of binding of devs that are unable to offload and check
   that before block cbs call.

v3->v4:
- patch 1:
 - rebased on top of the current net-next
 - added some extack strings
- patch 3:
 - rebased on top of the current net-next
- patch 5:
 - propagate netdev_ops->ndo_setup_tc error up to tcf_block_offload_bind
   caller
- patch 7:
 - rebased on top of the current net-next

v2->v3:
- removed original patch 1, removing tp->q cls_bpf dependency. Fixed by
  Jakub in the meantime.
- patch 1:
 - rebased on top of the current net-next
- patch 5:
 - new patch
- patch 8:
 - removed "p_" prefix from block index function args
- patch 10:
 - add tc offload feature handling

Jiri Pirko (11):
  net: sched: introduce support for multiple filter chain pointers
registration
  net: sched: avoid usage of tp->q in tcf_classify
  net: sched: introduce block mechanism to handle netif_keep_dst calls
  net: sched: remove classid and q fields from tcf_proto
  net: sched: keep track of offloaded filters and check tc offload
feature
  net: sched: use block index as a handle instead of qdisc when block is
shared
  net: sched: allow ingress and clsact qdiscs to share filter blocks
  mlxsw: spectrum_acl: Reshuffle code around
mlxsw_sp_acl_ruleset_create/destroy
  mlxsw: spectrum_acl: Don't store netdev and ingress for ruleset unbind
  mlxsw: spectrum_acl: Implement TC block sharing
  mlxsw: spectrum_acl: Pass mlxsw_sp_port down to ruleset bind/unbind
ops

 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 182 ++-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  44 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c | 302 ---
 .../ethernet/mellanox/mlxsw/spectrum_acl_tcam.c|  44 +-
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  |  41 +-
 include/net/pkt_cls.h  |   9 +
 include/net/sch_generic.h  |  27 +-
 include/uapi/linux/pkt_sched.h |  11 +
 net/sched/cls_api.c| 604 -
 net/sched/cls_bpf.c|   9 +-
 net/sched/cls_flow.c   |   2 +-
 net/sched/cls_flower.c |   3 +-
 net/sched/cls_matchall.c   |   3 +-
 net/sched/cls_route.c  |   2 +-
 net/sched/cls_u32.c|  13 +-
 net/sched/sch_ingress.c|  89 ++-
 16 files changed, 1079 insertions(+), 306 deletions(-)

-- 
2.9.5

[patch net-next v6 08/11] mlxsw: spectrum_acl: Reshuffle code around mlxsw_sp_acl_ruleset_create/destroy

2018-01-05 Thread Jiri Pirko

From: Jiri Pirko 

In order to prepare for follow-up changes, make the bind/unbind helpers
very simple. That required move of ht insertion/removal and bind/unbind
calls into mlxsw_sp_acl_ruleset_create/destroy.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c | 102 ++---
 1 file changed, 46 insertions(+), 56 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
index 93dcd31..ead4cb8 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
@@ -118,8 +118,26 @@ struct mlxsw_sp_fid *mlxsw_sp_acl_dummy_fid(struct 
mlxsw_sp *mlxsw_sp)
return mlxsw_sp->acl->dummy_fid;
 }
 
+static int mlxsw_sp_acl_ruleset_bind(struct mlxsw_sp *mlxsw_sp,
+struct mlxsw_sp_acl_ruleset *ruleset,
+struct net_device *dev, bool ingress)
+{
+   const struct mlxsw_sp_acl_profile_ops *ops = ruleset->ht_key.ops;
+
+   return ops->ruleset_bind(mlxsw_sp, ruleset->priv, dev, ingress);
+}
+
+static void mlxsw_sp_acl_ruleset_unbind(struct mlxsw_sp *mlxsw_sp,
+   struct mlxsw_sp_acl_ruleset *ruleset)
+{
+   const struct mlxsw_sp_acl_profile_ops *ops = ruleset->ht_key.ops;
+
+   ops->ruleset_unbind(mlxsw_sp, ruleset->priv);
+}
+
 static struct mlxsw_sp_acl_ruleset *
-mlxsw_sp_acl_ruleset_create(struct mlxsw_sp *mlxsw_sp,
+mlxsw_sp_acl_ruleset_create(struct mlxsw_sp *mlxsw_sp, struct net_device *dev,
+   bool ingress, u32 chain_index,
const struct mlxsw_sp_acl_profile_ops *ops)
 {
struct mlxsw_sp_acl *acl = mlxsw_sp->acl;
@@ -132,6 +150,9 @@ mlxsw_sp_acl_ruleset_create(struct mlxsw_sp *mlxsw_sp,
if (!ruleset)
return ERR_PTR(-ENOMEM);
ruleset->ref_count = 1;
+   ruleset->ht_key.dev = dev;
+   ruleset->ht_key.ingress = ingress;
+   ruleset->ht_key.chain_index = chain_index;
ruleset->ht_key.ops = ops;
 
err = rhashtable_init(>rule_ht, _sp_acl_rule_ht_params);
@@ -142,68 +163,49 @@ mlxsw_sp_acl_ruleset_create(struct mlxsw_sp *mlxsw_sp,
if (err)
goto err_ops_ruleset_add;
 
-   return ruleset;
-
-err_ops_ruleset_add:
-   rhashtable_destroy(>rule_ht);
-err_rhashtable_init:
-   kfree(ruleset);
-   return ERR_PTR(err);
-}
-
-static void mlxsw_sp_acl_ruleset_destroy(struct mlxsw_sp *mlxsw_sp,
-struct mlxsw_sp_acl_ruleset *ruleset)
-{
-   const struct mlxsw_sp_acl_profile_ops *ops = ruleset->ht_key.ops;
-
-   ops->ruleset_del(mlxsw_sp, ruleset->priv);
-   rhashtable_destroy(>rule_ht);
-   kfree(ruleset);
-}
-
-static int mlxsw_sp_acl_ruleset_bind(struct mlxsw_sp *mlxsw_sp,
-struct mlxsw_sp_acl_ruleset *ruleset,
-struct net_device *dev, bool ingress,
-u32 chain_index)
-{
-   const struct mlxsw_sp_acl_profile_ops *ops = ruleset->ht_key.ops;
-   struct mlxsw_sp_acl *acl = mlxsw_sp->acl;
-   int err;
-
-   ruleset->ht_key.dev = dev;
-   ruleset->ht_key.ingress = ingress;
-   ruleset->ht_key.chain_index = chain_index;
err = rhashtable_insert_fast(>ruleset_ht, >ht_node,
 mlxsw_sp_acl_ruleset_ht_params);
if (err)
-   return err;
-   if (!ruleset->ht_key.chain_index) {
+   goto err_ht_insert;
+
+   if (!chain_index) {
/* We only need ruleset with chain index 0, the implicit one,
 * to be directly bound to device. The rest of the rulesets
 * are bound by "Goto action set".
 */
-   err = ops->ruleset_bind(mlxsw_sp, ruleset->priv, dev, ingress);
+   err = mlxsw_sp_acl_ruleset_bind(mlxsw_sp, ruleset,
+   dev, ingress);
if (err)
-   goto err_ops_ruleset_bind;
+   goto err_ruleset_bind;
}
-   return 0;
 
-err_ops_ruleset_bind:
+   return ruleset;
+
+err_ruleset_bind:
rhashtable_remove_fast(>ruleset_ht, >ht_node,
   mlxsw_sp_acl_ruleset_ht_params);
-   return err;
+err_ht_insert:
+   ops->ruleset_del(mlxsw_sp, ruleset->priv);
+err_ops_ruleset_add:
+   rhashtable_destroy(>rule_ht);
+err_rhashtable_init:
+   kfree(ruleset);
+   return ERR_PTR(err);
 }
 
-static void mlxsw_sp_acl_ruleset_unbind(struct mlxsw_sp *mlxsw_sp,
-   struct mlxsw_sp_acl_ruleset *ruleset)
+static void mlxsw_sp_acl_ruleset_destroy(struct mlxsw_sp *mlxsw_sp,
+struct mlxsw_sp_acl_ruleset

[patch net-next v6 01/11] net: sched: introduce support for multiple filter chain pointers registration

2018-01-05 Thread Jiri Pirko

From: Jiri Pirko 

So far, there was possible only to register a single filter chain
pointer to block->chain[0]. However, when the blocks will get shareable,
we need to allow multiple filter chain pointers registration.

Signed-off-by: Jiri Pirko 
---
v3->v4:
- rebased on top of the current net-next
- added some extack strings
v2->v3:
- rebased on top of the current net-next
---
 include/net/pkt_cls.h |   3 +
 include/net/sch_generic.h |   5 +-
 net/sched/cls_api.c   | 236 +-
 3 files changed, 216 insertions(+), 28 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 5cd3cf5..95c90da 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -29,6 +29,8 @@ struct tcf_block_ext_info {
enum tcf_block_binder_type binder_type;
tcf_chain_head_change_t *chain_head_change;
void *chain_head_change_priv;
+   bool shareable;
+   u32 block_index;
 };
 
 struct tcf_block_cb;
@@ -50,6 +52,7 @@ void tcf_block_put_ext(struct tcf_block *block, struct Qdisc 
*q,
 
 static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
 {
+   WARN_ON(block->refcnt != 1);
return block->q;
 }
 
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index ac029d5..5cc4d71 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -275,8 +275,7 @@ typedef void tcf_chain_head_change_t(struct tcf_proto 
*tp_head, void *priv);
 
 struct tcf_chain {
struct tcf_proto __rcu *filter_chain;
-   tcf_chain_head_change_t *chain_head_change;
-   void *chain_head_change_priv;
+   struct list_head filter_chain_list;
struct list_head list;
struct tcf_block *block;
u32 index; /* chain index */
@@ -285,6 +284,8 @@ struct tcf_chain {
 
 struct tcf_block {
struct list_head chain_list;
+   u32 index; /* block index for shared blocks */
+   unsigned int refcnt;
struct net *net;
struct Qdisc *q;
struct list_head cb_list;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 6708b69..2dd584a 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -179,6 +180,12 @@ static void tcf_proto_destroy(struct tcf_proto *tp)
kfree_rcu(tp, rcu);
 }
 
+struct tcf_filter_chain_list_item {
+   struct list_head list;
+   tcf_chain_head_change_t *chain_head_change;
+   void *chain_head_change_priv;
+};
+
 static struct tcf_chain *tcf_chain_create(struct tcf_block *block,
  u32 chain_index)
 {
@@ -187,6 +194,7 @@ static struct tcf_chain *tcf_chain_create(struct tcf_block 
*block,
chain = kzalloc(sizeof(*chain), GFP_KERNEL);
if (!chain)
return NULL;
+   INIT_LIST_HEAD(>filter_chain_list);
list_add_tail(>list, >chain_list);
chain->block = block;
chain->index = chain_index;
@@ -194,12 +202,19 @@ static struct tcf_chain *tcf_chain_create(struct 
tcf_block *block,
return chain;
 }
 
+static void tcf_chain_head_change_item(struct tcf_filter_chain_list_item *item,
+  struct tcf_proto *tp_head)
+{
+   if (item->chain_head_change)
+   item->chain_head_change(tp_head, item->chain_head_change_priv);
+}
 static void tcf_chain_head_change(struct tcf_chain *chain,
  struct tcf_proto *tp_head)
 {
-   if (chain->chain_head_change)
-   chain->chain_head_change(tp_head,
-chain->chain_head_change_priv);
+   struct tcf_filter_chain_list_item *item;
+
+   list_for_each_entry(item, >filter_chain_list, list)
+   tcf_chain_head_change_item(item, tp_head);
 }
 
 static void tcf_chain_flush(struct tcf_chain *chain)
@@ -280,17 +295,91 @@ static void tcf_block_offload_unbind(struct tcf_block 
*block, struct Qdisc *q,
tcf_block_offload_cmd(block, q, ei, TC_BLOCK_UNBIND);
 }
 
-int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
- struct tcf_block_ext_info *ei,
- struct netlink_ext_ack *extack)
+static int
+tcf_chain_head_change_cb_add(struct tcf_chain *chain,
+struct tcf_block_ext_info *ei,
+struct netlink_ext_ack *extack)
 {
-   struct tcf_block *block = kzalloc(sizeof(*block), GFP_KERNEL);
+   struct tcf_filter_chain_list_item *item;
+
+   item = kmalloc(sizeof(*item), GFP_KERNEL);
+   if (!item) {
+   NL_SET_ERR_MSG(extack, "Memory allocation for head change 
callback item failed");
+   return -ENOMEM;
+   }
+   item->chain_head_change = ei->chain_head_change;
+   item->chain_head_change_priv = ei->chain_head_change_priv;
+   if (chain->filter_chain)
+

[PATCH bpf] selftests/bpf: fix test_align

2018-01-05 Thread Alexei Starovoitov

since commit 82abbf8d2fc4 the verifier rejects the bit-wise
arithmetic on pointers earlier.
The test 'dubious pointer arithmetic' now has less output to match on.
Adjust it.

Fixes: 82abbf8d2fc4 ("bpf: do not allow root to mangle valid pointers")
Reported-by: kernel test robot 
Signed-off-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/test_align.c | 22 +-
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_align.c 
b/tools/testing/selftests/bpf/test_align.c
index 8591c89c0828..471bbbdb94db 100644
--- a/tools/testing/selftests/bpf/test_align.c
+++ b/tools/testing/selftests/bpf/test_align.c
@@ -474,27 +474,7 @@ static struct bpf_align_test tests[] = {
.result = REJECT,
.matches = {
{4, "R5=pkt(id=0,off=0,r=0,imm=0)"},
-   /* ptr & 0x40 == either 0 or 0x40 */
-   {5, "R5=inv(id=0,umax_value=64,var_off=(0x0; 0x40))"},
-   /* ptr << 2 == unknown, (4n) */
-   {7, 
"R5=inv(id=0,smax_value=9223372036854775804,umax_value=18446744073709551612,var_off=(0x0;
 0xfffc))"},
-   /* (4n) + 14 == (4n+2).  We blow our bounds, because
-* the add could overflow.
-*/
-   {8, "R5=inv(id=0,var_off=(0x2; 0xfffc))"},
-   /* Checked s>=0 */
-   {10, 
"R5=inv(id=0,umin_value=2,umax_value=9223372036854775806,var_off=(0x2; 
0x7ffc))"},
-   /* packet pointer + nonnegative (4n+2) */
-   {12, 
"R6=pkt(id=1,off=0,r=0,umin_value=2,umax_value=9223372036854775806,var_off=(0x2;
 0x7ffc))"},
-   {14, 
"R4=pkt(id=1,off=4,r=0,umin_value=2,umax_value=9223372036854775806,var_off=(0x2;
 0x7ffc))"},
-   /* NET_IP_ALIGN + (4n+2) == (4n), alignment is fine.
-* We checked the bounds, but it might have been able
-* to overflow if the packet pointer started in the
-* upper half of the address space.
-* So we did not get a 'range' on R6, and the access
-* attempt will fail.
-*/
-   {16, 
"R6=pkt(id=1,off=0,r=0,umin_value=2,umax_value=9223372036854775806,var_off=(0x2;
 0x7ffc))"},
+   /* R5 bitwise operator &= on pointer prohibited */
}
},
{
-- 
2.9.5

[PATCH net-next v2 02/10] net/mlx5e: Move interrupt moderation forward declarations

2018-01-05 Thread Andy Gospodarek

From: Andy Gospodarek 

Move these to newly created file to prepare to move these functions to a
library.

Signed-off-by: Andy Gospodarek 
Acked-by: Tal Gilboa 
Acked-by: Saeed Mahameed 

---
 drivers/net/ethernet/mellanox/mlx5/core/en.h | 4 
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.h | 5 +
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index ddb5429..2ccedf6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -829,10 +829,6 @@ void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix);
 void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix);
 void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
 
-void mlx5e_rx_am(struct mlx5e_rq *rq);
-void mlx5e_rx_am_work(struct work_struct *work);
-struct mlx5e_cq_moder mlx5e_am_get_def_profile(u8 rx_cq_period_mode);
-
 void mlx5e_update_stats(struct mlx5e_priv *priv, bool full);
 
 int mlx5e_create_flow_steering(struct mlx5e_priv *priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h
index 84b8524..f5f6535 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h
@@ -72,4 +72,9 @@ enum {
MLX5_CQ_PERIOD_NUM_MODES
 };
 
+struct mlx5e_rq;
+void mlx5e_rx_am(struct mlx5e_rq *rq);
+void mlx5e_rx_am_work(struct work_struct *work);
+struct mlx5e_cq_moder mlx5e_am_get_def_profile(u8 rx_cq_period_mode);
+
 #endif /* MLX5_AM_H */
-- 
2.7.4

[PATCH net-next v2 07/10] net/mlx5e: Move dynamic interrupt coalescing code to include/linux

2018-01-05 Thread Andy Gospodarek

From: Andy Gospodarek 

This move allows drivers to add private structure elements to track the
number of packets, bytes, and interrupts events per ring.  A driver
also defines a workqueue handler to act on this collected data once per
poll and modify the coalescing parameters per ring.

Signed-off-by: Andy Gospodarek 
Acked-by: Tal Gilboa 
Acked-by: Saeed Mahameed 

---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c  |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/net_dim.c | 307 --
 drivers/net/ethernet/mellanox/mlx5/core/net_dim.h | 108 ---
 include/linux/net_dim.h   | 376 ++
 6 files changed, 379 insertions(+), 417 deletions(-)
 delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/net_dim.c
 delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/net_dim.h
 create mode 100644 include/linux/net_dim.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index b46b6de2..c805769 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -15,7 +15,7 @@ mlx5_core-$(CONFIG_MLX5_FPGA) += fpga/cmd.o fpga/core.o 
fpga/conn.o fpga/sdk.o \
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o 
\
en_tx.o en_rx.o en_dim.o en_txrx.o en_stats.o vxlan.o \
-   en_arfs.o en_fs_ethtool.o en_selftest.o net_dim.o
+   en_arfs.o en_fs_ethtool.o en_selftest.o
 
 mlx5_core-$(CONFIG_MLX5_MPFS) += lib/mpfs.o
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 732f275..cb9abc9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -46,10 +46,10 @@
 #include 
 #include 
 #include 
+#include 
 #include "wq.h"
 #include "mlx5_core.h"
 #include "en_stats.h"
-#include "net_dim.h"
 
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
index f620325..2b89951 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
@@ -30,6 +30,7 @@
  * SOFTWARE.
  */
 
+#include 
 #include "en.h"
 
 void mlx5e_rx_dim_work(struct work_struct *work)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/net_dim.c 
b/drivers/net/ethernet/mellanox/mlx5/core/net_dim.c
deleted file mode 100644
index 00b9ae3..000
--- a/drivers/net/ethernet/mellanox/mlx5/core/net_dim.c
+++ /dev/null
@@ -1,307 +0,0 @@
-/*
- * Copyright (c) 2016, Mellanox Technologies. All rights reserved.
- * Copyright (c) 2017, Broadcom Limiited. All rights reserved.
- *
- * This software is available to you under a choice of one of two
- * licenses.  You may choose to be licensed under the terms of the GNU
- * General Public License (GPL) Version 2, available from the file
- * COPYING in the main directory of this source tree, or the
- * OpenIB.org BSD license below:
- *
- * Redistribution and use in source and binary forms, with or
- * without modification, are permitted provided that the following
- * conditions are met:
- *
- *  - Redistributions of source code must retain the above
- *copyright notice, this list of conditions and the following
- *disclaimer.
- *
- *  - Redistributions in binary form must reproduce the above
- *copyright notice, this list of conditions and the following
- *disclaimer in the documentation and/or other materials
- *provided with the distribution.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
- * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
- * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
- * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- */
-
-#include "en.h"
-
-#define NET_DIM_PARAMS_NUM_PROFILES 5
-/* Adaptive moderation profiles */
-#define NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE 256
-#define NET_DIM_DEF_PROFILE_CQE 1
-#define NET_DIM_DEF_PROFILE_EQE 1
-
-/* All profiles sizes must be NET_PARAMS_DIM_NUM_PROFILES */
-#define NET_DIM_EQE_PROFILES { \
-   {1,   NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
-   {8,   NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
-   {64,  NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
-   {128, NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
-   {256,

[PATCH net-next v2 06/10] net/mlx5e: Change Mellanox references in DIM code

2018-01-05 Thread Andy Gospodarek

From: Andy Gospodarek 

Change all appropriate mlx5_am* and MLX5_AM* references to net_dim and
NET_DIM, respectively, in code that handles dynamic interrupt
moderation.  Also change all references from 'am' to 'dim' when used as
local variables.

Signed-off-by: Andy Gospodarek 
Acked-by: Tal Gilboa 
Acked-by: Saeed Mahameed 

---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  12 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   |  14 +-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  12 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  52 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |   4 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/net_dim.c  | 284 ++---
 drivers/net/ethernet/mellanox/mlx5/core/net_dim.h  |  63 +++--
 8 files changed, 232 insertions(+), 211 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 121f280..732f275 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -115,6 +115,9 @@
 #define MLX5E_PARAMS_DEFAULT_MIN_RX_WQES0x80
 #define MLX5E_PARAMS_DEFAULT_MIN_RX_WQES_MPW0x2
 
+#define MLX5E_CQ_PERIOD_MODE_START_FROM_EQE0x0
+#define MLX5E_CQ_PERIOD_MODE_START_FROM_CQE0x1
+
 #define MLX5E_LOG_INDIR_RQT_SIZE   0x7
 #define MLX5E_INDIR_RQT_SIZE   BIT(MLX5E_LOG_INDIR_RQT_SIZE)
 #define MLX5E_MIN_NUM_CHANNELS 0x1
@@ -237,8 +240,8 @@ struct mlx5e_params {
u16 num_channels;
u8  num_tc;
bool rx_cqe_compress_def;
-   struct mlx5e_cq_moder rx_cq_moderation;
-   struct mlx5e_cq_moder tx_cq_moderation;
+   struct net_dim_cq_moder rx_cq_moderation;
+   struct net_dim_cq_moder tx_cq_moderation;
bool lro_en;
u32 lro_wqe_sz;
u16 tx_max_inline;
@@ -248,7 +251,7 @@ struct mlx5e_params {
u32 indirection_rqt[MLX5E_INDIR_RQT_SIZE];
bool vlan_strip_disable;
bool scatter_fcs_en;
-   bool rx_am_enabled;
+   bool rx_dim_enabled;
u32 lro_timeout;
u32 pflags;
struct bpf_prog *xdp_prog;
@@ -527,7 +530,7 @@ struct mlx5e_rq {
unsigned long  state;
intix;
 
-   struct mlx5e_rx_am am; /* Adaptive Moderation */
+   struct net_dim dim; /* Dynamic Interrupt Moderation */
 
/* XDP */
struct bpf_prog   *xdp_prog;
@@ -1075,4 +1078,5 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
struct mlx5e_params *params,
u16 max_channels);
 u8 mlx5e_params_calculate_tx_min_inline(struct mlx5_core_dev *mdev);
+void mlx5e_rx_dim_work(struct work_struct *work);
 #endif /* __MLX5_EN_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
index b9b434b..f620325 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
@@ -32,17 +32,17 @@
 
 #include "en.h"
 
-void mlx5e_rx_am_work(struct work_struct *work)
+void mlx5e_rx_dim_work(struct work_struct *work)
 {
-   struct mlx5e_rx_am *am = container_of(work, struct mlx5e_rx_am,
- work);
-   struct mlx5e_rq *rq = container_of(am, struct mlx5e_rq, am);
-   struct mlx5e_cq_moder cur_profile = mlx5e_am_get_profile(am->mode,
-
am->profile_ix);
+   struct net_dim *dim = container_of(work, struct net_dim,
+  work);
+   struct mlx5e_rq *rq = container_of(dim, struct mlx5e_rq, dim);
+   struct net_dim_cq_moder cur_profile = net_dim_get_profile(dim->mode,
+ 
dim->profile_ix);
 
mlx5_core_modify_cq_moderation(rq->mdev, >cq.mcq,
   cur_profile.usec, cur_profile.pkts);
 
-   am->state = MLX5E_AM_START_MEASURE;
+   dim->state = NET_DIM_START_MEASURE;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 8f05efa..62ac4c9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -480,7 +480,7 @@ int mlx5e_ethtool_get_coalesce(struct mlx5e_priv *priv,
coal->rx_max_coalesced_frames = 
priv->channels.params.rx_cq_moderation.pkts;
coal->tx_coalesce_usecs   = 
priv->channels.params.tx_cq_moderation.usec;
coal->tx_max_coalesced_frames = 
priv->channels.params.tx_cq_moderation.pkts;
-   coal->use_adaptive_rx_coalesce = priv->channels.params.rx_am_enabled;
+

[PATCH net-next v2 08/10] net/dim: use struct net_dim_sample as arg to net_dim

2018-01-05 Thread Andy Gospodarek

From: Andy Gospodarek 

Simplify the arguments net_dim() by formatting them into a struct
net_dim_sample before calling the function.

Signed-off-by: Andy Gospodarek 
Suggested-by: Tal Gilboa 
Acked-by: Tal Gilboa 
Acked-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c | 13 -
 include/linux/net_dim.h   | 10 +++---
 2 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index dae77a9..f292bb3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -78,11 +78,14 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
for (i = 0; i < c->num_tc; i++)
mlx5e_cq_arm(>sq[i].cq);
 
-   if (MLX5E_TEST_BIT(c->rq.state, MLX5E_RQ_STATE_AM))
-   net_dim(>rq.dim,
-   c->rq.cq.event_ctr,
-   c->rq.stats.packets,
-   c->rq.stats.bytes);
+   if (MLX5E_TEST_BIT(c->rq.state, MLX5E_RQ_STATE_AM)) {
+   struct net_dim_sample dim_sample;
+   net_dim_sample(c->rq.cq.event_ctr,
+  c->rq.stats.packets,
+  c->rq.stats.bytes,
+  _sample);
+   net_dim(>rq.dim, dim_sample);
+   }
 
mlx5e_cq_arm(>rq.cq);
mlx5e_cq_arm(>icosq.cq);
diff --git a/include/linux/net_dim.h b/include/linux/net_dim.h
index bb99073..2cceefa 100644
--- a/include/linux/net_dim.h
+++ b/include/linux/net_dim.h
@@ -341,21 +341,18 @@ static inline void net_dim_calc_stats(struct 
net_dim_sample *start,
 }
 
 static inline void net_dim(struct net_dim *dim,
-  u16 event_ctr,
-  u64 packets,
-  u64 bytes)
+  struct net_dim_sample end_sample)
 {
-   struct net_dim_sample end_sample;
struct net_dim_stats curr_stats;
u16 nevents;
 
switch (dim->state) {
case NET_DIM_MEASURE_IN_PROGRESS:
-   nevents = BIT_GAP(BITS_PER_TYPE(u16), event_ctr,
+   nevents = BIT_GAP(BITS_PER_TYPE(u16),
+ end_sample.event_ctr,
  dim->start_sample.event_ctr);
if (nevents < NET_DIM_NEVENTS)
break;
-   net_dim_sample(event_ctr, packets, bytes, _sample);
net_dim_calc_stats(>start_sample, _sample,
   _stats);
if (net_dim_decision(_stats, dim)) {
@@ -365,7 +362,6 @@ static inline void net_dim(struct net_dim *dim,
}
/* fall through */
case NET_DIM_START_MEASURE:
-   net_dim_sample(event_ctr, packets, bytes, >start_sample);
dim->state = NET_DIM_MEASURE_IN_PROGRESS;
break;
case NET_DIM_APPLY_NEW_PROFILE:
-- 
2.7.4

[PATCH net-next v2 09/10] bnxt_en: add support for software dynamic interrupt moderation

2018-01-05 Thread Andy Gospodarek

From: Andy Gospodarek 

This implements the changes needed for the bnxt_en driver to add support
for dynamic interrupt moderation per ring.

This does add additional counters in the receive path, but testing shows
that any additional instructions are offset by throughput gain when the
default configuration is for low latency.

Signed-off-by: Andy Gospodarek 
Cc: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/Makefile   |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 49 +++
 drivers/net/ethernet/broadcom/bnxt/bnxt.h | 34 +++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c | 32 +++
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 12 ++
 5 files changed, 117 insertions(+), 12 deletions(-)
 create mode 100644 drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c

diff --git a/drivers/net/ethernet/broadcom/bnxt/Makefile 
b/drivers/net/ethernet/broadcom/bnxt/Makefile
index 59c8ec9..7c560d5 100644
--- a/drivers/net/ethernet/broadcom/bnxt/Makefile
+++ b/drivers/net/ethernet/broadcom/bnxt/Makefile
@@ -1,4 +1,4 @@
 obj-$(CONFIG_BNXT) += bnxt_en.o
 
-bnxt_en-y := bnxt.o bnxt_sriov.o bnxt_ethtool.o bnxt_dcb.o bnxt_ulp.o 
bnxt_xdp.o bnxt_vfr.o bnxt_devlink.o
+bnxt_en-y := bnxt.o bnxt_sriov.o bnxt_ethtool.o bnxt_dcb.o bnxt_ulp.o 
bnxt_xdp.o bnxt_vfr.o bnxt_devlink.o bnxt_dim.o
 bnxt_en-$(CONFIG_BNXT_FLOWER_OFFLOAD) += bnxt_tc.o
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 9efbdc6..b9d4c61 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1645,6 +1645,8 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_napi 
*bnapi, u32 *raw_cons,
rxr->rx_next_cons = NEXT_RX(cons);
 
 next_rx_no_prod:
+   cpr->rx_packets += 1;
+   cpr->rx_bytes += len;
*raw_cons = tmp_raw_cons;
 
return rc;
@@ -1802,6 +1804,7 @@ static irqreturn_t bnxt_msix(int irq, void *dev_instance)
struct bnxt_cp_ring_info *cpr = >cp_ring;
u32 cons = RING_CMP(cpr->cp_raw_cons);
 
+   cpr->event_ctr++;
prefetch(>cp_desc_ring[CP_RING(cons)][CP_IDX(cons)]);
napi_schedule(>napi);
return IRQ_HANDLED;
@@ -2025,6 +2028,14 @@ static int bnxt_poll(struct napi_struct *napi, int 
budget)
break;
}
}
+   if (bp->flags & BNXT_FLAG_DIM) {
+   struct net_dim_sample dim_sample;
+   net_dim_sample(cpr->event_ctr,
+  cpr->rx_packets,
+  cpr->rx_bytes,
+  _sample);
+   net_dim(>dim, dim_sample);
+   }
mmiowb();
return work_done;
 }
@@ -2610,6 +2621,8 @@ static void bnxt_init_cp_rings(struct bnxt *bp)
struct bnxt_ring_struct *ring = >cp_ring_struct;
 
ring->fw_ring_id = INVALID_HW_RING_ID;
+   cpr->rx_ring_coal.coal_ticks = bp->rx_coal.coal_ticks;
+   cpr->rx_ring_coal.coal_bufs = bp->rx_coal.coal_bufs;
}
 }
 
@@ -4583,6 +4596,36 @@ static void bnxt_hwrm_set_coal_params(struct bnxt_coal 
*hw_coal,
req->flags = cpu_to_le16(flags);
 }
 
+int bnxt_hwrm_set_ring_coal(struct bnxt *bp, struct bnxt_napi *bnapi)
+{
+   struct hwrm_ring_cmpl_ring_cfg_aggint_params_input req_rx = {0};
+   struct bnxt_cp_ring_info *cpr = >cp_ring;
+   struct bnxt_coal coal;
+   unsigned int grp_idx;
+
+/* Tick values in micro seconds.
+ * 1 coal_buf x bufs_per_record = 1 completion record.
+ */
+   memcpy(, >rx_coal, sizeof(struct bnxt_coal));
+
+   coal.coal_ticks = cpr->rx_ring_coal.coal_ticks;
+   coal.coal_bufs = cpr->rx_ring_coal.coal_bufs;
+
+   if (!bnapi->rx_ring)
+   return -ENODEV;
+
+   bnxt_hwrm_cmd_hdr_init(bp, _rx,
+  HWRM_RING_CMPL_RING_CFG_AGGINT_PARAMS, -1, -1);
+
+   bnxt_hwrm_set_coal_params(, _rx);
+
+   grp_idx = bnapi->index;
+   req_rx.ring_id = cpu_to_le16(bp->grp_info[grp_idx].cp_fw_ring_id);
+
+   return hwrm_send_message(bp, _rx, sizeof(req_rx),
+HWRM_CMD_TIMEOUT);
+}
+
 int bnxt_hwrm_set_coal(struct bnxt *bp)
 {
int i, rc = 0;
@@ -5705,7 +5748,13 @@ static void bnxt_enable_napi(struct bnxt *bp)
int i;
 
for (i = 0; i < bp->cp_nr_rings; i++) {
+   struct bnxt_cp_ring_info *cpr = >bnapi[i]->cp_ring;
bp->bnapi[i]->in_reset = false;
+
+   if (!(bp->bnapi[i]->flags & BNXT_NAPI_FLAG_XDP)) {
+   INIT_WORK(>dim.work, bnxt_dim_work);
+   cpr->dim.mode = NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE;
+   }
napi_enable(>bnapi[i]->napi);
}
 }
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h

[PATCH net-next v2 00/10] net: create dynamic software irq moderation library

2018-01-05 Thread Andy Gospodarek

From: Andy Gospodarek 

This converts the dynamic interrupt moderation library from the mlx5e
driver into a library so it can be used by any driver.  The penultimate
patch in this set adds support for thiw new dynamic interrupt moderation
library in the bnxt_en driver and the last patch creates an entry in the
MAINTAINERS file for this library.

The main purpose of this code is to allow an administrator to make sure
that default coalesce settings are optimized for low latency, but
quickly adapt to handle high throughput/bulk traffic by altering how
much time passes before popping an interrupt.

For any new driver the following changes would be needed to use this
library:

- add elements in ring struct to track items needed by this library
- create function that can be called to actually set coalesce settings
  for the driver

Credit to Rob Rice and Lee Reed for doing some of the initial proof of
concept and testing for this patch and Tal Gilboa and Or Gerlitz for
their comments, etc on this set.

v2: Spelling fixes from Stephen Hemminger, bnxt_en suggestions from
Michael Chan, spelling and formatting fixes from Or Gerlitz, and
spelling and mlx5e changes suggested by Tal Gilboa.

Andy Gospodarek (10):
  net/mlx5e: Move interrupt moderation structs to new file
  net/mlx5e: Move interrupt moderation forward declarations
  net/mlx5e: Remove rq references in mlx5e_rx_am
  net/mlx5e: Move AM logic enums
  net/mlx5e: Move generic functions to new file
  net/mlx5e: Change Mellanox references in DIM code
  net/mlx5e: Move dynamic interrupt coalescing code to include/linux
  net/dim: use struct net_dim_sample as arg to net_dim
  bnxt_en: add support for software dynamic interrupt moderation
  MAINTAINERS: add entry for Dynamic Interrupt Moderation

 MAINTAINERS|   5 +
 drivers/net/ethernet/broadcom/bnxt/Makefile|   2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c  |  49 +++
 drivers/net/ethernet/broadcom/bnxt/bnxt.h  |  34 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c  |  32 ++
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c  |  12 +
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  49 +--
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   |  49 +++
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  12 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  52 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |   4 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c | 341 ---
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |  10 +-
 include/linux/mlx5/mlx5_ifc.h  |   6 -
 include/linux/net_dim.h| 372 +
 16 files changed, 604 insertions(+), 427 deletions(-)
 create mode 100644 drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
 delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c
 create mode 100644 include/linux/net_dim.h

-- 
2.7.4

[PATCH net-next v2 01/10] net/mlx5e: Move interrupt moderation structs to new file

2018-01-05 Thread Andy Gospodarek

From: Andy Gospodarek 

Create new header file to prepare to move code that handles irq
moderation to a library that lives in a header file.

Signed-off-by: Andy Gospodarek 
Acked-by: Tal Gilboa 
Acked-by: Saeed Mahameed 

---
 drivers/net/ethernet/mellanox/mlx5/core/en.h | 33 +--
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.h | 75 
 include/linux/mlx5/mlx5_ifc.h|  6 --
 3 files changed, 76 insertions(+), 38 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_dim.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 543060c..ddb5429 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -49,6 +49,7 @@
 #include "wq.h"
 #include "mlx5_core.h"
 #include "en_stats.h"
+#include "en_dim.h"
 
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
@@ -226,12 +227,6 @@ enum mlx5e_priv_flag {
 #define MLX5E_MAX_BW_ALLOC 100 /* Max percentage of BW allocation */
 #endif
 
-struct mlx5e_cq_moder {
-   u16 usec;
-   u16 pkts;
-   u8 cq_period_mode;
-};
-
 struct mlx5e_params {
u8  log_sq_size;
u8  rq_wq_type;
@@ -472,32 +467,6 @@ struct mlx5e_mpw_info {
u16 skbs_frags[MLX5_MPWRQ_PAGES_PER_WQE];
 };
 
-struct mlx5e_rx_am_stats {
-   int ppms; /* packets per msec */
-   int bpms; /* bytes per msec */
-   int epms; /* events per msec */
-};
-
-struct mlx5e_rx_am_sample {
-   ktime_t time;
-   u32 pkt_ctr;
-   u32 byte_ctr;
-   u16 event_ctr;
-};
-
-struct mlx5e_rx_am { /* Adaptive Moderation */
-   u8  state;
-   struct mlx5e_rx_am_statsprev_stats;
-   struct mlx5e_rx_am_sample   start_sample;
-   struct work_struct  work;
-   u8  profile_ix;
-   u8  mode;
-   u8  tune_state;
-   u8  steps_right;
-   u8  steps_left;
-   u8  tired;
-};
-
 /* a single cache unit is capable to serve one napi call (for non-striding rq)
  * or a MPWQE (for striding rq).
  */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h
new file mode 100644
index 000..84b8524
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h
@@ -0,0 +1,75 @@
+/*
+ * Copyright (c) 2013-2015, Mellanox Technologies, Ltd.  All rights reserved.
+ * Copyright (c) 2017, Broadcom Limited
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+*/
+
+#ifndef MLX5_AM_H
+#define MLX5_AM_H
+
+struct mlx5e_cq_moder {
+   u16 usec;
+   u16 pkts;
+   u8 cq_period_mode;
+};
+
+struct mlx5e_rx_am_sample {
+   ktime_t time;
+   u32 pkt_ctr;
+   u32 byte_ctr;
+   u16 event_ctr;
+};
+
+struct mlx5e_rx_am_stats {
+   int ppms; /* packets per msec */
+   int bpms; /* bytes per msec */
+   int epms; /* events per msec */
+};
+
+struct mlx5e_rx_am { /* Adaptive Moderation */
+   u8  state;
+   struct mlx5e_rx_am_statsprev_stats;
+   struct mlx5e_rx_am_sample   start_sample;
+   struct work_struct  work;
+   u8

[PATCH net-next v2 10/10] MAINTAINERS: add entry for Dynamic Interrupt Moderation

2018-01-05 Thread Andy Gospodarek

From: Andy Gospodarek 

Signed-off-by: Andy Gospodarek 
Signed-off-by: Tal Gilboa 
Acked-by: Saeed Mahameed 
---
 MAINTAINERS | 5 +
 1 file changed, 5 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 753799d..178239dc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4944,6 +4944,11 @@ S:   Maintained
 F: lib/dynamic_debug.c
 F: include/linux/dynamic_debug.h
 
+DYNAMIC INTERRUPT MODERATION
+M: Tal Gilboa 
+S: Maintained
+F: include/linux/net_dim.h
+
 DZ DECSTATION DZ11 SERIAL DRIVER
 M: "Maciej W. Rozycki" 
 S: Maintained
-- 
2.7.4

[PATCH net-next v2 03/10] net/mlx5e: Remove rq references in mlx5e_rx_am

2018-01-05 Thread Andy Gospodarek

From: Andy Gospodarek 

This makes mlx5e_am_sample more generic so that it can be called easily
from a driver that does not use the same data structure to store these
values in a single structure.

Signed-off-by: Andy Gospodarek 
Acked-by: Tal Gilboa 
Acked-by: Saeed Mahameed 

---
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.h   |  6 --
 drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c | 22 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |  5 -
 3 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h
index f5f6535..b676a057 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h
@@ -72,8 +72,10 @@ enum {
MLX5_CQ_PERIOD_NUM_MODES
 };
 
-struct mlx5e_rq;
-void mlx5e_rx_am(struct mlx5e_rq *rq);
+void mlx5e_rx_am(struct mlx5e_rx_am *am,
+u16 event_ctr,
+u64 packets,
+u64 bytes);
 void mlx5e_rx_am_work(struct work_struct *work);
 struct mlx5e_cq_moder mlx5e_am_get_def_profile(u8 rx_cq_period_mode);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c
index e401d9d..1630076 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c
@@ -264,13 +264,15 @@ static bool mlx5e_am_decision(struct mlx5e_rx_am_stats 
*curr_stats,
return am->profile_ix != prev_ix;
 }
 
-static void mlx5e_am_sample(struct mlx5e_rq *rq,
+static void mlx5e_am_sample(u16 event_ctr,
+   u64 packets,
+   u64 bytes,
struct mlx5e_rx_am_sample *s)
 {
s->time  = ktime_get();
-   s->pkt_ctr   = rq->stats.packets;
-   s->byte_ctr  = rq->stats.bytes;
-   s->event_ctr = rq->cq.event_ctr;
+   s->pkt_ctr   = packets;
+   s->byte_ctr  = bytes;
+   s->event_ctr = event_ctr;
 }
 
 #define MLX5E_AM_NEVENTS 64
@@ -309,20 +311,22 @@ void mlx5e_rx_am_work(struct work_struct *work)
am->state = MLX5E_AM_START_MEASURE;
 }
 
-void mlx5e_rx_am(struct mlx5e_rq *rq)
+void mlx5e_rx_am(struct mlx5e_rx_am *am,
+u16 event_ctr,
+u64 packets,
+u64 bytes)
 {
-   struct mlx5e_rx_am *am = >am;
struct mlx5e_rx_am_sample end_sample;
struct mlx5e_rx_am_stats curr_stats;
u16 nevents;
 
switch (am->state) {
case MLX5E_AM_MEASURE_IN_PROGRESS:
-   nevents = BIT_GAP(BITS_PER_TYPE(u16), rq->cq.event_ctr,
+   nevents = BIT_GAP(BITS_PER_TYPE(u16), event_ctr,
  am->start_sample.event_ctr);
if (nevents < MLX5E_AM_NEVENTS)
break;
-   mlx5e_am_sample(rq, _sample);
+   mlx5e_am_sample(event_ctr, packets, bytes, _sample);
mlx5e_am_calc_stats(>start_sample, _sample,
_stats);
if (mlx5e_am_decision(_stats, am)) {
@@ -332,7 +336,7 @@ void mlx5e_rx_am(struct mlx5e_rq *rq)
}
/* fall through */
case MLX5E_AM_START_MEASURE:
-   mlx5e_am_sample(rq, >start_sample);
+   mlx5e_am_sample(event_ctr, packets, bytes, >start_sample);
am->state = MLX5E_AM_MEASURE_IN_PROGRESS;
break;
case MLX5E_AM_APPLY_NEW_PROFILE:
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index ab92298..1849169 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -79,7 +79,10 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
mlx5e_cq_arm(>sq[i].cq);
 
if (MLX5E_TEST_BIT(c->rq.state, MLX5E_RQ_STATE_AM))
-   mlx5e_rx_am(>rq);
+   mlx5e_rx_am(>rq.am,
+   c->rq.cq.event_ctr,
+   c->rq.stats.packets,
+   c->rq.stats.bytes);
 
mlx5e_cq_arm(>rq.cq);
mlx5e_cq_arm(>icosq.cq);
-- 
2.7.4

[PATCH net-next v2 05/10] net/mlx5e: Move generic functions to new file

2018-01-05 Thread Andy Gospodarek

From: Andy Gospodarek 

These functions were identified as ones that could be made generic and
used by multiple drivers.  Most of the contents of en_rx_am.c are moved
to net_dim.c.

Signed-off-by: Andy Gospodarek 
Acked-by: Tal Gilboa 
Acked-by: Saeed Mahameed 

---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   4 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   |  48 
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.h   | 108 ---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c | 320 -
 drivers/net/ethernet/mellanox/mlx5/core/net_dim.c  | 307 
 drivers/net/ethernet/mellanox/mlx5/core/net_dim.h  | 109 +++
 7 files changed, 467 insertions(+), 431 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
 delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_dim.h
 delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/net_dim.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/net_dim.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 19b21b4..b46b6de2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -14,8 +14,8 @@ mlx5_core-$(CONFIG_MLX5_FPGA) += fpga/cmd.o fpga/core.o 
fpga/conn.o fpga/sdk.o \
fpga/ipsec.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o 
\
-   en_tx.o en_rx.o en_rx_am.o en_txrx.o en_stats.o vxlan.o \
-   en_arfs.o en_fs_ethtool.o en_selftest.o
+   en_tx.o en_rx.o en_dim.o en_txrx.o en_stats.o vxlan.o \
+   en_arfs.o en_fs_ethtool.o en_selftest.o net_dim.o
 
 mlx5_core-$(CONFIG_MLX5_MPFS) += lib/mpfs.o
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 2ccedf6..121f280 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -49,7 +49,7 @@
 #include "wq.h"
 #include "mlx5_core.h"
 #include "en_stats.h"
-#include "en_dim.h"
+#include "net_dim.h"
 
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
new file mode 100644
index 000..b9b434b
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
@@ -0,0 +1,48 @@
+/*
+ * Copyright (c) 2016, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "en.h"
+
+void mlx5e_rx_am_work(struct work_struct *work)
+{
+   struct mlx5e_rx_am *am = container_of(work, struct mlx5e_rx_am,
+ work);
+   struct mlx5e_rq *rq = container_of(am, struct mlx5e_rq, am);
+   struct mlx5e_cq_moder cur_profile = mlx5e_am_get_profile(am->mode,
+
am->profile_ix);
+
+   mlx5_core_modify_cq_moderation(rq->mdev, >cq.mcq,
+  cur_profile.usec, cur_profile.pkts);
+
+   am->state = MLX5E_AM_START_MEASURE;
+}
+
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h
deleted file mode 100644
index c9f0d05..000
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h
+++

[PATCH net-next v2 04/10] net/mlx5e: Move AM logic enums

2018-01-05 Thread Andy Gospodarek

From: Andy Gospodarek 

More movement to help make this code more generic.

Signed-off-by: Andy Gospodarek 
Acked-by: Tal Gilboa 
Acked-by: Saeed Mahameed 

---
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.h   | 26 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c | 25 -
 2 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h
index b676a057..c9f0d05 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.h
@@ -72,6 +72,32 @@ enum {
MLX5_CQ_PERIOD_NUM_MODES
 };
 
+/* Adaptive moderation logic */
+enum {
+   MLX5E_AM_START_MEASURE,
+   MLX5E_AM_MEASURE_IN_PROGRESS,
+   MLX5E_AM_APPLY_NEW_PROFILE,
+};
+
+enum {
+   MLX5E_AM_PARKING_ON_TOP,
+   MLX5E_AM_PARKING_TIRED,
+   MLX5E_AM_GOING_RIGHT,
+   MLX5E_AM_GOING_LEFT,
+};
+
+enum {
+   MLX5E_AM_STATS_WORSE,
+   MLX5E_AM_STATS_SAME,
+   MLX5E_AM_STATS_BETTER,
+};
+
+enum {
+   MLX5E_AM_STEPPED,
+   MLX5E_AM_TOO_TIRED,
+   MLX5E_AM_ON_EDGE,
+};
+
 void mlx5e_rx_am(struct mlx5e_rx_am *am,
 u16 event_ctr,
 u64 packets,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c
index 1630076..337dd60 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c
@@ -82,31 +82,6 @@ struct mlx5e_cq_moder mlx5e_am_get_def_profile(u8 
rx_cq_period_mode)
return mlx5e_am_get_profile(rx_cq_period_mode, default_profile_ix);
 }
 
-/* Adaptive moderation logic */
-enum {
-   MLX5E_AM_START_MEASURE,
-   MLX5E_AM_MEASURE_IN_PROGRESS,
-   MLX5E_AM_APPLY_NEW_PROFILE,
-};
-
-enum {
-   MLX5E_AM_PARKING_ON_TOP,
-   MLX5E_AM_PARKING_TIRED,
-   MLX5E_AM_GOING_RIGHT,
-   MLX5E_AM_GOING_LEFT,
-};
-
-enum {
-   MLX5E_AM_STATS_WORSE,
-   MLX5E_AM_STATS_SAME,
-   MLX5E_AM_STATS_BETTER,
-};
-
-enum {
-   MLX5E_AM_STEPPED,
-   MLX5E_AM_TOO_TIRED,
-   MLX5E_AM_ON_EDGE,
-};
 
 static bool mlx5e_am_on_top(struct mlx5e_rx_am *am)
 {
-- 
2.7.4

Re: [PATCH 6/6] add test for aio poll and io_pgetevents

2018-01-05 Thread Jeff Moyer

Christoph Hellwig  writes:

> + p = fork();
> + switch (p) {
[snip]
> + default:
> + close(pipe1[0]);
> + close(pipe2[1]);
> +
> + io_prep_poll(, pipe2[0], POLLIN);
> +
> + ret = io_setup(1, );
> + if (ret) {
> + printf("child: io_setup failed\n");

parent

> + return 1;
> + }
> +
> + ret = io_submit(ctx, 1, iocbs);
> + if (ret != 1) {
> + printf("child: io_submit failed\n");

parent

Other than that, looks ok to me.  Thanks for writing a test!
I can fix this up, no need to repost.

-Jeff

Re: libaio: resurrect aio poll and add io_pgetevents support

2018-01-05 Thread Jeff Moyer

Christoph Hellwig  writes:

> Hi all,
>
> this series resurrects IOCB_CMD_POLL support and adds support for the
> new io_pgetevents system call, as well as adding a test case.

This looks good to me.  There may be a couple of changes to the syscall
bits, but I can take care of that.  I'll review the kernel bits more
thoroughly next week.

-Jeff

Re: [PATCH 2/6] move _body_io_syscall to the generic syscall.h

2018-01-05 Thread Jeff Moyer

Hi, Ben,

Thanks for the quick reply.

Benjamin LaHaise  writes:

> On Fri, Jan 05, 2018 at 11:25:17AM -0500, Jeff Moyer wrote:
>> Christoph Hellwig  writes:
>> 
>> > This way it can be used for the fallback 6-argument version on
>> > all architectures.
>> >
>> > Signed-off-by: Christoph Hellwig 
>> 
>> This is a strange way to do things.  However, I was never really sold on
>> libaio having to implement its own system call wrappers.  That decision
>> definitely resulted in some maintenance overhead.
>> 
>> Ben, what was your reasoning for not just using syscall?
>
> The main issue was that glibc's pthreads implementation really sucked back
> during initial development and there was a use-case for having the io_XXX
> functions usable directly from clone()ed threads that didn't have all the
> glibc pthread state setup for per-cpu areas to handle per-thread errno.
> That made sense back then, but is rather silly today.

Thanks for the background info.

> Technically, I'm not sure the generic syscall wrapper is safe to use.  The
> io_XXX arch wrappers don't modify errno, while it appears the generic one
> does.  That said, nobody has ever noticed...

Good point.  Common architectures don't use the generic syscall wrapper,
so I'm not sure we can conclude that it won't break anything.  At the
same time, I'm not sure I want to write and test the io_syscall6
assembly for all of the supported arches.  I could save and restore
errno.  That sounds ugly, but less painful than the other options.

Does anyone have any strong preferences?

-Jeff

[PATCH v4 2/3] qemu: virtio-net: use 64-bit values for feature flags

2018-01-05 Thread Jason Baron

In prepartion for using some of the high order feature bits, make sure that
virtio-net uses 64-bit values everywhere.

Signed-off-by: Jason Baron 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: virtio-...@lists.oasis-open.org
---
 hw/net/virtio-net.c| 55 +-
 include/hw/virtio/virtio-net.h |  2 +-
 2 files changed, 29 insertions(+), 28 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 38674b0..54823af 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -48,18 +48,18 @@
 (offsetof(container, field) + sizeof(((container *)0)->field))
 
 typedef struct VirtIOFeature {
-uint32_t flags;
+uint64_t flags;
 size_t end;
 } VirtIOFeature;
 
 static VirtIOFeature feature_sizes[] = {
-{.flags = 1 << VIRTIO_NET_F_MAC,
+{.flags = 1ULL << VIRTIO_NET_F_MAC,
  .end = endof(struct virtio_net_config, mac)},
-{.flags = 1 << VIRTIO_NET_F_STATUS,
+{.flags = 1ULL << VIRTIO_NET_F_STATUS,
  .end = endof(struct virtio_net_config, status)},
-{.flags = 1 << VIRTIO_NET_F_MQ,
+{.flags = 1ULL << VIRTIO_NET_F_MQ,
  .end = endof(struct virtio_net_config, max_virtqueue_pairs)},
-{.flags = 1 << VIRTIO_NET_F_MTU,
+{.flags = 1ULL << VIRTIO_NET_F_MTU,
  .end = endof(struct virtio_net_config, mtu)},
 {}
 };
@@ -1938,7 +1938,7 @@ static void virtio_net_device_realize(DeviceState *dev, 
Error **errp)
 int i;
 
 if (n->net_conf.mtu) {
-n->host_features |= (0x1 << VIRTIO_NET_F_MTU);
+n->host_features |= (1ULL << VIRTIO_NET_F_MTU);
 }
 
 virtio_net_set_config_size(n, n->host_features);
@@ -2109,45 +2109,46 @@ static const VMStateDescription vmstate_virtio_net = {
 };
 
 static Property virtio_net_properties[] = {
-DEFINE_PROP_BIT("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, true),
-DEFINE_PROP_BIT("guest_csum", VirtIONet, host_features,
+DEFINE_PROP_BIT64("csum", VirtIONet, host_features,
+VIRTIO_NET_F_CSUM, true),
+DEFINE_PROP_BIT64("guest_csum", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_CSUM, true),
-DEFINE_PROP_BIT("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true),
-DEFINE_PROP_BIT("guest_tso4", VirtIONet, host_features,
+DEFINE_PROP_BIT64("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true),
+DEFINE_PROP_BIT64("guest_tso4", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_TSO4, true),
-DEFINE_PROP_BIT("guest_tso6", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_tso6", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_TSO6, true),
-DEFINE_PROP_BIT("guest_ecn", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_ecn", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_ECN, true),
-DEFINE_PROP_BIT("guest_ufo", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_ufo", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_UFO, true),
-DEFINE_PROP_BIT("guest_announce", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_announce", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_ANNOUNCE, true),
-DEFINE_PROP_BIT("host_tso4", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_tso4", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_TSO4, true),
-DEFINE_PROP_BIT("host_tso6", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_tso6", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_TSO6, true),
-DEFINE_PROP_BIT("host_ecn", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_ecn", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_ECN, true),
-DEFINE_PROP_BIT("host_ufo", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_ufo", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_UFO, true),
-DEFINE_PROP_BIT("mrg_rxbuf", VirtIONet, host_features,
+DEFINE_PROP_BIT64("mrg_rxbuf", VirtIONet, host_features,
 VIRTIO_NET_F_MRG_RXBUF, true),
-DEFINE_PROP_BIT("status", VirtIONet, host_features,
+DEFINE_PROP_BIT64("status", VirtIONet, host_features,
 VIRTIO_NET_F_STATUS, true),
-DEFINE_PROP_BIT("ctrl_vq", VirtIONet, host_features,
+DEFINE_PROP_BIT64("ctrl_vq", VirtIONet, host_features,
 VIRTIO_NET_F_CTRL_VQ, true),
-DEFINE_PROP_BIT("ctrl_rx", VirtIONet, host_features,
+DEFINE_PROP_BIT64("ctrl_rx", VirtIONet, host_features,
 VIRTIO_NET_F_CTRL_RX, true),
-DEFINE_PROP_BIT("ctrl_vlan", VirtIONet, host_features,
+DEFINE_PROP_BIT64("ctrl_vlan", VirtIONet, host_features,
 VIRTIO_NET_F_CTRL_VLAN, true),
-DEFINE_PROP_BIT("ctrl_rx_extra", VirtIONet, host_features,
+DEFINE_PROP_BIT64("ctrl_rx_extra", VirtIONet, host_features,
 VIRTIO_NET_F_CTRL_RX_EXTRA, true),
-

[PATCH net-next v4 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor

2018-01-05 Thread Jason Baron

The ability to set speed and duplex for virtio_net is useful in various
scenarios as described here:

16032be virtio_net: add ethtool support for set and get of settings

However, it would be nice to be able to set this from the hypervisor,
such that virtio_net doesn't require custom guest ethtool commands.

Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.

Note that VIRTIO_NET_F_SPEED_DUPLEX is defined as bit 63, the intention
is that device feature bits are to grow down from bit 63, since the
transports are starting from bit 24 and growing up.

Signed-off-by: Jason Baron 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: virtio-...@lists.oasis-open.org
---
 drivers/net/virtio_net.c| 23 ++-
 include/uapi/linux/virtio_net.h | 13 +
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6fb7b65..4f27508 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1894,6 +1894,24 @@ static void virtnet_init_settings(struct net_device *dev)
vi->duplex = DUPLEX_UNKNOWN;
 }
 
+static void virtnet_update_settings(struct virtnet_info *vi)
+{
+   u32 speed;
+   u8 duplex;
+
+   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX))
+   return;
+
+   speed = virtio_cread32(vi->vdev, offsetof(struct virtio_net_config,
+ speed));
+   if (ethtool_validate_speed(speed))
+   vi->speed = speed;
+   duplex = virtio_cread8(vi->vdev, offsetof(struct virtio_net_config,
+ duplex));
+   if (ethtool_validate_duplex(duplex))
+   vi->duplex = duplex;
+}
+
 static const struct ethtool_ops virtnet_ethtool_ops = {
.get_drvinfo = virtnet_get_drvinfo,
.get_link = ethtool_op_get_link,
@@ -2147,6 +2165,7 @@ static void virtnet_config_changed_work(struct 
work_struct *work)
vi->status = v;
 
if (vi->status & VIRTIO_NET_S_LINK_UP) {
+   virtnet_update_settings(vi);
netif_carrier_on(vi->dev);
netif_tx_wake_all_queues(vi->dev);
} else {
@@ -2695,6 +2714,7 @@ static int virtnet_probe(struct virtio_device *vdev)
schedule_work(>config_work);
} else {
vi->status = VIRTIO_NET_S_LINK_UP;
+   virtnet_update_settings(vi);
netif_carrier_on(dev);
}
 
@@ -2796,7 +2816,8 @@ static struct virtio_device_id id_table[] = {
VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
VIRTIO_NET_F_CTRL_MAC_ADDR, \
-   VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
+   VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
+   VIRTIO_NET_F_SPEED_DUPLEX
 
 static unsigned int features[] = {
VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index fc353b5..5de6ed3 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,8 @@
 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
 
+#define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
+
 #ifndef VIRTIO_NET_NO_LEGACY
 #define VIRTIO_NET_F_GSO   6   /* Host handles pkts w/ any GSO type */
 #endif /* VIRTIO_NET_NO_LEGACY */
@@ -76,6 +78,17 @@ struct virtio_net_config {
__u16 max_virtqueue_pairs;
/* Default maximum transmit unit advice */
__u16 mtu;
+   /*
+* speed, in units of 1Mb. All values 0 to INT_MAX are legal.
+* Any other value stands for unknown.
+*/
+   __u32 speed;
+   /*
+* 0x00 - half duplex
+* 0x01 - full duplex
+* Any other value stands for unknown.
+*/
+   __u8 duplex;
 } __attribute__((packed));
 
 /*
-- 
2.6.1

[PATCH v4 0/3] virtio_net: allow hypervisor to indicate linkspeed and duplex setting

2018-01-05 Thread Jason Baron

We have found it useful to be able to set the linkspeed and duplex
settings from the host-side for virtio_net. This obviates the need
for guest changes and settings for these fields, and does not require
custom ethtool commands for virtio_net.

The ability to set linkspeed and duplex is useful in various cases
as described here:

16032be virtio_net: add ethtool support for set and get of settings

Using 'ethtool -s' continues to over-write the linkspeed/duplex
settings with this patch.

The 1/3 patch is against net-next, while the 2-3/3 patch are the associated
qemu changes that would go in after as update-linux-headers.sh should
be run first. So the qemu patches are a demonstration of how I intend this
to work.

Thanks,

-Jason

linux changes:

changes from v3:
* break the speed/duplex read into a function and also call from virtnet_probe
  when status bit is not negotiated
* only do speed/duplex read in virtnet_config_changed_work() on LINK_UP

changes from v2:
* move speed/duplex read into virtnet_config_changed_work() so link up changes
  are detected

Jason Baron (1):
  virtio_net: propagate linkspeed/duplex settings from the hypervisor

 drivers/net/virtio_net.c| 23 ++-
 include/uapi/linux/virtio_net.h | 13 +
 2 files changed, 35 insertions(+), 1 deletion(-)


qemu changes:

Jason Baron (2):
  qemu: virtio-net: use 64-bit values for feature flags
  qemu: add linkspeed and duplex settings to virtio-net

 hw/net/virtio-net.c | 87 -
 include/hw/virtio/virtio-net.h  |  5 +-
 include/standard-headers/linux/virtio_net.h | 13 +
 3 files changed, 77 insertions(+), 28 deletions(-)

-- 
2.6.1

[PATCH v4 3/3] qemu: add linkspeed and duplex settings to virtio-net

2018-01-05 Thread Jason Baron

Although linkspeed and duplex can be set in a linux guest via 'ethtool -s',
this requires custom ethtool commands for virtio-net by default.

Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.

Linkspeed and duplex settings can be set as:
'-device virtio-net,speed=1,duplex=full'

where speed is [-1...INT_MAX], and duplex is ["half"|"full"].

Signed-off-by: Jason Baron 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: virtio-...@lists.oasis-open.org
---
 hw/net/virtio-net.c | 32 +
 include/hw/virtio/virtio-net.h  |  3 +++
 include/standard-headers/linux/virtio_net.h | 13 
 3 files changed, 48 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 54823af..cd63659 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -40,6 +40,12 @@
 #define VIRTIO_NET_RX_QUEUE_MIN_SIZE VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE
 #define VIRTIO_NET_TX_QUEUE_MIN_SIZE VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE
 
+/* duplex and speed */
+#define DUPLEX_UNKNOWN  0xff
+#define DUPLEX_HALF 0x00
+#define DUPLEX_FULL 0x01
+#define SPEED_UNKNOWN   -1
+
 /*
  * Calculate the number of bytes up to and including the given 'field' of
  * 'container'.
@@ -61,6 +67,8 @@ static VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_net_config, max_virtqueue_pairs)},
 {.flags = 1ULL << VIRTIO_NET_F_MTU,
  .end = endof(struct virtio_net_config, mtu)},
+{.flags = 1ULL << VIRTIO_NET_F_SPEED_DUPLEX,
+ .end = endof(struct virtio_net_config, duplex)},
 {}
 };
 
@@ -89,6 +97,8 @@ static void virtio_net_get_config(VirtIODevice *vdev, uint8_t 
*config)
 virtio_stw_p(vdev, _virtqueue_pairs, n->max_queues);
 virtio_stw_p(vdev, , n->net_conf.mtu);
 memcpy(netcfg.mac, n->mac, ETH_ALEN);
+virtio_stl_p(vdev, , n->net_conf.speed);
+netcfg.duplex = n->net_conf.duplex;
 memcpy(config, , n->config_size);
 }
 
@@ -1941,6 +1951,26 @@ static void virtio_net_device_realize(DeviceState *dev, 
Error **errp)
 n->host_features |= (1ULL << VIRTIO_NET_F_MTU);
 }
 
+if (n->net_conf.duplex_str) {
+if (strncmp(n->net_conf.duplex_str, "half", 5) == 0) {
+n->net_conf.duplex = DUPLEX_HALF;
+} else if (strncmp(n->net_conf.duplex_str, "full", 5) == 0) {
+n->net_conf.duplex = DUPLEX_FULL;
+} else {
+error_setg(errp, "'duplex' must be 'half' or 'full'");
+}
+n->host_features |= (1ULL << VIRTIO_NET_F_SPEED_DUPLEX);
+} else {
+n->net_conf.duplex = DUPLEX_UNKNOWN;
+}
+
+if (n->net_conf.speed < SPEED_UNKNOWN) {
+error_setg(errp, "'speed' must be between -1 (SPEED_UNKOWN) and "
+   "INT_MAX");
+} else if (n->net_conf.speed >= 0) {
+n->host_features |= (1ULL << VIRTIO_NET_F_SPEED_DUPLEX);
+}
+
 virtio_net_set_config_size(n, n->host_features);
 virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size);
 
@@ -2161,6 +2191,8 @@ static Property virtio_net_properties[] = {
 DEFINE_PROP_UINT16("host_mtu", VirtIONet, net_conf.mtu, 0),
 DEFINE_PROP_BOOL("x-mtu-bypass-backend", VirtIONet, mtu_bypass_backend,
  true),
+DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
+DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index e7634c9..02484dc 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -38,6 +38,9 @@ typedef struct virtio_net_conf
 uint16_t rx_queue_size;
 uint16_t tx_queue_size;
 uint16_t mtu;
+int32_t speed;
+char *duplex_str;
+uint8_t duplex;
 } virtio_net_conf;
 
 /* Maximum packet size we can receive from tap device: header + 64k */
diff --git a/include/standard-headers/linux/virtio_net.h 
b/include/standard-headers/linux/virtio_net.h
index 30ff249..17c8531 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -57,6 +57,8 @@
 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
 
+#define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
+
 #ifndef VIRTIO_NET_NO_LEGACY
 #define VIRTIO_NET_F_GSO   6   /* Host handles pkts w/ any GSO type */
 #endif /* VIRTIO_NET_NO_LEGACY */
@@ -76,6 +78,17 @@ struct virtio_net_config {
uint16_t max_virtqueue_pairs;
/* Default maximum transmit unit advice */
uint16_t mtu;
+   /*
+* speed, in units of 1Mb. All values 0 to INT_MAX are legal.
+* Any other value stands for unknown.
+

Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in xfrm_lookup

2018-01-05 Thread Tobias Hommel

On Fri, Jan 05, 2018 at 09:51:16PM +, Holger Hoffstätte wrote:
> On Fri, 05 Jan 2018 22:13:23 +0100, Tobias Hommel wrote:
> 
> > Hi,
> > 
> > I'm running into a NULL pointer dereference after updating from Linux 4.1.6 
> > to
> > 4.14.11 (see kernel log below). I tried 4.14.3 initially which did not work
> > either.
> > Anyone has an idea what is happening here?
> 
> Try 4.14.12 because of:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.14.y=2d01ac8cc12b973668bf898b03bf9ffb12d83b83

Using tunnel mode here, not transport mode. Anyway, I tried it, the same
problem:
[  275.655170] BUG: unable to handle kernel NULL pointer dereference at 
0020
[  275.663230] IP: xfrm_lookup+0x2a/0x7d0
[  275.666986] PGD 0 P4D 0 
[  275.669579] Oops:  [#1] SMP PTI
[  275.673097] Modules linked in:
[  275.676182] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.14.12 #1
[  275.682215] Hardware name: To be filled by O.E.M. CAR-2051/CAR, BIOS 1.01 
07/11/2016
[  275.690013] task: 9b43fb0ed080 task.stack: b0af0009
[  275.695960] RIP: 0010:xfrm_lookup+0x2a/0x7d0
[  275.700256] RSP: 0018:9b43ffd83bd0 EFLAGS: 00010246
[  275.705528] RAX:  RBX: 8e074080 RCX: 
[  275.712710] RDX: 9b43ffd83c48 RSI:  RDI: 8e074080
[  275.719895] RBP: 8e074080 R08: 0002 R09: 
[  275.727071] R10: 0020 R11: 0020 R12: 9b43ffd83c48
[  275.734248] R13:  R14: 0002 R15: 9b43fb240078
[  275.741415] FS:  () GS:9b43ffd8() 
knlGS:
[  275.749527] CS:  0010 DS:  ES:  CR0: 80050033
[  275.755307] CR2: 0020 CR3: 00013e00a000 CR4: 001006e0
[  275.762474] Call Trace:
[  275.764939]   
[  275.766986]  __xfrm_route_forward+0xa4/0x110
[  275.771282]  ip_forward+0x3da/0x450
[  275.774803]  ? ip_rcv_finish+0x61/0x390
[  275.778666]  ip_rcv+0x2b5/0x380
[  275.781840]  ? inet_del_offload+0x30/0x30
[  275.785860]  __netif_receive_skb_core+0x751/0xb00
[  275.790593]  ? tcp_gro_receive+0x24d/0x310
[  275.794716]  ? netif_receive_skb_internal+0x47/0xf0
[  275.799620]  netif_receive_skb_internal+0x47/0xf0
[  275.804381]  napi_gro_flush+0x50/0x70
[  275.808071]  napi_complete_done+0x90/0xd0
[  275.812111]  igb_poll+0x8fd/0xe80
[  275.815458]  net_rx_action+0x1fc/0x310
[  275.819227]  __do_softirq+0xd5/0x1cf
[  275.822834]  irq_exit+0xa3/0xb0
[  275.826003]  do_IRQ+0x45/0xc0
[  275.829004]  common_interrupt+0x95/0x95
[  275.832868]  
[  275.835002] RIP: 0010:cpuidle_enter_state+0x120/0x200
[  275.840076] RSP: 0018:b0af00093eb8 EFLAGS: 0282 ORIG_RAX: 
ff7d
[  275.847685] RAX: 9b43ffd9ea80 RBX: 0002 RCX: 00402e53956c
[  275.854844] RDX:  RSI: 36ca RDI: 
[  275.862039] RBP: 9b43ffda71e8 R08: 0003 R09: 0018
[  275.869222] R10:  R11: 0102 R12: 00402e53956c
[  275.876398] R13: 00402e4e8838 R14: 0002 R15: 
[  275.883568]  ? cpuidle_enter_state+0x11c/0x200
[  275.888023]  do_idle+0xd6/0x170
[  275.891177]  cpu_startup_entry+0x67/0x70
[  275.895129]  start_secondary+0x167/0x190
[  275.899080]  secondary_startup_64+0xa5/0xb0
[  275.903291] Code: 00 41 57 41 56 45 89 c6 41 55 41 54 49 89 f5 55 53 49 89 
d4 48 89 fb 48 83 ec 40 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 <48> 8b 
46 20 48 85 c9 44 0f b7 38 c7 44 
[  275.922273] RIP: xfrm_lookup+0x2a/0x7d0 RSP: 9b43ffd83bd0
[  275.928070] CR2: 0020
[  275.931417] ---[ end trace 453df6e200be3ed0 ]---
[  275.936061] Kernel panic - not syncing: Fatal exception in interrupt
[  275.942566] Kernel Offset: 0xc00 from 0x8100 (relocation 
range: 0x8000-0xbfff)
[  275.953309] Rebooting in 10 seconds..

> 
> -h
>

[PATCH net-next 1/2] net: erspan: use bitfield instead of mask and offset

2018-01-05 Thread William Tu

Originally the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field.  This is more costly due to
calling ntohs/htons.  The patch changes it to use bitfields.

Signed-off-by: William Tu 
---
 include/net/erspan.h | 127 ++-
 net/ipv4/ip_gre.c|  38 ++-
 net/ipv6/ip6_gre.c   |  36 ++-
 3 files changed, 121 insertions(+), 80 deletions(-)

diff --git a/include/net/erspan.h b/include/net/erspan.h
index acdf6843095d..2b75821e2ebe 100644
--- a/include/net/erspan.h
+++ b/include/net/erspan.h
@@ -65,16 +65,30 @@
 #define GRA_MASK   0x0006
 #define O_MASK 0x0001
 
+#define HWID_OFFSET4
+#define DIR_OFFSET 3
+
 /* ERSPAN version 2 metadata header */
 struct erspan_md2 {
__be32 timestamp;
__be16 sgt; /* security group tag */
-   __be16 flags;
-#define P_OFFSET   15
-#define FT_OFFSET  10
-#define HWID_OFFSET4
-#define DIR_OFFSET 3
-#define GRA_OFFSET 1
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   __u8hwid_upper:2,
+   ft:5,
+   p:1;
+   __u8o:1,
+   gra:2,
+   dir:1,
+   hwid:4;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u8p:1,
+   ft:5,
+   hwid_upper:2;
+   __u8hwid:4,
+   dir:1,
+   gra:2,
+   o:1;
+#endif
 };
 
 enum erspan_encap_type {
@@ -95,15 +109,62 @@ struct erspan_metadata {
 };
 
 struct erspan_base_hdr {
-   __be16 ver_vlan;
-#define VER_OFFSET  12
-   __be16 session_id;
-#define COS_OFFSET  13
-#define EN_OFFSET   11
-#define BSO_OFFSET  EN_OFFSET
-#define T_OFFSET10
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+   __u8vlan_upper:4,
+   ver:4;
+   __u8vlan:8;
+   __u8session_id_upper:2,
+   t:1,
+   en:2,
+   cos:3;
+   __u8session_id:8;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u8ver: 4,
+   vlan_upper:4;
+   __u8vlan:8;
+   __u8cos:3,
+   en:2,
+   t:1,
+   session_id_upper:2;
+   __u8session_id:8;
+#else
+#error "Please fix "
+#endif
 };
 
+static inline void set_session_id(struct erspan_base_hdr *ershdr, u16 id)
+{
+   ershdr->session_id = id & 0xff;
+   ershdr->session_id_upper = (id >> 8) & 0x3;
+}
+
+static inline u16 get_session_id(const struct erspan_base_hdr *ershdr)
+{
+   return (ershdr->session_id_upper << 8) + ershdr->session_id;
+}
+
+static inline void set_vlan(struct erspan_base_hdr *ershdr, u16 vlan)
+{
+   ershdr->vlan = vlan & 0xff;
+   ershdr->vlan_upper = (vlan >> 8) & 0xf;
+}
+
+static inline u16 get_vlan(const struct erspan_base_hdr *ershdr)
+{
+   return (ershdr->vlan_upper << 8) + ershdr->vlan;
+}
+
+static inline void set_hwid(struct erspan_md2 *md2, u8 hwid)
+{
+   md2->hwid = hwid & 0xf;
+   md2->hwid_upper = (hwid >> 4) & 0x3;
+}
+
+static inline u8 get_hwid(const struct erspan_md2 *md2)
+{
+   return (md2->hwid_upper << 4) + md2->hwid;
+}
+
 static inline int erspan_hdr_len(int version)
 {
return sizeof(struct erspan_base_hdr) +
@@ -120,7 +181,7 @@ static inline u8 tos_to_cos(u8 tos)
 }
 
 static inline void erspan_build_header(struct sk_buff *skb,
-   __be32 id, u32 index,
+   u32 id, u32 index,
bool truncate, bool is_ipv4)
 {
struct ethhdr *eth = eth_hdr(skb);
@@ -154,12 +215,12 @@ static inline void erspan_build_header(struct sk_buff 
*skb,
memset(ershdr, 0, sizeof(*ershdr) + ERSPAN_V1_MDSIZE);
 
/* Build base header */
-   ershdr->ver_vlan = htons((vlan_tci & VLAN_MASK) |
-(ERSPAN_VERSION << VER_OFFSET));
-   ershdr->session_id = htons((u16)(ntohl(id) & ID_MASK) |
-  ((tos_to_cos(tos) << COS_OFFSET) & COS_MASK) |
-  (enc_type << EN_OFFSET & EN_MASK) |
-  ((truncate << T_OFFSET) & T_MASK));
+   ershdr->ver = ERSPAN_VERSION;
+   ershdr->cos = tos_to_cos(tos);
+   ershdr->en = enc_type;
+   ershdr->t = truncate;
+   set_vlan(ershdr, vlan_tci);
+   set_session_id(ershdr, id);
 
/* Build metadata */
ersmd = (struct erspan_metadata *)(ershdr + 1);
@@ -187,7 +248,7 @@ static inline __be32 erspan_get_timestamp(void)
 }
 
 static inline void erspan_build_header_v2(struct sk_buff *skb,
- __be32 id, u8 direction, u16 hwid,
+ u32 id, u8 direction, u16 hwid,
  bool truncate, bool is_ipv4)
 {
struct ethhdr *eth = eth_hdr(skb);
@@ -198,7 +259,6 @@ static inline void erspan_build_header_v2(struct sk_buff 
*skb,
__be16 tci;

[PATCH net-next 2/2] openvswitch: add erspan version II support

2018-01-05 Thread William Tu

The patch adds support for configuring the erspan version II
fields for openvswitch.

Signed-off-by: William Tu 
---
 include/uapi/linux/openvswitch.h |  12 +++-
 net/openvswitch/flow_netlink.c   | 125 +++
 2 files changed, 126 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 4265d7f9e1f2..3b1950c59a0c 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -273,6 +273,16 @@ enum {
 
 #define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1)
 
+enum {
+   OVS_ERSPAN_OPT_UNSPEC,
+   OVS_ERSPAN_OPT_IDX, /* be32 index */
+   OVS_ERSPAN_OPT_VER, /* u8 version number */
+   OVS_ERSPAN_OPT_DIR, /* u8 direction */
+   OVS_ERSPAN_OPT_HWID,/* u8 hardware ID */
+   __OVS_ERSPAN_OPT_MAX,
+};
+
+#define OVS_ERSPAN_OPT_MAX (__OVS_ERSPAN_OPT_MAX - 1)
 
 /* OVS_VPORT_ATTR_OPTIONS attributes for tunnels.
  */
@@ -363,7 +373,7 @@ enum ovs_tunnel_key_attr {
OVS_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_PAD,
-   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* be32 ERSPAN index. */
+   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* Nested OVS_ERSPAN_OPT_* */
__OVS_TUNNEL_KEY_ATTR_MAX
 };
 
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index bce1f78b0de5..696198cf3765 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -334,8 +334,10 @@ size_t ovs_tun_key_attr_size(void)
 * OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
 */
+ nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_SRC */
-   + nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_DST */
-   + nla_total_size(4);   /* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS */
+   + nla_total_size(2);   /* OVS_TUNNEL_KEY_ATTR_TP_DST */
+   /* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS is mutually exclusive with
+* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
+*/
 }
 
 static size_t ovs_nsh_key_attr_size(void)
@@ -386,6 +388,13 @@ static const struct ovs_len_tbl 
ovs_vxlan_ext_key_lens[OVS_VXLAN_EXT_MAX + 1] =
[OVS_VXLAN_EXT_GBP] = { .len = sizeof(u32) },
 };
 
+static const struct ovs_len_tbl ovs_erspan_opt_lens[OVS_ERSPAN_OPT_MAX + 1] = {
+   [OVS_ERSPAN_OPT_IDX]= { .len = sizeof(u32) },
+   [OVS_ERSPAN_OPT_VER]= { .len = sizeof(u8) },
+   [OVS_ERSPAN_OPT_DIR]= { .len = sizeof(u8) },
+   [OVS_ERSPAN_OPT_HWID]   = { .len = sizeof(u8) },
+};
+
 static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 
1] = {
[OVS_TUNNEL_KEY_ATTR_ID]= { .len = sizeof(u64) },
[OVS_TUNNEL_KEY_ATTR_IPV4_SRC]  = { .len = sizeof(u32) },
@@ -402,7 +411,8 @@ static const struct ovs_len_tbl 
ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
.next = ovs_vxlan_ext_key_lens 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_SRC]  = { .len = sizeof(struct in6_addr) 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_DST]  = { .len = sizeof(struct in6_addr) 
},
-   [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = sizeof(u32) },
+   [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = OVS_ATTR_NESTED,
+   .next = ovs_erspan_opt_lens },
 };
 
 static const struct ovs_len_tbl
@@ -640,16 +650,78 @@ static int erspan_tun_opt_from_nlattr(const struct nlattr 
*attr,
 {
unsigned long opt_key_offset;
struct erspan_metadata opts;
+   struct nlattr *a;
+   u16 hwid, dir;
+   int rem;
 
BUILD_BUG_ON(sizeof(opts) > sizeof(match->key->tun_opts));
 
memset(, 0, sizeof(opts));
-   opts.u.index = nla_get_be32(attr);
+   nla_for_each_nested(a, attr, rem) {
+   int type = nla_type(a);
 
-   /* Index has only 20-bit */
-   if (ntohl(opts.u.index) & ~INDEX_MASK) {
-   OVS_NLERR(log, "ERSPAN index number %x too large.",
- ntohl(opts.u.index));
+   if (type > OVS_ERSPAN_OPT_MAX) {
+   OVS_NLERR(log, "ERSPAN option %d out of range max %d",
+ type, OVS_ERSPAN_OPT_MAX);
+   return -EINVAL;
+   }
+
+   if (!check_attr_len(nla_len(a),
+   ovs_erspan_opt_lens[type].len)) {
+   OVS_NLERR(log, "ERSPAN option %d has unexpected len %d 
expected %d",
+ type, nla_len(a),
+ ovs_erspan_opt_lens[type].len);
+   return -EINVAL;
+   }
+
+   switch (type) {
+   case OVS_ERSPAN_OPT_IDX:
+

[PATCH net-next 0/2] net: erspan: add support for openvswitch

2018-01-05 Thread William Tu

The first patch refactors the originally erspan header definitions. 
Originally, the erspan fields are defined as a group into a __be16 field,
and use mask and offset to access each field.  This is more costly due to
calling ntohs/htons and confusing.  The first patch changes it to use
bitfields.  The second patch adds support for openvswitch.

William Tu (2):
  net: erspan: use bitfield instead of mask and offset
  openvswitch: add erspan version II support

 include/net/erspan.h | 127 +--
 include/uapi/linux/openvswitch.h |  12 +++-
 net/ipv4/ip_gre.c|  38 +---
 net/ipv6/ip6_gre.c   |  36 ---
 net/openvswitch/flow_netlink.c   | 125 +++---
 5 files changed, 247 insertions(+), 91 deletions(-)

-- 
2.7.4

Re: INFO: rcu detected stall in do_softirq

2018-01-05 Thread Dmitry Vyukov

On Fri, Jan 5, 2018 at 11:10 PM, syzbot
 wrote:
> Hello,
>
> syzkaller hit the following crash on
> 8a4816cad00bf14642f0ed6043b32d29a05006ce
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> Unfortunately, I don't have any reproducer for this bug yet.
>
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+6a74dabfc3393d3e5...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.

Looks like a hang in xfrm, so +xfrm maintainers.

> Can not set IPV6_FL_F_REFLECT if flowlabel_consistency sysctl is enable
> INFO: rcu_sched detected stalls on CPUs/tasks:
> (detected by 0, t=125007 jiffies, g=66299, c=66298, q=40)
> All QSes seen, last rcu_sched kthread activity 125014
> (4294991138-4294866124), jiffies_till_next_fqs=3, root ->qsmask 0x0
> syz-executor5   R  running task22056 22277   3688 0x0008
> Call Trace:
>  
>  sched_show_task+0x4a3/0x5e0 kernel/sched/core.c:5198
>  print_other_cpu_stall+0x996/0x1090 kernel/rcu/tree.c:1564
>  check_cpu_stall.isra.61+0x6e6/0x15b0 kernel/rcu/tree.c:1682
>  __rcu_pending kernel/rcu/tree.c:3440 [inline]
>  rcu_pending kernel/rcu/tree.c:3502 [inline]
>  rcu_check_callbacks+0x256/0xd00 kernel/rcu/tree.c:2842
>  update_process_times+0x30/0x60 kernel/time/timer.c:1630
>  tick_sched_handle+0x85/0x160 kernel/time/tick-sched.c:162
>  tick_sched_timer+0x42/0x120 kernel/time/tick-sched.c:1166
>  __run_hrtimer kernel/time/hrtimer.c:1211 [inline]
>  __hrtimer_run_queues+0x358/0xe20 kernel/time/hrtimer.c:1275
>  hrtimer_interrupt+0x1c2/0x5e0 kernel/time/hrtimer.c:1309
>  local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
>  smp_apic_timer_interrupt+0x14a/0x700 arch/x86/kernel/apic/apic.c:1050
>  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
> RIP: 0010:__sanitizer_cov_trace_pc+0x0/0x50
> RSP: 0018:8801db206738 EFLAGS: 0206 ORIG_RAX: ff11
> RAX: ed003a2965e3 RBX: 8801d14b2e40 RCX: 84c25899
> RDX: 0100 RSI: 8801c5f47f29 RDI: 79c8
> RBP: 8801db2067c8 R08: ed0038bea003 R09: ed0038bea003
> R10: 000b R11: ed0038bea002 R12: 8801c5f40580
> R13: 8801db206aa0 R14: 79d0 R15: dc00
>  __xfrm_decode_session+0x68/0x110 net/xfrm/xfrm_policy.c:2358
>  __xfrm_policy_check+0x18c/0x2350 net/xfrm/xfrm_policy.c:2393
>  __xfrm_policy_check2 include/net/xfrm.h:1170 [inline]
>  xfrm_policy_check include/net/xfrm.h:1175 [inline]
>  xfrm6_policy_check include/net/xfrm.h:1185 [inline]
>  rawv6_rcv+0x8f6/0x1200 net/ipv6/raw.c:424
>  ipv6_raw_deliver net/ipv6/raw.c:224 [inline]
>  raw6_local_deliver+0x819/0xa80 net/ipv6/raw.c:240
>  ip6_input_finish+0x3c7/0x17a0 net/ipv6/ip6_input.c:246
>  NF_HOOK include/linux/netfilter.h:250 [inline]
>  ip6_input+0xe9/0x560 net/ipv6/ip6_input.c:327
>  dst_input include/net/dst.h:449 [inline]
>  ip6_rcv_finish+0x1a9/0x7a0 net/ipv6/ip6_input.c:71
>  NF_HOOK include/linux/netfilter.h:250 [inline]
>  ipv6_rcv+0xf37/0x1fa0 net/ipv6/ip6_input.c:208
>  __netif_receive_skb_core+0x1a41/0x3460 net/core/dev.c:4499
>  __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4564
>  process_backlog+0x203/0x740 net/core/dev.c:5244
>  napi_poll net/core/dev.c:5642 [inline]
>  net_rx_action+0x792/0x1910 net/core/dev.c:5708
>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>  do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1115
>  
>  do_softirq.part.21+0x14d/0x190 kernel/softirq.c:329
>  do_softirq kernel/softirq.c:177 [inline]
>  __local_bh_enable_ip+0x1ee/0x230 kernel/softirq.c:182
>  local_bh_enable include/linux/bottom_half.h:32 [inline]
>  rcu_read_unlock_bh include/linux/rcupdate.h:727 [inline]
>  ip6_finish_output2+0xba0/0x23a0 net/ipv6/ip6_output.c:121
>  ip6_fragment+0x25f2/0x3470 net/ipv6/ip6_output.c:739
>  ip6_finish_output+0x6bb/0xaf0 net/ipv6/ip6_output.c:152
>  NF_HOOK_COND include/linux/netfilter.h:239 [inline]
>  ip6_output+0x1eb/0x840 net/ipv6/ip6_output.c:171
>  dst_output include/net/dst.h:443 [inline]
>  ip6_local_out+0x95/0x160 net/ipv6/output_core.c:176
>  ip6_send_skb+0xa1/0x330 net/ipv6/ip6_output.c:1674
>  ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1694
>  rawv6_push_pending_frames net/ipv6/raw.c:616 [inline]
>  rawv6_sendmsg+0x2ee9/0x3e70 net/ipv6/raw.c:935
>  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:764
>  sock_sendmsg_nosec net/socket.c:628 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:638
>  SYSC_sendto+0x361/0x5c0 net/socket.c:1719
>  SyS_sendto+0x40/0x50 net/socket.c:1687
>  entry_SYSCALL_64_fastpath+0x23/0x9a
> RIP: 0033:0x452ac9
> RSP: 002b:7f042a550c58 EFLAGS: 0212 ORIG_RAX: 002c
> RAX: ffda RBX:

Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in xfrm_lookup

2018-01-05 Thread Holger Hoffstätte

On Fri, 05 Jan 2018 22:13:23 +0100, Tobias Hommel wrote:

> Hi,
> 
> I'm running into a NULL pointer dereference after updating from Linux 4.1.6 to
> 4.14.11 (see kernel log below). I tried 4.14.3 initially which did not work
> either.
> Anyone has an idea what is happening here?

Try 4.14.12 because of:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.14.y=2d01ac8cc12b973668bf898b03bf9ffb12d83b83

-h

Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in xfrm_lookup

2018-01-05 Thread Tobias Hommel

On Sat, Jan 06, 2018 at 12:27:11AM +0300, Ozgur wrote:
> 
> 
> 06.01.2018, 00:20, "Tobias Hommel" :
> > Hi,
> 
> Hi Tobias,
> 
> > I'm running into a NULL pointer dereference after updating from Linux 4.1.6 
> > to
> > 4.14.11 (see kernel log below). I tried 4.14.3 initially which did not work
> > either.
> > Anyone has an idea what is happening here?
> >
> > The affected machine has 2 active ethernet interfaces (igb driver) and acts 
> > as
> > a VPN gateway running strongswan. There are several hundreds of IPSec
> > roadwarriors connecting to eth1. eth0 connects to an infrastructure running 
> > an
> > HTTP server.
> > During my tests these roadwarriors connect to the gateway, sometimes 
> > download a
> > large file from the HTTP server, disconnect and after a random delay repeat
> > these steps.
> >
> > Some observations I made:
> > * SMP Affinity for IRQs of the NICs Rx/Tx queues 
> > (/proc/irq/$IRQ/smp_affinity)
> >   * all affinities set to default ff is broken
> >   * setting affinity for all queues of both interfaces to the same CPU 
> > seems to
> > work fine (running stable for more than 1 day now)
> >   * setting affinity of eth0 queues to CPU 1 and affinity of eth1 queues to 
> > CPU
> > 2 is broken and seems to always trigger the bug on CPU 1
> > * the top 6 entries of the call trace are the same every time the system
> >   crashes, the other entries differ sometimes
> >
> > The bug is 100% reproducible on the Intel Atom machine from the log below 
> > and
> > also on a HP ProLiant Gen6 (also igb driver).
> > I can, of course, provide further information (CPU, NIC, kernel config, more
> > traces, etc.) if required.
> > If helpful I could also run tests on HP ProLiant Gen9 which has different 
> > NICs
> > (tg3).
> >
> > [ 7998.489094] BUG: unable to handle kernel NULL pointer dereference at 
> > 0020
> > [ 7998.496993] IP: xfrm_lookup+0x2a/0x7e0
> > [ 7998.500759] PGD 0 P4D 0
> > [ 7998.503316] Oops:  [#1] SMP PTI
> > [ 7998.506835] Modules linked in:
> > [ 7998.509929] CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 4.14.11 #3
> > [ 7998.516244] Hardware name: To be filled by O.E.M. CAR-2051/CAR, BIOS 
> > 1.01 07/11/2016
> > [ 7998.524039] task: 8826bb118000 task.stack: 947ac00f
> > [ 7998.530004] RIP: 0010:xfrm_lookup+0x2a/0x7e0
> > [ 7998.534298] RSP: 0018:947ac00f3b60 EFLAGS: 00010246
> > [ 7998.539550] RAX:  RBX: 93074040 RCX: 
> > 
> > [ 7998.546709] RDX: 947ac00f3bd8 RSI:  RDI: 
> > 93074040
> > [ 7998.553868] RBP: 93074040 R08: 0002 R09: 
> > 0001
> > [ 7998.561026] R10: 0032 R11:  R12: 
> > 947ac00f3bd8
> > [ 7998.568212] R13:  R14: 0002 R15: 
> > 8826b69a8078
> > [ 7998.575395] FS: () GS:8826bfc8() 
> > knlGS:
> > [ 7998.583550] CS: 0010 DS:  ES:  CR0: 80050033
> > [ 7998.589324] CR2: 0020 CR3: 0001781da000 CR4: 
> > 001006e0
> > [ 7998.596482] Call Trace:
> > [ 7998.598959] __xfrm_route_forward+0xa4/0x110
> > [ 7998.603263] ip_forward+0x3e0/0x450
> > [ 7998.606778] ? ip_rcv_finish+0x61/0x3a0
> > [ 7998.610645] ip_rcv+0x2c4/0x390
> > [ 7998.613818] ? inet_del_offload+0x30/0x30
> > [ 7998.617857] __netif_receive_skb_core+0x751/0xb00
> > [ 7998.622562] ? skb_send_sock+0x40/0x40
> > [ 7998.626356] ? netif_receive_skb_internal+0x47/0xf0
> > [ 7998.631252] netif_receive_skb_internal+0x47/0xf0
> > [ 7998.635987] napi_gro_receive+0x70/0x90
> > [ 7998.639835] gro_cell_poll+0x53/0x90
> > [ 7998.643439] net_rx_action+0x1fc/0x310
> > [ 7998.647210] ? rebalance_domains+0x101/0x2b0
> > [ 7998.651500] __do_softirq+0xd5/0x1cf
> > [ 7998.655105] run_ksoftirqd+0x14/0x30
> > [ 7998.658712] smpboot_thread_fn+0xf9/0x150
> > [ 7998.662723] kthread+0xef/0x130
> > [ 7998.665893] ? sort_range+0x20/0x20
> > [ 7998.669404] ? kthread_park+0x60/0x60
> > [ 7998.673098] ret_from_fork+0x1f/0x30
> > [ 7998.676674] Code: 00 41 57 41 56 45 89 c6 41 55 41 54 49 89 f5 55 53 49 
> > 89 d4 48 89 fb 48 83 ec 40 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 
> > <48> 8b 46 20 48 85 c9 44 0f b7 38 c7 44 24 0c 00 00 00 00 0f 84
> > [ 7998.695681] RIP: xfrm_lookup+0x2a/0x7e0 RSP: 947ac00f3b60
> > [ 7998.701479] CR2: 0020
> > [ 7998.704799] ---[ end trace 0544b1946919baad ]---
> > [ 7998.709442] Kernel panic - not syncing: Fatal exception in interrupt
> > [ 7998.715918] Kernel Offset: 0x1100 from 0x8100 
> > (relocation range: 0x8000-0xbfff)
> 
> 
> this error doesn't look like the last version kernel, I think this problem 
> NIC driver.
> What is the use network ethernet card model?
This is what lspci shows for both NICs:
# lspci -nns 00:14.0
00:14.0 Ethernet controller [0200]: Intel Corporation Ethernet Connection I354 
[8086:1f41] (rev 03)

I have

Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in xfrm_lookup

2018-01-05 Thread Ozgur



06.01.2018, 00:20, "Tobias Hommel" :
> Hi,

Hi Tobias,

> I'm running into a NULL pointer dereference after updating from Linux 4.1.6 to
> 4.14.11 (see kernel log below). I tried 4.14.3 initially which did not work
> either.
> Anyone has an idea what is happening here?
>
> The affected machine has 2 active ethernet interfaces (igb driver) and acts as
> a VPN gateway running strongswan. There are several hundreds of IPSec
> roadwarriors connecting to eth1. eth0 connects to an infrastructure running an
> HTTP server.
> During my tests these roadwarriors connect to the gateway, sometimes download 
> a
> large file from the HTTP server, disconnect and after a random delay repeat
> these steps.
>
> Some observations I made:
> * SMP Affinity for IRQs of the NICs Rx/Tx queues (/proc/irq/$IRQ/smp_affinity)
>   * all affinities set to default ff is broken
>   * setting affinity for all queues of both interfaces to the same CPU seems 
> to
> work fine (running stable for more than 1 day now)
>   * setting affinity of eth0 queues to CPU 1 and affinity of eth1 queues to 
> CPU
> 2 is broken and seems to always trigger the bug on CPU 1
> * the top 6 entries of the call trace are the same every time the system
>   crashes, the other entries differ sometimes
>
> The bug is 100% reproducible on the Intel Atom machine from the log below and
> also on a HP ProLiant Gen6 (also igb driver).
> I can, of course, provide further information (CPU, NIC, kernel config, more
> traces, etc.) if required.
> If helpful I could also run tests on HP ProLiant Gen9 which has different NICs
> (tg3).
>
> [ 7998.489094] BUG: unable to handle kernel NULL pointer dereference at 
> 0020
> [ 7998.496993] IP: xfrm_lookup+0x2a/0x7e0
> [ 7998.500759] PGD 0 P4D 0
> [ 7998.503316] Oops:  [#1] SMP PTI
> [ 7998.506835] Modules linked in:
> [ 7998.509929] CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 4.14.11 #3
> [ 7998.516244] Hardware name: To be filled by O.E.M. CAR-2051/CAR, BIOS 1.01 
> 07/11/2016
> [ 7998.524039] task: 8826bb118000 task.stack: 947ac00f
> [ 7998.530004] RIP: 0010:xfrm_lookup+0x2a/0x7e0
> [ 7998.534298] RSP: 0018:947ac00f3b60 EFLAGS: 00010246
> [ 7998.539550] RAX:  RBX: 93074040 RCX: 
> 
> [ 7998.546709] RDX: 947ac00f3bd8 RSI:  RDI: 
> 93074040
> [ 7998.553868] RBP: 93074040 R08: 0002 R09: 
> 0001
> [ 7998.561026] R10: 0032 R11:  R12: 
> 947ac00f3bd8
> [ 7998.568212] R13:  R14: 0002 R15: 
> 8826b69a8078
> [ 7998.575395] FS: () GS:8826bfc8() 
> knlGS:
> [ 7998.583550] CS: 0010 DS:  ES:  CR0: 80050033
> [ 7998.589324] CR2: 0020 CR3: 0001781da000 CR4: 
> 001006e0
> [ 7998.596482] Call Trace:
> [ 7998.598959] __xfrm_route_forward+0xa4/0x110
> [ 7998.603263] ip_forward+0x3e0/0x450
> [ 7998.606778] ? ip_rcv_finish+0x61/0x3a0
> [ 7998.610645] ip_rcv+0x2c4/0x390
> [ 7998.613818] ? inet_del_offload+0x30/0x30
> [ 7998.617857] __netif_receive_skb_core+0x751/0xb00
> [ 7998.622562] ? skb_send_sock+0x40/0x40
> [ 7998.626356] ? netif_receive_skb_internal+0x47/0xf0
> [ 7998.631252] netif_receive_skb_internal+0x47/0xf0
> [ 7998.635987] napi_gro_receive+0x70/0x90
> [ 7998.639835] gro_cell_poll+0x53/0x90
> [ 7998.643439] net_rx_action+0x1fc/0x310
> [ 7998.647210] ? rebalance_domains+0x101/0x2b0
> [ 7998.651500] __do_softirq+0xd5/0x1cf
> [ 7998.655105] run_ksoftirqd+0x14/0x30
> [ 7998.658712] smpboot_thread_fn+0xf9/0x150
> [ 7998.662723] kthread+0xef/0x130
> [ 7998.665893] ? sort_range+0x20/0x20
> [ 7998.669404] ? kthread_park+0x60/0x60
> [ 7998.673098] ret_from_fork+0x1f/0x30
> [ 7998.676674] Code: 00 41 57 41 56 45 89 c6 41 55 41 54 49 89 f5 55 53 49 89 
> d4 48 89 fb 48 83 ec 40 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 <48> 
> 8b 46 20 48 85 c9 44 0f b7 38 c7 44 24 0c 00 00 00 00 0f 84
> [ 7998.695681] RIP: xfrm_lookup+0x2a/0x7e0 RSP: 947ac00f3b60
> [ 7998.701479] CR2: 0020
> [ 7998.704799] ---[ end trace 0544b1946919baad ]---
> [ 7998.709442] Kernel panic - not syncing: Fatal exception in interrupt
> [ 7998.715918] Kernel Offset: 0x1100 from 0x8100 (relocation 
> range: 0x8000-0xbfff)


this error doesn't look like the last version kernel, I think this problem NIC 
driver.
What is the use network ethernet card model?
And which driver version you use?

> Best regards,
>
> Tobias Hommel

Ozgur

BUG: 4.14.11 unable to handle kernel NULL pointer dereference in xfrm_lookup

2018-01-05 Thread Tobias Hommel

Hi,

I'm running into a NULL pointer dereference after updating from Linux 4.1.6 to
4.14.11 (see kernel log below). I tried 4.14.3 initially which did not work
either.
Anyone has an idea what is happening here?

The affected machine has 2 active ethernet interfaces (igb driver) and acts as
a VPN gateway running strongswan. There are several hundreds of IPSec
roadwarriors connecting to eth1. eth0 connects to an infrastructure running an
HTTP server.
During my tests these roadwarriors connect to the gateway, sometimes download a
large file from the HTTP server, disconnect and after a random delay repeat
these steps.

Some observations I made:
* SMP Affinity for IRQs of the NICs Rx/Tx queues (/proc/irq/$IRQ/smp_affinity)
  * all affinities set to default ff is broken
  * setting affinity for all queues of both interfaces to the same CPU seems to
work fine (running stable for more than 1 day now)
  * setting affinity of eth0 queues to CPU 1 and affinity of eth1 queues to CPU
2 is broken and seems to always trigger the bug on CPU 1
* the top 6 entries of the call trace are the same every time the system
  crashes, the other entries differ sometimes

The bug is 100% reproducible on the Intel Atom machine from the log below and
also on a HP ProLiant Gen6 (also igb driver).
I can, of course, provide further information (CPU, NIC, kernel config, more
traces, etc.) if required.
If helpful I could also run tests on HP ProLiant Gen9 which has different NICs
(tg3).

[ 7998.489094] BUG: unable to handle kernel NULL pointer dereference at 
0020
[ 7998.496993] IP: xfrm_lookup+0x2a/0x7e0
[ 7998.500759] PGD 0 P4D 0 
[ 7998.503316] Oops:  [#1] SMP PTI
[ 7998.506835] Modules linked in:
[ 7998.509929] CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 4.14.11 #3
[ 7998.516244] Hardware name: To be filled by O.E.M. CAR-2051/CAR, BIOS 1.01 
07/11/2016
[ 7998.524039] task: 8826bb118000 task.stack: 947ac00f
[ 7998.530004] RIP: 0010:xfrm_lookup+0x2a/0x7e0
[ 7998.534298] RSP: 0018:947ac00f3b60 EFLAGS: 00010246
[ 7998.539550] RAX:  RBX: 93074040 RCX: 
[ 7998.546709] RDX: 947ac00f3bd8 RSI:  RDI: 93074040
[ 7998.553868] RBP: 93074040 R08: 0002 R09: 0001
[ 7998.561026] R10: 0032 R11:  R12: 947ac00f3bd8
[ 7998.568212] R13:  R14: 0002 R15: 8826b69a8078
[ 7998.575395] FS:  () GS:8826bfc8() 
knlGS:
[ 7998.583550] CS:  0010 DS:  ES:  CR0: 80050033
[ 7998.589324] CR2: 0020 CR3: 0001781da000 CR4: 001006e0
[ 7998.596482] Call Trace:
[ 7998.598959]  __xfrm_route_forward+0xa4/0x110
[ 7998.603263]  ip_forward+0x3e0/0x450
[ 7998.606778]  ? ip_rcv_finish+0x61/0x3a0
[ 7998.610645]  ip_rcv+0x2c4/0x390
[ 7998.613818]  ? inet_del_offload+0x30/0x30
[ 7998.617857]  __netif_receive_skb_core+0x751/0xb00
[ 7998.622562]  ? skb_send_sock+0x40/0x40
[ 7998.626356]  ? netif_receive_skb_internal+0x47/0xf0
[ 7998.631252]  netif_receive_skb_internal+0x47/0xf0
[ 7998.635987]  napi_gro_receive+0x70/0x90
[ 7998.639835]  gro_cell_poll+0x53/0x90
[ 7998.643439]  net_rx_action+0x1fc/0x310
[ 7998.647210]  ? rebalance_domains+0x101/0x2b0
[ 7998.651500]  __do_softirq+0xd5/0x1cf
[ 7998.655105]  run_ksoftirqd+0x14/0x30
[ 7998.658712]  smpboot_thread_fn+0xf9/0x150
[ 7998.662723]  kthread+0xef/0x130
[ 7998.665893]  ? sort_range+0x20/0x20
[ 7998.669404]  ? kthread_park+0x60/0x60
[ 7998.673098]  ret_from_fork+0x1f/0x30
[ 7998.676674] Code: 00 41 57 41 56 45 89 c6 41 55 41 54 49 89 f5 55 53 49 89 
d4 48 89 fb 48 83 ec 40 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 <48> 8b 
46 20 48 85 c9 44 0f b7 38 c7 44 24 0c 00 00 00 00 0f 84 
[ 7998.695681] RIP: xfrm_lookup+0x2a/0x7e0 RSP: 947ac00f3b60
[ 7998.701479] CR2: 0020
[ 7998.704799] ---[ end trace 0544b1946919baad ]---
[ 7998.709442] Kernel panic - not syncing: Fatal exception in interrupt
[ 7998.715918] Kernel Offset: 0x1100 from 0x8100 (relocation 
range: 0x8000-0xbfff)

Best regards,

Tobias Hommel

[PATCH bpf-next] bpf: fix verifier GPF in kmalloc failure path

2018-01-05 Thread Alexei Starovoitov

syzbot reported the following panic in the verifier triggered
by kmalloc error injection:

kasan: GPF could be caused by NULL-ptr deref or user memory access
RIP: 0010:copy_func_state kernel/bpf/verifier.c:403 [inline]
RIP: 0010:copy_verifier_state+0x364/0x590 kernel/bpf/verifier.c:431
Call Trace:
 pop_stack+0x8c/0x270 kernel/bpf/verifier.c:449
 push_stack kernel/bpf/verifier.c:491 [inline]
 check_cond_jmp_op kernel/bpf/verifier.c:3598 [inline]
 do_check+0x4b60/0xa050 kernel/bpf/verifier.c:4731
 bpf_check+0x3296/0x58c0 kernel/bpf/verifier.c:5489
 bpf_prog_load+0xa2a/0x1b00 kernel/bpf/syscall.c:1198
 SYSC_bpf kernel/bpf/syscall.c:1807 [inline]
 SyS_bpf+0x1044/0x4420 kernel/bpf/syscall.c:1769

when copy_verifier_state() aborts in the middle due to kmalloc failure
some of the frames could have been partially copied while
current free_verifier_state() loop
for (i = 0; i <= state->curframe; i++)
assumed that all frames are non-null.
Simply fix it by adding 'if (!state)' to free_func_state().
Also avoid stressing copy frame logic more if kzalloc fails
in push_stack() free env->cur_state right away.

Reported-by: syzbot+32ac5a3e473f2e01c...@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a2b211262c25..d921ab387b0b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -375,6 +375,8 @@ static int realloc_func_state(struct bpf_func_state *state, 
int size,
 
 static void free_func_state(struct bpf_func_state *state)
 {
+   if (!state)
+   return;
kfree(state->stack);
kfree(state);
 }
@@ -487,6 +489,8 @@ static struct bpf_verifier_state *push_stack(struct 
bpf_verifier_env *env,
}
return >st;
 err:
+   free_verifier_state(env->cur_state, true);
+   env->cur_state = NULL;
/* pop all elements and return */
while (!pop_stack(env, NULL, NULL));
return NULL;
-- 
2.9.5

Re: [net-next 06/10] net/mlx5e: change Mellanox references in DIM code

2018-01-05 Thread Andy Gospodarek

On Fri, Jan 05, 2018 at 10:04:50AM +0200, Tal Gilboa wrote:
> On 1/4/2018 10:21 PM, Andy Gospodarek wrote:
> > From: Andy Gospodarek 
> > 
> > Change all mlx5_am* and MLX_AM* references to net_dim and NET_DIM,
> MLX_AM->MLX5_AM
> 
> > cq_period_mode = enable ?
> > -   MLX5_CQ_PERIOD_MODE_START_FROM_CQE :
> > -   MLX5_CQ_PERIOD_MODE_START_FROM_EQE;
> > +   NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE :
> > +   NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE;
> I'm not sure about this part. CQE/EQE based moderation is a feature in
> Mellanox's chips, which isn't necessarily coupled with adaptive moderation.
> net_dim lib should know which values to choose according to the selected
> mode, but I don't think mlx5 driver should use an enum from net_dim for
> enabling/disabling HW features. Another issue is that we use the enum value
> as an argument for the command to HW (0=EQE, 1=CQE). If someone would change
> the values it would break the HW feature. I think it would be safer to use
> the NET_DIM_XXX enum only when using functions from net_dim lib.

[Please ignore my eariler response, I'm not sure I fully read/parsed what you
were saying.  Sorry about that.]

I like your suggestion, so I'm going to refactor this a bit based on that.  I
made all the other suggested changes, so this should be the last one

> 
> > current_cq_period_mode = is_rx_cq ?
> > priv->channels.params.rx_cq_moderation.cq_period_mode :
> > priv->channels.params.tx_cq_moderation.cq_period_mode;
> > mode_changed = cq_period_mode != current_cq_period_mode;
> > -   if (cq_period_mode == MLX5_CQ_PERIOD_MODE_START_FROM_CQE &&
> > +   if (cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE &&
> > !MLX5_CAP_GEN(mdev, cq_period_start_from_cqe))
> > return -EOPNOTSUPP;
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
> > b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > index 3aa1c90..edd4077 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > @@ -674,8 +674,8 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
> > wqe->data.lkey = rq->mkey_be;
> > }
> > -   INIT_WORK(>am.work, mlx5e_rx_am_work);
> > -   rq->am.mode = params->rx_cq_moderation.cq_period_mode;
> > +   INIT_WORK(>dim.work, mlx5e_rx_dim_work);
> > +   rq->dim.mode = params->rx_cq_moderation.cq_period_mode;
> > rq->page_cache.head = 0;
> > rq->page_cache.tail = 0;
> > @@ -919,7 +919,7 @@ static int mlx5e_open_rq(struct mlx5e_channel *c,
> > if (err)
> > goto err_destroy_rq;
> > -   if (params->rx_am_enabled)
> > +   if (params->rx_dim_enabled)
> > c->rq.state |= BIT(MLX5E_RQ_STATE_AM);
> > return 0;
> > @@ -952,7 +952,7 @@ static void mlx5e_deactivate_rq(struct mlx5e_rq *rq)
> >   static void mlx5e_close_rq(struct mlx5e_rq *rq)
> >   {
> > -   cancel_work_sync(>am.work);
> > +   cancel_work_sync(>dim.work);
> > mlx5e_destroy_rq(rq);
> > mlx5e_free_rx_descs(rq);
> > mlx5e_free_rq(rq);
> > @@ -1565,7 +1565,7 @@ static void mlx5e_destroy_cq(struct mlx5e_cq *cq)
> >   }
> >   static int mlx5e_open_cq(struct mlx5e_channel *c,
> > -struct mlx5e_cq_moder moder,
> > +struct net_dim_cq_moder moder,
> >  struct mlx5e_cq_param *param,
> >  struct mlx5e_cq *cq)
> >   {
> > @@ -1747,7 +1747,7 @@ static int mlx5e_open_channel(struct mlx5e_priv 
> > *priv, int ix,
> >   struct mlx5e_channel_param *cparam,
> >   struct mlx5e_channel **cp)
> >   {
> > -   struct mlx5e_cq_moder icocq_moder = {0, 0};
> > +   struct net_dim_cq_moder icocq_moder = {0, 0};
> > struct net_device *netdev = priv->netdev;
> > int cpu = mlx5e_get_cpu(priv, ix);
> > struct mlx5e_channel *c;
> > @@ -1999,7 +1999,7 @@ static void mlx5e_build_ico_cq_param(struct 
> > mlx5e_priv *priv,
> > mlx5e_build_common_cq_param(priv, param);
> > -   param->cq_period_mode = MLX5_CQ_PERIOD_MODE_START_FROM_EQE;
> > +   param->cq_period_mode = NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE;
> >   }
> >   static void mlx5e_build_icosq_param(struct mlx5e_priv *priv,
> > @@ -4016,13 +4016,13 @@ void mlx5e_set_tx_cq_mode_params(struct 
> > mlx5e_params *params, u8 cq_period_mode)
> > params->tx_cq_moderation.usec =
> > MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_USEC;
> > -   if (cq_period_mode == MLX5_CQ_PERIOD_MODE_START_FROM_CQE)
> > +   if (cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE)
> > params->tx_cq_moderation.usec =
> > MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_USEC_FROM_CQE;
> > MLX5E_SET_PFLAG(params, MLX5E_PFLAG_TX_CQE_BASED_MODER,
> > params->tx_cq_moderation.cq_period_mode ==
> > -   MLX5_CQ_PERIOD_MODE_START_FROM_CQE);
> > +

[PATCH net-next v3 07/10] net: qualcomm: rmnet: Add support for RX checksum offload

2018-01-05 Thread Subash Abhinov Kasiviswanathan

When using the MAPv4 packet format, receive checksum offload can be
enabled in hardware. The checksum computation over pseudo header is
not offloaded but the rest of the checksum computation over
the payload is offloaded. This applies only for TCP / UDP packets
which are not fragmented.

rmnet validates the TCP/UDP checksum for the packet using the checksum
from the checksum trailer added to the packet by hardware. The
validation performed is as following -

1. Perform 1's complement over the checksum value from the trailer
2. Compute 1's complement checksum over IPv4 / IPv6 header and
   subtracts it from the value from step 1
3. Computes 1's complement checksum over IPv4 / IPv6 pseudo header and
   adds it to the value from step 2
4. Subtracts the checksum value from the TCP / UDP header from the
   value from step 3.
5. Compares the value from step 4 to the checksum value from the
   TCP / UDP header.
6. If the comparison in step 5 succeeds, CHECKSUM_UNNECESSARY is set
   and the packet is passed on to network stack. If there is a
   failure, then the packet is passed on as such without modifying
   the ip_summed field.

The checksum field is also checked for UDP checksum 0 as per RFC 768
and for unexpected TCP checksum of 0.

If checksum offload is disabled when using MAPv4 packet format in
receive path, the packet is queued as is to network stack without
the validations above.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c   |  15 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h|   4 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c   | 186 -
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c|   2 +
 4 files changed, 201 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index 8f8c4f2..3409458 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -66,8 +66,8 @@ static void rmnet_set_skb_proto(struct sk_buff *skb)
struct rmnet_port *port)
 {
struct rmnet_endpoint *ep;
+   u16 len, pad;
u8 mux_id;
-   u16 len;
 
if (RMNET_MAP_GET_CD_BIT(skb)) {
if (port->data_format & RMNET_INGRESS_FORMAT_MAP_COMMANDS)
@@ -77,7 +77,8 @@ static void rmnet_set_skb_proto(struct sk_buff *skb)
}
 
mux_id = RMNET_MAP_GET_MUX_ID(skb);
-   len = RMNET_MAP_GET_LENGTH(skb) - RMNET_MAP_GET_PAD(skb);
+   pad = RMNET_MAP_GET_PAD(skb);
+   len = RMNET_MAP_GET_LENGTH(skb) - pad;
 
if (mux_id >= RMNET_MAX_LOGICAL_EP)
goto free_skb;
@@ -90,8 +91,14 @@ static void rmnet_set_skb_proto(struct sk_buff *skb)
 
/* Subtract MAP header */
skb_pull(skb, sizeof(struct rmnet_map_header));
-   skb_trim(skb, len);
rmnet_set_skb_proto(skb);
+
+   if (port->data_format & RMNET_INGRESS_FORMAT_MAP_CKSUMV4) {
+   if (!rmnet_map_checksum_downlink_packet(skb, len + pad))
+   skb->ip_summed = CHECKSUM_UNNECESSARY;
+   }
+
+   skb_trim(skb, len);
rmnet_deliver_skb(skb);
return;
 
@@ -115,7 +122,7 @@ static void rmnet_set_skb_proto(struct sk_buff *skb)
}
 
if (port->data_format & RMNET_INGRESS_FORMAT_DEAGGREGATION) {
-   while ((skbn = rmnet_map_deaggregate(skb)) != NULL)
+   while ((skbn = rmnet_map_deaggregate(skb, port)) != NULL)
__rmnet_map_ingress_handler(skbn, port);
 
consume_skb(skb);
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
index 50c50cd..ca9f473 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
@@ -83,9 +83,11 @@ struct rmnet_map_ul_csum_header {
 #define RMNET_MAP_NO_PAD_BYTES0
 #define RMNET_MAP_ADD_PAD_BYTES   1
 
-struct sk_buff *rmnet_map_deaggregate(struct sk_buff *skb);
+struct sk_buff *rmnet_map_deaggregate(struct sk_buff *skb,
+ struct rmnet_port *port);
 struct rmnet_map_header *rmnet_map_add_map_header(struct sk_buff *skb,
  int hdrlen, int pad);
 void rmnet_map_command(struct sk_buff *skb, struct rmnet_port *port);
+int rmnet_map_checksum_downlink_packet(struct sk_buff *skb, u16 len);
 
 #endif /* _RMNET_MAP_H_ */
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
index 978ce26..881c1dc 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
@@ -14,6 +14,9 @@
  */
 
 #include 
+#include 
+#include 
+#include 
 #include "rmnet_config.h"
 #include "rmnet_map.h"
 #include "rmnet_private.h"
@@ -21,6

[PATCH net-next v3 02/10] net: qualcomm: rmnet: Remove invalid condition while stamping mux id

2018-01-05 Thread Subash Abhinov Kasiviswanathan

rmnet devices cannot have a mux id of 255. This is validated when
assigning the mux id to the rmnet devices. As a result, checking for
mux id 255 does not apply in egress path.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index 0553932..b2d317e3 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -143,10 +143,7 @@ static int rmnet_map_egress_handler(struct sk_buff *skb,
if (!map_header)
goto fail;
 
-   if (mux_id == 0xff)
-   map_header->mux_id = 0;
-   else
-   map_header->mux_id = mux_id;
+   map_header->mux_id = mux_id;
 
skb->protocol = htons(ETH_P_MAP);
 
-- 
1.9.1

[PATCH net-next v3 08/10] net: qualcomm: rmnet: Handle command packets with checksum trailer

2018-01-05 Thread Subash Abhinov Kasiviswanathan

When using the MAPv4 packet format in conjunction with MAP commands,
a dummy DL checksum trailer will be appended to the packet. Before
this packet is sent out as an ACK, the DL checksum trailer needs to be
removed.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
index 51e6049..6bc328f 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
@@ -58,11 +58,24 @@ static u8 rmnet_map_do_flow_control(struct sk_buff *skb,
 }
 
 static void rmnet_map_send_ack(struct sk_buff *skb,
-  unsigned char type)
+  unsigned char type,
+  struct rmnet_port *port)
 {
struct rmnet_map_control_command *cmd;
int xmit_status;
 
+   if (port->data_format & RMNET_INGRESS_FORMAT_MAP_CKSUMV4) {
+   if (skb->len < sizeof(struct rmnet_map_header) +
+   RMNET_MAP_GET_LENGTH(skb) +
+   sizeof(struct rmnet_map_dl_csum_trailer)) {
+   kfree_skb(skb);
+   return;
+   }
+
+   skb_trim(skb, skb->len -
+sizeof(struct rmnet_map_dl_csum_trailer));
+   }
+
skb->protocol = htons(ETH_P_MAP);
 
cmd = RMNET_MAP_GET_CMD_START(skb);
@@ -100,5 +113,5 @@ void rmnet_map_command(struct sk_buff *skb, struct 
rmnet_port *port)
break;
}
if (rc == RMNET_MAP_COMMAND_ACK)
-   rmnet_map_send_ack(skb, rc);
+   rmnet_map_send_ack(skb, rc, port);
 }
-- 
1.9.1

[PATCH net-next v3 03/10] net: qualcomm: rmnet: Remove unused function declaration

2018-01-05 Thread Subash Abhinov Kasiviswanathan

rmnet_map_demultiplex() is only declared but not defined anywhere,
so remove it.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
index 4df359d..ef0eff2 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
@@ -67,7 +67,6 @@ struct rmnet_map_header {
 #define RMNET_MAP_NO_PAD_BYTES0
 #define RMNET_MAP_ADD_PAD_BYTES   1
 
-u8 rmnet_map_demultiplex(struct sk_buff *skb);
 struct sk_buff *rmnet_map_deaggregate(struct sk_buff *skb);
 struct rmnet_map_header *rmnet_map_add_map_header(struct sk_buff *skb,
  int hdrlen, int pad);
-- 
1.9.1

[PATCH net-next v3 04/10] net: qualcomm: rmnet: Rename ingress data format to data format

2018-01-05 Thread Subash Abhinov Kasiviswanathan

This is done so that we can use this field for both ingress and
egress flags.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c   | 10 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h   |  2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c |  5 ++---
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
index cedacdd..7e7704d 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
@@ -143,7 +143,7 @@ static int rmnet_newlink(struct net *src_net, struct 
net_device *dev,
 struct nlattr *tb[], struct nlattr *data[],
 struct netlink_ext_ack *extack)
 {
-   int ingress_format = RMNET_INGRESS_FORMAT_DEAGGREGATION;
+   u32 data_format = RMNET_INGRESS_FORMAT_DEAGGREGATION;
struct net_device *real_dev;
int mode = RMNET_EPMODE_VND;
struct rmnet_endpoint *ep;
@@ -185,11 +185,11 @@ static int rmnet_newlink(struct net *src_net, struct 
net_device *dev,
struct ifla_vlan_flags *flags;
 
flags = nla_data(data[IFLA_VLAN_FLAGS]);
-   ingress_format = flags->flags & flags->mask;
+   data_format = flags->flags & flags->mask;
}
 
-   netdev_dbg(dev, "data format [ingress 0x%08X]\n", ingress_format);
-   port->ingress_data_format = ingress_format;
+   netdev_dbg(dev, "data format [0x%08X]\n", data_format);
+   port->data_format = data_format;
 
return 0;
 
@@ -353,7 +353,7 @@ static int rmnet_changelink(struct net_device *dev, struct 
nlattr *tb[],
struct ifla_vlan_flags *flags;
 
flags = nla_data(data[IFLA_VLAN_FLAGS]);
-   port->ingress_data_format = flags->flags & flags->mask;
+   port->data_format = flags->flags & flags->mask;
}
 
return 0;
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
index 2ea9fe3..00e4634 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
@@ -32,7 +32,7 @@ struct rmnet_endpoint {
  */
 struct rmnet_port {
struct net_device *dev;
-   u32 ingress_data_format;
+   u32 data_format;
u8 nr_rmnet_devs;
u8 rmnet_mode;
struct hlist_head muxed_ep[RMNET_MAX_LOGICAL_EP];
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index b2d317e3..8e1f43a 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -69,8 +69,7 @@ static void rmnet_set_skb_proto(struct sk_buff *skb)
u16 len;
 
if (RMNET_MAP_GET_CD_BIT(skb)) {
-   if (port->ingress_data_format
-   & RMNET_INGRESS_FORMAT_MAP_COMMANDS)
+   if (port->data_format & RMNET_INGRESS_FORMAT_MAP_COMMANDS)
return rmnet_map_command(skb, port);
 
goto free_skb;
@@ -114,7 +113,7 @@ static void rmnet_set_skb_proto(struct sk_buff *skb)
skb_push(skb, ETH_HLEN);
}
 
-   if (port->ingress_data_format & RMNET_INGRESS_FORMAT_DEAGGREGATION) {
+   if (port->data_format & RMNET_INGRESS_FORMAT_DEAGGREGATION) {
while ((skbn = rmnet_map_deaggregate(skb)) != NULL)
__rmnet_map_ingress_handler(skbn, port);
 
-- 
1.9.1

[PATCH net-next v3 10/10] net: qualcomm: rmnet: Add support for GSO

2018-01-05 Thread Subash Abhinov Kasiviswanathan

Real devices may support scatter gather(SG), so enable SG on rmnet
devices to use GSO. GSO reduces CPU cycles by 20% for a rate of
146Mpbs for a single stream TCP connection.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
index f7f57ce..570a227 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
@@ -190,6 +190,7 @@ int rmnet_vnd_newlink(u8 id, struct net_device *rmnet_dev,
 
rmnet_dev->hw_features = NETIF_F_RXCSUM;
rmnet_dev->hw_features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
+   rmnet_dev->hw_features |= NETIF_F_SG;
 
rc = register_netdevice(rmnet_dev);
if (!rc) {
-- 
1.9.1

[PATCH net-next v3 06/10] net: qualcomm: rmnet: Define the MAPv4 packet formats

2018-01-05 Thread Subash Abhinov Kasiviswanathan

The MAPv4 packet format adds support for RX / TX checksum offload.
For a bi-directional UDP stream at a rate of 570 / 146 Mbps, roughly
10% CPU cycles are saved.

For receive path, there is a checksum trailer appended to the end of
the MAP packet. The valid field indicates if hardware has computed
the checksum. csum_start_offset indicates the offset from the start
of the IP header from which hardware has computed checksum.
csum_length is the number of bytes over which the checksum was
computed and the resulting value is csum_value.

In the transmit path, a header is appended between the end of the MAP
header and the start of the IP packet. csum_start_offset is the offset
in bytes from which hardware will compute the checksum if the
csum_enabled bit is set. udp_ip4_ind indicates if the checksum
value of 0 is valid or not. csum_insert_offset is the offset from the
csum_start_offset where hardware will insert the computed checksum.

The use of this additional packet format for checksum offload is
explained in subsequent patches.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h | 16 
 drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h |  2 ++
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
index ef0eff2..50c50cd 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
@@ -47,6 +47,22 @@ struct rmnet_map_header {
u16 pkt_len;
 }  __aligned(1);
 
+struct rmnet_map_dl_csum_trailer {
+   u8  reserved1;
+   u8  valid:1;
+   u8  reserved2:7;
+   u16 csum_start_offset;
+   u16 csum_length;
+   __be16 csum_value;
+} __aligned(1);
+
+struct rmnet_map_ul_csum_header {
+   __be16 csum_start_offset;
+   u16 csum_insert_offset:14;
+   u16 udp_ip4_ind:1;
+   u16 csum_enabled:1;
+} __aligned(1);
+
 #define RMNET_MAP_GET_MUX_ID(Y) (((struct rmnet_map_header *) \
 (Y)->data)->mux_id)
 #define RMNET_MAP_GET_CD_BIT(Y) (((struct rmnet_map_header *) \
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
index d214280..de0143e 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
@@ -21,6 +21,8 @@
 /* Constants */
 #define RMNET_INGRESS_FORMAT_DEAGGREGATION  BIT(0)
 #define RMNET_INGRESS_FORMAT_MAP_COMMANDS   BIT(1)
+#define RMNET_INGRESS_FORMAT_MAP_CKSUMV4BIT(2)
+#define RMNET_EGRESS_FORMAT_MAP_CKSUMV4 BIT(3)
 
 /* Replace skb->dev to a virtual rmnet device and pass up the stack */
 #define RMNET_EPMODE_VND (1)
-- 
1.9.1

[PATCH net-next v3 05/10] net: qualcomm: rmnet: Set pacing shift

2018-01-05 Thread Subash Abhinov Kasiviswanathan

The real device over which the rmnet devices are installed also
aggregate multiple IP packets and sends them as a single large
aggregate frame to the hardware. This causes degraded throughput
for TCP TX due to bufferbloat.

To overcome this problem, pacing shift value of 8 is set using the
sk_pacing_shift_update() helper. This value was determined based
on experiments with a single stream TCP TX using iperf for a
duration of 30s.

Pacing shift | Observed data rate (Mbps)
  10 | 9
   9 | 140
   8 | 146 (Max link rate)

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index 8e1f43a..8f8c4f2 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "rmnet_private.h"
 #include "rmnet_config.h"
 #include "rmnet_vnd.h"
@@ -204,6 +205,8 @@ void rmnet_egress_handler(struct sk_buff *skb)
struct rmnet_priv *priv;
u8 mux_id;
 
+   sk_pacing_shift_update(skb->sk, 8);
+
orig_dev = skb->dev;
priv = netdev_priv(orig_dev);
skb->dev = priv->real_dev;
-- 
1.9.1

[PATCH net-next v3 09/10] net: qualcomm: rmnet: Add support for TX checksum offload

2018-01-05 Thread Subash Abhinov Kasiviswanathan

TX checksum offload applies to TCP / UDP packets which are not
fragmented using the MAPv4 checksum trailer. The following needs to be
done to have checksum computed in hardware -

1. Set the checksum start offset and inset offset.
2. Set the csum_enabled bit
3. Compute and set 1's complement of partial checksum field in
   transport header.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c   |   8 ++
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h|   2 +
 .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c   | 120 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c|   1 +
 4 files changed, 131 insertions(+)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index 3409458..601edec 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -141,11 +141,19 @@ static int rmnet_map_egress_handler(struct sk_buff *skb,
additional_header_len = 0;
required_headroom = sizeof(struct rmnet_map_header);
 
+   if (port->data_format & RMNET_EGRESS_FORMAT_MAP_CKSUMV4) {
+   additional_header_len = sizeof(struct rmnet_map_ul_csum_header);
+   required_headroom += additional_header_len;
+   }
+
if (skb_headroom(skb) < required_headroom) {
if (pskb_expand_head(skb, required_headroom, 0, GFP_KERNEL))
goto fail;
}
 
+   if (port->data_format & RMNET_EGRESS_FORMAT_MAP_CKSUMV4)
+   rmnet_map_checksum_uplink_packet(skb, orig_dev);
+
map_header = rmnet_map_add_map_header(skb, additional_header_len, 0);
if (!map_header)
goto fail;
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
index ca9f473..6ce31e2 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
@@ -89,5 +89,7 @@ struct rmnet_map_header *rmnet_map_add_map_header(struct 
sk_buff *skb,
  int hdrlen, int pad);
 void rmnet_map_command(struct sk_buff *skb, struct rmnet_port *port);
 int rmnet_map_checksum_downlink_packet(struct sk_buff *skb, u16 len);
+void rmnet_map_checksum_uplink_packet(struct sk_buff *skb,
+ struct net_device *orig_dev);
 
 #endif /* _RMNET_MAP_H_ */
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
index 881c1dc..c74a6c5 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
@@ -171,6 +171,86 @@ static __sum16 *rmnet_map_get_csum_field(unsigned char 
protocol,
 }
 #endif
 
+static void rmnet_map_complement_ipv4_txporthdr_csum_field(void *iphdr)
+{
+   struct iphdr *ip4h = (struct iphdr *)iphdr;
+   void *txphdr;
+   u16 *csum;
+
+   txphdr = iphdr + ip4h->ihl * 4;
+
+   if (ip4h->protocol == IPPROTO_TCP || ip4h->protocol == IPPROTO_UDP) {
+   csum = (u16 *)rmnet_map_get_csum_field(ip4h->protocol, txphdr);
+   *csum = ~(*csum);
+   }
+}
+
+static void
+rmnet_map_ipv4_ul_csum_header(void *iphdr,
+ struct rmnet_map_ul_csum_header *ul_header,
+ struct sk_buff *skb)
+{
+   struct iphdr *ip4h = (struct iphdr *)iphdr;
+   __be16 *hdr = (__be16 *)ul_header, offset;
+
+   offset = htons((__force u16)(skb_transport_header(skb) -
+(unsigned char *)iphdr));
+   ul_header->csum_start_offset = offset;
+   ul_header->csum_insert_offset = skb->csum_offset;
+   ul_header->csum_enabled = 1;
+   if (ip4h->protocol == IPPROTO_UDP)
+   ul_header->udp_ip4_ind = 1;
+   else
+   ul_header->udp_ip4_ind = 0;
+
+   /* Changing remaining fields to network order */
+   hdr++;
+   *hdr = htons((__force u16)*hdr);
+
+   skb->ip_summed = CHECKSUM_NONE;
+
+   rmnet_map_complement_ipv4_txporthdr_csum_field(iphdr);
+}
+
+#if IS_ENABLED(CONFIG_IPV6)
+static void rmnet_map_complement_ipv6_txporthdr_csum_field(void *ip6hdr)
+{
+   struct ipv6hdr *ip6h = (struct ipv6hdr *)ip6hdr;
+   void *txphdr;
+   u16 *csum;
+
+   txphdr = ip6hdr + sizeof(struct ipv6hdr);
+
+   if (ip6h->nexthdr == IPPROTO_TCP || ip6h->nexthdr == IPPROTO_UDP) {
+   csum = (u16 *)rmnet_map_get_csum_field(ip6h->nexthdr, txphdr);
+   *csum = ~(*csum);
+   }
+}
+
+static void
+rmnet_map_ipv6_ul_csum_header(void *ip6hdr,
+ struct rmnet_map_ul_csum_header *ul_header,
+ struct sk_buff *skb)
+{
+   __be16 *hdr = (__be16 *)ul_header, offset;
+
+   offset = htons((__force

[PATCH net-next v3 00/10] net: qualcomm: rmnet: Enable csum offloads

2018-01-05 Thread Subash Abhinov Kasiviswanathan

This series introduces the MAPv4 packet format for checksum
offload plus some other minor changes.

Patches 1-3 are cleanups.

Patch 4 renames the ingress format to data format so that all data
formats can be configured using this going forward.

Patch 5 uses the pacing helper to improve TCP transmit performance.

Patch 6-9 defines the the MAPv4 for checksum offload for RX and TX.
A new header and trailer format are used as part of MAPv4.
For RX checksum offload, only the 1's complement of the IP payload
portion is computed by hardware. The meta data from RX header is
used to verify the checksum field in the packet. Note that the
IP packet and its field itself is not modified by hardware.
This gives metadata to help with the RX checksum. For TX, the
required metadata is filled up so hardware can compute the
checksum.

Patch 10 enables GSO on rmnet devices

v1->v2: Fix sparse errors reported by kbuild test robot

v2->v3: Update the commit message for Patch 5 based on Eric's comments

Subash Abhinov Kasiviswanathan (10):
  net: qualcomm: rmnet: Remove redundant check when stamping map header
  net: qualcomm: rmnet: Remove invalid condition while stamping mux id
  net: qualcomm: rmnet: Remove unused function declaration
  net: qualcomm: rmnet: Rename ingress data format to data format
  net: qualcomm: rmnet: Set pacing shift
  net: qualcomm: rmnet: Define the MAPv4 packet formats
  net: qualcomm: rmnet: Add support for RX checksum offload
  net: qualcomm: rmnet: Handle command packets with checksum trailer
  net: qualcomm: rmnet: Add support for TX checksum offload
  net: qualcomm: rmnet: Add support for GSO

 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c |  10 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h |   2 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c   |  36 ++-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h|  23 +-
 .../ethernet/qualcomm/rmnet/rmnet_map_command.c|  17 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c   | 309 -
 .../net/ethernet/qualcomm/rmnet/rmnet_private.h|   2 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c|   4 +
 8 files changed, 378 insertions(+), 25 deletions(-)

-- 
1.9.1

1 2 3 >

1 - 100 of 222 matches

Mail list logo