date:20180112

Re: [bpf-next PATCH v3 7/7] bpf: sockmap set rlimit

2018-01-12 Thread Martin KaFai Lau

On Thu, Jan 11, 2018 at 09:08:02PM -0800, John Fastabend wrote:
> Avoid extra step of setting limit from cmdline and do it directly in
> the program.
> 
> Signed-off-by: John Fastabend 
Acked-by: Martin KaFai Lau

Re: [bpf-next PATCH v3 4/7] bpf: sockmap sample, report bytes/sec

2018-01-12 Thread Martin KaFai Lau

On Thu, Jan 11, 2018 at 09:07:10PM -0800, John Fastabend wrote:
> Report bytes/sec sent as well as total bytes. Useful to get rough
> idea how different configurations and usage patterns perform with
> sockmap.
> 
> Signed-off-by: John Fastabend 
Acked-by: Martin KaFai Lau

Re: [bpf-next PATCH v3 3/7] bpf: sockmap sample, use fork() for send and recv

2018-01-12 Thread Martin KaFai Lau

On Thu, Jan 11, 2018 at 09:06:54PM -0800, John Fastabend wrote:
> Currently for SENDMSG tests first send completes then recv runs. This
> does not work well for large data sizes and/or many iterations. So
> fork the recv and send handler so that we run both send and recv. In
> the future we can add a parameter to do more than a single fork of
> tx/rx.
> 
> With this we can get many GBps of data which helps exercise the
> sockmap code.
> 
> Signed-off-by: John Fastabend 
One nit.

Acked-by: Martin KaFai Lau 

> ---
[ ... ]
> @@ -274,25 +275,50 @@ static int msg_loop(int fd, int iov_count, int 
> iov_length, int cnt,
>  
>  static int sendmsg_test(int iov_count, int iov_buf, int cnt, int verbose)
>  {
> + int txpid, rxpid, err = 0;
>   struct msg_stats s = {0};
> - int err;
> -
> - err = msg_loop(c1, iov_count, iov_buf, cnt, , true);
> - if (err) {
> - fprintf(stderr,
> - "msg_loop_tx: iov_count %i iov_buf %i cnt %i err %i\n",
> - iov_count, iov_buf, cnt, err);
> - return err;
> + int status;
> +
> + errno = 0;
> +
> + rxpid = fork();
> + if (rxpid == 0) {
> + err = msg_loop(p2, iov_count, iov_buf, cnt, , false);
> + if (err)
> + fprintf(stderr,
> + "msg_loop_rx: iov_count %i iov_buf %i cnt %i 
> err %i\n",
> + iov_count, iov_buf, cnt, err);
> + fprintf(stdout, "rx_sendmsg: TX_bytes %zu RX_bytes %zu\n",
> + s.bytes_sent, s.bytes_recvd);
> + shutdown(p2, SHUT_RDWR);
> + shutdown(p1, SHUT_RDWR);
> + exit(1);
> + } else if (rxpid == -1) {
> + perror("msg_loop_rx: ");
> + return errno;
>   }
>  
> - msg_loop(p2, iov_count, iov_buf, cnt, , false);
> - if (err)
> - fprintf(stderr,
> - "msg_loop_rx: iov_count %i iov_buf %i cnt %i err %i\n",
> - iov_count, iov_buf, cnt, err);
> + txpid = fork();
> + if (txpid == 0) {
> + err = msg_loop(c1, iov_count, iov_buf, cnt, , true);
> + if (err)
> + fprintf(stderr,
> + "msg_loop_tx: iov_count %i iov_buf %i cnt %i 
> err %i\n",
> + iov_count, iov_buf, cnt, err);
> + fprintf(stdout, "tx_sendmsg: TX_bytes %zu RX_bytes %zu\n",
> + s.bytes_sent, s.bytes_recvd);
> + shutdown(c1, SHUT_RDWR);
> + exit(1);
> + } else if (txpid == -1) {
> + perror("msg_loop_tx: ");
> + return errno;
> + }
>  
> - fprintf(stdout, "sendmsg: TX_bytes %zu RX_bytes %zu\n",
> - s.bytes_sent, s.bytes_recvd);
> + assert(waitpid(rxpid, , 0) == rxpid);
> + if (!txpid)
This case won't be hit?

> + goto out;
> + assert(waitpid(txpid, , 0) == txpid);
> +out:
>   return err;
>  }
>  
>

Re: [bpf-next PATCH v3 5/7] bpf: sockmap sample add base test without any BPF for comparison

2018-01-12 Thread Martin KaFai Lau

On Thu, Jan 11, 2018 at 09:07:29PM -0800, John Fastabend wrote:
> Add a base test that does not use BPF hooks to test baseline case.
> 
> Signed-off-by: John Fastabend 
Acked-by: Martin KaFai Lau

Re: [bpf-next PATCH v3 6/7] bpf: sockmap put client sockets in blocking mode

2018-01-12 Thread Martin KaFai Lau

On Thu, Jan 11, 2018 at 09:07:45PM -0800, John Fastabend wrote:
> Put client sockets in blocking mode otherwise with sendmsg tests
> its easy to overrun the socket buffers which results in the test
> being aborted.
> 
> The original non-blocking was added to handle listen/accept with
> a single thread the client/accepted sockets do not need to be
> non-blocking.
> 
> Signed-off-by: John Fastabend 
Acked-by: Martin KaFai Lau

Re: [bpf-next PATCH v3 3/7] bpf: sockmap sample, use fork() for send and recv

2018-01-12 Thread Martin KaFai Lau

On Thu, Jan 11, 2018 at 09:06:54PM -0800, John Fastabend wrote:
> Currently for SENDMSG tests first send completes then recv runs. This
> does not work well for large data sizes and/or many iterations. So
> fork the recv and send handler so that we run both send and recv. In
> the future we can add a parameter to do more than a single fork of
> tx/rx.
> 
> With this we can get many GBps of data which helps exercise the
> sockmap code.
> 
> Signed-off-by: John Fastabend 
Some nits.

Acked-by: Martin KaFai Lau 

> ---
[ ... ]
> @@ -274,25 +275,50 @@ static int msg_loop(int fd, int iov_count, int 
> iov_length, int cnt,
>  
>  static int sendmsg_test(int iov_count, int iov_buf, int cnt, int verbose)
>  {
> + int txpid, rxpid, err = 0;
>   struct msg_stats s = {0};
> - int err;
> -
> - err = msg_loop(c1, iov_count, iov_buf, cnt, , true);
> - if (err) {
> - fprintf(stderr,
> - "msg_loop_tx: iov_count %i iov_buf %i cnt %i err %i\n",
> - iov_count, iov_buf, cnt, err);
> - return err;
> + int status;
> +
> + errno = 0;
> +
> + rxpid = fork();
> + if (rxpid == 0) {
> + err = msg_loop(p2, iov_count, iov_buf, cnt, , false);
> + if (err)
> + fprintf(stderr,
> + "msg_loop_rx: iov_count %i iov_buf %i cnt %i 
> err %i\n",
> + iov_count, iov_buf, cnt, err);
> + fprintf(stdout, "rx_sendmsg: TX_bytes %zu RX_bytes %zu\n",
> + s.bytes_sent, s.bytes_recvd);
> + shutdown(p2, SHUT_RDWR);
> + shutdown(p1, SHUT_RDWR);
> + exit(1);
> + } else if (rxpid == -1) {
> + perror("msg_loop_rx: ");
> + return errno;
>   }
>  
> - msg_loop(p2, iov_count, iov_buf, cnt, , false);
> - if (err)
> - fprintf(stderr,
> - "msg_loop_rx: iov_count %i iov_buf %i cnt %i err %i\n",
> - iov_count, iov_buf, cnt, err);
> + txpid = fork();
> + if (txpid == 0) {
> + err = msg_loop(c1, iov_count, iov_buf, cnt, , true);
> + if (err)
> + fprintf(stderr,
> + "msg_loop_tx: iov_count %i iov_buf %i cnt %i 
> err %i\n",
> + iov_count, iov_buf, cnt, err);
> + fprintf(stdout, "tx_sendmsg: TX_bytes %zu RX_bytes %zu\n",
> + s.bytes_sent, s.bytes_recvd);
> + shutdown(c1, SHUT_RDWR);
> + exit(1);
> + } else if (txpid == -1) {
> + perror("msg_loop_tx: ");
> + return errno;
> + }
>  
> - fprintf(stdout, "sendmsg: TX_bytes %zu RX_bytes %zu\n",
> - s.bytes_sent, s.bytes_recvd);
> + assert(waitpid(rxpid, , 0) == rxpid);
> + if (!txpid)
This case won't be hit?

> + goto out;
> + assert(waitpid(txpid, , 0) == txpid);
> +out:
>   return err;
>  }
>  
>

Re: [bpf-next PATCH v3 2/7] bpf: add sendmsg option for testing BPF programs

2018-01-12 Thread Martin KaFai Lau

On Thu, Jan 11, 2018 at 09:06:34PM -0800, John Fastabend wrote:
> When testing BPF programs using sockmap I often want to have more
> control over how sendmsg is exercised. This becomes even more useful
> as new sockmap program types are added.
> 
> This adds a test type option to select type of test to run. Currently,
> only "ping" and "sendmsg" are supported, but more can be added as
> needed.
> 
> The new help argument gives the following,
> 
>  Usage: ./sockmap --cgroup 
>  options:
>  --help -h
>  --cgroup   -c
>  --rate -r
>  --verbose  -v
>  --iov_count-i
>  --length   -l
>  --test -t
> 
> Signed-off-by: John Fastabend 
> ---
>  samples/sockmap/sockmap_user.c |  147 
> +++-
>  1 file changed, 144 insertions(+), 3 deletions(-)
> 
> diff --git a/samples/sockmap/sockmap_user.c b/samples/sockmap/sockmap_user.c
> index 17400d4..8ec7dbf 100644
> --- a/samples/sockmap/sockmap_user.c
> +++ b/samples/sockmap/sockmap_user.c
> @@ -56,6 +56,9 @@
>   {"cgroup",  required_argument,  NULL, 'c' },
>   {"rate",required_argument,  NULL, 'r' },
>   {"verbose", no_argument,NULL, 'v' },
> + {"iov_count",   required_argument,  NULL, 'i' },
> + {"length",  required_argument,  NULL, 'l' },
> + {"test",required_argument,  NULL, 't' },
>   {0, 0, NULL, 0 }
>  };
>  
> @@ -182,6 +185,117 @@ static int sockmap_init_sockets(void)
>   return 0;
>  }
>  
> +struct msg_stats {
> + size_t bytes_sent;
> + size_t bytes_recvd;
> +};
> +
> +static int msg_loop(int fd, int iov_count, int iov_length, int cnt,
> + struct msg_stats *s, bool tx)
> +{
> + struct msghdr msg = {0};
> + struct iovec *iov;
> + int i, flags = 0;
> +
> + iov = calloc(iov_count, sizeof(struct iovec));
> + if (!iov)
> + return -ENOMEM;
I think errno has already been set to ENOMEM (instead of
-ENOMEM), so may directly use it instead.

> +
> + for (i = 0; i < iov_count; i++) {
> + char *d = calloc(iov_length, sizeof(char));
> +
> + if (!d) {
> + fprintf(stderr, "iov_count %i/%i OOM\n", i, iov_count);
> + free(iov);
> + return -ENOMEM;
The new "out_errno" label below should include freeing
all iov[i].iov_base.

Also, instead of a return here, I think you meant a "goto out_errno;"
such that the earlier "calloc(iov_length, sizeof(char));"
can also be freed during error.

Same for the following error return/goto cases

> + }
> + iov[i].iov_base = d;
> + iov[i].iov_len = iov_length;
> + }
> +
> + msg.msg_iov = iov;
> + msg.msg_iovlen = iov_count;
> +
> + if (tx) {
> + for (i = 0; i < cnt; i++) {
> + int sent = sendmsg(fd, , flags);
> +
> + if (sent < 0) {
> + perror("send loop error:");
> + free(iov);
> + return sent;
> + }
> + s->bytes_sent += sent;
> + }
> + } else {
> + int slct, recv, max_fd = fd;
> + struct timeval timeout;
> + float total_bytes;
> + fd_set w;
> +
> + total_bytes = (float)iov_count * (float)iov_length * (float)cnt;
> + while (s->bytes_recvd < total_bytes) {
> + timeout.tv_sec = 1;
> + timeout.tv_usec = 0;
> +
> + /* FD sets */
> + FD_ZERO();
> + FD_SET(fd, );
> +
> + slct = select(max_fd + 1, , NULL, NULL, );
> + if (slct == -1) {
> + perror("select()");
> + goto out_errno;
> + } else if (!slct) {
> + fprintf(stderr, "unexpected timeout\n");
> + goto out_errno;
Just in case, I think errno == 0 here here.

> + }
> +
> + recv = recvmsg(fd, , flags);
> + if (recv < 0) {
> + if (errno != EWOULDBLOCK) {
> + perror("recv failed()\n");
> + goto out_errno;
> + }
> + }
> +
> + s->bytes_recvd += recv;
> + }
> + }
> +
> + for (i = 0; i < iov_count; i++)
> + free(iov[i].iov_base);
> + free(iov);
> + return 0;
> +out_errno:
> + free(iov);
> + return errno;
> +}
> +
> +static int sendmsg_test(int iov_count, int iov_buf, int cnt, int verbose)
> +{
> + struct msg_stats s = {0};
> + int err;
> +
> + err = msg_loop(c1, iov_count, iov_buf, cnt, , true);
> + if (err) {
> +

Re: BUG: unable to handle kernel paging request in check_memory_region

2018-01-12 Thread Dmitry Vyukov

On Fri, Jan 12, 2018 at 11:58 PM, syzbot
 wrote:
> Hello,
>
> syzkaller hit the following crash on
> c92a9a461dff6140c539c61e457aa97df29517d6
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
>
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+32b24f3e7c9000c48...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.


Daniel, is it the same bug that was fixed by "bpf, array: fix overflow
in max_entries and undefined behavior in index_mask"?


> audit: type=1400 audit(1515790631.378:9): avc:  denied  { sys_chroot } for
> pid=3510 comm="syzkaller602893" capability=18
> scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
> tcontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 tclass=cap_userns
> permissive=1
> BUG: unable to handle kernel paging request at ed004e875e33
> IP: bytes_is_nonzero mm/kasan/kasan.c:166 [inline]
> IP: memory_is_nonzero mm/kasan/kasan.c:184 [inline]
> IP: memory_is_poisoned_n mm/kasan/kasan.c:210 [inline]
> IP: memory_is_poisoned mm/kasan/kasan.c:241 [inline]
> IP: check_memory_region_inline mm/kasan/kasan.c:257 [inline]
> IP: check_memory_region+0x61/0x190 mm/kasan/kasan.c:267
> PGD 21ffee067 P4D 21ffee067 PUD 21ffec067 PMD 0
> Oops:  [#1] SMP KASAN
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Modules linked in:
> CPU: 0 PID: 3510 Comm: syzkaller602893 Not tainted 4.15.0-rc7+ #259
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:bytes_is_nonzero mm/kasan/kasan.c:166 [inline]
> RIP: 0010:memory_is_nonzero mm/kasan/kasan.c:184 [inline]
> RIP: 0010:memory_is_poisoned_n mm/kasan/kasan.c:210 [inline]
> RIP: 0010:memory_is_poisoned mm/kasan/kasan.c:241 [inline]
> RIP: 0010:check_memory_region_inline mm/kasan/kasan.c:257 [inline]
> RIP: 0010:check_memory_region+0x61/0x190 mm/kasan/kasan.c:267
> RSP: 0018:8801bfa0 EFLAGS: 00010202
> RAX: ed004e875e33 RBX: 8802743af19b RCX: 817deb1c
> RDX:  RSI: 0004 RDI: 8802743af198
> RBP: 8801bfa77780 R08: 11004e875e33 R09: ed004e875e33
> R10: 0001 R11: ed004e875e33 R12: ed004e875e34
> R13: 8802743af198 R14: 8801bfc9f000 R15: 8801c135a680
> FS:  01a1d880() GS:8801db20() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: ed004e875e33 CR3: 0001bfe22003 CR4: 001606f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  memcpy+0x23/0x50 mm/kasan/kasan.c:302
>  memcpy include/linux/string.h:344 [inline]
>  map_lookup_elem+0x4dc/0xbd0 kernel/bpf/syscall.c:584
>  SYSC_bpf kernel/bpf/syscall.c:1711 [inline]
>  SyS_bpf+0x922/0x4400 kernel/bpf/syscall.c:1685
>  entry_SYSCALL_64_fastpath+0x23/0x9a
> RIP: 0033:0x440ac9
> RSP: 002b:007dff68 EFLAGS: 0203 ORIG_RAX: 0141
> RAX: ffda RBX:  RCX: 00440ac9
> RDX: 0018 RSI: 20eed000 RDI: 0001
> RBP:  R08:  R09: 
> R10:  R11: 0203 R12: 004022a0
> R13: 00402330 R14:  R15: 
> Code: 89 f8 49 c1 e8 03 49 89 db 49 c1 eb 03 4d 01 cb 4d 01 c1 4d 8d 63 01
> 4c 89 c8 4d 89 e2 4d 29 ca 49 83 fa 10 7f 3d 4d 85 d2 74 33 <41> 80 39 00 75
> 21 48 b8 01 00 00 00 00 fc ff df 4d 01 d1 49 01
> RIP: bytes_is_nonzero mm/kasan/kasan.c:166 [inline] RSP: 8801bfa0
> RIP: memory_is_nonzero mm/kasan/kasan.c:184 [inline] RSP: 8801bfa0
> RIP: memory_is_poisoned_n mm/kasan/kasan.c:210 [inline] RSP:
> 8801bfa0
> RIP: memory_is_poisoned mm/kasan/kasan.c:241 [inline] RSP: 8801bfa0
> RIP: check_memory_region_inline mm/kasan/kasan.c:257 [inline] RSP:
> 8801bfa0
> RIP: check_memory_region+0x61/0x190 mm/kasan/kasan.c:267 RSP:
> 8801bfa0
> CR2: ed004e875e33
> ---[ end trace 769bd3705f3abe78 ]---
> Kernel panic - not syncing: Fatal exception
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkal...@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:

Re: [bpf-next PATCH v3 1/7] bpf: refactor sockmap sample program update for arg parsing

2018-01-12 Thread Martin KaFai Lau

On Thu, Jan 11, 2018 at 09:06:17PM -0800, John Fastabend wrote:
> sockmap sample program takes arguments from cmd line but it reads them
> in using offsets into the array. Because we want to add more arguments
> in the future lets do proper argument handling.
> 
> Also refactor code to pull apart sock init and ping/pong test. This
> allows us to add new tests in the future.
> 
> Signed-off-by: John Fastabend 
One nit below.

Acked-by: Martin KaFai Lau 

> ---
[ ... ]
> @@ -280,12 +333,21 @@ int main(int argc, char **argv)
>   return err;
>   }
>  
> - err = sockmap_test_sockets(rate, dot);
> + err = sockmap_init_sockets();
>   if (err) {
>   fprintf(stderr, "ERROR: test socket failed: %d\n", err);
> - return err;
> + goto out;
>   }
> - return 0;
> +
> + err = forever_ping_pong(rate, verbose);
> +out:
> + close(s1);
> + close(s2);
> + close(p1);
> + close(p2);
> + close(c1);
> + close(c2);
close(cg_fd);

> + return err;
>  }
>  
>  void running_handler(int a)
>

Re: general protection fault in __bpf_map_put

2018-01-12 Thread Dmitry Vyukov

On Wed, Jan 10, 2018 at 1:58 PM, syzbot
 wrote:
> Hello,
>
> syzkaller hit the following crash on
> b4464bcab38d3f7fe995a7cb960eeac6889bec08
> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
>
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+d2f5524fb46fd3b31...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.

Daniel, is it the same bug that was fixed by "bpf, array: fix overflow
in max_entries and undefined behavior in index_mask"?

> audit: type=1400 audit(1515571663.627:11): avc:  denied  { map_read
> map_write } for  pid=3537 comm="syzkaller597104"
> scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
> tcontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 tclass=bpf
> permissive=1
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault:  [#1] SMP KASAN
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 23 Comm: kworker/1:1 Not tainted 4.15.0-rc7-next-20180110+ #93
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Workqueue: events bpf_map_free_deferred
> RIP: 0010:__bpf_map_put+0x64/0x2e0 kernel/bpf/syscall.c:233
> RSP: 0018:8801d98b7458 EFLAGS: 00010293
> RAX: 8801d98ac600 RBX: ad6001bc0dd1 RCX: 817e4454
> RDX:  RSI: 0001 RDI: ad6001bc0dd1
> RBP: 8801d98b74e8 R08: 11003b316e6a R09: 
> R10:  R11:  R12: 11003b316e8c
> R13: dc00 R14: dc00 R15: 0001
> FS:  () GS:8801db30() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 204f9fe4 CR3: 06822003 CR4: 001606e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  bpf_map_put+0x1a/0x20 kernel/bpf/syscall.c:243
>  bpf_map_fd_put_ptr+0x15/0x20 kernel/bpf/map_in_map.c:96
>  fd_array_map_delete_elem kernel/bpf/arraymap.c:420 [inline]
>  bpf_fd_array_map_clear kernel/bpf/arraymap.c:461 [inline]
>  array_of_map_free+0x100/0x180 kernel/bpf/arraymap.c:618
>  bpf_map_free_deferred+0xb0/0xe0 kernel/bpf/syscall.c:217
>  process_one_work+0xbbf/0x1af0 kernel/workqueue.c:2112
>  worker_thread+0x223/0x1990 kernel/workqueue.c:2246
>  kthread+0x33c/0x400 kernel/kthread.c:238
>  ret_from_fork+0x4b/0x60 arch/x86/entry/entry_64.S:547
> Code: b5 41 48 c7 45 80 1d 2c 59 86 48 c7 45 88 f0 43 7e 81 c7 00 f1 f1 f1
> f1 c7 40 04 00 f2 f2 f2 c7 40 08 f3 f3 f3 f3 e8 9c 1e f2 ff  ff 4b 48 74
> 2f e8 91 1e f2 ff 48 b8 00 00 00 00 00 fc ff df
> RIP: __bpf_map_put+0x64/0x2e0 kernel/bpf/syscall.c:233 RSP: 8801d98b7458
> ---[ end trace 61592f27aaa1e096 ]---
> Kernel panic - not syncing: Fatal exception
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkal...@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> If you want to test a patch for this bug, please reply with:
> #syz test: git://repo/address.git branch
> and provide the patch inline or as an attachment.
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bugs+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/94eb2c06de30df3d1605626b941b%40google.com.
> For more options, visit https://groups.google.com/d/optout.

Re: [PATCH net] sctp: removed unused var from sctp_make_auth

2018-01-12 Thread Xin Long

On Fri, Jan 12, 2018 at 12:22 AM, Marcelo Ricardo Leitner
 wrote:
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
>  net/sctp/sm_make_chunk.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index 
> 9bf575f2e8ed0888e0219a872e84018ada5064e0..f08531de5682256064ce35e3d44200caa71c3db8
>  100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -1273,7 +1273,6 @@ struct sctp_chunk *sctp_make_auth(const struct 
> sctp_association *asoc)
> struct sctp_authhdr auth_hdr;
> struct sctp_hmac *hmac_desc;
> struct sctp_chunk *retval;
> -   __u8 *hmac;
>
> /* Get the first hmac that the peer told us to use */
> hmac_desc = sctp_auth_asoc_get_hmac(asoc);
> @@ -1292,7 +1291,7 @@ struct sctp_chunk *sctp_make_auth(const struct 
> sctp_association *asoc)
> retval->subh.auth_hdr = sctp_addto_chunk(retval, sizeof(auth_hdr),
>  _hdr);
>
> -   hmac = skb_put_zero(retval->skb, hmac_desc->hmac_len);
> +   skb_put_zero(retval->skb, hmac_desc->hmac_len);
>
> /* Adjust the chunk header to include the empty MAC */
> retval->chunk_hdr->length =
> --
> 2.14.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reviewed-by: Xin Long

Re: [PATCH net] sctp: avoid compiler warning on implicit fallthru

2018-01-12 Thread Xin Long

On Fri, Jan 12, 2018 at 12:22 AM, Marcelo Ricardo Leitner
 wrote:
> These fall-through are expected.
>
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
>  net/sctp/ipv6.c | 1 +
>  net/sctp/outqueue.c | 4 ++--
>  2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index 
> 3b18085e3b10253f3f81be7a6747b50ef9357db2..5d4c15bf66d26219415596598a1f72d29b63a798
>  100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -826,6 +826,7 @@ static int sctp_inet6_af_supported(sa_family_t family, 
> struct sctp_sock *sp)
> case AF_INET:
> if (!__ipv6_only_sock(sctp_opt2sk(sp)))
> return 1;
> +   /* fallthru */
> default:
> return 0;
> }
> diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
> index 
> 7d67feeeffc1e758ae4be4ef1ddaea23276d1f5e..c4ec99b2015002b273071e6fb1ec3c59c9f61154
>  100644
> --- a/net/sctp/outqueue.c
> +++ b/net/sctp/outqueue.c
> @@ -918,9 +918,9 @@ static void sctp_outq_flush(struct sctp_outq *q, int 
> rtx_timeout, gfp_t gfp)
> break;
>
> case SCTP_CID_ABORT:
> -   if (sctp_test_T_bit(chunk)) {
> +   if (sctp_test_T_bit(chunk))
> packet->vtag = asoc->c.my_vtag;
> -   }
> +   /* fallthru */
> /* The following chunks are "response" chunks, i.e.
>  * they are generated in response to something we
>  * received.  If we are sending these, then we can
> --
> 2.14.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reviewed-by: Xin Long

[PATCH bpf] bpf: fix 32-bit divide by zero

2018-01-12 Thread Alexei Starovoitov

due to some JITs doing if (src_reg == 0) check in 64-bit mode
for div/mod opreations mask upper 32-bits of src register
before doing the check

Fixes: 622582786c9e ("net: filter: x86: internal BPF JIT")
Fixes: 7a12b5031c6b ("sparc64: Add eBPF JIT.")
Reported-by: syzbot+48340bb518e88849e...@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov 
---
arm64 jit seems to be ok
haven't analyzed other JITs
It works around the interpreter bug too, but I think
the interpreter worth fixing anyway.
---
 kernel/bpf/verifier.c | 18 ++
 net/core/filter.c |  4 
 2 files changed, 22 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 20eb04fd155e..b7448347e6b6 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4445,6 +4445,24 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
int i, cnt, delta = 0;
 
for (i = 0; i < insn_cnt; i++, insn++) {
+   if (insn->code == (BPF_ALU | BPF_MOD | BPF_X) ||
+   insn->code == (BPF_ALU | BPF_DIV | BPF_X)) {
+   /* due to JIT bugs clear upper 32-bits of src register
+* before div/mod operation
+*/
+   insn_buf[0] = BPF_MOV32_REG(insn->src_reg, 
insn->src_reg);
+   insn_buf[1] = *insn;
+   cnt = 2;
+   new_prog = bpf_patch_insn_data(env, i + delta, 
insn_buf, cnt);
+   if (!new_prog)
+   return -ENOMEM;
+
+   delta+= cnt - 1;
+   env->prog = prog = new_prog;
+   insn  = new_prog->insnsi + i + delta;
+   continue;
+   }
+
if (insn->code != (BPF_JMP | BPF_CALL))
continue;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index d339ef170df6..1c0eb436671f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -458,6 +458,10 @@ static int bpf_convert_filter(struct sock_filter *prog, 
int len,
convert_bpf_extensions(fp, ))
break;
 
+   if (fp->code == (BPF_ALU | BPF_DIV | BPF_X) ||
+   fp->code == (BPF_ALU | BPF_MOD | BPF_X))
+   *insn++ = BPF_MOV32_REG(BPF_REG_X, BPF_REG_X);
+
*insn = BPF_RAW_INSN(fp->code, BPF_REG_A, BPF_REG_X, 0, 
fp->k);
break;
 
-- 
2.9.5

Re: [PATCH net] Revert "openvswitch: Add erspan tunnel support."

2018-01-12 Thread Pravin Shelar

On Fri, Jan 12, 2018 at 12:29 PM, William Tu  wrote:
> This reverts commit ceaa001a170e43608854d5290a48064f57b565ed.
>
> The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr should be designed
> as a nested attribute to support all ERSPAN v1 and v2's fields.
> The current attr is a be32 supporting only one field.  Thus, this
> patch reverts it and later patch will redo it using nested attr.
>
> Signed-off-by: William Tu 
> Cc: Jiri Benc 
> Cc: Pravin Shelar 
Acked-by: Pravin B Shelar

Re: [PATCH bpf-next v5 0/5] Separate error injection table from kprobes

2018-01-12 Thread Alexei Starovoitov

On Sat, Jan 13, 2018 at 02:53:33AM +0900, Masami Hiramatsu wrote:
> Hi,
> 
> Here are the 5th version of patches to moving error injection
> table from kprobes. This version fixes a bug and update
> fail-function to support multiple function error injection.
> 
> Here is the previous version:
> 
> https://patchwork.ozlabs.org/cover/858663/
> 
> Changes in v5:
>  - [3/5] Fix a bug that within_error_injection returns false always.
>  - [5/5] Update to support multiple function error injection.
> 
> Thank you,

Applied to bpf-next, Thank you Masami.

[PATCH v2] bpf: fix divides by zero

2018-01-12 Thread Eric Dumazet

From: Eric Dumazet 

Divides by zero are not nice, lets avoid them if possible.

Also do_div() seems not needed when dealing with 32bit operands,
but this seems a minor detail.

Fixes: bd4cf0ed331a ("net: filter: rework/optimize internal BPF interpreter's 
instruction set")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
---
v2: kernel patches 101 : do not mangle patch :/

 kernel/bpf/core.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 
51ec2dda7f08c6c90af084589bb6d80662c77d12..7949e8b8f94e9cc196e0449214493ccce61b0903
 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -956,7 +956,7 @@ static unsigned int ___bpf_prog_run(u64 *regs, const struct 
bpf_insn *insn,
DST = tmp;
CONT;
ALU_MOD_X:
-   if (unlikely(SRC == 0))
+   if (unlikely((u32)SRC == 0))
return 0;
tmp = (u32) DST;
DST = do_div(tmp, (u32) SRC);
@@ -975,7 +975,7 @@ static unsigned int ___bpf_prog_run(u64 *regs, const struct 
bpf_insn *insn,
DST = div64_u64(DST, SRC);
CONT;
ALU_DIV_X:
-   if (unlikely(SRC == 0))
+   if (unlikely((u32)SRC == 0))
return 0;
tmp = (u32) DST;
do_div(tmp, (u32) SRC);

Re: [PATCH] bpf: fix divides by zero

2018-01-12 Thread Alexei Starovoitov

On Fri, Jan 12, 2018 at 05:33:26PM -0800, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> Divides by zero are not nice, lets avoid them if possible.
> 
> Also do_div() seems not needed when dealing with 32bit operands,
> but this seems a minor detail.
> 
> Fixes: bd4cf0ed331a ("net: filter: rework/optimize internal BPF interpreter's 
> instruction set")
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 
> ---
>  kernel/bpf/core.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index
> 51ec2dda7f08c6c90af084589bb6d80662c77d12..7949e8b8f94e9cc196e0449214493
> ccce61b0903 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -956,7 +956,7 @@ static unsigned int ___bpf_prog_run(u64 *regs,
> const struct bpf_insn *insn,
>   DST = tmp;
>   CONT;
>   ALU_MOD_X:
> - if (unlikely(SRC == 0))
> + if (unlikely((u32)SRC == 0))

wow.
Acked-by: Alexei Starovoitov 
we likely need to fix all JITs as well.
At least x64, arm64, sparc have the same bug.

Long term it's probably better to move all such checks out of JITs
and interpreter into the verifier and patch div/mod with
additional 'if src == 0'. This way we can do any type of
error reporting and/or aborting execution.

Re: [PATCH] bpf: fix divides by zero

2018-01-12 Thread Eric Dumazet

On Fri, 2018-01-12 at 17:33 -0800, Eric Dumazet wrote:
> From: Eric Dumazet 
> 

Sorry for the mangled patch. Will send V2

[PATCH] bpf: fix divides by zero

2018-01-12 Thread Eric Dumazet

From: Eric Dumazet 

Divides by zero are not nice, lets avoid them if possible.

Also do_div() seems not needed when dealing with 32bit operands,
but this seems a minor detail.

Fixes: bd4cf0ed331a ("net: filter: rework/optimize internal BPF interpreter's 
instruction set")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
---
 kernel/bpf/core.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index
51ec2dda7f08c6c90af084589bb6d80662c77d12..7949e8b8f94e9cc196e0449214493
ccce61b0903 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -956,7 +956,7 @@ static unsigned int ___bpf_prog_run(u64 *regs,
const struct bpf_insn *insn,
    DST = tmp;
    CONT;
    ALU_MOD_X:
-   if (unlikely(SRC == 0))
+   if (unlikely((u32)SRC == 0))
    return 0;
    tmp = (u32) DST;
    DST = do_div(tmp, (u32) SRC);
@@ -975,7 +975,7 @@ static unsigned int ___bpf_prog_run(u64 *regs,
const struct bpf_insn *insn,
    DST = div64_u64(DST, SRC);
    CONT;
    ALU_DIV_X:
-   if (unlikely(SRC == 0))
+   if (unlikely((u32)SRC == 0))
    return 0;
    tmp = (u32) DST;
    do_div(tmp, (u32) SRC);

[PATCH 2/2] ixgbe: use compiler constants in Rx path

2018-01-12 Thread Shannon Nelson

Rather than swapping runtime bytes to compare to constants, let the
compiler swap the constants and save a couple of runtuime cycles.

Signed-off-by: Shannon Nelson 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index c5ef09f..587fd8f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -806,9 +806,9 @@ void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
struct sk_buff *skb)
 {
struct ixgbe_adapter *adapter = netdev_priv(rx_ring->netdev);
-   u16 pkt_info = le16_to_cpu(rx_desc->wb.lower.lo_dword.hs_rss.pkt_info);
-   u16 ipsec_pkt_types = IXGBE_RXDADV_PKTTYPE_IPSEC_AH |
-   IXGBE_RXDADV_PKTTYPE_IPSEC_ESP;
+   __le16 pkt_info = rx_desc->wb.lower.lo_dword.hs_rss.pkt_info;
+   __le16 ipsec_pkt_types = cpu_to_le16(IXGBE_RXDADV_PKTTYPE_IPSEC_AH |
+IXGBE_RXDADV_PKTTYPE_IPSEC_ESP);
struct ixgbe_ipsec *ipsec = adapter->ipsec;
struct xfrm_offload *xo = NULL;
struct xfrm_state *xs = NULL;
@@ -825,11 +825,11 @@ void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
iph = (struct iphdr *)(skb->data + ETH_HLEN);
c_hdr = (u8 *)iph + iph->ihl * 4;
switch (pkt_info & ipsec_pkt_types) {
-   case IXGBE_RXDADV_PKTTYPE_IPSEC_AH:
+   case cpu_to_le16(IXGBE_RXDADV_PKTTYPE_IPSEC_AH):
spi = ((struct ip_auth_hdr *)c_hdr)->spi;
proto = IPPROTO_AH;
break;
-   case IXGBE_RXDADV_PKTTYPE_IPSEC_ESP:
+   case cpu_to_le16(IXGBE_RXDADV_PKTTYPE_IPSEC_ESP):
spi = ((struct ip_esp_hdr *)c_hdr)->spi;
proto = IPPROTO_ESP;
break;
-- 
2.7.4

[PATCH 1/2] ixgbe: ipsec offload for sparc

2018-01-12 Thread Shannon Nelson

Add a couple of byteswaps needed to make the ipsec offload
work on big-endian SPARC platforms.

Signed-off-by: Shannon Nelson 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 3d069a2..c5ef09f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -93,7 +93,7 @@ static void ixgbe_ipsec_set_rx_sa(struct ixgbe_hw *hw, u16 
idx, __be32 spi,
int i;
 
/* store the SPI (in bigendian) and IPidx */
-   IXGBE_WRITE_REG(hw, IXGBE_IPSRXSPI, spi);
+   IXGBE_WRITE_REG(hw, IXGBE_IPSRXSPI, cpu_to_le32(spi));
IXGBE_WRITE_REG(hw, IXGBE_IPSRXIPIDX, ip_idx);
IXGBE_WRITE_FLUSH(hw);
 
@@ -121,7 +121,7 @@ static void ixgbe_ipsec_set_rx_ip(struct ixgbe_hw *hw, u16 
idx, __be32 addr[])
 
/* store the ip address */
for (i = 0; i < 4; i++)
-   IXGBE_WRITE_REG(hw, IXGBE_IPSRXIPADDR(i), addr[i]);
+   IXGBE_WRITE_REG(hw, IXGBE_IPSRXIPADDR(i), cpu_to_le32(addr[i]));
IXGBE_WRITE_FLUSH(hw);
 
ixgbe_ipsec_set_rx_item(hw, idx, ips_rx_ip_tbl);
-- 
2.7.4

Re: [PATCH bpf] bpf: do not modify min/max bounds on scalars with constant values

2018-01-12 Thread Daniel Borkmann

On 01/12/2018 08:52 PM, Edward Cree wrote:
> On 12/01/18 19:23, Daniel Borkmann wrote:
>> syzkaller generated a BPF proglet and triggered a warning with
>> the following:
>>
>>   0: (b7) r0 = 0
>>   1: (d5) if r0 s<= 0x0 goto pc+0
>>R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
>>   2: (1f) r0 -= r1
>>R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
>>   verifier internal error: known but bad sbounds
>>
>> What happens is that in the first insn, r0's min/max value are
>> both 0 due to the immediate assignment, later in the jsle test
>> the bounds are updated for the min value in the false path,
>> meaning, they yield smin_val = 1 smax_val = 0,
> reg_set_min_max() refines the existing bounds, it doesn't replace
>  them, so all that's happened is that the jsle handling has
>  demonstrated that this branch can't be taken.

Correct, on the one hand one could argue that for known constants
it cannot get any more refined than this already since we know
the precise constant value the register holds (thus we make the
range worse in this case fwiw), on the other hand why should they
be treated any different than registers with unknown scalars.
True that smin > smax is a result that the branch won't be taken
at runtime.

> That AFAICT isn't confined to known constants, one could e.g.
>  obtain inconsistent bounds with two js* insns.  Updating the
>  bounds in reg_set_min_max() is right, it's where we try to use
>  those sbounds in adjust_ptr_min_max_vals() that's wrong imho;
>  instead the 'known' paths should be using off_reg->var_off.value
>  rather than smin_val everywhere.

Ok, I'll check further on the side-effects and evaluate this
option of using var_off.value for a v2 as well, thanks!

> Alternatively we could consider not following jumps/lack-thereof
>  that produce inconsistent bounds, but that can make insns
>  unreachable that previously weren't and thus reject programs
>  that we previously considered valid, so we probably can't get
>  away with that.

Agree.

Thanks,
Daniel

[patch 1/1] net/ipv6/route.c: work around gcc-4.4.4 anon union initializer issue

2018-01-12 Thread akpm

From: Andrew Morton 
Subject: net/ipv6/route.c: work around gcc-4.4.4 anon union initializer issue

gcc-4.4.4 has problems with initializers of anonymous union fields.

net/ipv6/route.c: In function 'rt6_sync_up':
net/ipv6/route.c:3586: error: unknown field 'nh_flags' specified in initializer
net/ipv6/route.c:3586: warning: missing braces around initializer
net/ipv6/route.c:3586: warning: (near initialization for 'arg.')
net/ipv6/route.c: In function 'rt6_sync_down_dev':
net/ipv6/route.c:3695: error: unknown field 'event' specified in initializer
net/ipv6/route.c:3695: warning: missing braces around initializer
net/ipv6/route.c:3695: warning: (near initialization for 'arg.')

Fixes: 2127d95aef6c ("ipv6: Clear nexthop flags upon netdev up")
Cc: Ido Schimmel 
Cc: David Ahern 
Cc: David S. Miller 
Signed-off-by: Andrew Morton 
---

 net/ipv6/route.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff -puN 
net/ipv6/route.c~net-ipv6-routec-work-around-gcc-444-anon-union-initializer-issue
 net/ipv6/route.c
--- 
a/net/ipv6/route.c~net-ipv6-routec-work-around-gcc-444-anon-union-initializer-issue
+++ a/net/ipv6/route.c
@@ -3583,7 +3583,7 @@ void rt6_sync_up(struct net_device *dev,
 {
struct arg_netdev_event arg = {
.dev = dev,
-   .nh_flags = nh_flags,
+   { .nh_flags = nh_flags, },
};
 
if (nh_flags & RTNH_F_DEAD && netif_carrier_ok(dev))
@@ -3692,7 +3692,7 @@ void rt6_sync_down_dev(struct net_device
 {
struct arg_netdev_event arg = {
.dev = dev,
-   .event = event,
+   { .event = event, },
};
 
fib6_clean_all(dev_net(dev), fib6_ifdown, );
_

Re: [PATCH v2 00/19] prevent bounds-check bypass via speculative execution

2018-01-12 Thread Tony Luck

On Thu, Jan 11, 2018 at 5:19 PM, Linus Torvalds
 wrote:
> Should the array access in entry_SYSCALL_64_fastpath be made to use
> the masking approach?

That one has a bounds check for an inline constant.

 cmpq$__NR_syscall_max, %rax

so should be safe.

The classic Spectre variant #1 code sequence is:

int array_size;

   if (x < array_size) {
   something with array[x]
   }

which runs into problems because the array_size variable may not
be in cache, and while the CPU core is waiting for the value it
speculates inside the "if" body.

The syscall entry is more like:

#define ARRAY_SIZE 10

 if (x < ARRAY_SIZE) {
  something with array[x]
 }

Here there isn't any reason for speculation. The core has the
value of 'x' in a register and the upper bound encoded into the
"cmp" instruction.  Both are right there, no waiting, no speculation.

-Tony

pull-request: bpf 2018-01-13

2018-01-12 Thread Daniel Borkmann

Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Follow-up fix to the recent BPF out-of-bounds speculation
   fix that prevents max_entries overflows and an undefined
   behavior on 32 bit archs on index_mask calculation, from
   Daniel.

2) Reject unsupported BPF_ARSH opcode in 32 bit ALU mode that
   was otherwise throwing an unknown opcode warning in the
   interpreter, from Daniel.

3) Typo fix in one of the user facing verbose() messages that
   was added during the BPF out-of-bounds speculation fix,
   from Colin.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!



The following changes since commit 661e4e33a984fbd05e6b573ce4bb639ca699c130:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf (2018-01-10 
11:17:21 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to bbeb6e4323dad9b5e0ee9f60c223dd532e2403b1:

  bpf, array: fix overflow in max_entries and undefined behavior in index_mask 
(2018-01-10 14:46:39 -0800)


Colin Ian King (1):
  bpf: fix spelling mistake: "obusing" -> "abusing"

Daniel Borkmann (2):
  bpf: arsh is not supported in 32 bit alu thus reject it
  bpf, array: fix overflow in max_entries and undefined behavior in 
index_mask

 kernel/bpf/arraymap.c   | 18 ++---
 kernel/bpf/verifier.c   |  7 -
 tools/testing/selftests/bpf/test_verifier.c | 40 +
 3 files changed, 61 insertions(+), 4 deletions(-)

Re: [bpf-next PATCH] bpf: simplify xdp_convert_ctx_access for xdp_rxq_info

2018-01-12 Thread Daniel Borkmann

On 01/11/2018 05:39 PM, Jesper Dangaard Brouer wrote:
> As pointed out by Daniel Borkmann, using bpf_target_off() is not
> necessary for xdp_rxq_info when extracting queue_index and
> ifindex, as these members are u32 like BPF_W.
> 
> Also fix trivial spelling mistake introduced in same commit.
> 
> Fixes: 02dd3291b2f0 ("bpf: finally expose xdp_rxq_info to XDP bpf-programs")
> Reported-by: Daniel Borkmann 
> Signed-off-by: Jesper Dangaard Brouer 

Applied to bpf-next, thanks Jesper!

Re: [PATCH v2 15/19] carl9170: prevent bounds-check bypass via speculative execution

2018-01-12 Thread Dan Williams

On Fri, Jan 12, 2018 at 12:01 PM, Christian Lamparter
 wrote:
> On Friday, January 12, 2018 7:39:50 PM CET Dan Williams wrote:
>> On Fri, Jan 12, 2018 at 6:42 AM, Christian Lamparter  
>> wrote:
>> > On Friday, January 12, 2018 1:47:46 AM CET Dan Williams wrote:
>> >> Static analysis reports that 'queue' may be a user controlled value that
>> >> is used as a data dependency to read from the 'ar9170_qmap' array. In
>> >> order to avoid potential leaks of kernel memory values, block
>> >> speculative execution of the instruction stream that could issue reads
>> >> based on an invalid result of 'ar9170_qmap[queue]'. In this case the
>> >> value of 'ar9170_qmap[queue]' is immediately reused as an index to the
>> >> 'ar->edcf' array.
>> >>
>> >> Based on an original patch by Elena Reshetova.
>> >>
>> >> Cc: Christian Lamparter 
>> >> Cc: Kalle Valo 
>> >> Cc: linux-wirel...@vger.kernel.org
>> >> Cc: netdev@vger.kernel.org
>> >> Signed-off-by: Elena Reshetova 
>> >> Signed-off-by: Dan Williams 
>> >> ---
>> > This patch (and p54, cw1200) look like the same patch?!
>> > Can you tell me what happend to:
>> >
>> > On Saturday, January 6, 2018 5:34:03 PM CET Dan Williams wrote:
>> >> On Sat, Jan 6, 2018 at 6:23 AM, Christian Lamparter  
>> >> wrote:
>> >> > And Furthermore a invalid queue (param->ac) would cause a crash in
>> >> > this line in mac80211 before it even reaches the driver [1]:
>> >> > |   sdata->tx_conf[params->ac] = p;
>> >> > |   
>> >> > |   if (drv_conf_tx(local, sdata,  params->ac , )) {
>> >> > |^^ (this is a wrapper for the *_op_conf_tx)
>> >> >
>> >> > I don't think these chin-up exercises are needed.
>> >>
>> >> Quite the contrary, you've identified a better place in the call stack
>> >> to sanitize the input and disable speculation. Then we can kill the
>> >> whole class of the wireless driver reports at once it seems.
>> > 
>>
>> I didn't see where ac is being validated against the driver specific
>> 'queues' value in that earlier patch.
> The link to the check is right there in the earlier post. It's in
> parse_txq_params():
> 
> |   if (txq_params->ac >= NL80211_NUM_ACS)
> |   return -EINVAL;
>
> NL80211_NUM_ACS is 4
> 
>
> This check was added ever since mac80211's ieee80211_set_txq_params():
> | sdata->tx_conf[params->ac] = p;
>
> For cw1200: the driver just sets the dev->queue to 4.
> In carl9170 dev->queues is set to __AR9170_NUM_TXQ and
> p54 uses P54_QUEUE_AC_NUM.
>
> Both __AR9170_NUM_TXQ and P54_QUEUE_AC_NUM are 4.
> And this is not going to change since all drivers
> have to follow mac80211's queue API:
> 
>
> Some background:
> In the old days (linux 2.6 and early 3.x), the parse_txq_params() function did
> not verify the "queue" value. That's why these drivers had to do it.
>
> Here's the relevant code from 2.6.39:
> 
> You'll notice that the check is missing there.
> Here's mac80211's ieee80211_set_txq_params for reference:
> 
>
> However over time, the check in the driver has become redundant.
>

Excellent, thank you for pointing that out and the background so clearly.

What this tells me though is that we want to inject an ifence() at
this input validation point, i.e.:

if (txq_params->ac >= NL80211_NUM_ACS) {
ifence();
return -EINVAL;
}

...but the kernel, in these patches, only has ifence() defined for
x86. The only way we can sanitize the 'txq_params->ac' value without
ifence() is to do it at array access time, but then we're stuck
touching all drivers when standard kernel development practice says
'refactor common code out of drivers'.

Ugh, the bigger concern is that this driver is being flagged and not
that initial bounds check. Imagine if cw1200 and p54 had already been
converted to assume that they can just trust 'queue'. It would never
have been flagged.

I think we should focus on the get_user path and  __fcheck_files for v3.

Re: [PATCH][next] bnxt_en: ensure len is ininitialized to zero

2018-01-12 Thread Andy Gospodarek

On Fri, Jan 12, 2018 at 10:11:17AM -0800, Michael Chan wrote:
> On Fri, Jan 12, 2018 at 9:46 AM, Colin King  wrote:
> > From: Colin Ian King 
> >
> > In the case where cmp_type == CMP_TYPE_RX_L2_TPA_START_CMP the
> > exit return path is via label next_rx_no_prod and cpr->rx_bytes
> > is being updated by an uninitialized value from len. Fix this by
> > initializing len to zero.
> >
> > Detected by CoverityScan, CID#1463807 ("Uninitialized scalar variable")
> >
> > Fixes: 6a8788f25625 ("bnxt_en: add support for software dynamic interrupt 
> > moderation")
> > Signed-off-by: Colin Ian King 
> > ---
> >  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
> > b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > index cf6ebf1e324b..5b5c4f266f1b 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > @@ -1482,7 +1482,7 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct 
> > bnxt_napi *bnapi, u32 *raw_cons,
> > u32 tmp_raw_cons = *raw_cons;
> > u16 cfa_code, cons, prod, cp_cons = RING_CMP(tmp_raw_cons);
> > struct bnxt_sw_rx_bd *rx_buf;
> > -   unsigned int len;
> > +   unsigned int len = 0;
> 
> It might be better to add a new label next_rx_no_prod_no_len and have
> the TPA code paths jump there instead.
> 
> Andy, what do you think?
> 

Yes, I think that would be a better fix.  Making the TPA vs not-TPA as
explicit and intentional as possible seems like a good idea.

> > u8 *data_ptr, agg_bufs, cmp_type;
> > dma_addr_t dma_addr;
> > struct sk_buff *skb;
> > --
> > 2.15.1
> >

Re: [RFC bpf-next] bpf: add new jited info fields in bpf_dev_offload and bpf_prog_info

2018-01-12 Thread Jakub Kicinski

On Fri, 12 Jan 2018 12:31:15 +0100, Daniel Borkmann wrote:
> On 01/12/2018 03:17 AM, Jakub Kicinski wrote:
> > On Thu, 11 Jan 2018 16:47:47 -0800, Jakub Kicinski wrote:  
> >> Hi!
> >>
> >> Jiong is working on dumping JITed NFP image via bpftool, Francois will be
> >> submitting support for NFP in binutils soon (whoop! :)).
> >>
> >> We would appreciate if you could weigh in on the uAPI.  Is it OK to reuse
> >> the existing jited_prog_len/jited_prog_insns or should we add separate
> >> 2 new fields (plus the arch name) to avoid confusing old user space?  
> > 
> > Ah, I skipped one chunk of Jiong's patch here, this would also be
> > necessary if we reuse fields:  
>
> What kind of string would sit in jited_arch_name for nfp? Given you know
> the {ifindex, netns_dev, netns_ino} 3‑tuple, isn't it possible to infer
> the driver info from ethtool (e.g. nfp_net_get_drvinfo()) already and do
> the mapping for binutils? 

Right, the inference can certainly work.  Probably by matching the PCI
ID of the device?  Or we can just assume it's the NFP since there is
only one upstream BPF offload :)

> Given we know when ifindex is 0, then its always host JITed and the
> current approach with bfd_openr() works fine, and if ifindex is non-0
> it is /always/ device offloaded to the one bound in the ifindex so
> JIT dump will be specific to that device only and never host one. Not
> at all opposed to extending uapi, just a question on my side to get a
> better understanding first wrt arch string (maybe rough but complete
> patch with nfp + bpftool bits might help too).

I was advocating for full separate set of fields, Jiong was trying for
a reuse, and we sort of met in the middle :)  Depends on how one looks
at the definition of the jited_prog_insn field, old user space is not
guaranteed to pay attention to ifindex.  We will drop the arch and reuse
host fields if that's the direction you're leaning in.

[PATCH v2] phy: realtek: use new helpers for paged register access

2018-01-12 Thread Heiner Kallweit

Make use of the new helpers for paged register access.

Signed-off-by: Heiner Kallweit 
---
v2:
- use accessor versions w/o locking in the read/write_page callbacks
---
 drivers/net/phy/realtek.c | 59 +++
 1 file changed, 14 insertions(+), 45 deletions(-)

diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
index 7c1bf688d..887bad07a 100644
--- a/drivers/net/phy/realtek.c
+++ b/drivers/net/phy/realtek.c
@@ -41,37 +41,14 @@ MODULE_DESCRIPTION("Realtek PHY driver");
 MODULE_AUTHOR("Johnson Leung");
 MODULE_LICENSE("GPL");
 
-static int rtl8211x_page_read(struct phy_device *phydev, u16 page, u16 address)
+static int rtl821x_read_page(struct phy_device *phydev)
 {
-   int ret;
-
-   ret = phy_write(phydev, RTL821x_PAGE_SELECT, page);
-   if (ret)
-   return ret;
-
-   ret = phy_read(phydev, address);
-
-   /* restore to default page 0 */
-   phy_write(phydev, RTL821x_PAGE_SELECT, 0x0);
-
-   return ret;
+   return __phy_read(phydev, RTL821x_PAGE_SELECT);
 }
 
-static int rtl8211x_page_write(struct phy_device *phydev, u16 page,
-  u16 address, u16 val)
+static int rtl821x_write_page(struct phy_device *phydev, int page)
 {
-   int ret;
-
-   ret = phy_write(phydev, RTL821x_PAGE_SELECT, page);
-   if (ret)
-   return ret;
-
-   ret = phy_write(phydev, address, val);
-
-   /* restore to default page 0 */
-   phy_write(phydev, RTL821x_PAGE_SELECT, 0x0);
-
-   return ret;
+   return __phy_write(phydev, RTL821x_PAGE_SELECT, page);
 }
 
 static int rtl8201_ack_interrupt(struct phy_device *phydev)
@@ -96,7 +73,7 @@ static int rtl8211f_ack_interrupt(struct phy_device *phydev)
 {
int err;
 
-   err = rtl8211x_page_read(phydev, 0xa43, RTL8211F_INSR);
+   err = phy_read_paged(phydev, 0xa43, RTL8211F_INSR);
 
return (err < 0) ? err : 0;
 }
@@ -110,7 +87,7 @@ static int rtl8201_config_intr(struct phy_device *phydev)
else
val = 0;
 
-   return rtl8211x_page_write(phydev, 0x7, RTL8201F_IER, val);
+   return phy_write_paged(phydev, 0x7, RTL8201F_IER, val);
 }
 
 static int rtl8211b_config_intr(struct phy_device *phydev)
@@ -148,36 +125,24 @@ static int rtl8211f_config_intr(struct phy_device *phydev)
else
val = 0;
 
-   return rtl8211x_page_write(phydev, 0xa42, RTL821x_INER, val);
+   return phy_write_paged(phydev, 0xa42, RTL821x_INER, val);
 }
 
 static int rtl8211f_config_init(struct phy_device *phydev)
 {
int ret;
-   u16 val;
+   u16 val = 0;
 
ret = genphy_config_init(phydev);
if (ret < 0)
return ret;
 
-   ret = rtl8211x_page_read(phydev, 0xd08, 0x11);
-   if (ret < 0)
-   return ret;
-
-   val = ret & 0x;
-
/* enable TX-delay for rgmii-id and rgmii-txid, otherwise disable it */
if (phydev->interface == PHY_INTERFACE_MODE_RGMII_ID ||
phydev->interface == PHY_INTERFACE_MODE_RGMII_TXID)
-   val |= RTL8211F_TX_DELAY;
-   else
-   val &= ~RTL8211F_TX_DELAY;
-
-   ret = rtl8211x_page_write(phydev, 0xd08, 0x11, val);
-   if (ret)
-   return ret;
+   val = RTL8211F_TX_DELAY;
 
-   return 0;
+   return phy_modify_paged(phydev, 0xd08, 0x11, RTL8211F_TX_DELAY, val);
 }
 
 static struct phy_driver realtek_drvs[] = {
@@ -197,6 +162,8 @@ static struct phy_driver realtek_drvs[] = {
.config_intr= _config_intr,
.suspend= genphy_suspend,
.resume = genphy_resume,
+   .read_page  = rtl821x_read_page,
+   .write_page = rtl821x_write_page,
}, {
.phy_id = 0x001cc912,
.name   = "RTL8211B Gigabit Ethernet",
@@ -236,6 +203,8 @@ static struct phy_driver realtek_drvs[] = {
.config_intr= _config_intr,
.suspend= genphy_suspend,
.resume = genphy_resume,
+   .read_page  = rtl821x_read_page,
+   .write_page = rtl821x_write_page,
},
 };
 
-- 
2.15.1

Re: [PATCH] phy: realtek: use new helpers for paged register access

2018-01-12 Thread Heiner Kallweit

Am 12.01.2018 um 22:01 schrieb Andrew Lunn:
> On Fri, Jan 12, 2018 at 09:30:08PM +0100, Heiner Kallweit wrote:
>> Make use of the new helpers for paged register access.
>>
>> Signed-off-by: Heiner Kallweit 
>> ---
>>  drivers/net/phy/realtek.c | 59 
>> +++
>>  1 file changed, 14 insertions(+), 45 deletions(-)
>>
>> diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
>> index 7c1bf688d..887bad07a 100644
>> --- a/drivers/net/phy/realtek.c
>> +++ b/drivers/net/phy/realtek.c
>> @@ -41,37 +41,14 @@ MODULE_DESCRIPTION("Realtek PHY driver");
>>  MODULE_AUTHOR("Johnson Leung");
>>  MODULE_LICENSE("GPL");
>>  
>> -static int rtl8211x_page_read(struct phy_device *phydev, u16 page, u16 
>> address)
>> +static int rtl821x_read_page(struct phy_device *phydev)
>>  {
>> -int ret;
>> -
>> -ret = phy_write(phydev, RTL821x_PAGE_SELECT, page);
>> -if (ret)
>> -return ret;
>> -
>> -ret = phy_read(phydev, address);
>> -
>> -/* restore to default page 0 */
>> -phy_write(phydev, RTL821x_PAGE_SELECT, 0x0);
>> -
>> -return ret;
>> +return phy_read(phydev, RTL821x_PAGE_SELECT);
> 
> Hi Heiner
> 
> It think this is wrong. You need to use __phy_read(). phy_read() does
> an mdiobus_read(), which takes the bus->mdio_lock. However,
> 
Uh, you are of course right.

> int phy_save_page(struct phy_device *phydev)
> {
> mutex_lock(>mdio.bus->mdio_lock);
> return __phy_read_page(phydev);
> }
> 
> This also takes the same lock. So this should deadlock.
> 
> Try turning on CONFIG_PROVE_LOCKING, it will detect problems like
> this.
> 
Thanks for the hint. I have this option set, however my test system
doesn't use the irq-related callbacks. So it wasn't able to report
the issue. My fault.

>   Andrew
>

[RFC bpf-next 2/2] tools: bpftool: improve architecture detection by using offload arch info

2018-01-12 Thread Jakub Kicinski

From: Jiong Wang 

The current architecture detection method in bpftool is designed for host
case.

For offload case, we can't use the architecture of "bpftool" itself. We
use the new arch name field in bpf_prog_info and use bfd_scan_arch to
return the correct bfd arch.

Signed-off-by: Jiong Wang 
---
 tools/bpf/bpftool/jit_disasm.c | 16 +++-
 tools/bpf/bpftool/main.h   |  3 ++-
 tools/bpf/bpftool/prog.c   |  7 ++-
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/jit_disasm.c b/tools/bpf/bpftool/jit_disasm.c
index 57d32e8a1391..a54fc0695a50 100644
--- a/tools/bpf/bpftool/jit_disasm.c
+++ b/tools/bpf/bpftool/jit_disasm.c
@@ -76,7 +76,8 @@ static int fprintf_json(void *out, const char *fmt, ...)
return 0;
 }
 
-void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes)
+void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes,
+  char *arch)
 {
disassembler_ftype disassemble;
struct disassemble_info info;
@@ -100,6 +101,19 @@ void disasm_print_insn(unsigned char *image, ssize_t len, 
int opcodes)
else
init_disassemble_info(, stdout,
  (fprintf_ftype) fprintf);
+
+   /* Update architecture info for offload. */
+   if (arch) {
+   const bfd_arch_info_type *inf = bfd_scan_arch(arch);
+
+   if (inf) {
+   bfdf->arch_info = inf;
+   } else {
+   p_err("No libfd support for %s", arch);
+   return;
+   }
+   }
+
info.arch = bfd_get_arch(bfdf);
info.mach = bfd_get_mach(bfdf);
info.buffer = image;
diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index 65b526fe6e7e..0c2898a68340 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -121,7 +121,8 @@ int do_cgroup(int argc, char **arg);
 
 int prog_parse_fd(int *argc, char ***argv);
 
-void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes);
+void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes,
+  char *arch);
 void print_hex_data_json(uint8_t *data, size_t len);
 
 #endif
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index c6a28be4665c..97c2649f71f8 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -775,7 +775,12 @@ static int do_dump(int argc, char **argv)
}
} else {
if (member_len == _prog_len) {
-   disasm_print_insn(buf, *member_len, opcodes);
+   if (info.ifindex)
+   disasm_print_insn(buf, *member_len, opcodes,
+ info.offload_arch_name);
+   else
+   disasm_print_insn(buf, *member_len, opcodes,
+ NULL);
} else {
kernel_syms_load();
if (json_output)
-- 
2.15.1

[RFC bpf-next 1/2] nfp: bpf: set new jit info fields

2018-01-12 Thread Jakub Kicinski

From: Jiong Wang 

This patch set those new jit info fields introduced in previous patch for NFP.

Signed-off-by: Jiong Wang 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.h|  2 ++
 drivers/net/ethernet/netronome/nfp/bpf/offload.c | 12 +++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index 66381afee2a9..41ee68eb7ddc 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -42,6 +42,8 @@
 
 #include "../nfp_asm.h"
 
+#define NFP_ARCH_NAME  "NFP-6xxx"
+
 /* For relocation logic use up-most byte of branch instruction as scratch
  * area.  Remember to clear this before sending instructions to HW!
  */
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c 
b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
index 320b2250d29a..8862d1e50bf5 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
@@ -124,6 +124,7 @@ static int nfp_bpf_translate(struct nfp_net *nn, struct 
bpf_prog *prog)
struct nfp_prog *nfp_prog = prog->aux->offload->dev_priv;
unsigned int stack_size;
unsigned int max_instr;
+   int err;
 
stack_size = nn_readb(nn, NFP_NET_CFG_BPF_STACK_SZ) * 64;
if (prog->aux->stack_depth > stack_size) {
@@ -140,7 +141,16 @@ static int nfp_bpf_translate(struct nfp_net *nn, struct 
bpf_prog *prog)
if (!nfp_prog->prog)
return -ENOMEM;
 
-   return nfp_bpf_jit(nfp_prog);
+   err = nfp_bpf_jit(nfp_prog);
+   if (err)
+   return err;
+
+   prog->aux->offload->jited_len = nfp_prog->prog_len * sizeof(u64);
+   prog->aux->offload->jited_image = nfp_prog->prog;
+   memcpy(prog->aux->offload->jited_arch_name, NFP_ARCH_NAME,
+  sizeof(NFP_ARCH_NAME));
+
+   return 0;
 }
 
 static int nfp_bpf_destroy(struct nfp_net *nn, struct bpf_prog *prog)
-- 
2.15.1

Re: dvb usb issues since kernel 4.9

2018-01-12 Thread Eric Dumazet

On Fri, 2018-01-12 at 19:13 -0200, Mauro Carvalho Chehab wrote:
> 
> 
> The .config file used to build the Kernel is at:
>   https://pastebin.com/wpZghann
> 

Hi Mauro

Any chance you can try CONFIG_HZ_1000=y, CONFIG_HZ=1000 ?

Thanks.

[PATCH net-next 2/2 v11] net: ethernet: Add a driver for Gemini gigabit ethernet

2018-01-12 Thread Linus Walleij

The Gemini ethernet has been around for years as an out-of-tree
patch used with the NAS boxen and routers built on StorLink
SL3512 and SL3516, later Storm Semiconductor, later Cortina
Systems. These ASICs are still being deployed and brand new
off-the-shelf systems using it can easily be acquired.

The full name of the IP block is "Net Engine and Gigabit
Ethernet MAC" commonly just called "GMAC".

The hardware block contains a common TCP Offload Enginer (TOE)
that can be used by both MACs. The current driver does not use
it.

Cc: Tobias Waldvogel 
Signed-off-by: Michał Mirosław 
Signed-off-by: Linus Walleij 
---
Changes from v10:
- Fix module compile error caused by the driver using two
  module_platform_driver() initcalls: this call has the
  limitiation that you can only have one such call per file
  when compiling modules (it works fine on builtin!).
  Scratched my head and worked around it by using open coded
  module_init() and module_exit() instead like in the old
  times, it works fine.
- Fix print formatting warnings caused by 32 vs 64 bit difference
  by either removing the offending debug prints or using the
  %pap modifier for pointer by reference when printing
  resource_size_t-sized information.
- Make gmac_coalesce_delay_expired() after hint from the
  build robot sparse run.
- Fixed an ugly cast identified by sparse by using a helper
  variable.
- Pushed to build servers for wide compile test so I don't
  get any more surprising build failures to be ashemed of...
  branch HEAD: c5fab9815a7a1857155a7911fbce85d1af11ab8f
  net: ethernet: Add a driver for Gemini gigabit ethernet
  elapsed time: 50m
  configs tested: 120
  The following configs have been built successfully.

Changes from v9:
- Get rid of dma_to_pfn() when looking up pages from their
  physical address.
- Replace the freeq_page_tab with a linear array of page
  pointers and mapping info that can be searched backwards
  to find the corresponding page for a mapping that comes
  in from the hardware.
- Drop the lock around resizing the queue after disabling
  IRQs: the lock should just protect these registers anyway.

Changes from v8:
- Remove dependency guards in Kconfig to get a wider compile
  coverage for the driver to detect broken APIs etc.

Changes from v7:
- Dropped all the typedefs and use structs and unions
  directly in the code.
- Pile all local variables in inverse christmas-tree descending
  order. Rewrite code and move assignments to make this strict.
- Cut the uppercase type names in the process.
- Drop a whole bunch of unused unions and types. If we want to
  unionize these registers when we add functioality then do so
  later.
- Do not disallow mapping 0 however unlikely.
- Do not issue any nasty BUG_ON() for unaligned allocations, but
  fail gracefully instead.
- Update stats on linearized TX fragments even if mapping fails.
- Update RX stats on the (rare) failed SKB from NAPI frags too.
- Set up the mask correctly in the IFF_ALLMULTI RX mode case.
- Pick up the DT node name changes.
- Fix up a bunch of typing: explicit unsigned int, switch to
  u32 where we certainly deal with that.
- Drop a whole slew of pointless unlikely() markups.
- Fix some UTF-8 flunky.
- Fixed a few thousand checkpatch errors/warnings. Kept a very
  few select ones I didn't find reasonable.

Changes from v6:
- Drop all arch support code using the old board files.
- Adapted for device tree probing
- Getting all resources using devm_* accessors where applicable
- Split in parent ethernet device and two per-port devices
  that get spawn from the parent. This is necessary with
  device tree and other aspects of the PHY device model and
  device tree structure that requires a 1:1 mapping between
  a device and PHY to work properly.
- Grab clocks and reset handles as resources from the clock
  and reset subsystems infrastructure instead of open coding
  access to system devices.
- Let the pin control subsystem deal with setting up the
  multplexing and clock skew/delay settings of the RGMII
  lines.
- A separate SoC driver was created to deal with setting up
  bus arbitration and will be merged separately.
- Tested with the D-Link DNS-313 NAS box with a Realtek RTL8211B
  transciever.
- Rename and move code around to fit better with the new device
  handling with a top level device and two children.
- Order code as net vendor Cortina and adapter Gemini. We have
  confirmed with Faraday that this network device is not from
  them (which was initially suspected).
- Rebased onto v4.15-rc1

Changes from v5:
 - merge arch setup code into the patch
 - move platform data include to include/linux/platform_data/gemini_gmac.h
 - use new hw_features instead of ethtool_ops for offload setting
 - add some #ifdefs for build testing on other arches
 - a bit of cleanups

Changes from v4:
 - rebased on upcoming 2.6.38 (removal of page_to_dma() and per-txq stats)
 - removed setting

[PATCH net-next 1/2 v11] net: ethernet: Add DT bindings for the Gemini ethernet

2018-01-12 Thread Linus Walleij

This adds the device tree bindings for the Gemini ethernet
controller. It is pretty straight-forward, using standard
bindings and modelling the two child ports as child devices
under the parent ethernet controller device.

Cc: devicet...@vger.kernel.org
Cc: Tobias Waldvogel 
Cc: Michał Mirosław 
Reviewed-by: Rob Herring 
Signed-off-by: Linus Walleij 
---
ChangeLog v10->v11:
- Resend with the driver.
ChangeLog v9->v10:
- Resend with the driver.
ChangeLog v8->v9:
- Collect Rob's ACK.
ChangeLog v7->v8:
- Use ethernet-port@0 and ethernet-port@1 with unit names
  and following OF graph requirements.
---
 .../bindings/net/cortina,gemini-ethernet.txt   | 92 ++
 1 file changed, 92 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/net/cortina,gemini-ethernet.txt

diff --git a/Documentation/devicetree/bindings/net/cortina,gemini-ethernet.txt 
b/Documentation/devicetree/bindings/net/cortina,gemini-ethernet.txt
new file mode 100644
index ..6c559981d110
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/cortina,gemini-ethernet.txt
@@ -0,0 +1,92 @@
+Cortina Systems Gemini Ethernet Controller
+==
+
+This ethernet controller is found in the Gemini SoC family:
+StorLink SL3512 and SL3516, also known as Cortina Systems
+CS3512 and CS3516.
+
+Required properties:
+- compatible: must be "cortina,gemini-ethernet"
+- reg: must contain the global registers and the V-bit and A-bit
+  memory areas, in total three register sets.
+- syscon: a phandle to the system controller
+- #address-cells: must be specified, must be <1>
+- #size-cells: must be specified, must be <1>
+- ranges: should be state like this giving a 1:1 address translation
+  for the subnodes
+
+The subnodes represents the two ethernet ports in this device.
+They are not independent of each other since they share resources
+in the parent node, and are thus children.
+
+Required subnodes:
+- port0: contains the resources for ethernet port 0
+- port1: contains the resources for ethernet port 1
+
+Required subnode properties:
+- compatible: must be "cortina,gemini-ethernet-port"
+- reg: must contain two register areas: the DMA/TOE memory and
+  the GMAC memory area of the port
+- interrupts: should contain the interrupt line of the port.
+  this is nominally a level interrupt active high.
+- resets: this must provide an SoC-integrated reset line for
+  the port.
+- clocks: this should contain a handle to the PCLK clock for
+  clocking the silicon in this port
+- clock-names: must be "PCLK"
+
+Optional subnode properties:
+- phy-mode: see ethernet.txt
+- phy-handle: see ethernet.txt
+
+Example:
+
+mdio-bus {
+   (...)
+   phy0: ethernet-phy@1 {
+   reg = <1>;
+   device_type = "ethernet-phy";
+   };
+   phy1: ethernet-phy@3 {
+   reg = <3>;
+   device_type = "ethernet-phy";
+   };
+};
+
+
+ethernet@6000 {
+   compatible = "cortina,gemini-ethernet";
+   reg = <0x6000 0x4000>, /* Global registers, queue */
+ <0x60004000 0x2000>, /* V-bit */
+ <0x60006000 0x2000>; /* A-bit */
+   syscon = <>;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges;
+
+   gmac0: ethernet-port@0 {
+   compatible = "cortina,gemini-ethernet-port";
+   reg = <0x60008000 0x2000>, /* Port 0 DMA/TOE */
+ <0x6000a000 0x2000>; /* Port 0 GMAC */
+   interrupt-parent = <>;
+   interrupts = <1 IRQ_TYPE_LEVEL_HIGH>;
+   resets = < GEMINI_RESET_GMAC0>;
+   clocks = < GEMINI_CLK_GATE_GMAC0>;
+   clock-names = "PCLK";
+   phy-mode = "rgmii";
+   phy-handle = <>;
+   };
+
+   gmac1: ethernet-port@1 {
+   compatible = "cortina,gemini-ethernet-port";
+   reg = <0x6000c000 0x2000>, /* Port 1 DMA/TOE */
+ <0x6000e000 0x2000>; /* Port 1 GMAC */
+   interrupt-parent = <>;
+   interrupts = <2 IRQ_TYPE_LEVEL_HIGH>;
+   resets = < GEMINI_RESET_GMAC1>;
+   clocks = < GEMINI_CLK_GATE_GMAC1>;
+   clock-names = "PCLK";
+   phy-mode = "rgmii";
+   phy-handle = <>;
+   };
+};
-- 
2.14.3

Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-12 Thread Eric Dumazet

On Fri, 2018-01-12 at 13:01 -0800, Saeed Mahameed wrote:

> which is better to grasp ?:
> 
> update_doorbell() {
> dma_wmb();
> ring->db = prod;
> }

This one is IMO the most secure one (least surprise)

Considering the time it took to discover this bug, I would really play
safe.

But obviously I do not maintain mlx4.

Re: dvb usb issues since kernel 4.9

2018-01-12 Thread Mauro Carvalho Chehab

Em Tue, 9 Jan 2018 09:48:47 -0800
Linus Torvalds  escreveu:

> On Tue, Jan 9, 2018 at 9:27 AM, Eric Dumazet  wrote:
> >
> > So yes, commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job") has
> > shown up multiple times in various 'regressions'
> > simply because it could surface the problem more often.
> > But even if you revert it, you can still make the faulty
> > driver/subsystem misbehave by adding more stress to the cpu handling
> > the IRQ.  
> 
> ..but that's always true. People sometimes live on the edge - often by
> design (ie hardware has been designed/selected to be the crappiest
> possible that still work).
> 
> That doesn't change anything. A patch that takes "bad things can
> happen" to "bad things DO happen" is a bad patch.
> 
> > Maybe the answer is to tune the kernel for small latencies at the
> > price of small throughput (situation before the patch)  
> 
> Generally we always want to tune for latency. Throughput is "easy",
> but almost never interesting.
> 
> Sure, people do batch jobs. And yes, people often _benchmark_
> throughput, because it's easy to benchmark. It's much harder to
> benchmark latency, even when it's often much more important.
> 
> A prime example is the SSD benchmarks in the last few years - they
> improved _dramatically_ when people noticed that the real problem was
> latency, not the idiotic maximum big-block bandwidth numbers that have
> almost zero impact on most people.
> 
> Put another way: we already have a very strong implicit bias towards
> bandwidth just because it's easier to see and measure.
> 
> That means that we generally should strive to have a explicit bias
> towards optimizing for latency when that choice comes up.  Just to
> balance things out (and just to not take the easy way out: bandwidth
> can often be improved by adding more layers of buffering and bigger
> buffers, and that often ends up really hurting latency).
> 
> > 1) Revert the patch  
> 
> Well, we can revert it only partially - limiting it to just networking
> for example.
> 
> Just saying "act the way you used to for tasklets" already seems to
> have fixed the issue in DVB.
> 
> > 2) get rid of ksoftirqd since it adds unexpected latencies.  
> 
> We can't get rid of it entirely, since the synchronous softirq code
> can cause problems too. It's why we have that "maximum of ten
> synchronous events" in __do_softirq().
> 
> And we don't *want* to get rid of it.
> 
> We've _always_ had that small-scale "at some point we can't do it
> synchronously any more".
> 
> That is a small-scale "don't have horrible latency for _other_ things"
> protection. So it's about latency too, it's just about protecting
> latency of the rest of the system.
> 
> The problem with commit 4cd13c21b207 is that it turns the small-scale
> latency issues in softirq handling (they get larger latencies for lots
> of hardware interrupts or even from non-preemptible kernel code) into
> the _huge_ scale latency of scheduling, and does so in a racy way too.
> 
> > 3) Let applications that expect to have high throughput make sure to
> > pin their threads on cpus that are not processing IRQ.
> > (And make sure to not use irqbalance, and setup IRQ cpu affinities)  
> 
> The only people that really deal in "thoughput only" tend to be the
> HPC people, and they already do things like this.
> 
> (The other end of the spectrum is the realtime people that have
> extreme latency requirements, who do things like that for the reverse
> reason: keeping one or more CPU's reserved for the particular
> low-latency realtime job).

Ok, it took me some time - and a faster microSD - in order to be sure that
the data loss weren't due to bad storage performance, but I have now some
test results.

In summary, indeed the ksoftirq commit 4cd13c21b207 ("softirq: Let ksoftirqd
do its job") is causing data losses. On my tests, it generate at least one
continuity error on every 1-5 minutes.

Either reverting it or applying Linus proposal of partially reverting
it fixes the issues. Increasing the number of URBs doesn't seem to
help.

I'm enclosing the dirty details below.

Linus/Eric,

Now that I have an environment setup, I can test whatever other alternative
that would fix the UDP packet flow attack while won't break the softirq
handling code.

Regards,
Mauro

---

All tests below were done on a Raspberry Pi3 with a SanDisk Extreme U3 microSD
card with 32GB and a DVBSky S960C DVB-S2 tuner with an external power supply,
connected to a TCP/IP network via Ethernet (with uses USB on RPi). It also
have a serial cable connected to it.

It was installed with LibreELEC 8.2.2, using tvheadend backend.

I'm recording one MPEG-TS service/"channel" composed of one audio and
one video stream, The total traffic collected by tvheadend was about 
4 Mbits/s (audio+video+EPG tables). It is part of a 58 mbits/s MPEG
Transport stream, with 23 TV service/"channels" on it.

While handling this issue, I found

Re: [bpf-next PATCH 4/7] net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG

2018-01-12 Thread John Fastabend

On 01/12/2018 12:46 PM, Eric Dumazet wrote:
> On Fri, 2018-01-12 at 12:26 -0800, John Fastabend wrote:
>> On 01/12/2018 12:10 PM, Eric Dumazet wrote:
>>> On Fri, 2018-01-12 at 10:10 -0800, John Fastabend wrote:
 When calling do_tcp_sendpages() from in kernel and we know the data
 has no references from user side we can omit SKBTX_SHARED_FRAG flag.
 This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used
 to omit setting SKBTX_SHARED_FRAG.

 Signed-off-by: John Fastabend 
 ---
  include/linux/socket.h |1 +
  net/ipv4/tcp.c |4 +++-
  2 files changed, 4 insertions(+), 1 deletion(-)

 diff --git a/include/linux/socket.h b/include/linux/socket.h
 index 9286a5a..add9360 100644
 --- a/include/linux/socket.h
 +++ b/include/linux/socket.h
 @@ -287,6 +287,7 @@ struct ucred {
  #define MSG_SENDPAGE_NOTLAST 0x2 /* sendpage() internal : not the 
 last page */
  #define MSG_BATCH 0x4 /* sendmmsg(): more messages coming */
  #define MSG_EOF MSG_FIN
 +#define MSG_NO_SHARED_FRAGS 0x8 /* sendpage() internal : page frags 
 are not shared */
  
  #define MSG_ZEROCOPY  0x400   /* Use user data in kernel path 
 */
  #define MSG_FASTOPEN  0x2000  /* Send data in TCP SYN */
 diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
 index 7ac583a..56c6f49 100644
 --- a/net/ipv4/tcp.c
 +++ b/net/ipv4/tcp.c
 @@ -995,7 +995,9 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page 
 *page, int offset,
get_page(page);
skb_fill_page_desc(skb, i, page, offset, copy);
}
 -  skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
 +
 +  if (!(flags & MSG_NO_SHARED_FRAGS))
 +  skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
  
skb->len += copy;
skb->data_len += copy;
>>>
>>> What would prevent user space from using this flag ?
>>>
>>
>> Nothing in the current patches. So user could set this, change the data,
>> and then presumably get incorrect checksums with bad timing. Seems like
>> this should be blocked so we don't allow users to try and send bad csums.
> 
> Are you sure user can set it ? How would this happen ?
> 

Ah OK I thought you might have a path that I missed. Just
rechecked and I don't see any paths where user flags get
to sendpage without being masked.

> It would be nice to check (sorry I was lazy/busy and did not check
> before asking)

No problem.

The splice path using pipe_to_sendpage() already masks the
flags before sendpage is called. The only other call sites I
see are in o2net and lowcomms both places flags are hard coded
in-kernel.

So we should be safe.


>> How about masking the flags coming from userland? Alternatively could add
>> a bool to do_tcp_sendpages().
>>

Re: NOARP devices and NOARP arp_cache entires

2018-01-12 Thread Jim Westfall

Jim Westfall  wrote [01.11.18]:
> Hi
> 
> I'm seeing some weird behavior related to NOARP devices and NOARP 
> entries in the arp cache.
> 
> I have a couple gre tunnels between a linux box and a upstream router that 
> send/recv a large amount of packets with unique ips.  On the order of 10k+ 
> unique ips per second seen by the linux box.
> 
> Each one of the ips ends up getting added to the arp cache as
> 
>  dev tun1234 lladdr 0.0.0.0 NOARP
> 
> This of course makes the arp cache grow extremely fast and overflow.  
> While I can tweak gc_thresh1/2/3 to make arp cache size huge, it doesn't 
> seem like the right answer as the kernel is spinning its wheels having to
> adding/expunging entries for the high rate of unique ips.
> 
> I'm unclear why the kernel is even tracking them in the arp cache.  If 
> routing table says to route the packet out a NOARP interface then there is 
> no arp, why involve the arp cache at all?  
> 
> You can see the behavior with the following
> 
> [root@jwestfall:~]# uname -a
> Linux jwestfall.jwestfall.net 4.14.10_1 #1 SMP PREEMPT Sun Dec 31 20:23:29 
> UTC 2017 x86_64 GNU/Linux
> 
> [root@jwestfall:~]# ip neigh show nud noarp 
> 10.0.0.172 dev lo lladdr 00:00:00:00:00:00 NOARP
> 10.70.50.5 dev tun0 lladdr 08 NOARP
> 127.0.0.1 dev lo lladdr 00:00:00:00:00:00 NOARP
> 
> Setup a bogus gre tunnel, the remote ip doesn't matter
> [root@jwestfall:~]# ip tunnel add tun1234 mode gre local 10.0.0.172 remote 
> 10.0.0.156 dev enp4s0
> [root@jwestfall:~]# ip link set up dev tun1234
> 
> Route a bogus network to the tunnel
> [root@jwestfall:~]# ip route add 192.168.111.0/24 dev tun1234
> 
> Ping ips on the bogus network
> [root@jwestfall:~]# nmap -sP 192.168.111.0/24
> 
> Starting Nmap 7.60 ( https://nmap.org ) at 2018-01-11 12:06 PST
> ...
> 
> [root@jwestfall:~]# ip neigh show nud noarp 
> 192.168.111.18 dev tun1234 lladdr 0.0.0.0 NOARP
> 192.168.111.4 dev tun1234 lladdr 0.0.0.0 NOARP
> 192.168.111.28 dev tun1234 lladdr 0.0.0.0 NOARP
> 192.168.111.17 dev tun1234 lladdr 0.0.0.0 NOARP
> 192.168.111.14 dev tun1234 lladdr 0.0.0.0 NOARP
> 192.168.111.34 dev tun1234 lladdr 0.0.0.0 NOARP
> 192.168.111.3 dev tun1234 lladdr 0.0.0.0 NOARP
> 192.168.111.20 dev tun1234 lladdr 0.0.0.0 NOARP
> 10.0.0.172 dev lo lladdr 00:00:00:00:00:00 NOARP
> 192.168.111.6 dev tun1234 lladdr 0.0.0.0 NOARP
> 192.168.111.27 dev tun1234 lladdr 0.0.0.0 NOARP
> 192.168.111.13 dev tun1234 lladdr 0.0.0.0 NOARP
> 192.168.111.33 dev tun1234 lladdr 0.0.0.0 NOARP
> ...
> 
> Also somewhat interesting is that on older kernels (3.2 time range) these 
> NOARP entries didn't get added for ipv4, but they did for ipv6 if you 
> pushed ipv6 through the ipv4 tunnel.
> 
> 2804:14c:f281:a1d8:61a2:a30:989f:3eb1 dev tun1 lladdr 0.0.0.0 NOARP
> 2607:8400:2122:4:e9f9:dbb8:2d44:75d1 dev tun2 lladdr 0.0.0.0 NOARP
> 
> Thanks
> Jim Westfall
> 
> 

Digging into this a bit in older kernels there was the following 

static struct neighbour *ipv4_neigh_lookup(const struct dst_entry *dst, const 
void *daddr)
{
static const __be32 inaddr_any = 0;
struct net_device *dev = dst->dev;
const __be32 *pkey = daddr;
struct neighbour *n;

if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
pkey = _any;

which was forcing the hash key to be 0.0.0.0 for tunnels.  This was removed as 
part of a263b3093641fb1ec377582c90986a7fd0625184 which was part of a larger set 
that "Disconnect neigh from dst_entry"

Would there be any aversion to me submitting a patch to mimic this older 
behavior?

Thanks
jim

Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-12 Thread Saeed Mahameed



On 01/12/2018 12:16 PM, Eric Dumazet wrote:
> On Fri, 2018-01-12 at 11:53 -0800, Saeed Mahameed wrote:
>>
>> On 01/12/2018 08:46 AM, Eric Dumazet wrote:
>>> On Fri, 2018-01-12 at 09:32 -0700, Jason Gunthorpe wrote:
 On Fri, Jan 12, 2018 at 11:42:22AM +0800, Jianchao Wang wrote:
> Customer reported memory corruption issue on previous mlx4_en driver
> version where the order-3 pages and multiple page reference counting
> were still used.
>
> Finally, find out one of the root causes is that the HW may see stale
> rx_descs due to prod db updating reaches HW before rx_desc. Especially
> when cross order-3 pages boundary and update a new one, HW may write
> on the pages which may has been freed and allocated again by others.
>
> To fix it, add a wmb between rx_desc and prod db updating to ensure
> the order. Even thougth order-0 and page recycling has been introduced,
> the disorder between rx_desc and prod db still could lead to corruption
> on different inbound packages.
>
> Signed-off-by: Jianchao Wang 
>   drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 85e28ef..eefa82c 100644
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -555,7 +555,7 @@ static void mlx4_en_refill_rx_buffers(struct 
> mlx4_en_priv *priv,
>   break;
>   ring->prod++;
>   } while (likely(--missing));
> -
> + wmb(); /* ensure rx_desc updating reaches HW before prod db updating */
>   mlx4_en_update_rx_prod_db(ring);
>   }
>   

 Does this need to be dma_wmb(), and should it be in
 mlx4_en_update_rx_prod_db ?

>>>
>>> +1 on dma_wmb()
>>>
>>> On what architecture bug was observed ?
>>>
>>> In any case, the barrier should be moved in mlx4_en_update_rx_prod_db()
>>> I think.
>>>
>>
>> +1 on dma_wmb(), thanks Eric for reviewing this.
>>
>> The barrier is also needed elsewhere in the code as well, but I wouldn't 
>> put it in mlx4_en_update_rx_prod_db(), just to allow batch filling of 
>> all rx rings and then hit the barrier only once. As a rule of thumb, mem 
>> barriers are the ring API caller responsibility.
>>
>> e.g. in mlx4_en_activate_rx_rings():
>> between mlx4_en_fill_rx_buffers(priv); and the loop that updates rx prod 
>> for all rings ring, the dma_wmb is needed, see below.
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
>> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> index b4d144e67514..65541721a240 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> @@ -370,6 +370,8 @@ int mlx4_en_activate_rx_rings(struct mlx4_en_priv *priv)
>>  if (err)
>>  goto err_buffers;
>>
>> +   dma_wmb();
>> +
>>  for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
>>  ring = priv->rx_ring[ring_ind];
> 
> 
> Why bother, considering dma_wmb() is a nop on x86,
> simply a compiler barrier.
> 
> Putting it in mlx4_en_update_rx_prod_db() and have no obscure bugs...
> 

Simply putting a memory barrier on the top or the bottom of a functions,
means nothing unless you are looking at the whole picture, of all the
callers of that function to understand why is it there.

which is better to grasp ?:

update_doorbell() {
dma_wmb();
ring->db = prod;
}

or

fill buffers();
dma_wmb();
update_doorbell();

I simply like the 2nd one since with one look you can understand what this 
dma_wmb is protecting.

Anyway this is truly a nit, Tariq can decide what is better for him :).

> Also we might change the existing wmb() in mlx4_en_process_rx_cq() by
> dma_wmb(), that would help performance a bit.
> 
> 

+1, Tariq will you handle ?

Re: [PATCH] phy: realtek: use new helpers for paged register access

2018-01-12 Thread Andrew Lunn

On Fri, Jan 12, 2018 at 09:30:08PM +0100, Heiner Kallweit wrote:
> Make use of the new helpers for paged register access.
> 
> Signed-off-by: Heiner Kallweit 
> ---
>  drivers/net/phy/realtek.c | 59 
> +++
>  1 file changed, 14 insertions(+), 45 deletions(-)
> 
> diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
> index 7c1bf688d..887bad07a 100644
> --- a/drivers/net/phy/realtek.c
> +++ b/drivers/net/phy/realtek.c
> @@ -41,37 +41,14 @@ MODULE_DESCRIPTION("Realtek PHY driver");
>  MODULE_AUTHOR("Johnson Leung");
>  MODULE_LICENSE("GPL");
>  
> -static int rtl8211x_page_read(struct phy_device *phydev, u16 page, u16 
> address)
> +static int rtl821x_read_page(struct phy_device *phydev)
>  {
> - int ret;
> -
> - ret = phy_write(phydev, RTL821x_PAGE_SELECT, page);
> - if (ret)
> - return ret;
> -
> - ret = phy_read(phydev, address);
> -
> - /* restore to default page 0 */
> - phy_write(phydev, RTL821x_PAGE_SELECT, 0x0);
> -
> - return ret;
> + return phy_read(phydev, RTL821x_PAGE_SELECT);

Hi Heiner

It think this is wrong. You need to use __phy_read(). phy_read() does
an mdiobus_read(), which takes the bus->mdio_lock. However,

int phy_save_page(struct phy_device *phydev)
{
mutex_lock(>mdio.bus->mdio_lock);
return __phy_read_page(phydev);
}

This also takes the same lock. So this should deadlock.

Try turning on CONFIG_PROVE_LOCKING, it will detect problems like
this.

Andrew

Re: [bpf-next PATCH 4/7] net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG

2018-01-12 Thread Eric Dumazet

On Fri, 2018-01-12 at 12:26 -0800, John Fastabend wrote:
> On 01/12/2018 12:10 PM, Eric Dumazet wrote:
> > On Fri, 2018-01-12 at 10:10 -0800, John Fastabend wrote:
> > > When calling do_tcp_sendpages() from in kernel and we know the data
> > > has no references from user side we can omit SKBTX_SHARED_FRAG flag.
> > > This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used
> > > to omit setting SKBTX_SHARED_FRAG.
> > > 
> > > Signed-off-by: John Fastabend 
> > > ---
> > >  include/linux/socket.h |1 +
> > >  net/ipv4/tcp.c |4 +++-
> > >  2 files changed, 4 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > > index 9286a5a..add9360 100644
> > > --- a/include/linux/socket.h
> > > +++ b/include/linux/socket.h
> > > @@ -287,6 +287,7 @@ struct ucred {
> > >  #define MSG_SENDPAGE_NOTLAST 0x2 /* sendpage() internal : not the 
> > > last page */
> > >  #define MSG_BATCH0x4 /* sendmmsg(): more messages coming */
> > >  #define MSG_EOF MSG_FIN
> > > +#define MSG_NO_SHARED_FRAGS 0x8 /* sendpage() internal : page frags 
> > > are not shared */
> > >  
> > >  #define MSG_ZEROCOPY 0x400   /* Use user data in kernel path 
> > > */
> > >  #define MSG_FASTOPEN 0x2000  /* Send data in TCP SYN */
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index 7ac583a..56c6f49 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -995,7 +995,9 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page 
> > > *page, int offset,
> > >   get_page(page);
> > >   skb_fill_page_desc(skb, i, page, offset, copy);
> > >   }
> > > - skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
> > > +
> > > + if (!(flags & MSG_NO_SHARED_FRAGS))
> > > + skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
> > >  
> > >   skb->len += copy;
> > >   skb->data_len += copy;
> > 
> > What would prevent user space from using this flag ?
> > 
> 
> Nothing in the current patches. So user could set this, change the data,
> and then presumably get incorrect checksums with bad timing. Seems like
> this should be blocked so we don't allow users to try and send bad csums.

Are you sure user can set it ? How would this happen ?

It would be nice to check (sorry I was lazy/busy and did not check
before asking)

> How about masking the flags coming from userland? Alternatively could add
> a bool to do_tcp_sendpages().
>

Re: [PATCH net] Revert "openvswitch: Add erspan tunnel support."

2018-01-12 Thread Jiri Benc

On Fri, 12 Jan 2018 12:29:22 -0800, William Tu wrote:
> This reverts commit ceaa001a170e43608854d5290a48064f57b565ed.
> 
> The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr should be designed
> as a nested attribute to support all ERSPAN v1 and v2's fields.
> The current attr is a be32 supporting only one field.  Thus, this
> patch reverts it and later patch will redo it using nested attr.
> 
> Signed-off-by: William Tu 
> Cc: Jiri Benc 
> Cc: Pravin Shelar 

Acked-by: Jiri Benc

Re: [PATCH 32/32] aio: implement io_pgetevents

2018-01-12 Thread Jeff Moyer

Christoph Hellwig  writes:

> This is the io_getevents equivalent of ppoll/pselect and allows to
> properly mix signals and aio completions (especially with IOCB_CMD_POLL)
> and atomically executes the following sequence:
>
>   sigset_t origmask;
>
>   pthread_sigmask(SIG_SETMASK, , );
>   ret = io_getevents(ctx, min_nr, nr, events, timeout);
>   pthread_sigmask(SIG_SETMASK, , NULL);
>
> Note that unlike many other signal related calls we do not pass a sigmask
> size, as that would get us to 7 arguments, which aren't easily supported
> by the syscall infrastructure.  It seems a lot less painful to just add a
> new syscall variant in the unlikely case we're going to increase the
> sigset size.

pselect, as an example, crams the sigmask and size together.  Why not
just do that?  libaio can take care of setting that up.

Cheers,
Jeff


>
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>  fs/aio.c   | 96 
> ++
>  include/linux/compat.h |  6 +++
>  include/linux/syscalls.h   |  6 +++
>  include/uapi/asm-generic/unistd.h  |  4 +-
>  kernel/sys_ni.c|  2 +
>  7 files changed, 105 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index 448ac2161112..5997c3e9ac3e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -391,3 +391,4 @@
>  382  i386pkey_free   sys_pkey_free
>  383  i386statx   sys_statx
>  384  i386arch_prctl  sys_arch_prctl  
> compat_sys_arch_prctl
> +385  i386io_pgetevents   sys_io_pgetevents   
> compat_sys_io_pgetevents
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index 5aef183e2f85..e995cd2b4e65 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -339,6 +339,7 @@
>  330  common  pkey_alloc  sys_pkey_alloc
>  331  common  pkey_free   sys_pkey_free
>  332  common  statx   sys_statx
> +333  common  io_pgetevents   sys_io_pgetevents
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/aio.c b/fs/aio.c
> index cae90ac6e4a3..57a4e8d89f78 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1299,10 +1299,6 @@ static long read_events(struct kioctx *ctx, long 
> min_nr, long nr,
>   wait_event_interruptible_hrtimeout(ctx->wait,
>   aio_read_events(ctx, min_nr, nr, event, ),
>   until);
> -
> - if (!ret && signal_pending(current))
> - ret = -EINTR;
> -
>   return ret;
>  }
>  
> @@ -1978,13 +1974,54 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
>   struct timespec __user *, timeout)
>  {
>   struct timespec64   ts;
> + int ret;
> +
> + if (timeout && unlikely(get_timespec64(, timeout)))
> + return -EFAULT;
>  
> - if (timeout) {
> - if (unlikely(get_timespec64(, timeout)))
> + ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
> + if (!ret && signal_pending(current))
> + ret = -EINTR;
> + return ret;
> +}
> +
> +SYSCALL_DEFINE6(io_pgetevents,
> + aio_context_t, ctx_id,
> + long, min_nr,
> + long, nr,
> + struct io_event __user *, events,
> + struct timespec __user *, timeout,
> + const sigset_t __user *, sigmask)
> +{
> + sigset_tksigmask, sigsaved;
> + struct timespec64   ts;
> + int ret;
> +
> + if (timeout && unlikely(get_timespec64(, timeout)))
> + return -EFAULT;
> +
> + if (sigmask) {
> + if (copy_from_user(, sigmask, sizeof(ksigmask)))
>   return -EFAULT;
> + sigdelsetmask(, sigmask(SIGKILL) | sigmask(SIGSTOP));
> + sigprocmask(SIG_SETMASK, , );
>   }
>  
> - return do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : 
> NULL);
> + ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
> + if (signal_pending(current)) {
> + if (sigmask) {
> + current->saved_sigmask = sigsaved;
> + set_restore_sigmask();
> + }
> +
> + if (!ret)
> + ret = -ERESTARTNOHAND;
> + } else {
> + if (sigmask)
> + sigprocmask(SIG_SETMASK, , NULL);
> + }
> +
> + return ret;
>  }
>  
>  #ifdef CONFIG_COMPAT
> @@ -1995,13 +2032,52 @@ COMPAT_SYSCALL_DEFINE5(io_getevents, 
> compat_aio_context_t, ctx_id,
>

[PATCH] phy: realtek: use new helpers for paged register access

2018-01-12 Thread Heiner Kallweit

Make use of the new helpers for paged register access.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/phy/realtek.c | 59 +++
 1 file changed, 14 insertions(+), 45 deletions(-)

diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
index 7c1bf688d..887bad07a 100644
--- a/drivers/net/phy/realtek.c
+++ b/drivers/net/phy/realtek.c
@@ -41,37 +41,14 @@ MODULE_DESCRIPTION("Realtek PHY driver");
 MODULE_AUTHOR("Johnson Leung");
 MODULE_LICENSE("GPL");
 
-static int rtl8211x_page_read(struct phy_device *phydev, u16 page, u16 address)
+static int rtl821x_read_page(struct phy_device *phydev)
 {
-   int ret;
-
-   ret = phy_write(phydev, RTL821x_PAGE_SELECT, page);
-   if (ret)
-   return ret;
-
-   ret = phy_read(phydev, address);
-
-   /* restore to default page 0 */
-   phy_write(phydev, RTL821x_PAGE_SELECT, 0x0);
-
-   return ret;
+   return phy_read(phydev, RTL821x_PAGE_SELECT);
 }
 
-static int rtl8211x_page_write(struct phy_device *phydev, u16 page,
-  u16 address, u16 val)
+static int rtl821x_write_page(struct phy_device *phydev, int page)
 {
-   int ret;
-
-   ret = phy_write(phydev, RTL821x_PAGE_SELECT, page);
-   if (ret)
-   return ret;
-
-   ret = phy_write(phydev, address, val);
-
-   /* restore to default page 0 */
-   phy_write(phydev, RTL821x_PAGE_SELECT, 0x0);
-
-   return ret;
+   return phy_write(phydev, RTL821x_PAGE_SELECT, page);
 }
 
 static int rtl8201_ack_interrupt(struct phy_device *phydev)
@@ -96,7 +73,7 @@ static int rtl8211f_ack_interrupt(struct phy_device *phydev)
 {
int err;
 
-   err = rtl8211x_page_read(phydev, 0xa43, RTL8211F_INSR);
+   err = phy_read_paged(phydev, 0xa43, RTL8211F_INSR);
 
return (err < 0) ? err : 0;
 }
@@ -110,7 +87,7 @@ static int rtl8201_config_intr(struct phy_device *phydev)
else
val = 0;
 
-   return rtl8211x_page_write(phydev, 0x7, RTL8201F_IER, val);
+   return phy_write_paged(phydev, 0x7, RTL8201F_IER, val);
 }
 
 static int rtl8211b_config_intr(struct phy_device *phydev)
@@ -148,36 +125,24 @@ static int rtl8211f_config_intr(struct phy_device *phydev)
else
val = 0;
 
-   return rtl8211x_page_write(phydev, 0xa42, RTL821x_INER, val);
+   return phy_write_paged(phydev, 0xa42, RTL821x_INER, val);
 }
 
 static int rtl8211f_config_init(struct phy_device *phydev)
 {
int ret;
-   u16 val;
+   u16 val = 0;
 
ret = genphy_config_init(phydev);
if (ret < 0)
return ret;
 
-   ret = rtl8211x_page_read(phydev, 0xd08, 0x11);
-   if (ret < 0)
-   return ret;
-
-   val = ret & 0x;
-
/* enable TX-delay for rgmii-id and rgmii-txid, otherwise disable it */
if (phydev->interface == PHY_INTERFACE_MODE_RGMII_ID ||
phydev->interface == PHY_INTERFACE_MODE_RGMII_TXID)
-   val |= RTL8211F_TX_DELAY;
-   else
-   val &= ~RTL8211F_TX_DELAY;
-
-   ret = rtl8211x_page_write(phydev, 0xd08, 0x11, val);
-   if (ret)
-   return ret;
+   val = RTL8211F_TX_DELAY;
 
-   return 0;
+   return phy_modify_paged(phydev, 0xd08, 0x11, RTL8211F_TX_DELAY, val);
 }
 
 static struct phy_driver realtek_drvs[] = {
@@ -197,6 +162,8 @@ static struct phy_driver realtek_drvs[] = {
.config_intr= _config_intr,
.suspend= genphy_suspend,
.resume = genphy_resume,
+   .read_page  = rtl821x_read_page,
+   .write_page = rtl821x_write_page,
}, {
.phy_id = 0x001cc912,
.name   = "RTL8211B Gigabit Ethernet",
@@ -236,6 +203,8 @@ static struct phy_driver realtek_drvs[] = {
.config_intr= _config_intr,
.suspend= genphy_suspend,
.resume = genphy_resume,
+   .read_page  = rtl821x_read_page,
+   .write_page = rtl821x_write_page,
},
 };
 
-- 
2.15.1

[PATCH net] Revert "openvswitch: Add erspan tunnel support."

2018-01-12 Thread William Tu

This reverts commit ceaa001a170e43608854d5290a48064f57b565ed.

The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr should be designed
as a nested attribute to support all ERSPAN v1 and v2's fields.
The current attr is a be32 supporting only one field.  Thus, this
patch reverts it and later patch will redo it using nested attr.

Signed-off-by: William Tu 
Cc: Jiri Benc 
Cc: Pravin Shelar 
---
 include/uapi/linux/openvswitch.h |  1 -
 net/openvswitch/flow_netlink.c   | 51 +---
 2 files changed, 1 insertion(+), 51 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 4265d7f9e1f2..dcfab5e3b55c 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -363,7 +363,6 @@ enum ovs_tunnel_key_attr {
OVS_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
address. */
OVS_TUNNEL_KEY_ATTR_PAD,
-   OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* be32 ERSPAN index. */
__OVS_TUNNEL_KEY_ATTR_MAX
 };
 
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 624ea74353dd..f143908b651d 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -49,7 +49,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include "flow_netlink.h"
 
@@ -334,8 +333,7 @@ size_t ovs_tun_key_attr_size(void)
 * OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
 */
+ nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_SRC */
-   + nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_DST */
-   + nla_total_size(4);   /* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS */
+   + nla_total_size(2);   /* OVS_TUNNEL_KEY_ATTR_TP_DST */
 }
 
 static size_t ovs_nsh_key_attr_size(void)
@@ -402,7 +400,6 @@ static const struct ovs_len_tbl 
ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
.next = ovs_vxlan_ext_key_lens 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_SRC]  = { .len = sizeof(struct in6_addr) 
},
[OVS_TUNNEL_KEY_ATTR_IPV6_DST]  = { .len = sizeof(struct in6_addr) 
},
-   [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = sizeof(u32) },
 };
 
 static const struct ovs_len_tbl
@@ -634,33 +631,6 @@ static int vxlan_tun_opt_from_nlattr(const struct nlattr 
*attr,
return 0;
 }
 
-static int erspan_tun_opt_from_nlattr(const struct nlattr *attr,
- struct sw_flow_match *match, bool is_mask,
- bool log)
-{
-   unsigned long opt_key_offset;
-   struct erspan_metadata opts;
-
-   BUILD_BUG_ON(sizeof(opts) > sizeof(match->key->tun_opts));
-
-   memset(, 0, sizeof(opts));
-   opts.index = nla_get_be32(attr);
-
-   /* Index has only 20-bit */
-   if (ntohl(opts.index) & ~INDEX_MASK) {
-   OVS_NLERR(log, "ERSPAN index number %x too large.",
- ntohl(opts.index));
-   return -EINVAL;
-   }
-
-   SW_FLOW_KEY_PUT(match, tun_opts_len, sizeof(opts), is_mask);
-   opt_key_offset = TUN_METADATA_OFFSET(sizeof(opts));
-   SW_FLOW_KEY_MEMCPY_OFFSET(match, opt_key_offset, , sizeof(opts),
- is_mask);
-
-   return 0;
-}
-
 static int ip_tun_from_nlattr(const struct nlattr *attr,
  struct sw_flow_match *match, bool is_mask,
  bool log)
@@ -768,19 +738,6 @@ static int ip_tun_from_nlattr(const struct nlattr *attr,
break;
case OVS_TUNNEL_KEY_ATTR_PAD:
break;
-   case OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS:
-   if (opts_type) {
-   OVS_NLERR(log, "Multiple metadata blocks 
provided");
-   return -EINVAL;
-   }
-
-   err = erspan_tun_opt_from_nlattr(a, match, is_mask, 
log);
-   if (err)
-   return err;
-
-   tun_flags |= TUNNEL_ERSPAN_OPT;
-   opts_type = type;
-   break;
default:
OVS_NLERR(log, "Unknown IP tunnel attribute %d",
  type);
@@ -905,10 +862,6 @@ static int __ip_tun_to_nlattr(struct sk_buff *skb,
else if (output->tun_flags & TUNNEL_VXLAN_OPT &&
 vxlan_opt_to_nlattr(skb, tun_opts, swkey_tun_opts_len))
return -EMSGSIZE;
-   else if (output->tun_flags & TUNNEL_ERSPAN_OPT &&
-nla_put_be32(skb, OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,
- ((struct erspan_metadata 
*)tun_opts)->index))
-

Re: [bpf-next PATCH 4/7] net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG

2018-01-12 Thread John Fastabend

On 01/12/2018 12:10 PM, Eric Dumazet wrote:
> On Fri, 2018-01-12 at 10:10 -0800, John Fastabend wrote:
>> When calling do_tcp_sendpages() from in kernel and we know the data
>> has no references from user side we can omit SKBTX_SHARED_FRAG flag.
>> This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used
>> to omit setting SKBTX_SHARED_FRAG.
>>
>> Signed-off-by: John Fastabend 
>> ---
>>  include/linux/socket.h |1 +
>>  net/ipv4/tcp.c |4 +++-
>>  2 files changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/socket.h b/include/linux/socket.h
>> index 9286a5a..add9360 100644
>> --- a/include/linux/socket.h
>> +++ b/include/linux/socket.h
>> @@ -287,6 +287,7 @@ struct ucred {
>>  #define MSG_SENDPAGE_NOTLAST 0x2 /* sendpage() internal : not the last 
>> page */
>>  #define MSG_BATCH   0x4 /* sendmmsg(): more messages coming */
>>  #define MSG_EOF MSG_FIN
>> +#define MSG_NO_SHARED_FRAGS 0x8 /* sendpage() internal : page frags are 
>> not shared */
>>  
>>  #define MSG_ZEROCOPY0x400   /* Use user data in kernel path 
>> */
>>  #define MSG_FASTOPEN0x2000  /* Send data in TCP SYN */
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 7ac583a..56c6f49 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -995,7 +995,9 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page 
>> *page, int offset,
>>  get_page(page);
>>  skb_fill_page_desc(skb, i, page, offset, copy);
>>  }
>> -skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
>> +
>> +if (!(flags & MSG_NO_SHARED_FRAGS))
>> +skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
>>  
>>  skb->len += copy;
>>  skb->data_len += copy;
> 
> What would prevent user space from using this flag ?
> 

Nothing in the current patches. So user could set this, change the data,
and then presumably get incorrect checksums with bad timing. Seems like
this should be blocked so we don't allow users to try and send bad csums.

How about masking the flags coming from userland? Alternatively could add
a bool to do_tcp_sendpages().

.John

[PATCH net-next 1/2] phy: add helpers for setting/clearing bits in PHY registers

2018-01-12 Thread Heiner Kallweit

Based on the recent introduction of phy_modify add helpers for setting
and clearing bits in PHY registers.

Signed-off-by: Heiner Kallweit 
---
 include/linux/phy.h | 49 +
 1 file changed, 49 insertions(+)

diff --git a/include/linux/phy.h b/include/linux/phy.h
index 47715a311..5a0c3e53e 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -764,6 +764,55 @@ static inline int __phy_write(struct phy_device *phydev, 
u32 regnum, u16 val)
 int __phy_modify(struct phy_device *phydev, u32 regnum, u16 mask, u16 set);
 int phy_modify(struct phy_device *phydev, u32 regnum, u16 mask, u16 set);
 
+/**
+ * __phy_set_bits - Convenience function for setting bits in a PHY register
+ * @phydev: the phy_device struct
+ * @regnum: register number to write
+ * @val: bits to set
+ *
+ * The caller must have taken the MDIO bus lock.
+ */
+static inline int __phy_set_bits(struct phy_device *phydev, u32 regnum, u16 
val)
+{
+   return __phy_modify(phydev, regnum, 0, val);
+}
+
+/**
+ * __phy_clear_bits - Convenience function for clearing bits in a PHY register
+ * @phydev: the phy_device struct
+ * @regnum: register number to write
+ * @val: bits to clear
+ *
+ * The caller must have taken the MDIO bus lock.
+ */
+static inline int __phy_clear_bits(struct phy_device *phydev, u32 regnum,
+  u16 val)
+{
+   return __phy_modify(phydev, regnum, val, 0);
+}
+
+/**
+ * phy_set_bits - Convenience function for setting bits in a PHY register
+ * @phydev: the phy_device struct
+ * @regnum: register number to write
+ * @val: bits to set
+ */
+static inline int phy_set_bits(struct phy_device *phydev, u32 regnum, u16 val)
+{
+   return phy_modify(phydev, regnum, 0, val);
+}
+
+/**
+ * phy_clear_bits - Convenience function for clearing bits in a PHY register
+ * @phydev: the phy_device struct
+ * @regnum: register number to write
+ * @val: bits to clear
+ */
+static inline int phy_clear_bits(struct phy_device *phydev, u32 regnum, u16 
val)
+{
+   return phy_modify(phydev, regnum, val, 0);
+}
+
 /**
  * phy_interrupt_is_valid - Convenience function for testing a given PHY irq
  * @phydev: the phy_device struct
-- 
2.15.1

[PATCH net-next 2/2] phy: use new helpers phy_set_bits/phy_clear_bits in phylib

2018-01-12 Thread Heiner Kallweit

Use new helpers phy_set_bits / phy_clear_bits in phylib.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/phy/phy_device.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 6bd11a070..b13eed21c 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1660,13 +1660,13 @@ EXPORT_SYMBOL(genphy_config_init);
 
 int genphy_suspend(struct phy_device *phydev)
 {
-   return phy_modify(phydev, MII_BMCR, 0, BMCR_PDOWN);
+   return phy_set_bits(phydev, MII_BMCR, BMCR_PDOWN);
 }
 EXPORT_SYMBOL(genphy_suspend);
 
 int genphy_resume(struct phy_device *phydev)
 {
-   return phy_modify(phydev, MII_BMCR, BMCR_PDOWN, 0);
+   return phy_clear_bits(phydev, MII_BMCR, BMCR_PDOWN);
 }
 EXPORT_SYMBOL(genphy_resume);
 
-- 
2.15.1

Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-12 Thread Eric Dumazet

On Fri, 2018-01-12 at 11:53 -0800, Saeed Mahameed wrote:
> 
> On 01/12/2018 08:46 AM, Eric Dumazet wrote:
> > On Fri, 2018-01-12 at 09:32 -0700, Jason Gunthorpe wrote:
> > > On Fri, Jan 12, 2018 at 11:42:22AM +0800, Jianchao Wang wrote:
> > > > Customer reported memory corruption issue on previous mlx4_en driver
> > > > version where the order-3 pages and multiple page reference counting
> > > > were still used.
> > > > 
> > > > Finally, find out one of the root causes is that the HW may see stale
> > > > rx_descs due to prod db updating reaches HW before rx_desc. Especially
> > > > when cross order-3 pages boundary and update a new one, HW may write
> > > > on the pages which may has been freed and allocated again by others.
> > > > 
> > > > To fix it, add a wmb between rx_desc and prod db updating to ensure
> > > > the order. Even thougth order-0 and page recycling has been introduced,
> > > > the disorder between rx_desc and prod db still could lead to corruption
> > > > on different inbound packages.
> > > > 
> > > > Signed-off-by: Jianchao Wang 
> > > >   drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
> > > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> > > > b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > > > index 85e28ef..eefa82c 100644
> > > > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > > > @@ -555,7 +555,7 @@ static void mlx4_en_refill_rx_buffers(struct 
> > > > mlx4_en_priv *priv,
> > > > break;
> > > > ring->prod++;
> > > > } while (likely(--missing));
> > > > -
> > > > +   wmb(); /* ensure rx_desc updating reaches HW before prod db 
> > > > updating */
> > > > mlx4_en_update_rx_prod_db(ring);
> > > >   }
> > > >   
> > > 
> > > Does this need to be dma_wmb(), and should it be in
> > > mlx4_en_update_rx_prod_db ?
> > > 
> > 
> > +1 on dma_wmb()
> > 
> > On what architecture bug was observed ?
> > 
> > In any case, the barrier should be moved in mlx4_en_update_rx_prod_db()
> > I think.
> > 
> 
> +1 on dma_wmb(), thanks Eric for reviewing this.
> 
> The barrier is also needed elsewhere in the code as well, but I wouldn't 
> put it in mlx4_en_update_rx_prod_db(), just to allow batch filling of 
> all rx rings and then hit the barrier only once. As a rule of thumb, mem 
> barriers are the ring API caller responsibility.
> 
> e.g. in mlx4_en_activate_rx_rings():
> between mlx4_en_fill_rx_buffers(priv); and the loop that updates rx prod 
> for all rings ring, the dma_wmb is needed, see below.
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index b4d144e67514..65541721a240 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -370,6 +370,8 @@ int mlx4_en_activate_rx_rings(struct mlx4_en_priv *priv)
>  if (err)
>  goto err_buffers;
> 
> +   dma_wmb();
> +
>  for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
>  ring = priv->rx_ring[ring_ind];


Why bother, considering dma_wmb() is a nop on x86,
simply a compiler barrier.

Putting it in mlx4_en_update_rx_prod_db() and have no obscure bugs...

Also we might change the existing wmb() in mlx4_en_process_rx_cq() by
dma_wmb(), that would help performance a bit.

[PATCH net-next 0/2] phy: add helpers for setting/clearing bits in PHY registers

2018-01-12 Thread Heiner Kallweit

Based on the recent introduction of phy_modify add helpers for setting
and clearing bits in PHY registers. First user is phylib.

Heiner Kallweit (2):
  phy: add helpers for setting/clearing bits in PHY registers
  phy: use new helpers phy_set_bits/phy_clear_bits in phylib

 drivers/net/phy/phy_device.c |  4 ++--
 include/linux/phy.h  | 49 
 2 files changed, 51 insertions(+), 2 deletions(-)

-- 
2.15.1

Re: [PATCH net] sctp: removed unused var from sctp_make_auth

2018-01-12 Thread Neil Horman

On Thu, Jan 11, 2018 at 02:22:07PM -0200, Marcelo Ricardo Leitner wrote:
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
>  net/sctp/sm_make_chunk.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index 
> 9bf575f2e8ed0888e0219a872e84018ada5064e0..f08531de5682256064ce35e3d44200caa71c3db8
>  100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -1273,7 +1273,6 @@ struct sctp_chunk *sctp_make_auth(const struct 
> sctp_association *asoc)
>   struct sctp_authhdr auth_hdr;
>   struct sctp_hmac *hmac_desc;
>   struct sctp_chunk *retval;
> - __u8 *hmac;
>  
>   /* Get the first hmac that the peer told us to use */
>   hmac_desc = sctp_auth_asoc_get_hmac(asoc);
> @@ -1292,7 +1291,7 @@ struct sctp_chunk *sctp_make_auth(const struct 
> sctp_association *asoc)
>   retval->subh.auth_hdr = sctp_addto_chunk(retval, sizeof(auth_hdr),
>_hdr);
>  
> - hmac = skb_put_zero(retval->skb, hmac_desc->hmac_len);
> + skb_put_zero(retval->skb, hmac_desc->hmac_len);
>  
>   /* Adjust the chunk header to include the empty MAC */
>   retval->chunk_hdr->length =
> -- 
> 2.14.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Acked-by: Neil Horman

Re: [PATCH net] sctp: avoid compiler warning on implicit fallthru

2018-01-12 Thread Neil Horman

On Thu, Jan 11, 2018 at 02:22:06PM -0200, Marcelo Ricardo Leitner wrote:
> These fall-through are expected.
> 
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
>  net/sctp/ipv6.c | 1 +
>  net/sctp/outqueue.c | 4 ++--
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index 
> 3b18085e3b10253f3f81be7a6747b50ef9357db2..5d4c15bf66d26219415596598a1f72d29b63a798
>  100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -826,6 +826,7 @@ static int sctp_inet6_af_supported(sa_family_t family, 
> struct sctp_sock *sp)
>   case AF_INET:
>   if (!__ipv6_only_sock(sctp_opt2sk(sp)))
>   return 1;
> + /* fallthru */
>   default:
>   return 0;
>   }
> diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
> index 
> 7d67feeeffc1e758ae4be4ef1ddaea23276d1f5e..c4ec99b2015002b273071e6fb1ec3c59c9f61154
>  100644
> --- a/net/sctp/outqueue.c
> +++ b/net/sctp/outqueue.c
> @@ -918,9 +918,9 @@ static void sctp_outq_flush(struct sctp_outq *q, int 
> rtx_timeout, gfp_t gfp)
>   break;
>  
>   case SCTP_CID_ABORT:
> - if (sctp_test_T_bit(chunk)) {
> + if (sctp_test_T_bit(chunk))
>   packet->vtag = asoc->c.my_vtag;
> - }
> + /* fallthru */
>   /* The following chunks are "response" chunks, i.e.
>* they are generated in response to something we
>* received.  If we are sending these, then we can
> -- 
> 2.14.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Acked-by: Neil Horman

Re: [bpf-next PATCH 4/7] net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG

2018-01-12 Thread Eric Dumazet

On Fri, 2018-01-12 at 10:10 -0800, John Fastabend wrote:
> When calling do_tcp_sendpages() from in kernel and we know the data
> has no references from user side we can omit SKBTX_SHARED_FRAG flag.
> This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used
> to omit setting SKBTX_SHARED_FRAG.
> 
> Signed-off-by: John Fastabend 
> ---
>  include/linux/socket.h |1 +
>  net/ipv4/tcp.c |4 +++-
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index 9286a5a..add9360 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -287,6 +287,7 @@ struct ucred {
>  #define MSG_SENDPAGE_NOTLAST 0x2 /* sendpage() internal : not the last 
> page */
>  #define MSG_BATCH0x4 /* sendmmsg(): more messages coming */
>  #define MSG_EOF MSG_FIN
> +#define MSG_NO_SHARED_FRAGS 0x8 /* sendpage() internal : page frags are 
> not shared */
>  
>  #define MSG_ZEROCOPY 0x400   /* Use user data in kernel path */
>  #define MSG_FASTOPEN 0x2000  /* Send data in TCP SYN */
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 7ac583a..56c6f49 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -995,7 +995,9 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page 
> *page, int offset,
>   get_page(page);
>   skb_fill_page_desc(skb, i, page, offset, copy);
>   }
> - skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
> +
> + if (!(flags & MSG_NO_SHARED_FRAGS))
> + skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
>  
>   skb->len += copy;
>   skb->data_len += copy;

What would prevent user space from using this flag ?

[PATCH net-next] ipv6: Fix build with gcc-4.4.5

2018-01-12 Thread Ido Schimmel

Emil reported the following compiler errors:

net/ipv6/route.c: In function `rt6_sync_up`:
net/ipv6/route.c:3586: error: unknown field `nh_flags` specified in initializer
net/ipv6/route.c:3586: warning: missing braces around initializer
net/ipv6/route.c:3586: warning: (near initialization for `arg.`)
net/ipv6/route.c: In function `rt6_sync_down_dev`:
net/ipv6/route.c:3695: error: unknown field `event` specified in initializer
net/ipv6/route.c:3695: warning: missing braces around initializer
net/ipv6/route.c:3695: warning: (near initialization for `arg.`)

Problem is with the named initializers for the anonymous union members.
Fix this by adding curly braces around the initialization.

Fixes: 4c981e28d373 ("ipv6: Prepare to handle multiple netdev events")
Signed-off-by: Ido Schimmel 
Reported-by: Emil S Tantilov 
Tested-by: Emil S Tantilov 
---
 net/ipv6/route.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 1076ae0ea9d5..c37bd9569172 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3583,7 +3583,9 @@ void rt6_sync_up(struct net_device *dev, unsigned int 
nh_flags)
 {
struct arg_netdev_event arg = {
.dev = dev,
-   .nh_flags = nh_flags,
+   {
+   .nh_flags = nh_flags,
+   },
};
 
if (nh_flags & RTNH_F_DEAD && netif_carrier_ok(dev))
@@ -3692,7 +3694,9 @@ void rt6_sync_down_dev(struct net_device *dev, unsigned 
long event)
 {
struct arg_netdev_event arg = {
.dev = dev,
-   .event = event,
+   {
+   .event = event,
+   },
};
 
fib6_clean_all(dev_net(dev), fib6_ifdown, );
-- 
2.14.3

Re: [PATCH v2 15/19] carl9170: prevent bounds-check bypass via speculative execution

2018-01-12 Thread Christian Lamparter

On Friday, January 12, 2018 7:39:50 PM CET Dan Williams wrote:
> On Fri, Jan 12, 2018 at 6:42 AM, Christian Lamparter  
> wrote:
> > On Friday, January 12, 2018 1:47:46 AM CET Dan Williams wrote:
> >> Static analysis reports that 'queue' may be a user controlled value that
> >> is used as a data dependency to read from the 'ar9170_qmap' array. In
> >> order to avoid potential leaks of kernel memory values, block
> >> speculative execution of the instruction stream that could issue reads
> >> based on an invalid result of 'ar9170_qmap[queue]'. In this case the
> >> value of 'ar9170_qmap[queue]' is immediately reused as an index to the
> >> 'ar->edcf' array.
> >>
> >> Based on an original patch by Elena Reshetova.
> >>
> >> Cc: Christian Lamparter 
> >> Cc: Kalle Valo 
> >> Cc: linux-wirel...@vger.kernel.org
> >> Cc: netdev@vger.kernel.org
> >> Signed-off-by: Elena Reshetova 
> >> Signed-off-by: Dan Williams 
> >> ---
> > This patch (and p54, cw1200) look like the same patch?!
> > Can you tell me what happend to:
> >
> > On Saturday, January 6, 2018 5:34:03 PM CET Dan Williams wrote:
> >> On Sat, Jan 6, 2018 at 6:23 AM, Christian Lamparter  
> >> wrote:
> >> > And Furthermore a invalid queue (param->ac) would cause a crash in
> >> > this line in mac80211 before it even reaches the driver [1]:
> >> > |   sdata->tx_conf[params->ac] = p;
> >> > |   
> >> > |   if (drv_conf_tx(local, sdata,  params->ac , )) {
> >> > |^^ (this is a wrapper for the *_op_conf_tx)
> >> >
> >> > I don't think these chin-up exercises are needed.
> >>
> >> Quite the contrary, you've identified a better place in the call stack
> >> to sanitize the input and disable speculation. Then we can kill the
> >> whole class of the wireless driver reports at once it seems.
> > 
> 
> I didn't see where ac is being validated against the driver specific
> 'queues' value in that earlier patch.
The link to the check is right there in the earlier post. It's in 
parse_txq_params():

|   if (txq_params->ac >= NL80211_NUM_ACS)
|   return -EINVAL;

NL80211_NUM_ACS is 4


This check was added ever since mac80211's ieee80211_set_txq_params():
| sdata->tx_conf[params->ac] = p;

For cw1200: the driver just sets the dev->queue to 4.
In carl9170 dev->queues is set to __AR9170_NUM_TXQ and
p54 uses P54_QUEUE_AC_NUM.

Both __AR9170_NUM_TXQ and P54_QUEUE_AC_NUM are 4.
And this is not going to change since all drivers
have to follow mac80211's queue API:


Some background:
In the old days (linux 2.6 and early 3.x), the parse_txq_params() function did
not verify the "queue" value. That's why these drivers had to do it.

Here's the relevant code from 2.6.39:

You'll notice that the check is missing there.
Here's mac80211's ieee80211_set_txq_params for reference:


However over time, the check in the driver has become redundant.

> > Anyway, I think there's an easy way to solve this: remove the
> > "if (queue < ar->hw->queues)" check altogether. It's no longer needed
> > anymore as the "queue" value is validated long before the driver code
> > gets called.
> > 
> > And from my understanding, this will fix the "In this case
> > the value of 'ar9170_qmap[queue]' is immediately reused as an index to
> > the 'ar->edcf' array." gripe your tool complains about.
> >
> > This is here's a quick test-case for carl9170.:
> > ---
> > diff --git a/drivers/net/wireless/ath/carl9170/main.c 
> > b/drivers/net/wireless/ath/carl9170/main.c
> > index 988c8857d78c..2d3afb15bb62 100644
> > --- a/drivers/net/wireless/ath/carl9170/main.c
> > +++ b/drivers/net/wireless/ath/carl9170/main.c
> > @@ -1387,13 +1387,8 @@ static int carl9170_op_conf_tx(struct ieee80211_hw 
> > *hw,
> > int ret;
> >
> > mutex_lock(>mutex);
> > -   if (queue < ar->hw->queues) {
> > -   memcpy(>edcf[ar9170_qmap[queue]], param, 
> > sizeof(*param));
> > -   ret = carl9170_set_qos(ar);
> > -   } else {
> > -   ret = -EINVAL;
> > -   }
> > -
> > +   memcpy(>edcf[ar9170_qmap[queue]], param, sizeof(*param));
> > +   ret = carl9170_set_qos(ar);
> > mutex_unlock(>mutex);
> > return ret;
> >  }
> > ---
> > What does your tool say about this?
> 
> If you take away the 'if' then I don't the tool will report on this.
> 
> > (If necessary, the "queue" value could be even sanitized with a
> >

[net-next 1/1] tipc: fix bug during lookup of multicast destination nodes

2018-01-12 Thread Jon Maloy

In commit 232d07b74a33 ("tipc: improve groupcast scope handling") we
inadvertently broke non-group multicast transmission when changing the
parameter 'domain' to 'scope' in the function
tipc_nametbl_lookup_dst_nodes(). We missed to make the corresponding
change in the calling function, with the result that the lookup always
fails.

A closer anaysis reveals that this parameter is not needed at all.
Non-group multicast is hard coded to use CLUSTER_SCOPE, and in the
current implementation this will be delivered to all matching
destinations except those which are published with NODE_SCOPE on other
nodes. Since such publications never will be visible on the sending node
anyway, it makes no sense to discriminate by scope at all.

We now remove this parameter altogether.

Signed-off-by: Jon Maloy 
---
 net/tipc/name_table.c | 6 ++
 net/tipc/name_table.h | 3 +--
 net/tipc/socket.c | 3 +--
 3 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index 64cdd3c..ed0457c 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -680,8 +680,7 @@ int tipc_nametbl_mc_lookup(struct net *net, u32 type, u32 
lower, u32 upper,
  * - Determines if any node local ports overlap
  */
 void tipc_nametbl_lookup_dst_nodes(struct net *net, u32 type, u32 lower,
-  u32 upper, u32 scope,
-  struct tipc_nlist *nodes)
+  u32 upper, struct tipc_nlist *nodes)
 {
struct sub_seq *sseq, *stop;
struct publication *publ;
@@ -699,8 +698,7 @@ void tipc_nametbl_lookup_dst_nodes(struct net *net, u32 
type, u32 lower,
for (; sseq != stop && sseq->lower <= upper; sseq++) {
info = sseq->info;
list_for_each_entry(publ, >zone_list, zone_list) {
-   if (publ->scope == scope)
-   tipc_nlist_add(nodes, publ->node);
+   tipc_nlist_add(nodes, publ->node);
}
}
spin_unlock_bh(>lock);
diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h
index b595d8a..f56e7cb 100644
--- a/net/tipc/name_table.h
+++ b/net/tipc/name_table.h
@@ -105,8 +105,7 @@ int tipc_nametbl_mc_lookup(struct net *net, u32 type, u32 
lower, u32 upper,
 void tipc_nametbl_build_group(struct net *net, struct tipc_group *grp,
  u32 type, u32 domain);
 void tipc_nametbl_lookup_dst_nodes(struct net *net, u32 type, u32 lower,
-  u32 upper, u32 domain,
-  struct tipc_nlist *nodes);
+  u32 upper, struct tipc_nlist *nodes);
 bool tipc_nametbl_lookup(struct net *net, u32 type, u32 instance, u32 domain,
 struct list_head *dsts, int *dstcnt, u32 exclude,
 bool all);
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1f23627..18fca5b 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -772,7 +772,6 @@ static int tipc_sendmcast(struct  socket *sock, struct 
tipc_name_seq *seq,
struct net *net = sock_net(sk);
int mtu = tipc_bcast_get_mtu(net);
struct tipc_mc_method *method = >mc_method;
-   u32 domain = addr_domain(net, TIPC_CLUSTER_SCOPE);
struct sk_buff_head pkts;
struct tipc_nlist dsts;
int rc;
@@ -788,7 +787,7 @@ static int tipc_sendmcast(struct  socket *sock, struct 
tipc_name_seq *seq,
/* Lookup destination nodes */
tipc_nlist_init(, tipc_own_addr(net));
tipc_nametbl_lookup_dst_nodes(net, seq->type, seq->lower,
- seq->upper, domain, );
+ seq->upper, );
if (!dsts.local && !dsts.remote)
return -EHOSTUNREACH;
 
-- 
2.1.4

Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-12 Thread Saeed Mahameed




On 01/12/2018 08:46 AM, Eric Dumazet wrote:

On Fri, 2018-01-12 at 09:32 -0700, Jason Gunthorpe wrote:

On Fri, Jan 12, 2018 at 11:42:22AM +0800, Jianchao Wang wrote:

Customer reported memory corruption issue on previous mlx4_en driver
version where the order-3 pages and multiple page reference counting
were still used.

Finally, find out one of the root causes is that the HW may see stale
rx_descs due to prod db updating reaches HW before rx_desc. Especially
when cross order-3 pages boundary and update a new one, HW may write
on the pages which may has been freed and allocated again by others.

To fix it, add a wmb between rx_desc and prod db updating to ensure
the order. Even thougth order-0 and page recycling has been introduced,
the disorder between rx_desc and prod db still could lead to corruption
on different inbound packages.

Signed-off-by: Jianchao Wang 
  drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 85e28ef..eefa82c 100644
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -555,7 +555,7 @@ static void mlx4_en_refill_rx_buffers(struct mlx4_en_priv 
*priv,
break;
ring->prod++;
} while (likely(--missing));
-
+   wmb(); /* ensure rx_desc updating reaches HW before prod db updating */
mlx4_en_update_rx_prod_db(ring);
  }
  


Does this need to be dma_wmb(), and should it be in
mlx4_en_update_rx_prod_db ?



+1 on dma_wmb()

On what architecture bug was observed ?

In any case, the barrier should be moved in mlx4_en_update_rx_prod_db()
I think.



+1 on dma_wmb(), thanks Eric for reviewing this.

The barrier is also needed elsewhere in the code as well, but I wouldn't 
put it in mlx4_en_update_rx_prod_db(), just to allow batch filling of 
all rx rings and then hit the barrier only once. As a rule of thumb, mem 
barriers are the ring API caller responsibility.


e.g. in mlx4_en_activate_rx_rings():
between mlx4_en_fill_rx_buffers(priv); and the loop that updates rx prod 
for all rings ring, the dma_wmb is needed, see below.


diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c

index b4d144e67514..65541721a240 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -370,6 +370,8 @@ int mlx4_en_activate_rx_rings(struct mlx4_en_priv *priv)
if (err)
goto err_buffers;

+   dma_wmb();
+
for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
ring = priv->rx_ring[ring_ind];

Re: [PATCH bpf] bpf: do not modify min/max bounds on scalars with constant values

2018-01-12 Thread Edward Cree

On 12/01/18 19:23, Daniel Borkmann wrote:
> syzkaller generated a BPF proglet and triggered a warning with
> the following:
>
>   0: (b7) r0 = 0
>   1: (d5) if r0 s<= 0x0 goto pc+0
>R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
>   2: (1f) r0 -= r1
>R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
>   verifier internal error: known but bad sbounds
>
> What happens is that in the first insn, r0's min/max value are
> both 0 due to the immediate assignment, later in the jsle test
> the bounds are updated for the min value in the false path,
> meaning, they yield smin_val = 1 smax_val = 0,
reg_set_min_max() refines the existing bounds, it doesn't replace
 them, so all that's happened is that the jsle handling has
 demonstrated that this branch can't be taken.
That AFAICT isn't confined to known constants, one could e.g.
 obtain inconsistent bounds with two js* insns.  Updating the
 bounds in reg_set_min_max() is right, it's where we try to use
 those sbounds in adjust_ptr_min_max_vals() that's wrong imho;
 instead the 'known' paths should be using off_reg->var_off.value
 rather than smin_val everywhere.

Alternatively we could consider not following jumps/lack-thereof
 that produce inconsistent bounds, but that can make insns
 unreachable that previously weren't and thus reject programs
 that we previously considered valid, so we probably can't get
 away with that.

-Ed

[PATCH bpf] bpf: do not modify min/max bounds on scalars with constant values

2018-01-12 Thread Daniel Borkmann

syzkaller generated a BPF proglet and triggered a warning with
the following:

  0: (b7) r0 = 0
  1: (d5) if r0 s<= 0x0 goto pc+0
   R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
  2: (1f) r0 -= r1
   R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
  verifier internal error: known but bad sbounds

What happens is that in the first insn, r0's min/max value are
both 0 due to the immediate assignment, later in the jsle test
the bounds are updated for the min value in the false path,
meaning, they yield smin_val = 1 smax_val = 0, and when ctx
pointer is subtracted from r0, verifier bails out with the
internal error and throwing a WARN since smin_val != smax_val
for the known constant.

Fix is to not update any bounds of the register holding the
constant, thus in reg_set_min_max() and reg_set_min_max_inv()
we return right away, similarly as in the case when the dst
register holds a pointer value. Reason for doing so is rather
straight forward: when we have a register holding a constant
as dst, then {s,u}min_val == {s,u}max_val, thus it cannot get
any more specific in terms of upper/lower bounds than this.

In reg_set_min_max() and reg_set_min_max_inv() it's fine to
only check false_reg similarly as with __is_pointer_value()
check since at this point in time, false_reg and true_reg
both hold the same state, so we only need to pick one. This
fixes the bug and properly rejects the program with an error
of 'R0 tried to subtract pointer from scalar'.

I've been thinking to additionally reject arithmetic on ctx
pointer in adjust_ptr_min_max_vals() right upfront as well
since we reject actual access in such case later on anyway,
but there's a use case in tracing (in bcc) in combination
with passing such ctx to bpf_probe_read(), so we cannot do
that part.

Reported-by: syzbot+6d362cadd45dc0a12...@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann 
---
 kernel/bpf/verifier.c   | 11 
 tools/testing/selftests/bpf/test_verifier.c | 95 +
 2 files changed, 106 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b414d6b..6bf1275 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2617,6 +2617,14 @@ static void reg_set_min_max(struct bpf_reg_state 
*true_reg,
 */
if (__is_pointer_value(false, false_reg))
return;
+   /* Same goes for constant values. They have {s,u}min == {s,u}max,
+* it cannot get narrower than this. Same here, at the branch
+* point false_reg and true_reg have the same type, so we only
+* check false_reg here as well.
+*/
+   if (false_reg->type == SCALAR_VALUE &&
+   tnum_is_const(false_reg->var_off))
+   return;
 
switch (opcode) {
case BPF_JEQ:
@@ -2689,6 +2697,9 @@ static void reg_set_min_max_inv(struct bpf_reg_state 
*true_reg,
 {
if (__is_pointer_value(false, false_reg))
return;
+   if (false_reg->type == SCALAR_VALUE &&
+   tnum_is_const(false_reg->var_off))
+   return;
 
switch (opcode) {
case BPF_JEQ:
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index b510174..162d497 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -8569,6 +8569,101 @@ static struct bpf_test tests[] = {
.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
+   "check deducing bounds from const, 1",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_JMP_IMM(BPF_JSGE, BPF_REG_0, 0, 0),
+   BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_1),
+   BPF_EXIT_INSN(),
+   },
+   .result = REJECT,
+   .errstr = "R0 tried to subtract pointer from scalar",
+   },
+   {
+   "check deducing bounds from const, 2",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_JMP_IMM(BPF_JSGE, BPF_REG_0, 0, 1),
+   BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_1),
+   BPF_EXIT_INSN(),
+   },
+   .result = REJECT,
+   .errstr = "R0 tried to subtract pointer from scalar",
+   },
+   {
+   "check deducing bounds from const, 3",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_JMP_IMM(BPF_JSGE, BPF_REG_0, 0, 1),
+   BPF_EXIT_INSN(),
+   BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_1),
+   BPF_EXIT_INSN(),
+   },
+   .result = REJECT,
+   .errstr = "R0 tried to subtract pointer from scalar",
+   },
+   {
+   "check deducing bounds from const, 4",
+   .insns

Re: [PATCHv2 net-next 2/2] openvswitch: add erspan version II support

2018-01-12 Thread William Tu

On Fri, Jan 12, 2018 at 10:39 AM, Pravin Shelar  wrote:
> On Fri, Jan 12, 2018 at 12:27 AM, Jiri Benc  wrote:
>> On Thu, 11 Jan 2018 08:34:14 -0800, William Tu wrote:
>>> I'd also prefer reverting ceaa001a170e since it's more clean but I
>>> also hope to have this feature in 4.15.
>>> How long does reverting take? Am I only able to submit the new patch
>>> after the reverting is merged? Or I can submit revert and this new
>>> patch in one series? I have little experience in reverting, can you
>>> suggest which way is better?
>>
>> Send the revert for net (subject will be "[PATCH net] revert:
>> openvswitch: Add erspan tunnel support."). Don't forget to explain why
>> you're proposing a revert.
>>
>> After it is accepted and applied to net.git, wait until the patch
>> appears in net-next.git. It may take a little while. After that, send
>> the new patch(es) for net-next.
>>
>
> I agree, Once we have the V2 interface, this current ERSAN interface
> unlikely to be used by any one, so it would be nice to get rid of the
> old interface while we can.

Thanks Jiri and Pravin.
I will send out revert patch request.
William

Re: [PATCH v2 10/19] ipv4: prevent bounds-check bypass via speculative execution

2018-01-12 Thread Dan Williams

On Thu, Jan 11, 2018 at 11:59 PM, Greg KH  wrote:
> On Thu, Jan 11, 2018 at 04:47:18PM -0800, Dan Williams wrote:
>> Static analysis reports that 'offset' may be a user controlled value
>> that is used as a data dependency reading from a raw_frag_vec buffer.
>> In order to avoid potential leaks of kernel memory values, block
>> speculative execution of the instruction stream that could issue further
>> reads based on an invalid '*(rfv->c + offset)' value.
>>
>> Based on an original patch by Elena Reshetova.
>
> There is the "Co-Developed-by:" tag now, if you want to use it...

Ok, thanks.

>
>> Cc: "David S. Miller" 
>> Cc: Alexey Kuznetsov 
>> Cc: Hideaki YOSHIFUJI 
>> Cc: netdev@vger.kernel.org
>> Signed-off-by: Elena Reshetova 
>> Signed-off-by: Dan Williams 
>> ---
>>  net/ipv4/raw.c |   10 ++
>>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> Ugh, what is this, the 4th time I've said "I don't think this is an
> issue, so why are you changing this code." to this patch.  To be
> followed by a "oh yeah, you are right, I'll drop it", only to see it
> show back up in the next time this patch series is sent out?
>
> Same for the other patches in this series that I have reviewed 4, maybe
> 5, times already.  The "v2" is not very true here...

The theme of the review feedback on v1 was 'don't put ifence in any
net/ code', and that was addressed.

I honestly thought the new definition of array_ptr() changed the
calculus on this patch. Given the same pattern appears in the ipv6
case, and I have yet to hear that we should drop the ipv6 patch, make
the code symmetric just for readability purposes. Otherwise we need a
comment saying why this is safe for ipv4, but maybe not safe for ipv6,
I think 'array_ptr' is effectively that comment. I.e. 'array_ptr()' is
designed to be low impact for instrumenting false positives. If that
new argument does not hold water I will definitely drop this patch.

Re: [PATCH v2 15/19] carl9170: prevent bounds-check bypass via speculative execution

2018-01-12 Thread Dan Williams

On Fri, Jan 12, 2018 at 6:42 AM, Christian Lamparter  wrote:
> On Friday, January 12, 2018 1:47:46 AM CET Dan Williams wrote:
>> Static analysis reports that 'queue' may be a user controlled value that
>> is used as a data dependency to read from the 'ar9170_qmap' array. In
>> order to avoid potential leaks of kernel memory values, block
>> speculative execution of the instruction stream that could issue reads
>> based on an invalid result of 'ar9170_qmap[queue]'. In this case the
>> value of 'ar9170_qmap[queue]' is immediately reused as an index to the
>> 'ar->edcf' array.
>>
>> Based on an original patch by Elena Reshetova.
>>
>> Cc: Christian Lamparter 
>> Cc: Kalle Valo 
>> Cc: linux-wirel...@vger.kernel.org
>> Cc: netdev@vger.kernel.org
>> Signed-off-by: Elena Reshetova 
>> Signed-off-by: Dan Williams 
>> ---
> This patch (and p54, cw1200) look like the same patch?!
> Can you tell me what happend to:
>
> On Saturday, January 6, 2018 5:34:03 PM CET Dan Williams wrote:
>> On Sat, Jan 6, 2018 at 6:23 AM, Christian Lamparter  
>> wrote:
>> > And Furthermore a invalid queue (param->ac) would cause a crash in
>> > this line in mac80211 before it even reaches the driver [1]:
>> > |   sdata->tx_conf[params->ac] = p;
>> > |   
>> > |   if (drv_conf_tx(local, sdata,  params->ac , )) {
>> > |^^ (this is a wrapper for the *_op_conf_tx)
>> >
>> > I don't think these chin-up exercises are needed.
>>
>> Quite the contrary, you've identified a better place in the call stack
>> to sanitize the input and disable speculation. Then we can kill the
>> whole class of the wireless driver reports at once it seems.
> 

I didn't see where ac is being validated against the driver specific
'queues' value in that earlier patch.

>
> Anyway, I think there's an easy way to solve this: remove the
> "if (queue < ar->hw->queues)" check altogether. It's no longer needed
> anymore as the "queue" value is validated long before the driver code
> gets called.

Can you point me to where that validation happens?

> And from my understanding, this will fix the "In this case
> the value of 'ar9170_qmap[queue]' is immediately reused as an index to
> the 'ar->edcf' array." gripe your tool complains about.
>
> This is here's a quick test-case for carl9170.:
> ---
> diff --git a/drivers/net/wireless/ath/carl9170/main.c 
> b/drivers/net/wireless/ath/carl9170/main.c
> index 988c8857d78c..2d3afb15bb62 100644
> --- a/drivers/net/wireless/ath/carl9170/main.c
> +++ b/drivers/net/wireless/ath/carl9170/main.c
> @@ -1387,13 +1387,8 @@ static int carl9170_op_conf_tx(struct ieee80211_hw *hw,
> int ret;
>
> mutex_lock(>mutex);
> -   if (queue < ar->hw->queues) {
> -   memcpy(>edcf[ar9170_qmap[queue]], param, sizeof(*param));
> -   ret = carl9170_set_qos(ar);
> -   } else {
> -   ret = -EINVAL;
> -   }
> -
> +   memcpy(>edcf[ar9170_qmap[queue]], param, sizeof(*param));
> +   ret = carl9170_set_qos(ar);
> mutex_unlock(>mutex);
> return ret;
>  }
> ---
> What does your tool say about this?

If you take away the 'if' then I don't the tool will report on this.

> (If necessary, the "queue" value could be even sanitized with a
> queue %= ARRAY_SIZE(ar9170_qmap); before the mutex_lock.)

That is what array_ptr() is doing in a more sophisticated way.

Re: [PATCHv2 net-next 2/2] openvswitch: add erspan version II support

2018-01-12 Thread Pravin Shelar

On Fri, Jan 12, 2018 at 12:27 AM, Jiri Benc  wrote:
> On Thu, 11 Jan 2018 08:34:14 -0800, William Tu wrote:
>> I'd also prefer reverting ceaa001a170e since it's more clean but I
>> also hope to have this feature in 4.15.
>> How long does reverting take? Am I only able to submit the new patch
>> after the reverting is merged? Or I can submit revert and this new
>> patch in one series? I have little experience in reverting, can you
>> suggest which way is better?
>
> Send the revert for net (subject will be "[PATCH net] revert:
> openvswitch: Add erspan tunnel support."). Don't forget to explain why
> you're proposing a revert.
>
> After it is accepted and applied to net.git, wait until the patch
> appears in net-next.git. It may take a little while. After that, send
> the new patch(es) for net-next.
>

I agree, Once we have the V2 interface, this current ERSAN interface
unlikely to be used by any one, so it would be nice to get rid of the
old interface while we can.

Re: KASAN: use-after-free Read in rds_tcp_tune

2018-01-12 Thread Sowmini Varadhan

On (01/11/18 21:29), syzbot wrote:
> ==
> BUG: KASAN: use-after-free in rds_tcp_tune+0x491/0x520 net/rds/tcp.c:397
> Read of size 4 at addr 8801cd5f6c58 by task kworker/u4:4/4954

Just had an offline discussion with santosh around this, here's a summary
of that discussion for the archives:

Looks like an rds_connect_worker workq got scheduled after the 
netns was deleted. This could happen if an an rds_connection got
added between lines 528 and 529 of 

  506 static void rds_tcp_kill_sock(struct net *net)
  :
  /* code to pull out all the rds_connections that should be destroyed */
  :
  528 spin_unlock_irq(_tcp_conn_lock);
  529 list_for_each_entry_safe(tc, _tc, _list, t_tcp_node)
  530 rds_conn_destroy(tc->t_cpath->cp_conn);

Such an rds_connection would miss out the rds_conn_destroy() 
loop (that cancels all pending work) and (if it was scheduled
after netns deletion) could trigger the use-after-free.

Evaluating various fixes for this (including using _bh instead of _irq
as suggested by santosh), I'll get back with a patch soon.

--Sowmini

RE: [PATCH net-next] i40evf: use GFP_ATOMIC under spin lock

2018-01-12 Thread Keller, Jacob E

> -Original Message-
> From: Wei Yongjun [mailto:weiyongj...@huawei.com]
> Sent: Thursday, January 11, 2018 6:27 PM
> To: Kirsher, Jeffrey T ; Keller, Jacob E
> 
> Cc: Wei Yongjun ; intel-wired-...@lists.osuosl.org;
> netdev@vger.kernel.org; kernel-janit...@vger.kernel.org
> Subject: [PATCH net-next] i40evf: use GFP_ATOMIC under spin lock
> 
> A spin lock is taken here so we should use GFP_ATOMIC.
> 

You are correct, good catch!

> Fixes: 504398f0a78e ("i40evf: use spinlock to protect (mac|vlan)_filter_list")
> Signed-off-by: Wei Yongjun 
> ---

Acked-by: Jacob Keller 

>  drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
> b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
> index feb95b6..ca5b538 100644
> --- a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
> +++ b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
> @@ -459,7 +459,7 @@ void i40evf_add_ether_addrs(struct i40evf_adapter
> *adapter)
>   more = true;
>   }
> 
> - veal = kzalloc(len, GFP_KERNEL);
> + veal = kzalloc(len, GFP_ATOMIC);
>   if (!veal) {
>   spin_unlock_bh(>mac_vlan_list_lock);
>   return;
> @@ -532,7 +532,7 @@ void i40evf_del_ether_addrs(struct i40evf_adapter
> *adapter)
> (count * sizeof(struct virtchnl_ether_addr));
>   more = true;
>   }
> - veal = kzalloc(len, GFP_KERNEL);
> + veal = kzalloc(len, GFP_ATOMIC);
>   if (!veal) {
>   spin_unlock_bh(>mac_vlan_list_lock);
>   return;
> @@ -606,7 +606,7 @@ void i40evf_add_vlans(struct i40evf_adapter *adapter)
> (count * sizeof(u16));
>   more = true;
>   }
> - vvfl = kzalloc(len, GFP_KERNEL);
> + vvfl = kzalloc(len, GFP_ATOMIC);
>   if (!vvfl) {
>   spin_unlock_bh(>mac_vlan_list_lock);
>   return;
> @@ -678,7 +678,7 @@ void i40evf_del_vlans(struct i40evf_adapter *adapter)
> (count * sizeof(u16));
>   more = true;
>   }
> - vvfl = kzalloc(len, GFP_KERNEL);
> + vvfl = kzalloc(len, GFP_ATOMIC);
>   if (!vvfl) {
>   spin_unlock_bh(>mac_vlan_list_lock);
>   return;

[bpf-next PATCH 7/7] bpf: add verifier tests for BPF_PROG_TYPE_SK_MSG

2018-01-12 Thread John Fastabend

Test read and writes for BPF_PROG_TYPE_SK_MSG.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/test_verifier.c |   54 +++
 1 file changed, 54 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 5438479..013791b 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -1227,6 +1227,60 @@ struct test_val {
.prog_type = BPF_PROG_TYPE_SK_SKB,
},
{
+   "direct packet read for SK_MSG",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+   offsetof(struct sk_msg_md, data)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+   offsetof(struct sk_msg_md, data_end)),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+   BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
+   BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_2, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SK_MSG,
+   },
+   {
+   "direct packet write for SK_MSG",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+   offsetof(struct sk_msg_md, data)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+   offsetof(struct sk_msg_md, data_end)),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+   BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
+   BPF_STX_MEM(BPF_B, BPF_REG_2, BPF_REG_2, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SK_MSG,
+   },
+   {
+   "overlapping checks for direct packet access SK_MSG",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+   offsetof(struct sk_msg_md, data)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+   offsetof(struct sk_msg_md, data_end)),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+   BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 4),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 6),
+   BPF_JMP_REG(BPF_JGT, BPF_REG_1, BPF_REG_3, 1),
+   BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_2, 6),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SK_MSG,
+   },
+   {
"check skb->mark is not writeable by sockets",
.insns = {
BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_1,

[bpf-next PATCH 6/7] bpf: add map tests for BPF_PROG_TYPE_SK_MSG

2018-01-12 Thread John Fastabend

Add map tests to attach BPF_PROG_TYPE_SK_MSG types to a sockmap.

Signed-off-by: John Fastabend 
---
 tools/include/uapi/linux/bpf.h |   16 ++
 tools/testing/selftests/bpf/Makefile   |3 +
 tools/testing/selftests/bpf/bpf_helpers.h  |2 +
 tools/testing/selftests/bpf/sockmap_parse_prog.c   |   15 +-
 tools/testing/selftests/bpf/sockmap_verdict_prog.c |7 +++
 tools/testing/selftests/bpf/test_maps.c|   54 +++-
 6 files changed, 90 insertions(+), 7 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4e8c60a..131d541 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -133,6 +133,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SOCK_OPS,
BPF_PROG_TYPE_SK_SKB,
BPF_PROG_TYPE_CGROUP_DEVICE,
+   BPF_PROG_TYPE_SK_MSG,
 };
 
 enum bpf_attach_type {
@@ -143,6 +144,7 @@ enum bpf_attach_type {
BPF_SK_SKB_STREAM_PARSER,
BPF_SK_SKB_STREAM_VERDICT,
BPF_CGROUP_DEVICE,
+   BPF_SK_MSG_VERDICT,
__MAX_BPF_ATTACH_TYPE
 };
 
@@ -906,6 +908,20 @@ enum sk_action {
SK_PASS,
 };
 
+/* User return codes for SK_MSG prog type. */
+enum sk_msg_action {
+   SK_MSG_DROP = 0,
+   SK_MSG_PASS,
+};
+
+/* user accessible metadata for SK_MSG packet hook, new fields must
+ * be added to the end of this structure
+ */
+struct sk_msg_md {
+   __u32 data;
+   __u32 data_end;
+};
+
 #define BPF_TAG_SIZE   8
 
 struct bpf_prog_info {
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index a8aa7e2..e399ca3 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -19,7 +19,8 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps 
test_lru_map test_lpm_map test
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
-   test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o
+   test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
+   sockmap_tcp_msg_prog.o
 
 TEST_PROGS := test_kmod.sh test_xdp_redirect.sh test_xdp_meta.sh \
test_offload.py
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index 33cb00e..997c95e 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -121,6 +121,8 @@ static int (*bpf_skb_under_cgroup)(void *ctx, void *map, 
int index) =
(void *) BPF_FUNC_skb_under_cgroup;
 static int (*bpf_skb_change_head)(void *, int len, int flags) =
(void *) BPF_FUNC_skb_change_head;
+static int (*bpf_skb_pull_data)(void *, int len) =
+   (void *) BPF_FUNC_skb_pull_data;
 
 /* Scan the ARCH passed in from ARCH env variable (see Makefile) */
 #if defined(__TARGET_ARCH_x86)
diff --git a/tools/testing/selftests/bpf/sockmap_parse_prog.c 
b/tools/testing/selftests/bpf/sockmap_parse_prog.c
index a1dec2b..0f92858 100644
--- a/tools/testing/selftests/bpf/sockmap_parse_prog.c
+++ b/tools/testing/selftests/bpf/sockmap_parse_prog.c
@@ -20,14 +20,25 @@ int bpf_prog1(struct __sk_buff *skb)
__u32 lport = skb->local_port;
__u32 rport = skb->remote_port;
__u8 *d = data;
+   __u32 len = (__u32) data_end - (__u32) data;
+   int err;
 
-   if (data + 10 > data_end)
-   return skb->len;
+   if (data + 10 > data_end) {
+   err = bpf_skb_pull_data(skb, 10);
+   if (err)
+   return SK_DROP;
+
+   data_end = (void *)(long)skb->data_end;
+   data = (void *)(long)skb->data;
+   if (data + 10 > data_end)
+   return SK_DROP;
+   }
 
/* This write/read is a bit pointless but tests the verifier and
 * strparser handler for read/write pkt data and access into sk
 * fields.
 */
+   d = data;
d[7] = 1;
return skb->len;
 }
diff --git a/tools/testing/selftests/bpf/sockmap_verdict_prog.c 
b/tools/testing/selftests/bpf/sockmap_verdict_prog.c
index d7bea97..2ce7634 100644
--- a/tools/testing/selftests/bpf/sockmap_verdict_prog.c
+++ b/tools/testing/selftests/bpf/sockmap_verdict_prog.c
@@ -26,6 +26,13 @@ struct bpf_map_def SEC("maps") sock_map_tx = {
.max_entries = 20,
 };
 
+struct bpf_map_def SEC("maps") sock_map_msg = {
+   .type = BPF_MAP_TYPE_SOCKMAP,
+   .key_size = sizeof(int),
+   .value_size = sizeof(int),
+   .max_entries = 20,
+};
+
 struct bpf_map_def SEC("maps") sock_map_break = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(int),
diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c

[bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-01-12 Thread John Fastabend

This implements a BPF ULP layer to allow policy enforcement and
monitoring at the socket layer. In order to support this a new
program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
the sendmsg/sendpage hook. To attach the policy to sockets a
sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.

Similar to previous sockmap usages when a sock is added to a
sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
program type attached then the BPF ULP layer is created on the
socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
every msg in sendmsg case and page/offset in sendpage case.

BPF_PROG_TYPE_SK_MSG Semantics/API:

BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
case and in the sendpage case leaves the data untouched. Both cases
return -EACESS to the user. Returning SK_PASS will allow the msg to
be sent.

In the sendmsg case data is copied into kernel space buffers before
running the BPF program. In the sendpage case data is never copied.
The implication being users may change data after BPF programs run in
the sendpage case. (A flag will be added to always copy shortly
if the copy must always be performed).

The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
in the sendmsg() case and the entire page/offset in the sendpage case.
This avoid ambiguity on how to handle mixed return codes in the
sendmsg case. The readable/writeable data provided to the program
in the sendmsg case may not be the entire message, in fact for
large sends this is likely the case. The data range that can be
read is part of the sk_msg_md structure. This is because similar
to the tc bpf_cls case the data is stored in a scatter gather list.
Future work will address this short-coming to allow users to pull
in more data if needed (similar to TC BPF).

The helper msg_redirect_map() can be used to select the socket to
send the data on. This is used similar to existing redirect use
cases. This allows policy to redirect msgs.

Pseudo code simple example:

The basic logic to attach a program to a socket is as follows,

  // load the programs
  bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
, _prog);

  // lookup the sockmap
  bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");

  // get fd for sockmap
  map_fd_msg = bpf_map__fd(bpf_map_msg);

  // attach program to sockmap
  bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);

Adding sockets to the map is done in the normal way,

  // Add a socket 'fd' to sockmap at location 'i'
  bpf_map_update_elem(map_fd_msg, , fd, BPF_ANY);

After the above any socket attached to "my_sock_map", in this case
'fd', will run the BPF msg verdict program (msg_prog) on every
sendmsg and sendpage system call.

For a complete example see BPF selftests bpf/sockmap_tcp_msg_*.c and
test_maps.c

Implementation notes:

It seemed the simplest, to me at least, to use a refcnt to ensure
psock is not lost across the sendmsg copy into the sg, the bpf program
running on the data in sg_data, and the final pass to the TCP stack.
Some performance testing may show a better method to do this and avoid
the refcnt cost, but for now use the simpler method.

Another item that will come after basic support is in place is
supporting MSG_MORE flag. At the moment we call sendpages even if
the MSG_MORE flag is set. An enhancement would be to collect the
pages into a larger scatterlist and pass down the stack. Notice that
bpf_tcp_sendmsg() could support this with some additional state saved
across sendmsg calls. I built the code to support this without having
to do refactoring work. Other flags TBD include ZEROCOPY flag.

Yet another detail that needs some thought is the size of scatterlist.
Currently, we use MAX_SKB_FRAGS simply because this was being used
already in the TLS case. Future work to improve the kernel sk APIs to
tune this depending on workload may be useful. This is a trade-off
between memory usage and B/s performance.

Signed-off-by: John Fastabend 
---
 include/linux/bpf.h   |1 
 include/linux/bpf_types.h |1 
 include/linux/filter.h|   10 +
 include/net/tcp.h |2 
 include/uapi/linux/bpf.h  |   28 +++
 kernel/bpf/sockmap.c  |  485 -
 kernel/bpf/syscall.c  |   14 +
 kernel/bpf/verifier.c |5 
 net/core/filter.c |  106 ++
 9 files changed, 638 insertions(+), 14 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9e03046..14cdb4d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -21,6 +21,7 @@
 struct perf_event;
 struct bpf_prog;
 struct bpf_map;
+struct sock;
 
 /* map is generic key/value storage optionally accesible by eBPF programs */
 struct bpf_map_ops {
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 19b8349..5e2e8a4 100644
---

Re: [PATCH][next] bnxt_en: ensure len is ininitialized to zero

2018-01-12 Thread Michael Chan

On Fri, Jan 12, 2018 at 9:46 AM, Colin King  wrote:
> From: Colin Ian King 
>
> In the case where cmp_type == CMP_TYPE_RX_L2_TPA_START_CMP the
> exit return path is via label next_rx_no_prod and cpr->rx_bytes
> is being updated by an uninitialized value from len. Fix this by
> initializing len to zero.
>
> Detected by CoverityScan, CID#1463807 ("Uninitialized scalar variable")
>
> Fixes: 6a8788f25625 ("bnxt_en: add support for software dynamic interrupt 
> moderation")
> Signed-off-by: Colin Ian King 
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index cf6ebf1e324b..5b5c4f266f1b 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -1482,7 +1482,7 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct 
> bnxt_napi *bnapi, u32 *raw_cons,
> u32 tmp_raw_cons = *raw_cons;
> u16 cfa_code, cons, prod, cp_cons = RING_CMP(tmp_raw_cons);
> struct bnxt_sw_rx_bd *rx_buf;
> -   unsigned int len;
> +   unsigned int len = 0;

It might be better to add a new label next_rx_no_prod_no_len and have
the TPA code paths jump there instead.

Andy, what do you think?

> u8 *data_ptr, agg_bufs, cmp_type;
> dma_addr_t dma_addr;
> struct sk_buff *skb;
> --
> 2.15.1
>

[bpf-next PATCH 4/7] net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG

2018-01-12 Thread John Fastabend

When calling do_tcp_sendpages() from in kernel and we know the data
has no references from user side we can omit SKBTX_SHARED_FRAG flag.
This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used
to omit setting SKBTX_SHARED_FRAG.

Signed-off-by: John Fastabend 
---
 include/linux/socket.h |1 +
 net/ipv4/tcp.c |4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 9286a5a..add9360 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -287,6 +287,7 @@ struct ucred {
 #define MSG_SENDPAGE_NOTLAST 0x2 /* sendpage() internal : not the last 
page */
 #define MSG_BATCH  0x4 /* sendmmsg(): more messages coming */
 #define MSG_EOF MSG_FIN
+#define MSG_NO_SHARED_FRAGS 0x8 /* sendpage() internal : page frags are 
not shared */
 
 #define MSG_ZEROCOPY   0x400   /* Use user data in kernel path */
 #define MSG_FASTOPEN   0x2000  /* Send data in TCP SYN */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7ac583a..56c6f49 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -995,7 +995,9 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page 
*page, int offset,
get_page(page);
skb_fill_page_desc(skb, i, page, offset, copy);
}
-   skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
+
+   if (!(flags & MSG_NO_SHARED_FRAGS))
+   skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
 
skb->len += copy;
skb->data_len += copy;

[bpf-next PATCH 2/7] sock: make static tls function alloc_sg generic sock helper

2018-01-12 Thread John Fastabend

The TLS ULP module builds scatterlists from a sock using
page_frag_refill(). This is going to be useful for other ULPs
so move it into sock file for more general use.

In the process remove useless goto at end of while loop.

Signed-off-by: John Fastabend 
---
 include/net/sock.h |4 +++
 net/core/sock.c|   56 ++
 net/tls/tls_sw.c   |   69 +---
 3 files changed, 67 insertions(+), 62 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 66fd395..2808343 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2148,6 +2148,10 @@ static inline struct page_frag *sk_page_frag(struct sock 
*sk)
 
 bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag);
 
+int sk_alloc_sg(struct sock *sk, int len, struct scatterlist *sg,
+   int *sg_num_elem, unsigned int *sg_size,
+   int first_coalesce);
+
 /*
  * Default write policy as shown to user space via poll/select/SIGIO
  */
diff --git a/net/core/sock.c b/net/core/sock.c
index 72d14b2..e84c03f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2237,6 +2237,62 @@ bool sk_page_frag_refill(struct sock *sk, struct 
page_frag *pfrag)
 }
 EXPORT_SYMBOL(sk_page_frag_refill);
 
+int sk_alloc_sg(struct sock *sk, int len, struct scatterlist *sg,
+   int *sg_num_elem, unsigned int *sg_size,
+   int first_coalesce)
+{
+   struct page_frag *pfrag;
+   unsigned int size = *sg_size;
+   int num_elem = *sg_num_elem, use = 0, rc = 0;
+   struct scatterlist *sge;
+   unsigned int orig_offset;
+
+   len -= size;
+   pfrag = sk_page_frag(sk);
+
+   while (len > 0) {
+   if (!sk_page_frag_refill(sk, pfrag)) {
+   rc = -ENOMEM;
+   goto out;
+   }
+
+   use = min_t(int, len, pfrag->size - pfrag->offset);
+
+   if (!sk_wmem_schedule(sk, use)) {
+   rc = -ENOMEM;
+   goto out;
+   }
+
+   sk_mem_charge(sk, use);
+   size += use;
+   orig_offset = pfrag->offset;
+   pfrag->offset += use;
+
+   sge = sg + num_elem - 1;
+   if (num_elem > first_coalesce && sg_page(sg) == pfrag->page &&
+   sg->offset + sg->length == orig_offset) {
+   sg->length += use;
+   } else {
+   sge++;
+   sg_unmark_end(sge);
+   sg_set_page(sge, pfrag->page, use, orig_offset);
+   get_page(pfrag->page);
+   ++num_elem;
+   if (num_elem == MAX_SKB_FRAGS) {
+   rc = -ENOSPC;
+   break;
+   }
+   }
+
+   len -= use;
+   }
+out:
+   *sg_size = size;
+   *sg_num_elem = num_elem;
+   return rc;
+}
+EXPORT_SYMBOL(sk_alloc_sg);
+
 static void __lock_sock(struct sock *sk)
__releases(>sk_lock.slock)
__acquires(>sk_lock.slock)
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 73d1921..dabdd1a 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -87,71 +87,16 @@ static void trim_both_sgl(struct sock *sk, int target_size)
target_size);
 }
 
-static int alloc_sg(struct sock *sk, int len, struct scatterlist *sg,
-   int *sg_num_elem, unsigned int *sg_size,
-   int first_coalesce)
-{
-   struct page_frag *pfrag;
-   unsigned int size = *sg_size;
-   int num_elem = *sg_num_elem, use = 0, rc = 0;
-   struct scatterlist *sge;
-   unsigned int orig_offset;
-
-   len -= size;
-   pfrag = sk_page_frag(sk);
-
-   while (len > 0) {
-   if (!sk_page_frag_refill(sk, pfrag)) {
-   rc = -ENOMEM;
-   goto out;
-   }
-
-   use = min_t(int, len, pfrag->size - pfrag->offset);
-
-   if (!sk_wmem_schedule(sk, use)) {
-   rc = -ENOMEM;
-   goto out;
-   }
-
-   sk_mem_charge(sk, use);
-   size += use;
-   orig_offset = pfrag->offset;
-   pfrag->offset += use;
-
-   sge = sg + num_elem - 1;
-   if (num_elem > first_coalesce && sg_page(sg) == pfrag->page &&
-   sg->offset + sg->length == orig_offset) {
-   sg->length += use;
-   } else {
-   sge++;
-   sg_unmark_end(sge);
-   sg_set_page(sge, pfrag->page, use, orig_offset);
-   get_page(pfrag->page);
-   ++num_elem;
-   if (num_elem == MAX_SKB_FRAGS) {
-   rc = -ENOSPC;
-

[bpf-next PATCH 3/7] sockmap: convert refcnt to an atomic refcnt

2018-01-12 Thread John Fastabend

The sockmap refcnt up until now has been wrapped in the
sk_callback_lock(). So its not actually needed any locking of its
own. The counter itself tracks the lifetime of the psock object.
Sockets in a sockmap have a lifetime that is independent of the
map they are part of. This is possible because a single socket may
be in multiple maps. When this happens we can only release the
psock data associated with the socket when the refcnt reaches
zero. There are three possible delete sock reference decrement
paths first through the normal sockmap process, the user deletes
the socket from the map. Second the map is removed and all sockets
in the map are removed, delete path is similar to case 1. The third
case is an asyncronous socket event such as a closing the socket. The
last case handles removing sockets that are no longer available.
For completeness, although inc does not pose any problems in this
patch series, the inc case only happens when a psock is added to a
map.

Next we plan to add another socket prog type to handle policy and
monitoring on the TX path. When we do this however we will need to
keep a reference count open across the sendmsg/sendpage call and
holding the sk_callback_lock() here (on every send) seems less than
ideal, also it may sleep in cases where we hit memory pressure.
Instead of dealing with these issues in some clever way simply make
the reference counting a refcnt_t type and do proper atomic ops.

Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |   21 +
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 3f662ee..972608f 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -62,8 +62,7 @@ struct smap_psock_map_entry {
 
 struct smap_psock {
struct rcu_head rcu;
-   /* refcnt is used inside sk_callback_lock */
-   u32 refcnt;
+   refcount_t refcnt;
 
/* datapath variables */
struct sk_buff_head rxqueue;
@@ -346,14 +345,12 @@ static void smap_destroy_psock(struct rcu_head *rcu)
 
 static void smap_release_sock(struct smap_psock *psock, struct sock *sock)
 {
-   psock->refcnt--;
-   if (psock->refcnt)
-   return;
-
-   smap_stop_sock(psock, sock);
-   clear_bit(SMAP_TX_RUNNING, >state);
-   rcu_assign_sk_user_data(sock, NULL);
-   call_rcu_sched(>rcu, smap_destroy_psock);
+   if (refcount_dec_and_test(>refcnt)) {
+   smap_stop_sock(psock, sock);
+   clear_bit(SMAP_TX_RUNNING, >state);
+   rcu_assign_sk_user_data(sock, NULL);
+   call_rcu_sched(>rcu, smap_destroy_psock);
+   }
 }
 
 static int smap_parse_func_strparser(struct strparser *strp,
@@ -485,7 +482,7 @@ static struct smap_psock *smap_init_psock(struct sock *sock,
INIT_WORK(>tx_work, smap_tx_work);
INIT_WORK(>gc_work, smap_gc_work);
INIT_LIST_HEAD(>maps);
-   psock->refcnt = 1;
+   refcount_set(>refcnt, 1);
 
rcu_assign_sk_user_data(sock, psock);
sock_hold(sock);
@@ -745,7 +742,7 @@ static int sock_map_ctx_update_elem(struct 
bpf_sock_ops_kern *skops,
err = -EBUSY;
goto out_progs;
}
-   psock->refcnt++;
+   refcount_inc(>refcnt);
} else {
psock = smap_init_psock(sock, stab);
if (IS_ERR(psock)) {

Re: [PATCH V2] ipvlan: fix ipvlan MTU limits

2018-01-12 Thread Jiri Benc

On Fri, 12 Jan 2018 09:50:35 -0800, Mahesh Bandewar (महेश बंडेवार) wrote:
> (Looks like you missed the last update I mentioned)

I did not miss it. The proposed behavior is inconsistent and has no
clear pattern (I used the word "magic" for that). I guess examples will
help more. See below.

> Here is the approach in details -

> (a) At slave creation time - it's exactly how it's done currently
> where slave copies masters' mtu. at the same time max_mtu of the slave
> is set as the current mtu of the master.
> (b) If slave updates mtu - ipvlan_change_mtu() will be called and the
> slave's mtu will get set and it will set a flag indicating that slave
> has changed it's mtu (dissociation from master if the mtu is different
> from masters'). If slave mtu is set same as masters' then this flag
> will get reset-ed indicating it wants to follow master (current
> behavior).

Consider these two cases:

# ip l a link eth0 type ipvlan 
# ip l s ipvlan0 mtu 1400
# ip l s eth0 mtu 1400
# ip l s eth0 mtu 1500

Now MTU of ipvlan0 is 1400.

# ip l a link eth0 type ipvlan 
# ip l s eth0 mtu 1400
# ip l s ipvlan0 mtu 1400
# ip l s eth0 mtu 1500

Now MTU of ipvlan0 is 1500.

See why I call that behavior "magic"? Before the last step, it's
impossible to tell from the user space what will happen. And no, don't
propose to artificially cover the first example by forcing the auto
follow flag, it doesn't help. It just moves the magic behavior to
different scenarios.

> (c) When master updates mtu - ipvlan_adj_mtu() gets called where all
> slaves' max_mtu changes to the master's mtu value (clamping applies
> for the all slaves which are not following master). All the slaves
> which are following master (flag per slave) will get this new mtu.
> Another consequence of this is that slaves' flag might get reset-ed if
> the master's mtu is reduced to the value that was set earlier for the
> slave (and it will start following slave).

Okay, you're actually proposing that. So, another example:

# ip l a link eth0 type ipvlan 
# ip l s ipvlan0 mtu 1400
# ip l s eth0 mtu 1400
# ip l s eth0 mtu 1500

Now MTU of ipvlan0 is 1500.

# ip l a link eth0 type ipvlan 
# ip l s ipvlan0 mtu 1400
# ip l s eth0 mtu 1600
# ip l s eth0 mtu 1500

Now MTU of ipvlan0 is 1400. There's still no consistency here.

> The above should work? The user-space can query the mtu of the slave
> device just like any other device. I was thinking about 'mtu_adj' with
> some additional future extention but for now; we can live with a flag
> on the slave device(s).

It absolutely does not. Never introduce magic behavior, it just forces
you to add more layers of hacks later.

Really, the only way is to introduce user visible and user changeable
flag. Yes, it's more work. But that's something you'll have to swallow.
Introducing hacky behavior is not the way to go. We try hard to
preserve user space compatibility which is pretty much impossible if
you introduce magic hacky behavior. Do it right from the start.

 Jiri

[bpf-next PATCH 1/7] net: add a UID to use for ULP socket assignment

2018-01-12 Thread John Fastabend

Create a UID field and enum that can be used to assign ULPs to
sockets. This saves a set of string comparisons if the ULP id
is known.

For sockmap, which is added in the next patches, a ULP is used
as a TX hook for TCP sockets. In this case the ULP being added
is done at map insert time and the ULP is known and done on the
kernel side. In this case the named lookup is not needed.

Remove pr_notice, user gets an error code back and should
check that rather than rely on logs.

Signed-off-by: John Fastabend 
---
 include/net/tcp.h  |5 +
 net/ipv4/tcp_ulp.c |   51 ++-
 net/tls/tls_main.c |1 +
 3 files changed, 52 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6939e69..a99ceb8 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1982,6 +1982,10 @@ static inline void tcp_listendrop(const struct sock *sk)
 #define TCP_ULP_MAX128
 #define TCP_ULP_BUF_MAX(TCP_ULP_NAME_MAX*TCP_ULP_MAX)
 
+enum {
+   TCP_ULP_TLS,
+};
+
 struct tcp_ulp_ops {
struct list_headlist;
 
@@ -1990,6 +1994,7 @@ struct tcp_ulp_ops {
/* cleanup ulp */
void (*release)(struct sock *sk);
 
+   int uid;
charname[TCP_ULP_NAME_MAX];
struct module   *owner;
 };
diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
index 6bb9e14..58cadc5 100644
--- a/net/ipv4/tcp_ulp.c
+++ b/net/ipv4/tcp_ulp.c
@@ -29,6 +29,18 @@ static struct tcp_ulp_ops *tcp_ulp_find(const char *name)
return NULL;
 }
 
+static struct tcp_ulp_ops *tcp_ulp_find_id(const int ulp)
+{
+   struct tcp_ulp_ops *e;
+
+   list_for_each_entry_rcu(e, _ulp_list, list) {
+   if (e->uid == ulp)
+   return e;
+   }
+
+   return NULL;
+}
+
 static const struct tcp_ulp_ops *__tcp_ulp_find_autoload(const char *name)
 {
const struct tcp_ulp_ops *ulp = NULL;
@@ -51,6 +63,19 @@ static const struct tcp_ulp_ops 
*__tcp_ulp_find_autoload(const char *name)
return ulp;
 }
 
+static const struct tcp_ulp_ops *__tcp_ulp_lookup(const int uid)
+{
+   const struct tcp_ulp_ops *ulp = NULL;
+
+   rcu_read_lock();
+   ulp = tcp_ulp_find_id(uid);
+   if (!ulp || !try_module_get(ulp->owner))
+   ulp = NULL;
+
+   rcu_read_unlock();
+   return ulp;
+}
+
 /* Attach new upper layer protocol to the list
  * of available protocols.
  */
@@ -59,13 +84,10 @@ int tcp_register_ulp(struct tcp_ulp_ops *ulp)
int ret = 0;
 
spin_lock(_ulp_list_lock);
-   if (tcp_ulp_find(ulp->name)) {
-   pr_notice("%s already registered or non-unique name\n",
- ulp->name);
+   if (tcp_ulp_find(ulp->name))
ret = -EEXIST;
-   } else {
+   else
list_add_tail_rcu(>list, _ulp_list);
-   }
spin_unlock(_ulp_list_lock);
 
return ret;
@@ -133,3 +155,22 @@ int tcp_set_ulp(struct sock *sk, const char *name)
icsk->icsk_ulp_ops = ulp_ops;
return 0;
 }
+
+int tcp_set_ulp_id(struct sock *sk, int ulp)
+{
+   struct inet_connection_sock *icsk = inet_csk(sk);
+   const struct tcp_ulp_ops *ulp_ops;
+   int err;
+
+   if (icsk->icsk_ulp_ops)
+   return -EEXIST;
+
+   ulp_ops = __tcp_ulp_lookup(ulp);
+   if (!ulp_ops)
+   return -ENOENT;
+
+   err = ulp_ops->init(sk);
+   if (!err)
+   icsk->icsk_ulp_ops = ulp_ops;
+   return err;
+}
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index e07ee3a..1563a9e 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -473,6 +473,7 @@ static int tls_init(struct sock *sk)
 
 static struct tcp_ulp_ops tcp_tls_ulp_ops __read_mostly = {
.name   = "tls",
+   .uid= TCP_ULP_TLS,
.owner  = THIS_MODULE,
.init   = tls_init,
 };

[bpf-next PATCH 0/7] Add BPF_PROG_TYPE_SK_MSG and attach pt

2018-01-12 Thread John Fastabend

This series include 4 setup patches (1-4) to allow ULP layer and
sockmap refcounting to support another ULP type. There is one small
change (5 lines!) on the TCP side in patch 4. This adds a flag so
that the ULP layer can inform the TCP stack not to mark the frags
as shared. When the ULP layer "owns" the frags and 'gives' them
to the TCP stack we know there can be no additional updates from
user side to the data.

Patch 4 is the bulk of the work. This adds a new program type
BPF_PROG_TYPE_SK_MSG, a sockmap program attach type
BPF_SK_MSG_VERDICT and a new ULP layer (TCP_ULP_BPF) to allow BPF
prgrams to be run on sendmsg/sendfile system calls and inspect data.
For now only TCP is supported when sockmap is extended other protos
can be added. See the patch description for a lengthy description
of the details. After this patch users can now attach BPF policy
and monitoring programs to socket send hooks.

Finally patches 6/7 and 7/7 add tests to test_maps and the verifier
in selftests/bpf so we get some automated coverage. One open
question I have is if we should move the samples/sockmap program
into selftests/bpf and start running it to get even more coverage
on the automated side. We can push this as an independent patch
set.

---

John Fastabend (7):
  net: add a UID to use for ULP socket assignment
  sock: make static tls function alloc_sg generic sock helper
  sockmap: convert refcnt to an atomic refcnt
  net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG
  bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
  bpf: add map tests for BPF_PROG_TYPE_SK_MSG
  bpf: add verifier tests for BPF_PROG_TYPE_SK_MSG


 include/linux/bpf.h|1 
 include/linux/bpf_types.h  |1 
 include/linux/filter.h |   10 
 include/linux/socket.h |1 
 include/net/sock.h |4 
 include/net/tcp.h  |7 
 include/uapi/linux/bpf.h   |   28 +
 kernel/bpf/sockmap.c   |  504 +++-
 kernel/bpf/syscall.c   |   14 -
 kernel/bpf/verifier.c  |5 
 net/core/filter.c  |  106 
 net/core/sock.c|   56 ++
 net/ipv4/tcp.c |4 
 net/ipv4/tcp_ulp.c |   51 ++
 net/tls/tls_main.c |1 
 net/tls/tls_sw.c   |   69 ---
 tools/include/uapi/linux/bpf.h |   16 +
 tools/testing/selftests/bpf/Makefile   |3 
 tools/testing/selftests/bpf/bpf_helpers.h  |2 
 tools/testing/selftests/bpf/sockmap_parse_prog.c   |   15 +
 tools/testing/selftests/bpf/sockmap_verdict_prog.c |7 
 tools/testing/selftests/bpf/test_maps.c|   54 ++
 tools/testing/selftests/bpf/test_verifier.c|   54 ++
 23 files changed, 913 insertions(+), 100 deletions(-)

--
Signature

[PATCH iproute2] ipaddress: Use family_name() for better code reuse

2018-01-12 Thread Serhey Popovych

Signed-off-by: Serhey Popovych 
---
 ip/ipaddress.c |   23 +--
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index f150d91..a3595c1 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -1558,6 +1558,8 @@ int print_addrinfo(const struct sockaddr_nl *who, struct 
nlmsghdr *n,
print_bool(PRINT_ANY, "deleted", "Deleted ", true);
 
if (!brief) {
+   const char *name;
+
if (filter.oneline || filter.flushb) {
const char *dev = ll_index_to_name(ifa->ifa_index);
 
@@ -1570,20 +1572,13 @@ int print_addrinfo(const struct sockaddr_nl *who, 
struct nlmsghdr *n,
}
}
 
-   int family = ifa->ifa_family;
-
-   if (ifa->ifa_family == AF_INET)
-   print_string(PRINT_ANY, "family", "%s ", "inet");
-   else if (ifa->ifa_family == AF_INET6)
-   print_string(PRINT_ANY, "family", "%s ", "inet6");
-   else if (ifa->ifa_family == AF_DECnet)
-   print_string(PRINT_ANY, "family", "%s ", "dnet");
-   else if (ifa->ifa_family == AF_IPX)
-   print_string(PRINT_ANY, "family", " %s ", "ipx");
-   else
-   print_int(PRINT_ANY,
- "family_index",
- "family %d ", family);
+   name = family_name(ifa->ifa_family);
+   if (*name != '?') {
+   print_string(PRINT_ANY, "family", "%s ", name);
+   } else {
+   print_int(PRINT_ANY, "family_index", "family %d ",
+ ifa->ifa_family);
+   }
}
 
if (rta_tb[IFA_LOCAL]) {
-- 
1.7.10.4

[PATCH bpf-next v5 3/5] error-injection: Separate error-injection from kprobe

2018-01-12 Thread Masami Hiramatsu

Since error-injection framework is not limited to be used
by kprobes, nor bpf. Other kernel subsystems can use it
freely for checking safeness of error-injection, e.g.
livepatch, ftrace etc.
So this separate error-injection framework from kprobes.

Some differences has been made:

- "kprobe" word is removed from any APIs/structures.
- BPF_ALLOW_ERROR_INJECTION() is renamed to
  ALLOW_ERROR_INJECTION() since it is not limited for BPF too.
- CONFIG_FUNCTION_ERROR_INJECTION is the config item of this
  feature. It is automatically enabled if the arch supports
  error injection feature for kprobe or ftrace etc.

Signed-off-by: Masami Hiramatsu 
Reviewed-by: Josef Bacik 
---
  Changes in v3:
   - Fix a build error for asmlinkage on i386 by including compiler.h
   - Fix "CONFIG_FUNCTION_ERROR_INJECT" typo.
   - Separate CONFIG_MODULES dependent code
   - Add CONFIG_KPROBES dependency for arch_deref_entry_point()
   - Call error-injection init function in late_initcall stage.
   - Fix read-side mutex lock
   - Some cosmetic cleanups
  Changes in v4:
   - Include error-injection.h for each ALLOW_ERROR_INJECTION
 macro user, instead of bpf.h
  Changes in v5:
   - Fix within_error_injection_list to return correct result.
---
 arch/Kconfig   |2 
 arch/x86/Kconfig   |2 
 arch/x86/include/asm/error-injection.h |   13 ++
 arch/x86/kernel/kprobes/core.c |   14 --
 arch/x86/lib/Makefile  |1 
 arch/x86/lib/error-inject.c|   19 +++
 fs/btrfs/disk-io.c |4 -
 fs/btrfs/free-space-cache.c|4 -
 include/asm-generic/error-injection.h  |   20 +++
 include/asm-generic/vmlinux.lds.h  |   14 +-
 include/linux/bpf.h|   11 --
 include/linux/error-injection.h|   21 +++
 include/linux/kprobes.h|1 
 include/linux/module.h |6 -
 kernel/kprobes.c   |  163 
 kernel/module.c|8 +
 kernel/trace/Kconfig   |2 
 kernel/trace/bpf_trace.c   |4 -
 kernel/trace/trace_kprobe.c|3 
 lib/Kconfig.debug  |4 +
 lib/Makefile   |1 
 lib/error-inject.c |  213 
 22 files changed, 317 insertions(+), 213 deletions(-)
 create mode 100644 arch/x86/include/asm/error-injection.h
 create mode 100644 arch/x86/lib/error-inject.c
 create mode 100644 include/asm-generic/error-injection.h
 create mode 100644 include/linux/error-injection.h
 create mode 100644 lib/error-inject.c

diff --git a/arch/Kconfig b/arch/Kconfig
index d3f4aaf9cb7a..97376accfb14 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -196,7 +196,7 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
-config HAVE_KPROBE_OVERRIDE
+config HAVE_FUNCTION_ERROR_INJECTION
bool
 
 config HAVE_NMI
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 45dc6233f2b9..366b19cb79b7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -154,7 +154,7 @@ config X86
select HAVE_KERNEL_XZ
select HAVE_KPROBES
select HAVE_KPROBES_ON_FTRACE
-   select HAVE_KPROBE_OVERRIDE
+   select HAVE_FUNCTION_ERROR_INJECTION
select HAVE_KRETPROBES
select HAVE_KVM
select HAVE_LIVEPATCH   if X86_64
diff --git a/arch/x86/include/asm/error-injection.h 
b/arch/x86/include/asm/error-injection.h
new file mode 100644
index ..47b7a1296245
--- /dev/null
+++ b/arch/x86/include/asm/error-injection.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ERROR_INJECTION_H
+#define _ASM_ERROR_INJECTION_H
+
+#include 
+#include 
+#include 
+#include 
+
+asmlinkage void just_return_func(void);
+void override_function_with_return(struct pt_regs *regs);
+
+#endif /* _ASM_ERROR_INJECTION_H */
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index b02a377d5905..bd36f3c33cd0 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -1183,17 +1183,3 @@ int arch_trampoline_kprobe(struct kprobe *p)
 {
return 0;
 }
-
-asmlinkage void override_func(void);
-asm(
-   ".type override_func, @function\n"
-   "override_func:\n"
-   "   ret\n"
-   ".size override_func, .-override_func\n"
-);
-
-void arch_kprobe_override_function(struct pt_regs *regs)
-{
-   regs->ip = (unsigned long)_func;
-}
-NOKPROBE_SYMBOL(arch_kprobe_override_function);
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index 7b181b61170e..171377b83be1 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -26,6 +26,7 @@ lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o insn-eval.o
 lib-$(CONFIG_RANDOMIZE_BASE) += kaslr.o

[PATCH bpf-next v5 5/5] error-injection: Support fault injection framework

2018-01-12 Thread Masami Hiramatsu

Support in-kernel fault-injection framework via debugfs.
This allows you to inject a conditional error to specified
function using debugfs interfaces.

Here is the result of test script described in
Documentation/fault-injection/fault-injection.txt

  ===
  # ./test_fail_function.sh
  1+0 records in
  1+0 records out
  1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227404 s, 46.1 MB/s
  btrfs-progs v4.4
  See http://btrfs.wiki.kernel.org for more information.

  Label:  (null)
  UUID:   bfa96010-12e9-4360-aed0-42eec7af5798
  Node size:  16384
  Sector size:4096
  Filesystem size:1001.00MiB
  Block group profiles:
Data: single8.00MiB
Metadata: DUP  58.00MiB
System:   DUP  12.00MiB
  SSD detected:   no
  Incompat features:  extref, skinny-metadata
  Number of devices:  1
  Devices:
 IDSIZE  PATH
  1  1001.00MiB  /dev/loop2

  mount: mount /dev/loop2 on /opt/tmpmnt failed: Cannot allocate memory
  SUCCESS!
  ===


Signed-off-by: Masami Hiramatsu 
Reviewed-by: Josef Bacik 
---
  Changes in v3:
   - Check and adjust error value for each target function
   - Clear kporbe flag for reuse
   - Add more documents and example
  Changes in v5:
   - Support multi-function error injection
---
 Documentation/fault-injection/fault-injection.txt |   68 
 kernel/Makefile   |1 
 kernel/fail_function.c|  349 +
 lib/Kconfig.debug |   10 +
 4 files changed, 428 insertions(+)
 create mode 100644 kernel/fail_function.c

diff --git a/Documentation/fault-injection/fault-injection.txt 
b/Documentation/fault-injection/fault-injection.txt
index 918972babcd8..f4a32463ca48 100644
--- a/Documentation/fault-injection/fault-injection.txt
+++ b/Documentation/fault-injection/fault-injection.txt
@@ -30,6 +30,12 @@ o fail_mmc_request
   injects MMC data errors on devices permitted by setting
   debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request
 
+o fail_function
+
+  injects error return on specific functions, which are marked by
+  ALLOW_ERROR_INJECTION() macro, by setting debugfs entries
+  under /sys/kernel/debug/fail_function. No boot option supported.
+
 Configure fault-injection capabilities behavior
 ---
 
@@ -123,6 +129,29 @@ configuration of fault-injection capabilities.
default is 'N', setting it to 'Y' will disable failure injections
when dealing with private (address space) futexes.
 
+- /sys/kernel/debug/fail_function/inject:
+
+   Format: { 'function-name' | '!function-name' | '' }
+   specifies the target function of error injection by name.
+   If the function name leads '!' prefix, given function is
+   removed from injection list. If nothing specified ('')
+   injection list is cleared.
+
+- /sys/kernel/debug/fail_function/injectable:
+
+   (read only) shows error injectable functions and what type of
+   error values can be specified. The error type will be one of
+   below;
+   - NULL: retval must be 0.
+   - ERRNO: retval must be -1 to -MAX_ERRNO (-4096).
+   - ERR_NULL: retval must be 0 or -1 to -MAX_ERRNO (-4096).
+
+- /sys/kernel/debug/fail_function//retval:
+
+   specifies the "error" return value to inject to the given
+   function for given function. This will be created when
+   user specifies new injection entry.
+
 o Boot option
 
 In order to inject faults while debugfs is not available (early boot time),
@@ -268,6 +297,45 @@ trap "echo 0 > /sys/kernel/debug/$FAILTYPE/probability" 
SIGINT SIGTERM EXIT
 echo "Injecting errors into the module $module... (interrupt to stop)"
 sleep 100
 
+--
+
+o Inject open_ctree error while btrfs mount
+
+#!/bin/bash
+
+rm -f testfile.img
+dd if=/dev/zero of=testfile.img bs=1M seek=1000 count=1
+DEVICE=$(losetup --show -f testfile.img)
+mkfs.btrfs -f $DEVICE
+mkdir -p tmpmnt
+
+FAILTYPE=fail_function
+FAILFUNC=open_ctree
+echo $FAILFUNC > /sys/kernel/debug/$FAILTYPE/inject
+echo -12 > /sys/kernel/debug/$FAILTYPE/$FAILFUNC/retval
+echo N > /sys/kernel/debug/$FAILTYPE/task-filter
+echo 100 > /sys/kernel/debug/$FAILTYPE/probability
+echo 0 > /sys/kernel/debug/$FAILTYPE/interval
+echo -1 > /sys/kernel/debug/$FAILTYPE/times
+echo 0 > /sys/kernel/debug/$FAILTYPE/space
+echo 1 > /sys/kernel/debug/$FAILTYPE/verbose
+
+mount -t btrfs $DEVICE tmpmnt
+if [ $? -ne 0 ]
+then
+   echo "SUCCESS!"
+else
+   echo "FAILED!"
+   umount tmpmnt
+fi
+
+echo > /sys/kernel/debug/$FAILTYPE/inject
+
+rmdir tmpmnt
+losetup -d $DEVICE
+rm testfile.img
+
+
 Tool to run command with failslab or fail_page_alloc
 
 In order to make it easier

[PATCH bpf-next v5 4/5] error-injection: Add injectable error types

2018-01-12 Thread Masami Hiramatsu

Add injectable error types for each error-injectable function.

One motivation of error injection test is to find software flaws,
mistakes or mis-handlings of expectable errors. If we find such
flaws by the test, that is a program bug, so we need to fix it.

But if the tester miss input the error (e.g. just return success
code without processing anything), it causes unexpected behavior
even if the caller is correctly programmed to handle any errors.
That is not what we want to test by error injection.

To clarify what type of errors the caller must expect for each
injectable function, this introduces injectable error types:

 - EI_ETYPE_NULL : means the function will return NULL if it
fails. No ERR_PTR, just a NULL.
 - EI_ETYPE_ERRNO : means the function will return -ERRNO
if it fails.
 - EI_ETYPE_ERRNO_NULL : means the function will return -ERRNO
   (ERR_PTR) or NULL.

ALLOW_ERROR_INJECTION() macro is expanded to get one of
NULL, ERRNO, ERRNO_NULL to record the error type for
each function. e.g.

 ALLOW_ERROR_INJECTION(open_ctree, ERRNO)

This error types are shown in debugfs as below.

  
  / # cat /sys/kernel/debug/error_injection/list
  open_ctree [btrfs]ERRNO
  io_ctl_init [btrfs]   ERRNO
  

Signed-off-by: Masami Hiramatsu 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/disk-io.c|2 +-
 fs/btrfs/free-space-cache.c   |2 +-
 include/asm-generic/error-injection.h |   23 +++---
 include/asm-generic/vmlinux.lds.h |2 +-
 include/linux/error-injection.h   |6 +
 include/linux/module.h|3 ++
 lib/error-inject.c|   43 -
 7 files changed, 66 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9798e21ebe9d..83e2349e1362 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3124,7 +3124,7 @@ int open_ctree(struct super_block *sb,
goto fail_block_groups;
goto retry_root_backup;
 }
-ALLOW_ERROR_INJECTION(open_ctree);
+ALLOW_ERROR_INJECTION(open_ctree, ERRNO);
 
 static void btrfs_end_buffer_write_sync(struct buffer_head *bh, int uptodate)
 {
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index ef847699031a..586bb06472bb 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -333,7 +333,7 @@ static int io_ctl_init(struct btrfs_io_ctl *io_ctl, struct 
inode *inode,
 
return 0;
 }
-ALLOW_ERROR_INJECTION(io_ctl_init);
+ALLOW_ERROR_INJECTION(io_ctl_init, ERRNO);
 
 static void io_ctl_free(struct btrfs_io_ctl *io_ctl)
 {
diff --git a/include/asm-generic/error-injection.h 
b/include/asm-generic/error-injection.h
index 08352c9d9f97..296c65442f00 100644
--- a/include/asm-generic/error-injection.h
+++ b/include/asm-generic/error-injection.h
@@ -3,17 +3,32 @@
 #define _ASM_GENERIC_ERROR_INJECTION_H
 
 #if defined(__KERNEL__) && !defined(__ASSEMBLY__)
+enum {
+   EI_ETYPE_NONE,  /* Dummy value for undefined case */
+   EI_ETYPE_NULL,  /* Return NULL if failure */
+   EI_ETYPE_ERRNO, /* Return -ERRNO if failure */
+   EI_ETYPE_ERRNO_NULL,/* Return -ERRNO or NULL if failure */
+};
+
+struct error_injection_entry {
+   unsigned long   addr;
+   int etype;
+};
+
 #ifdef CONFIG_FUNCTION_ERROR_INJECTION
 /*
  * Whitelist ganerating macro. Specify functions which can be
  * error-injectable using this macro.
  */
-#define ALLOW_ERROR_INJECTION(fname)   \
-static unsigned long __used\
+#define ALLOW_ERROR_INJECTION(fname, _etype)   \
+static struct error_injection_entry __used \
__attribute__((__section__("_error_injection_whitelist")))  \
-   _eil_addr_##fname = (unsigned long)fname;
+   _eil_addr_##fname = {   \
+   .addr = (unsigned long)fname,   \
+   .etype = EI_ETYPE_##_etype, \
+   };
 #else
-#define ALLOW_ERROR_INJECTION(fname)
+#define ALLOW_ERROR_INJECTION(fname, _etype)
 #endif
 #endif
 
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index f2068cca5206..ebe544e048cd 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -137,7 +137,7 @@
 #endif
 
 #ifdef CONFIG_FUNCTION_ERROR_INJECTION
-#define ERROR_INJECT_WHITELIST()   . = ALIGN(8); \
+#define ERROR_INJECT_WHITELIST()   STRUCT_ALIGN();   \
VMLINUX_SYMBOL(__start_error_injection_whitelist) = .;\
KEEP(*(_error_injection_whitelist))   \

[PATCH bpf-next v5 2/5] tracing/kprobe: bpf: Compare instruction pointer with original one

2018-01-12 Thread Masami Hiramatsu

Compare instruction pointer with original one on the
stack instead using per-cpu bpf_kprobe_override flag.

This patch also consolidates reset_current_kprobe() and
preempt_enable_no_resched() blocks. Those can be done
in one place.

Signed-off-by: Masami Hiramatsu 
Reviewed-by: Josef Bacik 
---
 kernel/trace/bpf_trace.c|1 -
 kernel/trace/trace_kprobe.c |   21 +++--
 2 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 1966ad3bf3e0..24ed6363e00f 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -83,7 +83,6 @@ EXPORT_SYMBOL_GPL(trace_call_bpf);
 #ifdef CONFIG_BPF_KPROBE_OVERRIDE
 BPF_CALL_2(bpf_override_return, struct pt_regs *, regs, unsigned long, rc)
 {
-   __this_cpu_write(bpf_kprobe_override, 1);
regs_set_return_value(regs, rc);
arch_kprobe_override_function(regs);
return 0;
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 3c8deb977a8b..b8c90441bc87 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -42,8 +42,6 @@ struct trace_kprobe {
(offsetof(struct trace_kprobe, tp.args) +   \
(sizeof(struct probe_arg) * (n)))
 
-DEFINE_PER_CPU(int, bpf_kprobe_override);
-
 static nokprobe_inline bool trace_kprobe_is_return(struct trace_kprobe *tk)
 {
return tk->rp.handler != NULL;
@@ -1205,6 +1203,7 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs 
*regs)
int rctx;
 
if (bpf_prog_array_valid(call)) {
+   unsigned long orig_ip = instruction_pointer(regs);
int ret;
 
ret = trace_call_bpf(call, regs);
@@ -1212,12 +1211,13 @@ kprobe_perf_func(struct trace_kprobe *tk, struct 
pt_regs *regs)
/*
 * We need to check and see if we modified the pc of the
 * pt_regs, and if so clear the kprobe and return 1 so that we
-* don't do the instruction skipping.  Also reset our state so
-* we are clean the next pass through.
+* don't do the single stepping.
+* The ftrace kprobe handler leaves it up to us to re-enable
+* preemption here before returning if we've modified the ip.
 */
-   if (__this_cpu_read(bpf_kprobe_override)) {
-   __this_cpu_write(bpf_kprobe_override, 0);
+   if (orig_ip != instruction_pointer(regs)) {
reset_current_kprobe();
+   preempt_enable_no_resched();
return 1;
}
if (!ret)
@@ -1325,15 +1325,8 @@ static int kprobe_dispatcher(struct kprobe *kp, struct 
pt_regs *regs)
if (tk->tp.flags & TP_FLAG_TRACE)
kprobe_trace_func(tk, regs);
 #ifdef CONFIG_PERF_EVENTS
-   if (tk->tp.flags & TP_FLAG_PROFILE) {
+   if (tk->tp.flags & TP_FLAG_PROFILE)
ret = kprobe_perf_func(tk, regs);
-   /*
-* The ftrace kprobe handler leaves it up to us to re-enable
-* preemption here before returning if we've modified the ip.
-*/
-   if (ret)
-   preempt_enable_no_resched();
-   }
 #endif
return ret;
 }

[PATCH bpf-next v5 1/5] tracing/kprobe: bpf: Check error injectable event is on function entry

2018-01-12 Thread Masami Hiramatsu

Check whether error injectable event is on function entry or not.
Currently it checks the event is ftrace-based kprobes or not,
but that is wrong. It should check if the event is on the entry
of target function. Since error injection will override a function
to just return with modified return value, that operation must
be done before the target function starts making stackframe.

As a side effect, bpf error injection is no need to depend on
function-tracer. It can work with sw-breakpoint based kprobe
events too.

Signed-off-by: Masami Hiramatsu 
Reviewed-by: Josef Bacik 
---
  Changes in v3:
   - Move arch_ftrace_kprobe_override_function() to
 core.c because it is no longer depending on ftrace.
   - Fix a bug to skip passing kprobes target name to
 kprobe_on_func_entry(). Passing both @addr and @symbol
 to that function will result in failure.
---
 arch/x86/include/asm/kprobes.h   |4 +---
 arch/x86/kernel/kprobes/core.c   |   14 ++
 arch/x86/kernel/kprobes/ftrace.c |   14 --
 kernel/trace/Kconfig |2 --
 kernel/trace/bpf_trace.c |8 
 kernel/trace/trace_kprobe.c  |9 ++---
 kernel/trace/trace_probe.h   |   12 ++--
 7 files changed, 31 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index 36abb23a7a35..367d99cff426 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -67,9 +67,7 @@ extern const int kretprobe_blacklist_size;
 void arch_remove_kprobe(struct kprobe *p);
 asmlinkage void kretprobe_trampoline(void);
 
-#ifdef CONFIG_KPROBES_ON_FTRACE
-extern void arch_ftrace_kprobe_override_function(struct pt_regs *regs);
-#endif
+extern void arch_kprobe_override_function(struct pt_regs *regs);
 
 /* Architecture specific copy of original instruction*/
 struct arch_specific_insn {
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index bd36f3c33cd0..b02a377d5905 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -1183,3 +1183,17 @@ int arch_trampoline_kprobe(struct kprobe *p)
 {
return 0;
 }
+
+asmlinkage void override_func(void);
+asm(
+   ".type override_func, @function\n"
+   "override_func:\n"
+   "   ret\n"
+   ".size override_func, .-override_func\n"
+);
+
+void arch_kprobe_override_function(struct pt_regs *regs)
+{
+   regs->ip = (unsigned long)_func;
+}
+NOKPROBE_SYMBOL(arch_kprobe_override_function);
diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c
index 1ea748d682fd..8dc0161cec8f 100644
--- a/arch/x86/kernel/kprobes/ftrace.c
+++ b/arch/x86/kernel/kprobes/ftrace.c
@@ -97,17 +97,3 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p)
p->ainsn.boostable = false;
return 0;
 }
-
-asmlinkage void override_func(void);
-asm(
-   ".type override_func, @function\n"
-   "override_func:\n"
-   "   ret\n"
-   ".size override_func, .-override_func\n"
-);
-
-void arch_ftrace_kprobe_override_function(struct pt_regs *regs)
-{
-   regs->ip = (unsigned long)_func;
-}
-NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function);
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index ae3a2d519e50..6400e1bf97c5 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -533,9 +533,7 @@ config FUNCTION_PROFILER
 config BPF_KPROBE_OVERRIDE
bool "Enable BPF programs to override a kprobed function"
depends on BPF_EVENTS
-   depends on KPROBES_ON_FTRACE
depends on HAVE_KPROBE_OVERRIDE
-   depends on DYNAMIC_FTRACE_WITH_REGS
default n
help
 Allows BPF to override the execution of a probed function and
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f6d2327ecb59..1966ad3bf3e0 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -85,7 +85,7 @@ BPF_CALL_2(bpf_override_return, struct pt_regs *, regs, 
unsigned long, rc)
 {
__this_cpu_write(bpf_kprobe_override, 1);
regs_set_return_value(regs, rc);
-   arch_ftrace_kprobe_override_function(regs);
+   arch_kprobe_override_function(regs);
return 0;
 }
 
@@ -800,11 +800,11 @@ int perf_event_attach_bpf_prog(struct perf_event *event,
int ret = -EEXIST;
 
/*
-* Kprobe override only works for ftrace based kprobes, and only if they
-* are on the opt-in list.
+* Kprobe override only works if they are on the function entry,
+* and only if they are on the opt-in list.
 */
if (prog->kprobe_override &&
-   (!trace_kprobe_ftrace(event->tp_event) ||
+   (!trace_kprobe_on_func_entry(event->tp_event) ||
 !trace_kprobe_error_injectable(event->tp_event)))
return -EINVAL;
 
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 91f4b57dab82..3c8deb977a8b 100644

[PATCH bpf-next v5 0/5] Separate error injection table from kprobes

2018-01-12 Thread Masami Hiramatsu

Hi,

Here are the 5th version of patches to moving error injection
table from kprobes. This version fixes a bug and update
fail-function to support multiple function error injection.

Here is the previous version:

https://patchwork.ozlabs.org/cover/858663/

Changes in v5:
 - [3/5] Fix a bug that within_error_injection returns false always.
 - [5/5] Update to support multiple function error injection.

Thank you,

---

Masami Hiramatsu (5):
  tracing/kprobe: bpf: Check error injectable event is on function entry
  tracing/kprobe: bpf: Compare instruction pointer with original one
  error-injection: Separate error-injection from kprobe
  error-injection: Add injectable error types
  error-injection: Support fault injection framework


 Documentation/fault-injection/fault-injection.txt |   68 
 arch/Kconfig  |2 
 arch/x86/Kconfig  |2 
 arch/x86/include/asm/error-injection.h|   13 +
 arch/x86/include/asm/kprobes.h|4 
 arch/x86/kernel/kprobes/ftrace.c  |   14 -
 arch/x86/lib/Makefile |1 
 arch/x86/lib/error-inject.c   |   19 +
 fs/btrfs/disk-io.c|4 
 fs/btrfs/free-space-cache.c   |4 
 include/asm-generic/error-injection.h |   35 ++
 include/asm-generic/vmlinux.lds.h |   14 -
 include/linux/bpf.h   |   11 -
 include/linux/error-injection.h   |   27 ++
 include/linux/kprobes.h   |1 
 include/linux/module.h|7 
 kernel/Makefile   |1 
 kernel/fail_function.c|  349 +
 kernel/kprobes.c  |  163 --
 kernel/module.c   |8 
 kernel/trace/Kconfig  |4 
 kernel/trace/bpf_trace.c  |   11 -
 kernel/trace/trace_kprobe.c   |   33 +-
 kernel/trace/trace_probe.h|   12 -
 lib/Kconfig.debug |   14 +
 lib/Makefile  |1 
 lib/error-inject.c|  242 +++
 27 files changed, 819 insertions(+), 245 deletions(-)
 create mode 100644 arch/x86/include/asm/error-injection.h
 create mode 100644 arch/x86/lib/error-inject.c
 create mode 100644 include/asm-generic/error-injection.h
 create mode 100644 include/linux/error-injection.h
 create mode 100644 kernel/fail_function.c
 create mode 100644 lib/error-inject.c

--
Masami Hiramatsu (Linaro)

Re: [PATCH][next] ixgbe: fix comparison of offset with zero or NVM_INVALID_PTR

2018-01-12 Thread Jeff Kirsher

On Fri, 2018-01-12 at 17:13 +, Colin King wrote:
> From: Colin Ian King 
> 
> The incorrect operator && is being used and will always return false
> as offset can never be two different values at the same time. Fix
> this
> by using the || operator instead.
> 
> Detected by CoverityScan, CID#1463806 ("Logically dead code")
> 
> Fixes: 73834aec7199 ("ixgbe: extend firmware version support")
> Signed-off-by: Colin Ian King 
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_common.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Dan Carpenter beat you too it, see his patch in my tree:

commit 7352baadcc2ad2ed214e52bd8b50ac6eb01968cd
Author: Dan Carpenter 
Date:   Fri Jan 12 09:45:00 2018 -0800

ixgbe: Fix && vs || typo

"offset" can't be both 0x0 and 0x so presumably || was intended
instead of &&.  That matches with how this check is done in other
functions.

Fixes: 73834aec7199 ("ixgbe: extend firmware version support")
Signed-off-by: Dan Carpenter 

signature.asc
Description: This is a digitally signed message part

Re: [PATCH V2] ipvlan: fix ipvlan MTU limits

2018-01-12 Thread महेश बंडेवार

On Fri, Jan 12, 2018 at 12:48 AM, Jiri Benc  wrote:
> On Fri, 12 Jan 2018 09:34:13 +0100, Jiri Benc wrote:
>> I don't think this works currently. When someone (does not have to be
>> you, it can be a management software running in background) sets the
>> MTU to the current value, the magic behavior is lost without any way to
>> restore it (unless I'm missing a way to restore it, see my question
>> above). So any user that depends on the magic behavior is broken anyway
>> even now.
>
> Upon further inspection, it seems that currently, slaves always follow
> master's MTU without a way to change it. Tough situation. Even
> implementing user space toggleable mtu_adj could break users in the way
> I described. But it seems to be the lesser evil, at least there would
> be a way to unbreak the scripts with one line addition.
>
> But it's absolute must to have this visible to the user space and
> changeable. Something like this:
>
> # ip a
> 123: ipvlan0:  mtu 1500 (auto) qdisc ...
> # ip l s ipvlan0 mtu 1400
> # ip a
> 123: ipvlan0:  mtu 1400 qdisc ...
> # ip l s ipvlan0 mtu auto
> # ip a
> 123: ipvlan0:  mtu 1500 (auto) qdisc ...
>
(Looks like you missed the last update I mentioned)
Here is the approach in details -

(a) At slave creation time - it's exactly how it's done currently
where slave copies masters' mtu. at the same time max_mtu of the slave
is set as the current mtu of the master.
(b) If slave updates mtu - ipvlan_change_mtu() will be called and the
slave's mtu will get set and it will set a flag indicating that slave
has changed it's mtu (dissociation from master if the mtu is different
from masters'). If slave mtu is set same as masters' then this flag
will get reset-ed indicating it wants to follow master (current
behavior).
(c) When master updates mtu - ipvlan_adj_mtu() gets called where all
slaves' max_mtu changes to the master's mtu value (clamping applies
for the all slaves which are not following master). All the slaves
which are following master (flag per slave) will get this new mtu.
Another consequence of this is that slaves' flag might get reset-ed if
the master's mtu is reduced to the value that was set earlier for the
slave (and it will start following slave).

The above should work? The user-space can query the mtu of the slave
device just like any other device. I was thinking about 'mtu_adj' with
some additional future extention but for now; we can live with a flag
on the slave device(s).

thanks,
--mahesh..

>  Jiri

[PATCH iproute2] tunnel: Return constant string without copying it

2018-01-12 Thread Serhey Popovych

We return constant string from tnl_strproto(), no need
to copy it to temporary buffer and then return such
buffer as const: return constant string instead.

Signed-off-by: Serhey Popovych 
---
 ip/tunnel.c |   25 +++--
 1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/ip/tunnel.c b/ip/tunnel.c
index f860103..946a36c 100644
--- a/ip/tunnel.c
+++ b/ip/tunnel.c
@@ -39,33 +39,22 @@
 
 const char *tnl_strproto(__u8 proto)
 {
-   static char buf[16];
-
switch (proto) {
case IPPROTO_IPIP:
-   strcpy(buf, "ip");
-   break;
+   return "ip";
case IPPROTO_GRE:
-   strcpy(buf, "gre");
-   break;
+   return "gre";
case IPPROTO_IPV6:
-   strcpy(buf, "ipv6");
-   break;
+   return "ipv6";
case IPPROTO_ESP:
-   strcpy(buf, "esp");
-   break;
+   return "esp";
case IPPROTO_MPLS:
-   strcpy(buf, "mpls");
-   break;
+   return "mpls";
case 0:
-   strcpy(buf, "any");
-   break;
+   return "any";
default:
-   strcpy(buf, "unknown");
-   break;
+   return "unknown";
}
-
-   return buf;
 }
 
 int tnl_get_ioctl(const char *basedev, void *p)
-- 
1.7.10.4

[PATCH][next] bnxt_en: ensure len is ininitialized to zero

2018-01-12 Thread Colin King

From: Colin Ian King 

In the case where cmp_type == CMP_TYPE_RX_L2_TPA_START_CMP the
exit return path is via label next_rx_no_prod and cpr->rx_bytes
is being updated by an uninitialized value from len. Fix this by
initializing len to zero.

Detected by CoverityScan, CID#1463807 ("Uninitialized scalar variable")

Fixes: 6a8788f25625 ("bnxt_en: add support for software dynamic interrupt 
moderation")
Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index cf6ebf1e324b..5b5c4f266f1b 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1482,7 +1482,7 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_napi 
*bnapi, u32 *raw_cons,
u32 tmp_raw_cons = *raw_cons;
u16 cfa_code, cons, prod, cp_cons = RING_CMP(tmp_raw_cons);
struct bnxt_sw_rx_bd *rx_buf;
-   unsigned int len;
+   unsigned int len = 0;
u8 *data_ptr, agg_bufs, cmp_type;
dma_addr_t dma_addr;
struct sk_buff *skb;
-- 
2.15.1

[PATCH iproute2 2/9] ip/tunnel: Correct and unify ttl/hoplimit printing

2018-01-12 Thread Serhey Popovych

Both ttl/hoplimit is from 1 to 255. Zero has special meaning:
use encapsulated packet value. In ip-link(8) -d output this
looks like "ttl/hoplimit inherit". In JSON we have "int" type
for ttl and therefore values from 0 (inherit) to 255.

To do the best in handling ttl/hoplimit we need to accept
both cases: missing attribute in netlink dump and zero value
for "inherit"ed case. Last one is broken since JSON output
introduction for gre/iptnl versions and was never true for
gre6/ip6tnl.

For all tunnels, except ip6tnl change JSON type from "int" to
"uint" to reflect true nature of the ttl.

Signed-off-by: Serhey Popovych 
---
 ip/ip6tunnel.c|5 -
 ip/iplink_geneve.c|   13 +++--
 ip/iplink_vxlan.c |   15 +++
 ip/iproute_lwtunnel.c |4 ++--
 ip/iptunnel.c |2 +-
 ip/link_gre.c |   15 ++-
 ip/link_gre6.c|   15 +++
 ip/link_ip6tnl.c  |   13 +++--
 ip/link_iptnl.c   |   15 ++-
 9 files changed, 47 insertions(+), 50 deletions(-)

diff --git a/ip/ip6tunnel.c b/ip/ip6tunnel.c
index b8db49c..783e28a 100644
--- a/ip/ip6tunnel.c
+++ b/ip/ip6tunnel.c
@@ -92,7 +92,10 @@ static void print_tunnel(struct ip6_tnl_parm2 *p)
else
printf(" encaplimit %u", p->encap_limit);
 
-   printf(" hoplimit %u", p->hop_limit);
+   if (p->hop_limit)
+   printf(" hoplimit %u", p->hop_limit);
+   else
+   printf(" hoplimit inherit");
 
if (p->flags & IP6_TNL_F_USE_ORIG_TCLASS)
printf(" tclass inherit");
diff --git a/ip/iplink_geneve.c b/ip/iplink_geneve.c
index f0f1d1c..5a0afd4 100644
--- a/ip/iplink_geneve.c
+++ b/ip/iplink_geneve.c
@@ -227,6 +227,7 @@ static int geneve_parse_opt(struct link_util *lu, int argc, 
char **argv,
 static void geneve_print_opt(struct link_util *lu, FILE *f, struct rtattr 
*tb[])
 {
__u32 vni;
+   __u8 ttl = 0;
__u8 tos;
 
if (!tb)
@@ -262,12 +263,12 @@ static void geneve_print_opt(struct link_util *lu, FILE 
*f, struct rtattr *tb[])
}
}
 
-   if (tb[IFLA_GENEVE_TTL]) {
-   __u8 ttl = rta_getattr_u8(tb[IFLA_GENEVE_TTL]);
-
-   if (ttl)
-   print_int(PRINT_ANY, "ttl", "ttl %d ", ttl);
-   }
+   if (tb[IFLA_GENEVE_TTL])
+   ttl = rta_getattr_u8(tb[IFLA_GENEVE_TTL]);
+   if (is_json_context() || ttl)
+   print_uint(PRINT_ANY, "ttl", "ttl %u ", ttl);
+   else
+   print_string(PRINT_FP, NULL, "ttl %s ", "inherit");
 
if (tb[IFLA_GENEVE_TOS] &&
(tos = rta_getattr_u8(tb[IFLA_GENEVE_TOS]))) {
diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index a6c964a..bac326d 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -395,6 +395,7 @@ static void vxlan_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
 {
__u32 vni;
unsigned int link;
+   __u8 ttl = 0;
__u8 tos;
__u32 maxaddr;
 
@@ -525,14 +526,12 @@ static void vxlan_print_opt(struct link_util *lu, FILE 
*f, struct rtattr *tb[])
}
}
 
-   if (tb[IFLA_VXLAN_TTL]) {
-   __u8 ttl = rta_getattr_u8(tb[IFLA_VXLAN_TTL]);
-
-   if (ttl)
-   print_int(PRINT_ANY, "ttl", "ttl %d ", ttl);
-   else
-   print_int(PRINT_JSON, "ttl", NULL, ttl);
-   }
+   if (tb[IFLA_VXLAN_TTL])
+   ttl = rta_getattr_u8(tb[IFLA_VXLAN_TTL]);
+   if (is_json_context() || ttl)
+   print_uint(PRINT_ANY, "ttl", "ttl %u ", ttl);
+   else
+   print_string(PRINT_FP, NULL, "ttl %s ", "inherit");
 
if (tb[IFLA_VXLAN_LABEL]) {
__u32 label = rta_getattr_u32(tb[IFLA_VXLAN_LABEL]);
diff --git a/ip/iproute_lwtunnel.c b/ip/iproute_lwtunnel.c
index a8d7171..a1d36ba 100644
--- a/ip/iproute_lwtunnel.c
+++ b/ip/iproute_lwtunnel.c
@@ -271,7 +271,7 @@ static void print_encap_ip(FILE *fp, struct rtattr *encap)
rt_addr_n2a_rta(AF_INET, tb[LWTUNNEL_IP_DST]));
 
if (tb[LWTUNNEL_IP_TTL])
-   fprintf(fp, "ttl %d ", rta_getattr_u8(tb[LWTUNNEL_IP_TTL]));
+   fprintf(fp, "ttl %u ", rta_getattr_u8(tb[LWTUNNEL_IP_TTL]));
 
if (tb[LWTUNNEL_IP_TOS])
fprintf(fp, "tos %d ", rta_getattr_u8(tb[LWTUNNEL_IP_TOS]));
@@ -326,7 +326,7 @@ static void print_encap_ip6(FILE *fp, struct rtattr *encap)
rt_addr_n2a_rta(AF_INET6, tb[LWTUNNEL_IP6_DST]));
 
if (tb[LWTUNNEL_IP6_HOPLIMIT])
-   fprintf(fp, "hoplimit %d ",
+   fprintf(fp, "hoplimit %u ",
rta_getattr_u8(tb[LWTUNNEL_IP6_HOPLIMIT]));
 
if (tb[LWTUNNEL_IP6_TC])
diff --git a/ip/iptunnel.c b/ip/iptunnel.c
index ce610f8..0aa3b33 100644
--- a/ip/iptunnel.c
+++ b/ip/iptunnel.c
@@ -326,7 +326,7 @@ static void

[PATCH iproute2 9/9] vti6/tunnel: Unify and simplify link type help functions

2018-01-12 Thread Serhey Popovych

Both of these two changes are missing for link_vti6.c:

  o commit 8b47135474cd (ip: link: Unify link type help functions a bit)
  o commit 561e650eff67 (ip link: Shortify printing the usage of link type)

Replay them on link_vti6.c to bring link type help functions
inline with other tunneling code.

Signed-off-by: Serhey Popovych 
---
 ip/link_vti6.c |   32 ++--
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/ip/link_vti6.c b/ip/link_vti6.c
index a4ad650..b9fe83a 100644
--- a/ip/link_vti6.c
+++ b/ip/link_vti6.c
@@ -24,20 +24,25 @@
 #include "ip_common.h"
 #include "tunnel.h"
 
+static void print_usage(FILE *f)
+{
+   fprintf(f,
+   "Usage: ... vti6 [ remote ADDR ]\n"
+   "[ local ADDR ]\n"
+   "[ [i|o]key KEY ]\n"
+   "[ dev PHYS_DEV ]\n"
+   "[ fwmark MARK ]\n"
+   "\n"
+   "Where: ADDR := { IPV6_ADDRESS }\n"
+   "   KEY  := { DOTTED_QUAD | NUMBER }\n"
+   "   MARK := { 0x0..0x }\n"
+   );
+}
 
 static void usage(void) __attribute__((noreturn));
 static void usage(void)
 {
-   fprintf(stderr, "Usage: ip link { add | set | change | replace | del } 
NAME\n");
-   fprintf(stderr, "  type { vti6 } [ remote ADDR ] [ local ADDR 
]\n");
-   fprintf(stderr, "  [ [i|o]key KEY ]\n");
-   fprintf(stderr, "  [ dev PHYS_DEV ]\n");
-   fprintf(stderr, "  [ fwmark MARK ]\n");
-   fprintf(stderr, "\n");
-   fprintf(stderr, "Where: NAME := STRING\n");
-   fprintf(stderr, "   ADDR := { IPV6_ADDRESS }\n");
-   fprintf(stderr, "   KEY  := { DOTTED_QUAD | NUMBER }\n");
-   fprintf(stderr, "   MARK := { 0x0..0x }\n");
+   print_usage(stderr);
exit(-1);
 }
 
@@ -216,9 +221,16 @@ static void vti6_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
print_0xhex(PRINT_ANY, "fwmark", "fwmark 0x%x ", fwmark);
 }
 
+static void vti6_print_help(struct link_util *lu, int argc, char **argv,
+   FILE *f)
+{
+   print_usage(f);
+}
+
 struct link_util vti6_link_util = {
.id = "vti6",
.maxattr = IFLA_VTI_MAX,
.parse_opt = vti6_parse_opt,
.print_opt = vti6_print_opt,
+   .print_help = vti6_print_help,
 };
-- 
1.7.10.4

[PATCH iproute2 4/9] ip/tunnel: Use print_0xhex() instead of print_string()

2018-01-12 Thread Serhey Popovych

No need for custom SPRINT_BUF() and snprintf() 0x%x
value to this buffer: we can use print_0xhex() instead
of print_string().

In link_iptnl.c use s2 instead of s1 buffer and remove
s1.

While there adjust fwmark option print order in iptnl
and ip6tnl to get it match each other.

Signed-off-by: Serhey Popovych 
---
 ip/link_gre.c|5 ++---
 ip/link_gre6.c   |9 -
 ip/link_ip6tnl.c |   18 --
 ip/link_iptnl.c  |   22 ++
 ip/link_vti.c|   14 --
 ip/link_vti6.c   |   13 -
 6 files changed, 32 insertions(+), 49 deletions(-)

diff --git a/ip/link_gre.c b/ip/link_gre.c
index 11a131f..2ae2194 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -442,9 +442,8 @@ static void gre_print_direct_opt(FILE *f, struct rtattr 
*tb[])
__u32 fwmark = rta_getattr_u32(tb[IFLA_GRE_FWMARK]);
 
if (fwmark) {
-   snprintf(s2, sizeof(s2), "0x%x", fwmark);
-
-   print_string(PRINT_ANY, "fwmark", "fwmark %s ", s2);
+   print_0xhex(PRINT_ANY,
+   "fwmark", "fwmark 0x%x ", fwmark);
}
}
 }
diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index 9b08656..9576354 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -496,18 +496,17 @@ static void gre_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
if (oflags & GRE_CSUM)
print_bool(PRINT_ANY, "ocsum", "ocsum ", true);
 
-   if (flags & IP6_TNL_F_USE_ORIG_FWMARK)
+   if (flags & IP6_TNL_F_USE_ORIG_FWMARK) {
print_bool(PRINT_ANY,
   "ip6_tnl_f_use_orig_fwmark",
   "fwmark inherit ",
   true);
-   else if (tb[IFLA_GRE_FWMARK]) {
+   } else if (tb[IFLA_GRE_FWMARK]) {
__u32 fwmark = rta_getattr_u32(tb[IFLA_GRE_FWMARK]);
 
if (fwmark) {
-   snprintf(s2, sizeof(s2), "0x%x", fwmark);
-
-   print_string(PRINT_ANY, "fwmark", "fwmark %s ", s2);
+   print_0xhex(PRINT_ANY,
+   "fwmark", "fwmark 0x%x ", fwmark);
}
}
 
diff --git a/ip/link_ip6tnl.c b/ip/link_ip6tnl.c
index bd06e46..9594c3e 100644
--- a/ip/link_ip6tnl.c
+++ b/ip/link_ip6tnl.c
@@ -437,6 +437,12 @@ static void ip6tunnel_print_opt(struct link_util *lu, FILE 
*f, struct rtattr *tb
if (flags & IP6_TNL_F_MIP6_DEV)
print_bool(PRINT_ANY, "ip6_tnl_f_mip6_dev", "mip6 ", true);
 
+   if (flags & IP6_TNL_F_ALLOW_LOCAL_REMOTE)
+   print_bool(PRINT_ANY,
+  "ip6_tnl_f_allow_local_remote",
+  "allow-localremote ",
+  true);
+
if (flags & IP6_TNL_F_USE_ORIG_FWMARK) {
print_bool(PRINT_ANY,
   "ip6_tnl_f_use_orig_fwmark",
@@ -446,19 +452,11 @@ static void ip6tunnel_print_opt(struct link_util *lu, 
FILE *f, struct rtattr *tb
__u32 fwmark = rta_getattr_u32(tb[IFLA_IPTUN_FWMARK]);
 
if (fwmark) {
-   SPRINT_BUF(b1);
-
-   snprintf(b1, sizeof(b1), "0x%x", fwmark);
-   print_string(PRINT_ANY, "fwmark", "fwmark %s ", b1);
+   print_0xhex(PRINT_ANY,
+   "fwmark", "fwmark 0x%x ", fwmark);
}
}
 
-   if (flags & IP6_TNL_F_ALLOW_LOCAL_REMOTE)
-   print_bool(PRINT_ANY,
-  "ip6_tnl_f_allow_local_remote",
-  "allow-localremote ",
-  true);
-
if (tb[IFLA_IPTUN_ENCAP_TYPE] &&
rta_getattr_u16(tb[IFLA_IPTUN_ENCAP_TYPE]) != TUNNEL_ENCAP_NONE) {
__u16 type = rta_getattr_u16(tb[IFLA_IPTUN_ENCAP_TYPE]);
diff --git a/ip/link_iptnl.c b/ip/link_iptnl.c
index fd08bd9..fcb0795 100644
--- a/ip/link_iptnl.c
+++ b/ip/link_iptnl.c
@@ -360,7 +360,6 @@ get_failed:
 
 static void iptunnel_print_opt(struct link_util *lu, FILE *f, struct rtattr 
*tb[])
 {
-   char s1[1024];
char s2[64];
const char *local = "any";
const char *remote = "any";
@@ -454,7 +453,7 @@ static void iptunnel_print_opt(struct link_util *lu, FILE 
*f, struct rtattr *tb[
 
const char *prefix = inet_ntop(AF_INET6,
   
RTA_DATA(tb[IFLA_IPTUN_6RD_PREFIX]),
-  s1, sizeof(s1));
+  s2, sizeof(s2));
 
if (is_json_context()) {
print_string(PRINT_JSON, "prefix", NULL, prefix);
@@ -481,6 +480,15 @@ static void iptunnel_print_opt(struct link_util *lu, FILE 
*f, struct rtattr *tb[
}
}
 
+   if (tb[IFLA_IPTUN_FWMARK]) {
+

[PATCH iproute2 5/9] ip/tunnel: Use print_string() and simplify encap option printing

2018-01-12 Thread Serhey Popovych

Use print_string() instead of fputs() and fprintf() to
print encapsulation for !is_json_context().

Introduce and use tnl_encap_optstr() to format encapsulation
option string according to tempate and given values to avoid
code duplication and simplify it.

Signed-off-by: Serhey Popovych 
---
 ip/link_gre.c|   54 +-
 ip/link_gre6.c   |   46 ++
 ip/link_ip6tnl.c |   40 +++-
 ip/link_iptnl.c  |   52 ++--
 ip/tunnel.c  |   24 
 ip/tunnel.h  |1 +
 6 files changed, 109 insertions(+), 108 deletions(-)

diff --git a/ip/link_gre.c b/ip/link_gre.c
index 2ae2194..b70f73b 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -491,50 +491,38 @@ static void gre_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
}
 
if (is_json_context()) {
-   print_uint(PRINT_JSON,
-  "sport",
-  NULL,
-  sport ? ntohs(sport) : 0);
+   print_uint(PRINT_JSON, "sport", NULL, ntohs(sport));
print_uint(PRINT_JSON, "dport", NULL, ntohs(dport));
-
-   print_bool(PRINT_JSON,
-  "csum",
-  NULL,
+   print_bool(PRINT_JSON, "csum", NULL,
   flags & TUNNEL_ENCAP_FLAG_CSUM);
-
-   print_bool(PRINT_JSON,
-  "csum6",
-  NULL,
+   print_bool(PRINT_JSON, "csum6", NULL,
   flags & TUNNEL_ENCAP_FLAG_CSUM6);
-
-   print_bool(PRINT_JSON,
-  "remcsum",
-  NULL,
+   print_bool(PRINT_JSON, "remcsum", NULL,
   flags & TUNNEL_ENCAP_FLAG_REMCSUM);
 
close_json_object();
} else {
-   if (sport == 0)
-   fputs("encap-sport auto ", f);
-   else
-   fprintf(f, "encap-sport %u", ntohs(sport));
+   int t;
 
-   fprintf(f, "encap-dport %u ", ntohs(dport));
+   t = sport ? ntohs(sport) + 1 : 0;
+   print_string(PRINT_FP, NULL, "%s",
+tnl_encap_optstr("sport", 1, t));
 
-   if (flags & TUNNEL_ENCAP_FLAG_CSUM)
-   fputs("encap-csum ", f);
-   else
-   fputs("noencap-csum ", f);
+   t = ntohs(dport) + 1;
+   print_string(PRINT_FP, NULL, "%s",
+tnl_encap_optstr("dport", 1, t));
 
-   if (flags & TUNNEL_ENCAP_FLAG_CSUM6)
-   fputs("encap-csum6 ", f);
-   else
-   fputs("noencap-csum6 ", f);
+   t = flags & TUNNEL_ENCAP_FLAG_CSUM;
+   print_string(PRINT_FP, NULL, "%s",
+tnl_encap_optstr("csum", t, -1));
 
-   if (flags & TUNNEL_ENCAP_FLAG_REMCSUM)
-   fputs("encap-remcsum ", f);
-   else
-   fputs("noencap-remcsum ", f);
+   t = flags & TUNNEL_ENCAP_FLAG_CSUM6;
+   print_string(PRINT_FP, NULL, "%s",
+tnl_encap_optstr("csum6", t, -1));
+
+   t = flags & TUNNEL_ENCAP_FLAG_REMCSUM;
+   print_string(PRINT_FP, NULL, "%s",
+tnl_encap_optstr("remcsum", t, -1));
}
}
 }
diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index 9576354..41180bb 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -538,40 +538,38 @@ static void gre_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
}
 
if (is_json_context()) {
-   print_uint(PRINT_JSON,
-  "sport",
-  NULL,
-  sport ? ntohs(sport) : 0);
+   print_uint(PRINT_JSON, "sport", NULL, ntohs(sport));
print_uint(PRINT_JSON, "dport", NULL, ntohs(dport));
print_bool(PRINT_JSON, "csum", NULL,
-  flags & TUNNEL_ENCAP_FLAG_CSUM);
+  flags & TUNNEL_ENCAP_FLAG_CSUM);

[PATCH iproute2 8/9] vti/tunnel: Unify ikey/okey printing

2018-01-12 Thread Serhey Popovych

For vti6 tunnel we print [io]key in dotted-quad notation
(ipv4 address) while in vti we do that in hex format.

For vti tunnel we print [io]key only if value is not
zero while for vti6 we miss such check.

Unify vti and vti6 tunnel [io]key output.

While here enlarge s2 buffer to the same size as in rest
of tunnel support code (64 bytes).

Signed-off-by: Serhey Popovych 
---
 ip/link_vti.c  |   15 +--
 ip/link_vti6.c |7 +--
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/ip/link_vti.c b/ip/link_vti.c
index eebf542..5fccf74 100644
--- a/ip/link_vti.c
+++ b/ip/link_vti.c
@@ -170,7 +170,7 @@ static void vti_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
__u32 key;
__u32 fwmark;
unsigned int link;
-   char s2[IFNAMSIZ];
+   char s2[64];
 
if (!tb)
return;
@@ -201,13 +201,16 @@ static void vti_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
}
 
if (tb[IFLA_VTI_IKEY] &&
-   (key = rta_getattr_u32(tb[IFLA_VTI_IKEY])))
-   print_0xhex(PRINT_ANY, "ikey", "ikey %#x ", ntohl(key));
-
+   (key = rta_getattr_u32(tb[IFLA_VTI_IKEY]))) {
+   inet_ntop(AF_INET, RTA_DATA(tb[IFLA_VTI_IKEY]), s2, sizeof(s2));
+   print_string(PRINT_ANY, "ikey", "ikey %s ", s2);
+   }
 
if (tb[IFLA_VTI_OKEY] &&
-   (key = rta_getattr_u32(tb[IFLA_VTI_OKEY])))
-   print_0xhex(PRINT_ANY, "okey", "okey %#x ", ntohl(key));
+   (key = rta_getattr_u32(tb[IFLA_VTI_OKEY]))) {
+   inet_ntop(AF_INET, RTA_DATA(tb[IFLA_VTI_OKEY]), s2, sizeof(s2));
+   print_string(PRINT_ANY, "okey", "okey %s ", s2);
+   }
 
if (tb[IFLA_VTI_FWMARK] &&
(fwmark = rta_getattr_u32(tb[IFLA_VTI_FWMARK])))
diff --git a/ip/link_vti6.c b/ip/link_vti6.c
index 29a7062..a4ad650 100644
--- a/ip/link_vti6.c
+++ b/ip/link_vti6.c
@@ -168,6 +168,7 @@ static void vti6_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
const char *remote = "any";
struct in6_addr saddr;
struct in6_addr daddr;
+   __u32 key;
__u32 fwmark;
unsigned int link;
char s2[64];
@@ -198,12 +199,14 @@ static void vti6_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
print_string(PRINT_ANY, "link", "dev %s ", n);
}
 
-   if (tb[IFLA_VTI_IKEY]) {
+   if (tb[IFLA_VTI_IKEY] &&
+   (key = rta_getattr_u32(tb[IFLA_VTI_IKEY]))) {
inet_ntop(AF_INET, RTA_DATA(tb[IFLA_VTI_IKEY]), s2, sizeof(s2));
print_string(PRINT_ANY, "ikey", "ikey %s ", s2);
}
 
-   if (tb[IFLA_VTI_OKEY]) {
+   if (tb[IFLA_VTI_OKEY] &&
+   (key = rta_getattr_u32(tb[IFLA_VTI_OKEY]))) {
inet_ntop(AF_INET, RTA_DATA(tb[IFLA_VTI_OKEY]), s2, sizeof(s2));
print_string(PRINT_ANY, "okey", "okey %s ", s2);
}
-- 
1.7.10.4

[PATCH iproute2 6/9] gre/tunnel: Print erspan_index using print_uint()

2018-01-12 Thread Serhey Popovych

One is missing in JSON output because fprintf()
is used instead of print_uint().

Signed-off-by: Serhey Popovych 
---
 ip/link_gre.c  |3 ++-
 ip/link_gre6.c |4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/ip/link_gre.c b/ip/link_gre.c
index b70f73b..a7d1cd1 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -464,7 +464,8 @@ static void gre_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
if (tb[IFLA_GRE_ERSPAN_INDEX]) {
__u32 erspan_idx = rta_getattr_u32(tb[IFLA_GRE_ERSPAN_INDEX]);
 
-   fprintf(f, "erspan_index %u ", erspan_idx);
+   print_uint(PRINT_ANY,
+  "erspan_index", "erspan_index %u ", erspan_idx);
}
 
if (tb[IFLA_GRE_ENCAP_TYPE] &&
diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index 41180bb..200846e 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -512,7 +512,9 @@ static void gre_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
 
if (tb[IFLA_GRE_ERSPAN_INDEX]) {
__u32 erspan_idx = rta_getattr_u32(tb[IFLA_GRE_ERSPAN_INDEX]);
-   fprintf(f, "erspan_index %u ", erspan_idx);
+
+   print_uint(PRINT_ANY,
+  "erspan_index", "erspan_index %u ", erspan_idx);
}
 
if (tb[IFLA_GRE_ENCAP_TYPE] &&
-- 
1.7.10.4

[PATCH iproute2 3/9] ip/tunnel: Simplify and unify tos printing

2018-01-12 Thread Serhey Popovych

For ip tunnels tos can be 0 when not configured, 1 when
inherited from encapsulated packet and rest specifying
diffserv (rfc2474) or tos (rfc1349) bits. It is stored
in packet tos/diffserv field and returned in tos
netlink attribute to userspace.

Simplify and unify tos printing by using print_0xhex()
and print_string() instead of fprintf() to output values.

Signed-off-by: Serhey Popovych 
---
 ip/iplink_geneve.c |   23 ---
 ip/iplink_vxlan.c  |   19 ---
 ip/link_gre.c  |   23 ---
 ip/link_iptnl.c|   22 --
 4 files changed, 32 insertions(+), 55 deletions(-)

diff --git a/ip/iplink_geneve.c b/ip/iplink_geneve.c
index 5a0afd4..2d0a041 100644
--- a/ip/iplink_geneve.c
+++ b/ip/iplink_geneve.c
@@ -228,7 +228,7 @@ static void geneve_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
 {
__u32 vni;
__u8 ttl = 0;
-   __u8 tos;
+   __u8 tos = 0;
 
if (!tb)
return;
@@ -270,20 +270,13 @@ static void geneve_print_opt(struct link_util *lu, FILE 
*f, struct rtattr *tb[])
else
print_string(PRINT_FP, NULL, "ttl %s ", "inherit");
 
-   if (tb[IFLA_GENEVE_TOS] &&
-   (tos = rta_getattr_u8(tb[IFLA_GENEVE_TOS]))) {
-   if (is_json_context()) {
-   print_0xhex(PRINT_JSON, "tos", "%#x", tos);
-   } else {
-   if (tos == 1) {
-   print_string(PRINT_FP,
-"tos",
-"tos %s ",
-"inherit");
-   } else {
-   fprintf(f, "tos %#x ", tos);
-   }
-   }
+   if (tb[IFLA_GENEVE_TOS])
+   tos = rta_getattr_u8(tb[IFLA_GENEVE_TOS]);
+   if (tos) {
+   if (is_json_context() || tos != 1)
+   print_0xhex(PRINT_ANY, "tos", "tos 0x%x ", tos);
+   else
+   print_string(PRINT_FP, NULL, "tos %s ", "inherit");
}
 
if (tb[IFLA_GENEVE_LABEL]) {
diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index bac326d..e292369 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -396,7 +396,7 @@ static void vxlan_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
__u32 vni;
unsigned int link;
__u8 ttl = 0;
-   __u8 tos;
+   __u8 tos = 0;
__u32 maxaddr;
 
if (!tb)
@@ -514,16 +514,13 @@ static void vxlan_print_opt(struct link_util *lu, FILE 
*f, struct rtattr *tb[])
if (tb[IFLA_VXLAN_L3MISS] && rta_getattr_u8(tb[IFLA_VXLAN_L3MISS]))
print_bool(PRINT_ANY, "l3miss", "l3miss ", true);
 
-   if (tb[IFLA_VXLAN_TOS] &&
-   (tos = rta_getattr_u8(tb[IFLA_VXLAN_TOS]))) {
-   if (is_json_context()) {
-   print_0xhex(PRINT_JSON, "tos", "%#x", tos);
-   } else {
-   if (tos == 1)
-   fprintf(f, "tos %s ", "inherit");
-   else
-   fprintf(f, "tos %#x ", tos);
-   }
+   if (tb[IFLA_VXLAN_TOS])
+   tos = rta_getattr_u8(tb[IFLA_VXLAN_TOS]);
+   if (tos) {
+   if (is_json_context() || tos != 1)
+   print_0xhex(PRINT_ANY, "tos", "tos 0x%x ", tos);
+   else
+   print_string(PRINT_FP, NULL, "tos %s ", "inherit");
}
 
if (tb[IFLA_VXLAN_TTL])
diff --git a/ip/link_gre.c b/ip/link_gre.c
index 2f6be20..11a131f 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -363,6 +363,7 @@ static void gre_print_direct_opt(FILE *f, struct rtattr 
*tb[])
unsigned int iflags = 0;
unsigned int oflags = 0;
__u8 ttl = 0;
+   __u8 tos = 0;
 
if (tb[IFLA_GRE_REMOTE]) {
unsigned int addr = rta_getattr_u32(tb[IFLA_GRE_REMOTE]);
@@ -396,21 +397,13 @@ static void gre_print_direct_opt(FILE *f, struct rtattr 
*tb[])
else
print_string(PRINT_FP, NULL, "ttl %s ", "inherit");
 
-   if (tb[IFLA_GRE_TOS] && rta_getattr_u8(tb[IFLA_GRE_TOS])) {
-   int tos = rta_getattr_u8(tb[IFLA_GRE_TOS]);
-
-   if (is_json_context()) {
-   SPRINT_BUF(b1);
-
-   snprintf(b1, sizeof(b1), "0x%x", tos);
-   print_string(PRINT_JSON, "tos", NULL, b1);
-   } else {
-   fputs("tos ", f);
-   if (tos == 1)
-   fputs("inherit ", f);
-   else
-   fprintf(f, "0x%x ", tos);
-   }
+   if (tb[IFLA_GRE_TOS])
+   tos = rta_getattr_u8(tb[IFLA_GRE_TOS]);
+   if (tos) {
+   if

[PATCH iproute2 1/9] iplink: Use ll_index_to_name() instead of if_indextoname()

2018-01-12 Thread Serhey Popovych

There are two reasons for switching to cached variant:

  1) ll_index_to_name() may return result from cache,
 eliminating expensive ioctl() to the kernel.

 Note that most of the code already switched from plain
 if_indextoname() to ll_index_to_name() to cached variant
 in print path because in most cases cache populated.

  2) It always return name in the form "if%d", even if
 entry is not in cache and ioctl() fails. This drops
 "link_index" from JSON output.

Signed-off-by: Serhey Popovych 
---
 bridge/fdb.c  |4 ++--
 bridge/link.c |   18 ++
 ip/iplink_bond.c  |   25 -
 ip/iplink_vxlan.c |8 ++--
 ip/iproute_lwtunnel.c |7 ++-
 ip/link_gre.c |   12 +---
 ip/link_gre6.c|   12 +---
 ip/link_ip6tnl.c  |   12 +---
 ip/link_iptnl.c   |   12 +---
 ip/link_vti.c |7 ++-
 ip/link_vti6.c|   10 --
 11 files changed, 42 insertions(+), 85 deletions(-)

diff --git a/bridge/fdb.c b/bridge/fdb.c
index 376713b..2cc0268 100644
--- a/bridge/fdb.c
+++ b/bridge/fdb.c
@@ -219,10 +219,10 @@ int print_fdb(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
unsigned int ifindex = rta_getattr_u32(tb[NDA_IFINDEX]);
 
if (ifindex) {
-   char ifname[IF_NAMESIZE];
+   const char *ifname;
 
if (!tb[NDA_LINK_NETNSID] &&
-   if_indextoname(ifindex, ifname)) {
+   (ifname = ll_index_to_name(ifindex))) {
if (jw_global)
jsonw_string_field(jw_global, "viaIf",
   ifname);
diff --git a/bridge/link.c b/bridge/link.c
index e2371d0..9c846c9 100644
--- a/bridge/link.c
+++ b/bridge/link.c
@@ -26,8 +26,6 @@ static const char *port_states[] = {
[BR_STATE_BLOCKING] = "blocking",
 };
 
-extern char *if_indextoname(unsigned int __ifindex, char *__ifname);
-
 static void print_link_flags(FILE *fp, unsigned int flags)
 {
fprintf(fp, "<");
@@ -104,7 +102,6 @@ int print_linkinfo(const struct sockaddr_nl *who,
int len = n->nlmsg_len;
struct ifinfomsg *ifi = NLMSG_DATA(n);
struct rtattr *tb[IFLA_MAX+1];
-   char b1[IFNAMSIZ];
 
len -= NLMSG_LENGTH(sizeof(*ifi));
if (len < 0) {
@@ -135,14 +132,9 @@ int print_linkinfo(const struct sockaddr_nl *who,
print_operstate(fp, rta_getattr_u8(tb[IFLA_OPERSTATE]));
 
if (tb[IFLA_LINK]) {
-   SPRINT_BUF(b1);
int iflink = rta_getattr_u32(tb[IFLA_LINK]);
 
-   if (iflink == 0)
-   fprintf(fp, "@NONE: ");
-   else
-   fprintf(fp, "@%s: ",
-   if_indextoname(iflink, b1));
+   fprintf(fp, "@%s: ", iflink ? ll_index_to_name(iflink) : 
"NONE");
} else
fprintf(fp, ": ");
 
@@ -151,9 +143,11 @@ int print_linkinfo(const struct sockaddr_nl *who,
if (tb[IFLA_MTU])
fprintf(fp, "mtu %u ", rta_getattr_u32(tb[IFLA_MTU]));
 
-   if (tb[IFLA_MASTER])
-   fprintf(fp, "master %s ",
-   if_indextoname(rta_getattr_u32(tb[IFLA_MASTER]), b1));
+   if (tb[IFLA_MASTER]) {
+   int master = rta_getattr_u32(tb[IFLA_MASTER]);
+
+   fprintf(fp, "master %s ", ll_index_to_name(master));
+   }
 
if (tb[IFLA_PROTINFO]) {
if (tb[IFLA_PROTINFO]->rta_type & NLA_F_NESTED) {
diff --git a/ip/iplink_bond.c b/ip/iplink_bond.c
index 2b5cf4f..45900c8 100644
--- a/ip/iplink_bond.c
+++ b/ip/iplink_bond.c
@@ -382,19 +382,9 @@ static void bond_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
 
if (tb[IFLA_BOND_ACTIVE_SLAVE] &&
(ifindex = rta_getattr_u32(tb[IFLA_BOND_ACTIVE_SLAVE]))) {
-   char buf[IFNAMSIZ];
-   const char *n = if_indextoname(ifindex, buf);
+   const char *n = ll_index_to_name(ifindex);
 
-   if (n)
-   print_string(PRINT_ANY,
-"active_slave",
-"active_slave %s ",
-n);
-   else
-   print_uint(PRINT_ANY,
-  "active_slave_index",
-  "active_slave %u ",
-  ifindex);
+   print_string(PRINT_ANY, "active_slave", "active_slave %s ", n);
}
 
if (tb[IFLA_BOND_MIIMON])
@@ -481,16 +471,9 @@ static void bond_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
 
if (tb[IFLA_BOND_PRIMARY] &&
(ifindex =

[PATCH iproute2 0/9] ip/tunnel: Improve tunnel parameters printing

2018-01-12 Thread Serhey Popovych

Continue improvements to tunnel modules. Final goal
is to make merge IP and IPv6 variants and make that
transition as transparent as possible.

Everything within this series is open for your comments,
suggestions and criticism.

See individual patch description message for details.

Thanks,
Serhii

Serhey Popovych (9):
  iplink: Use ll_index_to_name() instead of if_indextoname()
  ip/tunnel: Correct and unify ttl/hoplimit printing
  ip/tunnel: Simplify and unify tos printing
  ip/tunnel: Use print_0xhex() instead of print_string()
  ip/tunnel: Use print_string() and simplify encap option printing
  gre/tunnel: Print erspan_index using print_uint()
  ip/tunnel: Minor cleanups in print routines
  vti/tunnel: Unify ikey/okey printing
  vti6/tunnel: Unify and simplify link type help functions

 bridge/fdb.c  |4 +-
 bridge/link.c |   18 +++-
 ip/ip6tunnel.c|5 +-
 ip/iplink_bond.c  |   25 ++
 ip/iplink_geneve.c|   38 +++
 ip/iplink_vxlan.c |   42 +++--
 ip/iproute_lwtunnel.c |   11 ++---
 ip/iptunnel.c |2 +-
 ip/link_gre.c |  117 +++---
 ip/link_gre6.c|   94 ++---
 ip/link_ip6tnl.c  |   89 +--
 ip/link_iptnl.c   |  123 -
 ip/link_vti.c |   36 ++-
 ip/link_vti6.c|   60 +---
 ip/tunnel.c   |   24 ++
 ip/tunnel.h   |1 +
 16 files changed, 313 insertions(+), 376 deletions(-)

-- 
1.7.10.4

[PATCH iproute2 7/9] ip/tunnel: Minor cleanups in print routines

2018-01-12 Thread Serhey Popovych

Print "unknown" parameter for "encap" type in PRINT_FP
context using "%s " format specifier and benefit from
complite time string merge.

Unify encapsulation type check.

Signed-off-by: Serhey Popovych 
---
 ip/link_gre.c|5 +++--
 ip/link_gre6.c   |8 
 ip/link_ip6tnl.c |6 +++---
 ip/link_iptnl.c  |2 +-
 4 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/ip/link_gre.c b/ip/link_gre.c
index a7d1cd1..b4cde62 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -450,6 +450,8 @@ static void gre_print_direct_opt(FILE *f, struct rtattr 
*tb[])
 
 static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
 {
+   __u16 type;
+
if (!tb)
return;
 
@@ -469,8 +471,7 @@ static void gre_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
}
 
if (tb[IFLA_GRE_ENCAP_TYPE] &&
-   rta_getattr_u16(tb[IFLA_GRE_ENCAP_TYPE]) != TUNNEL_ENCAP_NONE) {
-   __u16 type = rta_getattr_u16(tb[IFLA_GRE_ENCAP_TYPE]);
+   (type = rta_getattr_u16(tb[IFLA_GRE_ENCAP_TYPE])) != 
TUNNEL_ENCAP_NONE) {
__u16 flags = rta_getattr_u16(tb[IFLA_GRE_ENCAP_FLAGS]);
__u16 sport = rta_getattr_u16(tb[IFLA_GRE_ENCAP_SPORT]);
__u16 dport = rta_getattr_u16(tb[IFLA_GRE_ENCAP_DPORT]);
diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index 200846e..557151f 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -381,6 +381,7 @@ static void gre_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
unsigned int iflags = 0;
unsigned int oflags = 0;
unsigned int flags = 0;
+   __u16 type;
__u32 flowinfo = 0;
struct in6_addr in6_addr_any = IN6ADDR_ANY_INIT;
__u8 ttl = 0;
@@ -518,15 +519,14 @@ static void gre_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
}
 
if (tb[IFLA_GRE_ENCAP_TYPE] &&
-   rta_getattr_u16(tb[IFLA_GRE_ENCAP_TYPE]) != TUNNEL_ENCAP_NONE) {
-   __u16 type = rta_getattr_u16(tb[IFLA_GRE_ENCAP_TYPE]);
+   (type = rta_getattr_u16(tb[IFLA_GRE_ENCAP_TYPE])) != 
TUNNEL_ENCAP_NONE) {
__u16 flags = rta_getattr_u16(tb[IFLA_GRE_ENCAP_FLAGS]);
__u16 sport = rta_getattr_u16(tb[IFLA_GRE_ENCAP_SPORT]);
__u16 dport = rta_getattr_u16(tb[IFLA_GRE_ENCAP_DPORT]);
 
open_json_object("encap");
-
print_string(PRINT_FP, NULL, "encap ", NULL);
+
switch (type) {
case TUNNEL_ENCAP_FOU:
print_string(PRINT_ANY, "type", "%s ", "fou");
@@ -535,7 +535,7 @@ static void gre_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
print_string(PRINT_ANY, "type", "%s ", "gue");
break;
default:
-   print_null(PRINT_ANY, "type", "unknown ", NULL);
+   print_null(PRINT_ANY, "type", "%s ", "unknown");
break;
}
 
diff --git a/ip/link_ip6tnl.c b/ip/link_ip6tnl.c
index aa6f6fa..51c73dc 100644
--- a/ip/link_ip6tnl.c
+++ b/ip/link_ip6tnl.c
@@ -336,6 +336,7 @@ static void ip6tunnel_print_opt(struct link_util *lu, FILE 
*f, struct rtattr *tb
char s2[64];
unsigned int link;
int flags = 0;
+   __u16 type;
__u32 flowinfo = 0;
__u8 ttl = 0;
 
@@ -458,8 +459,7 @@ static void ip6tunnel_print_opt(struct link_util *lu, FILE 
*f, struct rtattr *tb
}
 
if (tb[IFLA_IPTUN_ENCAP_TYPE] &&
-   rta_getattr_u16(tb[IFLA_IPTUN_ENCAP_TYPE]) != TUNNEL_ENCAP_NONE) {
-   __u16 type = rta_getattr_u16(tb[IFLA_IPTUN_ENCAP_TYPE]);
+   (type = rta_getattr_u16(tb[IFLA_IPTUN_ENCAP_TYPE])) != 
TUNNEL_ENCAP_NONE) {
__u16 flags = rta_getattr_u16(tb[IFLA_IPTUN_ENCAP_FLAGS]);
__u16 sport = rta_getattr_u16(tb[IFLA_IPTUN_ENCAP_SPORT]);
__u16 dport = rta_getattr_u16(tb[IFLA_IPTUN_ENCAP_DPORT]);
@@ -474,7 +474,7 @@ static void ip6tunnel_print_opt(struct link_util *lu, FILE 
*f, struct rtattr *tb
print_string(PRINT_ANY, "type", "%s ", "gue");
break;
default:
-   print_null(PRINT_ANY, "type", "unknown ", NULL);
+   print_null(PRINT_ANY, "type", "%s ", "unknown");
break;
}
 
diff --git a/ip/link_iptnl.c b/ip/link_iptnl.c
index 83a524f..17d28ec 100644
--- a/ip/link_iptnl.c
+++ b/ip/link_iptnl.c
@@ -505,7 +505,7 @@ static void iptunnel_print_opt(struct link_util *lu, FILE 
*f, struct rtattr *tb[
print_string(PRINT_ANY, "type", "%s ", "gue");
break;
default:
-   print_null(PRINT_ANY, "type", "unknown ", NULL);
+   print_null(PRINT_ANY, "type", "%s ", "unknown");

[net-next 02/10] ixgbe: Default to 1 pool always being allocated

2018-01-12 Thread Jeff Kirsher

From: Alexander Duyck 

We might as well configure the limit to default to 1 pool always for the
interface. This accounts for the fact that the PF counts as 1 pool if
SR-IOV is enabled, and in general we are always running in 1 pool mode when
RSS or DCB is enabled as well, though we don't need to actually evaluate
any of the VMDq features in those cases.

Signed-off-by: Alexander Duyck 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 7 ++-
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 9416531335f7..ffd9619f4c80 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6129,6 +6129,7 @@ static int ixgbe_sw_init(struct ixgbe_adapter *adapter,
fdir = min_t(int, IXGBE_MAX_FDIR_INDICES, num_online_cpus());
adapter->ring_feature[RING_F_FDIR].limit = fdir;
adapter->fdir_pballoc = IXGBE_FDIR_PBALLOC_64K;
+   adapter->ring_feature[RING_F_VMDQ].limit = 1;
 #ifdef CONFIG_IXGBE_DCA
adapter->flags |= IXGBE_FLAG_DCA_CAPABLE;
 #endif
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 0085f4632966..543f2e60e4b7 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -78,12 +78,9 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter 
*adapter,
struct ixgbe_hw *hw = >hw;
int i;
 
-   adapter->flags |= IXGBE_FLAG_SRIOV_ENABLED;
-
/* Enable VMDq flag so device will be set in VM mode */
-   adapter->flags |= IXGBE_FLAG_VMDQ_ENABLED;
-   if (!adapter->ring_feature[RING_F_VMDQ].limit)
-   adapter->ring_feature[RING_F_VMDQ].limit = 1;
+   adapter->flags |= IXGBE_FLAG_SRIOV_ENABLED |
+ IXGBE_FLAG_VMDQ_ENABLED;
 
/* Allocate memory for per VF control structures */
adapter->vfinfo = kcalloc(num_vfs, sizeof(struct vf_data_storage),
-- 
2.15.1

[net-next 00/10][pull request] 10GbE Intel Wired LAN Driver Updates 2018-01-12

2018-01-12 Thread Jeff Kirsher

This series contains updates to ixgbe, fm10k and net core.

Alex updates the driver to remove a duplicate MAC address check and
verifies that we have not run out of resources to configure a MAC rule
in our filter table.  Also do not assume that dev->num_tc was populated
and configured with the driver, since it can be configured via mqprio
without any hardware coordination.  Fixed the recording of stats for
MACVLAN in ixgbe and fm10k instead of recording the receive queue on
MACVLAN offloaded frames.  When handling a MACVLAN offload, we should
be stopping/starting traffic on our own queues instead of the upper
devices transmit queues.  Fixed possible race conditions with the
MACVLAN cleanup with the interface cleanup on shutdown.  With the 
recent fixes to ixgbe, we can cap the number of queues regardless of
accel_priv being in use or not, since the actual number of queues are
being reported via real_num_tx_queues.

Tony fixes up the kernel documentation for ixgbe and ixgbevf to resolve
warnings when W=1 is used.

The following are changes since commit 6bd39bc3da0f4a301fae69c4a32db2768f5118be:
  Merge branch 'hns3-add-some-new-features-and-fix-some-bugs'
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 10GbE

Alexander Duyck (8):
  ixgbe: Assume provided MAC filter has been verified by macvlan
  ixgbe: Default to 1 pool always being allocated
  ixgbe: Don't assume dev->num_tc is equal to hardware TC config
  ixgbe/fm10k: Record macvlan stats instead of Rx queue for macvlan
offloaded rings
  ixgbe: Do not manipulate macvlan Tx queues when performing macvlan
offload
  ixgbe: avoid bringing rings up/down as macvlans are added/removed
  ixgbe: Fix handling of macvlan Tx offload
  net: Cap number of queues even with accel_priv

Tony Nguyen (2):
  ixgbe: Fix kernel-doc format warnings
  ixgbevf: Fix kernel-doc format warnings

 drivers/net/ethernet/intel/fm10k/fm10k_main.c  |  14 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe.h   |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c |   3 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_82599.c |  11 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_common.c|   9 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb.c   |  10 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82598.c |  22 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_82599.c |   5 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_nl.c|   2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_debugfs.c   |   2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c   |   6 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c  |   3 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c   |  61 +++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 299 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c   |  15 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c   |   8 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |  15 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c  |  19 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |   6 +-
 drivers/net/ethernet/intel/ixgbevf/vf.c|  17 +-
 net/core/dev.c |   3 +-
 21 files changed, 273 insertions(+), 258 deletions(-)

-- 
2.15.1

1 2 3 >

1 - 100 of 224 matches

Mail list logo