date:20180427

Re: [PATCH] bpf: fix misaligned access for BPF_PROG_TYPE_PERF_EVENT program type on x86_32 platform

2018-04-27 Thread Wang YanQing

On Sat, Apr 28, 2018 at 01:33:15AM +0200, Daniel Borkmann wrote:
> On 04/28/2018 12:48 AM, Alexei Starovoitov wrote:
> > On Thu, Apr 26, 2018 at 05:57:49PM +0800, Wang YanQing wrote:
> >> All the testcases for BPF_PROG_TYPE_PERF_EVENT program type in
> >> test_verifier(kselftest) report below errors on x86_32:
> >> "
> >> 172/p unpriv: spill/fill of different pointers ldx FAIL
> >> Unexpected error message!
> >> 0: (bf) r6 = r10
> >> 1: (07) r6 += -8
> >> 2: (15) if r1 == 0x0 goto pc+3
> >> R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1
> >> 3: (bf) r2 = r10
> >> 4: (07) r2 += -76
> >> 5: (7b) *(u64 *)(r6 +0) = r2
> >> 6: (55) if r1 != 0x0 goto pc+1
> >> R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 
> >> fp-8=fp
> >> 7: (7b) *(u64 *)(r6 +0) = r1
> >> 8: (79) r1 = *(u64 *)(r6 +0)
> >> 9: (79) r1 = *(u64 *)(r1 +68)
> >> invalid bpf_context access off=68 size=8
> >>
> >> 378/p check bpf_perf_event_data->sample_period byte load permitted FAIL
> >> Failed to load prog 'Permission denied'!
> >> 0: (b7) r0 = 0
> >> 1: (71) r0 = *(u8 *)(r1 +68)
> >> invalid bpf_context access off=68 size=1
> >>
> >> 379/p check bpf_perf_event_data->sample_period half load permitted FAIL
> >> Failed to load prog 'Permission denied'!
> >> 0: (b7) r0 = 0
> >> 1: (69) r0 = *(u16 *)(r1 +68)
> >> invalid bpf_context access off=68 size=2
> >>
> >> 380/p check bpf_perf_event_data->sample_period word load permitted FAIL
> >> Failed to load prog 'Permission denied'!
> >> 0: (b7) r0 = 0
> >> 1: (61) r0 = *(u32 *)(r1 +68)
> >> invalid bpf_context access off=68 size=4
> >>
> >> 381/p check bpf_perf_event_data->sample_period dword load permitted FAIL
> >> Failed to load prog 'Permission denied'!
> >> 0: (b7) r0 = 0
> >> 1: (79) r0 = *(u64 *)(r1 +68)
> >> invalid bpf_context access off=68 size=8
> >> "
> >>
> >> This patch fix it, the fix isn't only necessary for x86_32, it will fix the
> >> same problem for other platforms too, if their size of bpf_user_pt_regs_t
> >> can't divide exactly into 8.
> >>
> >> Signed-off-by: Wang YanQing 
> >> ---
> >>  Hi all!
> >>  After mainline accept this patch, then we need to submit a sync patch
> >>  to update the tools/include/uapi/linux/bpf_perf_event.h.
> >>
> >>  Thanks.
> >>
> >>  include/uapi/linux/bpf_perf_event.h | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/include/uapi/linux/bpf_perf_event.h 
> >> b/include/uapi/linux/bpf_perf_event.h
> >> index eb1b9d2..ff4c092 100644
> >> --- a/include/uapi/linux/bpf_perf_event.h
> >> +++ b/include/uapi/linux/bpf_perf_event.h
> >> @@ -12,7 +12,7 @@
> >>  
> >>  struct bpf_perf_event_data {
> >>bpf_user_pt_regs_t regs;
> >> -  __u64 sample_period;
> >> +  __u64 sample_period __attribute__((aligned(8)));
> > 
> > I don't think this necessary.
> > imo it's a bug in pe_prog_is_valid_access
> > that should have allowed 8-byte access to 4-byte aligned sample_period.
> > The access rewritten by pe_prog_convert_ctx_access anyway,
> > no alignment issues as far as I can see.
> 
> Right, good point. Wang, could you give the below a test run:
> 
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 56ba0f2..95b9142 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -833,8 +833,14 @@ static bool pe_prog_is_valid_access(int off, int size, 
> enum bpf_access_type type
>   return false;
>   if (type != BPF_READ)
>   return false;
> - if (off % size != 0)
> - return false;
> + if (off % size != 0) {
> + if (sizeof(long) != 4)
> + return false;
> + if (size != 8)
> + return false;
> + if (off % size != 4)
> + return false;
> + }
> 
>   switch (off) {
>   case bpf_ctx_range(struct bpf_perf_event_data, sample_period):
Hi all!

I have tested this patch, but test_verifier reports the same errors
for the five testcases.

The reason is they all failed to pass the test of bpf_ctx_narrow_access_ok.

Thanks.

Re: WARNING in perf_trace_buf_alloc (2)

2018-04-27 Thread Alexei Starovoitov

On Sat, Apr 21, 2018 at 12:37:01PM -0700, Eric Biggers wrote:
> [+bpf maintainers and netdev]
> 
> On Mon, Nov 06, 2017 at 03:56:01AM -0800, syzbot wrote:
> > Hello,
> > 
> > syzkaller hit the following crash on
> > 5cb0512c02ecd7e6214e912e4c150f4219ac78e0
> > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached
> > Raw console output is attached.
> > C reproducer is attached
> > syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> > for information about syzkaller reproducers
> > 
> > 
> > [ cut here ]
> > WARNING: CPU: 0 PID: 3008 at kernel/trace/trace_event_perf.c:274
> > perf_trace_buf_alloc+0x12d/0x160 kernel/trace/trace_event_perf.c:273
> > Kernel panic - not syncing: panic_on_warn set ...
> > 
> > CPU: 0 PID: 3008 Comm: syzkaller609027 Not tainted 4.14.0-rc7+ #159
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:17 [inline]
> >  dump_stack+0x194/0x257 lib/dump_stack.c:53
> >  panic+0x1e4/0x417 kernel/panic.c:181
> >  __warn+0x1c4/0x1d9 kernel/panic.c:542
> >  report_bug+0x211/0x2d0 lib/bug.c:184
> >  fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
> >  do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
> >  do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
> >  do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
> >  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
> >  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:906
> > RIP: 0010:perf_trace_buf_alloc+0x12d/0x160
> > kernel/trace/trace_event_perf.c:273
> > RSP: 0018:8801c0fdf760 EFLAGS: 00010286
> > RAX: 001c RBX: 1100381fbefe RCX: 
> > RDX: 001c RSI: 1100381fbeac RDI: ed00381fbee0
> > RBP: 8801c0fdf780 R08: 0001 R09: 
> > R10: 8801c0fdf7a0 R11:  R12: 082c
> > R13: 8801c0fdf810 R14: 8801c0fdf890 R15: 8801d8b34b40
> >  perf_trace_bpf_map_keyval+0x260/0xbd0 include/trace/events/bpf.h:228
> >  trace_bpf_map_update_elem include/trace/events/bpf.h:274 [inline]
> >  map_update_elem kernel/bpf/syscall.c:597 [inline]
> >  SYSC_bpf kernel/bpf/syscall.c:1478 [inline]
> >  SyS_bpf+0x33eb/0x46a0 kernel/bpf/syscall.c:1453
> >  entry_SYSCALL_64_fastpath+0x1f/0xbe
> > RIP: 0033:0x445c29
> > RSP: 002b:007eff68 EFLAGS: 0246 ORIG_RAX: 0141
> > RAX: ffda RBX: 7ffe66adb340 RCX: 00445c29
> > RDX: 0020 RSI: 2053dfe0 RDI: 0002
> > RBP: 0082 R08:  R09: 
> > R10:  R11: 0246 R12: 00403280
> > R13: 00403310 R14:  R15: 
> > Dumping ftrace buffer:
> >(ftrace buffer empty)
> > Kernel Offset: disabled
> > Rebooting in 86400 seconds..
> > 
> > 
> > ---
> > This bug is generated by a dumb bot. It may contain errors.
> > See https://goo.gl/tpsmEJ for details.
> > Direct all questions to syzkal...@googlegroups.com.
> > Please credit me with: Reported-by: syzbot 
> > 
> > syzbot will keep track of this bug report.
> > Once a fix for this bug is committed, please reply to this email with:
> > #syz fix: exact-commit-title
> > To mark this as a duplicate of another syzbot report, please reply with:
> > #syz dup: exact-subject-of-another-report
> > If it's a one-off invalid bug report, please reply with:
> > #syz invalid
> > Note: if the crash happens again, it will cause creation of a new bug
> > report.
> > Note: all commands must start from beginning of the line.
> 
> This still happens on Linus' tree.  It seems one of the BPF tracepoints is
> trying to pass a buffer that is too long.  Here's a simplified reproducer that

right. this is easily reproducible.
looks like tracepoints in bpf core rot quite a bit.
will send a patch to address that soon.

> works on Linus' tree (commit 5e7c7806111ade5).  Note: it's not 100% reliable 
> for
> some reason; you may have to run it a couple times.  Daniel or Alexei, can one
> of you please look into this more?  Thanks!
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> int main()
> {
> int tracepoint_id;
> FILE *f;
> 
> f = 
> fopen("/sys/kernel/debug/tracing/events/bpf/bpf_map_update_elem/id",
>   "r");
> fscanf(f, "%d", _id);
> 
> struct perf_event_attr perf_attr = {
> .type = PERF_TYPE_TRACEPOINT,
> .size = sizeof(perf_attr),
> .config = tracepoint_id,
> };
> syscall(__NR_perf_event_open, _attr, 0, 0, -1, 0);
> 
> for (;;) {
> union bpf_attr create_attr = {
> .map_type = BPF_MAP_TYPE_HASH,
> .key_size = 4,
> .value_size = 2048,
>

Re: [PATCH bpf-next v7 05/10] bpf/verifier: improve register value range tracking with ARSH

2018-04-27 Thread Yonghong Song




On 4/27/18 4:48 PM, Alexei Starovoitov wrote:

On Wed, Apr 25, 2018 at 12:29:05PM -0700, Yonghong Song wrote:

When helpers like bpf_get_stack returns an int value
and later on used for arithmetic computation, the LSH and ARSH
operations are often required to get proper sign extension into
64-bit. For example, without this patch:
 54: R0=inv(id=0,umax_value=800)
 54: (bf) r8 = r0
 55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
 55: (67) r8 <<= 32
 56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff))
 56: (c7) r8 s>>= 32
 57: R8=inv(id=0)
With this patch:
 54: R0=inv(id=0,umax_value=800)
 54: (bf) r8 = r0
 55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
 55: (67) r8 <<= 32
 56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff))
 56: (c7) r8 s>>= 32
 57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
With better range of "R8", later on when "R8" is added to other register,
e.g., a map pointer or scalar-value register, the better register
range can be derived and verifier failure may be avoided.

In our later example,
 ..
 usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
 if (usize < 0)
 return 0;
 ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
 ..
Without improving ARSH value range tracking, the register representing
"max_len - usize" will have smin_value equal to S64_MIN and will be
rejected by verifier.

Signed-off-by: Yonghong Song 
---
  include/linux/tnum.h  |  4 +++-
  kernel/bpf/tnum.c | 10 ++
  kernel/bpf/verifier.c | 41 +
  3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/include/linux/tnum.h b/include/linux/tnum.h
index 0d2d3da..c7dc2b5 100644
--- a/include/linux/tnum.h
+++ b/include/linux/tnum.h
@@ -23,8 +23,10 @@ struct tnum tnum_range(u64 min, u64 max);
  /* Arithmetic and logical ops */
  /* Shift a tnum left (by a fixed shift) */
  struct tnum tnum_lshift(struct tnum a, u8 shift);
-/* Shift a tnum right (by a fixed shift) */
+/* Shift (rsh) a tnum right (by a fixed shift) */
  struct tnum tnum_rshift(struct tnum a, u8 shift);
+/* Shift (arsh) a tnum right (by a fixed min_shift) */
+struct tnum tnum_arshift(struct tnum a, u8 min_shift);
  /* Add two tnums, return @a + @b */
  struct tnum tnum_add(struct tnum a, struct tnum b);
  /* Subtract two tnums, return @a - @b */
diff --git a/kernel/bpf/tnum.c b/kernel/bpf/tnum.c
index 1f4bf68..938d412 100644
--- a/kernel/bpf/tnum.c
+++ b/kernel/bpf/tnum.c
@@ -43,6 +43,16 @@ struct tnum tnum_rshift(struct tnum a, u8 shift)
return TNUM(a.value >> shift, a.mask >> shift);
  }
  
+struct tnum tnum_arshift(struct tnum a, u8 min_shift)

+{
+   /* if a.value is negative, arithmetic shifting by minimum shift
+* will have larger negative offset compared to more shifting.
+* If a.value is nonnegative, arithmetic shifting by minimum shift
+* will have larger positive offset compare to more shifting.
+*/
+   return TNUM((s64)a.value >> min_shift, (s64)a.mask >> min_shift);
+}
+
  struct tnum tnum_add(struct tnum a, struct tnum b)
  {
u64 sm, sv, sigma, chi, mu;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6e3f859..573807f 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2974,6 +2974,47 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
/* We may learn something more from the var_off */
__update_reg_bounds(dst_reg);
break;
+   case BPF_ARSH:
+   if (umax_val >= insn_bitness) {
+   /* Shifts greater than 31 or 63 are undefined.
+* This includes shifts by a negative number.
+*/
+   mark_reg_unknown(env, regs, insn->dst_reg);
+   break;
+   }
+
+   /* BPF_ARSH is an arithmetic shift. The new range of
+* smin_value and smax_value should take the sign
+* into consideration.
+*
+* For example, if smin_value = -16, umin_val = 0
+* and umax_val = 2, the new smin_value should be
+* -16 >> 0 = -16 since -16 >> 2 = -4.
+* If smin_value = 16, umin_val = 0 and umax_val = 2,
+* the new smin_value should be 16 >> 2 = 4.
+*
+* Now suppose smax_value = -4, umin_val = 0 and
+* umax_val = 2, the new smax_value should be
+* -4 >> 2 = -1. If smax_value = 32 with the same
+* umin_val/umax_val, the new smax_value should remain 32.
+*/
+   if (dst_reg->smin_value < 0)
+   dst_reg->smin_value >>= umin_val;
+   else
+

Re: Request for stable 4.14.x inclusion: net: don't call update_pmtu unconditionally

2018-04-27 Thread Greg KH

On Fri, Apr 27, 2018 at 07:43:52PM +0100, Eddie Chapman wrote:
> On 27/04/18 19:07, Thomas Deutschmann wrote:
> > Hi Greg,
> > 
> > first, we need to cherry-pick another patch first:
> > >  From 52a589d51f1008f62569bf89e95b26221ee76690 Mon Sep 17 00:00:00 2001
> > > From: Xin Long 
> > > Date: Mon, 25 Dec 2017 14:43:58 +0800
> > > Subject: [PATCH] geneve: update skb dst pmtu on tx path
> > > 
> > > Commit a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path") has fixed
> > > a performance issue caused by the change of lower dev's mtu for vxlan.
> > > 
> > > The same thing needs to be done for geneve as well.
> > > 
> > > Note that geneve cannot adjust it's mtu according to lower dev's mtu
> > > when creating it. The performance is very low later when netperfing
> > > over it without fixing the mtu manually. This patch could also avoid
> > > this issue.
> > > 
> > > Signed-off-by: Xin Long 
> > > Signed-off-by: David S. Miller 
> 
> Oops, I completely missed that the coreos patch doesn't have the geneve hunk
> that is in the original 4.15 patch. I don't load the geneve module on my box
> hence why no problems surfaced on my machine.
> 
> Thanks Thomas for the correct instructions. Ignore my message Greg, I'll
> drop back into the shadows where I belong, sorry for the noise!

Talking about patches and pointing me at them is not noise at all, don't
be sorry! :)

I'll work on this after these next kernels are released, thanks all for
the details on what needs to be done.

greg k-h

[PATCH net-next] libcxgb,cxgb4: use __skb_put_zero to simplfy code

2018-04-27 Thread YueHaibing

use helper __skb_put_zero to replace the pattern of __skb_put() && memset()

Signed-off-by: YueHaibing 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c |  3 +--
 drivers/net/ethernet/chelsio/cxgb4/srq.c  |  3 +--
 drivers/net/ethernet/chelsio/libcxgb/libcxgb_cm.h | 15 +--
 3 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c
index db92f18..aae9802 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c
@@ -64,8 +64,7 @@ static int set_tcb_field(struct adapter *adap, struct 
filter_entry *f,
if (!skb)
return -ENOMEM;
 
-   req = (struct cpl_set_tcb_field *)__skb_put(skb, sizeof(*req));
-   memset(req, 0, sizeof(*req));
+   req = (struct cpl_set_tcb_field *)__skb_put_zero(skb, sizeof(*req));
INIT_TP_WR_CPL(req, CPL_SET_TCB_FIELD, ftid);
req->reply_ctrl = htons(REPLY_CHAN_V(0) |
QUEUENO_V(adap->sge.fw_evtq.abs_id) |
diff --git a/drivers/net/ethernet/chelsio/cxgb4/srq.c 
b/drivers/net/ethernet/chelsio/cxgb4/srq.c
index 6228a57..82b70a5 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/srq.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/srq.c
@@ -84,8 +84,7 @@ int cxgb4_get_srq_entry(struct net_device *dev,
if (!skb)
return -ENOMEM;
req = (struct cpl_srq_table_req *)
-   __skb_put(skb, sizeof(*req));
-   memset(req, 0, sizeof(*req));
+   __skb_put_zero(skb, sizeof(*req));
INIT_TP_WR(req, 0);
OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SRQ_TABLE_REQ,
  TID_TID_V(srq_idx) |
diff --git a/drivers/net/ethernet/chelsio/libcxgb/libcxgb_cm.h 
b/drivers/net/ethernet/chelsio/libcxgb/libcxgb_cm.h
index 4b5aacc..240ba9d 100644
--- a/drivers/net/ethernet/chelsio/libcxgb/libcxgb_cm.h
+++ b/drivers/net/ethernet/chelsio/libcxgb/libcxgb_cm.h
@@ -90,8 +90,7 @@ cxgb_mk_tid_release(struct sk_buff *skb, u32 len, u32 tid, 
u16 chan)
 {
struct cpl_tid_release *req;
 
-   req = __skb_put(skb, len);
-   memset(req, 0, len);
+   req = __skb_put_zero(skb, len);
 
INIT_TP_WR(req, tid);
OPCODE_TID(req) = cpu_to_be32(MK_OPCODE_TID(CPL_TID_RELEASE, tid));
@@ -104,8 +103,7 @@ cxgb_mk_close_con_req(struct sk_buff *skb, u32 len, u32 
tid, u16 chan,
 {
struct cpl_close_con_req *req;
 
-   req = __skb_put(skb, len);
-   memset(req, 0, len);
+   req = __skb_put_zero(skb, len);
 
INIT_TP_WR(req, tid);
OPCODE_TID(req) = cpu_to_be32(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, tid));
@@ -119,8 +117,7 @@ cxgb_mk_abort_req(struct sk_buff *skb, u32 len, u32 tid, 
u16 chan,
 {
struct cpl_abort_req *req;
 
-   req = __skb_put(skb, len);
-   memset(req, 0, len);
+   req = __skb_put_zero(skb, len);
 
INIT_TP_WR(req, tid);
OPCODE_TID(req) = cpu_to_be32(MK_OPCODE_TID(CPL_ABORT_REQ, tid));
@@ -134,8 +131,7 @@ cxgb_mk_abort_rpl(struct sk_buff *skb, u32 len, u32 tid, 
u16 chan)
 {
struct cpl_abort_rpl *rpl;
 
-   rpl = __skb_put(skb, len);
-   memset(rpl, 0, len);
+   rpl = __skb_put_zero(skb, len);
 
INIT_TP_WR(rpl, tid);
OPCODE_TID(rpl) = cpu_to_be32(MK_OPCODE_TID(CPL_ABORT_RPL, tid));
@@ -149,8 +145,7 @@ cxgb_mk_rx_data_ack(struct sk_buff *skb, u32 len, u32 tid, 
u16 chan,
 {
struct cpl_rx_data_ack *req;
 
-   req = __skb_put(skb, len);
-   memset(req, 0, len);
+   req = __skb_put_zero(skb, len);
 
INIT_TP_WR(req, tid);
OPCODE_TID(req) = cpu_to_be32(MK_OPCODE_TID(CPL_RX_DATA_ACK, tid));
-- 
2.7.0

[PATCH 1/1] tg3: fix meaningless hw_stats reading after tg3_halt memset 0 hw_stats

2018-04-27 Thread Zumeng Chen

Reading hw_stats will get the actual data after a sucessfull tg3_reset_hw,
which actually after tg3_timer_start, so tg->hw_stats_flag is introduced to
tell tg3_get_stats64 when hw_stats is ready to read, and it will be false
after having done memset(tp->hw_stats, 0) in tg3_halt. Plus tg3_get_stats64
and tg3_halt are protected by tp->lock in all scope.

Meanwhile, this patch is also to fix a kernel BUG_ON(in_interrupt) crash when
tg3_free_consistent is stuck in tp->lock, which might get a lot of in_softirq
counts(512 or so), and BUG_ON when vunmap to unmap hw->stats.

[ cut here ]
kernel BUG at /kernel-source//mm/vmalloc.c:1621!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
task: ffc87431 task.stack: ffc8742bc000
PC is at vunmap+0x48/0x50
LR is at __dma_free+0x98/0xa0
pc : [] lr : [] pstate: 0145
sp : ffc8742bfad0
x29: ffc8742bfad0 x28: ffc87431
x27: ffc878931200 x26: ffc87453
x25: 0003 x24: ff800b3aa000
x23: 700bb000 x22: 
x21:  x20: ffc87aafd0a0
x19: ff800b3aa000 x18: 0020
x17: 007f9e191e10 x16: ff8008eb0d28
x15: 000a x14: 00070cc8
x13: ff8008c65000 x12: 
x11: 000a x10: ffbf21d0e220
x9 : 0004 x8 : ff8008c65000
x7 : 3ff0 x6 : 
x5 : ff8008097f20 x4 : 
x3 : ff8008fd4fff x2 : ffc87b361788
x1 : ff800b3aafff x0 : 0201
Process connmand (pid: 785, stack limit = 0xffc8742bc000)
Stack: (0xffc8742bfad0 to 0xffc8742c)
fac0:   ffc8742bfaf0 ff8008097fb8
fae0: 1000 ff80 ffc8742bfb30 ff8000b717d4
fb00: ffc87aafd0a0 ff8008a38000 ff800b3aa000 ffc874530904
fb20: ffc874530900 700bb000 ffc8742bfb80 ff8000b80324
fb40: 0001 ffc874530900 0100 0200
fb60: 9003 ffc87453 0003 ffc87453
fb80: ffc8742bfbd0 ff8000b8aa5c ffc874530900 ffc87453
fba0: 0001  9003 ffc878931210
fbc0: 9002 ffc87453 ffc8742bfc00 ff80088bf44c
fbe0: ffc87453 ffc8742bfc50 0001 ffc87431
fc00: ffc8742bfc30 ff80088bf5e4 ffc87453 9002
fc20: ffc8742bfc40 ffc87453 ffc8742bfc60 ff80088c9d58
fc40: ffc87453 ff80088c9d34 ffc874530080 ffc874530080
fc60: ffc8742bfca0 ff80088c9e4c ffc87453 9003
fc80: 8914  007ffd94ba10 ffc8742bfd38
fca0: ffc8742bfcd0 ff80089509f8  ff9d
fcc0: 8914  ffc8742bfd60 ff8008953088
fce0: 8914 ffc874b49b80 007ffd94ba10 ff8008e9b400
fd00: 0004 007ffd94ba10 0124 001d
fd20: ff8008a32000 ff8008e9b400 0004 34687465
fd40:  9002  
fd60: ffc8742bfd90 ff80088a1720 ffc874b49b80 8914
fd80: 007ffd94ba10  ffc8742bfdc0 ff80088a2648
fda0: 8914 007ffd94ba10 ff8008e9b400 ffc878a73c00
fdc0: ffc8742bfe00 ff800822e9e0 8914 007ffd94ba10
fde0: ffc874b49bb0 ffc8747e5800 ffc8742bfe50 ff800823cd58
fe00: ffc8742bfe80 ff800822f0ec  ffc878a73c00
fe20: ffc878a73c00 0004 8914 0008
fe40: ffc8742bfe80 ff800822f0b0  ffc878a73c00
fe60: ffc878a73c00 0004 8914 ff8008083730
fe80:  ff8008083730  0048771fb000
fea0:  007f9e191e1c  0015
fec0: 0004 8914 007ffd94ba10 
fee0: 002f 0004 0010 
ff00: 001d 000f 0101010101010101 
ff20: 6532336338646634 00656c6261635f38 007f9e46a220 007f9e45f318
ff40: 004c1a58 007f9e191e10 06df 
ff60: 0004 004c6470 004c3c40 00512d20
ff80: 0001   
ffa0:  007ffd94b9f0 00463dec 007ffd94b9f0
ffc0: 007f9e191e1c  0004 001d
ffe0:    
Call trace:
Exception stack(0xffc8742bf900 to 0xffc8742bfa30)
f900: ff800b3aa000 0080 ffc8742bfad0 ff80081eb420
f920: ff80 ff80081a58ec ffc8742bf940 ff80081c3ea8
f940: ffc8742bf990

Re: Performance regressions in TCP_STREAM tests in Linux 4.15 (and later)

2018-04-27 Thread Steven Rostedt


We'd like this email archived in netdev list, but since netdev is
notorious for blocking outlook email as spam, it didn't go through. So
I'm replying here to help get it into the archives.

Thanks!

-- Steve


On Fri, 27 Apr 2018 23:05:46 +
Michael Wenig  wrote:

> As part of VMware's performance testing with the Linux 4.15 kernel,
> we identified CPU cost and throughput regressions when comparing to
> the Linux 4.14 kernel. The impacted test cases are mostly TCP_STREAM
> send tests when using small message sizes. The regressions are
> significant (up 3x) and were tracked down to be a side effect of Eric
> Dumazat's RB tree changes that went into the Linux 4.15 kernel.
> Further investigation showed our use of the TCP_NODELAY flag in
> conjunction with Eric's change caused the regressions to show and
> simply disabling TCP_NODELAY brought performance back to normal.
> Eric's change also resulted into significant improvements in our
> TCP_RR test cases.
> 
> 
> 
> Based on these results, our theory is that Eric's change made the
> system overall faster (reduced latency) but as a side effect less
> aggregation is happening (with TCP_NODELAY) and that results in lower
> throughput. Previously even though TCP_NODELAY was set, system was
> slower and we still got some benefit of aggregation. Aggregation
> helps in better efficiency and higher throughput although it can
> increase the latency. If you are seeing a regression in your
> application throughput after this change, using TCP_NODELAY might
> help bring performance back however that might increase latency.
> 
> 
> 
> As such, we are not asking for a fix but simply want to document for
> others what we have found.
> 
> 
> 
> Michael Wenig
> 
> Performance Engineering
> 
> VMware, Inc.
>

Re: [RFC v3 0/5] virtio: support packed ring

2018-04-27 Thread Jason Wang




On 2018年04月27日 17:12, Tiwei Bie wrote:

On Fri, Apr 27, 2018 at 02:17:51PM +0800, Jason Wang wrote:

On 2018年04月27日 12:18, Michael S. Tsirkin wrote:

On Fri, Apr 27, 2018 at 11:56:05AM +0800, Jason Wang wrote:

On 2018年04月25日 13:15, Tiwei Bie wrote:

Hello everyone,

This RFC implements packed ring support in virtio driver.

Some simple functional tests have been done with Jason's
packed ring implementation in vhost:

https://lkml.org/lkml/2018/4/23/12

Both of ping and netperf worked as expected (with EVENT_IDX
disabled). But there are below known issues:

1. Reloading the guest driver will break the Tx/Rx;

Will have a look at this issue.


2. Zeroing the flags when detaching a used desc will
  break the guest -> host path.

I still think zeroing flags is unnecessary or even a bug. At host, I track
last observed avail wrap counter and detect avail like (what is suggested in
the example code in the spec):

static bool desc_is_avail(struct vhost_virtqueue *vq, __virtio16 flags)
{
     bool avail = flags & cpu_to_vhost16(vq, DESC_AVAIL);

     return avail == vq->avail_wrap_counter;
}

So zeroing wrap can not work with this obviously.

Thanks

I agree. I think what one should do is flip the available bit.


But is this flipping a must?

Thanks

Yeah, that's my question too. It seems to be a requirement
for driver that, the only change to the desc status that a
driver can do during running is to mark the desc as avail,
and any other changes to the desc status are not allowed.
Similarly, the device can only mark the desc as used, and
any other changes to the desc status are also not allowed.
So the question is, are there such requirements?


Looks not, but I think we need clarify this in the spec.

Thanks



Based on below contents in the spec:

"""
Thus VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED bits are different
for an available descriptor and equal for a used descriptor.

Note that this observation is mostly useful for sanity-checking
as these are necessary but not sufficient conditions
"""

It seems that, it's necessary for devices to check whether
the AVAIL bit and USED bit are different.

Best regards,
Tiwei Bie

Re: [PATCH net-next] udp: remove stray export symbol

2018-04-27 Thread David Miller

From: Willem de Bruijn 
Date: Fri, 27 Apr 2018 11:12:10 -0400

> From: Willem de Bruijn 
> 
> UDP GSO needs to export __udp_gso_segment to call it from ipv6.
> 
> I accidentally exported static ipv4 function __udp4_gso_segment.
> Remove that EXPORT_SYMBOL_GPL.
> 
> Fixes: ee80d1ebe5ba ("udp: add udp gso")
> Signed-off-by: Willem de Bruijn 

Applied, thanks.

Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node

2018-04-27 Thread Jason Wang




On 2018年04月28日 09:51, Kevin Easton wrote:

On Fri, Apr 27, 2018 at 09:07:56PM -0400, Kevin Easton wrote:

On Fri, Apr 27, 2018 at 07:05:45PM +0300, Michael S. Tsirkin wrote:

On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote:

The struct vhost_msg within struct vhost_msg_node is copied to userspace,
so it should be allocated with kzalloc() to ensure all structure padding
is zeroed.

Signed-off-by: Kevin Easton 
Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com

Does it help if a patch naming the padding is applied,
and then we init just the relevant field?
Just curious.

No, I don't believe that is sufficient to fix the problem.

Scratch that, somehow I missed the "..and then we init just the
relevant field" part, sorry.

There's still the padding after the vhost_iotlb_msg to consider.  It's
named in the union but I don't think that's guaranteed to be initialised
when the iotlb member of the union is used to initialise things.


I didn't name the padding in my original patch because I wasn't sure
if the padding actually exists on 32 bit architectures?

This might still be a conce


Yes.

print &((struct vhost_msg *)0)->iotlb
$3 = (struct vhost_iotlb_msg *) 0x4




At the end of the day, zeroing 96 bytes (the full size of vhost_msg_node)
is pretty quick.

 - Kevin


Right, and even if it may be used heavily in the data-path, zeroing is 
not the main delay in that path.


Thanks

[PATCH v2 0/2] net: stmmac: dwmac-meson: 100M phy mode support for AXG SoC

2018-04-27 Thread Yixun Lan

Due to the dwmac glue layer register changed, we need to 
introduce a new compatible name for the Meson-AXG SoC
to support for the RMII 100M ethernet PHY.

Change since v1 at [1]:
  - implement set_phy_mode() for each SoC

[1] https://lkml.kernel.org/r/20180426160508.29380-1-yixun@amlogic.com

Yixun Lan (2):
  dt-bindings: net: meson-dwmac: new compatible name for AXG SoC
  net: stmmac: dwmac-meson: extend phy mode setting

 .../devicetree/bindings/net/meson-dwmac.txt   |   1 +
 .../ethernet/stmicro/stmmac/dwmac-meson8b.c   | 120 +++---
 2 files changed, 105 insertions(+), 16 deletions(-)

-- 
2.17.0

[PATCH v2 1/2] dt-bindings: net: meson-dwmac: new compatible name for AXG SoC

2018-04-27 Thread Yixun Lan

We need to introduce a new compatible name for the Meson-AXG SoC
in order to support the RMII 100M ethernet PHY, since the PRG_ETH0
register of the dwmac glue layer is changed from previous old SoC.

Signed-off-by: Yixun Lan 
---
 Documentation/devicetree/bindings/net/meson-dwmac.txt | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/net/meson-dwmac.txt 
b/Documentation/devicetree/bindings/net/meson-dwmac.txt
index 61cada22ae6c..1321bb194ed9 100644
--- a/Documentation/devicetree/bindings/net/meson-dwmac.txt
+++ b/Documentation/devicetree/bindings/net/meson-dwmac.txt
@@ -11,6 +11,7 @@ Required properties on all platforms:
- "amlogic,meson8b-dwmac"
- "amlogic,meson8m2-dwmac"
- "amlogic,meson-gxbb-dwmac"
+   - "amlogic,meson-axg-dwmac"
Additionally "snps,dwmac" and any applicable more
detailed version number described in net/stmmac.txt
should be used.
-- 
2.17.0

[PATCH v2 2/2] net: stmmac: dwmac-meson: extend phy mode setting

2018-04-27 Thread Yixun Lan

  In the Meson-AXG SoC, the phy mode setting of PRG_ETH0 in the glue layer
is extended from bit[0] to bit[2:0].
  There is no problem if we configure it to the RGMII 1000M PHY mode,
since the register setting is coincidentally compatible with previous one,
but for the RMII 100M PHY mode, the configuration need to be changed to
value - b100.
  This patch was verified with a RTL8201F 100M ethernet PHY.

Signed-off-by: Yixun Lan 
---
 .../ethernet/stmicro/stmmac/dwmac-meson8b.c   | 120 +++---
 1 file changed, 104 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
index 7cb794094a70..4ff231df7322 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -29,6 +30,10 @@
 
 #define PRG_ETH0_RGMII_MODEBIT(0)
 
+#define PRG_ETH0_EXT_PHY_MODE_MASK GENMASK(2, 0)
+#define PRG_ETH0_EXT_RGMII_MODE1
+#define PRG_ETH0_EXT_RMII_MODE 4
+
 /* mux to choose between fclk_div2 (bit unset) and mpll2 (bit set) */
 #define PRG_ETH0_CLK_M250_SEL_SHIFT4
 #define PRG_ETH0_CLK_M250_SEL_MASK GENMASK(4, 4)
@@ -47,12 +52,20 @@
 
 #define MUX_CLK_NUM_PARENTS2
 
+struct meson8b_dwmac;
+
+struct meson8b_dwmac_data {
+   int (*set_phy_mode)(struct meson8b_dwmac *dwmac);
+};
+
 struct meson8b_dwmac {
-   struct device   *dev;
-   void __iomem*regs;
-   phy_interface_t phy_mode;
-   struct clk  *rgmii_tx_clk;
-   u32 tx_delay_ns;
+   struct device   *dev;
+   void __iomem*regs;
+
+   const struct meson8b_dwmac_data *data;
+   phy_interface_t phy_mode;
+   struct clk  *rgmii_tx_clk;
+   u32 tx_delay_ns;
 };
 
 struct meson8b_dwmac_clk_configs {
@@ -171,6 +184,59 @@ static int meson8b_init_rgmii_tx_clk(struct meson8b_dwmac 
*dwmac)
return 0;
 }
 
+static int meson8b_set_phy_mode(struct meson8b_dwmac *dwmac)
+{
+   switch (dwmac->phy_mode) {
+   case PHY_INTERFACE_MODE_RGMII:
+   case PHY_INTERFACE_MODE_RGMII_RXID:
+   case PHY_INTERFACE_MODE_RGMII_ID:
+   case PHY_INTERFACE_MODE_RGMII_TXID:
+   /* enable RGMII mode */
+   meson8b_dwmac_mask_bits(dwmac, PRG_ETH0,
+   PRG_ETH0_RGMII_MODE,
+   PRG_ETH0_RGMII_MODE);
+   break;
+   case PHY_INTERFACE_MODE_RMII:
+   /* disable RGMII mode -> enables RMII mode */
+   meson8b_dwmac_mask_bits(dwmac, PRG_ETH0,
+   PRG_ETH0_RGMII_MODE, 0);
+   break;
+   default:
+   dev_err(dwmac->dev, "fail to set phy-mode %s\n",
+   phy_modes(dwmac->phy_mode));
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static int meson_axg_set_phy_mode(struct meson8b_dwmac *dwmac)
+{
+   switch (dwmac->phy_mode) {
+   case PHY_INTERFACE_MODE_RGMII:
+   case PHY_INTERFACE_MODE_RGMII_RXID:
+   case PHY_INTERFACE_MODE_RGMII_ID:
+   case PHY_INTERFACE_MODE_RGMII_TXID:
+   /* enable RGMII mode */
+   meson8b_dwmac_mask_bits(dwmac, PRG_ETH0,
+   PRG_ETH0_EXT_PHY_MODE_MASK,
+   PRG_ETH0_EXT_RGMII_MODE);
+   break;
+   case PHY_INTERFACE_MODE_RMII:
+   /* disable RGMII mode -> enables RMII mode */
+   meson8b_dwmac_mask_bits(dwmac, PRG_ETH0,
+   PRG_ETH0_EXT_PHY_MODE_MASK,
+   PRG_ETH0_EXT_RMII_MODE);
+   break;
+   default:
+   dev_err(dwmac->dev, "fail to set phy-mode %s\n",
+   phy_modes(dwmac->phy_mode));
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 static int meson8b_init_prg_eth(struct meson8b_dwmac *dwmac)
 {
int ret;
@@ -188,10 +254,6 @@ static int meson8b_init_prg_eth(struct meson8b_dwmac 
*dwmac)
 
case PHY_INTERFACE_MODE_RGMII_ID:
case PHY_INTERFACE_MODE_RGMII_TXID:
-   /* enable RGMII mode */
-   meson8b_dwmac_mask_bits(dwmac, PRG_ETH0, PRG_ETH0_RGMII_MODE,
-   PRG_ETH0_RGMII_MODE);
-
/* only relevant for RMII mode -> disable in RGMII mode */
meson8b_dwmac_mask_bits(dwmac, PRG_ETH0,
PRG_ETH0_INVERTED_RMII_CLK, 0);
@@ -224,10 +286,6 @@ static int meson8b_init_prg_eth(struct meson8b_dwmac 
*dwmac)
break;
 
case PHY_INTERFACE_MODE_RMII:
-

Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node

2018-04-27 Thread Kevin Easton

On Fri, Apr 27, 2018 at 09:07:56PM -0400, Kevin Easton wrote:
> On Fri, Apr 27, 2018 at 07:05:45PM +0300, Michael S. Tsirkin wrote:
> > On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote:
> > > The struct vhost_msg within struct vhost_msg_node is copied to userspace,
> > > so it should be allocated with kzalloc() to ensure all structure padding
> > > is zeroed.
> > > 
> > > Signed-off-by: Kevin Easton 
> > > Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com
> > 
> > Does it help if a patch naming the padding is applied,
> > and then we init just the relevant field?
> > Just curious.
> 
> No, I don't believe that is sufficient to fix the problem.

Scratch that, somehow I missed the "..and then we init just the
relevant field" part, sorry.

There's still the padding after the vhost_iotlb_msg to consider.  It's
named in the union but I don't think that's guaranteed to be initialised
when the iotlb member of the union is used to initialise things.

> I didn't name the padding in my original patch because I wasn't sure
> if the padding actually exists on 32 bit architectures?

This might still be a concern, too?

At the end of the day, zeroing 96 bytes (the full size of vhost_msg_node)
is pretty quick.

- Kevin

Re: [PATCH 2/2] bpf: btf: remove a couple conditions

2018-04-27 Thread Martin KaFai Lau

On Fri, Apr 27, 2018 at 02:26:50PM -0700, Martin KaFai Lau wrote:
> On Fri, Apr 27, 2018 at 11:31:36PM +0300, Dan Carpenter wrote:
> > On Fri, Apr 27, 2018 at 10:21:17PM +0200, Daniel Borkmann wrote:
> > > On 04/27/2018 09:39 PM, Dan Carpenter wrote:
> > > > On Fri, Apr 27, 2018 at 10:55:46AM -0700, Martin KaFai Lau wrote:
> > > >> On Fri, Apr 27, 2018 at 10:20:25AM -0700, Martin KaFai Lau wrote:
> > > >>> On Fri, Apr 27, 2018 at 05:04:59PM +0300, Dan Carpenter wrote:
> > >  We know "err" is zero so we can remove these and pull the code in one
> > >  indent level.
> > > 
> > >  Signed-off-by: Dan Carpenter 
> > > >>> Thanks for the simplification!
> > > >>>
> > > >>> Acked-by: Martin KaFai Lau 
> > > >> btw, it should be for bpf-next.  Please tag the subject with bpf-next 
> > > >> when
> > > >> you respin. Thanks!
> > > 
> > > Dan, thanks a lot for your fixes! Please respin with addressing Martin's
> > > feedback when you get a chance.
> > > 
> > 
> > My understanding is that he'd prefer we just ignore the static checker
> > warning since it's a false positive.
> Right, I think patch 1 is not needed.  I would prefer to use a comment
> in those cases.
> 
> > Should I instead initialize the
> > size to zero or something just to silence it?
After another thought,  I think init size to zero is
fine which is less intrusive.

Thanks!
Martin

> > 
> > regards,
> > dan carpenter
> >

Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node

2018-04-27 Thread Kevin Easton

On Fri, Apr 27, 2018 at 07:05:45PM +0300, Michael S. Tsirkin wrote:
> On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote:
> > The struct vhost_msg within struct vhost_msg_node is copied to userspace,
> > so it should be allocated with kzalloc() to ensure all structure padding
> > is zeroed.
> > 
> > Signed-off-by: Kevin Easton 
> > Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com
> 
> Does it help if a patch naming the padding is applied,
> and then we init just the relevant field?
> Just curious.

No, I don't believe that is sufficient to fix the problem.

The structure is allocated by kmalloc(), then individual fields are
initialised.  The named adding would be forced to be initialised if
it were initialised with a struct initialiser, but that's not the case.
The compiler is free to leave padding0 with whatever junk kmalloc()
left there.

Having said that, naming the padding *does* help - technically, the
compiler is allowed to put whatever it likes in the padding every time
you modify the struct.  It really needs both.

I didn't name the padding in my original patch because I wasn't sure
if the padding actually exists on 32 bit architectures?

- Kevin

Re: Suggestions on iterating eBPF maps

2018-04-27 Thread Alexei Starovoitov

On Fri, Apr 27, 2018 at 06:33:56PM +, Chenbo Feng wrote:
> resend with  plain text
> 
> On Fri, Apr 27, 2018 at 11:22 AM Chenbo Feng  wrote:
> 
> > Hi net-next,
> 
> > When doing the eBPF tools user-space development I noticed that the map
> iterating process in user-space have some little flaws. If we want to dump
> the whole map. The only way now I know is to use a null key to start the
> iteration and keep calling bpf_get_next_key and bpf_look_up_elem for each
> new key value pair until we reach the end of the map. I noticed the
> bpftools recently added used the similar approach.
> 
> > The overhead of repeating syscalls is acceptable, but the race problem
> come with this iteration process is a little annoying. If the current key
> we are using get deleted before we do the syscall to get the next key . The
> next key returned will start from the beginning of the map again and some
> entry will be dumped again depending on the position of the key deleted. If
> the racing problem is within the same userspace process, it can be easily
> fixed by adding some read/write locks. However, if multiple processes is
> reading the map through pinned fd while there is one process is editing the
> map entry or the kernel program is deleting entries, it become harder to
> get a consistent and correct map dump.
> 
> > We are wondering if there is already implementation we didn't notice in
> mainline kernel that help improved this iteration process and addressed the
> racing problem mentioned above? If not, what can be down to address the
> issue above. One thing we came up with is to use a single entry bpf map as
> a across process lock to prevent multiple userspace process to read/write
> other maps at the same time. But I don't know how safe this solution is
> since there will still be a race to read the lock map value and setup the
> lock.

to avoid seeing duplicate keys due to parallel removal one can walk all
keys with get_next first. Remove duplicate keys and then lookup their values.
By that time some elements could be removed and lookups will be failing.

Another approach could be to use map-in-map and have almost atomic
replace of the whole map with new potentially empty map. The prog
can continue using the new map, while user space walks no longer
accessed old map.

yet another approach would be to introduce a knob to the prog
that user space controls and make program obey that knob.
When it's on the prog won't be deleting/updating maps.

Re: [Cake] [PATCH iproute2-next v7] Add support for cake qdisc

2018-04-27 Thread Stephen Hemminger

On Fri, 27 Apr 2018 21:57:20 +0200
Toke Høiland-Jørgensen  wrote:

> sch_cake is intended to squeeze the most bandwidth and latency out of even
> the slowest ISP links and routers, while presenting an API simple enough
> that even an ISP can configure it.
> 
> Example of use on a cable ISP uplink:
> 
> tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter
> 
> To shape a cable download link (ifb and tc-mirred setup elided)
> 
> tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash 
> besteffort
> 
> Cake is filled with:
> 
> * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
>   derived Flow Queuing system, which autoconfigures based on the bandwidth.
> * A novel "triple-isolate" mode (the default) which balances per-host
>   and per-flow FQ even through NAT.
> * An deficit based shaper, that can also be used in an unlimited mode.
> * 8 way set associative hashing to reduce flow collisions to a minimum.
> * A reasonable interpretation of various diffserv latency/loss tradeoffs.
> * Support for zeroing diffserv markings for entering and exiting traffic.
> * Support for interacting well with Docsis 3.0 shaper framing.
> * Support for DSL framing types and shapers.
> * Support for ack filtering.
> * Extensive statistics for measuring, loss, ecn markings, latency variation.
> 
> Various versions baking have been available as an out of tree build for
> kernel versions going back to 3.10, as the embedded router world has been
> running a few years behind mainline Linux. A stable version has been
> generally available on lede-17.01 and later.
> 
> sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
> in the sqm-scripts, with sane defaults and vastly simpler configuration.
> 
> Cake's principal author is Jonathan Morton, with contributions from
> Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
> Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
> and Loganaden Velvindron.
> 
> Testing from Pete Heist, Georgios Amanakis, and the many other members of
> the c...@lists.bufferbloat.net mailing list.
> 
> Signed-off-by: Dave Taht 
> Signed-off-by: Toke Høiland-Jørgensen 
> ---
> Changelog:
> v7:
>   - Move the target/interval presets to a table and check that only
> one is passed.
> 
> v6:
>   - Identical to v5 because apparently I don't git so well... :/
> 
> v5:
>   - Print the SPLIT_GSO flag
>   - Switch to print_u64() for JSON output
>   - Fix a format string for mpu option output
> 
> v4:
>   - Switch stats parsing to use nested netlink attributes
>   - Tweaks to JSON stats output keys
> 
> v3:
>   - Remove accidentally included test flag
> 
> v2:
>   - Updated netlink config ABI
>   - Remove diffserv-llt mode
>   - Various tweaks and clean-ups of stats output
>  man/man8/tc-cake.8 | 632 ++
>  man/man8/tc.8  |   1 +
>  tc/Makefile|   1 +
>  tc/q_cake.c| 748 +
>  4 files changed, 1382 insertions(+)
>  create mode 100644 man/man8/tc-cake.8
>  create mode 100644 tc/q_cake.c

Looks good to me, when cake makes it into net-next.

Re: [PATCH v4] bpf, x86_32: add eBPF JIT compiler for ia32

2018-04-27 Thread Daniel Borkmann

On 04/26/2018 12:12 PM, Wang YanQing wrote:
[...]
> +/* encode 'dst_reg' and 'src_reg' registers into x86_32 opcode 'byte' */
> +static u8 add_2reg(u8 byte, u32 dst_reg, u32 src_reg)
> +{
> + return byte + dst_reg + (src_reg << 3);
> +}
> +
> +static void jit_fill_hole(void *area, unsigned int size)
> +{
> + /* fill whole space with int3 instructions */
> + memset(area, 0xcc, size);
> +}
> +
> +/* Checks whether BPF register is on scratch stack space or not. */
> +static inline bool is_on_stack(u8 bpf_reg)
> +{
> + static u8 stack_regs[] = {BPF_REG_AX};

Nit: you call this stack_regs here ...

> + int i, reg_len = sizeof(stack_regs);
> +
> + for (i = 0 ; i < reg_len ; i++) {
> + if (bpf_reg == stack_regs[i])
> + return false;

... but [BPF_REG_AX] = {IA32_ESI, IA32_EDI} is the only one
that is not on stack?

> + }
> + return true;
> +}
> +
> +static inline void emit_ia32_mov_i(const u8 dst, const u32 val, bool dstk,
> +u8 **pprog)
> +{
> + u8 *prog = *pprog;
> + int cnt = 0;
> +
> + if (dstk) {
> + if (val == 0) {
> + /* xor eax,eax */
> + EMIT2(0x33, add_2reg(0xC0, IA32_EAX, IA32_EAX));
> + /* mov dword ptr [ebp+off],eax */
> + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX),
> +   STACK_VAR(dst));
> + } else {
> + EMIT3_off32(0xC7, add_1reg(0x40, IA32_EBP),
> + STACK_VAR(dst), val);
> + }
> + } else {
> + if (val == 0)
> + EMIT2(0x33, add_2reg(0xC0, dst, dst));
> + else
> + EMIT2_off32(0xC7, add_1reg(0xC0, dst),
> + val);
> + }
> + *pprog = prog;
> +}
> +
[...]
> + if (is_imm8(jmp_offset)) {
> + EMIT2(jmp_cond, jmp_offset);
> + } else if (is_simm32(jmp_offset)) {
> + EMIT2_off32(0x0F, jmp_cond + 0x10, jmp_offset);
> + } else {
> + pr_err("cond_jmp gen bug %llx\n", jmp_offset);
> + return -EFAULT;
> + }
> +
> + break;
> + }
> + case BPF_JMP | BPF_JA:
> + jmp_offset = addrs[i + insn->off] - addrs[i];
> + if (!jmp_offset)
> + /* optimize out nop jumps */
> + break;

Needs same fix as in x86-64 JIT in 1612a981b766 ("bpf, x64: fix JIT emission
for dead code").

> +emit_jmp:
> + if (is_imm8(jmp_offset)) {
> + EMIT2(0xEB, jmp_offset);
> + } else if (is_simm32(jmp_offset)) {
> + EMIT1_off32(0xE9, jmp_offset);
> + } else {
> + pr_err("jmp gen bug %llx\n", jmp_offset);
> + return -EFAULT;
> + }
> + break;

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-27 Thread Siwei Liu

On Thu, Apr 26, 2018 at 4:42 PM, Michael S. Tsirkin  wrote:
> On Thu, Apr 26, 2018 at 03:14:46PM -0700, Siwei Liu wrote:
>> On Wed, Apr 25, 2018 at 7:28 PM, Michael S. Tsirkin  wrote:
>> > On Wed, Apr 25, 2018 at 03:57:57PM -0700, Siwei Liu wrote:
>> >> On Wed, Apr 25, 2018 at 3:22 PM, Michael S. Tsirkin  
>> >> wrote:
>> >> > On Wed, Apr 25, 2018 at 02:38:57PM -0700, Siwei Liu wrote:
>> >> >> On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin  
>> >> >> wrote:
>> >> >> > On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
>> >> >> >> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin 
>> >> >> >>  wrote:
>> >> >> >> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> >> >> >> >> On Mon, 23 Apr 2018 20:24:56 +0300
>> >> >> >> >> "Michael S. Tsirkin"  wrote:
>> >> >> >> >>
>> >> >> >> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger 
>> >> >> >> >> > wrote:
>> >> >> >> >> > > > >
>> >> >> >> >> > > > >I will NAK patches to change to common code for netvsc 
>> >> >> >> >> > > > >especially the
>> >> >> >> >> > > > >three device model.  MS worked hard with distro vendors 
>> >> >> >> >> > > > >to support transparent
>> >> >> >> >> > > > >mode, ans we really can't have a new model; or do 
>> >> >> >> >> > > > >backport.
>> >> >> >> >> > > > >
>> >> >> >> >> > > > >Plus, DPDK is now dependent on existing model.
>> >> >> >> >> > > >
>> >> >> >> >> > > > Sorry, but nobody here cares about dpdk or other similar 
>> >> >> >> >> > > > oddities.
>> >> >> >> >> > >
>> >> >> >> >> > > The network device model is a userspace API, and DPDK is a 
>> >> >> >> >> > > userspace application.
>> >> >> >> >> >
>> >> >> >> >> > It is userspace but are you sure dpdk is actually poking at 
>> >> >> >> >> > netdevs?
>> >> >> >> >> > AFAIK it's normally banging device registers directly.
>> >> >> >> >> >
>> >> >> >> >> > > You can't go breaking userspace even if you don't like the 
>> >> >> >> >> > > application.
>> >> >> >> >> >
>> >> >> >> >> > Could you please explain how is the proposed patchset breaking
>> >> >> >> >> > userspace? Ignoring DPDK for now, I don't think it changes the 
>> >> >> >> >> > userspace
>> >> >> >> >> > API at all.
>> >> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> The DPDK has a device driver vdev_netvsc which scans the Linux 
>> >> >> >> >> network devices
>> >> >> >> >> to look for Linux netvsc device and the paired VF device and 
>> >> >> >> >> setup the
>> >> >> >> >> DPDK environment.  This setup creates a DPDK failsafe 
>> >> >> >> >> (bondingish) instance
>> >> >> >> >> and sets up TAP support over the Linux netvsc device as well as 
>> >> >> >> >> the Mellanox
>> >> >> >> >> VF device.
>> >> >> >> >>
>> >> >> >> >> So it depends on existing 2 device model. You can't go to a 3 
>> >> >> >> >> device model
>> >> >> >> >> or start hiding devices from userspace.
>> >> >> >> >
>> >> >> >> > Okay so how does the existing patch break that? IIUC does not go 
>> >> >> >> > to
>> >> >> >> > a 3 device model since netvsc calls failover_register directly.
>> >> >> >> >
>> >> >> >> >> Also, I am working on associating netvsc and VF device based on 
>> >> >> >> >> serial number
>> >> >> >> >> rather than MAC address. The serial number is how Windows works 
>> >> >> >> >> now, and it makes
>> >> >> >> >> sense for Linux and Windows to use the same mechanism if 
>> >> >> >> >> possible.
>> >> >> >> >
>> >> >> >> > Maybe we should support same for virtio ...
>> >> >> >> > Which serial do you mean? From vpd?
>> >> >> >> >
>> >> >> >> > I guess you will want to keep supporting MAC for old hypervisors?
>> >> >> >> >
>> >> >> >> > It all seems like a reasonable thing to support in the generic 
>> >> >> >> > core.
>> >> >> >>
>> >> >> >> That's the reason why I chose explicit identifier rather than rely 
>> >> >> >> on
>> >> >> >> MAC address to bind/pair a device. MAC address can change. Even if 
>> >> >> >> it
>> >> >> >> can't, malicious guest user can fake MAC address to skip binding.
>> >> >> >>
>> >> >> >> -Siwei
>> >> >> >
>> >> >> > Address should be sampled at device creation to prevent this
>> >> >> > kind of hack. Not that it buys the malicious user much:
>> >> >> > if you can poke at MAC addresses you probably already can
>> >> >> > break networking.
>> >> >>
>> >> >> I don't understand why poking at MAC address may potentially break
>> >> >> networking.
>> >> >
>> >> > Set a MAC address to match another device on the same LAN,
>> >> > packets will stop reaching that MAC.
>> >>
>> >> What I meant was guest users may create a virtual link, say veth that
>> >> has exactly the same MAC address as that for the VF, which can easily
>> >> get around of the binding procedure.
>> >
>> > This patchset limits binding to PCI devices so it won't be affected
>> > by any hacks around virtual devices.
>>
>> Wait, I vaguely recall you seemed to like to generalize this feature
>>

[RFC net-next 2/5] net: ethtool: Add UAPI for PHY test modes

2018-04-27 Thread Florian Fainelli

Add the necessary UAPI changes to support querying the PHY tests modes
implemented and optionally associated test specific data. This will be
used as the foundation for supporting:

- IEEE standard electrical test modes
- cable diagnostics
- packet tester

Signed-off-by: Florian Fainelli 
---
 include/uapi/linux/ethtool.h | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 4ca65b56084f..a8befecfe853 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -567,6 +567,7 @@ struct ethtool_pauseparam {
  * @ETH_SS_RSS_HASH_FUNCS: RSS hush function names
  * @ETH_SS_PHY_STATS: Statistic names, for use with %ETHTOOL_GPHYSTATS
  * @ETH_SS_PHY_TUNABLES: PHY tunable names
+ * @ETH_SS_PHY_TESTS: PHY tests, for use with %ETHTOOL_GPHYTEST
  */
 enum ethtool_stringset {
ETH_SS_TEST = 0,
@@ -578,6 +579,7 @@ enum ethtool_stringset {
ETH_SS_TUNABLES,
ETH_SS_PHY_STATS,
ETH_SS_PHY_TUNABLES,
+   ETH_SS_PHY_TESTS,
 };
 
 /**
@@ -1296,6 +1298,25 @@ enum ethtool_fec_config_bits {
ETHTOOL_FEC_BASER_BIT,
 };
 
+/**
+ * struct ethtool_phy_test - Ethernet PHY test mode
+ * @cmd: Command number = %ETHTOOL_GPHYTEST or %ETHTOOL_SPHYTEST
+ * @flags: A bitmask of flags from  ethtool_test_flags.  Some
+ * flags may be set by the user on entry; others may be set by
+ * the driver on return.
+ * @mode: PHY test mode to enter. The index should be a valid test mode
+ * obtained through ethtool_get_strings with %ETH_SS_PHY_TESTS
+ * @len: The length of the test specific array @data
+ * @data: Array of test specific results to be interpreted with @mode
+ */
+struct ethtool_phy_test {
+   __u32   cmd;
+   __u32   flags;
+   __u32   mode;
+   __u32   len;
+   __u8data[0];
+};
+
 #define ETHTOOL_FEC_NONE   (1 << ETHTOOL_FEC_NONE_BIT)
 #define ETHTOOL_FEC_AUTO   (1 << ETHTOOL_FEC_AUTO_BIT)
 #define ETHTOOL_FEC_OFF(1 << ETHTOOL_FEC_OFF_BIT)
@@ -1396,6 +1417,8 @@ enum ethtool_fec_config_bits {
 #define ETHTOOL_PHY_STUNABLE   0x004f /* Set PHY tunable configuration */
 #define ETHTOOL_GFECPARAM  0x0050 /* Get FEC settings */
 #define ETHTOOL_SFECPARAM  0x0051 /* Set FEC settings */
+#define ETHTOOL_GPHYTEST   0x0052 /* Get PHY test mode(s) */
+#define ETHTOOL_SPHYTEST   0x0053 /* Set PHY test mode */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
-- 
2.14.1

[RFC net-next 0/5] Support for PHY test modes

2018-04-27 Thread Florian Fainelli

Hi all,

This patch series adds support for specifying PHY test modes through ethtool
and paves the ground for adding support for more complex test modes that might
require data to be exchanged between user and kernel space.

As an example, patches are included to add support for the IEEE electrical test
modes for 100BaseT2 and 1000BaseT. Those do not require data to be passed back
and forth.

I believe the infrastructure to be usable enough to add support for other things
like:

- cable diagnostics
- pattern generator/waveform generator with specific pattern being indicated
  for instance

Questions for Andrew, and others:

- there could be room for adding additional ETH_TEST_FL_* values in order to
  help determine how the test should be running
- some of these tests can be disruptive to connectivity, the minimum we could
  do is stop the PHY state machine and restart it when "normal" is used to exit
  those test modes

Comments welcome!

Example:

# ethtool --get-phy-tests gphy
PHY tests gphy:
 normal (Test data: No)
 100baseT2-tx-waveform (Test data: No)
 100baseT2-tx-jitter (Test data: No)
 100baseT2-tx-idle (Test data: No)
 1000baseT-tx-waveform (Test data: No)
 1000baseT-tx-jitter-master (Test data: No)
 1000baseT-tx-jitter-slave (Test data: No)
 1000BaseT-tx-distorsion (Test data: No)
# ethtool --set-phy-test gphy 100baseT2-tx-waveform
# [   65.262513] brcm-sf2 f0b0.ethernet_switch gphy: Link is Down


Florian Fainelli (5):
  net: phy: Pass stringset argument to ethtool operations
  net: ethtool: Add UAPI for PHY test modes
  net: ethtool: Add plumbing to get/set PHY test modes
  net: phy: Add support for IEEE standard test modes
  net: phy: broadcom: Add support for PHY test modes

 drivers/net/dsa/b53/b53_common.c |   4 +-
 drivers/net/phy/Kconfig  |   6 ++
 drivers/net/phy/Makefile |   4 +-
 drivers/net/phy/bcm-phy-lib.c|  21 --
 drivers/net/phy/bcm-phy-lib.h|   4 +-
 drivers/net/phy/bcm7xxx.c|   9 ++-
 drivers/net/phy/broadcom.c   |   6 +-
 drivers/net/phy/marvell.c|  11 ++-
 drivers/net/phy/micrel.c |  11 ++-
 drivers/net/phy/phy-tests.c  | 159 +++
 drivers/net/phy/smsc.c   |  10 ++-
 include/linux/phy.h  |  99 +---
 include/net/dsa.h|   4 +-
 include/uapi/linux/ethtool.h |  23 ++
 net/core/ethtool.c   |  86 +++--
 net/dsa/master.c |   9 ++-
 net/dsa/port.c   |   8 +-
 17 files changed, 427 insertions(+), 47 deletions(-)
 create mode 100644 drivers/net/phy/phy-tests.c

-- 
2.14.1

[RFC net-next 3/5] net: ethtool: Add plumbing to get/set PHY test modes

2018-04-27 Thread Florian Fainelli

Implement the core ethtool changes to get/set PHY test modes, no driver
implements that yet, but the internal API is defined and now allows it.
We also provide the required helpers in PHYLIB in order to call the
appropriate functions within the drivers.

Signed-off-by: Florian Fainelli 
---
 include/linux/phy.h | 63 --
 net/core/ethtool.c  | 79 ++---
 2 files changed, 135 insertions(+), 7 deletions(-)

diff --git a/include/linux/phy.h b/include/linux/phy.h
index deba0c11647f..449afde7ca7c 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -659,6 +659,14 @@ struct phy_driver {
struct ethtool_tunable *tuna,
const void *data);
int (*set_loopback)(struct phy_device *dev, bool enable);
+
+   /* Get and Set PHY test modes */
+   int (*get_test_len)(struct phy_device *dev, u32 mode);
+   int (*get_test)(struct phy_device *dev,
+   struct ethtool_phy_test *test, u8 *data);
+   int (*set_test)(struct phy_device *dev,
+   struct ethtool_phy_test *test,
+   const u8 *data);
 };
 #define to_phy_driver(d) container_of(to_mdio_common_driver(d),
\
  struct phy_driver, mdiodrv)
@@ -1090,9 +1098,11 @@ static inline int phy_ethtool_get_sset_count(struct 
phy_device *phydev,
if (!phydev->drv)
return -EIO;
 
-   if (phydev->drv->get_sset_count &&
-   phydev->drv->get_strings &&
-   phydev->drv->get_stats) {
+   if (!phydev->drv->get_sset_count || !phydev->drv->get_strings)
+   return -EOPNOTSUPP;
+
+   if (phydev->drv->get_stats || phydev->drv->get_test_len ||
+   phydev->drv->get_test || phydev->drv->set_test) {
mutex_lock(>lock);
ret = phydev->drv->get_sset_count(phydev, sset);
mutex_unlock(>lock);
@@ -1116,6 +1126,53 @@ static inline int phy_ethtool_get_stats(struct 
phy_device *phydev,
return 0;
 }
 
+static inline int phy_ethtool_get_test_len(struct phy_device *phydev,
+  u32 mode)
+{
+   int ret;
+
+   if (!phydev->drv)
+   return -EIO;
+
+   mutex_lock(>lock);
+   ret = phydev->drv->get_test_len(phydev, mode);
+   mutex_unlock(>lock);
+
+   return ret;
+}
+
+static inline int phy_ethtool_get_test(struct phy_device *phydev,
+  struct ethtool_phy_test *test,
+  u8 *data)
+{
+   int ret;
+
+   if (!phydev->drv)
+   return -EIO;
+
+   mutex_lock(>lock);
+   ret = phydev->drv->get_test(phydev, test, data);
+   mutex_unlock(>lock);
+
+   return ret;
+}
+
+static inline int phy_ethtool_set_test(struct phy_device *phydev,
+  struct ethtool_phy_test *test,
+  const u8 *data)
+{
+   int ret;
+
+   if (!phydev->drv)
+   return -EIO;
+
+   mutex_lock(>lock);
+   ret = phydev->drv->set_test(phydev, test, data);
+   mutex_unlock(>lock);
+
+   return ret;
+}
+
 extern struct bus_type mdio_bus_type;
 
 struct mdio_board_info {
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 0b9e2a44e1d1..52d2c9bc49b4 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -227,8 +227,9 @@ static int __ethtool_get_sset_count(struct net_device *dev, 
int sset)
if (sset == ETH_SS_PHY_TUNABLES)
return ARRAY_SIZE(phy_tunable_strings);
 
-   if (sset == ETH_SS_PHY_STATS && dev->phydev &&
-   !ops->get_ethtool_phy_stats)
+   if ((sset == ETH_SS_PHY_STATS && dev->phydev &&
+   !ops->get_ethtool_phy_stats) ||
+   (sset == ETH_SS_PHY_TESTS && dev->phydev))
return phy_ethtool_get_sset_count(dev->phydev, sset);
 
if (ops->get_sset_count && ops->get_strings)
@@ -252,8 +253,9 @@ static void __ethtool_get_strings(struct net_device *dev,
memcpy(data, tunable_strings, sizeof(tunable_strings));
else if (stringset == ETH_SS_PHY_TUNABLES)
memcpy(data, phy_tunable_strings, sizeof(phy_tunable_strings));
-   else if (stringset == ETH_SS_PHY_STATS && dev->phydev &&
-!ops->get_ethtool_phy_stats)
+   else if ((stringset == ETH_SS_PHY_STATS && dev->phydev &&
+!ops->get_ethtool_phy_stats) ||
+(stringset == ETH_SS_PHY_TESTS && dev->phydev))
phy_ethtool_get_strings(dev->phydev, stringset, data);
else
/* ops->get_strings is valid because checked earlier */
@@ -2016,6 +2018,68 @@ static int ethtool_get_phy_stats(struct net_device *dev, 
void __user *useraddr)
return ret;
 }
 
+static int ethtool_get_phy_test(struct net_device

[RFC net-next 4/5] net: phy: Add support for IEEE standard test modes

2018-04-27 Thread Florian Fainelli

Add support for the 100BaseT2 and 1000BaseT standard test modes as
defined by the IEEE 802.3-2012-Section two and three. We provide a set
of helper functions for PHY drivers to either punt entirely onto
genphy_* functions or if they desire, build additional tests on top of
the standard ones available.

Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/Kconfig |   6 ++
 drivers/net/phy/Makefile|   4 +-
 drivers/net/phy/phy-tests.c | 159 
 include/linux/phy.h |  22 ++
 4 files changed, 190 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/phy/phy-tests.c

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index edb8b9ab827f..ef3f2f1ae990 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -200,6 +200,12 @@ config LED_TRIGGER_PHY
Mbps OR Gbps OR link
for any speed known to the PHY.
 
+config CONFIG_PHYLIB_TEST_MODES
+   bool "Support for test modes"
+   ---help---
+ Selecting this option will allow the PHY library to support
+ test modes: electrical, cable diagnostics, pattern generator etc.
+
 
 comment "MII PHY device drivers"
 
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index 701ca0b8717e..e9905432e164 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -1,7 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
 # Makefile for Linux PHY drivers and MDIO bus drivers
 
-libphy-y   := phy.o phy-c45.o phy-core.o phy_device.o
+libphy-y   := phy.o phy-c45.o phy-core.o phy_device.o \
+  phy-tests.o
 mdio-bus-y += mdio_bus.o mdio_device.o
 
 ifdef CONFIG_MDIO_DEVICE
@@ -18,6 +19,7 @@ obj-$(CONFIG_MDIO_DEVICE) += mdio-bus.o
 endif
 libphy-$(CONFIG_SWPHY) += swphy.o
 libphy-$(CONFIG_LED_TRIGGER_PHY)   += phy_led_triggers.o
+libphy-$(CONFIG_PHYLIB_TEST_MODES) += phy-tests.o
 
 obj-$(CONFIG_PHYLINK)  += phylink.o
 obj-$(CONFIG_PHYLIB)   += libphy.o
diff --git a/drivers/net/phy/phy-tests.c b/drivers/net/phy/phy-tests.c
new file mode 100644
index ..5709d7821925
--- /dev/null
+++ b/drivers/net/phy/phy-tests.c
@@ -0,0 +1,159 @@
+// SPDX-License-Identifier: GPL-2.0
+/* PHY library common test modes
+ */
+#include 
+#include 
+
+/* genphy_get_test - Get PHY test specific data
+ * @phydev: the PHY device instance
+ * @test: the desired test mode
+ * @data: test specific data (none)
+ */
+int genphy_get_test(struct phy_device *phydev, struct ethtool_phy_test *test,
+   u8 *data)
+{
+   if (test->mode >= PHY_STD_TEST_MODE_MAX)
+   return -EOPNOTSUPP;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(genphy_get_test);
+
+/* genphy_set_test - Make a PHY enter one of the standard IEEE defined
+ * test modes
+ * @phydev: the PHY device instance
+ * @test: the desired test mode
+ * @data: test specific data (none)
+ *
+ * This function makes the designated @phydev enter the desired standard
+ * 100BaseT2 or 1000BaseT test mode as defined in IEEE 802.3-2012 section TWO
+ * and THREE under 32.6.1.2.1 and 40.6.1.1.2 respectively
+ */
+int genphy_set_test(struct phy_device *phydev,
+   struct ethtool_phy_test *test, const u8 *data)
+{
+   u16 shift, base, bmcr = 0;
+   int ret;
+
+   /* Exit test mode */
+   if (test->mode == PHY_STD_TEST_MODE_NORMAL) {
+   ret = phy_read(phydev, MII_CTRL1000);
+   if (ret < 0)
+   return ret;
+
+   ret &= ~GENMASK(15, 13);
+
+   return phy_write(phydev, MII_CTRL1000, ret);
+   }
+
+   switch (test->mode) {
+   case PHY_STD_TEST_MODE_100BASET2_1:
+   case PHY_STD_TEST_MODE_100BASET2_2:
+   case PHY_STD_TEST_MODE_100BASET2_3:
+   if (!(phydev->supported & PHY_100BT_FEATURES))
+   return -EOPNOTSUPP;
+
+   shift = 14;
+   base = test->mode - PHY_STD_TEST_MODE_NORMAL;
+   bmcr = BMCR_SPEED100;
+   break;
+
+   case PHY_STD_TEST_MODE_1000BASET_1:
+   case PHY_STD_TEST_MODE_1000BASET_2:
+   case PHY_STD_TEST_MODE_1000BASET_3:
+   case PHY_STD_TEST_MODE_1000BASET_4:
+   if (!(phydev->supported & PHY_1000BT_FEATURES))
+   return -EOPNOTSUPP;
+
+   shift = 13;
+   base = test->mode - PHY_STD_TEST_MODE_100BASET2_MAX;
+   bmcr = BMCR_SPEED1000;
+   break;
+
+   default:
+   /* Let an upper driver deal with additional modes it may
+* support
+*/
+   return -EOPNOTSUPP;
+   }
+
+   /* Force speed and duplex */
+   ret = phy_write(phydev, MII_BMCR, bmcr | BMCR_FULLDPLX);
+   if (ret < 0)
+   return ret;
+
+   /* Set the desired test mode bit */
+

[RFC net-next 5/5] net: phy: broadcom: Add support for PHY test modes

2018-04-27 Thread Florian Fainelli

Re-use the generic PHY library test modes for 100BaseT2 and 1000BaseT
and advertise support for those through the newly added ethtool knobs.

Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/bcm-phy-lib.c | 15 +--
 drivers/net/phy/bcm7xxx.c |  3 +++
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/net/phy/bcm-phy-lib.c b/drivers/net/phy/bcm-phy-lib.c
index e797e0863895..cb3081e523a5 100644
--- a/drivers/net/phy/bcm-phy-lib.c
+++ b/drivers/net/phy/bcm-phy-lib.c
@@ -334,6 +334,8 @@ int bcm_phy_get_sset_count(struct phy_device *phydev, int 
sset)
 {
if (sset == ETH_SS_PHY_STATS)
return ARRAY_SIZE(bcm_phy_hw_stats);
+   else if (sset == ETH_SS_PHY_TESTS)
+   return genphy_get_test_count(phydev);
 
return -EOPNOTSUPP;
 }
@@ -343,12 +345,13 @@ void bcm_phy_get_strings(struct phy_device *phydev, u32 
stringset, u8 *data)
 {
unsigned int i;
 
-   if (stringset != ETH_SS_PHY_STATS)
-   return;
-
-   for (i = 0; i < ARRAY_SIZE(bcm_phy_hw_stats); i++)
-   strlcpy(data + i * ETH_GSTRING_LEN,
-   bcm_phy_hw_stats[i].string, ETH_GSTRING_LEN);
+   if (stringset == ETH_SS_PHY_STATS) {
+   for (i = 0; i < ARRAY_SIZE(bcm_phy_hw_stats); i++)
+   strlcpy(data + i * ETH_GSTRING_LEN,
+   bcm_phy_hw_stats[i].string, ETH_GSTRING_LEN);
+   } else if (stringset == ETH_SS_PHY_TESTS) {
+   genphy_get_test_strings(phydev, data);
+   }
 }
 EXPORT_SYMBOL_GPL(bcm_phy_get_strings);
 
diff --git a/drivers/net/phy/bcm7xxx.c b/drivers/net/phy/bcm7xxx.c
index 1835af147eea..1efd287ed320 100644
--- a/drivers/net/phy/bcm7xxx.c
+++ b/drivers/net/phy/bcm7xxx.c
@@ -619,6 +619,9 @@ static int bcm7xxx_28nm_probe(struct phy_device *phydev)
.get_sset_count = bcm_phy_get_sset_count,   \
.get_strings= bcm_phy_get_strings,  \
.get_stats  = bcm7xxx_28nm_get_phy_stats,   \
+   .set_test   = genphy_set_test,  \
+   .get_test   = genphy_get_test,  \
+   .get_test_len   = genphy_get_test_len,  \
.probe  = bcm7xxx_28nm_probe,   \
 }
 
-- 
2.14.1

[RFC net-next 1/5] net: phy: Pass stringset argument to ethtool operations

2018-04-27 Thread Florian Fainelli

In preparation for returning a different type of strings other than
ETH_SS_STATS update the PHY drivers, helpers and consumers of these
functions.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_common.c |  4 ++--
 drivers/net/phy/bcm-phy-lib.c| 12 +---
 drivers/net/phy/bcm-phy-lib.h|  4 ++--
 drivers/net/phy/bcm7xxx.c|  6 --
 drivers/net/phy/broadcom.c   |  6 --
 drivers/net/phy/marvell.c| 11 +--
 drivers/net/phy/micrel.c | 11 +--
 drivers/net/phy/smsc.c   | 10 --
 include/linux/phy.h  | 14 --
 include/net/dsa.h|  4 ++--
 net/core/ethtool.c   |  7 ---
 net/dsa/master.c |  9 +
 net/dsa/port.c   |  8 
 13 files changed, 70 insertions(+), 36 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 9f561fe505cb..8201e8f5c028 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -837,7 +837,7 @@ void b53_get_strings(struct dsa_switch *ds, int port, u32 
stringset,
if (!phydev)
return;
 
-   phy_ethtool_get_strings(phydev, data);
+   phy_ethtool_get_strings(phydev, stringset, data);
}
 }
 EXPORT_SYMBOL(b53_get_strings);
@@ -899,7 +899,7 @@ int b53_get_sset_count(struct dsa_switch *ds, int port, int 
sset)
if (!phydev)
return 0;
 
-   return phy_ethtool_get_sset_count(phydev);
+   return phy_ethtool_get_sset_count(phydev, sset);
}
 
return 0;
diff --git a/drivers/net/phy/bcm-phy-lib.c b/drivers/net/phy/bcm-phy-lib.c
index 0876aec7328c..e797e0863895 100644
--- a/drivers/net/phy/bcm-phy-lib.c
+++ b/drivers/net/phy/bcm-phy-lib.c
@@ -330,16 +330,22 @@ static const struct bcm_phy_hw_stat bcm_phy_hw_stats[] = {
{ "phy_remote_rcv_nok", MII_BRCM_CORE_BASE14, 0, 8 },
 };
 
-int bcm_phy_get_sset_count(struct phy_device *phydev)
+int bcm_phy_get_sset_count(struct phy_device *phydev, int sset)
 {
-   return ARRAY_SIZE(bcm_phy_hw_stats);
+   if (sset == ETH_SS_PHY_STATS)
+   return ARRAY_SIZE(bcm_phy_hw_stats);
+
+   return -EOPNOTSUPP;
 }
 EXPORT_SYMBOL_GPL(bcm_phy_get_sset_count);
 
-void bcm_phy_get_strings(struct phy_device *phydev, u8 *data)
+void bcm_phy_get_strings(struct phy_device *phydev, u32 stringset, u8 *data)
 {
unsigned int i;
 
+   if (stringset != ETH_SS_PHY_STATS)
+   return;
+
for (i = 0; i < ARRAY_SIZE(bcm_phy_hw_stats); i++)
strlcpy(data + i * ETH_GSTRING_LEN,
bcm_phy_hw_stats[i].string, ETH_GSTRING_LEN);
diff --git a/drivers/net/phy/bcm-phy-lib.h b/drivers/net/phy/bcm-phy-lib.h
index 7c73808cbbde..bebcfe106283 100644
--- a/drivers/net/phy/bcm-phy-lib.h
+++ b/drivers/net/phy/bcm-phy-lib.h
@@ -42,8 +42,8 @@ int bcm_phy_downshift_get(struct phy_device *phydev, u8 
*count);
 
 int bcm_phy_downshift_set(struct phy_device *phydev, u8 count);
 
-int bcm_phy_get_sset_count(struct phy_device *phydev);
-void bcm_phy_get_strings(struct phy_device *phydev, u8 *data);
+int bcm_phy_get_sset_count(struct phy_device *phydev, int sset);
+void bcm_phy_get_strings(struct phy_device *phydev, u32 stringset, u8 *data);
 void bcm_phy_get_stats(struct phy_device *phydev, u64 *shadow,
   struct ethtool_stats *stats, u64 *data);
 
diff --git a/drivers/net/phy/bcm7xxx.c b/drivers/net/phy/bcm7xxx.c
index 29b1c88b55cc..1835af147eea 100644
--- a/drivers/net/phy/bcm7xxx.c
+++ b/drivers/net/phy/bcm7xxx.c
@@ -587,6 +587,9 @@ static void bcm7xxx_28nm_get_phy_stats(struct phy_device 
*phydev,
 static int bcm7xxx_28nm_probe(struct phy_device *phydev)
 {
struct bcm7xxx_phy_priv *priv;
+   int count;
+
+   count = bcm_phy_get_sset_count(phydev, ETH_SS_PHY_STATS);
 
priv = devm_kzalloc(>mdio.dev, sizeof(*priv), GFP_KERNEL);
if (!priv)
@@ -594,8 +597,7 @@ static int bcm7xxx_28nm_probe(struct phy_device *phydev)
 
phydev->priv = priv;
 
-   priv->stats = devm_kcalloc(>mdio.dev,
-  bcm_phy_get_sset_count(phydev), sizeof(u64),
+   priv->stats = devm_kcalloc(>mdio.dev, count, sizeof(u64),
   GFP_KERNEL);
if (!priv->stats)
return -ENOMEM;
diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
index 3bb6b66dc7bf..dd909799baf0 100644
--- a/drivers/net/phy/broadcom.c
+++ b/drivers/net/phy/broadcom.c
@@ -547,6 +547,9 @@ struct bcm53xx_phy_priv {
 static int bcm53xx_phy_probe(struct phy_device *phydev)
 {
struct bcm53xx_phy_priv *priv;
+   int count;
+
+   count = bcm_phy_get_sset_count(phydev, ETH_SS_PHY_STATS);
 
priv = devm_kzalloc(>mdio.dev, sizeof(*priv), GFP_KERNEL);
if (!priv)
@@ -554,8 +557,7 @@

[PATCH ethtool 2/2] ethtool: Add support for PHY test modes

2018-04-27 Thread Florian Fainelli

Add two new commands:

--get-phy-tests which allows fetching supported test modes by a given
  network device's PHY interface
--set-phy-test which allows entering one of the modes listed before and
  pass an eventual set of test specific data

Signed-off-by: Florian Fainelli 
---
 ethtool.c | 115 ++
 1 file changed, 115 insertions(+)

diff --git a/ethtool.c b/ethtool.c
index 3289e0f6e8ec..f02cd3560197 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -4854,6 +4854,118 @@ static int do_reset(struct cmd_context *ctx)
return 0;
 }
 
+static int do_gphytest(struct cmd_context *ctx)
+{
+   struct ethtool_gstrings *strings;
+   struct ethtool_phy_test test;
+   unsigned int i;
+   int max_len = 0, cur_len, rc;
+
+   if (ctx->argc != 0)
+   exit_bad_args();
+
+   strings = get_stringset(ctx, ETH_SS_PHY_TESTS, 0, 1);
+   if (!strings) {
+   perror("Cannot get PHY tests strings");
+   return 1;
+   }
+   if (strings->len == 0) {
+   fprintf(stderr, "No PHY tests defined\n");
+   rc = 1;
+   goto err;
+   }
+
+   /* Find longest string and align all strings accordingly */
+   for (i = 0; i < strings->len; i++) {
+   cur_len = strlen((const char*)strings->data +
+i * ETH_GSTRING_LEN);
+   if (cur_len > max_len)
+   max_len = cur_len;
+   }
+
+   printf("PHY tests %s:\n", ctx->devname);
+   for (i = 0; i < strings->len; i++) {
+   memset(, 0, sizeof(test));
+   test.cmd = ETHTOOL_GPHYTEST;
+   test.mode = i;
+
+   rc = send_ioctl(ctx, );
+   if (rc < 0)
+   continue;
+
+   fprintf(stdout, " %.*s (Test data: %s)\n",
+  max_len,
+  (const char *)strings->data + i * ETH_GSTRING_LEN,
+  test.len ? "Yes" : "No");
+   }
+
+   rc = 0;
+
+err:
+   free(strings);
+   return rc;
+}
+
+static int do_sphytest(struct cmd_context *ctx)
+{
+   struct ethtool_gstrings *strings;
+   struct ethtool_phy_test gtest;
+   struct ethtool_phy_test *stest;
+   unsigned int i;
+   int rc;
+
+   if (ctx->argc < 1)
+   exit_bad_args();
+
+   strings = get_stringset(ctx, ETH_SS_PHY_TESTS, 0, 1);
+   if (!strings) {
+   perror("Cannot get PHY test modes");
+   return 1;
+   }
+
+   if (strings->len == 0) {
+   fprintf(stderr, "No PHY tests defined\n");
+   rc = 1;
+   goto err;
+   }
+
+   for (i = 0; i < strings->len; i++) {
+   if (!strcmp(ctx->argp[0],
+   (const char *)strings->data + i * ETH_GSTRING_LEN))
+   break;
+   }
+
+   if (i == strings->len)
+   exit_bad_args();
+
+   memset(, 0, sizeof(gtest));
+   gtest.cmd = ETHTOOL_GPHYTEST;
+   gtest.mode = i;
+   rc = send_ioctl(ctx, );
+   if (rc < 0) {
+   rc = 1;
+   goto err;
+   }
+
+   stest = calloc(1, sizeof(*stest) + gtest.len);
+   if (!stest) {
+   perror("Unable to allocate memory");
+   rc = 1;
+   goto err;
+   }
+
+   stest->cmd = ETHTOOL_SPHYTEST;
+   stest->len = gtest.len;
+   stest->mode = i;
+
+   rc = send_ioctl(ctx, stest);
+   free(stest);
+err:
+   free(strings);
+   return rc;
+}
+
+
 static int parse_named_bool(struct cmd_context *ctx, const char *name, u8 *on)
 {
if (ctx->argc < 2)
@@ -5223,6 +5335,9 @@ static const struct option {
{ "--show-fec", 1, do_gfec, "Show FEC settings"},
{ "--set-fec", 1, do_sfec, "Set FEC settings",
  " [ encoding auto|off|rs|baser ]\n"},
+   { "--get-phy-tests", 1, do_gphytest,"Get PHY test mode(s)" },
+   { "--set-phy-test", 1, do_sphytest, "Set PHY test mode",
+ " [ test options ]\n" },
{ "-h|--help", 0, show_usage, "Show this help" },
{ "--version", 0, do_version, "Show version number" },
{}
-- 
2.14.1

[PATCH ethtool 1/2] ethtool-copy.h: Sync with net-next

2018-04-27 Thread Florian Fainelli

This brings support for PHY test modes (not accepted yet)

Signed-off-by: Florian Fainelli 
---
 ethtool-copy.h | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/ethtool-copy.h b/ethtool-copy.h
index 8cc61e9ab40b..42fb94129da5 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -572,6 +572,7 @@ enum ethtool_stringset {
ETH_SS_TUNABLES,
ETH_SS_PHY_STATS,
ETH_SS_PHY_TUNABLES,
+   ETH_SS_PHY_TESTS,
 };
 
 /**
@@ -1296,6 +1297,25 @@ enum ethtool_fec_config_bits {
 #define ETHTOOL_FEC_RS (1 << ETHTOOL_FEC_RS_BIT)
 #define ETHTOOL_FEC_BASER  (1 << ETHTOOL_FEC_BASER_BIT)
 
+/**
+ * struct ethtool_phy_test - Ethernet PHY test mode
+ * @cmd: Command number = %ETHTOOL_GPHYTEST or %ETHTOOL_SPHYTEST
+ * @flags: A bitmask of flags from  ethtool_test_flags.  Some
+ *  flags may be set by the user on entry; others may be set by
+ *  the driver on return.
+ * @mode: PHY test mode to enter. The index should be a valid test mode
+ * obtained through ethtool_get_strings with %ETH_SS_PHY_TESTS
+ * @len: The length of the test specific array @data
+ * @data: Array of test specific results to be interpreted with @mode
+ */
+struct ethtool_phy_test {
+__u32   cmd;
+__u32   flags;
+__u32   mode;
+__u32   len;
+__u8data[0];
+};
+
 /* CMDs currently supported */
 #define ETHTOOL_GSET   0x0001 /* DEPRECATED, Get settings.
* Please use ETHTOOL_GLINKSETTINGS
@@ -1391,6 +1411,9 @@ enum ethtool_fec_config_bits {
 #define ETHTOOL_GFECPARAM  0x0050 /* Get FEC settings */
 #define ETHTOOL_SFECPARAM  0x0051 /* Set FEC settings */
 
+#define ETHTOOL_GPHYTEST0x0052 /* Get PHY test mode(s) */
+#define ETHTOOL_SPHYTEST0x0053 /* Set PHY test mode */
+
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
 #define SPARC_ETH_SSET ETHTOOL_SSET
-- 
2.14.1

Re: [PATCH net-next v2 0/6] mlxsw: SPAN: Support routes pointing at bridges

2018-04-27 Thread David Miller

From: Ido Schimmel 
Date: Fri, 27 Apr 2018 18:11:05 +0300

> Changes from v1 to v2:
> 
> - Change the suite of bridge accessor functions to br_vlan_pvid_rtnl(),
>   br_vlan_info_rtnl(), br_fdb_find_port_rtnl().

Please address Stephen Hemminger's feedback, otherwise this series
looks good to go.

Thanks.

Re: [net-next v2] ipv6: sr: Add documentation for seg_flowlabel sysctl

2018-04-27 Thread David Miller

From: Ahmed Abdelsalam 
Date: Fri, 27 Apr 2018 17:51:48 +0200

> This patch adds a documentation for seg_flowlabel sysctl into
> Documentation/networking/ip-sysctl.txt
> 
> Signed-off-by: Ahmed Abdelsalam 

Applied, thank you.

Re: [PATCH net 0/2] sfc: more ARFS fixes

2018-04-27 Thread David Miller

From: Edward Cree 
Date: Fri, 27 Apr 2018 15:07:19 +0100

> A couple more bits of breakage in my recent ARFS and async filters work.
> Patch #1 in particular fixes a bug that leads to memory trampling and
>  consequent crashes.

Series applied, thanks Edward.

Re: [PATCH] drivers: net: replace UINT64_MAX with U64_MAX

2018-04-27 Thread David Miller

From: Jisheng Zhang 
Date: Fri, 27 Apr 2018 16:18:58 +0800

> U64_MAX is well defined now while the UINT64_MAX is not, so we fall
> back to drivers' own definition as below:
> 
>   #ifndef UINT64_MAX
>   #define UINT64_MAX (u64)(~((u64)0))
>   #endif
> 
> I believe this is in one phy driver then copied and pasted to other phy
> drivers.
> 
> Replace the UINT64_MAX with U64_MAX to clean up the source code.
> 
> Signed-off-by: Jisheng Zhang 

Looks good, applied to net-next, thanks.

Re: [bpf-next PATCH v2 3/3] bpf: selftest additions for SOCKHASH

2018-04-27 Thread Alexei Starovoitov

On Fri, Apr 27, 2018 at 04:24:43PM -0700, John Fastabend wrote:
> This runs existing SOCKMAP tests with SOCKHASH map type. To do this
> we push programs into include file and build two BPF programs. One
> for SOCKHASH and one for SOCKMAP.
> 
> We then run the entire test suite with each type.
> 
> Signed-off-by: John Fastabend 
> ---
>  tools/testing/selftests/bpf/Makefile |3 
>  tools/testing/selftests/bpf/test_sockhash_kern.c |4 
>  tools/testing/selftests/bpf/test_sockmap.c   |   27 +-
>  tools/testing/selftests/bpf/test_sockmap_kern.c  |  340 
> --
>  tools/testing/selftests/bpf/test_sockmap_kern.h  |  340 
> ++
>  5 files changed, 368 insertions(+), 346 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
>  create mode 100644 tools/testing/selftests/bpf/test_sockmap_kern.h

Looks like it was mainly a rename of test_sockmap_kern.c into .h
but commit doesn't show it as such.
Can you redo it with 'git mv' ?

Re: [bpf-next PATCH v2 2/3] bpf: sockmap, add hash map support

2018-04-27 Thread Alexei Starovoitov

On Fri, Apr 27, 2018 at 04:24:38PM -0700, John Fastabend wrote:
> Sockmap is currently backed by an array and enforces keys to be
> four bytes. This works well for many use cases and was originally
> modeled after devmap which also uses four bytes keys. However,
> this has become limiting in larger use cases where a hash would
> be more appropriate. For example users may want to use the 5-tuple
> of the socket as the lookup key.
> 
> To support this add hash support.
> 
> Signed-off-by: John Fastabend 
> ---
>  include/linux/bpf.h|8 +
>  include/linux/bpf_types.h  |1 
>  include/uapi/linux/bpf.h   |6 
>  kernel/bpf/core.c  |1 
>  kernel/bpf/sockmap.c   |  494 
> +++-
>  kernel/bpf/verifier.c  |   14 +
>  net/core/filter.c  |   58 +
>  tools/bpf/bpftool/map.c|1 
>  tools/include/uapi/linux/bpf.h |6 
>  9 files changed, 570 insertions(+), 19 deletions(-)

please split tools/* update into separate commit.

Also add man-page style documentation for new helpers to uapi/bpf.h

Re: [PATCH net-next v2 4/5] ipv6: sr: Add seg6local action End.BPF

2018-04-27 Thread Alexei Starovoitov

On Fri, Apr 27, 2018 at 10:59:19AM -0400, David Miller wrote:
> From: Mathieu Xhonneux 
> Date: Tue, 24 Apr 2018 18:44:15 +0100
> 
> > This patch adds the End.BPF action to the LWT seg6local infrastructure.
> > This action works like any other seg6local End action, meaning that an IPv6
> > header with SRH is needed, whose DA has to be equal to the SID of the
> > action. It will also advance the SRH to the next segment, the BPF program
> > does not have to take care of this.
> 
> I'd like to see some BPF developers review this change.
> 
> But on my side I wonder if, instead of validating the whole thing afterwards,
> we should make the helpers accessible by the eBPF program validate the changes
> as they are made.

Looking at the code I don't think it's possible to keep it valid all the time
while building, so seg6_validate_srh() after the program run seems necessary.

I think the whole set should be targeting bpf-next tree.
Please fix kbuild errors, rebase and document new helper in man-page style.
Things like:
+   test_btf_haskv.o test_btf_nokv.o test_lwt_seg6local.o
+>>> selftests/bpf: test for seg6local End.BPF action
should be fixed properly.

Re: [PATCH net-next] net: sch: prio: Set bands to default on delete instead of noop

2018-04-27 Thread David Miller

From: Nogah Frankel 
Date: Thu, 26 Apr 2018 16:32:36 +0300

> When a band is created, it is set to the default qdisc, which is
> "invisible" pfifo.
> However, if a band is set to a qdisc that is later being deleted, it will
> be set to noop qdisc. This can cause a packet loss, while there is no clear
> user indication for it. ("invisible" qdisc are not being shown by default).
> This patch sets a band to the default qdisc, rather then the noop qdisc, on
> delete operation.
> 
> Signed-off-by: Nogah Frankel 

Like Cong, I'm worried this will break something.  The code has
behaved this way for 2 decades or longer.

If you want to put another qdisc there, and thus not drop any traffic,
modify the qdisc to a new one instead of performing a delete operation.

Re: [PATCH v2] net/mlx4_en: fix potential use-after-free with dma_unmap_page

2018-04-27 Thread David Miller

From: Sarah Newman 
Date: Wed, 25 Apr 2018 21:00:34 -0700

> When swiotlb is in use, calling dma_unmap_page means that
> the original page mapped with dma_map_page must still be valid
> as swiotlb will copy data from its internal cache back to the
> originally requested DMA location. When GRO is enabled,
> all references to the original frag may be put before
> mlx4_en_free_frag is called, meaning the page has been freed
> before the call to dma_unmap_page in mlx4_en_free_frag.
> 
> To fix, unmap the page as soon as possible.
> 
> This can be trivially detected by doing the following:
> 
> Compile the kernel with DEBUG_PAGEALLOC
> Run the kernel as a Xen Dom0
> Leave GRO enabled on the interface
> Run a 10 second or more test with iperf over the interface.
> 
> Signed-off-by: Sarah Newman 

Tariq, I assume I will get this from you in the next set of
changes you submit to me.

Thanks.

Re: [PATCH bpf-next v7 05/10] bpf/verifier: improve register value range tracking with ARSH

2018-04-27 Thread Alexei Starovoitov

On Wed, Apr 25, 2018 at 12:29:05PM -0700, Yonghong Song wrote:
> When helpers like bpf_get_stack returns an int value
> and later on used for arithmetic computation, the LSH and ARSH
> operations are often required to get proper sign extension into
> 64-bit. For example, without this patch:
> 54: R0=inv(id=0,umax_value=800)
> 54: (bf) r8 = r0
> 55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
> 55: (67) r8 <<= 32
> 56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff))
> 56: (c7) r8 s>>= 32
> 57: R8=inv(id=0)
> With this patch:
> 54: R0=inv(id=0,umax_value=800)
> 54: (bf) r8 = r0
> 55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
> 55: (67) r8 <<= 32
> 56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff))
> 56: (c7) r8 s>>= 32
> 57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
> With better range of "R8", later on when "R8" is added to other register,
> e.g., a map pointer or scalar-value register, the better register
> range can be derived and verifier failure may be avoided.
> 
> In our later example,
> ..
> usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
> if (usize < 0)
> return 0;
> ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
> ..
> Without improving ARSH value range tracking, the register representing
> "max_len - usize" will have smin_value equal to S64_MIN and will be
> rejected by verifier.
> 
> Signed-off-by: Yonghong Song 
> ---
>  include/linux/tnum.h  |  4 +++-
>  kernel/bpf/tnum.c | 10 ++
>  kernel/bpf/verifier.c | 41 +
>  3 files changed, 54 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/tnum.h b/include/linux/tnum.h
> index 0d2d3da..c7dc2b5 100644
> --- a/include/linux/tnum.h
> +++ b/include/linux/tnum.h
> @@ -23,8 +23,10 @@ struct tnum tnum_range(u64 min, u64 max);
>  /* Arithmetic and logical ops */
>  /* Shift a tnum left (by a fixed shift) */
>  struct tnum tnum_lshift(struct tnum a, u8 shift);
> -/* Shift a tnum right (by a fixed shift) */
> +/* Shift (rsh) a tnum right (by a fixed shift) */
>  struct tnum tnum_rshift(struct tnum a, u8 shift);
> +/* Shift (arsh) a tnum right (by a fixed min_shift) */
> +struct tnum tnum_arshift(struct tnum a, u8 min_shift);
>  /* Add two tnums, return @a + @b */
>  struct tnum tnum_add(struct tnum a, struct tnum b);
>  /* Subtract two tnums, return @a - @b */
> diff --git a/kernel/bpf/tnum.c b/kernel/bpf/tnum.c
> index 1f4bf68..938d412 100644
> --- a/kernel/bpf/tnum.c
> +++ b/kernel/bpf/tnum.c
> @@ -43,6 +43,16 @@ struct tnum tnum_rshift(struct tnum a, u8 shift)
>   return TNUM(a.value >> shift, a.mask >> shift);
>  }
>  
> +struct tnum tnum_arshift(struct tnum a, u8 min_shift)
> +{
> + /* if a.value is negative, arithmetic shifting by minimum shift
> +  * will have larger negative offset compared to more shifting.
> +  * If a.value is nonnegative, arithmetic shifting by minimum shift
> +  * will have larger positive offset compare to more shifting.
> +  */
> + return TNUM((s64)a.value >> min_shift, (s64)a.mask >> min_shift);
> +}
> +
>  struct tnum tnum_add(struct tnum a, struct tnum b)
>  {
>   u64 sm, sv, sigma, chi, mu;
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 6e3f859..573807f 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2974,6 +2974,47 @@ static int adjust_scalar_min_max_vals(struct 
> bpf_verifier_env *env,
>   /* We may learn something more from the var_off */
>   __update_reg_bounds(dst_reg);
>   break;
> + case BPF_ARSH:
> + if (umax_val >= insn_bitness) {
> + /* Shifts greater than 31 or 63 are undefined.
> +  * This includes shifts by a negative number.
> +  */
> + mark_reg_unknown(env, regs, insn->dst_reg);
> + break;
> + }
> +
> + /* BPF_ARSH is an arithmetic shift. The new range of
> +  * smin_value and smax_value should take the sign
> +  * into consideration.
> +  *
> +  * For example, if smin_value = -16, umin_val = 0
> +  * and umax_val = 2, the new smin_value should be
> +  * -16 >> 0 = -16 since -16 >> 2 = -4.
> +  * If smin_value = 16, umin_val = 0 and umax_val = 2,
> +  * the new smin_value should be 16 >> 2 = 4.
> +  *
> +  * Now suppose smax_value = -4, umin_val = 0 and
> +  * umax_val = 2, the new smax_value should be
> +  * -4 >> 2 = -1. If smax_value = 32 with the same
> +  * umin_val/umax_val, the new smax_value should remain 32.
> +  */
> + if (dst_reg->smin_value < 0)
> + dst_reg->smin_value

Re: [PATCH] net: support compat 64-bit time in {s,g}etsockopt

2018-04-27 Thread David Miller

From: Lance Richardson 
Date: Wed, 25 Apr 2018 10:21:54 -0400

> For the x32 ABI, struct timeval has two 64-bit fields. However
> the kernel currently interprets the user-space values used for
> the SO_RCVTIMEO and SO_SNDTIMEO socket options as having a pair
> of 32-bit fields.
> 
> When the seconds portion of the requested timeout is less than 2**32,
> the seconds portion of the effective timeout is correct but the
> microseconds portion is zero.  When the seconds portion of the
> requested timeout is zero and the microseconds portion is non-zero,
> the kernel interprets the timeout as zero (never timeout).
> 
> Fix by using 64-bit time for SO_RCVTIMEO/SO_SNDTIMEO as required
> for the ABI.
> 
> The code included below demonstrates the problem.
 ...
> Fixes: 515c7af85ed9 ("x32: Use compat shims for {g,s}etsockopt")
> Reported-by: Gopal RajagopalSai 
> Signed-off-by: Lance Richardson 

Really nice commit message and test case.

Applied and queued up for -stable, thank you.

Re: [PATCH] bpf: fix misaligned access for BPF_PROG_TYPE_PERF_EVENT program type on x86_32 platform

2018-04-27 Thread Daniel Borkmann

On 04/28/2018 12:48 AM, Alexei Starovoitov wrote:
> On Thu, Apr 26, 2018 at 05:57:49PM +0800, Wang YanQing wrote:
>> All the testcases for BPF_PROG_TYPE_PERF_EVENT program type in
>> test_verifier(kselftest) report below errors on x86_32:
>> "
>> 172/p unpriv: spill/fill of different pointers ldx FAIL
>> Unexpected error message!
>> 0: (bf) r6 = r10
>> 1: (07) r6 += -8
>> 2: (15) if r1 == 0x0 goto pc+3
>> R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1
>> 3: (bf) r2 = r10
>> 4: (07) r2 += -76
>> 5: (7b) *(u64 *)(r6 +0) = r2
>> 6: (55) if r1 != 0x0 goto pc+1
>> R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 
>> fp-8=fp
>> 7: (7b) *(u64 *)(r6 +0) = r1
>> 8: (79) r1 = *(u64 *)(r6 +0)
>> 9: (79) r1 = *(u64 *)(r1 +68)
>> invalid bpf_context access off=68 size=8
>>
>> 378/p check bpf_perf_event_data->sample_period byte load permitted FAIL
>> Failed to load prog 'Permission denied'!
>> 0: (b7) r0 = 0
>> 1: (71) r0 = *(u8 *)(r1 +68)
>> invalid bpf_context access off=68 size=1
>>
>> 379/p check bpf_perf_event_data->sample_period half load permitted FAIL
>> Failed to load prog 'Permission denied'!
>> 0: (b7) r0 = 0
>> 1: (69) r0 = *(u16 *)(r1 +68)
>> invalid bpf_context access off=68 size=2
>>
>> 380/p check bpf_perf_event_data->sample_period word load permitted FAIL
>> Failed to load prog 'Permission denied'!
>> 0: (b7) r0 = 0
>> 1: (61) r0 = *(u32 *)(r1 +68)
>> invalid bpf_context access off=68 size=4
>>
>> 381/p check bpf_perf_event_data->sample_period dword load permitted FAIL
>> Failed to load prog 'Permission denied'!
>> 0: (b7) r0 = 0
>> 1: (79) r0 = *(u64 *)(r1 +68)
>> invalid bpf_context access off=68 size=8
>> "
>>
>> This patch fix it, the fix isn't only necessary for x86_32, it will fix the
>> same problem for other platforms too, if their size of bpf_user_pt_regs_t
>> can't divide exactly into 8.
>>
>> Signed-off-by: Wang YanQing 
>> ---
>>  Hi all!
>>  After mainline accept this patch, then we need to submit a sync patch
>>  to update the tools/include/uapi/linux/bpf_perf_event.h.
>>
>>  Thanks.
>>
>>  include/uapi/linux/bpf_perf_event.h | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/uapi/linux/bpf_perf_event.h 
>> b/include/uapi/linux/bpf_perf_event.h
>> index eb1b9d2..ff4c092 100644
>> --- a/include/uapi/linux/bpf_perf_event.h
>> +++ b/include/uapi/linux/bpf_perf_event.h
>> @@ -12,7 +12,7 @@
>>  
>>  struct bpf_perf_event_data {
>>  bpf_user_pt_regs_t regs;
>> -__u64 sample_period;
>> +__u64 sample_period __attribute__((aligned(8)));
> 
> I don't think this necessary.
> imo it's a bug in pe_prog_is_valid_access
> that should have allowed 8-byte access to 4-byte aligned sample_period.
> The access rewritten by pe_prog_convert_ctx_access anyway,
> no alignment issues as far as I can see.

Right, good point. Wang, could you give the below a test run:

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 56ba0f2..95b9142 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -833,8 +833,14 @@ static bool pe_prog_is_valid_access(int off, int size, 
enum bpf_access_type type
return false;
if (type != BPF_READ)
return false;
-   if (off % size != 0)
-   return false;
+   if (off % size != 0) {
+   if (sizeof(long) != 4)
+   return false;
+   if (size != 8)
+   return false;
+   if (off % size != 4)
+   return false;
+   }

switch (off) {
case bpf_ctx_range(struct bpf_perf_event_data, sample_period):

[bpf-next PATCH v2 2/3] bpf: sockmap, add hash map support

2018-04-27 Thread John Fastabend

Sockmap is currently backed by an array and enforces keys to be
four bytes. This works well for many use cases and was originally
modeled after devmap which also uses four bytes keys. However,
this has become limiting in larger use cases where a hash would
be more appropriate. For example users may want to use the 5-tuple
of the socket as the lookup key.

To support this add hash support.

Signed-off-by: John Fastabend 
---
 include/linux/bpf.h|8 +
 include/linux/bpf_types.h  |1 
 include/uapi/linux/bpf.h   |6 
 kernel/bpf/core.c  |1 
 kernel/bpf/sockmap.c   |  494 +++-
 kernel/bpf/verifier.c  |   14 +
 net/core/filter.c  |   58 +
 tools/bpf/bpftool/map.c|1 
 tools/include/uapi/linux/bpf.h |6 
 9 files changed, 570 insertions(+), 19 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 38ebbc6..add768a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -661,6 +661,7 @@ static inline void bpf_map_offload_map_free(struct bpf_map 
*map)
 
 #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_BPF_SYSCALL) && 
defined(CONFIG_INET)
 struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key);
+struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key);
 int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type);
 #else
 static inline struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 
key)
@@ -668,6 +669,12 @@ static inline struct sock  *__sock_map_lookup_elem(struct 
bpf_map *map, u32 key)
return NULL;
 }
 
+static inline struct sock  *__sock_hash_lookup_elem(struct bpf_map *map,
+   void *key)
+{
+   return NULL;
+}
+
 static inline int sock_map_prog(struct bpf_map *map,
struct bpf_prog *prog,
u32 type)
@@ -693,6 +700,7 @@ static inline int sock_map_prog(struct bpf_map *map,
 extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
 extern const struct bpf_func_proto bpf_get_stackid_proto;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
+extern const struct bpf_func_proto bpf_sock_hash_update_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2b28fcf..3101118 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -47,6 +47,7 @@
 BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
 #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_INET)
 BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKHASH, sock_hash_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
 #endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index da77a93..5cb983d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -116,6 +116,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_DEVMAP,
BPF_MAP_TYPE_SOCKMAP,
BPF_MAP_TYPE_CPUMAP,
+   BPF_MAP_TYPE_SOCKHASH,
 };
 
 enum bpf_prog_type {
@@ -1835,7 +1836,10 @@ struct bpf_stack_build_id {
FN(msg_pull_data),  \
FN(bind),   \
FN(xdp_adjust_tail),\
-   FN(skb_get_xfrm_state),
+   FN(skb_get_xfrm_state), \
+   FN(sock_hash_update),   \
+   FN(msg_redirect_hash),  \
+   FN(sk_redirect_hash),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ba03ec3..5917cc1 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1782,6 +1782,7 @@ void bpf_user_rnd_init_once(void)
 const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
 const struct bpf_func_proto bpf_get_current_comm_proto __weak;
 const struct bpf_func_proto bpf_sock_map_update_proto __weak;
+const struct bpf_func_proto bpf_sock_hash_update_proto __weak;
 
 const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
 {
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 8bda881..08eb3a5 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -60,6 +60,28 @@ struct bpf_stab {
struct bpf_sock_progs progs;
 };
 
+struct bucket {
+   struct hlist_head head;
+   raw_spinlock_t lock;
+};
+
+struct bpf_htab {
+   struct bpf_map map;
+   struct bucket *buckets;
+   atomic_t count;
+   u32 n_buckets;
+   u32 elem_size;
+   struct bpf_sock_progs progs;
+};
+
+struct htab_elem {
+   struct rcu_head rcu;
+   struct hlist_node hash_node;
+   u32 hash;
+   struct sock *sk;
+   char key[0];
+};
+
 enum smap_psock_state {
SMAP_TX_RUNNING,
 };
@@ -67,6 +89,8 @@ enum smap_psock_state {
 struct smap_psock_map_entry {
struct list_head list;

[bpf-next PATCH v2 1/3] bpf: sockmap, refactor sockmap routines to work with hashmap

2018-04-27 Thread John Fastabend

This patch only refactors the existing sockmap code. This will allow
much of the psock initialization code path and bpf helper codes to
work for both sockmap bpf map types that are backed by an array, the
currently supported type, and the new hash backed bpf map type
sockhash.

Most the fallout comes from three changes,

  - Pushing bpf programs into an independent structure so we
can use it from the htab struct in the next patch.
  - Generalizing helpers to use void *key instead of the hardcoded
u32.
  - Instead of passing map/key through the metadata we now do
the lookup inline. This avoids storing the key in the metadata
which will be useful when keys can be longer than 4 bytes. We
rename the sk pointers to sk_redir at this point as well to
avoid any confusion between the current sk pointer and the
redirect pointer sk_redir.

Signed-off-by: John Fastabend 
---
 include/linux/filter.h |3 -
 include/net/tcp.h  |3 -
 kernel/bpf/sockmap.c   |  148 +---
 net/core/filter.c  |   31 +++---
 4 files changed, 98 insertions(+), 87 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4da8b23..31cdfe8 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -512,9 +512,8 @@ struct sk_msg_buff {
int sg_end;
struct scatterlist sg_data[MAX_SKB_FRAGS];
bool sg_copy[MAX_SKB_FRAGS];
-   __u32 key;
__u32 flags;
-   struct bpf_map *map;
+   struct sock *sk_redir;
struct sk_buff *skb;
struct list_head list;
 };
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 833154e..089185a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -814,9 +814,8 @@ struct tcp_skb_cb {
 #endif
} header;   /* For incoming skbs */
struct {
-   __u32 key;
__u32 flags;
-   struct bpf_map *map;
+   struct sock *sk_redir;
void *data_end;
} bpf;
};
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 634415c..8bda881 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -48,14 +48,18 @@
 #define SOCK_CREATE_FLAG_MASK \
(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
 
-struct bpf_stab {
-   struct bpf_map map;
-   struct sock **sock_map;
+struct bpf_sock_progs {
struct bpf_prog *bpf_tx_msg;
struct bpf_prog *bpf_parse;
struct bpf_prog *bpf_verdict;
 };
 
+struct bpf_stab {
+   struct bpf_map map;
+   struct sock **sock_map;
+   struct bpf_sock_progs progs;
+};
+
 enum smap_psock_state {
SMAP_TX_RUNNING,
 };
@@ -456,7 +460,7 @@ static int free_curr_sg(struct sock *sk, struct sk_msg_buff 
*md)
 static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
 {
return ((_rc == SK_PASS) ?
-  (md->map ? __SK_REDIRECT : __SK_PASS) :
+  (md->sk_redir ? __SK_REDIRECT : __SK_PASS) :
   __SK_DROP);
 }
 
@@ -1088,7 +1092,7 @@ static int smap_verdict_func(struct smap_psock *psock, 
struct sk_buff *skb)
 * when we orphan the skb so that we don't have the possibility
 * to reference a stale map.
 */
-   TCP_SKB_CB(skb)->bpf.map = NULL;
+   TCP_SKB_CB(skb)->bpf.sk_redir = NULL;
skb->sk = psock->sock;
bpf_compute_data_pointers(skb);
preempt_disable();
@@ -1098,7 +1102,7 @@ static int smap_verdict_func(struct smap_psock *psock, 
struct sk_buff *skb)
 
/* Moving return codes from UAPI namespace into internal namespace */
return rc == SK_PASS ?
-   (TCP_SKB_CB(skb)->bpf.map ? __SK_REDIRECT : __SK_PASS) :
+   (TCP_SKB_CB(skb)->bpf.sk_redir ? __SK_REDIRECT : __SK_PASS) :
__SK_DROP;
 }
 
@@ -1368,7 +1372,6 @@ static int smap_init_sock(struct smap_psock *psock,
 }
 
 static void smap_init_progs(struct smap_psock *psock,
-   struct bpf_stab *stab,
struct bpf_prog *verdict,
struct bpf_prog *parse)
 {
@@ -1446,14 +1449,13 @@ static void smap_gc_work(struct work_struct *w)
kfree(psock);
 }
 
-static struct smap_psock *smap_init_psock(struct sock *sock,
- struct bpf_stab *stab)
+static struct smap_psock *smap_init_psock(struct sock *sock, int node)
 {
struct smap_psock *psock;
 
psock = kzalloc_node(sizeof(struct smap_psock),
 GFP_ATOMIC | __GFP_NOWARN,
-stab->map.numa_node);
+node);
if (!psock)
return ERR_PTR(-ENOMEM);
 
@@ -1658,40 +1660,26 @@ static int sock_map_delete_elem(struct bpf_map *map, 
void *key)
  *  - sock_map must use READ_ONCE and (cmp)xchg operations
  *  - BPF verdict/parse programs must use

Re: [PATCH bpf-next v7 04/10] bpf: remove never-hit branches in verifier adjust_scalar_min_max_vals

2018-04-27 Thread Alexei Starovoitov

On Wed, Apr 25, 2018 at 12:29:04PM -0700, Yonghong Song wrote:
> In verifier function adjust_scalar_min_max_vals,
> when src_known is false and the opcode is BPF_LSH/BPF_RSH,
> early return will happen in the function. So remove
> the branch in handling BPF_LSH/BPF_RSH when src_known is false.
> 
> Signed-off-by: Yonghong Song 

Acked-by: Alexei Starovoitov

[bpf-next PATCH v2 3/3] bpf: selftest additions for SOCKHASH

2018-04-27 Thread John Fastabend

This runs existing SOCKMAP tests with SOCKHASH map type. To do this
we push programs into include file and build two BPF programs. One
for SOCKHASH and one for SOCKMAP.

We then run the entire test suite with each type.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/Makefile |3 
 tools/testing/selftests/bpf/test_sockhash_kern.c |4 
 tools/testing/selftests/bpf/test_sockmap.c   |   27 +-
 tools/testing/selftests/bpf/test_sockmap_kern.c  |  340 --
 tools/testing/selftests/bpf/test_sockmap_kern.h  |  340 ++
 5 files changed, 368 insertions(+), 346 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_sockmap_kern.h

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index b64a7a3..03f9bf3 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -32,7 +32,8 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
-   test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o
+   test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o 
\
+   test_sockmap_kern.o test_sockhash_kern.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/test_sockhash_kern.c 
b/tools/testing/selftests/bpf/test_sockhash_kern.c
new file mode 100644
index 000..3bf4ad4
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sockhash_kern.c
@@ -0,0 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
+#define TEST_MAP_TYPE BPF_MAP_TYPE_SOCKHASH
+#include "./test_sockmap_kern.h"
diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index 29c022d..df7afc7 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -47,7 +47,8 @@
 #define S1_PORT 1
 #define S2_PORT 10001
 
-#define BPF_FILENAME "test_sockmap_kern.o"
+#define BPF_SOCKMAP_FILENAME "test_sockmap_kern.o"
+#define BPF_SOCKHASH_FILENAME "test_sockmap_kern.o"
 #define CG_PATH "/sockmap"
 
 /* global sockets */
@@ -1260,9 +1261,8 @@ static int test_start_end(int cgrp)
BPF_PROG_TYPE_SK_MSG,
 };
 
-static int populate_progs(void)
+static int populate_progs(char *bpf_file)
 {
-   char *bpf_file = BPF_FILENAME;
struct bpf_program *prog;
struct bpf_object *obj;
int i = 0;
@@ -1306,11 +1306,11 @@ static int populate_progs(void)
return 0;
 }
 
-static int test_suite(void)
+static int __test_suite(char *bpf_file)
 {
int cg_fd, err;
 
-   err = populate_progs();
+   err = populate_progs(bpf_file);
if (err < 0) {
fprintf(stderr, "ERROR: (%i) load bpf failed\n", err);
return err;
@@ -1347,17 +1347,30 @@ static int test_suite(void)
 
 out:
printf("Summary: %i PASSED %i FAILED\n", passed, failed);
+   cleanup_cgroup_environment();
close(cg_fd);
return err;
 }
 
+static int test_suite(void)
+{
+   int err;
+
+   err = __test_suite(BPF_SOCKMAP_FILENAME);
+   if (err)
+   goto out;
+   err = __test_suite(BPF_SOCKHASH_FILENAME);
+out:
+   return err;
+}
+
 int main(int argc, char **argv)
 {
struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY};
int iov_count = 1, length = 1024, rate = 1;
struct sockmap_options options = {0};
int opt, longindex, err, cg_fd = 0;
-   char *bpf_file = BPF_FILENAME;
+   char *bpf_file = BPF_SOCKMAP_FILENAME;
int test = PING_PONG;
 
if (setrlimit(RLIMIT_MEMLOCK, )) {
@@ -1438,7 +1451,7 @@ int main(int argc, char **argv)
return -1;
}
 
-   err = populate_progs();
+   err = populate_progs(bpf_file);
if (err) {
fprintf(stderr, "populate program: (%s) %s\n",
bpf_file, strerror(errno));
diff --git a/tools/testing/selftests/bpf/test_sockmap_kern.c 
b/tools/testing/selftests/bpf/test_sockmap_kern.c
index 33de97e..31ef954 100644
--- a/tools/testing/selftests/bpf/test_sockmap_kern.c
+++ b/tools/testing/selftests/bpf/test_sockmap_kern.c
@@ -1,340 +1,4 @@
 // SPDX-License-Identifier: GPL-2.0
 // Copyright (c) 2017-2018 Covalent IO, Inc. http://covalent.io
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include "bpf_helpers.h"
-#include "bpf_endian.h"
-
-/* Sockmap sample program connects a client and a backend together
- * using

[bpf-next PATCH v2 0/3] Hash support for sock

2018-04-27 Thread John Fastabend

In the original sockmap implementation we got away with using an
array similar to devmap. However, unlike devmap where an ifindex
has a nice 1:1 function into the map we have found some use cases
with sockets that need to be referenced using longer keys.

This series adds support for a sockhash map reusing as much of
the sockmap code as possible. I made the decision to add sockhash
specific helpers vs trying to generalize the existing helpers
because (a) they have sockmap in the name and (b) the keys are
different types. I prefer to be explicit here rather than play
type games or do something else tricky.

To test this we duplicate all the sockmap testing except swap out
the sockmap with a sockhash.

v2: fix file stats and add v2 tag

---

John Fastabend (3):
  bpf: sockmap, refactor sockmap routines to work with hashmap
  bpf: sockmap, add hash map support
  bpf: selftest additions for SOCKHASH


 include/linux/bpf.h  |8 
 include/linux/bpf_types.h|1 
 include/linux/filter.h   |3 
 include/net/tcp.h|3 
 include/uapi/linux/bpf.h |6 
 kernel/bpf/core.c|1 
 kernel/bpf/sockmap.c |  638 +++---
 kernel/bpf/verifier.c|   14 
 net/core/filter.c|   89 ++-
 tools/bpf/bpftool/map.c  |1 
 tools/include/uapi/linux/bpf.h   |6 
 tools/testing/selftests/bpf/Makefile |3 
 tools/testing/selftests/bpf/test_sockhash_kern.c |4 
 tools/testing/selftests/bpf/test_sockmap.c   |   27 +
 tools/testing/selftests/bpf/test_sockmap_kern.c  |  340 
 tools/testing/selftests/bpf/test_sockmap_kern.h  |  340 
 16 files changed, 1034 insertions(+), 450 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_sockmap_kern.h

--
Signature

Re: [PATCH bpf-next v7 03/10] bpf/verifier: refine retval R0 state for bpf_get_stack helper

2018-04-27 Thread Alexei Starovoitov

On Wed, Apr 25, 2018 at 12:29:03PM -0700, Yonghong Song wrote:
> The special property of return values for helpers bpf_get_stack
> and bpf_probe_read_str are captured in verifier.
> Both helpers return a negative error code or
> a length, which is equal to or smaller than the buffer
> size argument. This additional information in the
> verifier can avoid the condition such as "retval > bufsize"
> in the bpf program. For example, for the code blow,
> usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
> if (usize < 0 || usize > max_len)
> return 0;
> The verifier may have the following errors:
> 52: (85) call bpf_get_stack#65
>  R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
>  R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
>  R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>  R9_w=inv800 R10=fp0,call_-1
> 53: (bf) r8 = r0
> 54: (bf) r1 = r8
> 55: (67) r1 <<= 32
> 56: (bf) r2 = r1
> 57: (77) r2 >>= 32
> 58: (25) if r2 > 0x31f goto pc+33
>  R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
>  umax_value=18446744069414584320,
>  var_off=(0x0; 0x))
>  R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
>  R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>  R8=inv(id=0) R9=inv800 R10=fp0,call_-1
> 59: (1f) r9 -= r8
> 60: (c7) r1 s>>= 32
> 61: (bf) r2 = r7
> 62: (0f) r2 += r1
> math between map_value pointer and register with unbounded
> min value is not allowed
> The failure is due to llvm compiler optimization where register "r2",
> which is a copy of "r1", is tested for condition while later on "r1"
> is used for map_ptr operation. The verifier is not able to track such
> inst sequence effectively.
> 
> Without the "usize > max_len" condition, there is no llvm optimization
> and the below generated code passed verifier:
> 52: (85) call bpf_get_stack#65
>  R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
>  R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
>  R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>  R9_w=inv800 R10=fp0,call_-1
> 53: (b7) r1 = 0
> 54: (bf) r8 = r0
> 55: (67) r8 <<= 32
> 56: (c7) r8 s>>= 32
> 57: (6d) if r1 s> r8 goto pc+24
>  R0=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff))
>  R1=inv0 R6=ctx(id=0,off=0,imm=0)
>  R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>  R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
>  R10=fp0,call_-1
> 58: (bf) r2 = r7
> 59: (0f) r2 += r8
> 60: (1f) r9 -= r8
> 61: (bf) r1 = r6
> 
> Signed-off-by: Yonghong Song 

Acked-by: Alexei Starovoitov

[bpf-next PATCH 0/3] Hash support for sock

2018-04-27 Thread John Fastabend

In the original sockmap implementation we got away with using an
array similar to devmap. However, unlike devmap where an ifindex
has a nice 1:1 function into the map we have found some use cases
with sockets need to be referenced using longer keys.

This series adds support for a sockhash map type which reuses almost
all the sockmap code except it needed a few special add/remove
handlers.

To test this we duplicate all the sockmap testing except swap out
the sockmap with a sockhash.

---

John Fastabend (3):
  bpf: sockmap, refactor sockmap routines to work with hashmap
  bpf: sockmap, add hash map support
  bpf: selftest additions for SOCKHASH


 include/linux/bpf.h  |8 
 include/linux/bpf_types.h|1 
 include/linux/filter.h   |3 
 include/net/tcp.h|3 
 include/uapi/linux/bpf.h |6 
 kernel/bpf/core.c|1 
 kernel/bpf/sockmap.c |  638 +++---
 kernel/bpf/verifier.c|   14 
 net/core/filter.c|   89 ++-
 tools/bpf/bpftool/map.c  |1 
 tools/include/uapi/linux/bpf.h   |6 
 tools/testing/selftests/bpf/Makefile |3 
 tools/testing/selftests/bpf/test_sockhash_kern.c |4 
 tools/testing/selftests/bpf/test_sockmap.c   |   27 +
 tools/testing/selftests/bpf/test_sockmap_kern.c  |  340 
 tools/testing/selftests/bpf/test_sockmap_kern.h  |  340 
 16 files changed, 1034 insertions(+), 450 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_sockmap_kern.h

--
Signature

Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-27 Thread Mikulas Patocka



On Fri, 27 Apr 2018, Michael S. Tsirkin wrote:

> 2. Ability to control this from a separate config
>option.
> 
>It's still not that clear to me why is this such a
>hard requirement.  If a distro wants to force specific
>boot time options, why isn't CONFIG_CMDLINE sufficient?

So, try this and turn it on with CONFIG_CMDLINE. But I'm not a 
blogger and I will not write a blog post about it as James Bottomley 
suggests :-)
- so very few users will use it. 


fault-injection: introduce kvmalloc fallback options

This patch introduces a fault-injection option "kvmalloc_fallback". This
option makes kvmalloc randomly fall back to vmalloc.

Unfortunately, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code. This options helps to test for these bugs.

Signed-off-by: Mikulas Patocka 

---
 Documentation/fault-injection/fault-injection.txt |7 +
 mm/util.c |   30 ++
 2 files changed, 37 insertions(+)

Index: linux-2.6/Documentation/fault-injection/fault-injection.txt
===
--- linux-2.6.orig/Documentation/fault-injection/fault-injection.txt
2018-04-28 01:01:25.0 +0200
+++ linux-2.6/Documentation/fault-injection/fault-injection.txt 2018-04-28 
01:01:25.0 +0200
@@ -15,6 +15,12 @@ o fail_page_alloc
 
   injects page allocation failures. (alloc_pages(), get_free_pages(), ...)
 
+o kvmalloc_fallback
+
+  makes the function kvmalloc randomly fall back to vmalloc. This could be used
+  to detects bugs such as using DMA-API on the result of kvmalloc or freeing
+  the result of kvmalloc with free.
+
 o fail_futex
 
   injects futex deadlock and uaddr fault errors.
@@ -167,6 +173,7 @@ use the boot option:
 
failslab=
fail_page_alloc=
+   kvmalloc_fallback=
fail_make_request=
fail_futex=
mmc_core.fail_request=,,,
Index: linux-2.6/mm/util.c
===
--- linux-2.6.orig/mm/util.c2018-04-28 01:01:25.0 +0200
+++ linux-2.6/mm/util.c 2018-04-28 01:03:25.0 +0200
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -377,6 +378,29 @@ unsigned long vm_mmap(struct file *file,
 }
 EXPORT_SYMBOL(vm_mmap);
 
+#ifdef CONFIG_FAULT_INJECTION
+
+static DECLARE_FAULT_ATTR(kvmalloc_fallback);
+
+static int __init setup_kvmalloc_fallback(char *str)
+{
+   kvmalloc_fallback.verbose = 0;
+   return setup_fault_attr(_fallback, str);
+}
+
+__setup("kvmalloc_fallback=", setup_kvmalloc_fallback);
+
+#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
+static int __init kvmalloc_fallback_debugfs_init(void)
+{
+   fault_create_debugfs_attr("kvmalloc_fallback", NULL, 
_fallback);
+   return 0;
+}
+late_initcall(kvmalloc_fallback_debugfs_init);
+#endif
+
+#endif
+
 /**
  * kvmalloc_node - attempt to allocate physically contiguous memory, but upon
  * failure, fall back to non-contiguous (vmalloc) allocation.
@@ -404,6 +428,11 @@ void *kvmalloc_node(size_t size, gfp_t f
 */
WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
 
+#ifdef CONFIG_FAULT_INJECTION
+   if (should_fail(_fallback, size))
+   goto do_vmalloc;
+#endif
+
/*
 * We want to attempt a large physically contiguous block first because
 * it is less likely to fragment multiple larger blocks and therefore
@@ -427,6 +456,7 @@ void *kvmalloc_node(size_t size, gfp_t f
if (ret || size <= PAGE_SIZE)
return ret;
 
+do_vmalloc: __maybe_unused
return __vmalloc_node_flags_caller(size, node, flags,
__builtin_return_address(0));
 }

Re: [bpf-next PATCH 0/3] Hash support for sock

2018-04-27 Thread John Fastabend

On 04/27/2018 03:54 PM, Alexei Starovoitov wrote:
> On Fri, Apr 27, 2018 at 10:51 PM, John Fastabend
>  wrote:
>> In the original sockmap implementation we got away with using an
>> array similar to devmap. However, unlike devmap where an ifindex
>> has a nice 1:1 function into the map we have found some use cases
>> with sockets need to be referenced using longer keys.
>>
>> This series adds support for a sockhash map type which reuses almost
>> all the sockmap code except it needed a few special add/remove
>> handlers.
>>
>> To test this we duplicate all the sockmap testing except swap out
>> the sockmap with a sockhash.
>>
>> ---
>>
>> John Fastabend (3):
>>   bpf: sockmap, refactor sockmap routines to work with hashmap
>>   bpf: sockmap, add hash map support
>>   bpf: selftest additions for SOCKHASH
>>
>>
>>  tools/bpf/bpftool/map.c  |1
>>  tools/include/uapi/linux/bpf.h   |6
>>  tools/testing/selftests/bpf/Makefile |3
>>  tools/testing/selftests/bpf/test_sockhash_kern.c |4
>>  tools/testing/selftests/bpf/test_sockmap.c   |   27 +-
>>  tools/testing/selftests/bpf/test_sockmap_kern.c  |  340 
>> --
>>  tools/testing/selftests/bpf/test_sockmap_kern.h  |  340 
>> ++
>>  7 files changed, 374 insertions(+), 347 deletions(-)
>>  create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
>>  create mode 100644 tools/testing/selftests/bpf/test_sockmap_kern.h
> 
> something wrong here.
> patch 1 changes include/linux/filter.h
> but it's not included in the above.
> Please fix and resubmit
> 

Strange had to create a new branch to fix it. Anyways thanks
and v2 coming with correct stats.

.John

Re: [bpf-next PATCH 0/3] Hash support for sock

2018-04-27 Thread Alexei Starovoitov

On Fri, Apr 27, 2018 at 10:51 PM, John Fastabend
 wrote:
> In the original sockmap implementation we got away with using an
> array similar to devmap. However, unlike devmap where an ifindex
> has a nice 1:1 function into the map we have found some use cases
> with sockets need to be referenced using longer keys.
>
> This series adds support for a sockhash map type which reuses almost
> all the sockmap code except it needed a few special add/remove
> handlers.
>
> To test this we duplicate all the sockmap testing except swap out
> the sockmap with a sockhash.
>
> ---
>
> John Fastabend (3):
>   bpf: sockmap, refactor sockmap routines to work with hashmap
>   bpf: sockmap, add hash map support
>   bpf: selftest additions for SOCKHASH
>
>
>  tools/bpf/bpftool/map.c  |1
>  tools/include/uapi/linux/bpf.h   |6
>  tools/testing/selftests/bpf/Makefile |3
>  tools/testing/selftests/bpf/test_sockhash_kern.c |4
>  tools/testing/selftests/bpf/test_sockmap.c   |   27 +-
>  tools/testing/selftests/bpf/test_sockmap_kern.c  |  340 
> --
>  tools/testing/selftests/bpf/test_sockmap_kern.h  |  340 
> ++
>  7 files changed, 374 insertions(+), 347 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
>  create mode 100644 tools/testing/selftests/bpf/test_sockmap_kern.h

something wrong here.
patch 1 changes include/linux/filter.h
but it's not included in the above.
Please fix and resubmit

[bpf-next PATCH 0/3] Hash support for sock

2018-04-27 Thread John Fastabend

In the original sockmap implementation we got away with using an
array similar to devmap. However, unlike devmap where an ifindex
has a nice 1:1 function into the map we have found some use cases
with sockets need to be referenced using longer keys.

This series adds support for a sockhash map type which reuses almost
all the sockmap code except it needed a few special add/remove
handlers.

To test this we duplicate all the sockmap testing except swap out
the sockmap with a sockhash.

---

John Fastabend (3):
  bpf: sockmap, refactor sockmap routines to work with hashmap
  bpf: sockmap, add hash map support
  bpf: selftest additions for SOCKHASH


 tools/bpf/bpftool/map.c  |1 
 tools/include/uapi/linux/bpf.h   |6 
 tools/testing/selftests/bpf/Makefile |3 
 tools/testing/selftests/bpf/test_sockhash_kern.c |4 
 tools/testing/selftests/bpf/test_sockmap.c   |   27 +-
 tools/testing/selftests/bpf/test_sockmap_kern.c  |  340 --
 tools/testing/selftests/bpf/test_sockmap_kern.h  |  340 ++
 7 files changed, 374 insertions(+), 347 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_sockmap_kern.h

--
Signature

[bpf-next PATCH 3/3] bpf: selftest additions for SOCKHASH

2018-04-27 Thread John Fastabend

This runs existing SOCKMAP tests with SOCKHASH map type. To do this
we push programs into include file and build two BPF programs. One
for SOCKHASH and one for SOCKMAP.

We then run the entire test suite with each type.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/Makefile |3 
 tools/testing/selftests/bpf/test_sockhash_kern.c |4 
 tools/testing/selftests/bpf/test_sockmap.c   |   27 +-
 tools/testing/selftests/bpf/test_sockmap_kern.c  |  340 --
 tools/testing/selftests/bpf/test_sockmap_kern.h  |  340 ++
 5 files changed, 368 insertions(+), 346 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_sockmap_kern.h

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index b64a7a3..03f9bf3 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -32,7 +32,8 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
-   test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o
+   test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o 
\
+   test_sockmap_kern.o test_sockhash_kern.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/test_sockhash_kern.c 
b/tools/testing/selftests/bpf/test_sockhash_kern.c
new file mode 100644
index 000..3bf4ad4
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sockhash_kern.c
@@ -0,0 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
+#define TEST_MAP_TYPE BPF_MAP_TYPE_SOCKHASH
+#include "./test_sockmap_kern.h"
diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index 29c022d..df7afc7 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -47,7 +47,8 @@
 #define S1_PORT 1
 #define S2_PORT 10001
 
-#define BPF_FILENAME "test_sockmap_kern.o"
+#define BPF_SOCKMAP_FILENAME "test_sockmap_kern.o"
+#define BPF_SOCKHASH_FILENAME "test_sockmap_kern.o"
 #define CG_PATH "/sockmap"
 
 /* global sockets */
@@ -1260,9 +1261,8 @@ static int test_start_end(int cgrp)
BPF_PROG_TYPE_SK_MSG,
 };
 
-static int populate_progs(void)
+static int populate_progs(char *bpf_file)
 {
-   char *bpf_file = BPF_FILENAME;
struct bpf_program *prog;
struct bpf_object *obj;
int i = 0;
@@ -1306,11 +1306,11 @@ static int populate_progs(void)
return 0;
 }
 
-static int test_suite(void)
+static int __test_suite(char *bpf_file)
 {
int cg_fd, err;
 
-   err = populate_progs();
+   err = populate_progs(bpf_file);
if (err < 0) {
fprintf(stderr, "ERROR: (%i) load bpf failed\n", err);
return err;
@@ -1347,17 +1347,30 @@ static int test_suite(void)
 
 out:
printf("Summary: %i PASSED %i FAILED\n", passed, failed);
+   cleanup_cgroup_environment();
close(cg_fd);
return err;
 }
 
+static int test_suite(void)
+{
+   int err;
+
+   err = __test_suite(BPF_SOCKMAP_FILENAME);
+   if (err)
+   goto out;
+   err = __test_suite(BPF_SOCKHASH_FILENAME);
+out:
+   return err;
+}
+
 int main(int argc, char **argv)
 {
struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY};
int iov_count = 1, length = 1024, rate = 1;
struct sockmap_options options = {0};
int opt, longindex, err, cg_fd = 0;
-   char *bpf_file = BPF_FILENAME;
+   char *bpf_file = BPF_SOCKMAP_FILENAME;
int test = PING_PONG;
 
if (setrlimit(RLIMIT_MEMLOCK, )) {
@@ -1438,7 +1451,7 @@ int main(int argc, char **argv)
return -1;
}
 
-   err = populate_progs();
+   err = populate_progs(bpf_file);
if (err) {
fprintf(stderr, "populate program: (%s) %s\n",
bpf_file, strerror(errno));
diff --git a/tools/testing/selftests/bpf/test_sockmap_kern.c 
b/tools/testing/selftests/bpf/test_sockmap_kern.c
index 33de97e..31ef954 100644
--- a/tools/testing/selftests/bpf/test_sockmap_kern.c
+++ b/tools/testing/selftests/bpf/test_sockmap_kern.c
@@ -1,340 +1,4 @@
 // SPDX-License-Identifier: GPL-2.0
 // Copyright (c) 2017-2018 Covalent IO, Inc. http://covalent.io
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include "bpf_helpers.h"
-#include "bpf_endian.h"
-
-/* Sockmap sample program connects a client and a backend together
- * using

[bpf-next PATCH 2/3] bpf: sockmap, add hash map support

2018-04-27 Thread John Fastabend

Sockmap is currently backed by an array and enforces keys to be
four bytes. This works well for many use cases and was originally
modeled after devmap which also uses four bytes keys. However,
this has become limiting in larger use cases where a hash would
be more appropriate. For example users may want to use the 5-tuple
of the socket as the lookup key.

To support this add hash support.

Signed-off-by: John Fastabend 
---
 tools/bpf/bpftool/map.c|1 +
 tools/include/uapi/linux/bpf.h |6 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 38ebbc6..add768a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -661,6 +661,7 @@ static inline void bpf_map_offload_map_free(struct bpf_map 
*map)
 
 #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_BPF_SYSCALL) && 
defined(CONFIG_INET)
 struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key);
+struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key);
 int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type);
 #else
 static inline struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 
key)
@@ -668,6 +669,12 @@ static inline struct sock  *__sock_map_lookup_elem(struct 
bpf_map *map, u32 key)
return NULL;
 }
 
+static inline struct sock  *__sock_hash_lookup_elem(struct bpf_map *map,
+   void *key)
+{
+   return NULL;
+}
+
 static inline int sock_map_prog(struct bpf_map *map,
struct bpf_prog *prog,
u32 type)
@@ -693,6 +700,7 @@ static inline int sock_map_prog(struct bpf_map *map,
 extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
 extern const struct bpf_func_proto bpf_get_stackid_proto;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
+extern const struct bpf_func_proto bpf_sock_hash_update_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2b28fcf..3101118 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -47,6 +47,7 @@
 BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
 #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_INET)
 BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKHASH, sock_hash_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
 #endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index da77a93..5cb983d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -116,6 +116,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_DEVMAP,
BPF_MAP_TYPE_SOCKMAP,
BPF_MAP_TYPE_CPUMAP,
+   BPF_MAP_TYPE_SOCKHASH,
 };
 
 enum bpf_prog_type {
@@ -1835,7 +1836,10 @@ struct bpf_stack_build_id {
FN(msg_pull_data),  \
FN(bind),   \
FN(xdp_adjust_tail),\
-   FN(skb_get_xfrm_state),
+   FN(skb_get_xfrm_state), \
+   FN(sock_hash_update),   \
+   FN(msg_redirect_hash),  \
+   FN(sk_redirect_hash),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ba03ec3..5917cc1 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1782,6 +1782,7 @@ void bpf_user_rnd_init_once(void)
 const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
 const struct bpf_func_proto bpf_get_current_comm_proto __weak;
 const struct bpf_func_proto bpf_sock_map_update_proto __weak;
+const struct bpf_func_proto bpf_sock_hash_update_proto __weak;
 
 const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
 {
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 8bda881..08eb3a5 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -60,6 +60,28 @@ struct bpf_stab {
struct bpf_sock_progs progs;
 };
 
+struct bucket {
+   struct hlist_head head;
+   raw_spinlock_t lock;
+};
+
+struct bpf_htab {
+   struct bpf_map map;
+   struct bucket *buckets;
+   atomic_t count;
+   u32 n_buckets;
+   u32 elem_size;
+   struct bpf_sock_progs progs;
+};
+
+struct htab_elem {
+   struct rcu_head rcu;
+   struct hlist_node hash_node;
+   u32 hash;
+   struct sock *sk;
+   char key[0];
+};
+
 enum smap_psock_state {
SMAP_TX_RUNNING,
 };
@@ -67,6 +89,8 @@ enum smap_psock_state {
 struct smap_psock_map_entry {
struct list_head list;
struct sock **entry;
+   struct htab_elem *hash_link;
+   struct bpf_htab *htab;
 };
 
 struct smap_psock {
@@ -195,6 +219,12 @@ static void bpf_tcp_release(struct sock *sk)
rcu_read_unlock();
 }
 
+static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
+{
+   atomic_dec(>count);
+

[bpf-next PATCH 1/3] bpf: sockmap, refactor sockmap routines to work with hashmap

2018-04-27 Thread John Fastabend

This patch only refactors the existing sockmap code. This will allow
much of the psock initialization code path and bpf helper codes to
work for both sockmap bpf map types that are backed by an array, the
currently supported type, and the new hash backed bpf map type
sockhash.

Most the fallout comes from three changes,

  - Pushing bpf programs into an independent structure so we
can use it from the htab struct in the next patch.
  - Generalizing helpers to use void *key instead of the hardcoded
u32.
  - Instead of passing map/key through the metadata we now do
the lookup inline. This avoids storing the key in the metadata
which will be useful when keys can be longer than 4 bytes. We
rename the sk pointers to sk_redir at this point as well to
avoid any confusion between the current sk pointer and the
redirect pointer sk_redir.

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4da8b23..31cdfe8 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -512,9 +512,8 @@ struct sk_msg_buff {
int sg_end;
struct scatterlist sg_data[MAX_SKB_FRAGS];
bool sg_copy[MAX_SKB_FRAGS];
-   __u32 key;
__u32 flags;
-   struct bpf_map *map;
+   struct sock *sk_redir;
struct sk_buff *skb;
struct list_head list;
 };
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 833154e..089185a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -814,9 +814,8 @@ struct tcp_skb_cb {
 #endif
} header;   /* For incoming skbs */
struct {
-   __u32 key;
__u32 flags;
-   struct bpf_map *map;
+   struct sock *sk_redir;
void *data_end;
} bpf;
};
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 634415c..8bda881 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -48,14 +48,18 @@
 #define SOCK_CREATE_FLAG_MASK \
(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
 
-struct bpf_stab {
-   struct bpf_map map;
-   struct sock **sock_map;
+struct bpf_sock_progs {
struct bpf_prog *bpf_tx_msg;
struct bpf_prog *bpf_parse;
struct bpf_prog *bpf_verdict;
 };
 
+struct bpf_stab {
+   struct bpf_map map;
+   struct sock **sock_map;
+   struct bpf_sock_progs progs;
+};
+
 enum smap_psock_state {
SMAP_TX_RUNNING,
 };
@@ -456,7 +460,7 @@ static int free_curr_sg(struct sock *sk, struct sk_msg_buff 
*md)
 static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
 {
return ((_rc == SK_PASS) ?
-  (md->map ? __SK_REDIRECT : __SK_PASS) :
+  (md->sk_redir ? __SK_REDIRECT : __SK_PASS) :
   __SK_DROP);
 }
 
@@ -1088,7 +1092,7 @@ static int smap_verdict_func(struct smap_psock *psock, 
struct sk_buff *skb)
 * when we orphan the skb so that we don't have the possibility
 * to reference a stale map.
 */
-   TCP_SKB_CB(skb)->bpf.map = NULL;
+   TCP_SKB_CB(skb)->bpf.sk_redir = NULL;
skb->sk = psock->sock;
bpf_compute_data_pointers(skb);
preempt_disable();
@@ -1098,7 +1102,7 @@ static int smap_verdict_func(struct smap_psock *psock, 
struct sk_buff *skb)
 
/* Moving return codes from UAPI namespace into internal namespace */
return rc == SK_PASS ?
-   (TCP_SKB_CB(skb)->bpf.map ? __SK_REDIRECT : __SK_PASS) :
+   (TCP_SKB_CB(skb)->bpf.sk_redir ? __SK_REDIRECT : __SK_PASS) :
__SK_DROP;
 }
 
@@ -1368,7 +1372,6 @@ static int smap_init_sock(struct smap_psock *psock,
 }
 
 static void smap_init_progs(struct smap_psock *psock,
-   struct bpf_stab *stab,
struct bpf_prog *verdict,
struct bpf_prog *parse)
 {
@@ -1446,14 +1449,13 @@ static void smap_gc_work(struct work_struct *w)
kfree(psock);
 }
 
-static struct smap_psock *smap_init_psock(struct sock *sock,
- struct bpf_stab *stab)
+static struct smap_psock *smap_init_psock(struct sock *sock, int node)
 {
struct smap_psock *psock;
 
psock = kzalloc_node(sizeof(struct smap_psock),
 GFP_ATOMIC | __GFP_NOWARN,
-stab->map.numa_node);
+node);
if (!psock)
return ERR_PTR(-ENOMEM);
 
@@ -1658,40 +1660,26 @@ static int sock_map_delete_elem(struct bpf_map *map, 
void *key)
  *  - sock_map must use READ_ONCE and (cmp)xchg operations
  *  - BPF verdict/parse programs must use READ_ONCE and xchg operations
  */
-static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
-   struct bpf_map *map,
-   void *key, u64 flags)
+

Re: [PATCH] bpf: fix misaligned access for BPF_PROG_TYPE_PERF_EVENT program type on x86_32 platform

2018-04-27 Thread Alexei Starovoitov

On Thu, Apr 26, 2018 at 05:57:49PM +0800, Wang YanQing wrote:
> All the testcases for BPF_PROG_TYPE_PERF_EVENT program type in
> test_verifier(kselftest) report below errors on x86_32:
> "
> 172/p unpriv: spill/fill of different pointers ldx FAIL
> Unexpected error message!
> 0: (bf) r6 = r10
> 1: (07) r6 += -8
> 2: (15) if r1 == 0x0 goto pc+3
> R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1
> 3: (bf) r2 = r10
> 4: (07) r2 += -76
> 5: (7b) *(u64 *)(r6 +0) = r2
> 6: (55) if r1 != 0x0 goto pc+1
> R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 
> fp-8=fp
> 7: (7b) *(u64 *)(r6 +0) = r1
> 8: (79) r1 = *(u64 *)(r6 +0)
> 9: (79) r1 = *(u64 *)(r1 +68)
> invalid bpf_context access off=68 size=8
> 
> 378/p check bpf_perf_event_data->sample_period byte load permitted FAIL
> Failed to load prog 'Permission denied'!
> 0: (b7) r0 = 0
> 1: (71) r0 = *(u8 *)(r1 +68)
> invalid bpf_context access off=68 size=1
> 
> 379/p check bpf_perf_event_data->sample_period half load permitted FAIL
> Failed to load prog 'Permission denied'!
> 0: (b7) r0 = 0
> 1: (69) r0 = *(u16 *)(r1 +68)
> invalid bpf_context access off=68 size=2
> 
> 380/p check bpf_perf_event_data->sample_period word load permitted FAIL
> Failed to load prog 'Permission denied'!
> 0: (b7) r0 = 0
> 1: (61) r0 = *(u32 *)(r1 +68)
> invalid bpf_context access off=68 size=4
> 
> 381/p check bpf_perf_event_data->sample_period dword load permitted FAIL
> Failed to load prog 'Permission denied'!
> 0: (b7) r0 = 0
> 1: (79) r0 = *(u64 *)(r1 +68)
> invalid bpf_context access off=68 size=8
> "
> 
> This patch fix it, the fix isn't only necessary for x86_32, it will fix the
> same problem for other platforms too, if their size of bpf_user_pt_regs_t
> can't divide exactly into 8.
> 
> Signed-off-by: Wang YanQing 
> ---
>  Hi all!
>  After mainline accept this patch, then we need to submit a sync patch
>  to update the tools/include/uapi/linux/bpf_perf_event.h.
> 
>  Thanks.
> 
>  include/uapi/linux/bpf_perf_event.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/bpf_perf_event.h 
> b/include/uapi/linux/bpf_perf_event.h
> index eb1b9d2..ff4c092 100644
> --- a/include/uapi/linux/bpf_perf_event.h
> +++ b/include/uapi/linux/bpf_perf_event.h
> @@ -12,7 +12,7 @@
>  
>  struct bpf_perf_event_data {
>   bpf_user_pt_regs_t regs;
> - __u64 sample_period;
> + __u64 sample_period __attribute__((aligned(8)));

I don't think this necessary.
imo it's a bug in pe_prog_is_valid_access
that should have allowed 8-byte access to 4-byte aligned sample_period.
The access rewritten by pe_prog_convert_ctx_access anyway,
no alignment issues as far as I can see.

arch/x86/net/bpf_jit_comp conflicts. was: [tip:x86/cleanups] x86/bpf: Clean up non-standard comments, to make the code more readable

2018-04-27 Thread Alexei Starovoitov

On 4/27/18 5:13 AM, Daniel Borkmann wrote:

Hi Ingo,

On 04/27/2018 01:00 PM, tip-bot for Ingo Molnar wrote:

Commit-ID:  5f26c50143f58f256535bee8d93a105f36d4d2da
Gitweb: https://git.kernel.org/tip/5f26c50143f58f256535bee8d93a105f36d4d2da
Author: Ingo Molnar 
AuthorDate: Fri, 27 Apr 2018 11:54:40 +0200
Committer:  Ingo Molnar 
CommitDate: Fri, 27 Apr 2018 12:42:04 +0200

x86/bpf: Clean up non-standard comments, to make the code more readable

So by chance I looked into x86 assembly in arch/x86/net/bpf_jit_comp.c and
noticed the weird and inconsistent comment style it mistakenly learned from
the networking code:

 /* Multi-line comment ...
  * ... looks like this.
  */

Fix this to use the standard comment style specified in 
Documentation/CodingStyle
and used in arch/x86/ as well:

 /*
  * Multi-line comment ...
  * ... looks like this.
  */

Also, to quote Linus's ... more explicit views about this:

  > But no, the networking code picked *none* of the above sane formats.
  > Instead, it picked these two models that are just half-arsed
  > shit-for-brains:
  >
  >  (no)
  >  /* This is disgusting drug-induced
  >* crap, and should die
  >*/
  >
  >   (no-no-no)
  >   /* This is also very nasty
  >* and visually unbalanced */
  >
  > Please. The networking code actually has the *worst* possible comment
  > style. You can literally find that (no-no-no) style, which is just
  > really horribly disgusting and worse than the otherwise fairly similar
  > (d) in pretty much every way.

Also improve the comments and some other details while at it:

 - Don't mix same-line and previous-line comment style on otherwise
   identical code patterns within the same function,

 - capitalize 'BPF' and x86 register names consistently,

 - capitalize sentences consistently,

 - instead of 'x64' use 'x86-64': x64 is a Microsoft specific term,

 - use more consistent punctuation,

 - use standard coding style in macros as well,

 - fix typos and a few other minor details.

Consistent coding style is not optional, at least in arch/x86/.

No change in functionality.

Thanks for the cleanup, looks fine to me!

same here. thanks for the cleanup!

( In case this commit causes conflicts with pending development code
  I'll be glad to help resolve any conflicts! )

Any objections if we would simply route this via bpf-next tree, otherwise
this will indeed cause really ugly merge conflicts throughout the JIT with
pending work.

right. would be much better to route this patch via bpf-next.
Though all the changes are cleanups in comments I'm pretty sure
they will conflict with other changes we're doing.

Ingo,
could you please drop this patch from tip tree and resend it to us?
I cannot find the original patch in any public mailing list.
Only in tip-bot notification.

Personally I don't care whether bpf jit code uses networking
or non-networking style of comments, but will be happy to enforce
non-networking for this file in the future, since that seems to be the
preference.

Thanks

Greetings

2018-04-27 Thread Miss.Zeliha Omer Faruk




Hello Dear

Greetings to you, please I have a very important business proposal for our
mutual benefit, please let me know if you are interested i have asked you
before.

Best Regards,
Miss. Zeliha ömer Faruk
Caddesi Kristal Kule Binasi
No:215

Re: [PATCH 2/2] bpf: btf: remove a couple conditions

2018-04-27 Thread Martin KaFai Lau

On Fri, Apr 27, 2018 at 11:31:36PM +0300, Dan Carpenter wrote:
> On Fri, Apr 27, 2018 at 10:21:17PM +0200, Daniel Borkmann wrote:
> > On 04/27/2018 09:39 PM, Dan Carpenter wrote:
> > > On Fri, Apr 27, 2018 at 10:55:46AM -0700, Martin KaFai Lau wrote:
> > >> On Fri, Apr 27, 2018 at 10:20:25AM -0700, Martin KaFai Lau wrote:
> > >>> On Fri, Apr 27, 2018 at 05:04:59PM +0300, Dan Carpenter wrote:
> >  We know "err" is zero so we can remove these and pull the code in one
> >  indent level.
> > 
> >  Signed-off-by: Dan Carpenter 
> > >>> Thanks for the simplification!
> > >>>
> > >>> Acked-by: Martin KaFai Lau 
> > >> btw, it should be for bpf-next.  Please tag the subject with bpf-next 
> > >> when
> > >> you respin. Thanks!
> > 
> > Dan, thanks a lot for your fixes! Please respin with addressing Martin's
> > feedback when you get a chance.
> > 
> 
> My understanding is that he'd prefer we just ignore the static checker
> warning since it's a false positive.
Right, I think patch 1 is not needed.  I would prefer to use a comment
in those cases.

> Should I instead initialize the
> size to zero or something just to silence it?
> 
> regards,
> dan carpenter
>

IP_PKTINFO broken by acf568ee859f0 (xfrm: Reinject transport-mode packets through tasklet)

2018-04-27 Thread Maxime Bizon


Hello Herbert,

That patch just went into stable 4.14 and is causing a regression on my
setup.

Basically, IP_PKTINFO does not work anymore on transport-mode packets,
because skb->cb is now used to store the finish callback.

Was that expected or is it an unforeseen side effect ?

Thanks,

-- 
Maxime

[PATCH net-next] erspan: auto detect truncated packets.

2018-04-27 Thread William Tu

Currently the truncated bit is set only when the mirrored packet
is larger than mtu.  For certain cases, the packet might already
been truncated before sending to the erspan tunnel.  In this case,
the patch detect whether the IP header's total length is larger
than the actual skb->len.  If true, this indicated that the
mirrored packet is truncated and set the erspan truncate bit.

I tested the patch using bpf_skb_change_tail helper function to
shrink the packet size and send to erspan tunnel.

Reported-by: Xiaoyan Jin 
Signed-off-by: William Tu 
---
 net/ipv4/ip_gre.c  | 6 ++
 net/ipv6/ip6_gre.c | 6 ++
 2 files changed, 12 insertions(+)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 9c169bb2444d..dfe5b22f6ed4 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -578,6 +578,7 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct 
net_device *dev,
int tunnel_hlen;
int version;
__be16 df;
+   int nhoff;
 
tun_info = skb_tunnel_info(skb);
if (unlikely(!tun_info || !(tun_info->mode & IP_TUNNEL_INFO_TX) ||
@@ -605,6 +606,11 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct 
net_device *dev,
truncate = true;
}
 
+   nhoff = skb_network_header(skb) - skb_mac_header(skb);
+   if (skb->protocol == htons(ETH_P_IP) &&
+   (ntohs(ip_hdr(skb)->tot_len) > skb->len - nhoff))
+   truncate = true;
+
if (version == 1) {
erspan_build_header(skb, ntohl(tunnel_id_to_key32(key->tun_id)),
ntohl(md->u.index), truncate, true);
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 69727bc168cb..ac7ce85df667 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -896,6 +896,7 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff 
*skb,
struct flowi6 fl6;
int err = -EINVAL;
__u32 mtu;
+   int nhoff;
 
if (!ip6_tnl_xmit_ctl(t, >parms.laddr, >parms.raddr))
goto tx_err;
@@ -908,6 +909,11 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff 
*skb,
truncate = true;
}
 
+   nhoff = skb_network_header(skb) - skb_mac_header(skb);
+   if (skb->protocol == htons(ETH_P_IP) &&
+   (ntohs(ip_hdr(skb)->tot_len) > skb->len - nhoff))
+   truncate = true;
+
if (skb_cow_head(skb, dev->needed_headroom))
goto tx_err;
 
-- 
2.7.4

Re: [PATCH 2/2] bpf: btf: remove a couple conditions

2018-04-27 Thread Dan Carpenter

On Fri, Apr 27, 2018 at 10:21:17PM +0200, Daniel Borkmann wrote:
> On 04/27/2018 09:39 PM, Dan Carpenter wrote:
> > On Fri, Apr 27, 2018 at 10:55:46AM -0700, Martin KaFai Lau wrote:
> >> On Fri, Apr 27, 2018 at 10:20:25AM -0700, Martin KaFai Lau wrote:
> >>> On Fri, Apr 27, 2018 at 05:04:59PM +0300, Dan Carpenter wrote:
>  We know "err" is zero so we can remove these and pull the code in one
>  indent level.
> 
>  Signed-off-by: Dan Carpenter 
> >>> Thanks for the simplification!
> >>>
> >>> Acked-by: Martin KaFai Lau 
> >> btw, it should be for bpf-next.  Please tag the subject with bpf-next when
> >> you respin. Thanks!
> 
> Dan, thanks a lot for your fixes! Please respin with addressing Martin's
> feedback when you get a chance.
> 

My understanding is that he'd prefer we just ignore the static checker
warning since it's a false positive.  Should I instead initialize the
size to zero or something just to silence it?

regards,
dan carpenter

r8169 doesn't work after boot until `transmit queue 0 timed out`

2018-04-27 Thread ojab //

Oh hai!

I've created bugzilla ticket about this, but I'm not sure if anyone
reads it, so duplicating here.

I have new motherboard (ASUS A320-K) with 10ec:8168 realtek network
card installed and it doesn't work (i. e. `tcpdump` shows outgoing
packets but no packets are actually transmitted and `tcpdump` doesn't
show incoming packets while they are trasmitted from the other side)
then after 200-300 seconds I see [full stacktrace [1] and lspci [2]
output are attached to bugzilla ticket]

[  256.996145] [ cut here ]
[  256.997574] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
[  256.998992] WARNING: CPU: 6 PID: 0 at dev_watchdog+0x1f2/0x200
…

[  257.012243] RIP: 0010:dev_watchdog+0x1f2/0x200
…
[  257.032044]  
[  257.033829]  ? pfifo_fast_init+0x150/0x150
[  257.035618]  call_timer_fn+0x2b/0x120
[  257.037400]  run_timer_softirq+0x2f4/0x410
[  257.039170]  ? pfifo_fast_init+0x150/0x150
[  257.040931]  ? timerqueue_add+0x52/0x80
[  257.042694]  ? __hrtimer_run_queues+0x161/0x2e0
[  257.044462]  __do_softirq+0x111/0x32c
[  257.046223]  irq_exit+0x85/0x90
[  257.047966]  smp_apic_timer_interrupt+0x6c/0x120
[  257.049720]  apic_timer_interrupt+0xf/0x20
[  257.051475]  

and everything starts working normally. How can I make it work right after boot?

The issue is reproducible in linux-4.16.5 & linux-4.17-rc2 with
rtl_nic fw from linux-firmware git master.

[1] https://bugzilla.kernel.org/attachment.cgi?id=275627
[2] https://bugzilla.kernel.org/attachment.cgi?id=275629


//wbr ojab

Re: [PATCH 2/2] bpf: btf: remove a couple conditions

2018-04-27 Thread Daniel Borkmann

On 04/27/2018 09:39 PM, Dan Carpenter wrote:
> On Fri, Apr 27, 2018 at 10:55:46AM -0700, Martin KaFai Lau wrote:
>> On Fri, Apr 27, 2018 at 10:20:25AM -0700, Martin KaFai Lau wrote:
>>> On Fri, Apr 27, 2018 at 05:04:59PM +0300, Dan Carpenter wrote:
 We know "err" is zero so we can remove these and pull the code in one
 indent level.

 Signed-off-by: Dan Carpenter 
>>> Thanks for the simplification!
>>>
>>> Acked-by: Martin KaFai Lau 
>> btw, it should be for bpf-next.  Please tag the subject with bpf-next when
>> you respin. Thanks!

Dan, thanks a lot for your fixes! Please respin with addressing Martin's
feedback when you get a chance.

Thanks,
Daniel

[PATCH net-next] net: core: Assert the size of netdev_featres_t

2018-04-27 Thread Florian Fainelli

We have about 53 netdev_features_t bits defined and counting, add a
build time check to catch when an u64 type will not be enough and we
will have to convert that to a bitmap. This is done in
register_netdevice() for convenience.

Signed-off-by: Florian Fainelli 
---
 include/linux/netdevice.h | 6 ++
 net/core/dev.c| 1 +
 2 files changed, 7 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 366c32891158..4326bc6b27d1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4121,6 +4121,12 @@ const char *netdev_drivername(const struct net_device 
*dev);
 
 void linkwatch_run_queue(void);
 
+static inline void netdev_features_size_check(void)
+{
+   BUILD_BUG_ON(sizeof(netdev_features_t) * BITS_PER_BYTE <
+NETDEV_FEATURE_COUNT);
+}
+
 static inline netdev_features_t netdev_intersect_features(netdev_features_t f1,
  netdev_features_t f2)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 0a2d46424069..23e6c1aa78c6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7881,6 +7881,7 @@ int register_netdevice(struct net_device *dev)
int ret;
struct net *net = dev_net(dev);
 
+   netdev_features_size_check();
BUG_ON(dev_boot_phase);
ASSERT_RTNL();
 
-- 
2.14.1

Re: [PATCH net-next v2 3/7] dt-bindings: net: add DT bindings for Microsemi Ocelot Switch

2018-04-27 Thread Rob Herring

On Thu, Apr 26, 2018 at 09:59:27PM +0200, Alexandre Belloni wrote:
> DT bindings for the Ethernet switch found on Microsemi Ocelot platforms.
> 
> Cc: Rob Herring 
> Cc: James Hogan 
> Signed-off-by: Alexandre Belloni 
> ---
>  .../devicetree/bindings/net/mscc-ocelot.txt   | 82 +++
>  1 file changed, 82 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/mscc-ocelot.txt

Reviewed-by: Rob Herring

Re: [PATCH net-next v2 1/7] dt-bindings: net: add DT bindings for Microsemi MIIM

2018-04-27 Thread Rob Herring

On Thu, Apr 26, 2018 at 09:59:25PM +0200, Alexandre Belloni wrote:
> DT bindings for the Microsemi MII Management Controller found on Microsemi
> SoCs
> 
> Cc: Rob Herring 
> Reviewed-by: Florian Fainelli 
> Signed-off-by: Alexandre Belloni 
> ---
>  .../devicetree/bindings/net/mscc-miim.txt | 26 +++
>  1 file changed, 26 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/mscc-miim.txt

Reviewed-by: Rob Herring

[PATCH iproute2-next v7] Add support for cake qdisc

2018-04-27 Thread Toke Høiland-Jørgensen

sch_cake is intended to squeeze the most bandwidth and latency out of even
the slowest ISP links and routers, while presenting an API simple enough
that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash besteffort

Cake is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Support for DSL framing types and shapers.
* Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency variation.

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

Cake's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the c...@lists.bufferbloat.net mailing list.

Signed-off-by: Dave Taht 
Signed-off-by: Toke Høiland-Jørgensen 
---
Changelog:
v7:
  - Move the target/interval presets to a table and check that only
one is passed.

v6:
  - Identical to v5 because apparently I don't git so well... :/

v5:
  - Print the SPLIT_GSO flag
  - Switch to print_u64() for JSON output
  - Fix a format string for mpu option output

v4:
  - Switch stats parsing to use nested netlink attributes
  - Tweaks to JSON stats output keys

v3:
  - Remove accidentally included test flag

v2:
  - Updated netlink config ABI
  - Remove diffserv-llt mode
  - Various tweaks and clean-ups of stats output
 man/man8/tc-cake.8 | 632 ++
 man/man8/tc.8  |   1 +
 tc/Makefile|   1 +
 tc/q_cake.c| 748 +
 4 files changed, 1382 insertions(+)
 create mode 100644 man/man8/tc-cake.8
 create mode 100644 tc/q_cake.c

diff --git a/man/man8/tc-cake.8 b/man/man8/tc-cake.8
new file mode 100644
index ..dff2e360
--- /dev/null
+++ b/man/man8/tc-cake.8
@@ -0,0 +1,632 @@
+.TH CAKE 8 "27 April 2018" "iproute2" "Linux"
+.SH NAME
+CAKE \- Common Applications Kept Enhanced (CAKE)
+.SH SYNOPSIS
+.B tc qdisc ... cake
+.br
+[
+.BR bandwidth
+RATE |
+.BR unlimited*
+|
+.BR autorate_ingress
+]
+.br
+[
+.BR rtt
+TIME |
+.BR datacentre
+|
+.BR lan
+|
+.BR metro
+|
+.BR regional
+|
+.BR internet*
+|
+.BR oceanic
+|
+.BR satellite
+|
+.BR interplanetary
+]
+.br
+[
+.BR besteffort
+|
+.BR diffserv8
+|
+.BR diffserv4
+|
+.BR diffserv3*
+]
+.br
+[
+.BR flowblind
+|
+.BR srchost
+|
+.BR dsthost
+|
+.BR hosts
+|
+.BR flows
+|
+.BR dual-srchost
+|
+.BR dual-dsthost
+|
+.BR triple-isolate*
+]
+.br
+[
+.BR nat
+|
+.BR nonat*
+]
+.br
+[
+.BR wash
+|
+.BR nowash*
+]
+.br
+[
+.BR ack-filter
+|
+.BR ack-filter-aggressive
+|
+.BR no-ack-filter*
+]
+.br
+[
+.BR memlimit
+LIMIT ]
+.br
+[
+.BR ptm
+|
+.BR atm
+|
+.BR noatm*
+]
+.br
+[
+.BR overhead
+N |
+.BR conservative
+|
+.BR raw*
+]
+.br
+[
+.BR mpu
+N ]
+.br
+[
+.BR ingress
+|
+.BR egress*
+]
+.br
+(* marks defaults)
+
+
+.SH DESCRIPTION
+CAKE (Common Applications Kept Enhanced) is a shaping-capable queue discipline
+which uses both AQM and FQ.  It combines COBALT, which is an AQM algorithm
+combining Codel and BLUE, a shaper which operates in deficit mode, and a 
variant
+of DRR++ for flow isolation.  8-way set-associative hashing is used to 
virtually
+eliminate hash collisions.  Priority queuing is available through a simplified
+diffserv implementation.  Overhead compensation for various encapsulation
+schemes is tightly integrated.
+
+All settings are optional; the default settings are chosen to be sensible in
+most common deployments.  Most people will only need to set the
+.B bandwidth
+parameter to get useful results, but reading the
+.B Overhead Compensation
+and
+.B Round Trip Time
+sections is strongly encouraged.
+
+.SH SHAPER PARAMETERS
+CAKE uses a deficit-mode shaper, which does not exhibit the

[PATCH net] MAINTAINERS: add myself as SCTP co-maintainer

2018-04-27 Thread Marcelo Ricardo Leitner

Signed-off-by: Marcelo Ricardo Leitner 
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 
92be777d060a7df333a17d69e71bfd01760fa8f2..5bac32b545607933ea41fe9a56cc96d57ee7a094
 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12507,6 +12507,7 @@ F:  drivers/scsi/st_*.h
 SCTP PROTOCOL
 M: Vlad Yasevich 
 M: Neil Horman 
+M: Marcelo Ricardo Leitner 
 L: linux-s...@vger.kernel.org
 W: http://lksctp.sourceforge.net
 S: Maintained
--
2.14.3

[PATCH net-next] net: phy: Fix modular PHYLIB build

2018-04-27 Thread Florian Fainelli

After commit c59530d0d5dc ("net: Move PHY statistics code into PHY
library helpers") we made net/core/ethtool.c reference symbols which are
part of the library which can be modular. David introduced a temporary
fix with 1ecd6e8ad996 ("phy: Temporary build fix after phylib changes.")
which would prevent such modularity.

This is not desireable of course, so instead, just inline the functions
into include/linux/phy.h to keep both options available.

Fixes: c59530d0d5dc ("net: Move PHY statistics code into PHY library helpers")
Fixes: 1ecd6e8ad996 ("phy: Temporary build fix after phylib changes.")
Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/Kconfig |  3 ++-
 drivers/net/phy/phy.c   | 48 ---
 include/linux/phy.h | 50 +
 3 files changed, 40 insertions(+), 61 deletions(-)

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 7c5e8c1e9370..edb8b9ab827f 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -9,6 +9,7 @@ menuconfig MDIO_DEVICE
 
 config MDIO_BUS
tristate
+   default m if PHYLIB=m
default MDIO_DEVICE
help
  This internal symbol is used for link time dependencies and it
@@ -170,7 +171,7 @@ config PHYLINK
  autonegotiation modes.
 
 menuconfig PHYLIB
-   bool "PHY Device support and infrastructure"
+   tristate "PHY Device support and infrastructure"
depends on NETDEVICES
select MDIO_DEVICE
help
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index a98ed12c0009..05c1e8ef15e6 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1277,51 +1277,3 @@ int phy_ethtool_nway_reset(struct net_device *ndev)
return phy_restart_aneg(phydev);
 }
 EXPORT_SYMBOL(phy_ethtool_nway_reset);
-
-int phy_ethtool_get_strings(struct phy_device *phydev, u8 *data)
-{
-   if (!phydev->drv)
-   return -EIO;
-
-   mutex_lock(>lock);
-   phydev->drv->get_strings(phydev, data);
-   mutex_unlock(>lock);
-
-   return 0;
-}
-EXPORT_SYMBOL(phy_ethtool_get_strings);
-
-int phy_ethtool_get_sset_count(struct phy_device *phydev)
-{
-   int ret;
-
-   if (!phydev->drv)
-   return -EIO;
-
-   if (phydev->drv->get_sset_count &&
-   phydev->drv->get_strings &&
-   phydev->drv->get_stats) {
-   mutex_lock(>lock);
-   ret = phydev->drv->get_sset_count(phydev);
-   mutex_unlock(>lock);
-
-   return ret;
-   }
-
-   return -EOPNOTSUPP;
-}
-EXPORT_SYMBOL(phy_ethtool_get_sset_count);
-
-int phy_ethtool_get_stats(struct phy_device *phydev,
- struct ethtool_stats *stats, u64 *data)
-{
-   if (!phydev->drv)
-   return -EIO;
-
-   mutex_lock(>lock);
-   phydev->drv->get_stats(phydev, stats, data);
-   mutex_unlock(>lock);
-
-   return 0;
-}
-EXPORT_SYMBOL(phy_ethtool_get_stats);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 6ca81395c545..073235e70442 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -1066,27 +1066,53 @@ int phy_ethtool_nway_reset(struct net_device *ndev);
 #if IS_ENABLED(CONFIG_PHYLIB)
 int __init mdio_bus_init(void);
 void mdio_bus_exit(void);
-int phy_ethtool_get_strings(struct phy_device *phydev, u8 *data);
-int phy_ethtool_get_sset_count(struct phy_device *phydev);
-int phy_ethtool_get_stats(struct phy_device *phydev,
- struct ethtool_stats *stats, u64 *data);
-#else
-int phy_ethtool_get_strings(struct phy_device *phydev, u8 *data)
+#endif
+
+/* Inline function for use within net/core/ethtool.c (built-in) */
+static inline int phy_ethtool_get_strings(struct phy_device *phydev, u8 *data)
 {
-   return -EOPNOTSUPP;
+   if (!phydev->drv)
+   return -EIO;
+
+   mutex_lock(>lock);
+   phydev->drv->get_strings(phydev, data);
+   mutex_unlock(>lock);
+
+   return 0;
 }
 
-int phy_ethtool_get_sset_count(struct phy_device *phydev)
+static inline int phy_ethtool_get_sset_count(struct phy_device *phydev)
 {
+   int ret;
+
+   if (!phydev->drv)
+   return -EIO;
+
+   if (phydev->drv->get_sset_count &&
+   phydev->drv->get_strings &&
+   phydev->drv->get_stats) {
+   mutex_lock(>lock);
+   ret = phydev->drv->get_sset_count(phydev);
+   mutex_unlock(>lock);
+
+   return ret;
+   }
+
return -EOPNOTSUPP;
 }
 
-int phy_ethtool_get_stats(struct phy_device *phydev,
- struct ethtool_stats *stats, u64 *data)
+static inline int phy_ethtool_get_stats(struct phy_device *phydev,
+   struct ethtool_stats *stats, u64 *data)
 {
-   return -EOPNOTSUPP;
+   if (!phydev->drv)
+   return -EIO;
+
+   mutex_lock(>lock);
+

Re: [PATCH 2/2] bpf: btf: remove a couple conditions

2018-04-27 Thread Dan Carpenter

On Fri, Apr 27, 2018 at 10:55:46AM -0700, Martin KaFai Lau wrote:
> On Fri, Apr 27, 2018 at 10:20:25AM -0700, Martin KaFai Lau wrote:
> > On Fri, Apr 27, 2018 at 05:04:59PM +0300, Dan Carpenter wrote:
> > > We know "err" is zero so we can remove these and pull the code in one
> > > indent level.
> > > 
> > > Signed-off-by: Dan Carpenter 
> > Thanks for the simplification!
> > 
> > Acked-by: Martin KaFai Lau 
> btw, it should be for bpf-next.  Please tag the subject with bpf-next when
> you respin. Thanks!
>

I'm working against linux-next.  For networking, I have a separate tree
which I use to figure out if it's in net or net-next.  It's kind of a
headache (but obviously networking is the largest subtree so it's
required).

Is there an automated way to tie a Fixes tag from linux-next to a
subtree?

regards,
dan carpenter

[PATCH iproute2-next v6] Add support for cake qdisc

2018-04-27 Thread Toke Høiland-Jørgensen

sch_cake is intended to squeeze the most bandwidth and latency out of even
the slowest ISP links and routers, while presenting an API simple enough
that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash besteffort

Cake is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Support for DSL framing types and shapers.
* Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency variation.

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

Cake's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the c...@lists.bufferbloat.net mailing list.

Signed-off-by: Dave Taht 
Signed-off-by: Toke Høiland-Jørgensen 
---
Changelog:
v6:
  - Move the target/interval presets to a table and check that only
one is passed.

v5:
  - Print the SPLIT_GSO flag
  - Switch to print_u64() for JSON output
  - Fix a format string for mpu option output

v4:
  - Switch stats parsing to use nested netlink attributes
  - Tweaks to JSON stats output keys

v3:
  - Remove accidentally included test flag

v2:
  - Updated netlink config ABI
  - Remove diffserv-llt mode
  - Various tweaks and clean-ups of stats output
 man/man8/tc-cake.8 | 632 ++
 man/man8/tc.8  |   1 +
 tc/Makefile|   1 +
 tc/q_cake.c| 739 +
 4 files changed, 1373 insertions(+)
 create mode 100644 man/man8/tc-cake.8
 create mode 100644 tc/q_cake.c

diff --git a/man/man8/tc-cake.8 b/man/man8/tc-cake.8
new file mode 100644
index ..dff2e360
--- /dev/null
+++ b/man/man8/tc-cake.8
@@ -0,0 +1,632 @@
+.TH CAKE 8 "27 April 2018" "iproute2" "Linux"
+.SH NAME
+CAKE \- Common Applications Kept Enhanced (CAKE)
+.SH SYNOPSIS
+.B tc qdisc ... cake
+.br
+[
+.BR bandwidth
+RATE |
+.BR unlimited*
+|
+.BR autorate_ingress
+]
+.br
+[
+.BR rtt
+TIME |
+.BR datacentre
+|
+.BR lan
+|
+.BR metro
+|
+.BR regional
+|
+.BR internet*
+|
+.BR oceanic
+|
+.BR satellite
+|
+.BR interplanetary
+]
+.br
+[
+.BR besteffort
+|
+.BR diffserv8
+|
+.BR diffserv4
+|
+.BR diffserv3*
+]
+.br
+[
+.BR flowblind
+|
+.BR srchost
+|
+.BR dsthost
+|
+.BR hosts
+|
+.BR flows
+|
+.BR dual-srchost
+|
+.BR dual-dsthost
+|
+.BR triple-isolate*
+]
+.br
+[
+.BR nat
+|
+.BR nonat*
+]
+.br
+[
+.BR wash
+|
+.BR nowash*
+]
+.br
+[
+.BR ack-filter
+|
+.BR ack-filter-aggressive
+|
+.BR no-ack-filter*
+]
+.br
+[
+.BR memlimit
+LIMIT ]
+.br
+[
+.BR ptm
+|
+.BR atm
+|
+.BR noatm*
+]
+.br
+[
+.BR overhead
+N |
+.BR conservative
+|
+.BR raw*
+]
+.br
+[
+.BR mpu
+N ]
+.br
+[
+.BR ingress
+|
+.BR egress*
+]
+.br
+(* marks defaults)
+
+
+.SH DESCRIPTION
+CAKE (Common Applications Kept Enhanced) is a shaping-capable queue discipline
+which uses both AQM and FQ.  It combines COBALT, which is an AQM algorithm
+combining Codel and BLUE, a shaper which operates in deficit mode, and a 
variant
+of DRR++ for flow isolation.  8-way set-associative hashing is used to 
virtually
+eliminate hash collisions.  Priority queuing is available through a simplified
+diffserv implementation.  Overhead compensation for various encapsulation
+schemes is tightly integrated.
+
+All settings are optional; the default settings are chosen to be sensible in
+most common deployments.  Most people will only need to set the
+.B bandwidth
+parameter to get useful results, but reading the
+.B Overhead Compensation
+and
+.B Round Trip Time
+sections is strongly encouraged.
+
+.SH SHAPER PARAMETERS
+CAKE uses a deficit-mode shaper, which does not exhibit the initial burst
+typical of token-bucket shapers.  It will

[PATCH net-next v5] Add Common Applications Kept Enhanced (cake) qdisc

2018-04-27 Thread Toke Høiland-Jørgensen

sch_cake targets the home router use case and is intended to squeeze the
most bandwidth and latency out of even the slowest ISP links and routers,
while presenting an API simple enough that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash

CAKE is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Extensive support for DSL framing types.
* Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency
  variation.

A paper describing the design of CAKE is available at
https://arxiv.org/abs/1804.07617

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

CAKE's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the c...@lists.bufferbloat.net mailing list.

tc -s qdisc show dev eth2
qdisc cake 1: root refcnt 2 bandwidth 100Mbit diffserv3 triple-isolate rtt 
100.0ms raw overhead 0
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 0b of 500b
 capacity estimate: 100Mbit
 min/max network layer size:65535 /   0
 min/max overhead-adjusted size:65535 /   0
 average network hdr offset:0

   Bulk  Best EffortVoice
  thresh   6250Kbit  100Mbit   25Mbit
  target  5.0ms5.0ms5.0ms
  interval  100.0ms  100.0ms  100.0ms
  pk_delay  0us  0us  0us
  av_delay  0us  0us  0us
  sp_delay  0us  0us  0us
  pkts000
  bytes   000
  way_inds000
  way_miss000
  way_cols000
  drops   000
  marks   000
  ack_drop000
  sp_flows000
  bk_flows000
  un_flows000
  max_len 000
  quantum   300 1514  762

Tested-by: Pete Heist 
Tested-by: Georgios Amanakis 
Signed-off-by: Dave Taht 
Signed-off-by: Toke Høiland-Jørgensen 
---
Changelog
v5:
  - Refactor ACK filter code and hopefully fix the safety issues
properly this time.

v4:
  - Only split GSO packets if shaping at speeds <= 1Gbps
  - Fix overhead calculation code to also work for GSO packets
  - Don't re-implement kvzalloc()
  - Remove local header include from out-of-tree build (fixes kbuild-bot
complaint).
  - Several fixes to the ACK filter:
- Check pskb_may_pull() before deref of transport headers.
- Don't run ACK filter logic on split GSO packets
- Fix TCP sequence number compare to deal with wraparounds

v3:
  - Use IS_REACHABLE() macro to fix compilation when sch_cake is
built-in and conntrack is a module.
  - Switch the stats output to use nested netlink attributes instead
of a versioned struct.
  - Remove GPL boilerplate.
  - Fix array initialisation style.

v2:
  - Fix kbuild test bot complaint
  - Clean up the netlink ABI
  - Fix checkpatch complaints
  - A few tweaks to the behaviour of cake based on testing carried out
while writing the paper.

 include/uapi/linux/pkt_sched.h |  105 ++
 net/sched/Kconfig  |   11 +
 net/sched/Makefile |1 +
 net/sched/sch_cake.c   | 2680 
 4 files changed, 2797 insertions(+)
 create mode 100644

Re: [PATCH net-next v9 0/4] Enable virtio_net to act as a standby for a passthru device

2018-04-27 Thread Jiri Pirko

Fri, Apr 27, 2018 at 07:53:01PM CEST, sridhar.samudr...@intel.com wrote:
>On 4/27/2018 10:45 AM, Jiri Pirko wrote:
>> Fri, Apr 27, 2018 at 07:06:56PM CEST, sridhar.samudr...@intel.com wrote:

[...]

>> 
>> No changes in v9?
>
>I listed v9 updates at the start of the message.

Hmm, odd. I expected that at the end, in the changelog among other Vs
changes.

Will review this patchset tomorrow.
Thanks!

Re: [PATCH net] vhost: Use kzalloc() to allocate vhost_msg_node

2018-04-27 Thread Michael S. Tsirkin

On Fri, Apr 27, 2018 at 06:11:31PM +0200, Dmitry Vyukov wrote:
> On Fri, Apr 27, 2018 at 6:05 PM, Michael S. Tsirkin  wrote:
> > On Fri, Apr 27, 2018 at 11:45:02AM -0400, Kevin Easton wrote:
> >> The struct vhost_msg within struct vhost_msg_node is copied to userspace,
> >> so it should be allocated with kzalloc() to ensure all structure padding
> >> is zeroed.
> >>
> >> Signed-off-by: Kevin Easton 
> >> Reported-by: syzbot+87cfa083e727a2247...@syzkaller.appspotmail.com
> >
> > Does it help if a patch naming the padding is applied,
> > and then we init just the relevant field?
> > Just curious.
> 
> Yes, it would help.

How about a Tested-by tag then?

> >> ---
> >>  drivers/vhost/vhost.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> >> index f3bd8e9..1b84dcff 100644
> >> --- a/drivers/vhost/vhost.c
> >> +++ b/drivers/vhost/vhost.c
> >> @@ -2339,7 +2339,7 @@ EXPORT_SYMBOL_GPL(vhost_disable_notify);
> >>  /* Create a new message. */
> >>  struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type)
> >>  {
> >> - struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL);
> >> + struct vhost_msg_node *node = kzalloc(sizeof *node, GFP_KERNEL);
> >>   if (!node)
> >>   return NULL;
> >>   node->vq = vq;
> >> --
> >> 2.8.1
> >
> > --
> > You received this message because you are subscribed to the Google Groups 
> > "syzkaller-bugs" group.
> > To unsubscribe from this group and stop receiving emails from it, send an 
> > email to syzkaller-bugs+unsubscr...@googlegroups.com.
> > To view this discussion on the web visit 
> > https://groups.google.com/d/msgid/syzkaller-bugs/20180427185501-mutt-send-email-mst%40kernel.org.
> > For more options, visit https://groups.google.com/d/optout.

Re: [PATCH bpf-next v2 00/15] Introducing AF_XDP support

2018-04-27 Thread Björn Töpel

2018-04-27 21:06 GMT+02:00 Willem de Bruijn :
> On Fri, Apr 27, 2018 at 2:12 PM, Björn Töpel  wrote:
>> 2018-04-27 19:16 GMT+02:00 Willem de Bruijn 
>> :
>>> On Fri, Apr 27, 2018 at 8:17 AM, Björn Töpel  wrote:
 From: Björn Töpel 

 This patch set introduces a new address family called AF_XDP that is
 optimized for high performance packet processing and, in upcoming
 patch sets, zero-copy semantics. In this v2 version, we have removed
 all zero-copy related code in order to make it smaller, simpler and
 hopefully more review friendly. This patch set only supports copy-mode
 for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
 for RX using the XDP_DRV path. Zero-copy support requires XDP and
 driver changes that Jesper Dangaard Brouer is working on. Some of his
 work has already been accepted. We will publish our zero-copy support
 for RX and TX on top of his patch sets at a later point in time.
>>>
 Changes from V1:

 * Fixes to bugs spotted by Will in his review
 * Implemented the performance otimization to BPF_MAP_TYPE_XSKMAP
   suggested by Will
>>>
>>> An xsk may only exist in one map at a time. Is this somehow assured?
>>>
>>
>> Actually this is *not* the case. An xsk may reside in many maps, and
>> multiple times in the same map. So it's not assured at all. :-)
>
> Then can multiple call_rcu operations on xs->rcu race?

Hmm, right. xsk_flush in the rcu callback might race. I'd rather not
constrain the socket to only reside in one map... I'll try to fix the
flush, and go for the constraint as a last resort. IOW, I'll spin a v3
next week!

Thanks for finding this!


Björn

Re: [PATCH] ptp_pch: use helpers function for converting between ns and timespec

2018-04-27 Thread David Miller

From: YueHaibing 
Date: Fri, 27 Apr 2018 15:36:18 +0800

> use ns_to_timespec64() and timespec64_to_ns() instead of open coding
> 
> Signed-off-by: YueHaibing 

Applied to net-next, thanks.

Re: [PATCH v5 net-next 3/3] lan78xx: Modify error messages

2018-04-27 Thread Sergei Shtylyov

Hello!

On 04/27/2018 09:47 PM, Raghuram Chary J wrote:

> Modify the error messages when phy registration fails.
> 
> Signed-off-by: Raghuram Chary J 
> ---
>  drivers/net/usb/lan78xx.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
> index 54f8db887e3d..4b930c9faa16 100644
> --- a/drivers/net/usb/lan78xx.c
> +++ b/drivers/net/usb/lan78xx.c
> @@ -2100,14 +2100,14 @@ static struct phy_device *lan7801_phy_init(struct 
> lan78xx_net *dev)
>   ret = phy_register_fixup_for_uid(PHY_KSZ9031RNX, 0xfff0,
>ksz9031rnx_fixup);
>   if (ret < 0) {
> - netdev_err(dev->net, "fail to register fixup\n");
> + netdev_err(dev->net, "fail to register fixup for 
> PHY_KSZ9031RNX\n");

   Could correct "fail" to "failed", while at it.

>   return NULL;
>   }
>   /* external PHY fixup for LAN8835 */
>   ret = phy_register_fixup_for_uid(PHY_LAN8835, 0xfff0,
>lan8835_fixup);
>   if (ret < 0) {
> - netdev_err(dev->net, "fail to register fixup\n");
> + netdev_err(dev->net, "fail to register fixup for 
> PHY_LAN8835\n");

   Likewise.

[...]

MBR, Sergei

Re: [PATCH] net: systemport: fix spelling mistake: "asymetric" -> "asymmetric"

2018-04-27 Thread Florian Fainelli

On 04/27/2018 12:09 PM, Colin King wrote:
> From: Colin Ian King 
> 
> Trivial fix to spelling mistake in netdev_warn warning message
> 
> Signed-off-by: Colin Ian King 

Acked-by: Florian Fainelli 

Thanks Colin!
-- 
Florian

Re: [PATCH v3] net: qrtr: Expose tunneling endpoint to user space

2018-04-27 Thread David Miller

From: Bjorn Andersson 
Date: Thu, 26 Apr 2018 22:28:49 -0700

> This implements a misc character device named "qrtr-tun" for the purpose
> of allowing user space applications to implement endpoints in the qrtr
> network.
> 
> This allows more advanced (and dynamic) testing of the qrtr code as well
> as opens up the ability of tunneling qrtr over a network or USB link.
> 
> Signed-off-by: Bjorn Andersson 

Applied to net-next, thank you.

[PATCH] net: systemport: fix spelling mistake: "asymetric" -> "asymmetric"

2018-04-27 Thread Colin King

From: Colin Ian King 

Trivial fix to spelling mistake in netdev_warn warning message

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index effc651a2a2f..54ad1171e75e 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -2178,7 +2178,7 @@ static int bcm_sysport_map_queues(struct net_device *dev,
 
if (priv->per_port_num_tx_queues &&
priv->per_port_num_tx_queues != num_tx_queues)
-   netdev_warn(slave_dev, "asymetric number of per-port queues\n");
+   netdev_warn(slave_dev, "asymmetric number of per-port 
queues\n");
 
priv->per_port_num_tx_queues = num_tx_queues;
 
-- 
2.17.0

Re: [PATCH bpf-next v2 00/15] Introducing AF_XDP support

2018-04-27 Thread Willem de Bruijn

On Fri, Apr 27, 2018 at 2:12 PM, Björn Töpel  wrote:
> 2018-04-27 19:16 GMT+02:00 Willem de Bruijn :
>> On Fri, Apr 27, 2018 at 8:17 AM, Björn Töpel  wrote:
>>> From: Björn Töpel 
>>>
>>> This patch set introduces a new address family called AF_XDP that is
>>> optimized for high performance packet processing and, in upcoming
>>> patch sets, zero-copy semantics. In this v2 version, we have removed
>>> all zero-copy related code in order to make it smaller, simpler and
>>> hopefully more review friendly. This patch set only supports copy-mode
>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and
>>> driver changes that Jesper Dangaard Brouer is working on. Some of his
>>> work has already been accepted. We will publish our zero-copy support
>>> for RX and TX on top of his patch sets at a later point in time.
>>
>>> Changes from V1:
>>>
>>> * Fixes to bugs spotted by Will in his review
>>> * Implemented the performance otimization to BPF_MAP_TYPE_XSKMAP
>>>   suggested by Will
>>
>> An xsk may only exist in one map at a time. Is this somehow assured?
>>
>
> Actually this is *not* the case. An xsk may reside in many maps, and
> multiple times in the same map. So it's not assured at all. :-)

Then can multiple call_rcu operations on xs->rcu race?

Re: [PATCH net-next 03/13] sctp: remove an if() that is always true

2018-04-27 Thread Neil Horman

On Fri, Apr 27, 2018 at 03:13:49PM -0300, Marcelo Ricardo Leitner wrote:
> On Fri, Apr 27, 2018 at 06:50:50AM -0400, Neil Horman wrote:
> > On Thu, Apr 26, 2018 at 04:58:52PM -0300, Marcelo Ricardo Leitner wrote:
> > > As noticed by Xin Long, the if() here is always true as PMTU can never
> > > be 0.
> > >
> > > Reported-by: Xin Long 
> > > Signed-off-by: Marcelo Ricardo Leitner 
> > > ---
> > >  net/sctp/associola.c | 6 ++
> > >  1 file changed, 2 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> > > index 
> > > b3aa95222bd52113295cb246c503c903bdd5c353..c5ed09cfa8423b17546e3d45f6d06db03af66384
> > >  100644
> > > --- a/net/sctp/associola.c
> > > +++ b/net/sctp/associola.c
> > > @@ -1397,10 +1397,8 @@ void sctp_assoc_sync_pmtu(struct sctp_association 
> > > *asoc)
> > >   pmtu = t->pathmtu;
> > >   }
> > >
> > > - if (pmtu) {
> > > - asoc->pathmtu = pmtu;
> > > - asoc->frag_point = sctp_frag_point(asoc, pmtu);
> > > - }
> > > + asoc->pathmtu = pmtu;
> > > + asoc->frag_point = sctp_frag_point(asoc, pmtu);
> > >
> > Can you double check this?  Looking at it, it seems far fetched, but if 
> > someone
> 
> Sure.
> 
> > sends a crafted icmp dest unreach message to the host, pmtu_sending might be
> > able to get set for an association (which may have no transports established
> > yet), and if so, on the first packet send sctp_assoc_sync_pmtu can be 
> > called,
> > leading to a fall through in the loop over all transports, and pmtu being 
> > zero.
> > It seems like a far fetched set of circumstances, I know, but if it can 
> > happen,
> > I think you might see a crash in sctp_frag_point due to an underflow of the 
> > frag
> > value
> 
> If I got you right, this situation would not happen because when
> handling the icmp it will check if there is a transport and ignore it
> otherwise.
> 
Yup, you're right.  sctp_err_lookup returns NULL if there is no transport
associated with the icmp message

Thanks
Neil

>   Marcelo
> 
> >
> > Neil
> >
> > >   pr_debug("%s: asoc:%p, pmtu:%d, frag_point:%d\n", __func__, asoc,
> > >asoc->pathmtu, asoc->frag_point);
> > > --
> > > 2.14.3
> > >
> > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
>

Re: [RFC PATCH] MAINTAINERS: add davem in NETWORKING DRIVERS

2018-04-27 Thread David Miller

From: Vivien Didelot 
Date: Thu, 26 Apr 2018 19:47:35 -0400

> "./scripts/get_maintainer.pl -f" does not actually show us David as the
> maintainer of drivers/net directories such as team, bonding, phy or dsa.
> Adding him in an M: entry of NETWORKING DRIVERS fixes this.
> 
> Signed-off-by: Vivien Didelot 

Yeah, might as well.

Applied, th anks.

Re: [PATCH net-next 0/7] selftests: Add tests for mirroring to gretap

2018-04-27 Thread David Miller

From: Petr Machata 
Date: Fri, 27 Apr 2018 01:17:05 +0200

> This suite tests GRE-encapsulated mirroring. The general topology that
> most of the tests use is as follows, but each test defines details of
> the topology based on its needs, and some tests actually use a somewhat
> different topology.

Always good to see more networking tests.

Series applied, thanks.

[PATCH V2 net-next 2/2] selftest: add test for TCP_INQ

2018-04-27 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh 

Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Willem de Bruijn 
Reviewed-by: Eric Dumazet 
Reviewed-by: Neal Cardwell 
---
 tools/testing/selftests/net/Makefile  |   3 +-
 tools/testing/selftests/net/tcp_inq.c | 189 ++
 2 files changed, 191 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/tcp_inq.c

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index df9102ec7b7af..0a1821f8dfb18 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -9,7 +9,7 @@ TEST_PROGS += fib_tests.sh fib-onlink-tests.sh in_netns.sh 
pmtu.sh udpgso.sh
 TEST_PROGS += udpgso_bench.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
-TEST_GEN_FILES += tcp_mmap
+TEST_GEN_FILES += tcp_mmap tcp_inq
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict
 TEST_GEN_PROGS += udpgso udpgso_bench_tx udpgso_bench_rx
@@ -18,3 +18,4 @@ include ../lib.mk
 
 $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma
 $(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread
+$(OUTPUT)/tcp_inq: LDFLAGS += -lpthread
diff --git a/tools/testing/selftests/net/tcp_inq.c 
b/tools/testing/selftests/net/tcp_inq.c
new file mode 100644
index 0..3f6a27efbe5cf
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_inq.c
@@ -0,0 +1,189 @@
+/*
+ * Copyright 2018 Google Inc.
+ * Author: Soheil Hassas Yeganeh (soh...@google.com)
+ *
+ * Simple example on how to use TCP_INQ and TCP_CM_INQ.
+ *
+ * License (GPLv2):
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
+ * more details.
+ */
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef TCP_INQ
+#define TCP_INQ 35
+#endif
+
+#ifndef TCP_CM_INQ
+#define TCP_CM_INQ TCP_INQ
+#endif
+
+#define BUF_SIZE 8192
+#define CMSG_SIZE 32
+
+static int family = AF_INET6;
+static socklen_t addr_len = sizeof(struct sockaddr_in6);
+static int port = 4974;
+
+static void setup_loopback_addr(int family, struct sockaddr_storage *sockaddr)
+{
+   struct sockaddr_in6 *addr6 = (void *) sockaddr;
+   struct sockaddr_in *addr4 = (void *) sockaddr;
+
+   switch (family) {
+   case PF_INET:
+   memset(addr4, 0, sizeof(*addr4));
+   addr4->sin_family = AF_INET;
+   addr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+   addr4->sin_port = htons(port);
+   break;
+   case PF_INET6:
+   memset(addr6, 0, sizeof(*addr6));
+   addr6->sin6_family = AF_INET6;
+   addr6->sin6_addr = in6addr_loopback;
+   addr6->sin6_port = htons(port);
+   break;
+   default:
+   error(1, 0, "illegal family");
+   }
+}
+
+void *start_server(void *arg)
+{
+   int server_fd = (int)(unsigned long)arg;
+   struct sockaddr_in addr;
+   socklen_t addrlen = sizeof(addr);
+   char *buf;
+   int fd;
+   int r;
+
+   buf = malloc(BUF_SIZE);
+
+   for (;;) {
+   fd = accept(server_fd, (struct sockaddr *), );
+   if (fd == -1) {
+   perror("accept");
+   break;
+   }
+   do {
+   r = send(fd, buf, BUF_SIZE, 0);
+   } while (r < 0 && errno == EINTR);
+   if (r < 0)
+   perror("send");
+   if (r != BUF_SIZE)
+   fprintf(stderr, "can only send %d bytes\n", r);
+   /* TCP_INQ can overestimate in-queue by one byte if we send
+* the FIN packet. Sleep for 1 second, so that the client
+* likely invoked recvmsg().
+*/
+   sleep(1);
+   close(fd);
+   }
+
+   free(buf);
+   close(server_fd);
+   pthread_exit(0);
+}
+
+int main(int argc, char *argv[])
+{
+   struct sockaddr_storage listen_addr, addr;
+   int c, one = 1, inq = -1;
+   pthread_t server_thread;
+   char cmsgbuf[CMSG_SIZE];
+   struct iovec iov[1];
+   struct cmsghdr *cm;
+   struct msghdr msg;
+   int server_fd, fd;
+   char *buf;
+
+   while ((c = getopt(argc, argv, "46p:")) != -1) {
+   switch (c) {
+   case '4':
+

[PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-27 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh 

Applications with many concurrent connections, high variance
in receive queue length and tight memory bounds cannot
allocate worst-case buffer size to drain sockets. Knowing
the size of receive queue length, applications can optimize
how they allocate buffers to read from the socket.

The number of bytes pending on the socket is directly
available through ioctl(FIONREAD/SIOCINQ) and can be
approximated using getsockopt(MEMINFO) (rmem_alloc includes
skb overheads in addition to application data). But, both of
these options add an extra syscall per recvmsg. Moreover,
ioctl(FIONREAD/SIOCINQ) takes the socket lock.

Add the TCP_INQ socket option to TCP. When this socket
option is set, recvmsg() relays the number of bytes available
on the socket for reading to the application via the
TCP_CM_INQ control message.

Calculate the number of bytes after releasing the socket lock
to include the processed backlog, if any. To avoid an extra
branch in the hot path of recvmsg() for this new control
message, move all cmsg processing inside an existing branch for
processing receive timestamps. Since the socket lock is not held
when calculating the size of receive queue, TCP_INQ is a hint.
For example, it can overestimate the queue size by one byte,
if FIN is received.

With this method, applications can start reading from the socket
using a small buffer, and then use larger buffers based on the
remaining data when needed.

Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Willem de Bruijn 
Reviewed-by: Eric Dumazet 
Reviewed-by: Neal Cardwell 
---
 include/linux/tcp.h  |  2 +-
 include/net/tcp.h|  8 
 include/uapi/linux/tcp.h |  3 +++
 net/ipv4/tcp.c   | 27 +++
 4 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 20585d5c4e1c3..807776928cb86 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -228,7 +228,7 @@ struct tcp_sock {
unused:2;
u8  nonagle : 4,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
-   unused1 : 1,
+   recvmsg_inq : 1,/* Indicate # of bytes in queue upon recvmsg */
repair  : 1,
frto: 1;/* F-RTO (RFC5682) activated in CA_Loss */
u8  repair_queue;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 833154e3df173..0986836b5df5b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1951,6 +1951,14 @@ static inline int tcp_inq(struct sock *sk)
return answ;
 }
 
+static inline int tcp_inq_hint(const struct sock *sk)
+{
+   const struct tcp_sock *tp = tcp_sk(sk);
+
+   return max_t(int, 0,
+READ_ONCE(tp->rcv_nxt) - READ_ONCE(tp->copied_seq));
+}
+
 int tcp_peek_len(struct socket *sock);
 
 static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff *skb)
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 379b08700a542..d4cdd25a7bd48 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -122,6 +122,9 @@ enum {
 #define TCP_MD5SIG_EXT 32  /* TCP MD5 Signature with extensions */
 #define TCP_FASTOPEN_KEY   33  /* Set the key for Fast Open (cookie) */
 #define TCP_FASTOPEN_NO_COOKIE 34  /* Enable TFO without a TFO cookie */
+#define TCP_INQ35  /* Notify bytes available to 
read as a cmsg on read */
+
+#define TCP_CM_INQ TCP_INQ
 
 struct tcp_repair_opt {
__u32   opt_code;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index dfd090ea54ad4..5a7056980f730 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1910,13 +1910,14 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
u32 peek_seq;
u32 *seq;
unsigned long used;
-   int err;
+   int err, inq;
int target; /* Read at least this many bytes */
long timeo;
struct sk_buff *skb, *last;
u32 urg_hole = 0;
struct scm_timestamping tss;
bool has_tss = false;
+   bool has_cmsg;
 
if (unlikely(flags & MSG_ERRQUEUE))
return inet_recv_error(sk, msg, len, addr_len);
@@ -1931,6 +1932,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
if (sk->sk_state == TCP_LISTEN)
goto out;
 
+   has_cmsg = tp->recvmsg_inq;
timeo = sock_rcvtimeo(sk, nonblock);
 
/* Urgent data needs to be handled specially. */
@@ -2117,6 +2119,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
if (TCP_SKB_CB(skb)->has_rxtstamp) {
tcp_update_recv_tstamps(skb, );

Re: [PATCH net-next 2/2] net-backports: selftest: add test for TCP_INQ

2018-04-27 Thread Soheil Hassas Yeganeh

On Fri, Apr 27, 2018 at 2:50 PM, Soheil Hassas Yeganeh
 wrote:
> From: Soheil Hassas Yeganeh 
>
> Signed-off-by: Soheil Hassas Yeganeh 
> Signed-off-by: Yuchung Cheng 
> Signed-off-by: Willem de Bruijn 
> Reviewed-by: Eric Dumazet 
> Reviewed-by: Neal Cardwell 

Really sorry about the wrong patch subject. I'll send a V2 with the
corrected subject momentarily.

> ---
>  tools/testing/selftests/net/Makefile  |   3 +-
>  tools/testing/selftests/net/tcp_inq.c | 189 ++
>  2 files changed, 191 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/net/tcp_inq.c
>
> diff --git a/tools/testing/selftests/net/Makefile 
> b/tools/testing/selftests/net/Makefile
> index df9102ec7b7af..0a1821f8dfb18 100644
> --- a/tools/testing/selftests/net/Makefile
> +++ b/tools/testing/selftests/net/Makefile
> @@ -9,7 +9,7 @@ TEST_PROGS += fib_tests.sh fib-onlink-tests.sh in_netns.sh 
> pmtu.sh udpgso.sh
>  TEST_PROGS += udpgso_bench.sh
>  TEST_GEN_FILES =  socket
>  TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
> -TEST_GEN_FILES += tcp_mmap
> +TEST_GEN_FILES += tcp_mmap tcp_inq
>  TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
>  TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict
>  TEST_GEN_PROGS += udpgso udpgso_bench_tx udpgso_bench_rx
> @@ -18,3 +18,4 @@ include ../lib.mk
>
>  $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma
>  $(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread
> +$(OUTPUT)/tcp_inq: LDFLAGS += -lpthread
> diff --git a/tools/testing/selftests/net/tcp_inq.c 
> b/tools/testing/selftests/net/tcp_inq.c
> new file mode 100644
> index 0..3f6a27efbe5cf
> --- /dev/null
> +++ b/tools/testing/selftests/net/tcp_inq.c
> @@ -0,0 +1,189 @@
> +/*
> + * Copyright 2018 Google Inc.
> + * Author: Soheil Hassas Yeganeh (soh...@google.com)
> + *
> + * Simple example on how to use TCP_INQ and TCP_CM_INQ.
> + *
> + * License (GPLv2):
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
> + * more details.
> + */
> +#define _GNU_SOURCE
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#ifndef TCP_INQ
> +#define TCP_INQ 35
> +#endif
> +
> +#ifndef TCP_CM_INQ
> +#define TCP_CM_INQ TCP_INQ
> +#endif
> +
> +#define BUF_SIZE 8192
> +#define CMSG_SIZE 32
> +
> +static int family = AF_INET6;
> +static socklen_t addr_len = sizeof(struct sockaddr_in6);
> +static int port = 4974;
> +
> +static void setup_loopback_addr(int family, struct sockaddr_storage 
> *sockaddr)
> +{
> +   struct sockaddr_in6 *addr6 = (void *) sockaddr;
> +   struct sockaddr_in *addr4 = (void *) sockaddr;
> +
> +   switch (family) {
> +   case PF_INET:
> +   memset(addr4, 0, sizeof(*addr4));
> +   addr4->sin_family = AF_INET;
> +   addr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
> +   addr4->sin_port = htons(port);
> +   break;
> +   case PF_INET6:
> +   memset(addr6, 0, sizeof(*addr6));
> +   addr6->sin6_family = AF_INET6;
> +   addr6->sin6_addr = in6addr_loopback;
> +   addr6->sin6_port = htons(port);
> +   break;
> +   default:
> +   error(1, 0, "illegal family");
> +   }
> +}
> +
> +void *start_server(void *arg)
> +{
> +   int server_fd = (int)(unsigned long)arg;
> +   struct sockaddr_in addr;
> +   socklen_t addrlen = sizeof(addr);
> +   char *buf;
> +   int fd;
> +   int r;
> +
> +   buf = malloc(BUF_SIZE);
> +
> +   for (;;) {
> +   fd = accept(server_fd, (struct sockaddr *), );
> +   if (fd == -1) {
> +   perror("accept");
> +   break;
> +   }
> +   do {
> +   r = send(fd, buf, BUF_SIZE, 0);
> +   } while (r < 0 && errno == EINTR);
> +   if (r < 0)
> +   perror("send");
> +   if (r != BUF_SIZE)
> +   fprintf(stderr, "can only send %d bytes\n", r);
> +   /* TCP_INQ can overestimate in-queue by one byte if we send
> +* the FIN packet. Sleep for 1 second, so that the client
> +* likely invoked recvmsg().
> +*/
> +   sleep(1);
> +   close(fd);
> +   }
> +
> +   free(buf);
> +   close(server_fd);

[PATCH] DT: net: can: rcar_canfd: document R8A77980 bindings

2018-04-27 Thread Sergei Shtylyov

Document the R-Car V3H (R8A77980) SoC support in the R-Car CAN-FD bindings.

Signed-off-by: Sergei Shtylyov 

---
The patch is against the 'linux-can-next.git' repo plus the R8A77970 bindings
patch posted yesterday. Although I wouldn't object if they're both merged to
the 'linux-can.git' repo instead. :-)

 Documentation/devicetree/bindings/net/can/rcar_canfd.txt |1 +
 1 file changed, 1 insertion(+)

Index: linux-can-next/Documentation/devicetree/bindings/net/can/rcar_canfd.txt
===
--- linux-can-next.orig/Documentation/devicetree/bindings/net/can/rcar_canfd.txt
+++ linux-can-next/Documentation/devicetree/bindings/net/can/rcar_canfd.txt
@@ -7,6 +7,7 @@ Required properties:
   - "renesas,r8a7795-canfd" for R8A7795 (R-Car H3) compatible controller.
   - "renesas,r8a7796-canfd" for R8A7796 (R-Car M3) compatible controller.
   - "renesas,r8a77970-canfd" for R8A77970 (R-Car V3M) compatible controller.
+  - "renesas,r8a77980-canfd" for R8A77980 (R-Car V3H) compatible controller.
 
   When compatible with the generic version, nodes must list the
   SoC-specific version corresponding to the platform first, followed by the

Re: [PATCH net-next v2 00/14] bnxt_en: Net-next updates.

2018-04-27 Thread David Miller

From: Michael Chan 
Date: Thu, 26 Apr 2018 17:44:30 -0400

> This series has 3 main features.  The first is to add mqprio TC to
> hardware queue mapping to avoid reprogramming hardware CoS queue
> watermarks during run-time.  The second is DIM improvements from
> Andy Gospo.  The third is some improvements to VF resource allocations
> when supporting large numbers of VFs with more limited resources.
> 
> There are some additional minor improvements and a new function level
> discard counter.
> 
> v2: Fixed EEPROM typo noted by Andrew Lunn.

Series applied, thanks Michael.

[PATCH net-next 2/2] net-backports: selftest: add test for TCP_INQ

2018-04-27 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh 

Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Willem de Bruijn 
Reviewed-by: Eric Dumazet 
Reviewed-by: Neal Cardwell 
---
 tools/testing/selftests/net/Makefile  |   3 +-
 tools/testing/selftests/net/tcp_inq.c | 189 ++
 2 files changed, 191 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/tcp_inq.c

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index df9102ec7b7af..0a1821f8dfb18 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -9,7 +9,7 @@ TEST_PROGS += fib_tests.sh fib-onlink-tests.sh in_netns.sh 
pmtu.sh udpgso.sh
 TEST_PROGS += udpgso_bench.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
-TEST_GEN_FILES += tcp_mmap
+TEST_GEN_FILES += tcp_mmap tcp_inq
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict
 TEST_GEN_PROGS += udpgso udpgso_bench_tx udpgso_bench_rx
@@ -18,3 +18,4 @@ include ../lib.mk
 
 $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma
 $(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread
+$(OUTPUT)/tcp_inq: LDFLAGS += -lpthread
diff --git a/tools/testing/selftests/net/tcp_inq.c 
b/tools/testing/selftests/net/tcp_inq.c
new file mode 100644
index 0..3f6a27efbe5cf
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_inq.c
@@ -0,0 +1,189 @@
+/*
+ * Copyright 2018 Google Inc.
+ * Author: Soheil Hassas Yeganeh (soh...@google.com)
+ *
+ * Simple example on how to use TCP_INQ and TCP_CM_INQ.
+ *
+ * License (GPLv2):
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
+ * more details.
+ */
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef TCP_INQ
+#define TCP_INQ 35
+#endif
+
+#ifndef TCP_CM_INQ
+#define TCP_CM_INQ TCP_INQ
+#endif
+
+#define BUF_SIZE 8192
+#define CMSG_SIZE 32
+
+static int family = AF_INET6;
+static socklen_t addr_len = sizeof(struct sockaddr_in6);
+static int port = 4974;
+
+static void setup_loopback_addr(int family, struct sockaddr_storage *sockaddr)
+{
+   struct sockaddr_in6 *addr6 = (void *) sockaddr;
+   struct sockaddr_in *addr4 = (void *) sockaddr;
+
+   switch (family) {
+   case PF_INET:
+   memset(addr4, 0, sizeof(*addr4));
+   addr4->sin_family = AF_INET;
+   addr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+   addr4->sin_port = htons(port);
+   break;
+   case PF_INET6:
+   memset(addr6, 0, sizeof(*addr6));
+   addr6->sin6_family = AF_INET6;
+   addr6->sin6_addr = in6addr_loopback;
+   addr6->sin6_port = htons(port);
+   break;
+   default:
+   error(1, 0, "illegal family");
+   }
+}
+
+void *start_server(void *arg)
+{
+   int server_fd = (int)(unsigned long)arg;
+   struct sockaddr_in addr;
+   socklen_t addrlen = sizeof(addr);
+   char *buf;
+   int fd;
+   int r;
+
+   buf = malloc(BUF_SIZE);
+
+   for (;;) {
+   fd = accept(server_fd, (struct sockaddr *), );
+   if (fd == -1) {
+   perror("accept");
+   break;
+   }
+   do {
+   r = send(fd, buf, BUF_SIZE, 0);
+   } while (r < 0 && errno == EINTR);
+   if (r < 0)
+   perror("send");
+   if (r != BUF_SIZE)
+   fprintf(stderr, "can only send %d bytes\n", r);
+   /* TCP_INQ can overestimate in-queue by one byte if we send
+* the FIN packet. Sleep for 1 second, so that the client
+* likely invoked recvmsg().
+*/
+   sleep(1);
+   close(fd);
+   }
+
+   free(buf);
+   close(server_fd);
+   pthread_exit(0);
+}
+
+int main(int argc, char *argv[])
+{
+   struct sockaddr_storage listen_addr, addr;
+   int c, one = 1, inq = -1;
+   pthread_t server_thread;
+   char cmsgbuf[CMSG_SIZE];
+   struct iovec iov[1];
+   struct cmsghdr *cm;
+   struct msghdr msg;
+   int server_fd, fd;
+   char *buf;
+
+   while ((c = getopt(argc, argv, "46p:")) != -1) {
+   switch (c) {
+   case '4':
+

[PATCH net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-27 Thread Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh 

Applications with many concurrent connections, high variance
in receive queue length and tight memory bounds cannot
allocate worst-case buffer size to drain sockets. Knowing
the size of receive queue length, applications can optimize
how they allocate buffers to read from the socket.

The number of bytes pending on the socket is directly
available through ioctl(FIONREAD/SIOCINQ) and can be
approximated using getsockopt(MEMINFO) (rmem_alloc includes
skb overheads in addition to application data). But, both of
these options add an extra syscall per recvmsg. Moreover,
ioctl(FIONREAD/SIOCINQ) takes the socket lock.

Add the TCP_INQ socket option to TCP. When this socket
option is set, recvmsg() relays the number of bytes available
on the socket for reading to the application via the
TCP_CM_INQ control message.

Calculate the number of bytes after releasing the socket lock
to include the processed backlog, if any. To avoid an extra
branch in the hot path of recvmsg() for this new control
message, move all cmsg processing inside an existing branch for
processing receive timestamps. Since the socket lock is not held
when calculating the size of receive queue, TCP_INQ is a hint.
For example, it can overestimate the queue size by one byte,
if FIN is received.

With this method, applications can start reading from the socket
using a small buffer, and then use larger buffers based on the
remaining data when needed.

Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Willem de Bruijn 
Reviewed-by: Eric Dumazet 
Reviewed-by: Neal Cardwell 
---
 include/linux/tcp.h  |  2 +-
 include/net/tcp.h|  8 
 include/uapi/linux/tcp.h |  3 +++
 net/ipv4/tcp.c   | 27 +++
 4 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 20585d5c4e1c3..807776928cb86 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -228,7 +228,7 @@ struct tcp_sock {
unused:2;
u8  nonagle : 4,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
-   unused1 : 1,
+   recvmsg_inq : 1,/* Indicate # of bytes in queue upon recvmsg */
repair  : 1,
frto: 1;/* F-RTO (RFC5682) activated in CA_Loss */
u8  repair_queue;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 833154e3df173..0986836b5df5b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1951,6 +1951,14 @@ static inline int tcp_inq(struct sock *sk)
return answ;
 }
 
+static inline int tcp_inq_hint(const struct sock *sk)
+{
+   const struct tcp_sock *tp = tcp_sk(sk);
+
+   return max_t(int, 0,
+READ_ONCE(tp->rcv_nxt) - READ_ONCE(tp->copied_seq));
+}
+
 int tcp_peek_len(struct socket *sock);
 
 static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff *skb)
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 379b08700a542..d4cdd25a7bd48 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -122,6 +122,9 @@ enum {
 #define TCP_MD5SIG_EXT 32  /* TCP MD5 Signature with extensions */
 #define TCP_FASTOPEN_KEY   33  /* Set the key for Fast Open (cookie) */
 #define TCP_FASTOPEN_NO_COOKIE 34  /* Enable TFO without a TFO cookie */
+#define TCP_INQ35  /* Notify bytes available to 
read as a cmsg on read */
+
+#define TCP_CM_INQ TCP_INQ
 
 struct tcp_repair_opt {
__u32   opt_code;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index dfd090ea54ad4..5a7056980f730 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1910,13 +1910,14 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
u32 peek_seq;
u32 *seq;
unsigned long used;
-   int err;
+   int err, inq;
int target; /* Read at least this many bytes */
long timeo;
struct sk_buff *skb, *last;
u32 urg_hole = 0;
struct scm_timestamping tss;
bool has_tss = false;
+   bool has_cmsg;
 
if (unlikely(flags & MSG_ERRQUEUE))
return inet_recv_error(sk, msg, len, addr_len);
@@ -1931,6 +1932,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
if (sk->sk_state == TCP_LISTEN)
goto out;
 
+   has_cmsg = tp->recvmsg_inq;
timeo = sock_rcvtimeo(sk, nonblock);
 
/* Urgent data needs to be handled specially. */
@@ -2117,6 +2119,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int nonblock,
if (TCP_SKB_CB(skb)->has_rxtstamp) {
tcp_update_recv_tstamps(skb, );

Re: [PATCH net-next] hv_netvsc: simplify receive side calling arguments

2018-04-27 Thread David Miller

From: Stephen Hemminger 
Date: Thu, 26 Apr 2018 14:34:25 -0700

> The calls up from the napi poll reading the receive ring had many
> places where an argument was being recreated. I.e the caller already
> had the value and wasn't passing it, then the callee would use
> known relationship to determine the same value. Simpler and faster
> to just pass arguments needed.
> 
> Also, add const in a couple places where message is being only read.
> 
> Signed-off-by: Stephen Hemminger 

Applied, thanks Stephen.

RE: [PATCH v1 net-next] lan78xx: Lan7801 Support for Fixed PHY

2018-04-27 Thread RaghuramChary.Jallipalli

> >* Minor cleanup
> 
> This patch doesn't apply cleanly to net-next, please respin.

Apologies. Have sent updated version v5.

Thanks,
-Raghu

[PATCH v5 net-next 0/3] lan78xx updates along with Fixed phy Support

2018-04-27 Thread Raghuram Chary J

These series of patches handle few modifications in driver
and adds support for fixed phy.

Raghuram Chary J (3):
  lan78xx: Lan7801 Support for Fixed PHY
  lan78xx: Remove DRIVER_VERSION for lan78xx driver
  lan78xx: Modify error messages

 drivers/net/usb/Kconfig   |   1 +
 drivers/net/usb/lan78xx.c | 110 --
 2 files changed, 79 insertions(+), 32 deletions(-)

-- 
2.16.2

[PATCH v5 net-next 3/3] lan78xx: Modify error messages

2018-04-27 Thread Raghuram Chary J

Modify the error messages when phy registration fails.

Signed-off-by: Raghuram Chary J 
---
 drivers/net/usb/lan78xx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 54f8db887e3d..4b930c9faa16 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -2100,14 +2100,14 @@ static struct phy_device *lan7801_phy_init(struct 
lan78xx_net *dev)
ret = phy_register_fixup_for_uid(PHY_KSZ9031RNX, 0xfff0,
 ksz9031rnx_fixup);
if (ret < 0) {
-   netdev_err(dev->net, "fail to register fixup\n");
+   netdev_err(dev->net, "fail to register fixup for 
PHY_KSZ9031RNX\n");
return NULL;
}
/* external PHY fixup for LAN8835 */
ret = phy_register_fixup_for_uid(PHY_LAN8835, 0xfff0,
 lan8835_fixup);
if (ret < 0) {
-   netdev_err(dev->net, "fail to register fixup\n");
+   netdev_err(dev->net, "fail to register fixup for 
PHY_LAN8835\n");
return NULL;
}
/* add more external PHY fixup here if needed */
-- 
2.16.2

[PATCH v5 net-next 2/3] lan78xx: Remove DRIVER_VERSION for lan78xx driver

2018-04-27 Thread Raghuram Chary J

Remove driver version info from the lan78xx driver.

Signed-off-by: Raghuram Chary J 
---
 drivers/net/usb/lan78xx.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 81dfd10c3b92..54f8db887e3d 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -44,7 +44,6 @@
 #define DRIVER_AUTHOR  "WOOJUNG HUH "
 #define DRIVER_DESC"LAN78XX USB 3.0 Gigabit Ethernet Devices"
 #define DRIVER_NAME"lan78xx"
-#define DRIVER_VERSION "1.0.6"
 
 #define TX_TIMEOUT_JIFFIES (5 * HZ)
 #define THROTTLE_JIFFIES   (HZ / 8)
@@ -1503,7 +1502,6 @@ static void lan78xx_get_drvinfo(struct net_device *net,
struct lan78xx_net *dev = netdev_priv(net);
 
strncpy(info->driver, DRIVER_NAME, sizeof(info->driver));
-   strncpy(info->version, DRIVER_VERSION, sizeof(info->version));
usb_make_path(dev->udev, info->bus_info, sizeof(info->bus_info));
 }
 
-- 
2.16.2

[PATCH v5 net-next 1/3] lan78xx: Lan7801 Support for Fixed PHY

2018-04-27 Thread Raghuram Chary J

Adding Fixed PHY support to the lan78xx driver.

Signed-off-by: Raghuram Chary J 
---
v0->v1:
   * Remove driver version #define
   * Modify netdev_info to netdev_dbg
   * Move lan7801 specific to new routine and add switch case
   * Minor cleanup

v1->v2:
   * Removed fixedphy variable and used phy_is_pseudo_fixed_link() check.
v2->v3:
   * Revert driver version, debug statment changes for separate patch.
   * Modify lan7801 specific routine with return type struct phy_device.
v3->v4:
   * Modify lan7801 specific routine by removing phydev arg and get phydev.
v4->v5:
   * Patched in latest net-next.
   * Also handled error condition for fixedphy.
---
 drivers/net/usb/Kconfig   |   1 +
 drivers/net/usb/lan78xx.c | 104 +-
 2 files changed, 77 insertions(+), 28 deletions(-)

diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig
index f28bd74ac275..418b0904cecb 100644
--- a/drivers/net/usb/Kconfig
+++ b/drivers/net/usb/Kconfig
@@ -111,6 +111,7 @@ config USB_LAN78XX
select MII
select PHYLIB
select MICROCHIP_PHY
+   select FIXED_PHY
help
  This option adds support for Microchip LAN78XX based USB 2
  & USB 3 10/100/1000 Ethernet adapters.
diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index c59f8afd0d73..81dfd10c3b92 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -36,7 +36,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include "lan78xx.h"
@@ -2063,52 +2063,91 @@ static int ksz9031rnx_fixup(struct phy_device *phydev)
return 1;
 }
 
-static int lan78xx_phy_init(struct lan78xx_net *dev)
+static struct phy_device *lan7801_phy_init(struct lan78xx_net *dev)
 {
+   u32 buf;
int ret;
-   u32 mii_adv;
+   struct fixed_phy_status fphy_status = {
+   .link = 1,
+   .speed = SPEED_1000,
+   .duplex = DUPLEX_FULL,
+   };
struct phy_device *phydev;
 
phydev = phy_find_first(dev->mdiobus);
if (!phydev) {
-   netdev_err(dev->net, "no PHY found\n");
-   return -EIO;
-   }
-
-   if ((dev->chipid == ID_REV_CHIP_ID_7800_) ||
-   (dev->chipid == ID_REV_CHIP_ID_7850_)) {
-   phydev->is_internal = true;
-   dev->interface = PHY_INTERFACE_MODE_GMII;
-
-   } else if (dev->chipid == ID_REV_CHIP_ID_7801_) {
+   netdev_dbg(dev->net, "PHY Not Found!! Registering Fixed PHY\n");
+   phydev = fixed_phy_register(PHY_POLL, _status, -1,
+   NULL);
+   if (IS_ERR(phydev)) {
+   netdev_err(dev->net, "No PHY/fixed_PHY found\n");
+   return NULL;
+   }
+   netdev_dbg(dev->net, "Registered FIXED PHY\n");
+   dev->interface = PHY_INTERFACE_MODE_RGMII;
+   ret = lan78xx_write_reg(dev, MAC_RGMII_ID,
+   MAC_RGMII_ID_TXC_DELAY_EN_);
+   ret = lan78xx_write_reg(dev, RGMII_TX_BYP_DLL, 0x3D00);
+   ret = lan78xx_read_reg(dev, HW_CFG, );
+   buf |= HW_CFG_CLK125_EN_;
+   buf |= HW_CFG_REFCLK25_EN_;
+   ret = lan78xx_write_reg(dev, HW_CFG, buf);
+   } else {
if (!phydev->drv) {
netdev_err(dev->net, "no PHY driver found\n");
-   return -EIO;
+   return NULL;
}
-
dev->interface = PHY_INTERFACE_MODE_RGMII;
-
/* external PHY fixup for KSZ9031RNX */
ret = phy_register_fixup_for_uid(PHY_KSZ9031RNX, 0xfff0,
 ksz9031rnx_fixup);
if (ret < 0) {
netdev_err(dev->net, "fail to register fixup\n");
-   return ret;
+   return NULL;
}
/* external PHY fixup for LAN8835 */
ret = phy_register_fixup_for_uid(PHY_LAN8835, 0xfff0,
 lan8835_fixup);
if (ret < 0) {
netdev_err(dev->net, "fail to register fixup\n");
-   return ret;
+   return NULL;
}
/* add more external PHY fixup here if needed */
 
phydev->is_internal = false;
-   } else {
-   netdev_err(dev->net, "unknown ID found\n");
-   ret = -EIO;
-   goto error;
+   }
+   return phydev;
+}
+
+static int lan78xx_phy_init(struct lan78xx_net *dev)
+{
+   int ret;
+   u32 mii_adv;
+   struct phy_device *phydev;
+
+   switch (dev->chipid) {
+   case ID_REV_CHIP_ID_7801_:
+   phydev = lan7801_phy_init(dev);
+

Re: Request for stable 4.14.x inclusion: net: don't call update_pmtu unconditionally

2018-04-27 Thread Eddie Chapman


On 27/04/18 19:07, Thomas Deutschmann wrote:

Hi Greg,

first, we need to cherry-pick another patch first:
  

 From 52a589d51f1008f62569bf89e95b26221ee76690 Mon Sep 17 00:00:00 2001
From: Xin Long 
Date: Mon, 25 Dec 2017 14:43:58 +0800
Subject: [PATCH] geneve: update skb dst pmtu on tx path

Commit a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path") has fixed
a performance issue caused by the change of lower dev's mtu for vxlan.

The same thing needs to be done for geneve as well.

Note that geneve cannot adjust it's mtu according to lower dev's mtu
when creating it. The performance is very low later when netperfing
over it without fixing the mtu manually. This patch could also avoid
this issue.

Signed-off-by: Xin Long 
Signed-off-by: David S. Miller 


Oops, I completely missed that the coreos patch doesn't have the geneve 
hunk that is in the original 4.15 patch. I don't load the geneve module 
on my box hence why no problems surfaced on my machine.


Thanks Thomas for the correct instructions. Ignore my message Greg, I'll 
drop back into the shadows where I belong, sorry for the noise!

Re: [PATCH net-next 00/13] sctp: refactor MTU handling

2018-04-27 Thread David Miller

From: Marcelo Ricardo Leitner 
Date: Thu, 26 Apr 2018 16:58:49 -0300

> Currently MTU handling is spread over SCTP stack. There are multiple
> places doing same/similar calculations and updating them is error prone
> as one spot can easily be left out.
> 
> This patchset converges it into a more concise and consistent code. In
> general, it moves MTU handling from functions with bigger objectives,
> such as sctp_assoc_add_peer(), to specific functions.
> 
> It's also a preparation for the next patchset, which removes the
> duplication between sctp_make_op_error_space and
> sctp_make_op_error_fixed and relies on sctp_mtu_payload introduced here.
> 
> More details on each patch.

Series applied, thanks!

Re: Suggestions on iterating eBPF maps

2018-04-27 Thread Chenbo Feng

resend with  plain text

On Fri, Apr 27, 2018 at 11:22 AM Chenbo Feng  wrote:

> Hi net-next,

> When doing the eBPF tools user-space development I noticed that the map
iterating process in user-space have some little flaws. If we want to dump
the whole map. The only way now I know is to use a null key to start the
iteration and keep calling bpf_get_next_key and bpf_look_up_elem for each
new key value pair until we reach the end of the map. I noticed the
bpftools recently added used the similar approach.

> The overhead of repeating syscalls is acceptable, but the race problem
come with this iteration process is a little annoying. If the current key
we are using get deleted before we do the syscall to get the next key . The
next key returned will start from the beginning of the map again and some
entry will be dumped again depending on the position of the key deleted. If
the racing problem is within the same userspace process, it can be easily
fixed by adding some read/write locks. However, if multiple processes is
reading the map through pinned fd while there is one process is editing the
map entry or the kernel program is deleting entries, it become harder to
get a consistent and correct map dump.

> We are wondering if there is already implementation we didn't notice in
mainline kernel that help improved this iteration process and addressed the
racing problem mentioned above? If not, what can be down to address the
issue above. One thing we came up with is to use a single entry bpf map as
a across process lock to prevent multiple userspace process to read/write
other maps at the same time. But I don't know how safe this solution is
since there will still be a race to read the lock map value and setup the
lock.

> Thanks
> Chenbo Feng

1 2 3 4 >

1 - 100 of 305 matches

Mail list logo