[PATCH iproute2 V5 0/3] tc: Support for ip tunnel metadata set/unset/classify

2016-12-02 Thread Amir Vadai
Hi,

This short series adds support for matching and setting metadata for ip tunnel
shared device using the TC system, introduced in kernel 4.9 [1].

Applied and tested on top of commit b6c7fc61faab ("ss: print new tcp_info
fields: busy, rwnd-limited, sndbuf-limited times")


Example usage:

$ tc filter add dev vxlan0 protocol ip parent : \
flower \
  enc_src_ip 11.11.0.2 \
  enc_dst_ip 11.11.0.1 \
  enc_key_id 11 \
  dst_ip 11.11.11.1 \
action mirred egress redirect dev vnet0

$ tc filter add dev net0 protocol ip parent : \
flower \
  ip_proto 1 \
  dst_ip 11.11.11.2 \
action tunnel_key set \
  src_ip 11.11.0.1 \
  dst_ip 11.11.0.2 \
  id 11 \
action mirred egress redirect dev vxlan0

[1] - d1ba24feb466 ("Merge branch 'act_tunnel_key'")

Thanks,
Amir

Changes from V4:
- Fix rebase conflicts for net-next

Changes from V3:
- Fix bad wording in the man page about the use of the 'unset' operation

Changes from V2:
- Use const where needed
- Don't lose return value
- Introduce rta_getattr_be16() and rta_getattr_be32()

Changes from V1:
- Updated Patch 2/2 ("tc/act_tunnel: Introduce ip tunnel action") commit log
and the man page tc-tunnel_key to reflect the fact that 'unset' 
operation is
no mandatory.
And describe when it might be needed.
- Rename the 'release' operation to 'unset'

Amir Vadai (3):
  libnetlink: Introduce rta_getattr_be*()
  tc/cls_flower: Classify packet in ip tunnels
  tc/act_tunnel: Introduce ip tunnel action

Amir Vadai (3):
  libnetlink: Introduce rta_getattr_be*()
  tc/cls_flower: Classify packet in ip tunnels
  tc/act_tunnel: Introduce ip tunnel action

 bridge/fdb.c |   4 +-
 include/libnetlink.h |   9 ++
 include/linux/tc_act/tc_tunnel_key.h |  42 ++
 ip/iplink_geneve.c   |   2 +-
 ip/iplink_vxlan.c|   2 +-
 man/man8/tc-flower.8 |  17 ++-
 man/man8/tc-tunnel_key.8 | 112 +++
 tc/Makefile  |   1 +
 tc/f_flower.c|  84 +++-
 tc/m_tunnel_key.c| 258 +++
 10 files changed, 522 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/tc_act/tc_tunnel_key.h
 create mode 100644 man/man8/tc-tunnel_key.8
 create mode 100644 tc/m_tunnel_key.c

-- 
2.10.2



Re: [PATCH 5/7] Documentation: DT: net: cpsw: allow to specify descriptors pool size

2016-12-02 Thread Ivan Khoronzhuk
On Thu, Dec 01, 2016 at 05:34:30PM -0600, Grygorii Strashko wrote:
> Add optional property "descs_pool_size" to specify buffer descriptor's
> pool size. The "descs_pool_size" should define total number of CPDMA
> CPPI descriptors to be used for both ingress/egress packets
> processing. If not specified - the default value 256 will be used
> which will allow to place descriptor's pool into the internal CPPI
> RAM on most of TI SoC.
> 
> Signed-off-by: Grygorii Strashko 
> ---
>  Documentation/devicetree/bindings/net/cpsw.txt | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/cpsw.txt 
> b/Documentation/devicetree/bindings/net/cpsw.txt
> index 5ad439f..b99d196 100644
> --- a/Documentation/devicetree/bindings/net/cpsw.txt
> +++ b/Documentation/devicetree/bindings/net/cpsw.txt
> @@ -35,6 +35,11 @@ Optional properties:
> For example in dra72x-evm, pcf gpio has to be
> driven low so that cpsw slave 0 and phy data
> lines are connected via mux.
> +- descs_pool_size: total number of CPDMA CPPI descriptors to be used for
> +   both ingress/egress packets processing. if not
> +   specified the default value 256 will be used which
> +   will allow to place descriptors pool into the
> +   internal CPPI RAM.
Does it describe h/w? Why now module parameter? or even smth like ethtool num
ring entries?

>  
>  Slave Properties:
> -- 
> 2.10.1
> 


Re: [flamebait] xdp, well meaning but pointless

2016-12-02 Thread Hannes Frederic Sowa
On 02.12.2016 11:24, Jesper Dangaard Brouer wrote:
> On Thu, 1 Dec 2016 13:51:32 -0800
> Tom Herbert  wrote:
> 
 The technical plenary at last IETF on Seoul a couple of weeks ago was
 exclusively focussed on DDOS in light of the recent attack against
 Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
 presentation by Nick Sullivan
 (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
 alluded to some implementation of DDOS mitigation. In particular, on
 slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
> 
> slide 14
> 
 numbers he gave we're based in iptables+BPF and that was a whole
 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
 and that's also when I introduced XDP to whole IETF :-) ). If that's
 the best we can do the Internet is in a world hurt. DDOS mitigation
 alone is probably a sufficient motivation to look at XDP. We need
 something that drops bad packets as quickly as possible when under
 attack, we need this to be integrated into the stack, we need it to be
 programmable to deal with the increasing savvy of attackers, and we
 don't want to be forced to be dependent on HW solutions. This is why
 we created XDP!  
> 
> The 1.2Mpps number is a bit low, but we are unfortunately in that
> ballpark.
> 
>>> I totally understand that. But in my reply to David in this thread I
>>> mentioned DNS apex processing as being problematic which is actually
>>> being referred in your linked slide deck on page 9 ("What do floods look
>>> like") and the problematic of parsing DNS packets in XDP due to string
>>> processing and looping inside eBPF.
> 
> That is a weak argument. You do realize CloudFlare actually use eBPF to
> do this exact filtering, and (so-far) eBPF for parsing DNS have been
> sufficient for them.

You are talking about this code on the following slides (I actually
transcribed it for you here and disassembled):

l0: ld #0x14
l1: ldxb 4*([0]&0xf)
l2: add x
l3: tax
l4: ld [x+0]
l5: jeq #0x7657861, l6, l13
l6: ld [x+4]
l7: jeq #0x6d706c65, l8, l13
l8: ld [x+8]
l9: jeq #0x3636f6d, l10, l13
l10:ldb [x+12]
l11:jeq #0, l12, l13
l12:ret #0x1
l13:ret #0

You can offload this to u32 in hardware if that is what you want.

The reason this works is because of netfilter, which allows them to
dynamically generate BPF programs and insert and delete them from
chains, do intersection or unions of them.

If you have a freestanding program like in XDP the complexity space is a
different one and not comparable to this at all.

Bye,
Hannes



[PATCH iproute2/net-next] ss: initialise variables outside of for loop

2016-12-02 Thread Simon Horman
Initialise for loops outside of for loops. GCC flags this as being
out of spec unless C99 or C11 mode is used.

With this change the entire tree appears to compile cleanly with -Wall.

$ gcc --version
gcc (Debian 4.9.2-10) 4.9.2
...
$ make
...
ss.c: In function ‘unix_show_sock’:
ss.c:3128:4: error: ‘for’ loop initial declarations are only allowed in C99 or 
C11 mode
...

Signed-off-by: Simon Horman 
---
 misc/ss.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 839781ee29bc..ce0b9d3d993d 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -3124,10 +3124,12 @@ static int unix_show_sock(const struct sockaddr_nl 
*addr, struct nlmsghdr *nlh,
 
memcpy(name, RTA_DATA(tb[UNIX_DIAG_NAME]), len);
name[len] = '\0';
-   if (name[0] == '\0')
-   for (int i = 0; i < len; i++)
+   if (name[0] == '\0') {
+   int i;
+   for (i = 0; i < len; i++)
if (name[i] == '\0')
name[i] = '@';
+   }
stat.name = &name[0];
memcpy(stat.local.data, &stat.name, sizeof(stat.name));
}
-- 
2.7.0.rc3.207.g0ac5344



Re: [PATCH net v3] tipc: check minimum bearer MTU

2016-12-02 Thread Ying Xue

On 12/02/2016 04:33 PM, Michal Kubecek wrote:

Qian Zhang (张谦) reported a potential socket buffer overflow in
tipc_msg_build() which is also known as CVE-2016-8632: due to
insufficient checks, a buffer overflow can occur if MTU is too short for
even tipc headers. As anyone can set device MTU in a user/net namespace,
this issue can be abused by a regular user.

As agreed in the discussion on Ben Hutchings' original patch, we should
check the MTU at the moment a bearer is attached rather than for each
processed packet. We also need to repeat the check when bearer MTU is
adjusted to new device MTU. UDP case also needs a check to avoid
overflow when calculating bearer MTU.

Fixes: b97bf3fd8f6a ("[TIPC] Initial merge")
Signed-off-by: Michal Kubecek 
Reported-by: Qian Zhang (张谦) 
---


Thanks, it looks nice to me.

Acked-by: Ying Xue 



Re: Initial thoughts on TXDP

2016-12-02 Thread Jesper Dangaard Brouer

On Thu, 1 Dec 2016 11:51:42 -0800 Tom Herbert  wrote:
> On Wed, Nov 30, 2016 at 6:44 PM, Florian Westphal  wrote:
> > Tom Herbert  wrote:  
[...]
> >>   - Call into TCP/IP stack with page data directly from driver-- no
> >> skbuff allocation or interface. This is essentially provided by the
> >> XDP API although we would need to generalize the interface to call
> >> stack functions (I previously posted patches for that). We will also
> >> need a new action, XDP_HELD?, that indicates the XDP function held the
> >> packet (put on a socket for instance).  
> >
> > Seems this will not work at all with the planned page pool thing when
> > pages start to be held indefinitely.

It is quite the opposite, the page pool support pages are being held
for longer times, than drivers today.  The current driver page recycle
tricks cannot, as they depend on page refcnt being decremented quickly
(while pages are still mapped in their recycle queue).

> > You can also never get even close to userspace offload stacks once you
> > need/do this; allocations in hotpath are too expensive.

Yes. It is important to understand that once the number of outstanding
pages get large, the driver recycle stops working.  Meaning the pages
allocations start to go through the page allocator.  I've documented[1]
that the bare alloc+free cost[2] (231 cycles order-0/4K) is higher than
the 10G wirespeed budget (201 cycles).

Thus, the driver recycle tricks are nice for benchmarking, as it hides
the page allocator overhead. But this optimization might disappear for
Tom's and Eric's more real-world use-cases e.g. like 10.000 sockets.
The page pool don't these issues.

[1] 
http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf
[2] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v2] sh_eth: remove unchecked interrupts

2016-12-02 Thread Geert Uytterhoeven
Hi Chris,

On Thu, Dec 1, 2016 at 7:53 PM, Chris Brandt  wrote:
> On 12/1/2016, Sergei Shtylyov wrote:
>>
>> On 12/01/2016 05:42 PM, Geert Uytterhoeven wrote:
>>
>> >> --- a/drivers/net/ethernet/renesas/sh_eth.c
>> >> +++ b/drivers/net/ethernet/renesas/sh_eth.c
>> >> @@ -518,7 +518,7 @@ static struct sh_eth_cpu_data r7s72100_data = {
>> >>
>> >> .ecsr_value = ECSR_ICD,
>> >> .ecsipr_value   = ECSIPR_ICDIP,
>> >> -   .eesipr_value   = 0xff7f009f,
>> >> +   .eesipr_value   = 0xe77f009f,
>> >
>> > Comment not directly related to the merits of this patch: the EESIPR
>> > bit definitions seem to be identical to those for EESR, so we can get
>> > rid of the hardcoded values here?
>>
>> Do you mean using the @define's? We have EESIPR bits also declared,
>> see enum DMAC_IM_BIT,

Yes, that's what I meant.
Unfortunately the DMAC_IM_BIT enum doesn't cover all bits.

> Is your idea to get rid of .eesipr_value for devices that have values
> that are the same as .eesr_err_check?
>
>
> For example in sh_eth_dev_init():
>
> sh_eth_modify(ndev, EESR, 0, 0);
> mdp->irq_enabled = true;
> -   sh_eth_write(ndev, mdp->cd->eesipr_value, EESIPR);
> +   if (mdp->cd->eesipr_value)
> +   sh_eth_write(ndev, mdp->cd->eesipr_value, EESIPR);
> +   else
> +   sh_eth_write(ndev, mdp->cd->eesr_err_check, EESIPR);

No, my intention was to just get rid of the hardcoded values when
initializing .eesipr_value.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: stmmac ethernet in kernel 4.9-rc6: coalescing related pauses.

2016-12-02 Thread Pavel Machek
Hi!

> >Well, if you have a workload that sends and receive packets, it tends
> >to work ok, as you do tx_clean() in stmmac_poll(). My workload is not
> >like that -- it is "sending packets at 3MB/sec, receiving none". So
> >the stmmac_tx_timer() is rescheduled and rescheduled and rescheduled,
> >and then we run out of transmit descriptors, and then 40msec passes,
> >and then we clean them. Bad.
> >
> >And that's why low-res timers do not cut it.
> 
> in that case, I expect that the tuning of the driver could help you.
> I mean, by using ethtool, it could be enough to set the IC bit on all
> the descriptors. You should touch the tx_coal_frames.
> 
> Then you can use ethtool -S to monitor the status.

Yes, I did something similar. Unfortnunately that meant crash within
minutes, at least with 4.4 kernel. (If you know what was fixed between
4.4 and 4.9, that would be helpful).

> We had experimented this tuning on STB IP where just datagrams
> had to send externally. To be honest, although we had seen
> better results w/o any timer, we kept this approach enabled
> because the timer was fast enough to cover our tests on SH4 boxes.

Please reply to David, and explain how it is supposed to
work... because right now it does not. 40 msec delays are not
acceptable in default configuration.

> >>In the ring, some descriptors can raise the irq (according to a
> >>threshold) and set the IC bit. In this path, the NAPI  poll will be
> >>scheduled.
> >
> >Not NAPI poll but stmmac_tx_timer(), right?
> 
> in the xmit according the the threshold the timer is started or the
> interrupt is set inside the descriptor.
> Then stmmac_tx_clean will be always called and, if you see the flow,
> no irqlock protection is needed!

Agreed that no irqlock protection is needed if we rely on napi and timers.

> >>Concerning the lock protection, we had reviewed long time ago and
> >>IIRC, no raise condition should be present. Open to review it,
> >>again!
...
> >There's nothing that protect stmmac_poll() from running concurently
> >with stmmac_dma_interrupt(), right?
> 
> This is not necessary.

dma_interrupt accesses shared priv->xstats; variables are of type
unsigned long (not atomic_t), yet they are accesssed from interrupt
context and from stmmac_ethtool without any locking. That can result
in broken statistics AFAICT.

Please take another look. As far as I can tell, you can have two cpus
at #1 and #2 in the code, at the same time. It looks like napi_... has
some atomic opertions inside so that looks safe at the first look. But
I'm not sure if they also include enough memory barriers to make it
safe...?


static void stmmac_dma_interrupt(struct stmmac_priv *priv)
{
...
status = priv->hw->dma->dma_interrupt(priv->ioaddr, &priv->xstats);
if (likely((status & handle_rx)) || (status & handle_tx)) {
if (likely(napi_schedule_prep(&priv->napi))) {
#1
stmmac_disable_dma_irq(priv);
__napi_schedule(&priv->napi);
}
}


static int stmmac_poll(struct napi_struct *napi, int budget)
{
struct stmmac_priv *priv = container_of(napi, struct stmmac_priv, napi);
int work_done = 0;

priv->xstats.napi_poll++;
stmmac_tx_clean(priv);

work_done = stmmac_rx(priv, budget);
if (work_done < budget) {
napi_complete(napi);
#2
stmmac_enable_dma_irq(priv);
}
return work_done;
}


Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


net/can: warning in raw_setsockopt/__alloc_pages_slowpath

2016-12-02 Thread Andrey Konovalov
Hi!

I've got the following error report while running the syzkaller fuzzer.

A reproducer is attached.

On commit d8e435f3ab6fea2ea324dce72b51dd7761747523 (Nov 26).

[ cut here ]
WARNING: CPU: 0 PID: 4009 at mm/page_alloc.c:3511
__alloc_pages_slowpath+0x3d4/0x1bf0
Modules linked in:
CPU: 0 PID: 4009 Comm: a.out Not tainted 4.9.0-rc6+ #54
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 88006832f8a8 81c73b14  
 842c6320  88006832f8f0 8123dc57
 880067d86000 0db7 842c6320 0db7
Call Trace:
 [< inline >] __dump_stack lib/dump_stack.c:15
 [] dump_stack+0xb3/0x10f lib/dump_stack.c:51
 [] __warn+0x1a7/0x1f0 kernel/panic.c:550
 [] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
 [] __alloc_pages_slowpath+0x3d4/0x1bf0 mm/page_alloc.c:3511
 [] __alloc_pages_nodemask+0x5c2/0x710 mm/page_alloc.c:3781
 [] alloc_pages_current+0xf4/0x400 mm/mempolicy.c:2072
 [< inline >] alloc_pages ./include/linux/gfp.h:469
 [] kmalloc_order+0x1f/0x70 mm/slab_common.c:1015
 [] kmalloc_order_trace+0x1f/0x160 mm/slab_common.c:1026
 [< inline >] kmalloc_large ./include/linux/slab.h:422
 [] __kmalloc_track_caller+0x227/0x2a0 mm/slub.c:4233
 [] memdup_user+0x2c/0xa0 mm/util.c:137
 [] raw_setsockopt+0x1be/0x9f0 net/can/raw.c:506
 [< inline >] SYSC_setsockopt net/socket.c:1757
 [] SyS_setsockopt+0x154/0x240 net/socket.c:1736
 [] entry_SYSCALL_64_fastpath+0x1f/0xc2
arch/x86/entry/entry_64.S:209
---[ end trace bc80556cca970089 ]---
// autogenerated by syzkaller (http://github.com/google/syzkaller)

#ifndef __NR_setsockopt
#define __NR_setsockopt 54
#endif
#ifndef __NR_socket
#define __NR_socket 41
#endif

#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

const int kFailStatus = 67;
const int kErrorStatus = 68;
const int kRetryStatus = 69;

__attribute__((noreturn)) void fail(const char* msg, ...)
{
  int e = errno;
  fflush(stdout);
  va_list args;
  va_start(args, msg);
  vfprintf(stderr, msg, args);
  va_end(args);
  fprintf(stderr, " (errno %d)\n", e);
  exit(kFailStatus);
}

__attribute__((noreturn)) void exitf(const char* msg, ...)
{
  int e = errno;
  fflush(stdout);
  va_list args;
  va_start(args, msg);
  vfprintf(stderr, msg, args);
  va_end(args);
  fprintf(stderr, " (errno %d)\n", e);
  exit(kRetryStatus);
}

static int flag_debug;

void debug(const char* msg, ...)
{
  if (!flag_debug)
return;
  va_list args;
  va_start(args, msg);
  vfprintf(stdout, msg, args);
  va_end(args);
  fflush(stdout);
}

__thread int skip_segv;
__thread jmp_buf segv_env;

static void segv_handler(int sig, siginfo_t* info, void* uctx)
{
  if (__atomic_load_n(&skip_segv, __ATOMIC_RELAXED))
_longjmp(segv_env, 1);
  exit(sig);
}

static void install_segv_handler()
{
  struct sigaction sa;
  memset(&sa, 0, sizeof(sa));
  sa.sa_sigaction = segv_handler;
  sa.sa_flags = SA_NODEFER | SA_SIGINFO;
  sigaction(SIGSEGV, &sa, NULL);
  sigaction(SIGBUS, &sa, NULL);
}

#define NONFAILING(...)\
  {\
__atomic_fetch_add(&skip_segv, 1, __ATOMIC_SEQ_CST);   \
if (_setjmp(segv_env) == 0) {  \
  __VA_ARGS__; \
}  \
__atomic_fetch_sub(&skip_segv, 1, __ATOMIC_SEQ_CST);   \
  }

static uintptr_t execute_syscall(int nr, uintptr_t a0, uintptr_t a1,
 uintptr_t a2, uintptr_t a3,
 uintptr_t a4, uintptr_t a5,
 uintptr_t a6, uintptr_t a7,
 uintptr_t a8)
{
  switch (nr) {
  default:
return syscall(nr, a0, a1, a2, a3, a4, a5);
  }
}

static void setup_main_process(uint64_t pid)
{
  struct sigaction sa;
  memset(&sa, 0, sizeof(sa));
  sa.sa_handler = SIG_IGN;
  syscall(SYS_rt_sigaction, 0x20, &sa, NULL, 8);
  syscall(SYS_rt_sigaction, 0x21, &sa, NULL, 8);
  install_segv_handler();

  char tmpdir_template[] = "./syzkaller.XX";
  char* tmpdir = mkdtemp(tmpdir_template);
  if (!tmpdir)
fail("failed to mkdtemp");
  if (chmod(tmpdir, 0777))
fail("failed to chmod");
  if (chdir(tmpdir))
fail("failed to chdir");
}

static void loop();

static void sandbox_common()
{
  prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
  setpgrp();
  setsid();

  struct rlimit rlim;
  rlim.rlim_cur = rlim.rlim_max = 128 << 20;
  setrlimit(RLIMIT_AS, &rlim);
  rlim.rlim_cur = rlim.rlim_max = 1 << 20;

arp_filter and IPv6 ND

2016-12-02 Thread Saku Ytti
Hey,

net.ipv4.conf.all.arp_filter appears not to have IPv6 counter part.
Or am I missing something? That is Linux does answer to ND queries for
unrelated interfaces by default, and I can't seem to find way to turn
that off.

Is it proper maintainership to accept changes to single protocol,
without mandating the support for other protocol having same
behavioural characteristics?

It is good that some parts for ARP and ND have common code in linux
(neighbour.c) unlike in BSD where everything seems to be
self-contained.
I'd wish that even more of ARP/ND would common, because there are
still lot of common behavioural code in ARP/ND code itself, which
requires double maintenance and are implemented by different people at
different times, so leads to different set of bugs and behaviour for
same intended behaviour.

For example this feature should be protocol agnostic, developer should
only need to develop it once for the higher level behavioural code,
without minding which IP AFI it is for. Obviously that does not
exclude ability to sysctl configure it on/off per AFI.

Thanks!
-- 
  ++ytti


Re: [PATCH net v2] igb: re-assign hw address pointer on reset after PCI error

2016-12-02 Thread Guilherme G. Piccoli
On 11/10/2016 04:46 PM, Guilherme G. Piccoli wrote:
> Whenever the igb driver detects the result of a read operation returns
> a value composed only by F's (like 0x), it will detach the
> net_device, clear the hw_addr pointer and warn to the user that adapter's
> link is lost - those steps happen on igb_rd32().
> 
> In case a PCI error happens on Power architecture, there's a recovery
> mechanism called EEH, that will reset the PCI slot and call driver's
> handlers to reset the adapter and network functionality as well.
> 
> We observed that once hw_addr is NULL after the error is detected on
> igb_rd32(), it's never assigned back, so in the process of resetting
> the network functionality we got a NULL pointer dereference in both
> igb_configure_tx_ring() and igb_configure_rx_ring(). In order to avoid
> such bug, this patch re-assigns the hw_addr value in the slot_reset
> handler.
> 
> Reported-by: Anthony H. Thai 
> Reported-by: Harsha Thyagaraja 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
> b/drivers/net/ethernet/intel/igb/igb_main.c
> index edc9a6a..136ee9e 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -7878,6 +7878,11 @@ static pci_ers_result_t igb_io_slot_reset(struct 
> pci_dev *pdev)
>   pci_enable_wake(pdev, PCI_D3hot, 0);
>   pci_enable_wake(pdev, PCI_D3cold, 0);
> 
> + /* In case of PCI error, adapter lose its HW address
> +  * so we should re-assign it here.
> +  */
> + hw->hw_addr = adapter->io_addr;
> +
>   igb_reset(adapter);
>   wr32(E1000_WUS, ~0);
>   result = PCI_ERS_RESULT_RECOVERED;
> 

Ping?

Sorry to annoy, any news on this?
Thanks in advance!

Cheers,


Guilherme



Re: Initial thoughts on TXDP

2016-12-02 Thread Jesper Dangaard Brouer
On Thu, 1 Dec 2016 23:47:44 +0100
Hannes Frederic Sowa  wrote:

> Side note:
> 
> On 01.12.2016 20:51, Tom Herbert wrote:
> >> > E.g. "mini-skb": Even if we assume that this provides a speedup
> >> > (where does that come from? should make no difference if a 32 or
> >> >  320 byte buffer gets allocated).

Yes, the size of the allocation from the SLUB allocator does not change
base performance/cost much (at least for small objects, if < 1024).

Do notice the base SLUB alloc+free cost is fairly high (compared to a
201 cycles budget). Especially for networking as the free-side is very
likely to hit a slow path.  SLUB fast-path 53 cycles, and slow-path
around 100 cycles (data from [1]).  I've tried to address this with the
kmem_cache bulk APIs.  Which reduce the cost to approx 30 cycles.
(Something we have not fully reaped the benefit from yet!)

[1] https://git.kernel.org/torvalds/c/ca257195511

> >> >  
> > It's the zero'ing of three cache lines. I believe we talked about that
> > as netdev.

Actually 4 cache-lines, but with some cleanup I believe we can get down
to clearing 192 bytes 3 cache-lines.

> 
> Jesper and me played with that again very recently:
> 
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590
> 
> In micro-benchmarks we saw a pretty good speed up not using the rep
> stosb generated by gcc builtin but plain movq's. Probably the cost model
> for __builtin_memset in gcc is wrong?

Yes, I believe so.
 
> When Jesper is free we wanted to benchmark this and maybe come up with a
> arch specific way of cleaning if it turns out to really improve throughput.
> 
> SIMD instructions seem even faster but the kernel_fpu_begin/end() kill
> all the benefits.

One strange thing was, that on my skylake CPU (i7-6700K @4.00GHz),
Hannes's hand-optimized MOVQ ASM-code didn't go past 8 bytes per cycle,
or 32 cycles for 256 bytes.

Talking to Alex and John during netdev, and reading on the Intel arch,
I though that this CPU should be-able-to perform 16 bytes per cycle.
The CPU can do it as the rep-stos show this once the size gets large
enough.

On this CPU the memset rep stos starts to win around 512 bytes:

 192/35 =  5.5 bytes/cycle
 256/36 =  7.1 bytes/cycle
 512/40 = 12.8 bytes/cycle
 768/46 = 16.7 bytes/cycle
1024/52 = 19.7 bytes/cycle
2048/84 = 24.4 bytes/cycle
4096/148= 27.7 bytes/cycle

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: net/can: warning in raw_setsockopt/__alloc_pages_slowpath

2016-12-02 Thread Marc Kleine-Budde
On 12/02/2016 01:43 PM, Andrey Konovalov wrote:
> Hi!
> 
> I've got the following error report while running the syzkaller fuzzer.
> 
> A reproducer is attached.
> 
> On commit d8e435f3ab6fea2ea324dce72b51dd7761747523 (Nov 26).
> 
> [ cut here ]
> WARNING: CPU: 0 PID: 4009 at mm/page_alloc.c:3511
> __alloc_pages_slowpath+0x3d4/0x1bf0
> Modules linked in:
> CPU: 0 PID: 4009 Comm: a.out Not tainted 4.9.0-rc6+ #54
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>  88006832f8a8 81c73b14  
>  842c6320  88006832f8f0 8123dc57
>  880067d86000 0db7 842c6320 0db7
> Call Trace:
>  [< inline >] __dump_stack lib/dump_stack.c:15
>  [] dump_stack+0xb3/0x10f lib/dump_stack.c:51
>  [] __warn+0x1a7/0x1f0 kernel/panic.c:550
>  [] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
>  [] __alloc_pages_slowpath+0x3d4/0x1bf0 mm/page_alloc.c:3511
>  [] __alloc_pages_nodemask+0x5c2/0x710 mm/page_alloc.c:3781
>  [] alloc_pages_current+0xf4/0x400 mm/mempolicy.c:2072
>  [< inline >] alloc_pages ./include/linux/gfp.h:469
>  [] kmalloc_order+0x1f/0x70 mm/slab_common.c:1015
>  [] kmalloc_order_trace+0x1f/0x160 mm/slab_common.c:1026
>  [< inline >] kmalloc_large ./include/linux/slab.h:422
>  [] __kmalloc_track_caller+0x227/0x2a0 mm/slub.c:4233
>  [] memdup_user+0x2c/0xa0 mm/util.c:137
>  [] raw_setsockopt+0x1be/0x9f0 net/can/raw.c:506

We should add a check for a sensible optlen

> static int raw_setsockopt(struct socket *sock, int level, int optname,
> char __user *optval, unsigned int optlen)
> {
>   struct sock *sk = sock->sk;
>   struct raw_sock *ro = raw_sk(sk);
>   struct can_filter *filter = NULL;  /* dyn. alloc'ed filters */
>   struct can_filter sfilter; /* single filter */
>   struct net_device *dev = NULL;
>   can_err_mask_t err_mask = 0;
>   int count = 0;
>   int err = 0;
> 
>   if (level != SOL_CAN_RAW)
>   return -EINVAL;
> 
>   switch (optname) {
> 
>   case CAN_RAW_FILTER:
>   if (optlen % sizeof(struct can_filter) != 0)
>   return -EINVAL;

here...

if (optlen > 64 * sizeof(struct can_filter))
return -EINVAL;

> 
>   count = optlen / sizeof(struct can_filter);
> 
>   if (count > 1) {
>   /* filter does not fit into dfilter => alloc space */
>   filter = memdup_user(optval, optlen);
>   if (IS_ERR(filter))
>   return PTR_ERR(filter);
>   } else if (count == 1) {
>   if (copy_from_user(&sfilter, optval, sizeof(sfilter)))
>   return -EFAULT;
>   }
> 
>   lock_sock(sk);

Marc

-- 
Pengutronix e.K.  | Marc Kleine-Budde   |
Industrial Linux Solutions| Phone: +49-231-2826-924 |
Vertretung West/Dortmund  | Fax:   +49-5121-206917- |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |



signature.asc
Description: OpenPGP digital signature


Re: stmmac ethernet in kernel 4.9-rc6: coalescing related pauses.

2016-12-02 Thread Giuseppe CAVALLARO

On 12/2/2016 1:32 PM, Pavel Machek wrote:

Hi!


Well, if you have a workload that sends and receive packets, it tends
to work ok, as you do tx_clean() in stmmac_poll(). My workload is not
like that -- it is "sending packets at 3MB/sec, receiving none". So
the stmmac_tx_timer() is rescheduled and rescheduled and rescheduled,
and then we run out of transmit descriptors, and then 40msec passes,
and then we clean them. Bad.

And that's why low-res timers do not cut it.


in that case, I expect that the tuning of the driver could help you.
I mean, by using ethtool, it could be enough to set the IC bit on all
the descriptors. You should touch the tx_coal_frames.

Then you can use ethtool -S to monitor the status.


Yes, I did something similar. Unfortnunately that meant crash within
minutes, at least with 4.4 kernel. (If you know what was fixed between
4.4 and 4.9, that would be helpful).


4.4 has no GMAC4 support.
Alex, do you remember any patches to fix that?


We had experimented this tuning on STB IP where just datagrams
had to send externally. To be honest, although we had seen
better results w/o any timer, we kept this approach enabled
because the timer was fast enough to cover our tests on SH4 boxes.


Please reply to David, and explain how it is supposed to
work... because right now it does not. 40 msec delays are not
acceptable in default configuration.


I mean, that on UP and SMP system this schema helped
to improve the performance saving CPU on my side and this has been
tested since a long time (~4 years).
I tested something similar to yours where unidirectional traffic
with limited throughput was needed and I can confirm you that
tuning/removing coalesce parameters this helped. The tuning I decided
to keep in the driver was suitable in several user cases and if now
you have problems or you want to review it I can just confirm that
there are no problems on my side. If you want to simply the logic
around the tx process and remove timer on official driver I can accept
that. I will just ask you uto double check if the throughput and
CPU usage when request max throughput (better if on GiGa setup) has
no regressions.
Otherwise we could start thinking about adaptive schema if feasible.


In the ring, some descriptors can raise the irq (according to a
threshold) and set the IC bit. In this path, the NAPI  poll will be
scheduled.


Not NAPI poll but stmmac_tx_timer(), right?


in the xmit according the the threshold the timer is started or the
interrupt is set inside the descriptor.
Then stmmac_tx_clean will be always called and, if you see the flow,
no irqlock protection is needed!


Agreed that no irqlock protection is needed if we rely on napi and timers.


ok




Concerning the lock protection, we had reviewed long time ago and
IIRC, no raise condition should be present. Open to review it,
again!

...

There's nothing that protect stmmac_poll() from running concurently
with stmmac_dma_interrupt(), right?


This is not necessary.


dma_interrupt accesses shared priv->xstats; variables are of type
unsigned long (not atomic_t), yet they are accesssed from interrupt
context and from stmmac_ethtool without any locking. That can result
in broken statistics AFAICT.


ok we can check this and welcome patches and I'd prefer to
remove xstats from critical part of the code like ISR (that
comes from old story of the driver).



Please take another look. As far as I can tell, you can have two cpus
at #1 and #2 in the code, at the same time. It looks like napi_... has
some atomic opertions inside so that looks safe at the first look. But
I'm not sure if they also include enough memory barriers to make it
safe...?


Although I have never reproduced related issues on SMP platforms
due to reordering of memory operations but, as said above, welcome
review on this especially if you are seeing problems when manage NAPI.

FYI, the only memory barrier you will see in the driver are about the
OWN_BIT setting till now.


static void stmmac_dma_interrupt(struct stmmac_priv *priv)
{
...
status = priv->hw->dma->dma_interrupt(priv->ioaddr, &priv->xstats);
if (likely((status & handle_rx)) || (status & handle_tx)) {
if (likely(napi_schedule_prep(&priv->napi))) {
#1
stmmac_disable_dma_irq(priv);
__napi_schedule(&priv->napi);
}
}


static int stmmac_poll(struct napi_struct *napi, int budget)
{
struct stmmac_priv *priv = container_of(napi, struct stmmac_priv, napi);
int work_done = 0;

priv->xstats.napi_poll++;
stmmac_tx_clean(priv);

work_done = stmmac_rx(priv, budget);
if (work_done < budget) {
napi_complete(napi);
#2
stmmac_enable_dma_irq(priv);
}


hmm, I have to check (and refresh my memory) but the driver
uses the napi_schedule_prep.

Regards

Peppe


return work_done;
}


Best regards,
 

Re: [PATCH net] fib_trie: Avoid expensive update of suffix length when not required

2016-12-02 Thread Robert Shearman

On 01/12/16 18:29, Alexander Duyck wrote:

On Wed, Nov 30, 2016 at 4:27 PM, Robert Shearman  wrote:

On 29/11/16 23:14, Alexander Duyck wrote:

On Tue, Nov 29, 2016 at 9:46 AM, Robert Shearman 


Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
Signed-off-by: Robert Shearman 



So I am not entirely convinced this is a regression, I was wondering
what your numbers looked like before the patch you are saying this
fixes?  I recall I fixed a number of issues leading up to this so I am
just curious.



At commit 21d1f11db0e2f20a549ad8322879507850544670 ("fib_trie: Remove checks
for index >= tnode_child_length from tnode_get_child") which is the parent
of 5405afd1a306:

$ time sudo ip route restore < ~/allroutes
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists

real0m3.451s
user0m0.184s
sys 0m2.004s


At commit 5405afd1a30620de8601436bae739c67c0bc7324 ("fib_trie: Add tracking
value for suffix length"):

$ time sudo ip route restore < ~/allroutes
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists

real0m48.624s
user0m0.360s
sys 0m46.960s

So does that warrant a fixes tag for this patch?


Yes, please include it in the patch description next time.

I've been giving it thought and the fact is the patch series has
merit.  I just think it was being a bit too aggressive in terms of
some of the changes.  So placing any changes in put_child seem to be
the one piece in this that make this a bit ugly.


Does that imply the current updating of the parent's slen value should 
be removed from put_child then?


I started out without doing the changes in put_child, but that meant the 
changes to push the slen up through the trie were spread all over the 
place. This seemed like the cleaner solution.



+   }
+}
+
 /* Add a child at position i overwriting the old value.
  * Update the value of full_children and empty_children.
+ *
+ * The suffix length of the parent node and the rest of the tree is
+ * updated (if required) when adding/replacing a node, but is caller's
+ * responsibility on removal.
  */
 static void put_child(struct key_vector *tn, unsigned long i,
  struct key_vector *n)
@@ -447,8 +461,8 @@ static void put_child(struct key_vector *tn, unsigned
long i,
else if (!wasfull && isfull)
tn_info(tn)->full_children++;

-   if (n && (tn->slen < n->slen))
-   tn->slen = n->slen;
+   if (n)
+   node_push_suffix(tn, n);



This is just creating redundant work if we have to perform a resize.



The only real redundant work is the assignment of slen in tn, since the
propagation up the trie has to happen regardless and if a subsequent resize
results in changes to the trie then the propagation will stop at tn's parent
since the suffix length of the tn's parent will not be less than tn's suffix
length.



This isn't going to work.  We want to avoid trying to push the suffix
when we place a child.  There are scenarios where we are placing
children in nodes that don't have parents yet because we are doing
rebalances and such.  I suspect this could be pretty expensive on a
resize call.


It's not expensive because the new parents that are being built up are 
orphaned from the rest of the trie, so the push up won't proceed into 
the rest of the trie until the new node is inserted into the trie.



We want to pull the suffix pushing out of the resize completely.  The
problem is this is going to cause us to start pushing suffixes for
every node moved as a part of resize.


This will mean that nodes will have to be visited multiple times, which 
could well be more expensive than doing it during the resize.







rcu_assign_pointer(tn->tnode[i], n);
 }


...


-static void leaf_pull_suffix(struct key_vector *tp, struct key_vector
*l)
+static void node_set_suffix(struct key_vector *tp, unsigned char slen)
 {
-   while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
-   if (update_suffix(tp) > l->slen)
+   if (slen > tp->slen)
+   tp->slen = slen;
+}
+
+static void node_pull_suffix(struct key_vector *tn)
+{
+   struct key_vector *tp;
+   unsigned char slen;
+
+   slen = update_suffix(tn);
+   tp = node_parent(tn);
+   while ((tp->slen > tp->pos) && (tp->slen > slen)) {
+   if (update_suffix(tp) > slen)
break;
tp = node_parent(tp);
}
 }





Actually I realized that there is a bug here.  The check for
update_suffix(tp) > slen can cause us to stop prematurely if I am not
mistaken.  What we should probably be doing is pulling out tp->slen
and if the result of update_suffix is the same value then we can break
the loop.  Otherwise we can stop early if a "second runner up" shows
up that is supposed to be pushed up the trie.


Excellent point. This also

Aw: Re: stmmac ethernet in kernel 4.9-rc6: coalescing related pauses.

2016-12-02 Thread Lino Sanfilippo
Hi,


> 
> There's nothing that protect stmmac_poll() from running concurently
> with stmmac_dma_interrupt(), right?
> 

could it be that there is also another issue concerned locking?:
The tx completion handler takes the xmit_lock in case that the
netif_queue is stopped. This is AFAICS unnecessary, since both
xmit and completion handler are already synchronized by the private
tx lock. But it is IMHO also dangerous:

In the xmit handler we have the locking order
1. xmit_lock
2. private tx lock

while in the completion handler its the reverse:

1. private tx lock
2. xmit lock.

Do I miss something?


Regards,
Lino


Re: arp_filter and IPv6 ND

2016-12-02 Thread Hannes Frederic Sowa
On 02.12.2016 13:51, Saku Ytti wrote:
> net.ipv4.conf.all.arp_filter appears not to have IPv6 counter part.
> Or am I missing something? That is Linux does answer to ND queries for
> unrelated interfaces by default, and I can't seem to find way to turn
> that off.

May I ask why you want to turn it off?

In IPv6 this depends on the scope. In IPv4 this concept doesn't really
exist.

Please notice that in IPv4 arp_filter does not necessarily mean that the
system is operating in strong end system mode but you end up in an
hybrid clone where arp is acting strong but routing not and thus you
also have to add fib rules to simulate that.

> Is it proper maintainership to accept changes to single protocol,
> without mandating the support for other protocol having same
> behavioural characteristics?
> 
> It is good that some parts for ARP and ND have common code in linux
> (neighbour.c) unlike in BSD where everything seems to be
> self-contained.
>
> I'd wish that even more of ARP/ND would common, because there are
> still lot of common behavioural code in ARP/ND code itself, which
> requires double maintenance and are implemented by different people at
> different times, so leads to different set of bugs and behaviour for
> same intended behaviour.
> 
> For example this feature should be protocol agnostic, developer should
> only need to develop it once for the higher level behavioural code,
> without minding which IP AFI it is for. Obviously that does not
> exclude ability to sysctl configure it on/off per AFI.




Re: [PATCH] stmmac: reduce code duplication getting basic descriptors

2016-12-02 Thread Alexandre Torgue

Hi Pavel,

On 11/28/2016 01:17 PM, Pavel Machek wrote:


Remove code duplication getting basic descriptors.


I agree with your patch, it will make code easier to understand.
After fix kbuild issue you can add my Acked-by;

Regards
Alex



Signed-off-by: Pavel Machek 

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index f7133d0..ed20668 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -929,6 +929,17 @@ static int stmmac_set_bfsize(int mtu, int bufsize)
return ret;
 }

+static inline struct dma_desc *stmmac_tx_desc(struct stmmac_priv *priv, int i)
+{
+   struct dma_desc *p;
+   if (priv->extend_desc)
+   p = &((priv->dma_etx + i)->basic);
+   else
+   p = priv->dma_tx + i;
+   return p;
+}
+
+
 /**
  * stmmac_clear_descriptors - clear descriptors
  * @priv: driver private structure
@@ -1078,11 +1089,7 @@ static int init_dma_desc_rings(struct net_device *dev, 
gfp_t flags)

/* TX INITIALIZATION */
for (i = 0; i < DMA_TX_SIZE; i++) {
-   struct dma_desc *p;
-   if (priv->extend_desc)
-   p = &((priv->dma_etx + i)->basic);
-   else
-   p = priv->dma_tx + i;
+   struct dma_desc *p = stmmac_tx_desc(priv, i);

if (priv->synopsys_id >= DWMAC_CORE_4_00) {
p->des0 = 0;
@@ -1129,12 +1136,7 @@ static void dma_free_tx_skbufs(struct stmmac_priv *priv)
int i;

for (i = 0; i < DMA_TX_SIZE; i++) {
-   struct dma_desc *p;
-
-   if (priv->extend_desc)
-   p = &((priv->dma_etx + i)->basic);
-   else
-   p = priv->dma_tx + i;
+   struct dma_desc *p = stmmac_tx_desc(priv, i);

if (priv->tx_skbuff_dma[i].buf) {
if (priv->tx_skbuff_dma[i].map_as_page)
@@ -1314,14 +1316,9 @@ static void __stmmac_tx_clean(struct stmmac_priv *priv)

while (entry != priv->cur_tx) {
struct sk_buff *skb = priv->tx_skbuff[entry];
-   struct dma_desc *p;
+   struct dma_desc *p = stmmac_tx_desc(priv, entry);
int status;

-   if (priv->extend_desc)
-   p = (struct dma_desc *)(priv->dma_etx + entry);
-   else
-   p = priv->dma_tx + entry;
-
status = priv->hw->desc->tx_status(&priv->dev->stats,
  &priv->xstats, p,
  priv->ioaddr);
@@ -2227,11 +2224,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, 
struct net_device *dev)

csum_insertion = (skb->ip_summed == CHECKSUM_PARTIAL);

-   if (likely(priv->extend_desc))
-   desc = (struct dma_desc *)(priv->dma_etx + entry);
-   else
-   desc = priv->dma_tx + entry;
-
+   desc = stmmac_tx_desc(priv, entry);
first = desc;

priv->tx_skbuff[first_entry] = skb;
@@ -2254,10 +2247,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, 
struct net_device *dev)

entry = STMMAC_GET_ENTRY(entry, DMA_TX_SIZE);

-   if (likely(priv->extend_desc))
-   desc = (struct dma_desc *)(priv->dma_etx + entry);
-   else
-   desc = priv->dma_tx + entry;
+   desc = stmmac_tx_desc(priv, entry);

des = skb_frag_dma_map(priv->device, frag, 0, len,
   DMA_TO_DEVICE);



[PATCH 3/6] net: stmmac: stmmac_platform: fix parsing of DT binding

2016-12-02 Thread Niklas Cassel
From: Niklas Cassel 

commit 64c3b252e9fc ("net: stmmac: fixed the pbl setting with DT")
changed the parsing of the DT binding.

Before 64c3b252e9fc, snps,fixed-burst and snps,mixed-burst were parsed
regardless if the property snps,pbl existed or not.
After the commit, fixed burst and mixed burst are only parsed if
snps,pbl exists. Now when snps,aal has been added, it too is only
parsed if snps,pbl exists.

Since the DT binding does not specify that fixed burst, mixed burst
or aal depend on snps,pbl being specified, undo changes introduced
by 64c3b252e9fc.

The issue commit 64c3b252e9fc ("net: stmmac: fixed the pbl setting with
DT") tries to address is solved in another way:
The databook specifies that all values other than
1, 2, 4, 8, 16, or 32 results in undefined behavior,
so snps,pbl = <0> is invalid.

If pbl is 0 after parsing, set pbl to DEFAULT_DMA_PBL.
This handles the case where the property is omitted, and also handles
the case where the property is specified without any data.

Signed-off-by: Niklas Cassel 
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  |  3 +++
 .../net/ethernet/stmicro/stmmac/stmmac_platform.c  | 27 +++---
 2 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 27a03f7ee095..9ba2eb4c22fe 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1585,6 +1585,9 @@ static int stmmac_init_dma_engine(struct stmmac_priv 
*priv)
return -EINVAL;
}
 
+   if (!priv->plat->dma_cfg->pbl)
+   priv->plat->dma_cfg->pbl = DEFAULT_DMA_PBL;
+
if (priv->extend_desc && (priv->mode == STMMAC_RING_MODE))
atds = 1;
 
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
index b46c892c7079..05b8c33effd5 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
@@ -303,21 +303,20 @@ stmmac_probe_config_dt(struct platform_device *pdev, 
const char **mac)
plat->force_sf_dma_mode = 1;
}
 
-   if (of_find_property(np, "snps,pbl", NULL)) {
-   dma_cfg = devm_kzalloc(&pdev->dev, sizeof(*dma_cfg),
-  GFP_KERNEL);
-   if (!dma_cfg) {
-   of_node_put(plat->phy_node);
-   return ERR_PTR(-ENOMEM);
-   }
-   plat->dma_cfg = dma_cfg;
-   of_property_read_u32(np, "snps,pbl", &dma_cfg->pbl);
-   dma_cfg->aal = of_property_read_bool(np, "snps,aal");
-   dma_cfg->fixed_burst =
-   of_property_read_bool(np, "snps,fixed-burst");
-   dma_cfg->mixed_burst =
-   of_property_read_bool(np, "snps,mixed-burst");
+   dma_cfg = devm_kzalloc(&pdev->dev, sizeof(*dma_cfg),
+  GFP_KERNEL);
+   if (!dma_cfg) {
+   of_node_put(plat->phy_node);
+   return ERR_PTR(-ENOMEM);
}
+   plat->dma_cfg = dma_cfg;
+
+   of_property_read_u32(np, "snps,pbl", &dma_cfg->pbl);
+
+   dma_cfg->aal = of_property_read_bool(np, "snps,aal");
+   dma_cfg->fixed_burst = of_property_read_bool(np, "snps,fixed-burst");
+   dma_cfg->mixed_burst = of_property_read_bool(np, "snps,mixed-burst");
+
plat->force_thresh_dma_mode = of_property_read_bool(np, 
"snps,force_thresh_dma_mode");
if (plat->force_thresh_dma_mode) {
plat->force_sf_dma_mode = 0;
-- 
2.1.4



[PATCH 2/6] net: stmmac: simplify the common DMA init API

2016-12-02 Thread Niklas Cassel
From: Niklas Cassel 

Use struct stmmac_dma_cfg *dma_cfg as an argument rather
than using all the struct members as individual arguments.

Signed-off-by: Niklas Cassel 
---
 drivers/net/ethernet/stmicro/stmmac/common.h|  4 ++--
 drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c | 13 +++--
 drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c  |  7 ---
 drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c| 14 --
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c   |  6 +-
 5 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index 6d2de4e01f6d..3023ec7ae83e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -411,8 +411,8 @@ extern const struct stmmac_desc_ops ndesc_ops;
 struct stmmac_dma_ops {
/* DMA core initialization */
int (*reset)(void __iomem *ioaddr);
-   void (*init)(void __iomem *ioaddr, int pbl, int fb, int mb,
-int aal, u32 dma_tx, u32 dma_rx, int atds);
+   void (*init)(void __iomem *ioaddr, struct stmmac_dma_cfg *dma_cfg,
+u32 dma_tx, u32 dma_rx, int atds);
/* Configure the AXI Bus Mode Register */
void (*axi)(void __iomem *ioaddr, struct stmmac_axi *axi);
/* Dump DMA registers */
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
index 990746955216..01d0d0f315e5 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
@@ -82,8 +82,9 @@ static void dwmac1000_dma_axi(void __iomem *ioaddr, struct 
stmmac_axi *axi)
writel(value, ioaddr + DMA_AXI_BUS_MODE);
 }
 
-static void dwmac1000_dma_init(void __iomem *ioaddr, int pbl, int fb, int mb,
-  int aal, u32 dma_tx, u32 dma_rx, int atds)
+static void dwmac1000_dma_init(void __iomem *ioaddr,
+  struct stmmac_dma_cfg *dma_cfg,
+  u32 dma_tx, u32 dma_rx, int atds)
 {
u32 value = readl(ioaddr + DMA_BUS_MODE);
 
@@ -99,20 +100,20 @@ static void dwmac1000_dma_init(void __iomem *ioaddr, int 
pbl, int fb, int mb,
 */
value |= DMA_BUS_MODE_MAXPBL;
value &= ~DMA_BUS_MODE_PBL_MASK;
-   value |= (pbl << DMA_BUS_MODE_PBL_SHIFT);
+   value |= (dma_cfg->pbl << DMA_BUS_MODE_PBL_SHIFT);
 
/* Set the Fixed burst mode */
-   if (fb)
+   if (dma_cfg->fixed_burst)
value |= DMA_BUS_MODE_FB;
 
/* Mixed Burst has no effect when fb is set */
-   if (mb)
+   if (dma_cfg->mixed_burst)
value |= DMA_BUS_MODE_MB;
 
if (atds)
value |= DMA_BUS_MODE_ATDS;
 
-   if (aal)
+   if (dma_cfg->aal)
value |= DMA_BUS_MODE_AAL;
 
writel(value, ioaddr + DMA_BUS_MODE);
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c
index 61f54c99a7de..e5664da382f3 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c
@@ -32,11 +32,12 @@
 #include "dwmac100.h"
 #include "dwmac_dma.h"
 
-static void dwmac100_dma_init(void __iomem *ioaddr, int pbl, int fb, int mb,
- int aal, u32 dma_tx, u32 dma_rx, int atds)
+static void dwmac100_dma_init(void __iomem *ioaddr,
+ struct stmmac_dma_cfg *dma_cfg,
+ u32 dma_tx, u32 dma_rx, int atds)
 {
/* Enable Application Access by writing to DMA CSR0 */
-   writel(DMA_BUS_MODE_DEFAULT | (pbl << DMA_BUS_MODE_PBL_SHIFT),
+   writel(DMA_BUS_MODE_DEFAULT | (dma_cfg->pbl << DMA_BUS_MODE_PBL_SHIFT),
   ioaddr + DMA_BUS_MODE);
 
/* Mask interrupts by writing to CSR7 */
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
index 577316de6ba8..0946546d6dcd 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
@@ -97,27 +97,29 @@ static void dwmac4_dma_init_channel(void __iomem *ioaddr, 
int pbl,
writel(dma_rx_phy, ioaddr + DMA_CHAN_RX_BASE_ADDR(channel));
 }
 
-static void dwmac4_dma_init(void __iomem *ioaddr, int pbl, int fb, int mb,
-   int aal, u32 dma_tx, u32 dma_rx, int atds)
+static void dwmac4_dma_init(void __iomem *ioaddr,
+   struct stmmac_dma_cfg *dma_cfg,
+   u32 dma_tx, u32 dma_rx, int atds)
 {
u32 value = readl(ioaddr + DMA_SYS_BUS_MODE);
int i;
 
/* Set the Fixed burst mode */
-   if (fb)
+   if (dma_cfg->fixed_burst)
value |= DMA_SYS_BUS_FB;
 
/* Mixed Burst has no effect when fb is set */
-   if (mb)

[PATCH 4/6] net: stmmac: dwmac1000: fix define DMA_BUS_MODE_RPBL_MASK

2016-12-02 Thread Niklas Cassel
From: Niklas Cassel 

DMA_BUS_MODE_RPBL_MASK is really 6 bits,
just like DMA_BUS_MODE_PBL_MASK.

Signed-off-by: Niklas Cassel 
---
 drivers/net/ethernet/stmicro/stmmac/dwmac1000.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h 
b/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h
index ff3e5ab39bd0..52b9407a8a39 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h
@@ -225,7 +225,7 @@ enum rx_tx_priority_ratio {
 
 #define DMA_BUS_MODE_FB0x0001  /* Fixed burst */
 #define DMA_BUS_MODE_MB0x0400  /* Mixed burst */
-#define DMA_BUS_MODE_RPBL_MASK 0x003e  /* Rx-Programmable Burst Len */
+#define DMA_BUS_MODE_RPBL_MASK 0x007e  /* Rx-Programmable Burst Len */
 #define DMA_BUS_MODE_RPBL_SHIFT17
 #define DMA_BUS_MODE_USP   0x0080
 #define DMA_BUS_MODE_MAXPBL0x0100
-- 
2.1.4



[PATCH 1/6] net: stmmac: return error if no DMA configuration is found

2016-12-02 Thread Niklas Cassel
From: Niklas Cassel 

All drivers except pci glue layer calls stmmac_probe_config_dt.
stmmac_probe_config_dt does a kzalloc dma_cfg.

pci glue layer does kzalloc dma_cfg explicitly, so all current
drivers does a kzalloc dma_cfg.

Return an error if no DMA configuration is found, that way
we can assume that the DMA configuration always exists.

Signed-off-by: Niklas Cassel 
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 1f9ec02fa7f8..04f88df7da49 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1577,16 +1577,12 @@ static void stmmac_check_ether_addr(struct stmmac_priv 
*priv)
  */
 static int stmmac_init_dma_engine(struct stmmac_priv *priv)
 {
-   int pbl = DEFAULT_DMA_PBL, fixed_burst = 0, aal = 0;
-   int mixed_burst = 0;
int atds = 0;
int ret = 0;
 
-   if (priv->plat->dma_cfg) {
-   pbl = priv->plat->dma_cfg->pbl;
-   fixed_burst = priv->plat->dma_cfg->fixed_burst;
-   mixed_burst = priv->plat->dma_cfg->mixed_burst;
-   aal = priv->plat->dma_cfg->aal;
+   if (!priv->plat->dma_cfg) {
+   dev_err(priv->device, "DMA configuration not found\n");
+   return -EINVAL;
}
 
if (priv->extend_desc && (priv->mode == STMMAC_RING_MODE))
@@ -1598,8 +1594,12 @@ static int stmmac_init_dma_engine(struct stmmac_priv 
*priv)
return ret;
}
 
-   priv->hw->dma->init(priv->ioaddr, pbl, fixed_burst, mixed_burst,
-   aal, priv->dma_tx_phy, priv->dma_rx_phy, atds);
+   priv->hw->dma->init(priv->ioaddr,
+   priv->plat->dma_cfg->pbl,
+   priv->plat->dma_cfg->fixed_burst,
+   priv->plat->dma_cfg->mixed_burst,
+   priv->plat->dma_cfg->aal,
+   priv->dma_tx_phy, priv->dma_rx_phy, atds);
 
if (priv->synopsys_id >= DWMAC_CORE_4_00) {
priv->rx_tail_addr = priv->dma_rx_phy +
-- 
2.1.4



[PATCH 5/6] net: stmmac: add support for independent DMA pbl for tx/rx

2016-12-02 Thread Niklas Cassel
From: Niklas Cassel 

GMAC and newer supports independent programmable burst lengths for
DMA tx/rx. Add new optional devicetree properties representing this.

To be backwards compatible, snps,pbl will still be valid, but
snps,txpbl/snps,rxpbl will override the value in snps,pbl if set.

If the IP is synthesized to use the AXI interface, there is a register
and a matching DT property inside the optional stmmac-axi-config DT node
for controlling burst lengths, named snps,blen.
However, using this register, it is not possible to control tx and rx
independently. Also, this register is not available if the IP was
synthesized with, e.g., the AHB interface.

Signed-off-by: Niklas Cassel 
---
 Documentation/devicetree/bindings/net/stmmac.txt  |  6 +-
 Documentation/networking/stmmac.txt   | 19 +--
 drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c   | 12 ++--
 drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c  | 12 +++-
 drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c |  2 ++
 include/linux/stmmac.h|  2 ++
 6 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/stmmac.txt 
b/Documentation/devicetree/bindings/net/stmmac.txt
index b95ff998ba73..8080038ff1b2 100644
--- a/Documentation/devicetree/bindings/net/stmmac.txt
+++ b/Documentation/devicetree/bindings/net/stmmac.txt
@@ -34,7 +34,11 @@ Optional properties:
   platforms.
 - tx-fifo-depth: See ethernet.txt file in the same directory
 - rx-fifo-depth: See ethernet.txt file in the same directory
-- snps,pbl Programmable Burst Length
+- snps,pbl Programmable Burst Length (tx and rx)
+- snps,txpbl   Tx Programmable Burst Length. Only for GMAC and newer.
+   If set, DMA tx will use this value rather than snps,pbl.
+- snps,rxpbl   Rx Programmable Burst Length. Only for GMAC and newer.
+   If set, DMA rx will use this value rather than snps,pbl.
 - snps,aal Address-Aligned Beats
 - snps,fixed-burst Program the DMA to use the fixed burst mode
 - snps,mixed-burst Program the DMA to use the mixed burst mode
diff --git a/Documentation/networking/stmmac.txt 
b/Documentation/networking/stmmac.txt
index e226f8925c9e..82c8e496b4bb 100644
--- a/Documentation/networking/stmmac.txt
+++ b/Documentation/networking/stmmac.txt
@@ -154,7 +154,8 @@ Where:
o pbl: the Programmable Burst Length is maximum number of beats to
be transferred in one DMA transaction.
GMAC also enables the 4xPBL by default.
-   o fixed_burst/mixed_burst/burst_len
+   o txpbl/rxpbl: GMAC and newer supports independent DMA pbl for tx/rx.
+   o fixed_burst/mixed_burst/aal
  o clk_csr: fixed CSR Clock range selection.
  o has_gmac: uses the GMAC core.
  o enh_desc: if sets the MAC will use the enhanced descriptor structure.
@@ -206,16 +207,22 @@ tuned according to the HW capabilities.
 
 struct stmmac_dma_cfg {
int pbl;
+   int txpbl;
+   int rxpbl;
int fixed_burst;
-   int burst_len_supported;
+   int mixed_burst;
+   bool aal;
 };
 
 Where:
- o pbl: Programmable Burst Length
+ o pbl: Programmable Burst Length (tx and rx)
+ o txpbl: Transmit Programmable Burst Length. Only for GMAC and newer.
+If set, DMA tx will use this value rather than pbl.
+ o rxpbl: Receive Programmable Burst Length. Only for GMAC and newer.
+If set, DMA rx will use this value rather than pbl.
  o fixed_burst: program the DMA to use the fixed burst mode
- o burst_len: this is the value we put in the register
- supported values are provided as macros in
- linux/stmmac.h header file.
+ o mixed_burst: program the DMA to use the mixed burst mode
+ o aal: Address-Aligned Beats
 
 ---
 
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
index 01d0d0f315e5..1dd34fb4c1a9 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
@@ -87,20 +87,20 @@ static void dwmac1000_dma_init(void __iomem *ioaddr,
   u32 dma_tx, u32 dma_rx, int atds)
 {
u32 value = readl(ioaddr + DMA_BUS_MODE);
+   int txpbl = dma_cfg->txpbl ?: dma_cfg->pbl;
+   int rxpbl = dma_cfg->rxpbl ?: dma_cfg->pbl;
 
/*
 * Set the DMA PBL (Programmable Burst Length) mode.
 *
 * Note: before stmmac core 3.50 this mode bit was 4xPBL, and
 * post 3.5 mode bit acts as 8*PBL.
-*
-* This configuration doesn't take care about the Separate PBL
-* so only the bits: 13-8 are programmed with the PBL passed from the
-* platform.
 */
value |= DMA_BUS_MODE_MAXPBL;
-   value &= ~DMA_BUS_MODE_PBL_MASK;
-   value |= (dma_cfg->pbl << DMA_BUS_MODE_PBL_SHIFT);
+   value |= DMA_BUS_MODE_USP;
+   val

[PATCH 6/6] net: smmac: allow configuring lower pbl values

2016-12-02 Thread Niklas Cassel
From: Niklas Cassel 

The driver currently always sets the PBLx8/PBLx4 bit, which means that
the pbl values configured via the pbl/txpbl/rxpbl DT properties are
always multiplied by 8/4 in the hardware.

In order to allow the DT to configure lower pbl values, while at the
same time not changing behavior of any existing device trees using the
pbl/txpbl/rxpbl settings, add a property to disable the multiplication
of the pbl by 8/4 in the hardware.

Suggested-by: Rabin Vincent 
Signed-off-by: Niklas Cassel 
---
 Documentation/devicetree/bindings/net/stmmac.txt  | 2 ++
 Documentation/networking/stmmac.txt   | 5 -
 drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c   | 3 ++-
 drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c  | 3 ++-
 drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c  | 2 ++
 drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c | 1 +
 include/linux/stmmac.h| 1 +
 7 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/stmmac.txt 
b/Documentation/devicetree/bindings/net/stmmac.txt
index 8080038ff1b2..128da752fec9 100644
--- a/Documentation/devicetree/bindings/net/stmmac.txt
+++ b/Documentation/devicetree/bindings/net/stmmac.txt
@@ -39,6 +39,8 @@ Optional properties:
If set, DMA tx will use this value rather than snps,pbl.
 - snps,rxpbl   Rx Programmable Burst Length. Only for GMAC and newer.
If set, DMA rx will use this value rather than snps,pbl.
+- snps,no-pbl-x8   Don't multiply the pbl/txpbl/rxpbl values by 8.
+   For core rev < 3.50, don't multiply the values by 4.
 - snps,aal Address-Aligned Beats
 - snps,fixed-burst Program the DMA to use the fixed burst mode
 - snps,mixed-burst Program the DMA to use the mixed burst mode
diff --git a/Documentation/networking/stmmac.txt 
b/Documentation/networking/stmmac.txt
index 82c8e496b4bb..d3376c5fbcf0 100644
--- a/Documentation/networking/stmmac.txt
+++ b/Documentation/networking/stmmac.txt
@@ -153,8 +153,9 @@ Where:
  o dma_cfg: internal DMA parameters
o pbl: the Programmable Burst Length is maximum number of beats to
be transferred in one DMA transaction.
-   GMAC also enables the 4xPBL by default.
+   GMAC also enables the 4xPBL by default. (8xPBL for GMAC 3.50 and newer)
o txpbl/rxpbl: GMAC and newer supports independent DMA pbl for tx/rx.
+   o pblx8: Enable 8xPBL (4xPBL for core rev < 3.50). Enabled by default.
o fixed_burst/mixed_burst/aal
  o clk_csr: fixed CSR Clock range selection.
  o has_gmac: uses the GMAC core.
@@ -209,6 +210,7 @@ struct stmmac_dma_cfg {
int pbl;
int txpbl;
int rxpbl;
+   bool pblx8;
int fixed_burst;
int mixed_burst;
bool aal;
@@ -220,6 +222,7 @@ Where:
 If set, DMA tx will use this value rather than pbl.
  o rxpbl: Receive Programmable Burst Length. Only for GMAC and newer.
 If set, DMA rx will use this value rather than pbl.
+ o pblx8: Enable 8xPBL (4xPBL for core rev < 3.50). Enabled by default.
  o fixed_burst: program the DMA to use the fixed burst mode
  o mixed_burst: program the DMA to use the mixed burst mode
  o aal: Address-Aligned Beats
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
index 1dd34fb4c1a9..1d313af647b4 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
@@ -96,7 +96,8 @@ static void dwmac1000_dma_init(void __iomem *ioaddr,
 * Note: before stmmac core 3.50 this mode bit was 4xPBL, and
 * post 3.5 mode bit acts as 8*PBL.
 */
-   value |= DMA_BUS_MODE_MAXPBL;
+   if (dma_cfg->pblx8)
+   value |= DMA_BUS_MODE_MAXPBL;
value |= DMA_BUS_MODE_USP;
value &= ~(DMA_BUS_MODE_PBL_MASK | DMA_BUS_MODE_RPBL_MASK);
value |= (txpbl << DMA_BUS_MODE_PBL_SHIFT);
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
index 0bf47825bfeb..0f7110d19a4a 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
@@ -82,7 +82,8 @@ static void dwmac4_dma_init_channel(void __iomem *ioaddr,
 * on each channel
 */
value = readl(ioaddr + DMA_CHAN_CONTROL(channel));
-   value = value | DMA_BUS_MODE_PBL;
+   if (dma_cfg->pblx8)
+   value = value | DMA_BUS_MODE_PBL;
writel(value, ioaddr + DMA_CHAN_CONTROL(channel));
 
value = readl(ioaddr + DMA_CHAN_TX_CONTROL(channel));
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
index 56c8a2342c14..a2831773431a 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
@@ -81,6 +81,7 @@ 

[PATCH] net: phy: dp83848: Support ethernet pause frames

2016-12-02 Thread Jesper Nilsson
According to the documentation, the PHYs supported by this driver
can also support pause frames. Announce this to be so.
Tested with a TI83822I.

Signed-off-by: Jesper Nilsson 
---
 drivers/net/phy/dp83848.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/dp83848.c b/drivers/net/phy/dp83848.c
index 800b39f..6e4117f 100644
--- a/drivers/net/phy/dp83848.c
+++ b/drivers/net/phy/dp83848.c
@@ -88,7 +88,8 @@ MODULE_DEVICE_TABLE(mdio, dp83848_tbl);
.phy_id = _id,  \
.phy_id_mask= 0xfff0,   \
.name   = _name,\
-   .features   = PHY_BASIC_FEATURES,   \
+   .features   = (PHY_BASIC_FEATURES | \
+   SUPPORTED_Pause | SUPPORTED_Asym_Pause),\
.flags  = PHY_HAS_INTERRUPT,\
\
.soft_reset = genphy_soft_reset,\
-- 
2.1.4


/^JN - Jesper Nilsson
-- 
   Jesper Nilsson -- jesper.nils...@axis.com


Re: [WIP] net+mlx4: auto doorbell

2016-12-02 Thread Eric Dumazet
On Thu, 2016-12-01 at 09:04 -0800, Eric Dumazet wrote:
> On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:
> 
> > I think you misunderstood my concept[1].  I don't want to stop the
> > queue. The new __QUEUE_STATE_FLUSH_NEEDED does not stop the queue, is
> > it just indicating that someone need to flush/ring-doorbell.  Maybe it
> > need another name, because it also indicate that the driver can see
> > that its TX queue is so busy that we don't need to call it immediately.
> > The qdisc layer can then choose to enqueue instead if doing direct xmit.
> 
> But driver ndo_start_xmit() does not have a pointer to qdisc.
> 
> Also the concept of 'queue busy' just because we queued one packet is a
> bit flaky.
> 
> > 
> > When qdisc layer or trafgen/af_packet see this indication it knows it
> > should/must flush the queue when it don't have more work left.  Perhaps
> > through net_tx_action(), by registering itself and e.g. if qdisc_run()
> > is called and queue is empty then check if queue needs a flush. I would
> > also allow driver to flush and clear this bit.
> 
> net_tx_action() is not normally called, unless BQL limit is hit and/or
> some qdiscs with throttling (HTB, TBF, FQ, ...)
> 
> > 
> > I just see it as an extension of your solution, as we still need the
> > driver to figure out then the doorbell/flush can be delayed.
> > p.s. don't be discouraged by this feedback, I'm just very excited and
> > happy that your are working on a solution in this area. As this is a
> > problem area that I've not been able to solve myself for the last
> > approx 2 years. Keep up the good work!
> 
> Do not worry, I appreciate the feedbacks ;)
> 
> BTW, if you are doing tests on mlx4 40Gbit, would you check the
> following quick/dirty hack, using lots of low-rate flows ?
> 
> mlx4 has really hard time to transmit small TSO packets (2 or 3 MSS)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 12ea3405f442..96940666abd3 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -2631,6 +2631,11 @@ static void mlx4_en_del_vxlan_port(struct  net_device 
> *dev,
> queue_work(priv->mdev->workqueue, &priv->vxlan_del_task);
>  }
>  
> +static int mlx4_gso_segs_min = 4; /* TSO packets with less than 4 segments 
> are segmented */
> +module_param_named(mlx4_gso_segs_min, mlx4_gso_segs_min, uint, 0644);
> +MODULE_PARM_DESC(mlx4_gso_segs_min, "threshold for software segmentation of 
> small TSO packets");
> +
> +
>  static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
> struct net_device *dev,
> netdev_features_t features)
> @@ -2651,6 +2656,8 @@ static netdev_features_t mlx4_en_features_check(struct 
> sk_buff *skb,
> (udp_hdr(skb)->dest != priv->vxlan_port))
> features &= ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
> }
> +   if (skb_is_gso(skb) && skb_shinfo(skb)->gso_segs < mlx4_gso_segs_min)
> +   features &= NETIF_F_GSO_MASK;


Sorry, stupid typo.
This should be "features &= ~NETIF_F_GSO_MASK;" of course

>  
> return features;
>  }
> 
> 
> 
> 




Re: stmmac ethernet in kernel 4.9-rc6: coalescing related pauses.

2016-12-02 Thread Alexandre Torgue

Hi Pavel and Peppe,

On 12/02/2016 02:51 PM, Giuseppe CAVALLARO wrote:

On 12/2/2016 1:32 PM, Pavel Machek wrote:

Hi!


Well, if you have a workload that sends and receive packets, it tends
to work ok, as you do tx_clean() in stmmac_poll(). My workload is not
like that -- it is "sending packets at 3MB/sec, receiving none". So
the stmmac_tx_timer() is rescheduled and rescheduled and rescheduled,
and then we run out of transmit descriptors, and then 40msec passes,
and then we clean them. Bad.

And that's why low-res timers do not cut it.


in that case, I expect that the tuning of the driver could help you.
I mean, by using ethtool, it could be enough to set the IC bit on all
the descriptors. You should touch the tx_coal_frames.

Then you can use ethtool -S to monitor the status.


Yes, I did something similar. Unfortnunately that meant crash within
minutes, at least with 4.4 kernel. (If you know what was fixed between
4.4 and 4.9, that would be helpful).


4.4 has no GMAC4 support.
Alex, do you remember any patches to fix that?


No sorry Peppe.

Pavel,

Sorry but I'm a little bit confused. I'm dropped in some mails without 
historic. I see cleanup, coalescence issue and TSO question.

What is your main issue? Are you working on gmac4 or 3.x ?
Can you refresh a little bit the story please ?

Regards
Alex



We had experimented this tuning on STB IP where just datagrams
had to send externally. To be honest, although we had seen
better results w/o any timer, we kept this approach enabled
because the timer was fast enough to cover our tests on SH4 boxes.


Please reply to David, and explain how it is supposed to
work... because right now it does not. 40 msec delays are not
acceptable in default configuration.


I mean, that on UP and SMP system this schema helped
to improve the performance saving CPU on my side and this has been
tested since a long time (~4 years).
I tested something similar to yours where unidirectional traffic
with limited throughput was needed and I can confirm you that
tuning/removing coalesce parameters this helped. The tuning I decided
to keep in the driver was suitable in several user cases and if now
you have problems or you want to review it I can just confirm that
there are no problems on my side. If you want to simply the logic
around the tx process and remove timer on official driver I can accept
that. I will just ask you uto double check if the throughput and
CPU usage when request max throughput (better if on GiGa setup) has
no regressions.
Otherwise we could start thinking about adaptive schema if feasible.


In the ring, some descriptors can raise the irq (according to a
threshold) and set the IC bit. In this path, the NAPI  poll will be
scheduled.


Not NAPI poll but stmmac_tx_timer(), right?


in the xmit according the the threshold the timer is started or the
interrupt is set inside the descriptor.
Then stmmac_tx_clean will be always called and, if you see the flow,
no irqlock protection is needed!


Agreed that no irqlock protection is needed if we rely on napi and
timers.


ok




Concerning the lock protection, we had reviewed long time ago and
IIRC, no raise condition should be present. Open to review it,
again!

...

There's nothing that protect stmmac_poll() from running concurently
with stmmac_dma_interrupt(), right?


This is not necessary.


dma_interrupt accesses shared priv->xstats; variables are of type
unsigned long (not atomic_t), yet they are accesssed from interrupt
context and from stmmac_ethtool without any locking. That can result
in broken statistics AFAICT.


ok we can check this and welcome patches and I'd prefer to
remove xstats from critical part of the code like ISR (that
comes from old story of the driver).



Please take another look. As far as I can tell, you can have two cpus
at #1 and #2 in the code, at the same time. It looks like napi_... has
some atomic opertions inside so that looks safe at the first look. But
I'm not sure if they also include enough memory barriers to make it
safe...?


Although I have never reproduced related issues on SMP platforms
due to reordering of memory operations but, as said above, welcome
review on this especially if you are seeing problems when manage NAPI.

FYI, the only memory barrier you will see in the driver are about the
OWN_BIT setting till now.


static void stmmac_dma_interrupt(struct stmmac_priv *priv)
{
...
status = priv->hw->dma->dma_interrupt(priv->ioaddr,
&priv->xstats);
if (likely((status & handle_rx)) || (status & handle_tx)) {
if (likely(napi_schedule_prep(&priv->napi))) {
#1
stmmac_disable_dma_irq(priv);
__napi_schedule(&priv->napi);
}
}


static int stmmac_poll(struct napi_struct *napi, int budget)
{
struct stmmac_priv *priv = container_of(napi, struct
stmmac_priv, napi);
int work_done = 0;

priv->xstats.napi_poll++;
stmmac_tx_clean(priv);

work_done

Re: [PATCH] net: phy: dp83848: Support ethernet pause frames

2016-12-02 Thread Andrew F. Davis
On 12/02/2016 08:22 AM, Jesper Nilsson wrote:
> According to the documentation, the PHYs supported by this driver
> can also support pause frames. Announce this to be so.
> Tested with a TI83822I.
> 

Looks like all PHYs supported by this driver do, so:

Acked-by: Andrew F. Davis 

> Signed-off-by: Jesper Nilsson 
> ---
>  drivers/net/phy/dp83848.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/dp83848.c b/drivers/net/phy/dp83848.c
> index 800b39f..6e4117f 100644
> --- a/drivers/net/phy/dp83848.c
> +++ b/drivers/net/phy/dp83848.c
> @@ -88,7 +88,8 @@ MODULE_DEVICE_TABLE(mdio, dp83848_tbl);
>   .phy_id = _id,  \
>   .phy_id_mask= 0xfff0,   \
>   .name   = _name,\
> - .features   = PHY_BASIC_FEATURES,   \
> + .features   = (PHY_BASIC_FEATURES | \
> + SUPPORTED_Pause | SUPPORTED_Asym_Pause),\

Aligning these may look nicer though.

>   .flags  = PHY_HAS_INTERRUPT,\
>   \
>   .soft_reset = genphy_soft_reset,\
> 


Re: Initial thoughts on TXDP

2016-12-02 Thread Edward Cree
On 01/12/16 23:46, Tom Herbert wrote:
> The only time we
> _really_ to allocate an skbuf is when we need to put the packet onto a
> queue. All the other use cases are really just to pass a structure
> containing a packet from function to function. For that purpose we
> should be able to just pass a much smaller structure in a stack
> argument and only allocate an skbuff when we need to enqueue. In cases
> where we don't ever queue a packet we might never need to allocate any
> skbuff
Now this intrigues me, because one of the objections to bundling (vs GRO)
was the memory usage of all those SKBs.  IIRC we already do a 'GRO-like'
coalescing when packets reach a TCP socket anyway (or at least in some
cases, not sure if all the different ways we can enqueue a TCP packet for
RX do it), but if we could carry the packets from NIC to socket without
SKBs, doing so in lists rather than one-at-a-time wouldn't cost any extra
memory (the packet-pages are all already allocated on the NIC RX ring).
Possibly combine the two, so that rather than having potentially four
versions of each function (skb, skbundle, void*, void* bundle) you just
have the two 'ends'.

-Ed


Re: [PATCH net v2] tcp: warn on bogus MSS and try to amend it

2016-12-02 Thread Eric Dumazet
On Fri, 2016-12-02 at 08:55 -0200, Marcelo Ricardo Leitner wrote:
> There have been some reports lately about TCP connection stalls caused
> by NIC drivers that aren't setting gso_size on aggregated packets on rx
> path. This causes TCP to assume that the MSS is actually the size of the
> aggregated packet, which is invalid.
> 
> Although the proper fix is to be done at each driver, it's often hard
> and cumbersome for one to debug, come to such root cause and report/fix
> it.
> 
> This patch amends this situation in two ways. First, it adds a warning
> on when this situation occurs, so it gives a hint to those trying to
> debug this. It also limit the maximum probed MSS to the adverised MSS,
> as it should never be any higher than that.
> 
> The result is that the connection may not have the best performance ever
> but it shouldn't stall, and the admin will have a hint on what to look
> for.
> 
> Tested with virtio by forcing gso_size to 0.
> 
> Cc: Jonathan Maxwell 
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
> v2: Updated msg as suggested by David.
> 
>  net/ipv4/tcp_input.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 
> a27b9c0e27c08b4e4aeaff3d0bfdf3ae561ba4d8..fd619eb93749b6de56a41669248b337c051d9fe2
>  100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -144,7 +144,10 @@ static void tcp_measure_rcv_mss(struct sock *sk, const 
> struct sk_buff *skb)
>*/
>   len = skb_shinfo(skb)->gso_size ? : skb->len;
>   if (len >= icsk->icsk_ack.rcv_mss) {
> - icsk->icsk_ack.rcv_mss = len;
> + icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
> +tcp_sk(sk)->advmss);
> + if (icsk->icsk_ack.rcv_mss != len)
> + pr_warn_once("Driver has suspect GRO implementation, 
> TCP performance may be compromised.\n");
>   } else {
>   /* Otherwise, we make more careful check taking into account,
>* that SACKs block is variable.


skb->dev is indeed NULL, but it might be worth getting back the device
using skb->skb_iif maybe ?





Re: [PATCH] net: phy: dp83848: Support ethernet pause frames

2016-12-02 Thread Jesper Nilsson
On Fri, Dec 02, 2016 at 08:35:23AM -0600, Andrew F. Davis wrote:
> On 12/02/2016 08:22 AM, Jesper Nilsson wrote:
> > According to the documentation, the PHYs supported by this driver
> > can also support pause frames. Announce this to be so.
> > Tested with a TI83822I.
> > 
> 
> Looks like all PHYs supported by this driver do, so:
> 
> Acked-by: Andrew F. Davis 
> 
> > Signed-off-by: Jesper Nilsson 
> > ---
> >  drivers/net/phy/dp83848.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/phy/dp83848.c b/drivers/net/phy/dp83848.c
> > index 800b39f..6e4117f 100644
> > --- a/drivers/net/phy/dp83848.c
> > +++ b/drivers/net/phy/dp83848.c
> > @@ -88,7 +88,8 @@ MODULE_DEVICE_TABLE(mdio, dp83848_tbl);
> > .phy_id = _id,  \
> > .phy_id_mask= 0xfff0,   \
> > .name   = _name,\
> > -   .features   = PHY_BASIC_FEATURES,   \
> > +   .features   = (PHY_BASIC_FEATURES | \
> > +   SUPPORTED_Pause | SUPPORTED_Asym_Pause),\
> 
> Aligning these may look nicer though.

Agreed, will send a v2.

> > .flags  = PHY_HAS_INTERRUPT,\
> > \
> > .soft_reset = genphy_soft_reset,\
> > 

/^JN - Jesper Nilsson
-- 
   Jesper Nilsson -- jesper.nils...@axis.com


[PATCH v2] net: phy: dp83848: Support ethernet pause frames

2016-12-02 Thread Jesper Nilsson
According to the documentation, the PHYs supported by this driver
can also support pause frames. Announce this to be so.
Tested with a TI83822I.

Acked-by: Andrew F. Davis 
Signed-off-by: Jesper Nilsson 
---
 drivers/net/phy/dp83848.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/dp83848.c b/drivers/net/phy/dp83848.c
index 800b39f..320d0dc 100644
--- a/drivers/net/phy/dp83848.c
+++ b/drivers/net/phy/dp83848.c
@@ -88,7 +88,9 @@ MODULE_DEVICE_TABLE(mdio, dp83848_tbl);
.phy_id = _id,  \
.phy_id_mask= 0xfff0,   \
.name   = _name,\
-   .features   = PHY_BASIC_FEATURES,   \
+   .features   = (PHY_BASIC_FEATURES | \
+  SUPPORTED_Pause |\
+  SUPPORTED_Asym_Pause),   \
.flags  = PHY_HAS_INTERRUPT,\
\
.soft_reset = genphy_soft_reset,\
-- 
2.1.4


/^JN - Jesper Nilsson
-- 
   Jesper Nilsson -- jesper.nils...@axis.com


Re: net/can: warning in raw_setsockopt/__alloc_pages_slowpath

2016-12-02 Thread Oliver Hartkopp



On 12/02/2016 02:24 PM, Marc Kleine-Budde wrote:

On 12/02/2016 01:43 PM, Andrey Konovalov wrote:




 [] raw_setsockopt+0x1be/0x9f0 net/can/raw.c:506


We should add a check for a sensible optlen


static int raw_setsockopt(struct socket *sock, int level, int optname,
  char __user *optval, unsigned int optlen)
{
struct sock *sk = sock->sk;
struct raw_sock *ro = raw_sk(sk);
struct can_filter *filter = NULL;  /* dyn. alloc'ed filters */
struct can_filter sfilter; /* single filter */
struct net_device *dev = NULL;
can_err_mask_t err_mask = 0;
int count = 0;
int err = 0;

if (level != SOL_CAN_RAW)
return -EINVAL;

switch (optname) {

case CAN_RAW_FILTER:
if (optlen % sizeof(struct can_filter) != 0)
return -EINVAL;


here...

if (optlen > 64 * sizeof(struct can_filter))
return -EINVAL;



Agreed.

But what is sensible here?
64 filters is way to small IMO.

When thinking about picking a bunch of single CAN IDs I would tend to 
something like 512 filters.


Regards,
Oliver



count = optlen / sizeof(struct can_filter);

if (count > 1) {
/* filter does not fit into dfilter => alloc space */
filter = memdup_user(optval, optlen);
if (IS_ERR(filter))
return PTR_ERR(filter);
} else if (count == 1) {
if (copy_from_user(&sfilter, optval, sizeof(sfilter)))
return -EFAULT;
}

lock_sock(sk);


Marc



Re: [PATCH] net/rtnetlink: fix attribute name in nlmsg_size() comments

2016-12-02 Thread David Miller
From: Tobias Klauser 
Date: Wed, 30 Nov 2016 14:30:37 +0100

> Use the correct attribute constant names IFLA_GSO_MAX_{SEGS,SIZE}
> instead of IFLA_MAX_GSO_{SEGS,SIZE} for the comments int nlmsg_size().
> 
> Cc: Eric Dumazet 
> Signed-off-by: Tobias Klauser 

Applied, thanks.


Re: [PATCH 0/2] net: Add support for SGMII PCS on Altera TSE MAC

2016-12-02 Thread David Miller
From: Neill Whillans 
Date: Wed, 30 Nov 2016 13:41:03 +

> These patches were created as part of work to add support for SGMII
> PCS functionality to the Altera TSE MAC. Patches are based on 4.9-rc6
> git tree.
> 
> The first patch in the series adds support for the VSC8572 dual-port
> Gigabit Ethernet transceiver, used in integration testing.
> 
> The second patch adds support for the SGMII PCS functionality to the
> Altera TSE driver.

Series applied, thanks.


Re: DSA vs. SWTICHDEV ?

2016-12-02 Thread Andrew Lunn
On Thu, Dec 01, 2016 at 04:38:47PM -0500, Murali Karicheri wrote:
> Hi Andrew,
> On 12/01/2016 12:31 PM, Andrew Lunn wrote:
> > Hi Murali 
> > 
> >> 2. Switch mode where it implements a simple Ethernet switch. Currently
> >>it doesn't have address learning capability, but in future it
> >>can.
> > 
> > If it does not have address learning capabilities, does it act like a
> > plain old hub? What comes in one port goes out all others?
> 
> Thanks for the response!
> 
> Yes. It is a plain hub. it replicates frame to both ports. So need to
> run a bridge layer for address learning in software.

Hi Murali

It would be good if you start thinking about all the different
directions. It is not just host to port A and host to port B. What
about port A to Port B? Can it do that in hardware? 

> I think not. I see we have a non Linux implementation that does address
> learning in software using a hash table and look up MAC for each packet
> to see which port it needs to be sent to.

I think i need to read more about the switch. I'm starting to wonder
if it has enough intelligence to be usable. Switchdev is about pushing
configuration down into the switch. It does not however sound like
there is that much which is configurable in this switch.

  Andrew


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-12-02 Thread Jesper Dangaard Brouer
On Thu, 01 Dec 2016 23:17:48 +0100
Paolo Abeni  wrote:

> On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> > (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> > small socket queues)
> > 
> > On Wed, 30 Nov 2016 16:35:20 +
> > Mel Gorman  wrote:
> >   
> > > > I don't quite get why you are setting the socket recv size
> > > > (with -- -s and -S) to such a small number, size + 256.
> > > > 
> > > 
> > > Maybe I missed something at the time I wrote that but why would it
> > > need to be larger?  
> > 
> > Well, to me it is quite obvious that we need some queue to avoid packet
> > drops.  We have two processes netperf and netserver, that are sending
> > packets between each-other (UDP_STREAM mostly netperf -> netserver).
> > These PIDs are getting scheduled and migrated between CPUs, and thus
> > does not get executed equally fast, thus a queue is need absorb the
> > fluctuations.
> > 
> > The network stack is even partly catching your config "mistake" and
> > increase the socket queue size, so we minimum can handle one max frame
> > (due skb "truesize" concept approx PAGE_SIZE + overhead).
> > 
> > Hopefully for localhost testing a small queue should hopefully not
> > result in packet drops.  Testing... ups, this does result in packet
> > drops.
> > 
> > Test command extracted from mmtests, UDP_STREAM size 1024:
> > 
> >  netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
> >-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> > 
> >  UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
> >   port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
> >  Socket  Message  Elapsed  Messages
> >  SizeSize Time Okay Errors   Throughput
> >  bytes   bytessecs#  #   10^6bits/sec
> > 
> >46081024   60.00 50024301  06829.98
> >2560   60.00 46133211   6298.72
> > 
> >  Dropped packets: 50024301-46133211=3891090
> > 
> > To get a better drop indication, during this I run a command, to get
> > system-wide network counters from the last second, so below numbers are
> > per second.
> > 
> >  $ nstat > /dev/null && sleep 1  && nstat
> >  #kernel
> >  IpInReceives885162 0.0
> >  IpInDelivers885161 0.0
> >  IpOutRequests   885162 0.0
> >  UdpInDatagrams  776105 0.0
> >  UdpInErrors 109056 0.0
> >  UdpOutDatagrams 885160 0.0
> >  UdpRcvbufErrors 109056 0.0
> >  IpExtInOctets   931190476  0.0
> >  IpExtOutOctets  931189564  0.0
> >  IpExtInNoECTPkts885162 0.0
> > 
> > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> > happens kernel side in __udp_queue_rcv_skb[1], because receiving
> > process didn't empty it's queue fast enough see [2].
> > 
> > Although upstream changes are coming in this area, [2] is replaced with
> > __udp_enqueue_schedule_skb, which I actually tested with... hmm
> > 
> > Retesting with kernel 4.7.0-baseline+ ... show something else. 
> > To Paolo, you might want to look into this.  And it could also explain why
> > I've not see the mentioned speedup by mm-change, as I've been testing
> > this patch on top of net-next (at 93ba550) with Paolo's UDP changes.  
> 
> Thank you for reporting this.
> 
> It seems that the commit 123b4a633580 ("udp: use it's own memory
> accounting schema") is too strict while checking the rcvbuf. 
> 
> For very small value of rcvbuf, it allows a single skb to be enqueued,
> while previously we allowed 2 of them to enter the queue, even if the
> first one truesize exceeded rcvbuf, as in your test-case.
> 
> Can you please try the following patch ?

Sure, it looks much better with this patch.


$ 
/home/jbrouer/git/mmtests/work/testdisk/sources/netperf-2.4.5-installed/bin/netperf
 -t UDP_STREAM  -l 60  -H 127.0.0.1-- -s 1280 -S 1280 -m 1024 -M 1024 -P 
15895
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15895 AF_INET to 
127.0.0.1 (127.0.0.1) port 15895 AF_INET
Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

  46081024   60.00 50191555  06852.82
  2560   60.00 50189872   6852.59

Only 50191555-50189872=1683 drops, approx 1683/60 = 28/sec

$ nstat > /dev/null && sleep 1  && nstat
#kernel
IpInReceives885417 0.0
IpInDelivers885416 0.0
IpOutRequests   885417 0.0
UdpInDatagrams  885382 0.0
UdpInErrors 29 0.0
UdpOutDatagrams

Re: net/can: warning in raw_setsockopt/__alloc_pages_slowpath

2016-12-02 Thread Marc Kleine-Budde
On 12/02/2016 04:11 PM, Oliver Hartkopp wrote:
> 
> 
> On 12/02/2016 02:24 PM, Marc Kleine-Budde wrote:
>> On 12/02/2016 01:43 PM, Andrey Konovalov wrote:
> 
> 
>>>  [] raw_setsockopt+0x1be/0x9f0 net/can/raw.c:506
>>
>> We should add a check for a sensible optlen
>>
>>> static int raw_setsockopt(struct socket *sock, int level, int optname,
>>>   char __user *optval, unsigned int optlen)
>>> {
>>> struct sock *sk = sock->sk;
>>> struct raw_sock *ro = raw_sk(sk);
>>> struct can_filter *filter = NULL;  /* dyn. alloc'ed filters */
>>> struct can_filter sfilter; /* single filter */
>>> struct net_device *dev = NULL;
>>> can_err_mask_t err_mask = 0;
>>> int count = 0;
>>> int err = 0;
>>>
>>> if (level != SOL_CAN_RAW)
>>> return -EINVAL;
>>>
>>> switch (optname) {
>>>
>>> case CAN_RAW_FILTER:
>>> if (optlen % sizeof(struct can_filter) != 0)
>>> return -EINVAL;
>>
>> here...
>>
>>  if (optlen > 64 * sizeof(struct can_filter))
>>  return -EINVAL;
>>
> 
> Agreed.
> 
> But what is sensible here?
> 64 filters is way to small IMO.
> 
> When thinking about picking a bunch of single CAN IDs I would tend to 
> something like 512 filters.

Ok - 64 was just an arbitrary chosen value for demonstration purposes :)

Marc

-- 
Pengutronix e.K.  | Marc Kleine-Budde   |
Industrial Linux Solutions| Phone: +49-231-2826-924 |
Vertretung West/Dortmund  | Fax:   +49-5121-206917- |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3

2016-12-02 Thread Paolo Abeni
On Fri, 2016-12-02 at 16:37 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 01 Dec 2016 23:17:48 +0100
> Paolo Abeni  wrote:
> 
> > On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> > > (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> > > small socket queues)
> > > 
> > > On Wed, 30 Nov 2016 16:35:20 +
> > > Mel Gorman  wrote:
> > >   
> > > > > I don't quite get why you are setting the socket recv size
> > > > > (with -- -s and -S) to such a small number, size + 256.
> > > > > 
> > > > 
> > > > Maybe I missed something at the time I wrote that but why would it
> > > > need to be larger?  
> > > 
> > > Well, to me it is quite obvious that we need some queue to avoid packet
> > > drops.  We have two processes netperf and netserver, that are sending
> > > packets between each-other (UDP_STREAM mostly netperf -> netserver).
> > > These PIDs are getting scheduled and migrated between CPUs, and thus
> > > does not get executed equally fast, thus a queue is need absorb the
> > > fluctuations.
> > > 
> > > The network stack is even partly catching your config "mistake" and
> > > increase the socket queue size, so we minimum can handle one max frame
> > > (due skb "truesize" concept approx PAGE_SIZE + overhead).
> > > 
> > > Hopefully for localhost testing a small queue should hopefully not
> > > result in packet drops.  Testing... ups, this does result in packet
> > > drops.
> > > 
> > > Test command extracted from mmtests, UDP_STREAM size 1024:
> > > 
> > >  netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
> > >-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> > > 
> > >  UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
> > >   port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
> > >  Socket  Message  Elapsed  Messages
> > >  SizeSize Time Okay Errors   Throughput
> > >  bytes   bytessecs#  #   10^6bits/sec
> > > 
> > >46081024   60.00 50024301  06829.98
> > >2560   60.00 46133211   6298.72
> > > 
> > >  Dropped packets: 50024301-46133211=3891090
> > > 
> > > To get a better drop indication, during this I run a command, to get
> > > system-wide network counters from the last second, so below numbers are
> > > per second.
> > > 
> > >  $ nstat > /dev/null && sleep 1  && nstat
> > >  #kernel
> > >  IpInReceives885162 0.0
> > >  IpInDelivers885161 0.0
> > >  IpOutRequests   885162 0.0
> > >  UdpInDatagrams  776105 0.0
> > >  UdpInErrors 109056 0.0
> > >  UdpOutDatagrams 885160 0.0
> > >  UdpRcvbufErrors 109056 0.0
> > >  IpExtInOctets   931190476  0.0
> > >  IpExtOutOctets  931189564  0.0
> > >  IpExtInNoECTPkts885162 0.0
> > > 
> > > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> > > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> > > happens kernel side in __udp_queue_rcv_skb[1], because receiving
> > > process didn't empty it's queue fast enough see [2].
> > > 
> > > Although upstream changes are coming in this area, [2] is replaced with
> > > __udp_enqueue_schedule_skb, which I actually tested with... hmm
> > > 
> > > Retesting with kernel 4.7.0-baseline+ ... show something else. 
> > > To Paolo, you might want to look into this.  And it could also explain why
> > > I've not see the mentioned speedup by mm-change, as I've been testing
> > > this patch on top of net-next (at 93ba550) with Paolo's UDP changes.  
> > 
> > Thank you for reporting this.
> > 
> > It seems that the commit 123b4a633580 ("udp: use it's own memory
> > accounting schema") is too strict while checking the rcvbuf. 
> > 
> > For very small value of rcvbuf, it allows a single skb to be enqueued,
> > while previously we allowed 2 of them to enter the queue, even if the
> > first one truesize exceeded rcvbuf, as in your test-case.
> > 
> > Can you please try the following patch ?
> 
> Sure, it looks much better with this patch.

Thank you for testing. I'll send a formal patch to David soon.

BTW I see I nice performance improvement compared to 4.7...

Cheers,

Paolo



Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op

2016-12-02 Thread Andrew Lunn
On Thu, Dec 01, 2016 at 03:41:20PM -0500, Vivien Didelot wrote:
> Hi Andrew,
> 
> Andrew Lunn  writes:
> 
> >> diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h 
> >> b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> >> index ab52c37..9e51405 100644
> >> --- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> >> +++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> >> @@ -765,6 +765,9 @@ struct mv88e6xxx_ops {
> >>int (*phy_write)(struct mv88e6xxx_chip *chip, int addr, int reg,
> >> u16 val);
> >>  
> >> +  /* Switch Software Reset */
> >> +  int (*reset)(struct mv88e6xxx_chip *chip);
> >> +
> >
> > Hi Vivien
> >
> > In my huge patch series of 6390, i've been using a g1_ prefix for
> > functionality which is in global 1, g2_ for global 2, etc.  This has
> > worked for everything so far with the exception of setting which
> > reserved MAC addresses should be sent to the CPU. Most devices have it
> > in g2, but 6390 has it in g1.
> >
> > Please could you add the prefix.
> 
> I don't understand. It looks like you are talking about the second part
> of the comment I made on your RFC patchset, about the Rsvd2CPU feature:

Hi Vivien

I mean

+   /* Switch Software Reset */
+   int (*g1_reset)(struct mv88e6xxx_chip *chip);
+

We have a collection of function pointers with port_ prefix, another
collection with stats_, and a third with ppu_, etc. And then we have
some which do not fit a specific category. Those i have prefixed with
g1_ or g2_. I think we should have some prefix, and that is my
suggestion.

Andrew


Re: [PATCH net 0/7] net: stmmac: fix probe error handling and phydev leaks

2016-12-02 Thread David Miller
From: Johan Hovold 
Date: Wed, 30 Nov 2016 15:29:48 +0100

> This series fixes a number of issues with the stmmac-driver probe error
> handling, which for example left clocks enabled after probe failures.
> 
> The final patch fixes a failure to deregister and free any fixed-link
> PHYs that were registered during probe on probe errors and on driver
> unbind. It also fixes a related of-node leak on late probe errors.
> 
> This series depends on the of_phy_deregister_fixed_link() helper that
> was just merged to net.
> 
> As mentioned earlier, one staging driver also suffers from a similar
> leak and can be fixed up once the above mentioned helper hits mainline.
> 
> Note that these patches have only been compile tested.

Series applied, thanks.


Re: [PATCH net-next V2 0/7] Mellanox 100G mlx5 updates 2016-11-29

2016-12-02 Thread David Miller
From: Saeed Mahameed 
Date: Wed, 30 Nov 2016 17:59:36 +0200

> The following series from Tariq and Roi, provides some critical fixes
> and updates for the mlx5e driver.
> 
> From Tariq: 
>  - Fix driver coherent memory huge allocation issues by fragmenting
>completion queues, in a way that is transparent to the netdev driver by
>providing a new buffer type "mlx5_frag_buf" with the same access API.
>  - Create UMR MKey per RQ to have better scalability.
> 
> From Roi:
>  - Some fixes for the encap-decap support and tc flower added lately to the
>mlx5e driver.
> 
> v1->v2:
>  - Fix start index in error flow of mlx5_frag_buf_alloc_node, pointed out by 
> Eric.
> 
> This series was generated against commit:
> 31ac1c19455f ("geneve: fix ip_hdr_len reserved for geneve6 tunnel.")

Series applied, thanks.


[PATCH net] geneve: avoid use-after-free of skb->data

2016-12-02 Thread Sabrina Dubroca
geneve{,6}_build_skb can end up doing a pskb_expand_head(), which
makes the ip_hdr(skb) reference we stashed earlier stale. Since it's
only needed as an argument to ip_tunnel_ecn_encap(), move this
directly in the function call.

Fixes: 08399efc6319 ("geneve: ensure ECN info is handled properly in all tx/rx 
paths")
Signed-off-by: Sabrina Dubroca 
---
 drivers/net/geneve.c | 14 --
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 532695c8209b..bab65647b80f 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -859,7 +859,6 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, 
struct net_device *dev,
struct geneve_dev *geneve = netdev_priv(dev);
struct geneve_sock *gs4;
struct rtable *rt = NULL;
-   const struct iphdr *iip; /* interior IP header */
int err = -EINVAL;
struct flowi4 fl4;
__u8 tos, ttl;
@@ -890,8 +889,6 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, 
struct net_device *dev,
sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
skb_reset_mac_header(skb);
 
-   iip = ip_hdr(skb);
-
if (info) {
const struct ip_tunnel_key *key = &info->key;
u8 *opts = NULL;
@@ -911,7 +908,7 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, 
struct net_device *dev,
if (unlikely(err))
goto tx_error;
 
-   tos = ip_tunnel_ecn_encap(key->tos, iip, skb);
+   tos = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
ttl = key->ttl;
df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
} else {
@@ -920,7 +917,7 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, 
struct net_device *dev,
if (unlikely(err))
goto tx_error;
 
-   tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, iip, skb);
+   tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb);
ttl = geneve->ttl;
if (!ttl && IN_MULTICAST(ntohl(fl4.daddr)))
ttl = 1;
@@ -952,7 +949,6 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, 
struct net_device *dev,
 {
struct geneve_dev *geneve = netdev_priv(dev);
struct dst_entry *dst = NULL;
-   const struct iphdr *iip; /* interior IP header */
struct geneve_sock *gs6;
int err = -EINVAL;
struct flowi6 fl6;
@@ -982,8 +978,6 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, 
struct net_device *dev,
sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
skb_reset_mac_header(skb);
 
-   iip = ip_hdr(skb);
-
if (info) {
const struct ip_tunnel_key *key = &info->key;
u8 *opts = NULL;
@@ -1004,7 +998,7 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, 
struct net_device *dev,
if (unlikely(err))
goto tx_error;
 
-   prio = ip_tunnel_ecn_encap(key->tos, iip, skb);
+   prio = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
ttl = key->ttl;
label = info->key.label;
} else {
@@ -1014,7 +1008,7 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, 
struct net_device *dev,
goto tx_error;
 
prio = ip_tunnel_ecn_encap(ip6_tclass(fl6.flowlabel),
-  iip, skb);
+  ip_hdr(skb), skb);
ttl = geneve->ttl;
if (!ttl && ipv6_addr_is_multicast(&fl6.daddr))
ttl = 1;
-- 
2.10.2



[PATCH net-next] sfc: remove EFX_BUG_ON_PARANOID, use EFX_WARN_ON_[ONCE_]PARANOID instead

2016-12-02 Thread Edward Cree
Logically, EFX_BUG_ON_PARANOID can never be correct.  For, BUG_ON should
 only be used if it is not possible to continue without potential harm;
 and since the non-DEBUG driver will continue regardless (as the BUG_ON is
 compiled out), clearly the BUG_ON cannot be needed in the DEBUG driver.
So, replace every EFX_BUG_ON_PARANOID with either an EFX_WARN_ON_PARANOID
 or the newly defined EFX_WARN_ON_ONCE_PARANOID.

Signed-off-by: Edward Cree 
---
 drivers/net/ethernet/sfc/ef10.c   |  2 +-
 drivers/net/ethernet/sfc/efx.c|  2 +-
 drivers/net/ethernet/sfc/ethtool.c|  4 ++--
 drivers/net/ethernet/sfc/farch.c  |  6 ++---
 drivers/net/ethernet/sfc/mcdi.h   |  4 ++--
 drivers/net/ethernet/sfc/mcdi_mon.c   |  4 ++--
 drivers/net/ethernet/sfc/mcdi_port.c  |  2 +-
 drivers/net/ethernet/sfc/net_driver.h | 22 +--
 drivers/net/ethernet/sfc/ptp.c|  2 +-
 drivers/net/ethernet/sfc/rx.c |  8 +++
 drivers/net/ethernet/sfc/siena.c  |  2 +-
 drivers/net/ethernet/sfc/tx.c | 12 +-
 drivers/net/ethernet/sfc/tx_tso.c | 41 +--
 13 files changed, 55 insertions(+), 56 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 0f58ea8..de2947c 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -2100,7 +2100,7 @@ static int efx_ef10_tx_tso_desc(struct efx_tx_queue 
*tx_queue,
u32 seqnum;
u32 mss;
 
-   EFX_BUG_ON_PARANOID(tx_queue->tso_version != 2);
+   EFX_WARN_ON_ONCE_PARANOID(tx_queue->tso_version != 2);
 
mss = skb_shinfo(skb)->gso_size;
 
diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index da7028d..5a5dcad 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -355,7 +355,7 @@ static int efx_probe_eventq(struct efx_channel *channel)
/* Build an event queue with room for one event per tx and rx buffer,
 * plus some extra for link state events and MCDI completions. */
entries = roundup_pow_of_two(efx->rxq_entries + efx->txq_entries + 128);
-   EFX_BUG_ON_PARANOID(entries > EFX_MAX_EVQ_SIZE);
+   EFX_WARN_ON_PARANOID(entries > EFX_MAX_EVQ_SIZE);
channel->eventq_mask = max(entries, EFX_MIN_EVQ_SIZE) - 1;
 
return efx_nic_probe_eventq(channel);
diff --git a/drivers/net/ethernet/sfc/ethtool.c 
b/drivers/net/ethernet/sfc/ethtool.c
index ca29d3d..f644216 100644
--- a/drivers/net/ethernet/sfc/ethtool.c
+++ b/drivers/net/ethernet/sfc/ethtool.c
@@ -333,12 +333,12 @@ static int efx_ethtool_fill_self_tests(struct efx_nic 
*efx,
  "core", 0, "registers", NULL);
 
if (efx->phy_op->run_tests != NULL) {
-   EFX_BUG_ON_PARANOID(efx->phy_op->test_name == NULL);
+   EFX_WARN_ON_PARANOID(efx->phy_op->test_name == NULL);
 
for (i = 0; true; ++i) {
const char *name;
 
-   EFX_BUG_ON_PARANOID(i >= EFX_MAX_PHY_TESTS);
+   EFX_WARN_ON_PARANOID(i >= EFX_MAX_PHY_TESTS);
name = efx->phy_op->test_name(efx, i);
if (name == NULL)
break;
diff --git a/drivers/net/ethernet/sfc/farch.c b/drivers/net/ethernet/sfc/farch.c
index 91aa3ec..e4ca216 100644
--- a/drivers/net/ethernet/sfc/farch.c
+++ b/drivers/net/ethernet/sfc/farch.c
@@ -177,7 +177,7 @@ efx_init_special_buffer(struct efx_nic *efx, struct 
efx_special_buffer *buffer)
dma_addr_t dma_addr;
int i;
 
-   EFX_BUG_ON_PARANOID(!buffer->buf.addr);
+   EFX_WARN_ON_PARANOID(!buffer->buf.addr);
 
/* Write buffer descriptors to NIC */
for (i = 0; i < buffer->entries; i++) {
@@ -332,7 +332,7 @@ void efx_farch_tx_write(struct efx_tx_queue *tx_queue)
txd = efx_tx_desc(tx_queue, write_ptr);
++tx_queue->write_count;
 
-   EFX_BUG_ON_PARANOID(buffer->flags & EFX_TX_BUF_OPTION);
+   EFX_WARN_ON_ONCE_PARANOID(buffer->flags & EFX_TX_BUF_OPTION);
 
/* Create TX descriptor ring entry */
BUILD_BUG_ON(EFX_TX_BUF_CONT != 1);
@@ -2041,7 +2041,7 @@ efx_farch_filter_from_gen_spec(struct 
efx_farch_filter_spec *spec,
__be32 rhost, host1, host2;
__be16 rport, port1, port2;
 
-   EFX_BUG_ON_PARANOID(!(gen_spec->flags & EFX_FILTER_FLAG_RX));
+   EFX_WARN_ON_PARANOID(!(gen_spec->flags & EFX_FILTER_FLAG_RX));
 
if (gen_spec->ether_type != htons(ETH_P_IP))
return -EPROTONOSUPPORT;
diff --git a/drivers/net/ethernet/sfc/mcdi.h b/drivers/net/ethernet/sfc/mcdi.h
index c9aeb07..4472107 100644
--- a/drivers/net/ethernet/sfc/mcdi.h
+++ b/drivers/net/ethernet/sfc/mcdi.h
@@ -129,14 +129,14 @@ struct efx_mcdi_data {
 
 static inline struct efx_mcdi_iface *efx_mcdi(struct efx_nic *efx)
 {
-   EFX_B

Re: arp_filter and IPv6 ND

2016-12-02 Thread Saku Ytti
On 2 December 2016 at 16:08, Hannes Frederic Sowa
 wrote:

Hey,

> May I ask why you want to turn it off?

Certainly. I don't want device to answer with link address for L3
address it does not have on the link. In my case it triggers this bug
https://supportforums.cisco.com/document/12098096/cscse46790-cef-prefers-arp-adjacency-over-rib-next-hop

In this particular case, for one reason or another my Cisco device
would have ND entry for Linux loopback pointing to an interface with
completely different network.  Which itself would be just weird, but
combined with weird behaviour of Cisco it actually causes the loopback
route advertised by BGP not to be installed. If the ND entry didn't
exist, the BGP route would be installed.
I don't really even know why the ND entry exists, all I can think of
is that Linux must have sent gratuitous reply, because I don't se why
Cisco would have tried to discover it.

Expected behaviour is that the loopback/128 BGP route resolves to
on-link next-hop, and on-link next hop is then ND'd. Observed
behaviour is that loopback/128 BGP route also appears in ND cache.

> In IPv6 this depends on the scope. In IPv4 this concept doesn't really
> exist.
>
> Please notice that in IPv4 arp_filter does not necessarily mean that the
> system is operating in strong end system mode but you end up in an
> hybrid clone where arp is acting strong but routing not and thus you
> also have to add fib rules to simulate that.

It's just very peculiar behaviour to have ARP or ND entries on a
interface where given subnet does not exist, it rudimentarily causes
difficult to troubleshoot problems and is surprising/unexpected
behaviour.
Of course well behaving device wouldn't accept such replies, because
it itself could be attack vector (imagine me telling you 8.8.8.8 is on
the link, or worse, your bank).

I'm curious, why does this behaviour exist? When is this desirable?
I've never seen any other device than Linux behave like this, and when
ever I've heard about the problem, I've only seen surprised faces that
it does behave like this.
-- 
  ++ytti


RE: [PATCH net] fib_trie: Avoid expensive update of suffix length when not required

2016-12-02 Thread Duyck, Alexander H
> -Original Message-
> From: Robert Shearman [mailto:rshea...@brocade.com]
> Sent: Friday, December 2, 2016 5:54 AM
> To: Alexander Duyck 
> Cc: David Miller ; Netdev ;
> Duyck, Alexander H 
> Subject: Re: [PATCH net] fib_trie: Avoid expensive update of suffix length 
> when
> not required
> 
> On 01/12/16 18:29, Alexander Duyck wrote:
> > On Wed, Nov 30, 2016 at 4:27 PM, Robert Shearman
>  wrote:
> >> On 29/11/16 23:14, Alexander Duyck wrote:
> >>> On Tue, Nov 29, 2016 at 9:46 AM, Robert Shearman
> >>> 
> 
>  Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix
>  length")
>  Signed-off-by: Robert Shearman 
> >>>
> >>>
> >>> So I am not entirely convinced this is a regression, I was wondering
> >>> what your numbers looked like before the patch you are saying this
> >>> fixes?  I recall I fixed a number of issues leading up to this so I
> >>> am just curious.
> >>
> >>
> >> At commit 21d1f11db0e2f20a549ad8322879507850544670 ("fib_trie:
> Remove
> >> checks for index >= tnode_child_length from tnode_get_child") which
> >> is the parent of 5405afd1a306:
> >>
> >> $ time sudo ip route restore < ~/allroutes RTNETLINK answers: File
> >> exists RTNETLINK answers: File exists RTNETLINK answers: File exists
> >> RTNETLINK answers: File exists
> >>
> >> real0m3.451s
> >> user0m0.184s
> >> sys 0m2.004s
> >>
> >>
> >> At commit 5405afd1a30620de8601436bae739c67c0bc7324 ("fib_trie: Add
> >> tracking value for suffix length"):
> >>
> >> $ time sudo ip route restore < ~/allroutes RTNETLINK answers: File
> >> exists RTNETLINK answers: File exists RTNETLINK answers: File exists
> >> RTNETLINK answers: File exists
> >>
> >> real0m48.624s
> >> user0m0.360s
> >> sys 0m46.960s
> >>
> >> So does that warrant a fixes tag for this patch?
> >
> > Yes, please include it in the patch description next time.
> >
> > I've been giving it thought and the fact is the patch series has
> > merit.  I just think it was being a bit too aggressive in terms of
> > some of the changes.  So placing any changes in put_child seem to be
> > the one piece in this that make this a bit ugly.
> 
> Does that imply the current updating of the parent's slen value should be
> removed from put_child then?

No, that is where it should be.  We want to leave the put_child code as is.  
That way all the code for halving and inflating nodes works correctly.

The only reason for having the update_suffix code update the parent was because 
before it was propagating the suffix length for increases as well.  Since we 
aren't using it for that anymore there isn't much point in having update_suffix 
touching the parent in the code below.
 
> I started out without doing the changes in put_child, but that meant the 
> changes
> to push the slen up through the trie were spread all over the place. This 
> seemed
> like the cleaner solution.

It actually makes it much messier.  The put_child call is used all over the 
place in many situations where it doesn't make sense to go through updating the 
entire trie.  The big issue is you can't walk up the trie if you are working on 
assembling a node off on the side that you plan to insert into the trie later.  
The only places we need to worry about updating the suffix are when we have 
added a new leaf/suffix, or when we have deleted a leaf/suffix.  That is why in 
the patches I submitted I only update in those two places.

>  +   }
>  +}
>  +
>   /* Add a child at position i overwriting the old value.
>    * Update the value of full_children and empty_children.
>  + *
>  + * The suffix length of the parent node and the rest of the tree
>  + is
>  + * updated (if required) when adding/replacing a node, but is
>  + caller's
>  + * responsibility on removal.
>    */
>   static void put_child(struct key_vector *tn, unsigned long i,
>    struct key_vector *n) @@ -447,8 +461,8 @@
>  static void put_child(struct key_vector *tn, unsigned long i,
>  else if (!wasfull && isfull)
>  tn_info(tn)->full_children++;
> 
>  -   if (n && (tn->slen < n->slen))
>  -   tn->slen = n->slen;
>  +   if (n)
>  +   node_push_suffix(tn, n);
> >>>
> >>>
> >>> This is just creating redundant work if we have to perform a resize.
> >>
> >>
> >> The only real redundant work is the assignment of slen in tn, since
> >> the propagation up the trie has to happen regardless and if a
> >> subsequent resize results in changes to the trie then the propagation
> >> will stop at tn's parent since the suffix length of the tn's parent
> >> will not be less than tn's suffix length.
> >
> >
> > This isn't going to work.  We want to avoid trying to push the suffix
> > when we place a child.  There are scenarios where we are placing
> > children in nodes that don't have parents yet because we are doing
> > rebalances and such.  I suspect this could be pr

Re: [PATCH net-next] sock: reset sk_err for ICMP packets read from error queue

2016-12-02 Thread David Miller
From: Soheil Hassas Yeganeh 
Date: Wed, 30 Nov 2016 14:01:08 -0500

> From: Soheil Hassas Yeganeh 
> 
> Only when ICMP packets are enqueued onto the error queue,
> sk_err is also set. Before f5f99309fa74 (sock: do not set sk_err
> in sock_dequeue_err_skb), a subsequent error queue read
> would set sk_err to the next error on the queue, or 0 if empty.
> As no error types other than ICMP set this field, sk_err should
> not be modified upon dequeuing them.
> 
> Only for ICMP errors, reset the (racy) sk_err. Some applications,
> like traceroute, rely on it and go into a futile busy POLLERR
> loop otherwise.
> 
> In principle, sk_err has to be set while an ICMP error is queued.
> Testing is_icmp_err_skb(skb_next) approximates this without
> requiring a full queue walk. Applications that receive both ICMP
> and other errors cannot rely on this legacy behavior, as other
> errors do not set sk_err in the first place.
> 
> Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb)
> Signed-off-by: Soheil Hassas Yeganeh 
> Signed-off-by: Willem de Bruijn 

Applied, thanks.


Re: stmmac: turn coalescing / NAPI off in stmmac

2016-12-02 Thread Giuseppe CAVALLARO

Hi Pavel

On 12/2/2016 11:42 AM, Pavel Machek wrote:

Hi!


Anyway... since you asked. I belive I have way to disable NAPI / tx
coalescing in the driver. Unfortunately, locking is missing on the rx
path, and needs to be extended to _irqsave variant on tx path.


I have just replied to a previous thread about that...


Yeah, please reply to David's mail where he describes why it can't
work.


let me to re-check the mails :-) I can try to provide you
more details about what I experimented




So patch currently looks like this (hand edited, can't be
applied, got it working few hours ago). Does it look acceptable?

I'd prefer this to go after the patch that pulls common code to single
place, so that single place needs to be patched. Plus I guess I should
add ifdefs, so that more advanced NAPI / tx coalescing code can be
reactivated when it is fixed. Trivial fixes can go on top. Does that
sound like a plan?


Hmm, what I find strange is that, just this code is running since a
long time on several platforms and Chip versions. No raise condition
have been found or lock protection problems (also proving look
mechanisms).


Well, it works better for me when I disable CONFIG_SMP. It is normal
that locking problems are hard to reproduce :-(.


can you share me the test, maybe I can try to reproduce on ARM box.
Are you using 3.x or 4.x GMAC?


Pavel, I ask you sorry if I missed some problems so, if you can
(as D. Miller asked) to send us a cover letter + all patches
I will try to reply soon. I can do also some tests if you ask
me that! I could run on 3.x and 4.x but I cannot promise you
benchmarks.


Actually... I have questions here. David normally pulls from you (can
I have a address of your git tree?).


No I send the patches to the mailing list.



Could you apply these to your git?

[PATCH] stmmac ethernet: unify locking
[PATCH] stmmac: simplify flag assignment
[PATCH] stmmac: cleanup documenation, make it match reality

They are rather trivial and independend, I'm not sure what cover
letter would say, besides "simple fixes".

Then I can re-do the reset on top of that...


Which tree do you want patches against?

https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/ ?


I think that bug fixing should be on top of net.git but I let Miller
to decide.


Hmm. It is "only" a performance problem (40msec delays).. I guess
-next is better target.


ok, maybe if you resend these with a cover-letter I can try to
contribute on reviewing (in case of I have missed some detail).

Best Regards
Peppe



Best regards,
Pavel





Re: stmmac ethernet in kernel 4.9-rc6: coalescing related pauses.

2016-12-02 Thread Giuseppe CAVALLARO

Hi Alex

On 12/2/2016 3:26 PM, Alexandre Torgue wrote:

4.4 has no GMAC4 support.
Alex, do you remember any patches to fix that?


No sorry Peppe.

Pavel,

Sorry but I'm a little bit confused. I'm dropped in some mails without
historic. I see cleanup, coalescence issue and TSO question.
What is your main issue? Are you working on gmac4 or 3.x ?
Can you refresh a little bit the story please ?


let me try to do a sum, please Pavel feel free to correct me.

There are some open points about the tx mitigation schema
that we are trying to detail and eventually tune or change
(but keeping the same performance on other user-case).

In particular, the test case that is raising problem is
an unicast tx bench.
I suggested Pavel to tune coalesce (IC bit settings) via
ethtool and monitor stats but he is getting problems (maybe
due to lock).

IIUC problems are mainly on new kernel and not on 4.4 where
the gmac4 is missing. Please Pavel, could you confirm?

Then, there are some open points about lock protections
for xstat and Pavel is getting some problem on SMP.
I do think that we need to review that. This also could
improve the code in critical parts.

Also there are some other discussion about the lock
protection on NAPI still under discussion. I have not
clear if in this case Pavel is getting strange behavior.

Regards
Peppe


Re: [PATCH net-next] bpf, xdp: drop rcu_read_lock from bpf_prog_run_xdp and move to caller

2016-12-02 Thread David Miller
From: Daniel Borkmann 
Date: Wed, 30 Nov 2016 22:16:06 +0100

> After 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
> the rcu_read_lock() in bpf_prog_run_xdp() is superfluous, since callers
> need to hold rcu_read_lock() already to make sure BPF program doesn't
> get released in the background.
> 
> Thus, drop it from bpf_prog_run_xdp(), as it can otherwise be misleading.
> Still keeping the bpf_prog_run_xdp() is useful as it allows for grepping
> in XDP supported drivers and to keep the typecheck on the context intact.
> For mlx4, this means we don't have a double rcu_read_lock() anymore. nfp can
> just make use of bpf_prog_run_xdp(), too. For qede, just move rcu_read_lock()
> out of the helper. When the driver gets atomic replace support, this will
> move to call-sites eventually.
> 
> mlx5 needs actual fixing as it has the same issue as described already in
> 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
> that is, we're under RCU bh at this time, BPF programs are released via
> call_rcu(), and call_rcu() != call_rcu_bh(), so we need to properly mark
> read side as programs can get xchg()'ed in mlx5e_xdp_set() without queue
> reset.
> 
> Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
> Signed-off-by: Daniel Borkmann 
> Acked-by: Alexei Starovoitov 
> ---
>  ( Also here net-next is just fine, imho. )

Applied, thanks.


[PATCH] pull request for net: batman-adv 2016-12-02

2016-12-02 Thread Simon Wunderlich
Hi David,

here is another bugfix which we would like to see integrated into net,
if this is possible now.

Please pull or let me know of any problem!

Thank you,
  Simon

The following changes since commit e13258f38e927b61cdb5f4ad25309450d3b127d1:

  batman-adv: Detect missing primaryif during tp_send as error (2016-11-04 
12:27:39 +0100)

are available in the git repository at:

  git://git.open-mesh.org/linux-merge.git tags/batadv-net-for-davem-20161202

for you to fetch changes up to c2d0f48a13e53b4747704c9e692f5e765e52041a:

  batman-adv: Check for alloc errors when preparing TT local data (2016-12-02 
10:46:59 +0100)


Here is another batman-adv bugfix:

 - fix checking for failed allocation of TVLV blocks in TT local data,
   by Sven Eckelmann


Sven Eckelmann (1):
  batman-adv: Check for alloc errors when preparing TT local data

 net/batman-adv/translation-table.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


[PATCH net-next] net_sched: gen_estimator: account for timer drifts

2016-12-02 Thread Eric Dumazet
From: Eric Dumazet 

Under heavy stress, timer used in estimators tend to slowly be delayed
by a few jiffies, leading to inaccuracies.

Lets remember what was the last scheduled jiffies so that we get more
precise estimations, without having to add a multiply/divide in the loop
to account for the drifts.

Signed-off-by: Eric Dumazet 
---
 net/core/gen_estimator.c |   25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index cad8e791f28e..0993844faeea 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -78,8 +78,7 @@
 
 #define EST_MAX_INTERVAL   5
 
-struct gen_estimator
-{
+struct gen_estimator {
struct list_headlist;
struct gnet_stats_basic_packed  *bstats;
struct gnet_stats_rate_est64*rate_est;
@@ -96,8 +95,8 @@ struct gen_estimator
struct rcu_head head;
 };
 
-struct gen_estimator_head
-{
+struct gen_estimator_head {
+   unsigned long   next_jiffies;
struct timer_list   timer;
struct list_headlist;
 };
@@ -146,8 +145,15 @@ static void est_timer(unsigned long arg)
spin_unlock(e->stats_lock);
}
 
-   if (!list_empty(&elist[idx].list))
-   mod_timer(&elist[idx].timer, jiffies + ((HZ/4) << idx));
+   if (!list_empty(&elist[idx].list)) {
+   elist[idx].next_jiffies += ((HZ/4) << idx);
+
+   if (unlikely(time_after_eq(jiffies, elist[idx].next_jiffies))) {
+   /* Ouch... timer was delayed. */
+   elist[idx].next_jiffies = jiffies + 1;
+   }
+   mod_timer(&elist[idx].timer, elist[idx].next_jiffies);
+   }
rcu_read_unlock();
 }
 
@@ -251,9 +257,10 @@ int gen_new_estimator(struct gnet_stats_basic_packed 
*bstats,
setup_timer(&elist[idx].timer, est_timer, idx);
}
 
-   if (list_empty(&elist[idx].list))
-   mod_timer(&elist[idx].timer, jiffies + ((HZ/4) << idx));
-
+   if (list_empty(&elist[idx].list)) {
+   elist[idx].next_jiffies = jiffies + ((HZ/4) << idx);
+   mod_timer(&elist[idx].timer, elist[idx].next_jiffies);
+   }
list_add_rcu(&est->list, &elist[idx].list);
gen_add_node(est);
spin_unlock_bh(&est_tree_lock);




[PATCH] batman-adv: Check for alloc errors when preparing TT local data

2016-12-02 Thread Simon Wunderlich
From: Sven Eckelmann 

batadv_tt_prepare_tvlv_local_data can fail to allocate the memory for the
new TVLV block. The caller is informed about this problem with the returned
length of 0. Not checking this value results in an invalid memory access
when either tt_data or tt_change is accessed.

Reported-by: Dan Carpenter 
Fixes: 7ea7b4a14275 ("batman-adv: make the TT CRC logic VLAN specific")
Signed-off-by: Sven Eckelmann 
Signed-off-by: Simon Wunderlich 
---
 net/batman-adv/translation-table.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/batman-adv/translation-table.c 
b/net/batman-adv/translation-table.c
index 7f66309..0dc85eb 100644
--- a/net/batman-adv/translation-table.c
+++ b/net/batman-adv/translation-table.c
@@ -3282,7 +3282,7 @@ static bool batadv_send_my_tt_response(struct batadv_priv 
*bat_priv,
 &tvlv_tt_data,
 &tt_change,
 &tt_len);
-   if (!tt_len)
+   if (!tt_len || !tvlv_len)
goto unlock;
 
/* Copy the last orig_node's OGM buffer */
@@ -3300,7 +3300,7 @@ static bool batadv_send_my_tt_response(struct batadv_priv 
*bat_priv,
 &tvlv_tt_data,
 &tt_change,
 &tt_len);
-   if (!tt_len)
+   if (!tt_len || !tvlv_len)
goto out;
 
/* fill the rest of the tvlv with the real TT entries */
-- 
2.10.2



Re: [PATCH 1/1] ax25: Fix segfault when receiving an iframe with net2kiss loaded

2016-12-02 Thread David Miller
From: Basil Gunn 
Date: Wed, 30 Nov 2016 11:15:25 -0800

> AX.25 uses sock_queue_rcv_skb() to queue an iframe received packet.
> This routine writes NULL to the socket buffer device structure
> pointer. The socket buffer is subsequently serviced by
> __netif_receiv_skb_core() which dereferences the device structure
> pointer & segfaults.
> 
> The fix puts the ax25 device structure pointer back in the socket
> buffer struct after sock_queue_rcv_skb() is called.

You cannot do this.

We have no reference count held on the device object, therefore once
access to this SKB leaves the receive packet processing path and thus
the RCU protected region, we must set skb->dev to NULL.

Queueing the SKB to the socket causes the SKB to leave the receive
processing path.

You will have to find another way to propagate this device pointer
to the code that needs it.


[PATCH net-next] udp: be less conservative with sock rmem accounting

2016-12-02 Thread Paolo Abeni
Before commit 850cbaddb52d ("udp: use it's own memory accounting
schema"), the udp protocol allowed sk_rmem_alloc to grow beyond
the rcvbuf by the whole current packet's truesize. After said commit
we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue
is empty. As reported by Jesper this cause a performance regression
for some (small) values of rcvbuf.

This commit is intended to fix the regression restoring the old
handling of the rcvbuf limit.

Reported-by: Jesper Dangaard Brouer 
Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema")
Signed-off-by: Paolo Abeni 
---
 net/ipv4/udp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e1d0bf8..16d88ba 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1205,14 +1205,14 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct 
sk_buff *skb)
 * queue is full; always allow at least a packet
 */
rmem = atomic_read(&sk->sk_rmem_alloc);
-   if (rmem && (rmem + size > sk->sk_rcvbuf))
+   if (rmem > sk->sk_rcvbuf)
goto drop;
 
/* we drop only if the receive buf is full and the receive
 * queue contains some other skb
 */
rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
-   if ((rmem > sk->sk_rcvbuf) && (rmem > size))
+   if (rmem > (size + sk->sk_rcvbuf))
goto uncharge_drop;
 
spin_lock(&list->lock);
-- 
1.8.3.1



Re: arp_filter and IPv6 ND

2016-12-02 Thread Hannes Frederic Sowa
Hello,

On 02.12.2016 16:42, Saku Ytti wrote:
> On 2 December 2016 at 16:08, Hannes Frederic Sowa
>  wrote:
> 
> Hey,
> 
>> May I ask why you want to turn it off?
> 
> Certainly. I don't want device to answer with link address for L3
> address it does not have on the link. In my case it triggers this bug
> https://supportforums.cisco.com/document/12098096/cscse46790-cef-prefers-arp-adjacency-over-rib-next-hop

Okay, that should not happen.

Redirects and neighbor advertisements are the only way how you can
announce prefixes on-link. Unfortunately historically we automatically
add device routes for prefixes, too. We can't change this anymore but
this is wrong.

> In this particular case, for one reason or another my Cisco device
> would have ND entry for Linux loopback pointing to an interface with
> completely different network.  Which itself would be just weird, but
> combined with weird behaviour of Cisco it actually causes the loopback
> route advertised by BGP not to be installed. If the ND entry didn't
> exist, the BGP route would be installed.

H... Loopback route advertised by BGP? Do you use filter to get rid
of that on your AS-border? So you probably don't use an IGP? Do you use
next-hop-self attribute on your neighbor in that direction? BGP in
general doesn't lead to ND entry installs, protocols like IS-IS afair
can short circuit here.

Hmm, I would keep the Loopback announcements out of the BGP.

> I don't really even know why the ND entry exists, all I can think of
> is that Linux must have sent gratuitous reply, because I don't se why
> Cisco would have tried to discover it.
>
> Expected behaviour is that the loopback/128 BGP route resolves to
> on-link next-hop, and on-link next hop is then ND'd. Observed
> behaviour is that loopback/128 BGP route also appears in ND cache.

Yep, exactly.

>> In IPv6 this depends on the scope. In IPv4 this concept doesn't really
>> exist.
>>
>> Please notice that in IPv4 arp_filter does not necessarily mean that the
>> system is operating in strong end system mode but you end up in an
>> hybrid clone where arp is acting strong but routing not and thus you
>> also have to add fib rules to simulate that.
> 
> It's just very peculiar behaviour to have ARP or ND entries on a
> interface where given subnet does not exist, it rudimentarily causes
> difficult to troubleshoot problems and is surprising/unexpected
> behaviour.

For enterprise and cloud stuff it is certainly very surprising, as some
isolations don't work as expected. OTOH it is really easy to build up
home networks and things are more plug and play.

> Of course well behaving device wouldn't accept such replies, because
> it itself could be attack vector (imagine me telling you 8.8.8.8 is on
> the link, or worse, your bank).
> 
> I'm curious, why does this behaviour exist? When is this desirable?
> I've never seen any other device than Linux behave like this, and when
> ever I've heard about the problem, I've only seen surprised faces that
> it does behave like this.

I don't feel comfortable to answer that, just some thoughts...

Some RFCs require that for some router implementations (CPE), on the
other hand weak end model in Linux was probably inherited by IPv4. The
addition of duplicate address detection (which of course only makes
sense in strong end systems) to IPv6, basically shows that IPv6 is more
or less designed to be a strong end system model.

Anyway, a patch to suppress ndisc requests on those interfaces will
probably be accepted.

For unicast reverse filtering e.g. there is actually no sysctl available
anymore, instead you are supposed to install a netfilter rule to handle
this, which automatically takes care of this.

Bye,
Hannes



Re: [PATCH 2/7] net: ethernet: ti: cpdma: fix desc re-queuing

2016-12-02 Thread Grygorii Strashko


On 12/02/2016 05:03 AM, Ivan Khoronzhuk wrote:
> On Thu, Dec 01, 2016 at 05:34:27PM -0600, Grygorii Strashko wrote:
>> The currently processing cpdma descriptor with EOQ flag set may
>> contain two values in Next Descriptor Pointer field:
>> - valid pointer: means CPDMA missed addition of new desc in queue;
> It shouldn't happen in normal circumstances, right?

it might happen, because desc push compete with desc pop.
You can check stats values:
chan->stats.misqueued
chan->stats.requeue
 under different types of net-loads.

TRM:
"
If the pNext pointer is initially NULL, and more packets need to be queued for 
transmit, the software
application may alter this pointer to point to a newly appended descriptor. The 
EMAC will use the new
pointer value and proceed to the next descriptor unless the pNext value has 
already been read. In this
latter case, the transmitter will halt on the transmit channel in question, and 
the software application may
restart it at that time. The software can detect this case by checking for an 
end of queue (EOQ) condition
flag on the updated packet descriptor when it is returned by the EMAC.
"


> So, why it happens only for egress channels? And Does that mean
> there is some resynchronization between submit and process function,
> or this is h/w issue?

no hw issues. this patch just removes one unnecessary I/O access 

> 
>> - null: no more descriptors in queue.
>> In the later case, it's not required to write to HDP register, but now
>> CPDMA does it.
>>
>> Hence, add additional check for Next Descriptor Pointer != null in
>> cpdma_chan_process() function before writing in HDP register.
>>
>> Signed-off-by: Grygorii Strashko 
>> ---
>>  drivers/net/ethernet/ti/davinci_cpdma.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c 
>> b/drivers/net/ethernet/ti/davinci_cpdma.c
>> index 0924014..379314f 100644
>> --- a/drivers/net/ethernet/ti/davinci_cpdma.c
>> +++ b/drivers/net/ethernet/ti/davinci_cpdma.c
>> @@ -1152,7 +1152,7 @@ static int __cpdma_chan_process(struct cpdma_chan 
>> *chan)
>>  chan->count--;
>>  chan->stats.good_dequeue++;
>>  
>> -if (status & CPDMA_DESC_EOQ) {
>> +if ((status & CPDMA_DESC_EOQ) && chan->head) {
>>  chan->stats.requeue++;
>>  chan_write(chan, hdp, desc_phys(pool, chan->head));
>>  }
>> -- 
>> 2.10.1
>>

-- 
regards,
-grygorii


Re: [PATCH net-next v4 0/4] bpf: BPF for lightweight tunnel encapsulation

2016-12-02 Thread David Miller
From: Thomas Graf 
Date: Wed, 30 Nov 2016 17:10:07 +0100

> This series implements BPF program invocation from dst entries via the
> lightweight tunnels infrastructure.

Nice work, applied, thanks Thomas.



Re: [PATCH net-next] net/mlx5e: skip loopback selftest with !CONFIG_INET

2016-12-02 Thread David Miller
From: Arnd Bergmann 
Date: Wed, 30 Nov 2016 22:05:39 +0100

> When CONFIG_INET is disabled, the new selftest results in a link
> error:
> 
> drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.o: In function 
> `mlx5e_test_loopback':
> en_selftest.c:(.text.mlx5e_test_loopback+0x2ec): undefined reference to 
> `ip_send_check'
> en_selftest.c:(.text.mlx5e_test_loopback+0x34c): undefined reference to 
> `udp4_hwcsum'
> 
> This hides the specific test in that configuration.
> 
> Fixes: 0952da791c97 ("net/mlx5e: Add support for loopback selftest")
> Signed-off-by: Arnd Bergmann 

Applied, thanks Arnd.


Re: [PATCH net v2] igb: re-assign hw address pointer on reset after PCI error

2016-12-02 Thread Keller, Jacob E
On Fri, 2016-12-02 at 10:55 -0200, Guilherme G. Piccoli wrote:
> On 11/10/2016 04:46 PM, Guilherme G. Piccoli wrote:
> > Whenever the igb driver detects the result of a read operation
> > returns
> > a value composed only by F's (like 0x), it will detach the
> > net_device, clear the hw_addr pointer and warn to the user that
> > adapter's
> > link is lost - those steps happen on igb_rd32().
> > 
> > In case a PCI error happens on Power architecture, there's a
> > recovery
> > mechanism called EEH, that will reset the PCI slot and call
> > driver's
> > handlers to reset the adapter and network functionality as well.
> > 
> > We observed that once hw_addr is NULL after the error is detected
> > on
> > igb_rd32(), it's never assigned back, so in the process of
> > resetting
> > the network functionality we got a NULL pointer dereference in both
> > igb_configure_tx_ring() and igb_configure_rx_ring(). In order to
> > avoid
> > such bug, this patch re-assigns the hw_addr value in the slot_reset
> > handler.
> > 
> > Reported-by: Anthony H. Thai 
> > Reported-by: Harsha Thyagaraja 
> > Signed-off-by: Guilherme G. Piccoli 
> > ---
> >  drivers/net/ethernet/intel/igb/igb_main.c | 5 +
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> > b/drivers/net/ethernet/intel/igb/igb_main.c
> > index edc9a6a..136ee9e 100644
> > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > @@ -7878,6 +7878,11 @@ static pci_ers_result_t
> > igb_io_slot_reset(struct pci_dev *pdev)
> >     pci_enable_wake(pdev, PCI_D3hot, 0);
> >     pci_enable_wake(pdev, PCI_D3cold, 0);
> > 
> > +   /* In case of PCI error, adapter lose its HW
> > address
> > +    * so we should re-assign it here.
> > +    */
> > +   hw->hw_addr = adapter->io_addr;
> > +
> >     igb_reset(adapter);
> >     wr32(E1000_WUS, ~0);
> >     result = PCI_ERS_RESULT_RECOVERED;
> > 
> 
> Ping?
> 
> Sorry to annoy, any news on this?
> Thanks in advance!
> 
> Cheers,
> 
> 
> Guilherme
> 

This seems reasonable. It's similar to what fm10k driver does under
this circumstance.

Thanks,
Jake

Re: [flamebait] xdp, well meaning but pointless

2016-12-02 Thread Tom Herbert
On Fri, Dec 2, 2016 at 3:54 AM, Hannes Frederic Sowa
 wrote:
> On 02.12.2016 11:24, Jesper Dangaard Brouer wrote:
>> On Thu, 1 Dec 2016 13:51:32 -0800
>> Tom Herbert  wrote:
>>
> The technical plenary at last IETF on Seoul a couple of weeks ago was
> exclusively focussed on DDOS in light of the recent attack against
> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
> presentation by Nick Sullivan
> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
> alluded to some implementation of DDOS mitigation. In particular, on
> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
>>
>> slide 14
>>
> numbers he gave we're based in iptables+BPF and that was a whole
> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
> and that's also when I introduced XDP to whole IETF :-) ). If that's
> the best we can do the Internet is in a world hurt. DDOS mitigation
> alone is probably a sufficient motivation to look at XDP. We need
> something that drops bad packets as quickly as possible when under
> attack, we need this to be integrated into the stack, we need it to be
> programmable to deal with the increasing savvy of attackers, and we
> don't want to be forced to be dependent on HW solutions. This is why
> we created XDP!
>>
>> The 1.2Mpps number is a bit low, but we are unfortunately in that
>> ballpark.
>>
 I totally understand that. But in my reply to David in this thread I
 mentioned DNS apex processing as being problematic which is actually
 being referred in your linked slide deck on page 9 ("What do floods look
 like") and the problematic of parsing DNS packets in XDP due to string
 processing and looping inside eBPF.
>>
>> That is a weak argument. You do realize CloudFlare actually use eBPF to
>> do this exact filtering, and (so-far) eBPF for parsing DNS have been
>> sufficient for them.
>
> You are talking about this code on the following slides (I actually
> transcribed it for you here and disassembled):
>
> l0: ld #0x14
> l1: ldxb 4*([0]&0xf)
> l2: add x
> l3: tax
> l4: ld [x+0]
> l5: jeq #0x7657861, l6, l13
> l6: ld [x+4]
> l7: jeq #0x6d706c65, l8, l13
> l8: ld [x+8]
> l9: jeq #0x3636f6d, l10, l13
> l10:ldb [x+12]
> l11:jeq #0, l12, l13
> l12:ret #0x1
> l13:ret #0
>
> You can offload this to u32 in hardware if that is what you want.
>
> The reason this works is because of netfilter, which allows them to
> dynamically generate BPF programs and insert and delete them from
> chains, do intersection or unions of them.
>
> If you have a freestanding program like in XDP the complexity space is a
> different one and not comparable to this at all.
>
I don't understand this comment about complexity especially in regards
to the idea of offloading u32 to hardware. Relying on hardware to do
anything always leads to more complexity than an equivalent SW
implementation for the same functionality. The only reason we ever use
a hardware mechanisms is if it gives *significantly* better
performance. If the performance difference isn't there then doing
things in SW is going to be the better path (as we see in XDP).

Tom

> Bye,
> Hannes
>


Re: [Intel-wired-lan] [PATCH v2] e1000e: free IRQ regardless of __E1000_DOWN

2016-12-02 Thread Baicar, Tyler

Hello Sasha,

Were you able to reproduce this issue?

Do you have a patch fixing the close function inconsistencies that you 
mentioned which I could try out?


Thanks,
Tyler

On 11/21/2016 1:40 PM, Baicar, Tyler wrote:

On 11/17/2016 6:31 AM, Neftin, Sasha wrote:

On 11/13/2016 10:34 AM, Neftin, Sasha wrote:

On 11/11/2016 12:35 AM, Baicar, Tyler wrote:

Hello Sasha,

On 11/9/2016 11:19 PM, Neftin, Sasha wrote:

On 11/9/2016 11:41 PM, Tyler Baicar wrote:

Move IRQ free code so that it will happen regardless of the
__E1000_DOWN bit. Currently the e1000e driver only releases its IRQ
if the __E1000_DOWN bit is cleared. This is not sufficient because
it is possible for __E1000_DOWN to be set without releasing the IRQ.
In such a situation, we will hit a kernel bug later in e1000_remove
because the IRQ still has action since it was never freed. A
secondary bus reset can cause this case to happen.

Signed-off-by: Tyler Baicar 
---
   drivers/net/ethernet/intel/e1000e/netdev.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 7017281..36cfcb0 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -4679,12 +4679,13 @@ int e1000e_close(struct net_device *netdev)
 if (!test_bit(__E1000_DOWN, &adapter->state)) {
   e1000e_down(adapter, true);
-e1000_free_irq(adapter);
 /* Link status message must follow this format */
   pr_info("%s NIC Link is Down\n", adapter->netdev->name);
   }
   +e1000_free_irq(adapter);
+
   napi_disable(&adapter->napi);
 e1000e_free_tx_resources(adapter->tx_ring);


I would like not recommend insert this change. This change related
driver state machine, we afraid from lot of synchronization 
problem and

issues.
We need keep e1000_free_irq in loop and check for 'test_bit' ready.
What do you mean here? There is no loop. If __E1000_DOWN is set 
then we

will never free the IRQ.

Another point, does before execute secondary bus reset your SW 
back up

pcie configuration space as properly?
After a secondary bus reset, the link needs to recover and go back 
to a

working state after 1 second.

 From the callstack, the issue is happening while removing the 
endpoint

from the system, before applying the secondary bus reset.

The order of events is
1. remove the drivers
2. cause a secondary bus reset
3. wait 1 second

Actually, this is too much, usually link up in less than 100ms.You can
check Data Link Layer indication.

4. recover the link

callstack:
free_msi_irqs+0x6c/0x1a8
pci_disable_msi+0xb0/0x148
e1000e_reset_interrupt_capability+0x60/0x78
e1000_remove+0xc8/0x180
pci_device_remove+0x48/0x118
__device_release_driver+0x80/0x108
device_release_driver+0x2c/0x40
pci_stop_bus_device+0xa0/0xb0
pci_stop_bus_device+0x3c/0xb0
pci_stop_root_bus+0x54/0x80
acpi_pci_root_remove+0x28/0x64
acpi_bus_trim+0x6c/0xa4
acpi_device_hotplug+0x19c/0x3f4
acpi_hotplug_work_fn+0x28/0x3c
process_one_work+0x150/0x460
worker_thread+0x50/0x4b8
kthread+0xd4/0xe8
ret_from_fork+0x10/0x50

Thanks,
Tyler


Hello Tyler,
Okay, we need consult more about this suggestion.
May I ask what is setup you run? Is there NIC or on board LAN? I would
like try reproduce this issue in our lab's too.
Also, is same issue observed with same scenario and others NIC's too?
Sasha
___
Intel-wired-lan mailing list
intel-wired-...@lists.osuosl.org
http://lists.osuosl.org/mailman/listinfo/intel-wired-lan


Hello Tyler,
I see some in consistent implementation of __*_close methods in our
drivers. Do you have any igb NIC to check if same problem persist there?
Thanks,
Sasha

Hello Sasha,

I couldn't find an igb NIC to test with, but I did find another e1000e 
card that does not cause the same issue. That card is:


0004:01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit 
Network Connection

Subsystem: Intel Corporation Gigabit CT Desktop Adapter
Physical Slot: 5-1
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
SERR- 
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 299
Region 0: Memory at c01001c (32-bit, non-prefetchable) 
[size=128K]
Region 1: Memory at c010010 (32-bit, non-prefetchable) 
[size=512K]

Region 2: I/O ports at 1000 [size=32]
Region 3: Memory at c01001e (32-bit, non-prefetchable) 
[size=16K]

Expansion ROM at c010018 [disabled] [size=256K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)

Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit

Re: [PATCH] net: atarilance: use %8ph for printing hex string

2016-12-02 Thread David Miller
From: Rasmus Villemoes 
Date: Wed, 30 Nov 2016 23:02:54 +0100

> This is already using the %pM printf extension; might as well also use
> %ph to make the code smaller.
> 
> Signed-off-by: Rasmus Villemoes 

Applied, thank you.


Re: net/can: warning in raw_setsockopt/__alloc_pages_slowpath

2016-12-02 Thread Oliver Hartkopp



On 12/02/2016 04:42 PM, Marc Kleine-Budde wrote:

On 12/02/2016 04:11 PM, Oliver Hartkopp wrote:



On 12/02/2016 02:24 PM, Marc Kleine-Budde wrote:

On 12/02/2016 01:43 PM, Andrey Konovalov wrote:




 [] raw_setsockopt+0x1be/0x9f0 net/can/raw.c:506


We should add a check for a sensible optlen


static int raw_setsockopt(struct socket *sock, int level, int optname,
  char __user *optval, unsigned int optlen)
{
struct sock *sk = sock->sk;
struct raw_sock *ro = raw_sk(sk);
struct can_filter *filter = NULL;  /* dyn. alloc'ed filters */
struct can_filter sfilter; /* single filter */
struct net_device *dev = NULL;
can_err_mask_t err_mask = 0;
int count = 0;
int err = 0;

if (level != SOL_CAN_RAW)
return -EINVAL;

switch (optname) {

case CAN_RAW_FILTER:
if (optlen % sizeof(struct can_filter) != 0)
return -EINVAL;


here...

if (optlen > 64 * sizeof(struct can_filter))
return -EINVAL;



Agreed.

But what is sensible here?
64 filters is way to small IMO.

When thinking about picking a bunch of single CAN IDs I would tend to
something like 512 filters.


Ok - 64 was just an arbitrary chosen value for demonstration purposes :)



:-)

Would you like to send a patch?

Regards,
Oliver



Re: [PATCH/RFC iproute2/net-next 2/3] tc: flower: introduce enum flower_endpoint

2016-12-02 Thread Jiri Pirko
Fri, Dec 02, 2016 at 10:59:44AM CET, simon.hor...@netronome.com wrote:
>Introduce enum flower_endpoint and use it instead of a bool
>as the type for paramatising source and destination.
>
>This is intended to improve read-ability and provide some type
>checking of endpoint parameters.
>
>Signed-off-by: Simon Horman 
>---
> tc/f_flower.c | 22 ++
> 1 file changed, 14 insertions(+), 8 deletions(-)
>
>diff --git a/tc/f_flower.c b/tc/f_flower.c
>index 615e8f27bed2..42253067b43d 100644
>--- a/tc/f_flower.c
>+++ b/tc/f_flower.c
>@@ -23,6 +23,11 @@
> #include "tc_util.h"
> #include "rt_names.h"
> 
>+enum flower_endpoint {
>+  flower_src,
>+  flower_dst

FLOWER_ENDPOINT_SRC,
FLOWER_ENDPOINT_DST,

>+};
>+
> static void explain(void)
> {
>   fprintf(stderr,
>@@ -160,29 +165,30 @@ static int flower_parse_ip_addr(char *str, __be16 
>eth_type,
>   return 0;
> }
> 
>-static int flower_port_attr_type(__u8 ip_proto, bool is_src)
>+static int flower_port_attr_type(__u8 ip_proto, enum flower_endpoint endpoint)
> {
>   if (ip_proto == IPPROTO_TCP)
>-  return is_src ? TCA_FLOWER_KEY_TCP_SRC :
>+  return endpoint == flower_src ? TCA_FLOWER_KEY_TCP_SRC :
>   TCA_FLOWER_KEY_TCP_DST;
>   else if (ip_proto == IPPROTO_UDP)
>-  return is_src ? TCA_FLOWER_KEY_UDP_SRC :
>+  return endpoint == flower_src ? TCA_FLOWER_KEY_UDP_SRC :
>   TCA_FLOWER_KEY_UDP_DST;
>   else if (ip_proto == IPPROTO_SCTP)
>-  return is_src ? TCA_FLOWER_KEY_SCTP_SRC :
>+  return endpoint == flower_src ? TCA_FLOWER_KEY_SCTP_SRC :
>   TCA_FLOWER_KEY_SCTP_DST;
>   else
>   return -1;
> }
> 
>-static int flower_parse_port(char *str, __u8 ip_proto, bool is_src,
>+static int flower_parse_port(char *str, __u8 ip_proto,
>+   enum flower_endpoint endpoint,
>struct nlmsghdr *n)
> {
>   int ret;
>   int type;
>   __be16 port;
> 
>-  type = flower_port_attr_type(ip_proto, is_src);
>+  type = flower_port_attr_type(ip_proto, endpoint);
>   if (type < 0)
>   return -1;
> 
>@@ -340,14 +346,14 @@ static int flower_parse_opt(struct filter_util *qu, char 
>*handle,
>   }
>   } else if (matches(*argv, "dst_port") == 0) {
>   NEXT_ARG();
>-  ret = flower_parse_port(*argv, ip_proto, false, n);
>+  ret = flower_parse_port(*argv, ip_proto, flower_dst, n);
>   if (ret < 0) {
>   fprintf(stderr, "Illegal \"dst_port\"\n");
>   return -1;
>   }
>   } else if (matches(*argv, "src_port") == 0) {
>   NEXT_ARG();
>-  ret = flower_parse_port(*argv, ip_proto, true, n);
>+  ret = flower_parse_port(*argv, ip_proto, flower_src, n);
>   if (ret < 0) {
>   fprintf(stderr, "Illegal \"src_port\"\n");
>   return -1;
>-- 
>2.7.0.rc3.207.g0ac5344
>


Re: [PATCH/RFC iproute2/net-next 3/3] tc: flower: support matching on ICMP type and code

2016-12-02 Thread Jiri Pirko
Fri, Dec 02, 2016 at 10:59:45AM CET, simon.hor...@netronome.com wrote:
>Support matching on ICMP type and code.
>
>Example usage:
>
>tc qdisc add dev eth0 ingress
>
>tc filter add dev eth0 protocol ip parent : flower \
>   indev eth0 ip_proto icmp type 8 code 0 action drop
>
>tc filter add dev eth0 protocol ipv6 parent : flower \
>   indev eth0 ip_proto icmpv6 type 128 code 0 action drop
>
>Signed-off-by: Simon Horman 
>---
> man/man8/tc-flower.8 | 20 ---
> tc/f_flower.c| 96 
> 2 files changed, 105 insertions(+), 11 deletions(-)
>
>diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
>index a401293fed50..c01ace6249dd 100644
>--- a/man/man8/tc-flower.8
>+++ b/man/man8/tc-flower.8
>@@ -29,7 +29,7 @@ flower \- flow based traffic control filter
> .IR PRIORITY " | "
> .BR vlan_eth_type " { " ipv4 " | " ipv6 " | "
> .IR ETH_TYPE " } | "
>-.BR ip_proto " { " tcp " | " udp " | " sctp " | "
>+.BR ip_proto " { " tcp " | " udp " | " sctp " | " icmp " | " icmpv6 " | "
> .IR IP_PROTO " } | { "
> .BR dst_ip " | " src_ip " } { "
> .IR ipv4_address " | " ipv6_address " } | { "
>@@ -94,7 +94,7 @@ or an unsigned 16bit value in hexadecimal format.
> Match on layer four protocol.
> .I IP_PROTO
> may be
>-.BR tcp ", " udp ", " sctp
>+.BR tcp ", " udp ", " sctp ", " icmp ", " icmpv6
> or an unsigned 8bit value in hexadecimal format.
> .TP
> .BI dst_ip " ADDRESS"
>@@ -112,6 +112,13 @@ option of tc filter.
> Match on layer 4 protocol source or destination port number. Only available 
> for
> .BR ip_proto " values " udp ", " tcp  " and " sctp
> which have to be specified in beforehand.
>+.TP
>+.BI type " NUMBER"
>+.TQ
>+.BI code " NUMBER"
>+Match on ICMP type or code. Only available for
>+.BR ip_proto " values " icmp  " and " icmpv6
>+which have to be specified in beforehand.
> .SH NOTES
> As stated above where applicable, matches of a certain layer implicitly depend
> on the matches of the next lower layer. Precisely, layer one and two matches
>@@ -120,13 +127,16 @@ have no dependency, layer three matches
> (\fBip_proto\fR, \fBdst_ip\fR and \fBsrc_ip\fR)
> depend on the
> .B protocol
>-option of tc filter
>-and finally layer four matches
>+option of tc filter, layer four port matches
> (\fBdst_port\fR and \fBsrc_port\fR)
> depend on
> .B ip_proto
> being set to
>-.BR tcp ", " udp " or " sctp.
>+.BR tcp ", " udp " or " sctp,
>+and finally ICMP matches (\fBcode\fR and \fBtype\fR) depend on
>+.B ip_proto
>+being set to
>+.BR icmp " or " icmpv6.
> .P
> There can be only used one mask per one prio. If user needs to specify 
> different
> mask, he has to use different prio.
>diff --git a/tc/f_flower.c b/tc/f_flower.c
>index 42253067b43d..59f6f1ea26e6 100644
>--- a/tc/f_flower.c
>+++ b/tc/f_flower.c
>@@ -28,6 +28,11 @@ enum flower_endpoint {
>   flower_dst
> };
> 
>+enum flower_icmp_field {
>+  flower_icmp_type,
>+  flower_icmp_code

FLOWER_ICMP_FIELD_TYPE,
FLOWER_ICMP_FIELD_CODE,

>+};
>+
> static void explain(void)
> {
>   fprintf(stderr,
>@@ -42,11 +47,13 @@ static void explain(void)
>   "   vlan_ethtype [ ipv4 | ipv6 | ETH-TYPE ] 
> |\n"
>   "   dst_mac MAC-ADDR |\n"
>   "   src_mac MAC-ADDR |\n"
>-  "   ip_proto [tcp | udp | sctp | IP-PROTO ] 
>|\n"
>+  "   ip_proto [tcp | udp | sctp | icmp | 
>icmpv6 | IP-PROTO ] |\n"
>   "   dst_ip [ IPV4-ADDR | IPV6-ADDR ] |\n"
>   "   src_ip [ IPV4-ADDR | IPV6-ADDR ] |\n"
>   "   dst_port PORT-NUMBER |\n"
>-  "   src_port PORT-NUMBER }\n"
>+  "   src_port PORT-NUMBER |\n"
>+  "   type ICMP-TYPE |\n"
>+  "   code ICMP-CODE }\n"
>   "   FILTERID := X:Y:Z\n"
>   "   ACTION-SPEC := ... look at individual actions\n"
>   "\n"
>@@ -95,16 +102,23 @@ static int flower_parse_ip_proto(char *str, __be16 
>eth_type, int type,
>   int ret;
>   __u8 ip_proto;
> 
>-  if (eth_type != htons(ETH_P_IP) && eth_type != htons(ETH_P_IPV6)) {
>-  fprintf(stderr, "Illegal \"eth_type\" for ip proto\n");
>-  return -1;
>-  }
>+  if (eth_type != htons(ETH_P_IP) && eth_type != htons(ETH_P_IPV6))
>+  goto err;
>+
>   if (matches(str, "tcp") == 0) {
>   ip_proto = IPPROTO_TCP;
>   } else if (matches(str, "udp") == 0) {
>   ip_proto = IPPROTO_UDP;
>   } else if (matches(str, "sctp") == 0) {
>   ip_proto = IPPROTO_SCTP;
>+  } else if (matches(str, "icmp") == 0) {
>+  if (eth_type != htons(ETH_P_IP))
>+  goto err;
>+  ip_proto = IPPROTO_ICMP;

Re: [PATCH/RFC iproute2/net-next 0/3] tc: flower: Support matching on ICMP

2016-12-02 Thread Jiri Pirko
Fri, Dec 02, 2016 at 10:59:42AM CET, simon.hor...@netronome.com wrote:
>Add support for matching on ICMP type and code to flower. This is modeled
>on existing support for matching on L4 ports.
>
>The second patch provided a minor cleanup which is in keeping with
>they style used in the last patch.
>
>This is marked as an RFC to match the same designation given to the
>corresponding kernel patches.

Looks nice, I only have those 2 enum nitpicks.

Thanks.

>
>Based on iproute2/net-next with the following applied:
>* [[PATCH iproute2/net-next v2] 0/4] tc: flower: SCTP and other port fixes
>
>Simon Horman (3):
>  tc: flower: update headers for TCA_FLOWER_KEY_ICMP*
>  tc: flower: introduce enum flower_endpoint
>  tc: flower: support matching on ICMP type and code
>
> include/linux/pkt_cls.h |  10 
> man/man8/tc-flower.8|  20 ++--
> tc/f_flower.c   | 118 ++--
> 3 files changed, 129 insertions(+), 19 deletions(-)
>
>-- 
>2.7.0.rc3.207.g0ac5344
>


Re: [PATCH/RFC net-next 0/2] net/sched: cls_flower: Support matching on ICMP

2016-12-02 Thread Jiri Pirko
Fri, Dec 02, 2016 at 10:52:30AM CET, simon.hor...@netronome.com wrote:
>Hi,
>
>this series add supports for matching on ICMP type and code to cls_flower.
>This is modeled on existing support for matching on L4 ports. The updates
>to the dissector are intended to allow for code and storage re-use.

Looks fine to me. Thanks!


>
>Simon Horman (2):
>  flow dissector: ICMP support
>  net/sched: cls_flower: Support matching on ICMP type and code
>
> drivers/net/bonding/bond_main.c |  6 +++--
> include/linux/skbuff.h  |  5 +
> include/net/flow_dissector.h| 50 ++---
> include/uapi/linux/pkt_cls.h| 10 +
> net/core/flow_dissector.c   | 34 +---
> net/sched/cls_flow.c|  4 ++--
> net/sched/cls_flower.c  | 42 ++
> 7 files changed, 141 insertions(+), 10 deletions(-)
>
>-- 
>2.7.0.rc3.207.g0ac5344
>


Re: [PATCH 2/2] net: ethernet: altera: TSE: do not use tx queue lock in tx completion handler

2016-12-02 Thread David Miller
From: Lino Sanfilippo 
Date: Wed, 30 Nov 2016 23:48:32 +0100

> The driver already uses its private lock for synchronization between xmit
> and xmit completion handler making the additional use of the xmit_lock
> unnecessary.
> Furthermore the driver does not set NETIF_F_LLTX resulting in xmit to be
> called with the xmit_lock held and then taking the private lock while xmit
> completion handler does the reverse, first take the private lock, then the
> xmit_lock.
> Fix these issues by not taking the xmit_lock in the tx completion handler.
> 
> Signed-off-by: Lino Sanfilippo 

Yeah that could be a nasty deadlock, in fact.

Applied, thanks.


Re: Initial thoughts on TXDP

2016-12-02 Thread Tom Herbert
On Fri, Dec 2, 2016 at 6:36 AM, Edward Cree  wrote:
> On 01/12/16 23:46, Tom Herbert wrote:
>> The only time we
>> _really_ to allocate an skbuf is when we need to put the packet onto a
>> queue. All the other use cases are really just to pass a structure
>> containing a packet from function to function. For that purpose we
>> should be able to just pass a much smaller structure in a stack
>> argument and only allocate an skbuff when we need to enqueue. In cases
>> where we don't ever queue a packet we might never need to allocate any
>> skbuff
> Now this intrigues me, because one of the objections to bundling (vs GRO)
> was the memory usage of all those SKBs.  IIRC we already do a 'GRO-like'
> coalescing when packets reach a TCP socket anyway (or at least in some
> cases, not sure if all the different ways we can enqueue a TCP packet for
> RX do it), but if we could carry the packets from NIC to socket without
> SKBs, doing so in lists rather than one-at-a-time wouldn't cost any extra
> memory (the packet-pages are all already allocated on the NIC RX ring).
> Possibly combine the two, so that rather than having potentially four
> versions of each function (skb, skbundle, void*, void* bundle) you just
> have the two 'ends'.
>
Yep, seems like a good idea to incorporate bundling into TXDP from the get-go.

Tom

> -Ed


Re: [PATCH net] packet: fix race condition in packet_set_ring

2016-12-02 Thread David Miller
From: Eric Dumazet 
Date: Wed, 30 Nov 2016 14:55:36 -0800

> From: Philip Pettersson 
> 
> When packet_set_ring creates a ring buffer it will initialize a
> struct timer_list if the packet version is TPACKET_V3. This value
> can then be raced by a different thread calling setsockopt to
> set the version to TPACKET_V1 before packet_set_ring has finished.
> 
> This leads to a use-after-free on a function pointer in the
> struct timer_list when the socket is closed as the previously
> initialized timer will not be deleted.
> 
> The bug is fixed by taking lock_sock(sk) in packet_setsockopt when
> changing the packet version while also taking the lock at the start
> of packet_set_ring.
> 
> Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
> Signed-off-by: Philip Pettersson 
> Signed-off-by: Eric Dumazet 

Applied and queued up for -stable, thanks.


Re: [PATCH 1/2] net: ethernet: altera: TSE: Remove unneeded dma sync for tx buffers

2016-12-02 Thread David Miller
From: Lino Sanfilippo 
Date: Wed, 30 Nov 2016 23:48:31 +0100

> An explicit dma sync for device directly after mapping as well as an
> explicit dma sync for cpu directly before unmapping is unnecessary and
> costly on the hotpath. So remove these calls.
> 
> Signed-off-by: Lino Sanfilippo 

Applied.


Re: [flamebait] xdp, well meaning but pointless

2016-12-02 Thread Jesper Dangaard Brouer

On Thu, 1 Dec 2016 10:11:08 +0100 Florian Westphal  wrote:

> In light of DPDKs existence it make a lot more sense to me to provide
> a). a faster mmap based interface (possibly AF_PACKET based) that allows
> to map nic directly into userspace, detaching tx/rx queue from kernel.
> 
> John Fastabend sent something like this last year as a proof of
> concept, iirc it was rejected because register space got exposed directly
> to userspace.  I think we should re-consider merging netmap
> (or something conceptually close to its design).

I'm actually working in this direction, of zero-copy RX mapping packets
into userspace.  This work is mostly related to page_pool, and I only
plan to use XDP as a filter for selecting packets going to userspace,
as this choice need to be taken very early.

My design is here:
 
https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html

This is mostly about changing the memory model in the drivers, to allow
for safely mapping pages to userspace.  (An efficient queue mechanism is
not covered).  People often overlook that netmap's efficiency *also* comes
from introducing pre-mapping memory/pages to userspace.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op

2016-12-02 Thread David Miller
From: Vivien Didelot 
Date: Wed, 30 Nov 2016 17:59:27 -0500

> @@ -765,6 +765,9 @@ struct mv88e6xxx_ops {
>   int (*phy_write)(struct mv88e6xxx_chip *chip, int addr, int reg,
>u16 val);
>  
> + /* Switch Software Reset */
> + int (*reset)(struct mv88e6xxx_chip *chip);
> +

I think Andrew's request to name this method "g1_reset" is reasonable, please
respin with that change.

Thanks.


Re: [PATCH 5/7] Documentation: DT: net: cpsw: allow to specify descriptors pool size

2016-12-02 Thread Grygorii Strashko


On 12/02/2016 05:28 AM, Ivan Khoronzhuk wrote:
> On Thu, Dec 01, 2016 at 05:34:30PM -0600, Grygorii Strashko wrote:
>> Add optional property "descs_pool_size" to specify buffer descriptor's
>> pool size. The "descs_pool_size" should define total number of CPDMA
>> CPPI descriptors to be used for both ingress/egress packets
>> processing. If not specified - the default value 256 will be used
>> which will allow to place descriptor's pool into the internal CPPI
>> RAM on most of TI SoC.
>>
>> Signed-off-by: Grygorii Strashko 
>> ---
>>  Documentation/devicetree/bindings/net/cpsw.txt | 5 +
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/Documentation/devicetree/bindings/net/cpsw.txt 
>> b/Documentation/devicetree/bindings/net/cpsw.txt
>> index 5ad439f..b99d196 100644
>> --- a/Documentation/devicetree/bindings/net/cpsw.txt
>> +++ b/Documentation/devicetree/bindings/net/cpsw.txt
>> @@ -35,6 +35,11 @@ Optional properties:
>>For example in dra72x-evm, pcf gpio has to be
>>driven low so that cpsw slave 0 and phy data
>>lines are connected via mux.
>> +- descs_pool_size   : total number of CPDMA CPPI descriptors to be used for
>> +  both ingress/egress packets processing. if not
>> +  specified the default value 256 will be used which
>> +  will allow to place descriptors pool into the
>> +  internal CPPI RAM.
> Does it describe h/w? Why now module parameter? or even smth like ethtool num
> ring entries?
> 

It can be module parameter too. in general this is expected to be 
 one-time boot setting only.  

- OR
So, do you propose to use 
   ethtool -g ethX

   ethtool -G ethX [rx N] [tx N]
?

Now cpdma has one pool for all RX/TX channels, so changing this settings
by ethtool will require: pause interfaces, reallocate cpdma pool, 
re-arrange buffers between channels, resume interface. Correct?

How do you think - we can move forward with one pool or better to have two (Rx 
and Tx)?

Wouldn't it be reasonable to still have DT (or module) parameter to avoid 
cpdma reconfiguration on system startup (pause/resume interfaces) (faster boot)?

How about cpdma re-allocation policy (with expectation that is shouldn't happen 
too often)?
- increasing of Rx, Tx will grow total number of physically allocated buffers 
(total_desc_num)
- decreasing of Rx, Tx will just change number of available buffers (no memory 
re-allocation)

- OR 
Can we move forward with current patch (total number of CPDMA CPPI descriptors 
defined in DT) 
and add ethtool -G ethX [rx N] [tx N] which will allow to re-split descs 
between RX and TX?



-- 
regards,
-grygorii


Re: [PATCH net-next v3] ipv6 addrconf: Implemented enhanced DAD (RFC7527)

2016-12-02 Thread David Miller
From: Erik Nordmark 
Date: Wed, 30 Nov 2016 15:39:19 -0800

> @@ -794,6 +808,17 @@ static void ndisc_recv_ns(struct sk_buff *skb)
>  have_ifp:
>   if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) {
>   if (dad) {
> + if (nonce != 0 && ifp->dad_nonce == nonce) {
> + u8 *np = (u8 *)&nonce;
> + /* Matching nonce if looped back */
> + ND_PRINTK(2, notice,
> +   "%s: IPv6 DAD loopback for 
> address %pI6c nonce %02x:%02x:%02x:%02x:%02x:%02x ignored\n",
> +   ifp->idev->dev->name,
> +   &ifp->addr,
> +   np[0], np[1], np[2], np[3],
> +   np[4], np[5]);

I know you said you'd leave this, but I'd actually like to ask that you
use %pM here to save some kernel size.

Thank you.


Re: [PATCH net v2 1/3] ipv4: Set skb->protocol properly for local output

2016-12-02 Thread David Miller
From: Eli Cooper 
Date: Thu,  1 Dec 2016 10:05:10 +0800

> When xfrm is applied to TSO/GSO packets, it follows this path:
> 
> xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
> 
> where skb_gso_segment() relies on skb->protocol to function properly.
> 
> This patch sets skb->protocol to ETH_P_IP before dst_output() is called,
> fixing a bug where GSO packets sent through a sit tunnel are dropped
> when xfrm is involved.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Eli Cooper 

Applied.


Re: [PATCH net v2 2/3] ipv6: Set skb->protocol properly for local output

2016-12-02 Thread David Miller
From: Eli Cooper 
Date: Thu,  1 Dec 2016 10:05:11 +0800

> When xfrm is applied to TSO/GSO packets, it follows this path:
> 
> xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
> 
> where skb_gso_segment() relies on skb->protocol to function properly.
> 
> This patch sets skb->protocol to ETH_P_IPV6 before dst_output() is called,
> fixing a bug where GSO packets sent through an ipip6 tunnel are dropped
> when xfrm is involved.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Eli Cooper 

Applied.


Re: [PATCH net v2 3/3] Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"

2016-12-02 Thread David Miller
From: Eli Cooper 
Date: Thu,  1 Dec 2016 10:05:12 +0800

> This reverts commit ae148b085876fa771d9ef2c05f85d4b4bf09ce0d
> ("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()").
> 
> skb->protocol is now set in __ip_local_out() and __ip6_local_out() before
> dst_output() is called. It is no longer necessary to do it for each tunnel.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Eli Cooper 

Applied.


Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op

2016-12-02 Thread Vivien Didelot
Hi Andrew,

Andrew Lunn  writes:

> + /* Switch Software Reset */
> + int (*g1_reset)(struct mv88e6xxx_chip *chip);
> +
>
> We have a collection of function pointers with port_ prefix, another
> collection with stats_, and a third with ppu_, etc. And then we have
> some which do not fit a specific category. Those i have prefixed with
> g1_ or g2_. I think we should have some prefix, and that is my
> suggestion.

I disagree. There's only one entry point to issue a switch software
reset, so .reset is enough.

I use this opportunity to give a bit of details about mv88e6xxx/ so that
things get written down at least once somewhere:

global1.c implements accessors to "Global 1 Registers" features and are
prefixed with mv88e6xxx_g1_; port.c implements accessors to "Port
Registers" features and are prefixed with mv88e6xxx_port_, and so
on. (where xxx can be a model if there's conflict due to a redefinition
of the same register)

If a feature is not present or if there's more than one way to access
it, these accessors are bundled in the per-chip mv88e6xxx_ops structure
for disambiguation.

chip.c implements support for a single chip by aggregating and nicely
wrapping these operations. It provides a generic API for Marvell
switches, used to implement net/dsa routines.

  Here's a couple of example. Setting a switch MAC can be done in Global
  1, or Global 2 depending on the model. Thus .set_switch_mac can be
  mv88e6xxx_g1_set_switch_mac or mv88e6xxx_g2_set_switch_mac.

  Setting the port's speed is always in the same Port register, but its
  layout varies with the model. Thus .port_set_speed can be
  mv88e6185_port_set_speed or mv88e6352_port_set_speed.

Thanks,

Vivien


Re: [PATCH v2] cpsw: ethtool: add support for getting/setting EEE registers

2016-12-02 Thread Florian Fainelli
On 12/02/2016 01:11 AM, Giuseppe CAVALLARO wrote:
> Hi Florian
> sorry for my delay.
> 
> On 11/24/2016 7:23 PM, Florian Fainelli wrote:
>> +Peppe,
>>
>> Le 24/11/2016 à 07:38, Andrew Lunn a écrit :
 As for enabling advertising and correct working of cpsw do you mean it
 would be better to disable EEE in any PHY on cpsw initialization as
 long as cpsw doesn't provide support for EEE?

 We observe some strange behavior with our gigabit PHYs and a link
 partner in a EEE-capable unmanaged NetGear switch. Disabling
 advertising seems to help. Though we're still investigating the issue.
>>>
>>> Hi Florian
>>>
>>> Am i right in saying, a PHY should not advertise EEE until the MAC
>>> driver calls phy_init_eee(), indicating the MAC supports EEE?
>>
>> You would think so, but I don't see how this could possibly work if that
>> was not the case already, see below.
>>
>>>
>>> If so, it looks like we need to change a few of the PHY drivers, in
>>> particular, the bcm-*.c.
>>
>> The first part that bcm-phy-lib.c does is make sure that EEE is enabled
>> such that this gets reflected in MDIO_PCS_EEE_ABLE, without this, we
>> won't be able to pass the first test in phy_init_eee(). The second part
>> is to advertise EEE such that this gets reflected in MDIO_AN_EEE_ADV,
>> also to make sure that we can pass the second check in phy_init_eee().
>>
>> Now, looking at phy_init_eee(), and what stmmac does (and bcmgenet,
>> copied after stmmac), we need to somehow, have EEE advertised for
>> phy_init_eee() to succeed, prepare the MAC to support EEE, and finally
>> conclude with a call to phy_ethtool_set_eee(), which writes to the
>> MDIO_AN_EEE_ADV register, and concludes the EEE auto-negotiated process.
>> Since we already have EEE advertised, we are essentially just checking
>> that the EEE advertised settings and the LP advertised settings actually
>> do match, so it sounds like the final call to phy_ethtool_set_eee() is
>> potentially useless if the resolved advertised and link partner
>> advertised settings already match...
>>
>> So it sounds like at least, the first time you try to initialize EEE, we
>> should start with EEE not advertised, and then, if we have EEE enabled
>> at some point, and we re-negotiate the link parameters, somehow
>> phy_init_eee() does a right job for that.
>>
>> Peppe, any thoughts on this?
> 
> I share what you say.
> 
> In sum, the EEE management inside the stmmac is:
> 
> - the driver looks at own HW cap register if EEE is supported
> 
> (indeed the user could keep disable EEE if bugged on some HW
>  + Alex, Fabrice: we had some patches for this to propose where we
>  called the phy_ethtool_set_eee to disable feature at phy
>  level
> 
> - then the stmmac asks PHY layer to understand if transceiver and
>   partners are EEE capable.
> 
> - If all matches the EEE is actually initialized.
> 
> the logic above should be respected when use ethtool, hmm, I will
> check the stmmac_ethtool_op_set_eee asap.
> 
> Hoping this is useful

This is definitively useful, the only part that I am struggling to
understand in phy_init_eee() is this:

eee_adv = phy_read_mmd_indirect(phydev, MDIO_AN_EEE_ADV,
MDIO_MMD_AN);
if (eee_adv <= 0)
goto eee_exit_err;

if we are not already advertising EEE in the PHY's MMIO_MMD_AN page, by
the time we call phy_init_eee(), then we cannot complete the EEE
configuration at the PHY level, and presumably we should abort the EEE
configuration at the MAC level.

While this condition makes sense if e.g: you are re-negotiating the link
with your partner for instance and if EEE was already advertised, the
very first time this function is called, it seems to be like we should
skip the check, because phy_init_eee() should actually tell us if, as a
result of a successful check, we should be setting EEE as something we
advertise?

Do you remember what was the logic behind this check when you added it?

Thanks!
-- 
Florian


Re: [PATCH net-next v3 1/2] tcp: randomize tcp timestamp offsets for each connection

2016-12-02 Thread David Miller
From: Florian Westphal 
Date: Thu,  1 Dec 2016 11:32:06 +0100

> jiffies based timestamps allow for easy inference of number of devices
> behind NAT translators and also makes tracking of hosts simpler.
> 
> commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset")
> added the main infrastructure that is needed for per-connection ts
> randomization, in particular writing/reading the on-wire tcp header
> format takes the offset into account so rest of stack can use normal
> tcp_time_stamp (jiffies).
> 
> So only two items are left:
>  - add a tsoffset for request sockets
>  - extend the tcp isn generator to also return another 32bit number
>in addition to the ISN.
> 
> Re-use of ISN generator also means timestamps are still monotonically
> increasing for same connection quadruple, i.e. PAWS will still work.
> 
> Includes fixes from Eric Dumazet.
> 
> Signed-off-by: Florian Westphal 
> Acked-by: Eric Dumazet 
> Acked-by: Yuchung Cheng 

Applied.


Re: [PATCH net-next v3 2/2] tcp: allow to turn tcp timestamp randomization off

2016-12-02 Thread David Miller
From: Florian Westphal 
Date: Thu,  1 Dec 2016 11:32:07 +0100

> Eric says: "By looking at tcpdump, and TS val of xmit packets of multiple
> flows, we can deduct the relative qdisc delays (think of fq pacing).
> This should work even if we have one flow per remote peer."
> 
> Having random per flow (or host) offsets doesn't allow that anymore so add
> a way to turn this off.
> 
> Suggested-by: Eric Dumazet 
> Signed-off-by: Florian Westphal 
> Acked-by: Yuchung Cheng 

Applied.


Re: arp_filter and IPv6 ND

2016-12-02 Thread Saku Ytti
On 2 December 2016 at 18:45, Hannes Frederic Sowa
 wrote:

> next-hop-self attribute on your neighbor in that direction? BGP in
> general doesn't lead to ND entry installs, protocols like IS-IS afair
> can short circuit here.

That's the whole problem, Linux does not think of ND or ARP as
interface specific thing, but as global thing. ND and ARP will happily
answer to query from any interface if any other interface has said IP.
I'm not sure why the Loopback ended up in Cisco ND Cache, answer is
either Cisco queried for it or Linux did gratuitous answer. I believe
gratuitous.

> Hmm, I would keep the Loopback announcements out of the BGP.

It's extremely common way to do anycast, but not interesting for the
topic at hand.

> For enterprise and cloud stuff it is certainly very surprising, as some
> isolations don't work as expected. OTOH it is really easy to build up
> home networks and things are more plug and play.

Can you give me practical example when the behaviour is desirable, my
imagination is failing me. I'm not arguing, I just want to understand
it, as I've never had the need myself.
I've never ran into setup which needs it, but cursory googling shows
several people having broken networks because of the behaviour. If it
is needed, I'm sure it's esoteric setup and perhaps saner default
would that extra sysctl config is needed to get this interface
agnostic ARP/ND behaviour.

> Some RFCs require that for some router implementations (CPE), on the
> other hand weak end model in Linux was probably inherited by IPv4. The
> addition of duplicate address detection (which of course only makes
> sense in strong end systems) to IPv6, basically shows that IPv6 is more
> or less designed to be a strong end system model.
>
> Anyway, a patch to suppress ndisc requests on those interfaces will
> probably be accepted.

Grand, not that I feel comfortable writing it. I'd rather see the
whole suppression functionality moved to neighbour.c from being AFI
specific.

-- 
  ++ytti


Re: [PATCH/RFC net-next 0/2] net/sched: cls_flower: Support matching on ICMP

2016-12-02 Thread Simon Horman
On Fri, Dec 02, 2016 at 06:10:48PM +0100, Jiri Pirko wrote:
> Fri, Dec 02, 2016 at 10:52:30AM CET, simon.hor...@netronome.com wrote:
> >Hi,
> >
> >this series add supports for matching on ICMP type and code to cls_flower.
> >This is modeled on existing support for matching on L4 ports. The updates
> >to the dissector are intended to allow for code and storage re-use.
> 
> Looks fine to me. Thanks!

Thanks, I'll drop the RFC designation and repost.


Re: [PATCH/RFC iproute2/net-next 0/3] tc: flower: Support matching on ICMP

2016-12-02 Thread Simon Horman
On Fri, Dec 02, 2016 at 06:10:20PM +0100, Jiri Pirko wrote:
> Fri, Dec 02, 2016 at 10:59:42AM CET, simon.hor...@netronome.com wrote:
> >Add support for matching on ICMP type and code to flower. This is modeled
> >on existing support for matching on L4 ports.
> >
> >The second patch provided a minor cleanup which is in keeping with
> >they style used in the last patch.
> >
> >This is marked as an RFC to match the same designation given to the
> >corresponding kernel patches.
> 
> Looks nice, I only have those 2 enum nitpicks.

Thanks, I'll fix those and repost.


Re: [PATCH v3 0/3] Add QLogic FastLinQ iSCSI (qedi) driver.

2016-12-02 Thread David Miller
From: "Rangankar, Manish" 
Date: Fri, 2 Dec 2016 07:00:39 +

> Please consider applying the qed patches 1 & 2 to net-next.

Ok, done.


Re: [PATCH 4/6] net: ethernet: ti: cpts: add ptp pps support

2016-12-02 Thread Grygorii Strashko
Hi Richard,

On 12/02/2016 03:58 AM, Richard Cochran wrote:
> On Wed, Nov 30, 2016 at 11:17:38PM +0100, Richard Cochran wrote:
>> On Wed, Nov 30, 2016 at 02:43:57PM -0600, Grygorii Strashko wrote:
>>> Sry, but this is questionable - code for pps comes from TI internal
>>> branches (SDK releases) where it survived for a pretty long time.
> 
> Actually, there is a way to get an accurate PPS from the am335x.  See
> this recent thread:
> 
>
> https://www.mail-archive.com/linuxptp-devel@lists.sourceforge.net/msg01726.html
> 
> That is the way to go, and so, please drop this present patch.
> 

thanks for the links - it sounds very interesting.

As I understood, people trying to enable PPS on am335 device with the
goal to have PPS signal generated on some SoC pin and therefore they use 
DMtimer.
Also, as i understood, the Timer Load Register (TLDR) is corrected once
a second at each HW_TS_PUSH - as result, if freq was corrected during current 
sec
there will be some HW_TS_PUSH generation jitter any way.

Above solution is a bit complex for keystone 2 SoCs, as CPTS itself on these 
SoCs has
output pin (ts_comp) which can be used for PPS signal generation. So, I think,
similar results can be achieved by removing PPS correction code from 
cpts_ptp_adjfreq()
and updating CPTS_TS_LOAD_VAL once a sec in cpts_overflow_check().

or I missed smth?

-- 
regards,
-grygorii


[PATCHv2 net-next 0/4] MV88E6390 batch two

2016-12-02 Thread Andrew Lunn
This is the second batch of patches adding support for the
MV88e6390. They are not sufficient to make it work properly.

The mv88e6390 has a much expanded set of priority maps. Refactor the
existing code, and implement basic support for the new device.

Similarly, the monitor control register has been reworked.

The mv88e6390 has something odd in its EDSA tagging implementation,
which means it is not possible to use it. So we need to use DSA
tagging. This is the first device with EDSA support where we need to
use DSA, and the code does not support this. So two patches refactor
the existing code. The two different register definitions are
separated out, and using DSA on an EDSA capable device is added.

v2:
Add port prefix
Add helper function for 6390
Add _IEEE_ into #defines
Split monitor_ctrl into a number of separate ops.
Remove 6390 code which is management, used in a later patch
s/EGREES/EGRESS/.
Broke up setup_port_dsa() and set_port_dsa() into a number of ops

Andrew Lunn (4):
  net: dsa: mv88e6xxx: Implement mv88e6390 tag remap
  net: dsa: mv88e6xxx: Monitor and Management tables
  net: dsa: mv88e6xxx: Move the tagging protocol into info
  net: dsa: mv88e6xxx: Refactor CPU and DSA port setup

 drivers/net/dsa/mv88e6xxx/chip.c  | 339 ++
 drivers/net/dsa/mv88e6xxx/global1.c   |  69 +++
 drivers/net/dsa/mv88e6xxx/global1.h   |   4 +
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |  62 +--
 drivers/net/dsa/mv88e6xxx/port.c  | 181 ++
 drivers/net/dsa/mv88e6xxx/port.h  |  15 ++
 6 files changed, 583 insertions(+), 87 deletions(-)

-- 
2.10.2



[PATCHv2 net-next 2/4] net: dsa: mv88e6xxx: Monitor and Management tables

2016-12-02 Thread Andrew Lunn
The mv88e6390 changes the monitor control register into the Monitor
and Management control, which is an indirection register to various
registers.

Add ops to set the CPU port and the ingress/egress port for both
register layouts, to global1

Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/mv88e6xxx/chip.c  | 68 +-
 drivers/net/dsa/mv88e6xxx/global1.c   | 69 +++
 drivers/net/dsa/mv88e6xxx/global1.h   |  4 ++
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 13 +++
 4 files changed, 145 insertions(+), 9 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index ff4bd2f74357..6e981bedd028 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2747,15 +2747,17 @@ static int mv88e6xxx_g1_setup(struct mv88e6xxx_chip 
*chip)
if (err)
return err;
 
-   /* Configure the upstream port, and configure it as the port to which
-* ingress and egress and ARP monitor frames are to be sent.
-*/
-   reg = upstream_port << GLOBAL_MONITOR_CONTROL_INGRESS_SHIFT |
-   upstream_port << GLOBAL_MONITOR_CONTROL_EGRESS_SHIFT |
-   upstream_port << GLOBAL_MONITOR_CONTROL_ARP_SHIFT;
-   err = mv88e6xxx_g1_write(chip, GLOBAL_MONITOR_CONTROL, reg);
-   if (err)
-   return err;
+   if (chip->info->ops->g1_set_cpu_port) {
+   err = chip->info->ops->g1_set_cpu_port(chip, upstream_port);
+   if (err)
+   return err;
+   }
+
+   if (chip->info->ops->g1_set_egress_port) {
+   err = chip->info->ops->g1_set_egress_port(chip, upstream_port);
+   if (err)
+   return err;
+   }
 
/* Disable remote management, and set the switch's DSA device number. */
err = mv88e6xxx_g1_write(chip, GLOBAL_CONTROL_2,
@@ -3184,6 +3186,8 @@ static const struct mv88e6xxx_ops mv88e6085_ops = {
.stats_get_sset_count = mv88e6095_stats_get_sset_count,
.stats_get_strings = mv88e6095_stats_get_strings,
.stats_get_stats = mv88e6095_stats_get_stats,
+   .g1_set_cpu_port = mv88e6095_g1_set_cpu_port,
+   .g1_set_egress_port = mv88e6095_g1_set_egress_port,
 };
 
 static const struct mv88e6xxx_ops mv88e6095_ops = {
@@ -3213,6 +3217,8 @@ static const struct mv88e6xxx_ops mv88e6097_ops = {
.stats_get_sset_count = mv88e6095_stats_get_sset_count,
.stats_get_strings = mv88e6095_stats_get_strings,
.stats_get_stats = mv88e6095_stats_get_stats,
+   .g1_set_cpu_port = mv88e6095_g1_set_cpu_port,
+   .g1_set_egress_port = mv88e6095_g1_set_egress_port,
 };
 
 static const struct mv88e6xxx_ops mv88e6123_ops = {
@@ -3227,6 +3233,8 @@ static const struct mv88e6xxx_ops mv88e6123_ops = {
.stats_get_sset_count = mv88e6095_stats_get_sset_count,
.stats_get_strings = mv88e6095_stats_get_strings,
.stats_get_stats = mv88e6095_stats_get_stats,
+   .g1_set_cpu_port = mv88e6095_g1_set_cpu_port,
+   .g1_set_egress_port = mv88e6095_g1_set_egress_port,
 };
 
 static const struct mv88e6xxx_ops mv88e6131_ops = {
@@ -3242,6 +3250,8 @@ static const struct mv88e6xxx_ops mv88e6131_ops = {
.stats_get_sset_count = mv88e6095_stats_get_sset_count,
.stats_get_strings = mv88e6095_stats_get_strings,
.stats_get_stats = mv88e6095_stats_get_stats,
+   .g1_set_cpu_port = mv88e6095_g1_set_cpu_port,
+   .g1_set_egress_port = mv88e6095_g1_set_egress_port,
 };
 
 static const struct mv88e6xxx_ops mv88e6161_ops = {
@@ -3257,6 +3267,8 @@ static const struct mv88e6xxx_ops mv88e6161_ops = {
.stats_get_sset_count = mv88e6095_stats_get_sset_count,
.stats_get_strings = mv88e6095_stats_get_strings,
.stats_get_stats = mv88e6095_stats_get_stats,
+   .g1_set_cpu_port = mv88e6095_g1_set_cpu_port,
+   .g1_set_egress_port = mv88e6095_g1_set_egress_port,
 };
 
 static const struct mv88e6xxx_ops mv88e6165_ops = {
@@ -3271,6 +3283,8 @@ static const struct mv88e6xxx_ops mv88e6165_ops = {
.stats_get_sset_count = mv88e6095_stats_get_sset_count,
.stats_get_strings = mv88e6095_stats_get_strings,
.stats_get_stats = mv88e6095_stats_get_stats,
+   .g1_set_cpu_port = mv88e6095_g1_set_cpu_port,
+   .g1_set_egress_port = mv88e6095_g1_set_egress_port,
 };
 
 static const struct mv88e6xxx_ops mv88e6171_ops = {
@@ -3287,6 +3301,8 @@ static const struct mv88e6xxx_ops mv88e6171_ops = {
.stats_get_sset_count = mv88e6095_stats_get_sset_count,
.stats_get_strings = mv88e6095_stats_get_strings,
.stats_get_stats = mv88e6095_stats_get_stats,
+   .g1_set_cpu_port = mv88e6095_g1_set_cpu_port,
+   .g1_set_egress_port = mv88e6095_g1_set_egress_port,
 };
 
 static const struct mv88e6xxx_ops mv88e6172_ops = {
@@ -3305,6 +3321,8 @@ static const struct mv88e6xxx_ops mv88e6172_ops = {
.stats_get_sset_cou

[PATCHv2 net-next 3/4] net: dsa: mv88e6xxx: Move the tagging protocol into info

2016-12-02 Thread Andrew Lunn
Older chips support a single tagging protocol, DSA. New chips support
both DSA and EDSA, an enhanced version. Having both as an option
changes the register layouts. Up until now, it has been assumed that
if EDSA is supported, it will be used. Hence the register layout has
been determined by which protocol should be used. However, mv88e6390
has a different implementation of EDSA, which requires we need to use
the DSA tagging. Hence separate the selection of the protocol from the
register layout.

Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/mv88e6xxx/chip.c  | 33 +++--
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 17 -
 2 files changed, 31 insertions(+), 19 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 6e981bedd028..80efee6f5e16 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2482,7 +2482,7 @@ static int mv88e6xxx_setup_port(struct mv88e6xxx_chip 
*chip, int port)
PORT_CONTROL_USE_TAG | PORT_CONTROL_USE_IP |
PORT_CONTROL_STATE_FORWARDING;
if (dsa_is_cpu_port(ds, port)) {
-   if (mv88e6xxx_has(chip, MV88E6XXX_FLAG_EDSA))
+   if (chip->info->tag_protocol == DSA_TAG_PROTO_EDSA)
reg |= PORT_CONTROL_FRAME_ETHER_TYPE_DSA |
PORT_CONTROL_FORWARD_UNKNOWN_MC;
else
@@ -2611,7 +2611,7 @@ static int mv88e6xxx_setup_port(struct mv88e6xxx_chip 
*chip, int port)
/* Port Ethertype: use the Ethertype DSA Ethertype
 * value.
 */
-   if (mv88e6xxx_has(chip, MV88E6XXX_FLAG_EDSA)) {
+   if (chip->info->tag_protocol == DSA_TAG_PROTO_EDSA) {
err = mv88e6xxx_port_write(chip, port, PORT_ETH_TYPE,
   ETH_P_EDSA);
if (err)
@@ -3637,6 +3637,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.global1_addr = 0x1b,
.age_time_coeff = 15000,
.g1_irqs = 8,
+   .tag_protocol = DSA_TAG_PROTO_DSA,
.flags = MV88E6XXX_FLAGS_FAMILY_6097,
.ops = &mv88e6085_ops,
},
@@ -3651,6 +3652,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.global1_addr = 0x1b,
.age_time_coeff = 15000,
.g1_irqs = 8,
+   .tag_protocol = DSA_TAG_PROTO_DSA,
.flags = MV88E6XXX_FLAGS_FAMILY_6095,
.ops = &mv88e6095_ops,
},
@@ -3679,6 +3681,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.global1_addr = 0x1b,
.age_time_coeff = 15000,
.g1_irqs = 9,
+   .tag_protocol = DSA_TAG_PROTO_DSA,
.flags = MV88E6XXX_FLAGS_FAMILY_6165,
.ops = &mv88e6123_ops,
},
@@ -3693,6 +3696,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.global1_addr = 0x1b,
.age_time_coeff = 15000,
.g1_irqs = 9,
+   .tag_protocol = DSA_TAG_PROTO_DSA,
.flags = MV88E6XXX_FLAGS_FAMILY_6185,
.ops = &mv88e6131_ops,
},
@@ -3707,6 +3711,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.global1_addr = 0x1b,
.age_time_coeff = 15000,
.g1_irqs = 9,
+   .tag_protocol = DSA_TAG_PROTO_DSA,
.flags = MV88E6XXX_FLAGS_FAMILY_6165,
.ops = &mv88e6161_ops,
},
@@ -3721,6 +3726,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.global1_addr = 0x1b,
.age_time_coeff = 15000,
.g1_irqs = 9,
+   .tag_protocol = DSA_TAG_PROTO_DSA,
.flags = MV88E6XXX_FLAGS_FAMILY_6165,
.ops = &mv88e6165_ops,
},
@@ -3735,6 +3741,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.global1_addr = 0x1b,
.age_time_coeff = 15000,
.g1_irqs = 9,
+   .tag_protocol = DSA_TAG_PROTO_EDSA,
.flags = MV88E6XXX_FLAGS_FAMILY_6351,
.ops = &mv88e6171_ops,
},
@@ -3749,6 +3756,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.global1_addr = 0x1b,
.age_time_coeff = 15000,
.g1_irqs = 9,
+   .tag_protocol = DSA_TAG_PROTO_EDSA,
.flags = MV88E6XXX_FLAGS_FAMILY_6352,
.ops = &mv88e6172_ops,
},
@@ -3763,6 +3771,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.global1_addr = 0x1b,
.age_time_coeff = 15000,
.g1_irqs = 9,
+   .tag_protocol = DSA_TAG_PROTO_EDSA,
.flags = MV88E6XXX_FLAGS_FAMIL

[PATCHv2 net-next 4/4] net: dsa: mv88e6xxx: Refactor CPU and DSA port setup

2016-12-02 Thread Andrew Lunn
Older chips only support DSA tagging. Newer chips have both DSA and
EDSA tagging. Refactor the code by adding port functions for setting the
frame mode, egress mode, and if to forward unknown frames.

This results in the helper mv88e6xxx_6065_family() becoming unused, so
remove it.

Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/mv88e6xxx/chip.c  | 205 ++
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |  20 
 drivers/net/dsa/mv88e6xxx/port.c  | 118 +++
 drivers/net/dsa/mv88e6xxx/port.h  |  13 +++
 4 files changed, 308 insertions(+), 48 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 80efee6f5e16..d1e6a0760a75 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -677,11 +677,6 @@ static int mv88e6xxx_phy_ppu_write(struct mv88e6xxx_chip 
*chip, int addr,
return err;
 }
 
-static bool mv88e6xxx_6065_family(struct mv88e6xxx_chip *chip)
-{
-   return chip->info->family == MV88E6XXX_FAMILY_6065;
-}
-
 static bool mv88e6xxx_6095_family(struct mv88e6xxx_chip *chip)
 {
return chip->info->family == MV88E6XXX_FAMILY_6095;
@@ -2438,6 +2433,80 @@ static int mv88e6xxx_serdes_power_on(struct 
mv88e6xxx_chip *chip)
return err;
 }
 
+static int mv88e6xxx_setup_port_dsa(struct mv88e6xxx_chip *chip, int port,
+   int upstream_port)
+{
+   int err;
+
+   err = chip->info->ops->port_set_frame_mode(
+   chip, port, MV88E6XXX_FRAME_MODE_DSA);
+   if (err)
+   return err;
+
+   err = chip->info->ops->port_set_egress_unknowns(
+   chip, port, port == upstream_port);
+   if (err)
+   return err;
+
+   if (chip->info->ops->port_set_ether_type)
+   return chip->info->ops->port_set_ether_type(
+   chip, port, ETH_P_EDSA);
+
+   return 0;
+}
+
+static int mv88e6xxx_setup_port_cpu(struct mv88e6xxx_chip *chip, int port)
+{
+   int err;
+
+   switch (chip->info->tag_protocol) {
+   case DSA_TAG_PROTO_EDSA:
+   err = chip->info->ops->port_set_frame_mode(
+   chip, port, MV88E6XXX_FRAME_MODE_ETHERTYPE);
+   if (err)
+   return err;
+
+   err = mv88e6xxx_port_set_egress_mode(
+   chip, port, PORT_CONTROL_EGRESS_ADD_TAG);
+   if (err)
+   return err;
+
+   if (chip->info->ops->port_set_ether_type)
+   err = chip->info->ops->port_set_ether_type(
+   chip, port, ETH_P_EDSA);
+   break;
+
+   case DSA_TAG_PROTO_DSA:
+   err = chip->info->ops->port_set_frame_mode(
+   chip, port, MV88E6XXX_FRAME_MODE_DSA);
+   if (err)
+   return err;
+
+   err = mv88e6xxx_port_set_egress_mode(
+   chip, port, PORT_CONTROL_EGRESS_UNMODIFIED);
+   break;
+   default:
+   err = -EINVAL;
+   }
+
+   if (err)
+   return err;
+
+   return chip->info->ops->port_set_egress_unknowns(chip, port, true);
+}
+
+static int mv88e6xxx_setup_port_normal(struct mv88e6xxx_chip *chip, int port)
+{
+   int err;
+
+   err = chip->info->ops->port_set_frame_mode(
+   chip, port, MV88E6XXX_FRAME_MODE_NORMAL);
+   if (err)
+   return err;
+
+   return chip->info->ops->port_set_egress_unknowns(chip, port, false);
+}
+
 static int mv88e6xxx_setup_port(struct mv88e6xxx_chip *chip, int port)
 {
struct dsa_switch *ds = chip->ds;
@@ -2473,44 +2542,25 @@ static int mv88e6xxx_setup_port(struct mv88e6xxx_chip 
*chip, int port)
 * If this is the upstream port for this switch, enable
 * forwarding of unknown unicasts and multicasts.
 */
-   reg = 0;
-   if (mv88e6xxx_6352_family(chip) || mv88e6xxx_6351_family(chip) ||
-   mv88e6xxx_6165_family(chip) || mv88e6xxx_6097_family(chip) ||
-   mv88e6xxx_6095_family(chip) || mv88e6xxx_6065_family(chip) ||
-   mv88e6xxx_6185_family(chip) || mv88e6xxx_6320_family(chip))
-   reg = PORT_CONTROL_IGMP_MLD_SNOOP |
+   reg = PORT_CONTROL_IGMP_MLD_SNOOP |
PORT_CONTROL_USE_TAG | PORT_CONTROL_USE_IP |
PORT_CONTROL_STATE_FORWARDING;
+   err = mv88e6xxx_port_write(chip, port, PORT_CONTROL, reg);
+   if (err)
+   return err;
+
if (dsa_is_cpu_port(ds, port)) {
-   if (chip->info->tag_protocol == DSA_TAG_PROTO_EDSA)
-   reg |= PORT_CONTROL_FRAME_ETHER_TYPE_DSA |
-   PORT_CONTROL_FORWARD_UNKNOWN_MC;
-   else
-   reg |= PORT_CONTROL_DSA_TAG;
-   reg |= PORT_CONTROL_EGRESS_ADD_TAG |
-   PORT_CONTROL_FORWARD_UNKN

[PATCHv2 net-next 1/4] net: dsa: mv88e6xxx: Implement mv88e6390 tag remap

2016-12-02 Thread Andrew Lunn
The mv88e6390 does not have the two registers to set the frame
priority map. Instead it has an indirection registers for setting a
number of different priority maps. Refactor the old code into an
function, implement the mv88e6390 version, and use an op to call the
right one.

Signed-off-by: Andrew Lunn 
---
v2:
Add port prefix
Add helper function for 6390
Add _IEEE_ into #defines
---
 drivers/net/dsa/mv88e6xxx/chip.c  | 37 
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 12 +++
 drivers/net/dsa/mv88e6xxx/port.c  | 63 +++
 drivers/net/dsa/mv88e6xxx/port.h  |  2 ++
 4 files changed, 101 insertions(+), 13 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index ce2f7ff8066e..ff4bd2f74357 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2617,20 +2617,10 @@ static int mv88e6xxx_setup_port(struct mv88e6xxx_chip 
*chip, int port)
if (err)
return err;
}
+   }
 
-   /* Tag Remap: use an identity 802.1p prio -> switch
-* prio mapping.
-*/
-   err = mv88e6xxx_port_write(chip, port, PORT_TAG_REGMAP_0123,
-  0x3210);
-   if (err)
-   return err;
-
-   /* Tag Remap 2: use an identity 802.1p prio -> switch
-* prio mapping.
-*/
-   err = mv88e6xxx_port_write(chip, port, PORT_TAG_REGMAP_4567,
-  0x7654);
+   if (chip->info->ops->port_tag_remap) {
+   err = chip->info->ops->port_tag_remap(chip, port);
if (err)
return err;
}
@@ -3189,6 +3179,7 @@ static const struct mv88e6xxx_ops mv88e6085_ops = {
.port_set_link = mv88e6xxx_port_set_link,
.port_set_duplex = mv88e6xxx_port_set_duplex,
.port_set_speed = mv88e6185_port_set_speed,
+   .port_tag_remap = mv88e6095_port_tag_remap,
.stats_snapshot = mv88e6xxx_g1_stats_snapshot,
.stats_get_sset_count = mv88e6095_stats_get_sset_count,
.stats_get_strings = mv88e6095_stats_get_strings,
@@ -3217,6 +3208,7 @@ static const struct mv88e6xxx_ops mv88e6097_ops = {
.port_set_link = mv88e6xxx_port_set_link,
.port_set_duplex = mv88e6xxx_port_set_duplex,
.port_set_speed = mv88e6185_port_set_speed,
+   .port_tag_remap = mv88e6095_port_tag_remap,
.stats_snapshot = mv88e6xxx_g1_stats_snapshot,
.stats_get_sset_count = mv88e6095_stats_get_sset_count,
.stats_get_strings = mv88e6095_stats_get_strings,
@@ -3245,6 +3237,7 @@ static const struct mv88e6xxx_ops mv88e6131_ops = {
.port_set_link = mv88e6xxx_port_set_link,
.port_set_duplex = mv88e6xxx_port_set_duplex,
.port_set_speed = mv88e6185_port_set_speed,
+   .port_tag_remap = mv88e6095_port_tag_remap,
.stats_snapshot = mv88e6xxx_g1_stats_snapshot,
.stats_get_sset_count = mv88e6095_stats_get_sset_count,
.stats_get_strings = mv88e6095_stats_get_strings,
@@ -3259,6 +3252,7 @@ static const struct mv88e6xxx_ops mv88e6161_ops = {
.port_set_link = mv88e6xxx_port_set_link,
.port_set_duplex = mv88e6xxx_port_set_duplex,
.port_set_speed = mv88e6185_port_set_speed,
+   .port_tag_remap = mv88e6095_port_tag_remap,
.stats_snapshot = mv88e6xxx_g1_stats_snapshot,
.stats_get_sset_count = mv88e6095_stats_get_sset_count,
.stats_get_strings = mv88e6095_stats_get_strings,
@@ -3288,6 +3282,7 @@ static const struct mv88e6xxx_ops mv88e6171_ops = {
.port_set_duplex = mv88e6xxx_port_set_duplex,
.port_set_rgmii_delay = mv88e6352_port_set_rgmii_delay,
.port_set_speed = mv88e6185_port_set_speed,
+   .port_tag_remap = mv88e6095_port_tag_remap,
.stats_snapshot = mv88e6320_g1_stats_snapshot,
.stats_get_sset_count = mv88e6095_stats_get_sset_count,
.stats_get_strings = mv88e6095_stats_get_strings,
@@ -3305,6 +3300,7 @@ static const struct mv88e6xxx_ops mv88e6172_ops = {
.port_set_duplex = mv88e6xxx_port_set_duplex,
.port_set_rgmii_delay = mv88e6352_port_set_rgmii_delay,
.port_set_speed = mv88e6352_port_set_speed,
+   .port_tag_remap = mv88e6095_port_tag_remap,
.stats_snapshot = mv88e6320_g1_stats_snapshot,
.stats_get_sset_count = mv88e6095_stats_get_sset_count,
.stats_get_strings = mv88e6095_stats_get_strings,
@@ -3320,6 +3316,7 @@ static const struct mv88e6xxx_ops mv88e6175_ops = {
.port_set_duplex = mv88e6xxx_port_set_duplex,
.port_set_rgmii_delay = mv88e6352_port_set_rgmii_delay,
.port_set_speed = mv88e6185_port_set_speed,
+   .port_tag_remap = mv88e6095_port_tag_remap,
.stats_snapshot = mv88e6320_g1_stats_snapshot,
.stats_get_sset_count

[PATCH net-next 0/2] net/sched: cls_flower: Support matching on ICMP

2016-12-02 Thread Simon Horman
Hi,

this series add supports for matching on ICMP type and code to cls_flower.
This is modeled on existing support for matching on L4 ports. The updates
to the dissector are intended to allow for code and storage re-use.

Change since RFC:
* Drop RFC designation after positive review from Jiri


Simon Horman (2):
  flow dissector: ICMP support
  net/sched: cls_flower: Support matching on ICMP type and code

 drivers/net/bonding/bond_main.c |  6 +++--
 include/linux/skbuff.h  |  5 +
 include/net/flow_dissector.h| 50 ++---
 include/uapi/linux/pkt_cls.h| 10 +
 net/core/flow_dissector.c   | 34 +---
 net/sched/cls_flow.c|  4 ++--
 net/sched/cls_flower.c  | 42 ++
 7 files changed, 141 insertions(+), 10 deletions(-)

-- 
2.7.0.rc3.207.g0ac5344



[PATCH net-next 1/2] flow dissector: ICMP support

2016-12-02 Thread Simon Horman
Allow dissection of ICMP(V6) type and code. This re-uses transport layer
port dissection code as although ICMP is not a transport protocol and their
type and code are not ports this allows sharing of both code and storage.

Signed-off-by: Simon Horman 
---
 drivers/net/bonding/bond_main.c |  6 --
 include/linux/skbuff.h  |  5 +
 include/net/flow_dissector.h| 30 +++---
 net/core/flow_dissector.c   | 34 +++---
 net/sched/cls_flow.c|  4 ++--
 5 files changed, 69 insertions(+), 10 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 8029dd4912b6..a6f75cfb2bf7 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3181,7 +3181,8 @@ static bool bond_flow_dissect(struct bonding *bond, 
struct sk_buff *skb,
} else {
return false;
}
-   if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 && proto >= 0)
+   if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 &&
+   proto >= 0 && !skb_flow_is_icmp_any(skb, proto))
fk->ports.ports = skb_flow_get_ports(skb, noff, proto);
 
return true;
@@ -3209,7 +3210,8 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff 
*skb)
return bond_eth_hash(skb);
 
if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER23 ||
-   bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23)
+   bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23 ||
+   flow_keys_are_icmp_any(&flow))
hash = bond_eth_hash(skb);
else
hash = (__force u32)flow.ports.ports;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9c535fbccf2c..44a8f69a9198 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1094,6 +1094,11 @@ u32 __skb_get_poff(const struct sk_buff *skb, void *data,
 __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
void *data, int hlen_proto);
 
+static inline bool skb_flow_is_icmp_any(const struct sk_buff *skb, u8 ip_proto)
+{
+   return flow_protos_are_icmp_any(skb->protocol, ip_proto);
+}
+
 static inline __be32 skb_flow_get_ports(const struct sk_buff *skb,
int thoff, u8 ip_proto)
 {
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index c4f31666afd2..8880025914e3 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -2,6 +2,7 @@
 #define _NET_FLOW_DISSECTOR_H
 
 #include 
+#include 
 #include 
 #include 
 
@@ -89,10 +90,15 @@ struct flow_dissector_key_addrs {
 };
 
 /**
- * flow_dissector_key_tp_ports:
- * @ports: port numbers of Transport header
+ * flow_dissector_key_ports:
+ * @ports: port numbers of Transport header or
+ * type and code of ICMP header
+ * ports: source (high) and destination (low) port numbers
  * src: source port number
  * dst: destination port number
+ * icmp: ICMP type (high) and code (low)
+ * type: ICMP type
+ * type: ICMP code
  */
 struct flow_dissector_key_ports {
union {
@@ -101,6 +107,11 @@ struct flow_dissector_key_ports {
__be16 src;
__be16 dst;
};
+   __be16 icmp;
+   struct {
+   u8 type;
+   u8 code;
+   };
};
 };
 
@@ -188,9 +199,22 @@ struct flow_keys_digest {
 void make_flow_keys_digest(struct flow_keys_digest *digest,
   const struct flow_keys *flow);
 
+static inline bool flow_protos_are_icmp_any(__be16 n_proto, u8 ip_proto)
+{
+   return (n_proto == htons(ETH_P_IP) && ip_proto == IPPROTO_ICMP) ||
+   (n_proto == htons(ETH_P_IPV6) && ip_proto == IPPROTO_ICMPV6);
+}
+
+static inline bool flow_keys_are_icmp_any(const struct flow_keys *keys)
+{
+   return flow_protos_are_icmp_any(keys->basic.n_proto,
+   keys->basic.ip_proto);
+}
+
 static inline bool flow_keys_have_l4(const struct flow_keys *keys)
 {
-   return (keys->ports.ports || keys->tags.flow_label);
+   return (!flow_keys_are_icmp_any(keys) && keys->ports.ports) ||
+   keys->tags.flow_label;
 }
 
 u32 flow_hash_from_keys(struct flow_keys *keys);
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 1eb6f949e5b2..0584b4bb4390 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -58,6 +58,28 @@ void skb_flow_dissector_init(struct flow_dissector 
*flow_dissector,
 EXPORT_SYMBOL(skb_flow_dissector_init);
 
 /**
+ * skb_flow_get_be16 - extract be16 entity
+ * @skb: sk_buff to extract from
+ * @poff: offset to extract at
+ * @data: raw buffer pointer to the packet
+ * @hlen: packet header length
+ *
+ * The function will 

[PATCH net-next 2/2] net/sched: cls_flower: Support matching on ICMP type and code

2016-12-02 Thread Simon Horman
Support matching on ICMP type and code.

Example usage:

tc qdisc add dev eth0 ingress

tc filter add dev eth0 protocol ip parent : flower \
indev eth0 ip_proto icmp type 8 code 0 action drop

tc filter add dev eth0 protocol ipv6 parent : flower \
indev eth0 ip_proto icmpv6 type 128 code 0 action drop

Signed-off-by: Simon Horman 
---
 include/net/flow_dissector.h | 24 ++--
 include/uapi/linux/pkt_cls.h | 10 ++
 net/sched/cls_flower.c   | 42 ++
 3 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 8880025914e3..5540dfa18872 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -199,10 +199,30 @@ struct flow_keys_digest {
 void make_flow_keys_digest(struct flow_keys_digest *digest,
   const struct flow_keys *flow);
 
+static inline bool flow_protos_are_icmpv4(__be16 n_proto, u8 ip_proto)
+{
+   return n_proto == htons(ETH_P_IP) && ip_proto == IPPROTO_ICMP;
+}
+
+static inline bool flow_protos_are_icmpv6(__be16 n_proto, u8 ip_proto)
+{
+   return n_proto == htons(ETH_P_IPV6) && ip_proto == IPPROTO_ICMPV6;
+}
+
 static inline bool flow_protos_are_icmp_any(__be16 n_proto, u8 ip_proto)
 {
-   return (n_proto == htons(ETH_P_IP) && ip_proto == IPPROTO_ICMP) ||
-   (n_proto == htons(ETH_P_IPV6) && ip_proto == IPPROTO_ICMPV6);
+   return flow_protos_are_icmpv4(n_proto, ip_proto) ||
+   flow_protos_are_icmpv6(n_proto, ip_proto);
+}
+
+static inline bool flow_basic_key_is_icmpv4(const struct 
flow_dissector_key_basic *basic)
+{
+   return flow_protos_are_icmpv4(basic->n_proto, basic->ip_proto);
+}
+
+static inline bool flow_basic_key_is_icmpv6(const struct 
flow_dissector_key_basic *basic)
+{
+   return flow_protos_are_icmpv6(basic->n_proto, basic->ip_proto);
 }
 
 static inline bool flow_keys_are_icmp_any(const struct flow_keys *keys)
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 86786d45ee66..58160fe80b80 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -457,6 +457,16 @@ enum {
TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK,   /* be16 */
TCA_FLOWER_KEY_ENC_UDP_DST_PORT,/* be16 */
TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK,   /* be16 */
+
+   TCA_FLOWER_KEY_ICMPV4_CODE, /* u8 */
+   TCA_FLOWER_KEY_ICMPV4_CODE_MASK,/* u8 */
+   TCA_FLOWER_KEY_ICMPV4_TYPE, /* u8 */
+   TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,/* u8 */
+   TCA_FLOWER_KEY_ICMPV6_CODE, /* u8 */
+   TCA_FLOWER_KEY_ICMPV6_CODE_MASK,/* u8 */
+   TCA_FLOWER_KEY_ICMPV6_TYPE, /* u8 */
+   TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,/* u8 */
+
__TCA_FLOWER_MAX,
 };
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index e8dd09af0d0c..412efa7de226 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -355,6 +355,14 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 
1] = {
[TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK]  = { .type = NLA_U16 },
[TCA_FLOWER_KEY_ENC_UDP_DST_PORT]   = { .type = NLA_U16 },
[TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK]  = { .type = NLA_U16 },
+   [TCA_FLOWER_KEY_ICMPV4_TYPE]= { .type = NLA_U8 },
+   [TCA_FLOWER_KEY_ICMPV4_TYPE_MASK] = { .type = NLA_U8 },
+   [TCA_FLOWER_KEY_ICMPV4_CODE]= { .type = NLA_U8 },
+   [TCA_FLOWER_KEY_ICMPV4_CODE_MASK] = { .type = NLA_U8 },
+   [TCA_FLOWER_KEY_ICMPV6_TYPE]= { .type = NLA_U8 },
+   [TCA_FLOWER_KEY_ICMPV6_TYPE_MASK] = { .type = NLA_U8 },
+   [TCA_FLOWER_KEY_ICMPV6_CODE]= { .type = NLA_U8 },
+   [TCA_FLOWER_KEY_ICMPV6_CODE_MASK] = { .type = NLA_U8 },
 };
 
 static void fl_set_key_val(struct nlattr **tb,
@@ -471,6 +479,20 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
fl_set_key_val(tb, &key->tp.dst, TCA_FLOWER_KEY_SCTP_DST,
   &mask->tp.dst, TCA_FLOWER_KEY_SCTP_DST_MASK,
   sizeof(key->tp.dst));
+   } else if (flow_basic_key_is_icmpv4(&key->basic)) {
+   fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
+  &mask->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
+  sizeof(key->tp.type));
+   fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
+  &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
+  sizeof(key->tp.code));
+   } else if (flow_basic_key_is_icmpv6(&key->basic)) {
+   fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
+  &mask->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
+  sizeof(key->tp.type));
+   fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
+  

Re: [flamebait] xdp, well meaning but pointless

2016-12-02 Thread Hannes Frederic Sowa
On 02.12.2016 17:59, Tom Herbert wrote:
> On Fri, Dec 2, 2016 at 3:54 AM, Hannes Frederic Sowa
>  wrote:
>> On 02.12.2016 11:24, Jesper Dangaard Brouer wrote:
>>> On Thu, 1 Dec 2016 13:51:32 -0800
>>> Tom Herbert  wrote:
>>>
>> The technical plenary at last IETF on Seoul a couple of weeks ago was
>> exclusively focussed on DDOS in light of the recent attack against
>> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
>> presentation by Nick Sullivan
>> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
>> alluded to some implementation of DDOS mitigation. In particular, on
>> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
>>>
>>> slide 14
>>>
>> numbers he gave we're based in iptables+BPF and that was a whole
>> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
>> and that's also when I introduced XDP to whole IETF :-) ). If that's
>> the best we can do the Internet is in a world hurt. DDOS mitigation
>> alone is probably a sufficient motivation to look at XDP. We need
>> something that drops bad packets as quickly as possible when under
>> attack, we need this to be integrated into the stack, we need it to be
>> programmable to deal with the increasing savvy of attackers, and we
>> don't want to be forced to be dependent on HW solutions. This is why
>> we created XDP!
>>>
>>> The 1.2Mpps number is a bit low, but we are unfortunately in that
>>> ballpark.
>>>
> I totally understand that. But in my reply to David in this thread I
> mentioned DNS apex processing as being problematic which is actually
> being referred in your linked slide deck on page 9 ("What do floods look
> like") and the problematic of parsing DNS packets in XDP due to string
> processing and looping inside eBPF.
>>>
>>> That is a weak argument. You do realize CloudFlare actually use eBPF to
>>> do this exact filtering, and (so-far) eBPF for parsing DNS have been
>>> sufficient for them.
>>
>> You are talking about this code on the following slides (I actually
>> transcribed it for you here and disassembled):
>>
>> l0: ld #0x14
>> l1: ldxb 4*([0]&0xf)
>> l2: add x
>> l3: tax
>> l4: ld [x+0]
>> l5: jeq #0x7657861, l6, l13
>> l6: ld [x+4]
>> l7: jeq #0x6d706c65, l8, l13
>> l8: ld [x+8]
>> l9: jeq #0x3636f6d, l10, l13
>> l10:ldb [x+12]
>> l11:jeq #0, l12, l13
>> l12:ret #0x1
>> l13:ret #0
>>
>> You can offload this to u32 in hardware if that is what you want.
>>
>> The reason this works is because of netfilter, which allows them to
>> dynamically generate BPF programs and insert and delete them from
>> chains, do intersection or unions of them.
>>
>> If you have a freestanding program like in XDP the complexity space is a
>> different one and not comparable to this at all.
>>
> I don't understand this comment about complexity especially in regards
> to the idea of offloading u32 to hardware. Relying on hardware to do
> anything always leads to more complexity than an equivalent SW
> implementation for the same functionality. The only reason we ever use
> a hardware mechanisms is if it gives *significantly* better
> performance. If the performance difference isn't there then doing
> things in SW is going to be the better path (as we see in XDP).

I am just wondering why the u32 filter wasn't mentioned in their slide
deck. If all what Cloudflare needs are those kind of matches, they are
in fact actually easier to generate than an cBPF program. It is not a
good example of how a real world DoS filter in XDP would look like.

If you argue XDP as a C function hook that can call arbitrary code in
the driver before submitting that to the networking stack, yep, that is
not complex at all. Depending on how those modules will be maintained,
they either end up in the kernel and will be updated on major changes or
are 3rd party and people have to update them and also depend on the
driver features.

But this opens up a whole new can of worms also. I haven't really
thought this through completely, but last time the patches were nack'ed
with lots of strong opinions and I tended to agree with them. I am
revisiting this position.

Certainly you can build real-world DoS protection with this function
pointer hook and C code in the driver. In this case a user space
solution still has advantages because of maintainability, as e.g. with
netmap or dpdk you are again decoupled from the in-kernel API/ABI and
don't need to test, recompile etc. on each kernel upgrade. If the module
ends up in the kernel, those problems might also disappear.

For XDP+eBPF to provide a full DoS mitigation (protocol parsing,
sampling and dropping) solution seems to be too complex for me because
of the arguments I stated in my previous mail.

Bye,
Hannes



[PROBLEM]: IPv6 ICMP Ancillary Data IPV6_PKTINFO sendmsg() invalid argument

2016-12-02 Thread Thomas Lloyd
Hi All,

First post here; any pointers or improvements for bug reporting are
welcome.

[1.] One line summary of the problem:

sendmsg() with ancillary data of type IPV6_PKTINFO fails with invalid
argument and further possible memory corruption.

-
[2.] Full description of the problem/report:

As part of my project I am attempting to perform IPv6 neighbour
solicitations to discover the MAC address of my local neighbours.
In the included test code link-local src and dst addresses are used to
send neighbour solicitations along with specifying the network interface
in IPV6_PKTINFO ancillary data.

In my test environment I have a Ubuntu 16.04 LTS Server/Router and a
Ubuntu 16.04 LTS Desktop.
These are both hosted on a Windows 10 Host as Virtual Guests.

Router/Server Interfaces:
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group
default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
   valid_lft forever preferred_lft forever
2: eth0:  mtu 1500 qdisc pfifo_fast state
UP group default qlen 1000
    link/ether 08:00:27:c5:0e:76 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
   valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fec5:e76/64 scope link
   valid_lft forever preferred_lft forever
3: eth1:  mtu 1500 qdisc pfifo_fast state
UP group default qlen 1000
    link/ether 08:00:27:0a:c1:80 brd ff:ff:ff:ff:ff:ff
    inet6 2001:db8:1000:baba::1/64 scope global
   valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe0a:c180/64 scope link
   valid_lft forever preferred_lft forever

Eth0 is set up for NAT to a physical IPv4 network but is unused in this
test setup.
Eth1 is IPv6 and is the radvd interface attatched to an internal virtual
box network interface.

In my client environment I have three network interfaces:

1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group
default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
   valid_lft forever preferred_lft forever
2: enp0s3:  mtu 1500 qdisc pfifo_fast
state UP group default qlen 1000
    link/ether 08:00:27:b9:42:47 brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.121/24 brd 192.168.100.255 scope global dynamic
enp0s3
   valid_lft 79320sec preferred_lft 79320sec
    inet6 fe80::5523:e7df:d12c:681/64 scope link
   valid_lft forever preferred_lft forever
3: enp0s8:  mtu 1500 qdisc pfifo_fast
state UP group default qlen 1000
    link/ether 08:00:27:1a:a7:22 brd ff:ff:ff:ff:ff:ff
4: enp0s9:  mtu 1500 qdisc pfifo_fast
state UP group default qlen 1000
    link/ether 08:00:27:35:13:22 brd ff:ff:ff:ff:ff:ff
    inet6 2001:db8:1000:baba:b30a:ed85:8568:cc66/64 scope global
noprefixroute dynamic
   valid_lft 86389sec preferred_lft 14389sec
    inet6 fe80::cffe:e7ef:a03a:34cb/64 scope link
   valid_lft forever preferred_lft forever

enp0s3 is a bridged physical wired connection on the host to an ipv4
network with router / gateway, ipv6 is active but not used.
enp0s8 is a bridged physical wifi connection on the host to another ipv4
subnet within the same network on the same router / gateway ipv6 is active
but not used .
enp0d9 is an internal virtual box network connected directly to another
Ubuntu guest as an IPv6 stateless router (the neighbour I am solicitating)
this interfaces is ipv4 enabled but not used.

The expected behaviour is to use the sendmsg() system call and specify the
outgoing interface for the neighbour sonication message using the
IPV6_PKTINFO ancillary data as detailed in the RFC3542 links below.

6.1.  Specifying/Receiving the Interface
https://tools.ietf.org/html/rfc3542.html#section-6.1

6.7 Summary of Outgoing Interface Selection
https://tools.ietf.org/html/rfc3542.html#section-6.7

The actual behaviour is for the sendmsg() system call to fail with errno
set to 22, Invalid Argument.

A note of interest is that if a perror() call is made to report this
problem a seg fault in mallo.c is encountered. I have not performed
further investigation.
You can see this has been commented after the sendmsg() system call in the
demo code.

Warning:
In my testing I have removed the IPV6_PKTINFO ancillary data and have
found that if the IPv6 router is not available at boot by default enp0s3
is used to send the packet.
I will send another e-mail further describing the behaviour with the
subject:

[PROBLEM]: IPv6 Link-Local Routing through wrong interface

-
[3.] Keywords (i.e., modules, networking, kernel):
 networking, ipv6, link-local, ancillary

-
[4.] Kernel version (from /proc/version):
Linux version 4.4.0-51-generic (buildd@lcy01-08) (gcc version 5.4.0
20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #72-Ubuntu SMP Thu Nov 24
18:29:54

Re: [WIP] net+mlx4: auto doorbell

2016-12-02 Thread Eric Dumazet
On Wed, 2016-11-30 at 18:50 -0800, Eric Dumazet wrote:
> On Wed, 2016-11-30 at 18:32 -0800, Eric Dumazet wrote:
> 
> > I simply suggest we try to queue the qdisc for further servicing as we
> > do today, from net_tx_action(), but we might use a different bit, so
> > that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
> > before we grab it from net_tx_action(), maybe 100 usec later (time to
> > flush all skbs queued in napi_consume_skb() and maybe RX processing,
> > since most NIC handle TX completion before doing RX processing from thei
> > napi poll handler.
> > 
> > Should be doable with few changes in __netif_schedule() and
> > net_tx_action(), plus some control paths that will need to take care of
> > the new bit at dismantle time, right ?
> 
> Hmm... this is silly. Code already implements a different bit.
> 
> qdisc_run() seems to run more often from net_tx_action(), I have to find
> out why.

After more analysis I believe TSQ was one of the bottlenecks.

I prepared a patch series that helped my use cases.






Re: [PATCH net-next 1/2] samples, bpf: Refactor test_current_task_under_cgroup - separate out helpers

2016-12-02 Thread Alexei Starovoitov
On Fri, Dec 02, 2016 at 02:42:18AM -0800, Sargun Dhillon wrote:
> This patch modifies test_current_task_under_cgroup_user. The test has
> several helpers around creating a temporary environment for cgroup
> testing, and moving the current task around cgroups. This set of
> helpers can then be used in other tests.
> 
> Signed-off-by: Sargun Dhillon 

lgtm
Acked-by: Alexei Starovoitov 



<    1   2   3   4   >