[PATCH iproute2 V5 0/3] tc: Support for ip tunnel metadata set/unset/classify
Hi, This short series adds support for matching and setting metadata for ip tunnel shared device using the TC system, introduced in kernel 4.9 [1]. Applied and tested on top of commit b6c7fc61faab ("ss: print new tcp_info fields: busy, rwnd-limited, sndbuf-limited times") Example usage: $ tc filter add dev vxlan0 protocol ip parent : \ flower \ enc_src_ip 11.11.0.2 \ enc_dst_ip 11.11.0.1 \ enc_key_id 11 \ dst_ip 11.11.11.1 \ action mirred egress redirect dev vnet0 $ tc filter add dev net0 protocol ip parent : \ flower \ ip_proto 1 \ dst_ip 11.11.11.2 \ action tunnel_key set \ src_ip 11.11.0.1 \ dst_ip 11.11.0.2 \ id 11 \ action mirred egress redirect dev vxlan0 [1] - d1ba24feb466 ("Merge branch 'act_tunnel_key'") Thanks, Amir Changes from V4: - Fix rebase conflicts for net-next Changes from V3: - Fix bad wording in the man page about the use of the 'unset' operation Changes from V2: - Use const where needed - Don't lose return value - Introduce rta_getattr_be16() and rta_getattr_be32() Changes from V1: - Updated Patch 2/2 ("tc/act_tunnel: Introduce ip tunnel action") commit log and the man page tc-tunnel_key to reflect the fact that 'unset' operation is no mandatory. And describe when it might be needed. - Rename the 'release' operation to 'unset' Amir Vadai (3): libnetlink: Introduce rta_getattr_be*() tc/cls_flower: Classify packet in ip tunnels tc/act_tunnel: Introduce ip tunnel action Amir Vadai (3): libnetlink: Introduce rta_getattr_be*() tc/cls_flower: Classify packet in ip tunnels tc/act_tunnel: Introduce ip tunnel action bridge/fdb.c | 4 +- include/libnetlink.h | 9 ++ include/linux/tc_act/tc_tunnel_key.h | 42 ++ ip/iplink_geneve.c | 2 +- ip/iplink_vxlan.c| 2 +- man/man8/tc-flower.8 | 17 ++- man/man8/tc-tunnel_key.8 | 112 +++ tc/Makefile | 1 + tc/f_flower.c| 84 +++- tc/m_tunnel_key.c| 258 +++ 10 files changed, 522 insertions(+), 9 deletions(-) create mode 100644 include/linux/tc_act/tc_tunnel_key.h create mode 100644 man/man8/tc-tunnel_key.8 create mode 100644 tc/m_tunnel_key.c -- 2.10.2
Re: [PATCH 5/7] Documentation: DT: net: cpsw: allow to specify descriptors pool size
On Thu, Dec 01, 2016 at 05:34:30PM -0600, Grygorii Strashko wrote: > Add optional property "descs_pool_size" to specify buffer descriptor's > pool size. The "descs_pool_size" should define total number of CPDMA > CPPI descriptors to be used for both ingress/egress packets > processing. If not specified - the default value 256 will be used > which will allow to place descriptor's pool into the internal CPPI > RAM on most of TI SoC. > > Signed-off-by: Grygorii Strashko > --- > Documentation/devicetree/bindings/net/cpsw.txt | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/Documentation/devicetree/bindings/net/cpsw.txt > b/Documentation/devicetree/bindings/net/cpsw.txt > index 5ad439f..b99d196 100644 > --- a/Documentation/devicetree/bindings/net/cpsw.txt > +++ b/Documentation/devicetree/bindings/net/cpsw.txt > @@ -35,6 +35,11 @@ Optional properties: > For example in dra72x-evm, pcf gpio has to be > driven low so that cpsw slave 0 and phy data > lines are connected via mux. > +- descs_pool_size: total number of CPDMA CPPI descriptors to be used for > + both ingress/egress packets processing. if not > + specified the default value 256 will be used which > + will allow to place descriptors pool into the > + internal CPPI RAM. Does it describe h/w? Why now module parameter? or even smth like ethtool num ring entries? > > Slave Properties: > -- > 2.10.1 >
Re: [flamebait] xdp, well meaning but pointless
On 02.12.2016 11:24, Jesper Dangaard Brouer wrote: > On Thu, 1 Dec 2016 13:51:32 -0800 > Tom Herbert wrote: > The technical plenary at last IETF on Seoul a couple of weeks ago was exclusively focussed on DDOS in light of the recent attack against Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare presentation by Nick Sullivan (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf) alluded to some implementation of DDOS mitigation. In particular, on slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel" > > slide 14 > numbers he gave we're based in iptables+BPF and that was a whole 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic and that's also when I introduced XDP to whole IETF :-) ). If that's the best we can do the Internet is in a world hurt. DDOS mitigation alone is probably a sufficient motivation to look at XDP. We need something that drops bad packets as quickly as possible when under attack, we need this to be integrated into the stack, we need it to be programmable to deal with the increasing savvy of attackers, and we don't want to be forced to be dependent on HW solutions. This is why we created XDP! > > The 1.2Mpps number is a bit low, but we are unfortunately in that > ballpark. > >>> I totally understand that. But in my reply to David in this thread I >>> mentioned DNS apex processing as being problematic which is actually >>> being referred in your linked slide deck on page 9 ("What do floods look >>> like") and the problematic of parsing DNS packets in XDP due to string >>> processing and looping inside eBPF. > > That is a weak argument. You do realize CloudFlare actually use eBPF to > do this exact filtering, and (so-far) eBPF for parsing DNS have been > sufficient for them. You are talking about this code on the following slides (I actually transcribed it for you here and disassembled): l0: ld #0x14 l1: ldxb 4*([0]&0xf) l2: add x l3: tax l4: ld [x+0] l5: jeq #0x7657861, l6, l13 l6: ld [x+4] l7: jeq #0x6d706c65, l8, l13 l8: ld [x+8] l9: jeq #0x3636f6d, l10, l13 l10:ldb [x+12] l11:jeq #0, l12, l13 l12:ret #0x1 l13:ret #0 You can offload this to u32 in hardware if that is what you want. The reason this works is because of netfilter, which allows them to dynamically generate BPF programs and insert and delete them from chains, do intersection or unions of them. If you have a freestanding program like in XDP the complexity space is a different one and not comparable to this at all. Bye, Hannes
[PATCH iproute2/net-next] ss: initialise variables outside of for loop
Initialise for loops outside of for loops. GCC flags this as being out of spec unless C99 or C11 mode is used. With this change the entire tree appears to compile cleanly with -Wall. $ gcc --version gcc (Debian 4.9.2-10) 4.9.2 ... $ make ... ss.c: In function ‘unix_show_sock’: ss.c:3128:4: error: ‘for’ loop initial declarations are only allowed in C99 or C11 mode ... Signed-off-by: Simon Horman --- misc/ss.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index 839781ee29bc..ce0b9d3d993d 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -3124,10 +3124,12 @@ static int unix_show_sock(const struct sockaddr_nl *addr, struct nlmsghdr *nlh, memcpy(name, RTA_DATA(tb[UNIX_DIAG_NAME]), len); name[len] = '\0'; - if (name[0] == '\0') - for (int i = 0; i < len; i++) + if (name[0] == '\0') { + int i; + for (i = 0; i < len; i++) if (name[i] == '\0') name[i] = '@'; + } stat.name = &name[0]; memcpy(stat.local.data, &stat.name, sizeof(stat.name)); } -- 2.7.0.rc3.207.g0ac5344
Re: [PATCH net v3] tipc: check minimum bearer MTU
On 12/02/2016 04:33 PM, Michal Kubecek wrote: Qian Zhang (张谦) reported a potential socket buffer overflow in tipc_msg_build() which is also known as CVE-2016-8632: due to insufficient checks, a buffer overflow can occur if MTU is too short for even tipc headers. As anyone can set device MTU in a user/net namespace, this issue can be abused by a regular user. As agreed in the discussion on Ben Hutchings' original patch, we should check the MTU at the moment a bearer is attached rather than for each processed packet. We also need to repeat the check when bearer MTU is adjusted to new device MTU. UDP case also needs a check to avoid overflow when calculating bearer MTU. Fixes: b97bf3fd8f6a ("[TIPC] Initial merge") Signed-off-by: Michal Kubecek Reported-by: Qian Zhang (张谦) --- Thanks, it looks nice to me. Acked-by: Ying Xue
Re: Initial thoughts on TXDP
On Thu, 1 Dec 2016 11:51:42 -0800 Tom Herbert wrote: > On Wed, Nov 30, 2016 at 6:44 PM, Florian Westphal wrote: > > Tom Herbert wrote: [...] > >> - Call into TCP/IP stack with page data directly from driver-- no > >> skbuff allocation or interface. This is essentially provided by the > >> XDP API although we would need to generalize the interface to call > >> stack functions (I previously posted patches for that). We will also > >> need a new action, XDP_HELD?, that indicates the XDP function held the > >> packet (put on a socket for instance). > > > > Seems this will not work at all with the planned page pool thing when > > pages start to be held indefinitely. It is quite the opposite, the page pool support pages are being held for longer times, than drivers today. The current driver page recycle tricks cannot, as they depend on page refcnt being decremented quickly (while pages are still mapped in their recycle queue). > > You can also never get even close to userspace offload stacks once you > > need/do this; allocations in hotpath are too expensive. Yes. It is important to understand that once the number of outstanding pages get large, the driver recycle stops working. Meaning the pages allocations start to go through the page allocator. I've documented[1] that the bare alloc+free cost[2] (231 cycles order-0/4K) is higher than the 10G wirespeed budget (201 cycles). Thus, the driver recycle tricks are nice for benchmarking, as it hides the page allocator overhead. But this optimization might disappear for Tom's and Eric's more real-world use-cases e.g. like 10.000 sockets. The page pool don't these issues. [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH v2] sh_eth: remove unchecked interrupts
Hi Chris, On Thu, Dec 1, 2016 at 7:53 PM, Chris Brandt wrote: > On 12/1/2016, Sergei Shtylyov wrote: >> >> On 12/01/2016 05:42 PM, Geert Uytterhoeven wrote: >> >> >> --- a/drivers/net/ethernet/renesas/sh_eth.c >> >> +++ b/drivers/net/ethernet/renesas/sh_eth.c >> >> @@ -518,7 +518,7 @@ static struct sh_eth_cpu_data r7s72100_data = { >> >> >> >> .ecsr_value = ECSR_ICD, >> >> .ecsipr_value = ECSIPR_ICDIP, >> >> - .eesipr_value = 0xff7f009f, >> >> + .eesipr_value = 0xe77f009f, >> > >> > Comment not directly related to the merits of this patch: the EESIPR >> > bit definitions seem to be identical to those for EESR, so we can get >> > rid of the hardcoded values here? >> >> Do you mean using the @define's? We have EESIPR bits also declared, >> see enum DMAC_IM_BIT, Yes, that's what I meant. Unfortunately the DMAC_IM_BIT enum doesn't cover all bits. > Is your idea to get rid of .eesipr_value for devices that have values > that are the same as .eesr_err_check? > > > For example in sh_eth_dev_init(): > > sh_eth_modify(ndev, EESR, 0, 0); > mdp->irq_enabled = true; > - sh_eth_write(ndev, mdp->cd->eesipr_value, EESIPR); > + if (mdp->cd->eesipr_value) > + sh_eth_write(ndev, mdp->cd->eesipr_value, EESIPR); > + else > + sh_eth_write(ndev, mdp->cd->eesr_err_check, EESIPR); No, my intention was to just get rid of the hardcoded values when initializing .eesipr_value. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: stmmac ethernet in kernel 4.9-rc6: coalescing related pauses.
Hi! > >Well, if you have a workload that sends and receive packets, it tends > >to work ok, as you do tx_clean() in stmmac_poll(). My workload is not > >like that -- it is "sending packets at 3MB/sec, receiving none". So > >the stmmac_tx_timer() is rescheduled and rescheduled and rescheduled, > >and then we run out of transmit descriptors, and then 40msec passes, > >and then we clean them. Bad. > > > >And that's why low-res timers do not cut it. > > in that case, I expect that the tuning of the driver could help you. > I mean, by using ethtool, it could be enough to set the IC bit on all > the descriptors. You should touch the tx_coal_frames. > > Then you can use ethtool -S to monitor the status. Yes, I did something similar. Unfortnunately that meant crash within minutes, at least with 4.4 kernel. (If you know what was fixed between 4.4 and 4.9, that would be helpful). > We had experimented this tuning on STB IP where just datagrams > had to send externally. To be honest, although we had seen > better results w/o any timer, we kept this approach enabled > because the timer was fast enough to cover our tests on SH4 boxes. Please reply to David, and explain how it is supposed to work... because right now it does not. 40 msec delays are not acceptable in default configuration. > >>In the ring, some descriptors can raise the irq (according to a > >>threshold) and set the IC bit. In this path, the NAPI poll will be > >>scheduled. > > > >Not NAPI poll but stmmac_tx_timer(), right? > > in the xmit according the the threshold the timer is started or the > interrupt is set inside the descriptor. > Then stmmac_tx_clean will be always called and, if you see the flow, > no irqlock protection is needed! Agreed that no irqlock protection is needed if we rely on napi and timers. > >>Concerning the lock protection, we had reviewed long time ago and > >>IIRC, no raise condition should be present. Open to review it, > >>again! ... > >There's nothing that protect stmmac_poll() from running concurently > >with stmmac_dma_interrupt(), right? > > This is not necessary. dma_interrupt accesses shared priv->xstats; variables are of type unsigned long (not atomic_t), yet they are accesssed from interrupt context and from stmmac_ethtool without any locking. That can result in broken statistics AFAICT. Please take another look. As far as I can tell, you can have two cpus at #1 and #2 in the code, at the same time. It looks like napi_... has some atomic opertions inside so that looks safe at the first look. But I'm not sure if they also include enough memory barriers to make it safe...? static void stmmac_dma_interrupt(struct stmmac_priv *priv) { ... status = priv->hw->dma->dma_interrupt(priv->ioaddr, &priv->xstats); if (likely((status & handle_rx)) || (status & handle_tx)) { if (likely(napi_schedule_prep(&priv->napi))) { #1 stmmac_disable_dma_irq(priv); __napi_schedule(&priv->napi); } } static int stmmac_poll(struct napi_struct *napi, int budget) { struct stmmac_priv *priv = container_of(napi, struct stmmac_priv, napi); int work_done = 0; priv->xstats.napi_poll++; stmmac_tx_clean(priv); work_done = stmmac_rx(priv, budget); if (work_done < budget) { napi_complete(napi); #2 stmmac_enable_dma_irq(priv); } return work_done; } Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
net/can: warning in raw_setsockopt/__alloc_pages_slowpath
Hi! I've got the following error report while running the syzkaller fuzzer. A reproducer is attached. On commit d8e435f3ab6fea2ea324dce72b51dd7761747523 (Nov 26). [ cut here ] WARNING: CPU: 0 PID: 4009 at mm/page_alloc.c:3511 __alloc_pages_slowpath+0x3d4/0x1bf0 Modules linked in: CPU: 0 PID: 4009 Comm: a.out Not tainted 4.9.0-rc6+ #54 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 88006832f8a8 81c73b14 842c6320 88006832f8f0 8123dc57 880067d86000 0db7 842c6320 0db7 Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [] dump_stack+0xb3/0x10f lib/dump_stack.c:51 [] __warn+0x1a7/0x1f0 kernel/panic.c:550 [] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585 [] __alloc_pages_slowpath+0x3d4/0x1bf0 mm/page_alloc.c:3511 [] __alloc_pages_nodemask+0x5c2/0x710 mm/page_alloc.c:3781 [] alloc_pages_current+0xf4/0x400 mm/mempolicy.c:2072 [< inline >] alloc_pages ./include/linux/gfp.h:469 [] kmalloc_order+0x1f/0x70 mm/slab_common.c:1015 [] kmalloc_order_trace+0x1f/0x160 mm/slab_common.c:1026 [< inline >] kmalloc_large ./include/linux/slab.h:422 [] __kmalloc_track_caller+0x227/0x2a0 mm/slub.c:4233 [] memdup_user+0x2c/0xa0 mm/util.c:137 [] raw_setsockopt+0x1be/0x9f0 net/can/raw.c:506 [< inline >] SYSC_setsockopt net/socket.c:1757 [] SyS_setsockopt+0x154/0x240 net/socket.c:1736 [] entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:209 ---[ end trace bc80556cca970089 ]--- // autogenerated by syzkaller (http://github.com/google/syzkaller) #ifndef __NR_setsockopt #define __NR_setsockopt 54 #endif #ifndef __NR_socket #define __NR_socket 41 #endif #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include const int kFailStatus = 67; const int kErrorStatus = 68; const int kRetryStatus = 69; __attribute__((noreturn)) void fail(const char* msg, ...) { int e = errno; fflush(stdout); va_list args; va_start(args, msg); vfprintf(stderr, msg, args); va_end(args); fprintf(stderr, " (errno %d)\n", e); exit(kFailStatus); } __attribute__((noreturn)) void exitf(const char* msg, ...) { int e = errno; fflush(stdout); va_list args; va_start(args, msg); vfprintf(stderr, msg, args); va_end(args); fprintf(stderr, " (errno %d)\n", e); exit(kRetryStatus); } static int flag_debug; void debug(const char* msg, ...) { if (!flag_debug) return; va_list args; va_start(args, msg); vfprintf(stdout, msg, args); va_end(args); fflush(stdout); } __thread int skip_segv; __thread jmp_buf segv_env; static void segv_handler(int sig, siginfo_t* info, void* uctx) { if (__atomic_load_n(&skip_segv, __ATOMIC_RELAXED)) _longjmp(segv_env, 1); exit(sig); } static void install_segv_handler() { struct sigaction sa; memset(&sa, 0, sizeof(sa)); sa.sa_sigaction = segv_handler; sa.sa_flags = SA_NODEFER | SA_SIGINFO; sigaction(SIGSEGV, &sa, NULL); sigaction(SIGBUS, &sa, NULL); } #define NONFAILING(...)\ {\ __atomic_fetch_add(&skip_segv, 1, __ATOMIC_SEQ_CST); \ if (_setjmp(segv_env) == 0) { \ __VA_ARGS__; \ } \ __atomic_fetch_sub(&skip_segv, 1, __ATOMIC_SEQ_CST); \ } static uintptr_t execute_syscall(int nr, uintptr_t a0, uintptr_t a1, uintptr_t a2, uintptr_t a3, uintptr_t a4, uintptr_t a5, uintptr_t a6, uintptr_t a7, uintptr_t a8) { switch (nr) { default: return syscall(nr, a0, a1, a2, a3, a4, a5); } } static void setup_main_process(uint64_t pid) { struct sigaction sa; memset(&sa, 0, sizeof(sa)); sa.sa_handler = SIG_IGN; syscall(SYS_rt_sigaction, 0x20, &sa, NULL, 8); syscall(SYS_rt_sigaction, 0x21, &sa, NULL, 8); install_segv_handler(); char tmpdir_template[] = "./syzkaller.XX"; char* tmpdir = mkdtemp(tmpdir_template); if (!tmpdir) fail("failed to mkdtemp"); if (chmod(tmpdir, 0777)) fail("failed to chmod"); if (chdir(tmpdir)) fail("failed to chdir"); } static void loop(); static void sandbox_common() { prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0); setpgrp(); setsid(); struct rlimit rlim; rlim.rlim_cur = rlim.rlim_max = 128 << 20; setrlimit(RLIMIT_AS, &rlim); rlim.rlim_cur = rlim.rlim_max = 1 << 20;
arp_filter and IPv6 ND
Hey, net.ipv4.conf.all.arp_filter appears not to have IPv6 counter part. Or am I missing something? That is Linux does answer to ND queries for unrelated interfaces by default, and I can't seem to find way to turn that off. Is it proper maintainership to accept changes to single protocol, without mandating the support for other protocol having same behavioural characteristics? It is good that some parts for ARP and ND have common code in linux (neighbour.c) unlike in BSD where everything seems to be self-contained. I'd wish that even more of ARP/ND would common, because there are still lot of common behavioural code in ARP/ND code itself, which requires double maintenance and are implemented by different people at different times, so leads to different set of bugs and behaviour for same intended behaviour. For example this feature should be protocol agnostic, developer should only need to develop it once for the higher level behavioural code, without minding which IP AFI it is for. Obviously that does not exclude ability to sysctl configure it on/off per AFI. Thanks! -- ++ytti
Re: [PATCH net v2] igb: re-assign hw address pointer on reset after PCI error
On 11/10/2016 04:46 PM, Guilherme G. Piccoli wrote: > Whenever the igb driver detects the result of a read operation returns > a value composed only by F's (like 0x), it will detach the > net_device, clear the hw_addr pointer and warn to the user that adapter's > link is lost - those steps happen on igb_rd32(). > > In case a PCI error happens on Power architecture, there's a recovery > mechanism called EEH, that will reset the PCI slot and call driver's > handlers to reset the adapter and network functionality as well. > > We observed that once hw_addr is NULL after the error is detected on > igb_rd32(), it's never assigned back, so in the process of resetting > the network functionality we got a NULL pointer dereference in both > igb_configure_tx_ring() and igb_configure_rx_ring(). In order to avoid > such bug, this patch re-assigns the hw_addr value in the slot_reset > handler. > > Reported-by: Anthony H. Thai > Reported-by: Harsha Thyagaraja > Signed-off-by: Guilherme G. Piccoli > --- > drivers/net/ethernet/intel/igb/igb_main.c | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c > b/drivers/net/ethernet/intel/igb/igb_main.c > index edc9a6a..136ee9e 100644 > --- a/drivers/net/ethernet/intel/igb/igb_main.c > +++ b/drivers/net/ethernet/intel/igb/igb_main.c > @@ -7878,6 +7878,11 @@ static pci_ers_result_t igb_io_slot_reset(struct > pci_dev *pdev) > pci_enable_wake(pdev, PCI_D3hot, 0); > pci_enable_wake(pdev, PCI_D3cold, 0); > > + /* In case of PCI error, adapter lose its HW address > + * so we should re-assign it here. > + */ > + hw->hw_addr = adapter->io_addr; > + > igb_reset(adapter); > wr32(E1000_WUS, ~0); > result = PCI_ERS_RESULT_RECOVERED; > Ping? Sorry to annoy, any news on this? Thanks in advance! Cheers, Guilherme
Re: Initial thoughts on TXDP
On Thu, 1 Dec 2016 23:47:44 +0100 Hannes Frederic Sowa wrote: > Side note: > > On 01.12.2016 20:51, Tom Herbert wrote: > >> > E.g. "mini-skb": Even if we assume that this provides a speedup > >> > (where does that come from? should make no difference if a 32 or > >> > 320 byte buffer gets allocated). Yes, the size of the allocation from the SLUB allocator does not change base performance/cost much (at least for small objects, if < 1024). Do notice the base SLUB alloc+free cost is fairly high (compared to a 201 cycles budget). Especially for networking as the free-side is very likely to hit a slow path. SLUB fast-path 53 cycles, and slow-path around 100 cycles (data from [1]). I've tried to address this with the kmem_cache bulk APIs. Which reduce the cost to approx 30 cycles. (Something we have not fully reaped the benefit from yet!) [1] https://git.kernel.org/torvalds/c/ca257195511 > >> > > > It's the zero'ing of three cache lines. I believe we talked about that > > as netdev. Actually 4 cache-lines, but with some cleanup I believe we can get down to clearing 192 bytes 3 cache-lines. > > Jesper and me played with that again very recently: > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590 > > In micro-benchmarks we saw a pretty good speed up not using the rep > stosb generated by gcc builtin but plain movq's. Probably the cost model > for __builtin_memset in gcc is wrong? Yes, I believe so. > When Jesper is free we wanted to benchmark this and maybe come up with a > arch specific way of cleaning if it turns out to really improve throughput. > > SIMD instructions seem even faster but the kernel_fpu_begin/end() kill > all the benefits. One strange thing was, that on my skylake CPU (i7-6700K @4.00GHz), Hannes's hand-optimized MOVQ ASM-code didn't go past 8 bytes per cycle, or 32 cycles for 256 bytes. Talking to Alex and John during netdev, and reading on the Intel arch, I though that this CPU should be-able-to perform 16 bytes per cycle. The CPU can do it as the rep-stos show this once the size gets large enough. On this CPU the memset rep stos starts to win around 512 bytes: 192/35 = 5.5 bytes/cycle 256/36 = 7.1 bytes/cycle 512/40 = 12.8 bytes/cycle 768/46 = 16.7 bytes/cycle 1024/52 = 19.7 bytes/cycle 2048/84 = 24.4 bytes/cycle 4096/148= 27.7 bytes/cycle -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: net/can: warning in raw_setsockopt/__alloc_pages_slowpath
On 12/02/2016 01:43 PM, Andrey Konovalov wrote: > Hi! > > I've got the following error report while running the syzkaller fuzzer. > > A reproducer is attached. > > On commit d8e435f3ab6fea2ea324dce72b51dd7761747523 (Nov 26). > > [ cut here ] > WARNING: CPU: 0 PID: 4009 at mm/page_alloc.c:3511 > __alloc_pages_slowpath+0x3d4/0x1bf0 > Modules linked in: > CPU: 0 PID: 4009 Comm: a.out Not tainted 4.9.0-rc6+ #54 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > 88006832f8a8 81c73b14 > 842c6320 88006832f8f0 8123dc57 > 880067d86000 0db7 842c6320 0db7 > Call Trace: > [< inline >] __dump_stack lib/dump_stack.c:15 > [] dump_stack+0xb3/0x10f lib/dump_stack.c:51 > [] __warn+0x1a7/0x1f0 kernel/panic.c:550 > [] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585 > [] __alloc_pages_slowpath+0x3d4/0x1bf0 mm/page_alloc.c:3511 > [] __alloc_pages_nodemask+0x5c2/0x710 mm/page_alloc.c:3781 > [] alloc_pages_current+0xf4/0x400 mm/mempolicy.c:2072 > [< inline >] alloc_pages ./include/linux/gfp.h:469 > [] kmalloc_order+0x1f/0x70 mm/slab_common.c:1015 > [] kmalloc_order_trace+0x1f/0x160 mm/slab_common.c:1026 > [< inline >] kmalloc_large ./include/linux/slab.h:422 > [] __kmalloc_track_caller+0x227/0x2a0 mm/slub.c:4233 > [] memdup_user+0x2c/0xa0 mm/util.c:137 > [] raw_setsockopt+0x1be/0x9f0 net/can/raw.c:506 We should add a check for a sensible optlen > static int raw_setsockopt(struct socket *sock, int level, int optname, > char __user *optval, unsigned int optlen) > { > struct sock *sk = sock->sk; > struct raw_sock *ro = raw_sk(sk); > struct can_filter *filter = NULL; /* dyn. alloc'ed filters */ > struct can_filter sfilter; /* single filter */ > struct net_device *dev = NULL; > can_err_mask_t err_mask = 0; > int count = 0; > int err = 0; > > if (level != SOL_CAN_RAW) > return -EINVAL; > > switch (optname) { > > case CAN_RAW_FILTER: > if (optlen % sizeof(struct can_filter) != 0) > return -EINVAL; here... if (optlen > 64 * sizeof(struct can_filter)) return -EINVAL; > > count = optlen / sizeof(struct can_filter); > > if (count > 1) { > /* filter does not fit into dfilter => alloc space */ > filter = memdup_user(optval, optlen); > if (IS_ERR(filter)) > return PTR_ERR(filter); > } else if (count == 1) { > if (copy_from_user(&sfilter, optval, sizeof(sfilter))) > return -EFAULT; > } > > lock_sock(sk); Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions| Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917- | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | signature.asc Description: OpenPGP digital signature
Re: stmmac ethernet in kernel 4.9-rc6: coalescing related pauses.
On 12/2/2016 1:32 PM, Pavel Machek wrote: Hi! Well, if you have a workload that sends and receive packets, it tends to work ok, as you do tx_clean() in stmmac_poll(). My workload is not like that -- it is "sending packets at 3MB/sec, receiving none". So the stmmac_tx_timer() is rescheduled and rescheduled and rescheduled, and then we run out of transmit descriptors, and then 40msec passes, and then we clean them. Bad. And that's why low-res timers do not cut it. in that case, I expect that the tuning of the driver could help you. I mean, by using ethtool, it could be enough to set the IC bit on all the descriptors. You should touch the tx_coal_frames. Then you can use ethtool -S to monitor the status. Yes, I did something similar. Unfortnunately that meant crash within minutes, at least with 4.4 kernel. (If you know what was fixed between 4.4 and 4.9, that would be helpful). 4.4 has no GMAC4 support. Alex, do you remember any patches to fix that? We had experimented this tuning on STB IP where just datagrams had to send externally. To be honest, although we had seen better results w/o any timer, we kept this approach enabled because the timer was fast enough to cover our tests on SH4 boxes. Please reply to David, and explain how it is supposed to work... because right now it does not. 40 msec delays are not acceptable in default configuration. I mean, that on UP and SMP system this schema helped to improve the performance saving CPU on my side and this has been tested since a long time (~4 years). I tested something similar to yours where unidirectional traffic with limited throughput was needed and I can confirm you that tuning/removing coalesce parameters this helped. The tuning I decided to keep in the driver was suitable in several user cases and if now you have problems or you want to review it I can just confirm that there are no problems on my side. If you want to simply the logic around the tx process and remove timer on official driver I can accept that. I will just ask you uto double check if the throughput and CPU usage when request max throughput (better if on GiGa setup) has no regressions. Otherwise we could start thinking about adaptive schema if feasible. In the ring, some descriptors can raise the irq (according to a threshold) and set the IC bit. In this path, the NAPI poll will be scheduled. Not NAPI poll but stmmac_tx_timer(), right? in the xmit according the the threshold the timer is started or the interrupt is set inside the descriptor. Then stmmac_tx_clean will be always called and, if you see the flow, no irqlock protection is needed! Agreed that no irqlock protection is needed if we rely on napi and timers. ok Concerning the lock protection, we had reviewed long time ago and IIRC, no raise condition should be present. Open to review it, again! ... There's nothing that protect stmmac_poll() from running concurently with stmmac_dma_interrupt(), right? This is not necessary. dma_interrupt accesses shared priv->xstats; variables are of type unsigned long (not atomic_t), yet they are accesssed from interrupt context and from stmmac_ethtool without any locking. That can result in broken statistics AFAICT. ok we can check this and welcome patches and I'd prefer to remove xstats from critical part of the code like ISR (that comes from old story of the driver). Please take another look. As far as I can tell, you can have two cpus at #1 and #2 in the code, at the same time. It looks like napi_... has some atomic opertions inside so that looks safe at the first look. But I'm not sure if they also include enough memory barriers to make it safe...? Although I have never reproduced related issues on SMP platforms due to reordering of memory operations but, as said above, welcome review on this especially if you are seeing problems when manage NAPI. FYI, the only memory barrier you will see in the driver are about the OWN_BIT setting till now. static void stmmac_dma_interrupt(struct stmmac_priv *priv) { ... status = priv->hw->dma->dma_interrupt(priv->ioaddr, &priv->xstats); if (likely((status & handle_rx)) || (status & handle_tx)) { if (likely(napi_schedule_prep(&priv->napi))) { #1 stmmac_disable_dma_irq(priv); __napi_schedule(&priv->napi); } } static int stmmac_poll(struct napi_struct *napi, int budget) { struct stmmac_priv *priv = container_of(napi, struct stmmac_priv, napi); int work_done = 0; priv->xstats.napi_poll++; stmmac_tx_clean(priv); work_done = stmmac_rx(priv, budget); if (work_done < budget) { napi_complete(napi); #2 stmmac_enable_dma_irq(priv); } hmm, I have to check (and refresh my memory) but the driver uses the napi_schedule_prep. Regards Peppe return work_done; } Best regards,
Re: [PATCH net] fib_trie: Avoid expensive update of suffix length when not required
On 01/12/16 18:29, Alexander Duyck wrote: On Wed, Nov 30, 2016 at 4:27 PM, Robert Shearman wrote: On 29/11/16 23:14, Alexander Duyck wrote: On Tue, Nov 29, 2016 at 9:46 AM, Robert Shearman Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length") Signed-off-by: Robert Shearman So I am not entirely convinced this is a regression, I was wondering what your numbers looked like before the patch you are saying this fixes? I recall I fixed a number of issues leading up to this so I am just curious. At commit 21d1f11db0e2f20a549ad8322879507850544670 ("fib_trie: Remove checks for index >= tnode_child_length from tnode_get_child") which is the parent of 5405afd1a306: $ time sudo ip route restore < ~/allroutes RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists real0m3.451s user0m0.184s sys 0m2.004s At commit 5405afd1a30620de8601436bae739c67c0bc7324 ("fib_trie: Add tracking value for suffix length"): $ time sudo ip route restore < ~/allroutes RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists real0m48.624s user0m0.360s sys 0m46.960s So does that warrant a fixes tag for this patch? Yes, please include it in the patch description next time. I've been giving it thought and the fact is the patch series has merit. I just think it was being a bit too aggressive in terms of some of the changes. So placing any changes in put_child seem to be the one piece in this that make this a bit ugly. Does that imply the current updating of the parent's slen value should be removed from put_child then? I started out without doing the changes in put_child, but that meant the changes to push the slen up through the trie were spread all over the place. This seemed like the cleaner solution. + } +} + /* Add a child at position i overwriting the old value. * Update the value of full_children and empty_children. + * + * The suffix length of the parent node and the rest of the tree is + * updated (if required) when adding/replacing a node, but is caller's + * responsibility on removal. */ static void put_child(struct key_vector *tn, unsigned long i, struct key_vector *n) @@ -447,8 +461,8 @@ static void put_child(struct key_vector *tn, unsigned long i, else if (!wasfull && isfull) tn_info(tn)->full_children++; - if (n && (tn->slen < n->slen)) - tn->slen = n->slen; + if (n) + node_push_suffix(tn, n); This is just creating redundant work if we have to perform a resize. The only real redundant work is the assignment of slen in tn, since the propagation up the trie has to happen regardless and if a subsequent resize results in changes to the trie then the propagation will stop at tn's parent since the suffix length of the tn's parent will not be less than tn's suffix length. This isn't going to work. We want to avoid trying to push the suffix when we place a child. There are scenarios where we are placing children in nodes that don't have parents yet because we are doing rebalances and such. I suspect this could be pretty expensive on a resize call. It's not expensive because the new parents that are being built up are orphaned from the rest of the trie, so the push up won't proceed into the rest of the trie until the new node is inserted into the trie. We want to pull the suffix pushing out of the resize completely. The problem is this is going to cause us to start pushing suffixes for every node moved as a part of resize. This will mean that nodes will have to be visited multiple times, which could well be more expensive than doing it during the resize. rcu_assign_pointer(tn->tnode[i], n); } ... -static void leaf_pull_suffix(struct key_vector *tp, struct key_vector *l) +static void node_set_suffix(struct key_vector *tp, unsigned char slen) { - while ((tp->slen > tp->pos) && (tp->slen > l->slen)) { - if (update_suffix(tp) > l->slen) + if (slen > tp->slen) + tp->slen = slen; +} + +static void node_pull_suffix(struct key_vector *tn) +{ + struct key_vector *tp; + unsigned char slen; + + slen = update_suffix(tn); + tp = node_parent(tn); + while ((tp->slen > tp->pos) && (tp->slen > slen)) { + if (update_suffix(tp) > slen) break; tp = node_parent(tp); } } Actually I realized that there is a bug here. The check for update_suffix(tp) > slen can cause us to stop prematurely if I am not mistaken. What we should probably be doing is pulling out tp->slen and if the result of update_suffix is the same value then we can break the loop. Otherwise we can stop early if a "second runner up" shows up that is supposed to be pushed up the trie. Excellent point. This also
Aw: Re: stmmac ethernet in kernel 4.9-rc6: coalescing related pauses.
Hi, > > There's nothing that protect stmmac_poll() from running concurently > with stmmac_dma_interrupt(), right? > could it be that there is also another issue concerned locking?: The tx completion handler takes the xmit_lock in case that the netif_queue is stopped. This is AFAICS unnecessary, since both xmit and completion handler are already synchronized by the private tx lock. But it is IMHO also dangerous: In the xmit handler we have the locking order 1. xmit_lock 2. private tx lock while in the completion handler its the reverse: 1. private tx lock 2. xmit lock. Do I miss something? Regards, Lino
Re: arp_filter and IPv6 ND
On 02.12.2016 13:51, Saku Ytti wrote: > net.ipv4.conf.all.arp_filter appears not to have IPv6 counter part. > Or am I missing something? That is Linux does answer to ND queries for > unrelated interfaces by default, and I can't seem to find way to turn > that off. May I ask why you want to turn it off? In IPv6 this depends on the scope. In IPv4 this concept doesn't really exist. Please notice that in IPv4 arp_filter does not necessarily mean that the system is operating in strong end system mode but you end up in an hybrid clone where arp is acting strong but routing not and thus you also have to add fib rules to simulate that. > Is it proper maintainership to accept changes to single protocol, > without mandating the support for other protocol having same > behavioural characteristics? > > It is good that some parts for ARP and ND have common code in linux > (neighbour.c) unlike in BSD where everything seems to be > self-contained. > > I'd wish that even more of ARP/ND would common, because there are > still lot of common behavioural code in ARP/ND code itself, which > requires double maintenance and are implemented by different people at > different times, so leads to different set of bugs and behaviour for > same intended behaviour. > > For example this feature should be protocol agnostic, developer should > only need to develop it once for the higher level behavioural code, > without minding which IP AFI it is for. Obviously that does not > exclude ability to sysctl configure it on/off per AFI.
Re: [PATCH] stmmac: reduce code duplication getting basic descriptors
Hi Pavel, On 11/28/2016 01:17 PM, Pavel Machek wrote: Remove code duplication getting basic descriptors. I agree with your patch, it will make code easier to understand. After fix kbuild issue you can add my Acked-by; Regards Alex Signed-off-by: Pavel Machek diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index f7133d0..ed20668 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -929,6 +929,17 @@ static int stmmac_set_bfsize(int mtu, int bufsize) return ret; } +static inline struct dma_desc *stmmac_tx_desc(struct stmmac_priv *priv, int i) +{ + struct dma_desc *p; + if (priv->extend_desc) + p = &((priv->dma_etx + i)->basic); + else + p = priv->dma_tx + i; + return p; +} + + /** * stmmac_clear_descriptors - clear descriptors * @priv: driver private structure @@ -1078,11 +1089,7 @@ static int init_dma_desc_rings(struct net_device *dev, gfp_t flags) /* TX INITIALIZATION */ for (i = 0; i < DMA_TX_SIZE; i++) { - struct dma_desc *p; - if (priv->extend_desc) - p = &((priv->dma_etx + i)->basic); - else - p = priv->dma_tx + i; + struct dma_desc *p = stmmac_tx_desc(priv, i); if (priv->synopsys_id >= DWMAC_CORE_4_00) { p->des0 = 0; @@ -1129,12 +1136,7 @@ static void dma_free_tx_skbufs(struct stmmac_priv *priv) int i; for (i = 0; i < DMA_TX_SIZE; i++) { - struct dma_desc *p; - - if (priv->extend_desc) - p = &((priv->dma_etx + i)->basic); - else - p = priv->dma_tx + i; + struct dma_desc *p = stmmac_tx_desc(priv, i); if (priv->tx_skbuff_dma[i].buf) { if (priv->tx_skbuff_dma[i].map_as_page) @@ -1314,14 +1316,9 @@ static void __stmmac_tx_clean(struct stmmac_priv *priv) while (entry != priv->cur_tx) { struct sk_buff *skb = priv->tx_skbuff[entry]; - struct dma_desc *p; + struct dma_desc *p = stmmac_tx_desc(priv, entry); int status; - if (priv->extend_desc) - p = (struct dma_desc *)(priv->dma_etx + entry); - else - p = priv->dma_tx + entry; - status = priv->hw->desc->tx_status(&priv->dev->stats, &priv->xstats, p, priv->ioaddr); @@ -2227,11 +2224,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev) csum_insertion = (skb->ip_summed == CHECKSUM_PARTIAL); - if (likely(priv->extend_desc)) - desc = (struct dma_desc *)(priv->dma_etx + entry); - else - desc = priv->dma_tx + entry; - + desc = stmmac_tx_desc(priv, entry); first = desc; priv->tx_skbuff[first_entry] = skb; @@ -2254,10 +2247,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev) entry = STMMAC_GET_ENTRY(entry, DMA_TX_SIZE); - if (likely(priv->extend_desc)) - desc = (struct dma_desc *)(priv->dma_etx + entry); - else - desc = priv->dma_tx + entry; + desc = stmmac_tx_desc(priv, entry); des = skb_frag_dma_map(priv->device, frag, 0, len, DMA_TO_DEVICE);
[PATCH 3/6] net: stmmac: stmmac_platform: fix parsing of DT binding
From: Niklas Cassel commit 64c3b252e9fc ("net: stmmac: fixed the pbl setting with DT") changed the parsing of the DT binding. Before 64c3b252e9fc, snps,fixed-burst and snps,mixed-burst were parsed regardless if the property snps,pbl existed or not. After the commit, fixed burst and mixed burst are only parsed if snps,pbl exists. Now when snps,aal has been added, it too is only parsed if snps,pbl exists. Since the DT binding does not specify that fixed burst, mixed burst or aal depend on snps,pbl being specified, undo changes introduced by 64c3b252e9fc. The issue commit 64c3b252e9fc ("net: stmmac: fixed the pbl setting with DT") tries to address is solved in another way: The databook specifies that all values other than 1, 2, 4, 8, 16, or 32 results in undefined behavior, so snps,pbl = <0> is invalid. If pbl is 0 after parsing, set pbl to DEFAULT_DMA_PBL. This handles the case where the property is omitted, and also handles the case where the property is specified without any data. Signed-off-by: Niklas Cassel --- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 3 +++ .../net/ethernet/stmicro/stmmac/stmmac_platform.c | 27 +++--- 2 files changed, 16 insertions(+), 14 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index 27a03f7ee095..9ba2eb4c22fe 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -1585,6 +1585,9 @@ static int stmmac_init_dma_engine(struct stmmac_priv *priv) return -EINVAL; } + if (!priv->plat->dma_cfg->pbl) + priv->plat->dma_cfg->pbl = DEFAULT_DMA_PBL; + if (priv->extend_desc && (priv->mode == STMMAC_RING_MODE)) atds = 1; diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c index b46c892c7079..05b8c33effd5 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c @@ -303,21 +303,20 @@ stmmac_probe_config_dt(struct platform_device *pdev, const char **mac) plat->force_sf_dma_mode = 1; } - if (of_find_property(np, "snps,pbl", NULL)) { - dma_cfg = devm_kzalloc(&pdev->dev, sizeof(*dma_cfg), - GFP_KERNEL); - if (!dma_cfg) { - of_node_put(plat->phy_node); - return ERR_PTR(-ENOMEM); - } - plat->dma_cfg = dma_cfg; - of_property_read_u32(np, "snps,pbl", &dma_cfg->pbl); - dma_cfg->aal = of_property_read_bool(np, "snps,aal"); - dma_cfg->fixed_burst = - of_property_read_bool(np, "snps,fixed-burst"); - dma_cfg->mixed_burst = - of_property_read_bool(np, "snps,mixed-burst"); + dma_cfg = devm_kzalloc(&pdev->dev, sizeof(*dma_cfg), + GFP_KERNEL); + if (!dma_cfg) { + of_node_put(plat->phy_node); + return ERR_PTR(-ENOMEM); } + plat->dma_cfg = dma_cfg; + + of_property_read_u32(np, "snps,pbl", &dma_cfg->pbl); + + dma_cfg->aal = of_property_read_bool(np, "snps,aal"); + dma_cfg->fixed_burst = of_property_read_bool(np, "snps,fixed-burst"); + dma_cfg->mixed_burst = of_property_read_bool(np, "snps,mixed-burst"); + plat->force_thresh_dma_mode = of_property_read_bool(np, "snps,force_thresh_dma_mode"); if (plat->force_thresh_dma_mode) { plat->force_sf_dma_mode = 0; -- 2.1.4
[PATCH 2/6] net: stmmac: simplify the common DMA init API
From: Niklas Cassel Use struct stmmac_dma_cfg *dma_cfg as an argument rather than using all the struct members as individual arguments. Signed-off-by: Niklas Cassel --- drivers/net/ethernet/stmicro/stmmac/common.h| 4 ++-- drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c | 13 +++-- drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c | 7 --- drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c| 14 -- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 6 +- 5 files changed, 22 insertions(+), 22 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h index 6d2de4e01f6d..3023ec7ae83e 100644 --- a/drivers/net/ethernet/stmicro/stmmac/common.h +++ b/drivers/net/ethernet/stmicro/stmmac/common.h @@ -411,8 +411,8 @@ extern const struct stmmac_desc_ops ndesc_ops; struct stmmac_dma_ops { /* DMA core initialization */ int (*reset)(void __iomem *ioaddr); - void (*init)(void __iomem *ioaddr, int pbl, int fb, int mb, -int aal, u32 dma_tx, u32 dma_rx, int atds); + void (*init)(void __iomem *ioaddr, struct stmmac_dma_cfg *dma_cfg, +u32 dma_tx, u32 dma_rx, int atds); /* Configure the AXI Bus Mode Register */ void (*axi)(void __iomem *ioaddr, struct stmmac_axi *axi); /* Dump DMA registers */ diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c index 990746955216..01d0d0f315e5 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c @@ -82,8 +82,9 @@ static void dwmac1000_dma_axi(void __iomem *ioaddr, struct stmmac_axi *axi) writel(value, ioaddr + DMA_AXI_BUS_MODE); } -static void dwmac1000_dma_init(void __iomem *ioaddr, int pbl, int fb, int mb, - int aal, u32 dma_tx, u32 dma_rx, int atds) +static void dwmac1000_dma_init(void __iomem *ioaddr, + struct stmmac_dma_cfg *dma_cfg, + u32 dma_tx, u32 dma_rx, int atds) { u32 value = readl(ioaddr + DMA_BUS_MODE); @@ -99,20 +100,20 @@ static void dwmac1000_dma_init(void __iomem *ioaddr, int pbl, int fb, int mb, */ value |= DMA_BUS_MODE_MAXPBL; value &= ~DMA_BUS_MODE_PBL_MASK; - value |= (pbl << DMA_BUS_MODE_PBL_SHIFT); + value |= (dma_cfg->pbl << DMA_BUS_MODE_PBL_SHIFT); /* Set the Fixed burst mode */ - if (fb) + if (dma_cfg->fixed_burst) value |= DMA_BUS_MODE_FB; /* Mixed Burst has no effect when fb is set */ - if (mb) + if (dma_cfg->mixed_burst) value |= DMA_BUS_MODE_MB; if (atds) value |= DMA_BUS_MODE_ATDS; - if (aal) + if (dma_cfg->aal) value |= DMA_BUS_MODE_AAL; writel(value, ioaddr + DMA_BUS_MODE); diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c index 61f54c99a7de..e5664da382f3 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c @@ -32,11 +32,12 @@ #include "dwmac100.h" #include "dwmac_dma.h" -static void dwmac100_dma_init(void __iomem *ioaddr, int pbl, int fb, int mb, - int aal, u32 dma_tx, u32 dma_rx, int atds) +static void dwmac100_dma_init(void __iomem *ioaddr, + struct stmmac_dma_cfg *dma_cfg, + u32 dma_tx, u32 dma_rx, int atds) { /* Enable Application Access by writing to DMA CSR0 */ - writel(DMA_BUS_MODE_DEFAULT | (pbl << DMA_BUS_MODE_PBL_SHIFT), + writel(DMA_BUS_MODE_DEFAULT | (dma_cfg->pbl << DMA_BUS_MODE_PBL_SHIFT), ioaddr + DMA_BUS_MODE); /* Mask interrupts by writing to CSR7 */ diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c index 577316de6ba8..0946546d6dcd 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c @@ -97,27 +97,29 @@ static void dwmac4_dma_init_channel(void __iomem *ioaddr, int pbl, writel(dma_rx_phy, ioaddr + DMA_CHAN_RX_BASE_ADDR(channel)); } -static void dwmac4_dma_init(void __iomem *ioaddr, int pbl, int fb, int mb, - int aal, u32 dma_tx, u32 dma_rx, int atds) +static void dwmac4_dma_init(void __iomem *ioaddr, + struct stmmac_dma_cfg *dma_cfg, + u32 dma_tx, u32 dma_rx, int atds) { u32 value = readl(ioaddr + DMA_SYS_BUS_MODE); int i; /* Set the Fixed burst mode */ - if (fb) + if (dma_cfg->fixed_burst) value |= DMA_SYS_BUS_FB; /* Mixed Burst has no effect when fb is set */ - if (mb)
[PATCH 4/6] net: stmmac: dwmac1000: fix define DMA_BUS_MODE_RPBL_MASK
From: Niklas Cassel DMA_BUS_MODE_RPBL_MASK is really 6 bits, just like DMA_BUS_MODE_PBL_MASK. Signed-off-by: Niklas Cassel --- drivers/net/ethernet/stmicro/stmmac/dwmac1000.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h b/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h index ff3e5ab39bd0..52b9407a8a39 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h @@ -225,7 +225,7 @@ enum rx_tx_priority_ratio { #define DMA_BUS_MODE_FB0x0001 /* Fixed burst */ #define DMA_BUS_MODE_MB0x0400 /* Mixed burst */ -#define DMA_BUS_MODE_RPBL_MASK 0x003e /* Rx-Programmable Burst Len */ +#define DMA_BUS_MODE_RPBL_MASK 0x007e /* Rx-Programmable Burst Len */ #define DMA_BUS_MODE_RPBL_SHIFT17 #define DMA_BUS_MODE_USP 0x0080 #define DMA_BUS_MODE_MAXPBL0x0100 -- 2.1.4
[PATCH 1/6] net: stmmac: return error if no DMA configuration is found
From: Niklas Cassel All drivers except pci glue layer calls stmmac_probe_config_dt. stmmac_probe_config_dt does a kzalloc dma_cfg. pci glue layer does kzalloc dma_cfg explicitly, so all current drivers does a kzalloc dma_cfg. Return an error if no DMA configuration is found, that way we can assume that the DMA configuration always exists. Signed-off-by: Niklas Cassel --- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index 1f9ec02fa7f8..04f88df7da49 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -1577,16 +1577,12 @@ static void stmmac_check_ether_addr(struct stmmac_priv *priv) */ static int stmmac_init_dma_engine(struct stmmac_priv *priv) { - int pbl = DEFAULT_DMA_PBL, fixed_burst = 0, aal = 0; - int mixed_burst = 0; int atds = 0; int ret = 0; - if (priv->plat->dma_cfg) { - pbl = priv->plat->dma_cfg->pbl; - fixed_burst = priv->plat->dma_cfg->fixed_burst; - mixed_burst = priv->plat->dma_cfg->mixed_burst; - aal = priv->plat->dma_cfg->aal; + if (!priv->plat->dma_cfg) { + dev_err(priv->device, "DMA configuration not found\n"); + return -EINVAL; } if (priv->extend_desc && (priv->mode == STMMAC_RING_MODE)) @@ -1598,8 +1594,12 @@ static int stmmac_init_dma_engine(struct stmmac_priv *priv) return ret; } - priv->hw->dma->init(priv->ioaddr, pbl, fixed_burst, mixed_burst, - aal, priv->dma_tx_phy, priv->dma_rx_phy, atds); + priv->hw->dma->init(priv->ioaddr, + priv->plat->dma_cfg->pbl, + priv->plat->dma_cfg->fixed_burst, + priv->plat->dma_cfg->mixed_burst, + priv->plat->dma_cfg->aal, + priv->dma_tx_phy, priv->dma_rx_phy, atds); if (priv->synopsys_id >= DWMAC_CORE_4_00) { priv->rx_tail_addr = priv->dma_rx_phy + -- 2.1.4
[PATCH 5/6] net: stmmac: add support for independent DMA pbl for tx/rx
From: Niklas Cassel GMAC and newer supports independent programmable burst lengths for DMA tx/rx. Add new optional devicetree properties representing this. To be backwards compatible, snps,pbl will still be valid, but snps,txpbl/snps,rxpbl will override the value in snps,pbl if set. If the IP is synthesized to use the AXI interface, there is a register and a matching DT property inside the optional stmmac-axi-config DT node for controlling burst lengths, named snps,blen. However, using this register, it is not possible to control tx and rx independently. Also, this register is not available if the IP was synthesized with, e.g., the AHB interface. Signed-off-by: Niklas Cassel --- Documentation/devicetree/bindings/net/stmmac.txt | 6 +- Documentation/networking/stmmac.txt | 19 +-- drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c | 12 ++-- drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c | 12 +++- drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c | 2 ++ include/linux/stmmac.h| 2 ++ 6 files changed, 35 insertions(+), 18 deletions(-) diff --git a/Documentation/devicetree/bindings/net/stmmac.txt b/Documentation/devicetree/bindings/net/stmmac.txt index b95ff998ba73..8080038ff1b2 100644 --- a/Documentation/devicetree/bindings/net/stmmac.txt +++ b/Documentation/devicetree/bindings/net/stmmac.txt @@ -34,7 +34,11 @@ Optional properties: platforms. - tx-fifo-depth: See ethernet.txt file in the same directory - rx-fifo-depth: See ethernet.txt file in the same directory -- snps,pbl Programmable Burst Length +- snps,pbl Programmable Burst Length (tx and rx) +- snps,txpbl Tx Programmable Burst Length. Only for GMAC and newer. + If set, DMA tx will use this value rather than snps,pbl. +- snps,rxpbl Rx Programmable Burst Length. Only for GMAC and newer. + If set, DMA rx will use this value rather than snps,pbl. - snps,aal Address-Aligned Beats - snps,fixed-burst Program the DMA to use the fixed burst mode - snps,mixed-burst Program the DMA to use the mixed burst mode diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt index e226f8925c9e..82c8e496b4bb 100644 --- a/Documentation/networking/stmmac.txt +++ b/Documentation/networking/stmmac.txt @@ -154,7 +154,8 @@ Where: o pbl: the Programmable Burst Length is maximum number of beats to be transferred in one DMA transaction. GMAC also enables the 4xPBL by default. - o fixed_burst/mixed_burst/burst_len + o txpbl/rxpbl: GMAC and newer supports independent DMA pbl for tx/rx. + o fixed_burst/mixed_burst/aal o clk_csr: fixed CSR Clock range selection. o has_gmac: uses the GMAC core. o enh_desc: if sets the MAC will use the enhanced descriptor structure. @@ -206,16 +207,22 @@ tuned according to the HW capabilities. struct stmmac_dma_cfg { int pbl; + int txpbl; + int rxpbl; int fixed_burst; - int burst_len_supported; + int mixed_burst; + bool aal; }; Where: - o pbl: Programmable Burst Length + o pbl: Programmable Burst Length (tx and rx) + o txpbl: Transmit Programmable Burst Length. Only for GMAC and newer. +If set, DMA tx will use this value rather than pbl. + o rxpbl: Receive Programmable Burst Length. Only for GMAC and newer. +If set, DMA rx will use this value rather than pbl. o fixed_burst: program the DMA to use the fixed burst mode - o burst_len: this is the value we put in the register - supported values are provided as macros in - linux/stmmac.h header file. + o mixed_burst: program the DMA to use the mixed burst mode + o aal: Address-Aligned Beats --- diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c index 01d0d0f315e5..1dd34fb4c1a9 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c @@ -87,20 +87,20 @@ static void dwmac1000_dma_init(void __iomem *ioaddr, u32 dma_tx, u32 dma_rx, int atds) { u32 value = readl(ioaddr + DMA_BUS_MODE); + int txpbl = dma_cfg->txpbl ?: dma_cfg->pbl; + int rxpbl = dma_cfg->rxpbl ?: dma_cfg->pbl; /* * Set the DMA PBL (Programmable Burst Length) mode. * * Note: before stmmac core 3.50 this mode bit was 4xPBL, and * post 3.5 mode bit acts as 8*PBL. -* -* This configuration doesn't take care about the Separate PBL -* so only the bits: 13-8 are programmed with the PBL passed from the -* platform. */ value |= DMA_BUS_MODE_MAXPBL; - value &= ~DMA_BUS_MODE_PBL_MASK; - value |= (dma_cfg->pbl << DMA_BUS_MODE_PBL_SHIFT); + value |= DMA_BUS_MODE_USP; + val
[PATCH 6/6] net: smmac: allow configuring lower pbl values
From: Niklas Cassel The driver currently always sets the PBLx8/PBLx4 bit, which means that the pbl values configured via the pbl/txpbl/rxpbl DT properties are always multiplied by 8/4 in the hardware. In order to allow the DT to configure lower pbl values, while at the same time not changing behavior of any existing device trees using the pbl/txpbl/rxpbl settings, add a property to disable the multiplication of the pbl by 8/4 in the hardware. Suggested-by: Rabin Vincent Signed-off-by: Niklas Cassel --- Documentation/devicetree/bindings/net/stmmac.txt | 2 ++ Documentation/networking/stmmac.txt | 5 - drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c | 3 ++- drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c | 3 ++- drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c | 2 ++ drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c | 1 + include/linux/stmmac.h| 1 + 7 files changed, 14 insertions(+), 3 deletions(-) diff --git a/Documentation/devicetree/bindings/net/stmmac.txt b/Documentation/devicetree/bindings/net/stmmac.txt index 8080038ff1b2..128da752fec9 100644 --- a/Documentation/devicetree/bindings/net/stmmac.txt +++ b/Documentation/devicetree/bindings/net/stmmac.txt @@ -39,6 +39,8 @@ Optional properties: If set, DMA tx will use this value rather than snps,pbl. - snps,rxpbl Rx Programmable Burst Length. Only for GMAC and newer. If set, DMA rx will use this value rather than snps,pbl. +- snps,no-pbl-x8 Don't multiply the pbl/txpbl/rxpbl values by 8. + For core rev < 3.50, don't multiply the values by 4. - snps,aal Address-Aligned Beats - snps,fixed-burst Program the DMA to use the fixed burst mode - snps,mixed-burst Program the DMA to use the mixed burst mode diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt index 82c8e496b4bb..d3376c5fbcf0 100644 --- a/Documentation/networking/stmmac.txt +++ b/Documentation/networking/stmmac.txt @@ -153,8 +153,9 @@ Where: o dma_cfg: internal DMA parameters o pbl: the Programmable Burst Length is maximum number of beats to be transferred in one DMA transaction. - GMAC also enables the 4xPBL by default. + GMAC also enables the 4xPBL by default. (8xPBL for GMAC 3.50 and newer) o txpbl/rxpbl: GMAC and newer supports independent DMA pbl for tx/rx. + o pblx8: Enable 8xPBL (4xPBL for core rev < 3.50). Enabled by default. o fixed_burst/mixed_burst/aal o clk_csr: fixed CSR Clock range selection. o has_gmac: uses the GMAC core. @@ -209,6 +210,7 @@ struct stmmac_dma_cfg { int pbl; int txpbl; int rxpbl; + bool pblx8; int fixed_burst; int mixed_burst; bool aal; @@ -220,6 +222,7 @@ Where: If set, DMA tx will use this value rather than pbl. o rxpbl: Receive Programmable Burst Length. Only for GMAC and newer. If set, DMA rx will use this value rather than pbl. + o pblx8: Enable 8xPBL (4xPBL for core rev < 3.50). Enabled by default. o fixed_burst: program the DMA to use the fixed burst mode o mixed_burst: program the DMA to use the mixed burst mode o aal: Address-Aligned Beats diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c index 1dd34fb4c1a9..1d313af647b4 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c @@ -96,7 +96,8 @@ static void dwmac1000_dma_init(void __iomem *ioaddr, * Note: before stmmac core 3.50 this mode bit was 4xPBL, and * post 3.5 mode bit acts as 8*PBL. */ - value |= DMA_BUS_MODE_MAXPBL; + if (dma_cfg->pblx8) + value |= DMA_BUS_MODE_MAXPBL; value |= DMA_BUS_MODE_USP; value &= ~(DMA_BUS_MODE_PBL_MASK | DMA_BUS_MODE_RPBL_MASK); value |= (txpbl << DMA_BUS_MODE_PBL_SHIFT); diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c index 0bf47825bfeb..0f7110d19a4a 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c @@ -82,7 +82,8 @@ static void dwmac4_dma_init_channel(void __iomem *ioaddr, * on each channel */ value = readl(ioaddr + DMA_CHAN_CONTROL(channel)); - value = value | DMA_BUS_MODE_PBL; + if (dma_cfg->pblx8) + value = value | DMA_BUS_MODE_PBL; writel(value, ioaddr + DMA_CHAN_CONTROL(channel)); value = readl(ioaddr + DMA_CHAN_TX_CONTROL(channel)); diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c index 56c8a2342c14..a2831773431a 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c @@ -81,6 +81,7 @@
[PATCH] net: phy: dp83848: Support ethernet pause frames
According to the documentation, the PHYs supported by this driver can also support pause frames. Announce this to be so. Tested with a TI83822I. Signed-off-by: Jesper Nilsson --- drivers/net/phy/dp83848.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/phy/dp83848.c b/drivers/net/phy/dp83848.c index 800b39f..6e4117f 100644 --- a/drivers/net/phy/dp83848.c +++ b/drivers/net/phy/dp83848.c @@ -88,7 +88,8 @@ MODULE_DEVICE_TABLE(mdio, dp83848_tbl); .phy_id = _id, \ .phy_id_mask= 0xfff0, \ .name = _name,\ - .features = PHY_BASIC_FEATURES, \ + .features = (PHY_BASIC_FEATURES | \ + SUPPORTED_Pause | SUPPORTED_Asym_Pause),\ .flags = PHY_HAS_INTERRUPT,\ \ .soft_reset = genphy_soft_reset,\ -- 2.1.4 /^JN - Jesper Nilsson -- Jesper Nilsson -- jesper.nils...@axis.com
Re: [WIP] net+mlx4: auto doorbell
On Thu, 2016-12-01 at 09:04 -0800, Eric Dumazet wrote: > On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote: > > > I think you misunderstood my concept[1]. I don't want to stop the > > queue. The new __QUEUE_STATE_FLUSH_NEEDED does not stop the queue, is > > it just indicating that someone need to flush/ring-doorbell. Maybe it > > need another name, because it also indicate that the driver can see > > that its TX queue is so busy that we don't need to call it immediately. > > The qdisc layer can then choose to enqueue instead if doing direct xmit. > > But driver ndo_start_xmit() does not have a pointer to qdisc. > > Also the concept of 'queue busy' just because we queued one packet is a > bit flaky. > > > > > When qdisc layer or trafgen/af_packet see this indication it knows it > > should/must flush the queue when it don't have more work left. Perhaps > > through net_tx_action(), by registering itself and e.g. if qdisc_run() > > is called and queue is empty then check if queue needs a flush. I would > > also allow driver to flush and clear this bit. > > net_tx_action() is not normally called, unless BQL limit is hit and/or > some qdiscs with throttling (HTB, TBF, FQ, ...) > > > > > I just see it as an extension of your solution, as we still need the > > driver to figure out then the doorbell/flush can be delayed. > > p.s. don't be discouraged by this feedback, I'm just very excited and > > happy that your are working on a solution in this area. As this is a > > problem area that I've not been able to solve myself for the last > > approx 2 years. Keep up the good work! > > Do not worry, I appreciate the feedbacks ;) > > BTW, if you are doing tests on mlx4 40Gbit, would you check the > following quick/dirty hack, using lots of low-rate flows ? > > mlx4 has really hard time to transmit small TSO packets (2 or 3 MSS) > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c > b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c > index 12ea3405f442..96940666abd3 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c > @@ -2631,6 +2631,11 @@ static void mlx4_en_del_vxlan_port(struct net_device > *dev, > queue_work(priv->mdev->workqueue, &priv->vxlan_del_task); > } > > +static int mlx4_gso_segs_min = 4; /* TSO packets with less than 4 segments > are segmented */ > +module_param_named(mlx4_gso_segs_min, mlx4_gso_segs_min, uint, 0644); > +MODULE_PARM_DESC(mlx4_gso_segs_min, "threshold for software segmentation of > small TSO packets"); > + > + > static netdev_features_t mlx4_en_features_check(struct sk_buff *skb, > struct net_device *dev, > netdev_features_t features) > @@ -2651,6 +2656,8 @@ static netdev_features_t mlx4_en_features_check(struct > sk_buff *skb, > (udp_hdr(skb)->dest != priv->vxlan_port)) > features &= ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK); > } > + if (skb_is_gso(skb) && skb_shinfo(skb)->gso_segs < mlx4_gso_segs_min) > + features &= NETIF_F_GSO_MASK; Sorry, stupid typo. This should be "features &= ~NETIF_F_GSO_MASK;" of course > > return features; > } > > > >
Re: stmmac ethernet in kernel 4.9-rc6: coalescing related pauses.
Hi Pavel and Peppe, On 12/02/2016 02:51 PM, Giuseppe CAVALLARO wrote: On 12/2/2016 1:32 PM, Pavel Machek wrote: Hi! Well, if you have a workload that sends and receive packets, it tends to work ok, as you do tx_clean() in stmmac_poll(). My workload is not like that -- it is "sending packets at 3MB/sec, receiving none". So the stmmac_tx_timer() is rescheduled and rescheduled and rescheduled, and then we run out of transmit descriptors, and then 40msec passes, and then we clean them. Bad. And that's why low-res timers do not cut it. in that case, I expect that the tuning of the driver could help you. I mean, by using ethtool, it could be enough to set the IC bit on all the descriptors. You should touch the tx_coal_frames. Then you can use ethtool -S to monitor the status. Yes, I did something similar. Unfortnunately that meant crash within minutes, at least with 4.4 kernel. (If you know what was fixed between 4.4 and 4.9, that would be helpful). 4.4 has no GMAC4 support. Alex, do you remember any patches to fix that? No sorry Peppe. Pavel, Sorry but I'm a little bit confused. I'm dropped in some mails without historic. I see cleanup, coalescence issue and TSO question. What is your main issue? Are you working on gmac4 or 3.x ? Can you refresh a little bit the story please ? Regards Alex We had experimented this tuning on STB IP where just datagrams had to send externally. To be honest, although we had seen better results w/o any timer, we kept this approach enabled because the timer was fast enough to cover our tests on SH4 boxes. Please reply to David, and explain how it is supposed to work... because right now it does not. 40 msec delays are not acceptable in default configuration. I mean, that on UP and SMP system this schema helped to improve the performance saving CPU on my side and this has been tested since a long time (~4 years). I tested something similar to yours where unidirectional traffic with limited throughput was needed and I can confirm you that tuning/removing coalesce parameters this helped. The tuning I decided to keep in the driver was suitable in several user cases and if now you have problems or you want to review it I can just confirm that there are no problems on my side. If you want to simply the logic around the tx process and remove timer on official driver I can accept that. I will just ask you uto double check if the throughput and CPU usage when request max throughput (better if on GiGa setup) has no regressions. Otherwise we could start thinking about adaptive schema if feasible. In the ring, some descriptors can raise the irq (according to a threshold) and set the IC bit. In this path, the NAPI poll will be scheduled. Not NAPI poll but stmmac_tx_timer(), right? in the xmit according the the threshold the timer is started or the interrupt is set inside the descriptor. Then stmmac_tx_clean will be always called and, if you see the flow, no irqlock protection is needed! Agreed that no irqlock protection is needed if we rely on napi and timers. ok Concerning the lock protection, we had reviewed long time ago and IIRC, no raise condition should be present. Open to review it, again! ... There's nothing that protect stmmac_poll() from running concurently with stmmac_dma_interrupt(), right? This is not necessary. dma_interrupt accesses shared priv->xstats; variables are of type unsigned long (not atomic_t), yet they are accesssed from interrupt context and from stmmac_ethtool without any locking. That can result in broken statistics AFAICT. ok we can check this and welcome patches and I'd prefer to remove xstats from critical part of the code like ISR (that comes from old story of the driver). Please take another look. As far as I can tell, you can have two cpus at #1 and #2 in the code, at the same time. It looks like napi_... has some atomic opertions inside so that looks safe at the first look. But I'm not sure if they also include enough memory barriers to make it safe...? Although I have never reproduced related issues on SMP platforms due to reordering of memory operations but, as said above, welcome review on this especially if you are seeing problems when manage NAPI. FYI, the only memory barrier you will see in the driver are about the OWN_BIT setting till now. static void stmmac_dma_interrupt(struct stmmac_priv *priv) { ... status = priv->hw->dma->dma_interrupt(priv->ioaddr, &priv->xstats); if (likely((status & handle_rx)) || (status & handle_tx)) { if (likely(napi_schedule_prep(&priv->napi))) { #1 stmmac_disable_dma_irq(priv); __napi_schedule(&priv->napi); } } static int stmmac_poll(struct napi_struct *napi, int budget) { struct stmmac_priv *priv = container_of(napi, struct stmmac_priv, napi); int work_done = 0; priv->xstats.napi_poll++; stmmac_tx_clean(priv); work_done
Re: [PATCH] net: phy: dp83848: Support ethernet pause frames
On 12/02/2016 08:22 AM, Jesper Nilsson wrote: > According to the documentation, the PHYs supported by this driver > can also support pause frames. Announce this to be so. > Tested with a TI83822I. > Looks like all PHYs supported by this driver do, so: Acked-by: Andrew F. Davis > Signed-off-by: Jesper Nilsson > --- > drivers/net/phy/dp83848.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/phy/dp83848.c b/drivers/net/phy/dp83848.c > index 800b39f..6e4117f 100644 > --- a/drivers/net/phy/dp83848.c > +++ b/drivers/net/phy/dp83848.c > @@ -88,7 +88,8 @@ MODULE_DEVICE_TABLE(mdio, dp83848_tbl); > .phy_id = _id, \ > .phy_id_mask= 0xfff0, \ > .name = _name,\ > - .features = PHY_BASIC_FEATURES, \ > + .features = (PHY_BASIC_FEATURES | \ > + SUPPORTED_Pause | SUPPORTED_Asym_Pause),\ Aligning these may look nicer though. > .flags = PHY_HAS_INTERRUPT,\ > \ > .soft_reset = genphy_soft_reset,\ >
Re: Initial thoughts on TXDP
On 01/12/16 23:46, Tom Herbert wrote: > The only time we > _really_ to allocate an skbuf is when we need to put the packet onto a > queue. All the other use cases are really just to pass a structure > containing a packet from function to function. For that purpose we > should be able to just pass a much smaller structure in a stack > argument and only allocate an skbuff when we need to enqueue. In cases > where we don't ever queue a packet we might never need to allocate any > skbuff Now this intrigues me, because one of the objections to bundling (vs GRO) was the memory usage of all those SKBs. IIRC we already do a 'GRO-like' coalescing when packets reach a TCP socket anyway (or at least in some cases, not sure if all the different ways we can enqueue a TCP packet for RX do it), but if we could carry the packets from NIC to socket without SKBs, doing so in lists rather than one-at-a-time wouldn't cost any extra memory (the packet-pages are all already allocated on the NIC RX ring). Possibly combine the two, so that rather than having potentially four versions of each function (skb, skbundle, void*, void* bundle) you just have the two 'ends'. -Ed
Re: [PATCH net v2] tcp: warn on bogus MSS and try to amend it
On Fri, 2016-12-02 at 08:55 -0200, Marcelo Ricardo Leitner wrote: > There have been some reports lately about TCP connection stalls caused > by NIC drivers that aren't setting gso_size on aggregated packets on rx > path. This causes TCP to assume that the MSS is actually the size of the > aggregated packet, which is invalid. > > Although the proper fix is to be done at each driver, it's often hard > and cumbersome for one to debug, come to such root cause and report/fix > it. > > This patch amends this situation in two ways. First, it adds a warning > on when this situation occurs, so it gives a hint to those trying to > debug this. It also limit the maximum probed MSS to the adverised MSS, > as it should never be any higher than that. > > The result is that the connection may not have the best performance ever > but it shouldn't stall, and the admin will have a hint on what to look > for. > > Tested with virtio by forcing gso_size to 0. > > Cc: Jonathan Maxwell > Signed-off-by: Marcelo Ricardo Leitner > --- > v2: Updated msg as suggested by David. > > net/ipv4/tcp_input.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index > a27b9c0e27c08b4e4aeaff3d0bfdf3ae561ba4d8..fd619eb93749b6de56a41669248b337c051d9fe2 > 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -144,7 +144,10 @@ static void tcp_measure_rcv_mss(struct sock *sk, const > struct sk_buff *skb) >*/ > len = skb_shinfo(skb)->gso_size ? : skb->len; > if (len >= icsk->icsk_ack.rcv_mss) { > - icsk->icsk_ack.rcv_mss = len; > + icsk->icsk_ack.rcv_mss = min_t(unsigned int, len, > +tcp_sk(sk)->advmss); > + if (icsk->icsk_ack.rcv_mss != len) > + pr_warn_once("Driver has suspect GRO implementation, > TCP performance may be compromised.\n"); > } else { > /* Otherwise, we make more careful check taking into account, >* that SACKs block is variable. skb->dev is indeed NULL, but it might be worth getting back the device using skb->skb_iif maybe ?
Re: [PATCH] net: phy: dp83848: Support ethernet pause frames
On Fri, Dec 02, 2016 at 08:35:23AM -0600, Andrew F. Davis wrote: > On 12/02/2016 08:22 AM, Jesper Nilsson wrote: > > According to the documentation, the PHYs supported by this driver > > can also support pause frames. Announce this to be so. > > Tested with a TI83822I. > > > > Looks like all PHYs supported by this driver do, so: > > Acked-by: Andrew F. Davis > > > Signed-off-by: Jesper Nilsson > > --- > > drivers/net/phy/dp83848.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/net/phy/dp83848.c b/drivers/net/phy/dp83848.c > > index 800b39f..6e4117f 100644 > > --- a/drivers/net/phy/dp83848.c > > +++ b/drivers/net/phy/dp83848.c > > @@ -88,7 +88,8 @@ MODULE_DEVICE_TABLE(mdio, dp83848_tbl); > > .phy_id = _id, \ > > .phy_id_mask= 0xfff0, \ > > .name = _name,\ > > - .features = PHY_BASIC_FEATURES, \ > > + .features = (PHY_BASIC_FEATURES | \ > > + SUPPORTED_Pause | SUPPORTED_Asym_Pause),\ > > Aligning these may look nicer though. Agreed, will send a v2. > > .flags = PHY_HAS_INTERRUPT,\ > > \ > > .soft_reset = genphy_soft_reset,\ > > /^JN - Jesper Nilsson -- Jesper Nilsson -- jesper.nils...@axis.com
[PATCH v2] net: phy: dp83848: Support ethernet pause frames
According to the documentation, the PHYs supported by this driver can also support pause frames. Announce this to be so. Tested with a TI83822I. Acked-by: Andrew F. Davis Signed-off-by: Jesper Nilsson --- drivers/net/phy/dp83848.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/phy/dp83848.c b/drivers/net/phy/dp83848.c index 800b39f..320d0dc 100644 --- a/drivers/net/phy/dp83848.c +++ b/drivers/net/phy/dp83848.c @@ -88,7 +88,9 @@ MODULE_DEVICE_TABLE(mdio, dp83848_tbl); .phy_id = _id, \ .phy_id_mask= 0xfff0, \ .name = _name,\ - .features = PHY_BASIC_FEATURES, \ + .features = (PHY_BASIC_FEATURES | \ + SUPPORTED_Pause |\ + SUPPORTED_Asym_Pause), \ .flags = PHY_HAS_INTERRUPT,\ \ .soft_reset = genphy_soft_reset,\ -- 2.1.4 /^JN - Jesper Nilsson -- Jesper Nilsson -- jesper.nils...@axis.com
Re: net/can: warning in raw_setsockopt/__alloc_pages_slowpath
On 12/02/2016 02:24 PM, Marc Kleine-Budde wrote: On 12/02/2016 01:43 PM, Andrey Konovalov wrote: [] raw_setsockopt+0x1be/0x9f0 net/can/raw.c:506 We should add a check for a sensible optlen static int raw_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen) { struct sock *sk = sock->sk; struct raw_sock *ro = raw_sk(sk); struct can_filter *filter = NULL; /* dyn. alloc'ed filters */ struct can_filter sfilter; /* single filter */ struct net_device *dev = NULL; can_err_mask_t err_mask = 0; int count = 0; int err = 0; if (level != SOL_CAN_RAW) return -EINVAL; switch (optname) { case CAN_RAW_FILTER: if (optlen % sizeof(struct can_filter) != 0) return -EINVAL; here... if (optlen > 64 * sizeof(struct can_filter)) return -EINVAL; Agreed. But what is sensible here? 64 filters is way to small IMO. When thinking about picking a bunch of single CAN IDs I would tend to something like 512 filters. Regards, Oliver count = optlen / sizeof(struct can_filter); if (count > 1) { /* filter does not fit into dfilter => alloc space */ filter = memdup_user(optval, optlen); if (IS_ERR(filter)) return PTR_ERR(filter); } else if (count == 1) { if (copy_from_user(&sfilter, optval, sizeof(sfilter))) return -EFAULT; } lock_sock(sk); Marc
Re: [PATCH] net/rtnetlink: fix attribute name in nlmsg_size() comments
From: Tobias Klauser Date: Wed, 30 Nov 2016 14:30:37 +0100 > Use the correct attribute constant names IFLA_GSO_MAX_{SEGS,SIZE} > instead of IFLA_MAX_GSO_{SEGS,SIZE} for the comments int nlmsg_size(). > > Cc: Eric Dumazet > Signed-off-by: Tobias Klauser Applied, thanks.
Re: [PATCH 0/2] net: Add support for SGMII PCS on Altera TSE MAC
From: Neill Whillans Date: Wed, 30 Nov 2016 13:41:03 + > These patches were created as part of work to add support for SGMII > PCS functionality to the Altera TSE MAC. Patches are based on 4.9-rc6 > git tree. > > The first patch in the series adds support for the VSC8572 dual-port > Gigabit Ethernet transceiver, used in integration testing. > > The second patch adds support for the SGMII PCS functionality to the > Altera TSE driver. Series applied, thanks.
Re: DSA vs. SWTICHDEV ?
On Thu, Dec 01, 2016 at 04:38:47PM -0500, Murali Karicheri wrote: > Hi Andrew, > On 12/01/2016 12:31 PM, Andrew Lunn wrote: > > Hi Murali > > > >> 2. Switch mode where it implements a simple Ethernet switch. Currently > >>it doesn't have address learning capability, but in future it > >>can. > > > > If it does not have address learning capabilities, does it act like a > > plain old hub? What comes in one port goes out all others? > > Thanks for the response! > > Yes. It is a plain hub. it replicates frame to both ports. So need to > run a bridge layer for address learning in software. Hi Murali It would be good if you start thinking about all the different directions. It is not just host to port A and host to port B. What about port A to Port B? Can it do that in hardware? > I think not. I see we have a non Linux implementation that does address > learning in software using a hash table and look up MAC for each packet > to see which port it needs to be sent to. I think i need to read more about the switch. I'm starting to wonder if it has enough intelligence to be usable. Switchdev is about pushing configuration down into the switch. It does not however sound like there is that much which is configurable in this switch. Andrew
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Thu, 01 Dec 2016 23:17:48 +0100 Paolo Abeni wrote: > On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote: > > (Cc. netdev, we might have an issue with Paolo's UDP accounting and > > small socket queues) > > > > On Wed, 30 Nov 2016 16:35:20 + > > Mel Gorman wrote: > > > > > > I don't quite get why you are setting the socket recv size > > > > (with -- -s and -S) to such a small number, size + 256. > > > > > > > > > > Maybe I missed something at the time I wrote that but why would it > > > need to be larger? > > > > Well, to me it is quite obvious that we need some queue to avoid packet > > drops. We have two processes netperf and netserver, that are sending > > packets between each-other (UDP_STREAM mostly netperf -> netserver). > > These PIDs are getting scheduled and migrated between CPUs, and thus > > does not get executed equally fast, thus a queue is need absorb the > > fluctuations. > > > > The network stack is even partly catching your config "mistake" and > > increase the socket queue size, so we minimum can handle one max frame > > (due skb "truesize" concept approx PAGE_SIZE + overhead). > > > > Hopefully for localhost testing a small queue should hopefully not > > result in packet drops. Testing... ups, this does result in packet > > drops. > > > > Test command extracted from mmtests, UDP_STREAM size 1024: > > > > netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 \ > >-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895 > > > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) > > port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET > > Socket Message Elapsed Messages > > SizeSize Time Okay Errors Throughput > > bytes bytessecs# # 10^6bits/sec > > > >46081024 60.00 50024301 06829.98 > >2560 60.00 46133211 6298.72 > > > > Dropped packets: 50024301-46133211=3891090 > > > > To get a better drop indication, during this I run a command, to get > > system-wide network counters from the last second, so below numbers are > > per second. > > > > $ nstat > /dev/null && sleep 1 && nstat > > #kernel > > IpInReceives885162 0.0 > > IpInDelivers885161 0.0 > > IpOutRequests 885162 0.0 > > UdpInDatagrams 776105 0.0 > > UdpInErrors 109056 0.0 > > UdpOutDatagrams 885160 0.0 > > UdpRcvbufErrors 109056 0.0 > > IpExtInOctets 931190476 0.0 > > IpExtOutOctets 931189564 0.0 > > IpExtInNoECTPkts885162 0.0 > > > > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See > > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop > > happens kernel side in __udp_queue_rcv_skb[1], because receiving > > process didn't empty it's queue fast enough see [2]. > > > > Although upstream changes are coming in this area, [2] is replaced with > > __udp_enqueue_schedule_skb, which I actually tested with... hmm > > > > Retesting with kernel 4.7.0-baseline+ ... show something else. > > To Paolo, you might want to look into this. And it could also explain why > > I've not see the mentioned speedup by mm-change, as I've been testing > > this patch on top of net-next (at 93ba550) with Paolo's UDP changes. > > Thank you for reporting this. > > It seems that the commit 123b4a633580 ("udp: use it's own memory > accounting schema") is too strict while checking the rcvbuf. > > For very small value of rcvbuf, it allows a single skb to be enqueued, > while previously we allowed 2 of them to enter the queue, even if the > first one truesize exceeded rcvbuf, as in your test-case. > > Can you please try the following patch ? Sure, it looks much better with this patch. $ /home/jbrouer/git/mmtests/work/testdisk/sources/netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895 UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 46081024 60.00 50191555 06852.82 2560 60.00 50189872 6852.59 Only 50191555-50189872=1683 drops, approx 1683/60 = 28/sec $ nstat > /dev/null && sleep 1 && nstat #kernel IpInReceives885417 0.0 IpInDelivers885416 0.0 IpOutRequests 885417 0.0 UdpInDatagrams 885382 0.0 UdpInErrors 29 0.0 UdpOutDatagrams
Re: net/can: warning in raw_setsockopt/__alloc_pages_slowpath
On 12/02/2016 04:11 PM, Oliver Hartkopp wrote: > > > On 12/02/2016 02:24 PM, Marc Kleine-Budde wrote: >> On 12/02/2016 01:43 PM, Andrey Konovalov wrote: > > >>> [] raw_setsockopt+0x1be/0x9f0 net/can/raw.c:506 >> >> We should add a check for a sensible optlen >> >>> static int raw_setsockopt(struct socket *sock, int level, int optname, >>> char __user *optval, unsigned int optlen) >>> { >>> struct sock *sk = sock->sk; >>> struct raw_sock *ro = raw_sk(sk); >>> struct can_filter *filter = NULL; /* dyn. alloc'ed filters */ >>> struct can_filter sfilter; /* single filter */ >>> struct net_device *dev = NULL; >>> can_err_mask_t err_mask = 0; >>> int count = 0; >>> int err = 0; >>> >>> if (level != SOL_CAN_RAW) >>> return -EINVAL; >>> >>> switch (optname) { >>> >>> case CAN_RAW_FILTER: >>> if (optlen % sizeof(struct can_filter) != 0) >>> return -EINVAL; >> >> here... >> >> if (optlen > 64 * sizeof(struct can_filter)) >> return -EINVAL; >> > > Agreed. > > But what is sensible here? > 64 filters is way to small IMO. > > When thinking about picking a bunch of single CAN IDs I would tend to > something like 512 filters. Ok - 64 was just an arbitrary chosen value for demonstration purposes :) Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions| Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917- | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | signature.asc Description: OpenPGP digital signature
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Fri, 2016-12-02 at 16:37 +0100, Jesper Dangaard Brouer wrote: > On Thu, 01 Dec 2016 23:17:48 +0100 > Paolo Abeni wrote: > > > On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote: > > > (Cc. netdev, we might have an issue with Paolo's UDP accounting and > > > small socket queues) > > > > > > On Wed, 30 Nov 2016 16:35:20 + > > > Mel Gorman wrote: > > > > > > > > I don't quite get why you are setting the socket recv size > > > > > (with -- -s and -S) to such a small number, size + 256. > > > > > > > > > > > > > Maybe I missed something at the time I wrote that but why would it > > > > need to be larger? > > > > > > Well, to me it is quite obvious that we need some queue to avoid packet > > > drops. We have two processes netperf and netserver, that are sending > > > packets between each-other (UDP_STREAM mostly netperf -> netserver). > > > These PIDs are getting scheduled and migrated between CPUs, and thus > > > does not get executed equally fast, thus a queue is need absorb the > > > fluctuations. > > > > > > The network stack is even partly catching your config "mistake" and > > > increase the socket queue size, so we minimum can handle one max frame > > > (due skb "truesize" concept approx PAGE_SIZE + overhead). > > > > > > Hopefully for localhost testing a small queue should hopefully not > > > result in packet drops. Testing... ups, this does result in packet > > > drops. > > > > > > Test command extracted from mmtests, UDP_STREAM size 1024: > > > > > > netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 \ > > >-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895 > > > > > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) > > > port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET > > > Socket Message Elapsed Messages > > > SizeSize Time Okay Errors Throughput > > > bytes bytessecs# # 10^6bits/sec > > > > > >46081024 60.00 50024301 06829.98 > > >2560 60.00 46133211 6298.72 > > > > > > Dropped packets: 50024301-46133211=3891090 > > > > > > To get a better drop indication, during this I run a command, to get > > > system-wide network counters from the last second, so below numbers are > > > per second. > > > > > > $ nstat > /dev/null && sleep 1 && nstat > > > #kernel > > > IpInReceives885162 0.0 > > > IpInDelivers885161 0.0 > > > IpOutRequests 885162 0.0 > > > UdpInDatagrams 776105 0.0 > > > UdpInErrors 109056 0.0 > > > UdpOutDatagrams 885160 0.0 > > > UdpRcvbufErrors 109056 0.0 > > > IpExtInOctets 931190476 0.0 > > > IpExtOutOctets 931189564 0.0 > > > IpExtInNoECTPkts885162 0.0 > > > > > > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See > > > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop > > > happens kernel side in __udp_queue_rcv_skb[1], because receiving > > > process didn't empty it's queue fast enough see [2]. > > > > > > Although upstream changes are coming in this area, [2] is replaced with > > > __udp_enqueue_schedule_skb, which I actually tested with... hmm > > > > > > Retesting with kernel 4.7.0-baseline+ ... show something else. > > > To Paolo, you might want to look into this. And it could also explain why > > > I've not see the mentioned speedup by mm-change, as I've been testing > > > this patch on top of net-next (at 93ba550) with Paolo's UDP changes. > > > > Thank you for reporting this. > > > > It seems that the commit 123b4a633580 ("udp: use it's own memory > > accounting schema") is too strict while checking the rcvbuf. > > > > For very small value of rcvbuf, it allows a single skb to be enqueued, > > while previously we allowed 2 of them to enter the queue, even if the > > first one truesize exceeded rcvbuf, as in your test-case. > > > > Can you please try the following patch ? > > Sure, it looks much better with this patch. Thank you for testing. I'll send a formal patch to David soon. BTW I see I nice performance improvement compared to 4.7... Cheers, Paolo
Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op
On Thu, Dec 01, 2016 at 03:41:20PM -0500, Vivien Didelot wrote: > Hi Andrew, > > Andrew Lunn writes: > > >> diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h > >> b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h > >> index ab52c37..9e51405 100644 > >> --- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h > >> +++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h > >> @@ -765,6 +765,9 @@ struct mv88e6xxx_ops { > >>int (*phy_write)(struct mv88e6xxx_chip *chip, int addr, int reg, > >> u16 val); > >> > >> + /* Switch Software Reset */ > >> + int (*reset)(struct mv88e6xxx_chip *chip); > >> + > > > > Hi Vivien > > > > In my huge patch series of 6390, i've been using a g1_ prefix for > > functionality which is in global 1, g2_ for global 2, etc. This has > > worked for everything so far with the exception of setting which > > reserved MAC addresses should be sent to the CPU. Most devices have it > > in g2, but 6390 has it in g1. > > > > Please could you add the prefix. > > I don't understand. It looks like you are talking about the second part > of the comment I made on your RFC patchset, about the Rsvd2CPU feature: Hi Vivien I mean + /* Switch Software Reset */ + int (*g1_reset)(struct mv88e6xxx_chip *chip); + We have a collection of function pointers with port_ prefix, another collection with stats_, and a third with ppu_, etc. And then we have some which do not fit a specific category. Those i have prefixed with g1_ or g2_. I think we should have some prefix, and that is my suggestion. Andrew
Re: [PATCH net 0/7] net: stmmac: fix probe error handling and phydev leaks
From: Johan Hovold Date: Wed, 30 Nov 2016 15:29:48 +0100 > This series fixes a number of issues with the stmmac-driver probe error > handling, which for example left clocks enabled after probe failures. > > The final patch fixes a failure to deregister and free any fixed-link > PHYs that were registered during probe on probe errors and on driver > unbind. It also fixes a related of-node leak on late probe errors. > > This series depends on the of_phy_deregister_fixed_link() helper that > was just merged to net. > > As mentioned earlier, one staging driver also suffers from a similar > leak and can be fixed up once the above mentioned helper hits mainline. > > Note that these patches have only been compile tested. Series applied, thanks.
Re: [PATCH net-next V2 0/7] Mellanox 100G mlx5 updates 2016-11-29
From: Saeed Mahameed Date: Wed, 30 Nov 2016 17:59:36 +0200 > The following series from Tariq and Roi, provides some critical fixes > and updates for the mlx5e driver. > > From Tariq: > - Fix driver coherent memory huge allocation issues by fragmenting >completion queues, in a way that is transparent to the netdev driver by >providing a new buffer type "mlx5_frag_buf" with the same access API. > - Create UMR MKey per RQ to have better scalability. > > From Roi: > - Some fixes for the encap-decap support and tc flower added lately to the >mlx5e driver. > > v1->v2: > - Fix start index in error flow of mlx5_frag_buf_alloc_node, pointed out by > Eric. > > This series was generated against commit: > 31ac1c19455f ("geneve: fix ip_hdr_len reserved for geneve6 tunnel.") Series applied, thanks.
[PATCH net] geneve: avoid use-after-free of skb->data
geneve{,6}_build_skb can end up doing a pskb_expand_head(), which makes the ip_hdr(skb) reference we stashed earlier stale. Since it's only needed as an argument to ip_tunnel_ecn_encap(), move this directly in the function call. Fixes: 08399efc6319 ("geneve: ensure ECN info is handled properly in all tx/rx paths") Signed-off-by: Sabrina Dubroca --- drivers/net/geneve.c | 14 -- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 532695c8209b..bab65647b80f 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -859,7 +859,6 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev, struct geneve_dev *geneve = netdev_priv(dev); struct geneve_sock *gs4; struct rtable *rt = NULL; - const struct iphdr *iip; /* interior IP header */ int err = -EINVAL; struct flowi4 fl4; __u8 tos, ttl; @@ -890,8 +889,6 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev, sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true); skb_reset_mac_header(skb); - iip = ip_hdr(skb); - if (info) { const struct ip_tunnel_key *key = &info->key; u8 *opts = NULL; @@ -911,7 +908,7 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev, if (unlikely(err)) goto tx_error; - tos = ip_tunnel_ecn_encap(key->tos, iip, skb); + tos = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb); ttl = key->ttl; df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0; } else { @@ -920,7 +917,7 @@ static netdev_tx_t geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev, if (unlikely(err)) goto tx_error; - tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, iip, skb); + tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb); ttl = geneve->ttl; if (!ttl && IN_MULTICAST(ntohl(fl4.daddr))) ttl = 1; @@ -952,7 +949,6 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev, { struct geneve_dev *geneve = netdev_priv(dev); struct dst_entry *dst = NULL; - const struct iphdr *iip; /* interior IP header */ struct geneve_sock *gs6; int err = -EINVAL; struct flowi6 fl6; @@ -982,8 +978,6 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev, sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true); skb_reset_mac_header(skb); - iip = ip_hdr(skb); - if (info) { const struct ip_tunnel_key *key = &info->key; u8 *opts = NULL; @@ -1004,7 +998,7 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev, if (unlikely(err)) goto tx_error; - prio = ip_tunnel_ecn_encap(key->tos, iip, skb); + prio = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb); ttl = key->ttl; label = info->key.label; } else { @@ -1014,7 +1008,7 @@ static netdev_tx_t geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev, goto tx_error; prio = ip_tunnel_ecn_encap(ip6_tclass(fl6.flowlabel), - iip, skb); + ip_hdr(skb), skb); ttl = geneve->ttl; if (!ttl && ipv6_addr_is_multicast(&fl6.daddr)) ttl = 1; -- 2.10.2
[PATCH net-next] sfc: remove EFX_BUG_ON_PARANOID, use EFX_WARN_ON_[ONCE_]PARANOID instead
Logically, EFX_BUG_ON_PARANOID can never be correct. For, BUG_ON should only be used if it is not possible to continue without potential harm; and since the non-DEBUG driver will continue regardless (as the BUG_ON is compiled out), clearly the BUG_ON cannot be needed in the DEBUG driver. So, replace every EFX_BUG_ON_PARANOID with either an EFX_WARN_ON_PARANOID or the newly defined EFX_WARN_ON_ONCE_PARANOID. Signed-off-by: Edward Cree --- drivers/net/ethernet/sfc/ef10.c | 2 +- drivers/net/ethernet/sfc/efx.c| 2 +- drivers/net/ethernet/sfc/ethtool.c| 4 ++-- drivers/net/ethernet/sfc/farch.c | 6 ++--- drivers/net/ethernet/sfc/mcdi.h | 4 ++-- drivers/net/ethernet/sfc/mcdi_mon.c | 4 ++-- drivers/net/ethernet/sfc/mcdi_port.c | 2 +- drivers/net/ethernet/sfc/net_driver.h | 22 +-- drivers/net/ethernet/sfc/ptp.c| 2 +- drivers/net/ethernet/sfc/rx.c | 8 +++ drivers/net/ethernet/sfc/siena.c | 2 +- drivers/net/ethernet/sfc/tx.c | 12 +- drivers/net/ethernet/sfc/tx_tso.c | 41 +-- 13 files changed, 55 insertions(+), 56 deletions(-) diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c index 0f58ea8..de2947c 100644 --- a/drivers/net/ethernet/sfc/ef10.c +++ b/drivers/net/ethernet/sfc/ef10.c @@ -2100,7 +2100,7 @@ static int efx_ef10_tx_tso_desc(struct efx_tx_queue *tx_queue, u32 seqnum; u32 mss; - EFX_BUG_ON_PARANOID(tx_queue->tso_version != 2); + EFX_WARN_ON_ONCE_PARANOID(tx_queue->tso_version != 2); mss = skb_shinfo(skb)->gso_size; diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c index da7028d..5a5dcad 100644 --- a/drivers/net/ethernet/sfc/efx.c +++ b/drivers/net/ethernet/sfc/efx.c @@ -355,7 +355,7 @@ static int efx_probe_eventq(struct efx_channel *channel) /* Build an event queue with room for one event per tx and rx buffer, * plus some extra for link state events and MCDI completions. */ entries = roundup_pow_of_two(efx->rxq_entries + efx->txq_entries + 128); - EFX_BUG_ON_PARANOID(entries > EFX_MAX_EVQ_SIZE); + EFX_WARN_ON_PARANOID(entries > EFX_MAX_EVQ_SIZE); channel->eventq_mask = max(entries, EFX_MIN_EVQ_SIZE) - 1; return efx_nic_probe_eventq(channel); diff --git a/drivers/net/ethernet/sfc/ethtool.c b/drivers/net/ethernet/sfc/ethtool.c index ca29d3d..f644216 100644 --- a/drivers/net/ethernet/sfc/ethtool.c +++ b/drivers/net/ethernet/sfc/ethtool.c @@ -333,12 +333,12 @@ static int efx_ethtool_fill_self_tests(struct efx_nic *efx, "core", 0, "registers", NULL); if (efx->phy_op->run_tests != NULL) { - EFX_BUG_ON_PARANOID(efx->phy_op->test_name == NULL); + EFX_WARN_ON_PARANOID(efx->phy_op->test_name == NULL); for (i = 0; true; ++i) { const char *name; - EFX_BUG_ON_PARANOID(i >= EFX_MAX_PHY_TESTS); + EFX_WARN_ON_PARANOID(i >= EFX_MAX_PHY_TESTS); name = efx->phy_op->test_name(efx, i); if (name == NULL) break; diff --git a/drivers/net/ethernet/sfc/farch.c b/drivers/net/ethernet/sfc/farch.c index 91aa3ec..e4ca216 100644 --- a/drivers/net/ethernet/sfc/farch.c +++ b/drivers/net/ethernet/sfc/farch.c @@ -177,7 +177,7 @@ efx_init_special_buffer(struct efx_nic *efx, struct efx_special_buffer *buffer) dma_addr_t dma_addr; int i; - EFX_BUG_ON_PARANOID(!buffer->buf.addr); + EFX_WARN_ON_PARANOID(!buffer->buf.addr); /* Write buffer descriptors to NIC */ for (i = 0; i < buffer->entries; i++) { @@ -332,7 +332,7 @@ void efx_farch_tx_write(struct efx_tx_queue *tx_queue) txd = efx_tx_desc(tx_queue, write_ptr); ++tx_queue->write_count; - EFX_BUG_ON_PARANOID(buffer->flags & EFX_TX_BUF_OPTION); + EFX_WARN_ON_ONCE_PARANOID(buffer->flags & EFX_TX_BUF_OPTION); /* Create TX descriptor ring entry */ BUILD_BUG_ON(EFX_TX_BUF_CONT != 1); @@ -2041,7 +2041,7 @@ efx_farch_filter_from_gen_spec(struct efx_farch_filter_spec *spec, __be32 rhost, host1, host2; __be16 rport, port1, port2; - EFX_BUG_ON_PARANOID(!(gen_spec->flags & EFX_FILTER_FLAG_RX)); + EFX_WARN_ON_PARANOID(!(gen_spec->flags & EFX_FILTER_FLAG_RX)); if (gen_spec->ether_type != htons(ETH_P_IP)) return -EPROTONOSUPPORT; diff --git a/drivers/net/ethernet/sfc/mcdi.h b/drivers/net/ethernet/sfc/mcdi.h index c9aeb07..4472107 100644 --- a/drivers/net/ethernet/sfc/mcdi.h +++ b/drivers/net/ethernet/sfc/mcdi.h @@ -129,14 +129,14 @@ struct efx_mcdi_data { static inline struct efx_mcdi_iface *efx_mcdi(struct efx_nic *efx) { - EFX_B
Re: arp_filter and IPv6 ND
On 2 December 2016 at 16:08, Hannes Frederic Sowa wrote: Hey, > May I ask why you want to turn it off? Certainly. I don't want device to answer with link address for L3 address it does not have on the link. In my case it triggers this bug https://supportforums.cisco.com/document/12098096/cscse46790-cef-prefers-arp-adjacency-over-rib-next-hop In this particular case, for one reason or another my Cisco device would have ND entry for Linux loopback pointing to an interface with completely different network. Which itself would be just weird, but combined with weird behaviour of Cisco it actually causes the loopback route advertised by BGP not to be installed. If the ND entry didn't exist, the BGP route would be installed. I don't really even know why the ND entry exists, all I can think of is that Linux must have sent gratuitous reply, because I don't se why Cisco would have tried to discover it. Expected behaviour is that the loopback/128 BGP route resolves to on-link next-hop, and on-link next hop is then ND'd. Observed behaviour is that loopback/128 BGP route also appears in ND cache. > In IPv6 this depends on the scope. In IPv4 this concept doesn't really > exist. > > Please notice that in IPv4 arp_filter does not necessarily mean that the > system is operating in strong end system mode but you end up in an > hybrid clone where arp is acting strong but routing not and thus you > also have to add fib rules to simulate that. It's just very peculiar behaviour to have ARP or ND entries on a interface where given subnet does not exist, it rudimentarily causes difficult to troubleshoot problems and is surprising/unexpected behaviour. Of course well behaving device wouldn't accept such replies, because it itself could be attack vector (imagine me telling you 8.8.8.8 is on the link, or worse, your bank). I'm curious, why does this behaviour exist? When is this desirable? I've never seen any other device than Linux behave like this, and when ever I've heard about the problem, I've only seen surprised faces that it does behave like this. -- ++ytti
RE: [PATCH net] fib_trie: Avoid expensive update of suffix length when not required
> -Original Message- > From: Robert Shearman [mailto:rshea...@brocade.com] > Sent: Friday, December 2, 2016 5:54 AM > To: Alexander Duyck > Cc: David Miller ; Netdev ; > Duyck, Alexander H > Subject: Re: [PATCH net] fib_trie: Avoid expensive update of suffix length > when > not required > > On 01/12/16 18:29, Alexander Duyck wrote: > > On Wed, Nov 30, 2016 at 4:27 PM, Robert Shearman > wrote: > >> On 29/11/16 23:14, Alexander Duyck wrote: > >>> On Tue, Nov 29, 2016 at 9:46 AM, Robert Shearman > >>> > > Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix > length") > Signed-off-by: Robert Shearman > >>> > >>> > >>> So I am not entirely convinced this is a regression, I was wondering > >>> what your numbers looked like before the patch you are saying this > >>> fixes? I recall I fixed a number of issues leading up to this so I > >>> am just curious. > >> > >> > >> At commit 21d1f11db0e2f20a549ad8322879507850544670 ("fib_trie: > Remove > >> checks for index >= tnode_child_length from tnode_get_child") which > >> is the parent of 5405afd1a306: > >> > >> $ time sudo ip route restore < ~/allroutes RTNETLINK answers: File > >> exists RTNETLINK answers: File exists RTNETLINK answers: File exists > >> RTNETLINK answers: File exists > >> > >> real0m3.451s > >> user0m0.184s > >> sys 0m2.004s > >> > >> > >> At commit 5405afd1a30620de8601436bae739c67c0bc7324 ("fib_trie: Add > >> tracking value for suffix length"): > >> > >> $ time sudo ip route restore < ~/allroutes RTNETLINK answers: File > >> exists RTNETLINK answers: File exists RTNETLINK answers: File exists > >> RTNETLINK answers: File exists > >> > >> real0m48.624s > >> user0m0.360s > >> sys 0m46.960s > >> > >> So does that warrant a fixes tag for this patch? > > > > Yes, please include it in the patch description next time. > > > > I've been giving it thought and the fact is the patch series has > > merit. I just think it was being a bit too aggressive in terms of > > some of the changes. So placing any changes in put_child seem to be > > the one piece in this that make this a bit ugly. > > Does that imply the current updating of the parent's slen value should be > removed from put_child then? No, that is where it should be. We want to leave the put_child code as is. That way all the code for halving and inflating nodes works correctly. The only reason for having the update_suffix code update the parent was because before it was propagating the suffix length for increases as well. Since we aren't using it for that anymore there isn't much point in having update_suffix touching the parent in the code below. > I started out without doing the changes in put_child, but that meant the > changes > to push the slen up through the trie were spread all over the place. This > seemed > like the cleaner solution. It actually makes it much messier. The put_child call is used all over the place in many situations where it doesn't make sense to go through updating the entire trie. The big issue is you can't walk up the trie if you are working on assembling a node off on the side that you plan to insert into the trie later. The only places we need to worry about updating the suffix are when we have added a new leaf/suffix, or when we have deleted a leaf/suffix. That is why in the patches I submitted I only update in those two places. > + } > +} > + > /* Add a child at position i overwriting the old value. > * Update the value of full_children and empty_children. > + * > + * The suffix length of the parent node and the rest of the tree > + is > + * updated (if required) when adding/replacing a node, but is > + caller's > + * responsibility on removal. > */ > static void put_child(struct key_vector *tn, unsigned long i, > struct key_vector *n) @@ -447,8 +461,8 @@ > static void put_child(struct key_vector *tn, unsigned long i, > else if (!wasfull && isfull) > tn_info(tn)->full_children++; > > - if (n && (tn->slen < n->slen)) > - tn->slen = n->slen; > + if (n) > + node_push_suffix(tn, n); > >>> > >>> > >>> This is just creating redundant work if we have to perform a resize. > >> > >> > >> The only real redundant work is the assignment of slen in tn, since > >> the propagation up the trie has to happen regardless and if a > >> subsequent resize results in changes to the trie then the propagation > >> will stop at tn's parent since the suffix length of the tn's parent > >> will not be less than tn's suffix length. > > > > > > This isn't going to work. We want to avoid trying to push the suffix > > when we place a child. There are scenarios where we are placing > > children in nodes that don't have parents yet because we are doing > > rebalances and such. I suspect this could be pr
Re: [PATCH net-next] sock: reset sk_err for ICMP packets read from error queue
From: Soheil Hassas Yeganeh Date: Wed, 30 Nov 2016 14:01:08 -0500 > From: Soheil Hassas Yeganeh > > Only when ICMP packets are enqueued onto the error queue, > sk_err is also set. Before f5f99309fa74 (sock: do not set sk_err > in sock_dequeue_err_skb), a subsequent error queue read > would set sk_err to the next error on the queue, or 0 if empty. > As no error types other than ICMP set this field, sk_err should > not be modified upon dequeuing them. > > Only for ICMP errors, reset the (racy) sk_err. Some applications, > like traceroute, rely on it and go into a futile busy POLLERR > loop otherwise. > > In principle, sk_err has to be set while an ICMP error is queued. > Testing is_icmp_err_skb(skb_next) approximates this without > requiring a full queue walk. Applications that receive both ICMP > and other errors cannot rely on this legacy behavior, as other > errors do not set sk_err in the first place. > > Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb) > Signed-off-by: Soheil Hassas Yeganeh > Signed-off-by: Willem de Bruijn Applied, thanks.
Re: stmmac: turn coalescing / NAPI off in stmmac
Hi Pavel On 12/2/2016 11:42 AM, Pavel Machek wrote: Hi! Anyway... since you asked. I belive I have way to disable NAPI / tx coalescing in the driver. Unfortunately, locking is missing on the rx path, and needs to be extended to _irqsave variant on tx path. I have just replied to a previous thread about that... Yeah, please reply to David's mail where he describes why it can't work. let me to re-check the mails :-) I can try to provide you more details about what I experimented So patch currently looks like this (hand edited, can't be applied, got it working few hours ago). Does it look acceptable? I'd prefer this to go after the patch that pulls common code to single place, so that single place needs to be patched. Plus I guess I should add ifdefs, so that more advanced NAPI / tx coalescing code can be reactivated when it is fixed. Trivial fixes can go on top. Does that sound like a plan? Hmm, what I find strange is that, just this code is running since a long time on several platforms and Chip versions. No raise condition have been found or lock protection problems (also proving look mechanisms). Well, it works better for me when I disable CONFIG_SMP. It is normal that locking problems are hard to reproduce :-(. can you share me the test, maybe I can try to reproduce on ARM box. Are you using 3.x or 4.x GMAC? Pavel, I ask you sorry if I missed some problems so, if you can (as D. Miller asked) to send us a cover letter + all patches I will try to reply soon. I can do also some tests if you ask me that! I could run on 3.x and 4.x but I cannot promise you benchmarks. Actually... I have questions here. David normally pulls from you (can I have a address of your git tree?). No I send the patches to the mailing list. Could you apply these to your git? [PATCH] stmmac ethernet: unify locking [PATCH] stmmac: simplify flag assignment [PATCH] stmmac: cleanup documenation, make it match reality They are rather trivial and independend, I'm not sure what cover letter would say, besides "simple fixes". Then I can re-do the reset on top of that... Which tree do you want patches against? https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/ ? I think that bug fixing should be on top of net.git but I let Miller to decide. Hmm. It is "only" a performance problem (40msec delays).. I guess -next is better target. ok, maybe if you resend these with a cover-letter I can try to contribute on reviewing (in case of I have missed some detail). Best Regards Peppe Best regards, Pavel
Re: stmmac ethernet in kernel 4.9-rc6: coalescing related pauses.
Hi Alex On 12/2/2016 3:26 PM, Alexandre Torgue wrote: 4.4 has no GMAC4 support. Alex, do you remember any patches to fix that? No sorry Peppe. Pavel, Sorry but I'm a little bit confused. I'm dropped in some mails without historic. I see cleanup, coalescence issue and TSO question. What is your main issue? Are you working on gmac4 or 3.x ? Can you refresh a little bit the story please ? let me try to do a sum, please Pavel feel free to correct me. There are some open points about the tx mitigation schema that we are trying to detail and eventually tune or change (but keeping the same performance on other user-case). In particular, the test case that is raising problem is an unicast tx bench. I suggested Pavel to tune coalesce (IC bit settings) via ethtool and monitor stats but he is getting problems (maybe due to lock). IIUC problems are mainly on new kernel and not on 4.4 where the gmac4 is missing. Please Pavel, could you confirm? Then, there are some open points about lock protections for xstat and Pavel is getting some problem on SMP. I do think that we need to review that. This also could improve the code in critical parts. Also there are some other discussion about the lock protection on NAPI still under discussion. I have not clear if in this case Pavel is getting strange behavior. Regards Peppe
Re: [PATCH net-next] bpf, xdp: drop rcu_read_lock from bpf_prog_run_xdp and move to caller
From: Daniel Borkmann Date: Wed, 30 Nov 2016 22:16:06 +0100 > After 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"), > the rcu_read_lock() in bpf_prog_run_xdp() is superfluous, since callers > need to hold rcu_read_lock() already to make sure BPF program doesn't > get released in the background. > > Thus, drop it from bpf_prog_run_xdp(), as it can otherwise be misleading. > Still keeping the bpf_prog_run_xdp() is useful as it allows for grepping > in XDP supported drivers and to keep the typecheck on the context intact. > For mlx4, this means we don't have a double rcu_read_lock() anymore. nfp can > just make use of bpf_prog_run_xdp(), too. For qede, just move rcu_read_lock() > out of the helper. When the driver gets atomic replace support, this will > move to call-sites eventually. > > mlx5 needs actual fixing as it has the same issue as described already in > 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"), > that is, we're under RCU bh at this time, BPF programs are released via > call_rcu(), and call_rcu() != call_rcu_bh(), so we need to properly mark > read side as programs can get xchg()'ed in mlx5e_xdp_set() without queue > reset. > > Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support") > Signed-off-by: Daniel Borkmann > Acked-by: Alexei Starovoitov > --- > ( Also here net-next is just fine, imho. ) Applied, thanks.
[PATCH] pull request for net: batman-adv 2016-12-02
Hi David, here is another bugfix which we would like to see integrated into net, if this is possible now. Please pull or let me know of any problem! Thank you, Simon The following changes since commit e13258f38e927b61cdb5f4ad25309450d3b127d1: batman-adv: Detect missing primaryif during tp_send as error (2016-11-04 12:27:39 +0100) are available in the git repository at: git://git.open-mesh.org/linux-merge.git tags/batadv-net-for-davem-20161202 for you to fetch changes up to c2d0f48a13e53b4747704c9e692f5e765e52041a: batman-adv: Check for alloc errors when preparing TT local data (2016-12-02 10:46:59 +0100) Here is another batman-adv bugfix: - fix checking for failed allocation of TVLV blocks in TT local data, by Sven Eckelmann Sven Eckelmann (1): batman-adv: Check for alloc errors when preparing TT local data net/batman-adv/translation-table.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
[PATCH net-next] net_sched: gen_estimator: account for timer drifts
From: Eric Dumazet Under heavy stress, timer used in estimators tend to slowly be delayed by a few jiffies, leading to inaccuracies. Lets remember what was the last scheduled jiffies so that we get more precise estimations, without having to add a multiply/divide in the loop to account for the drifts. Signed-off-by: Eric Dumazet --- net/core/gen_estimator.c | 25 - 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c index cad8e791f28e..0993844faeea 100644 --- a/net/core/gen_estimator.c +++ b/net/core/gen_estimator.c @@ -78,8 +78,7 @@ #define EST_MAX_INTERVAL 5 -struct gen_estimator -{ +struct gen_estimator { struct list_headlist; struct gnet_stats_basic_packed *bstats; struct gnet_stats_rate_est64*rate_est; @@ -96,8 +95,8 @@ struct gen_estimator struct rcu_head head; }; -struct gen_estimator_head -{ +struct gen_estimator_head { + unsigned long next_jiffies; struct timer_list timer; struct list_headlist; }; @@ -146,8 +145,15 @@ static void est_timer(unsigned long arg) spin_unlock(e->stats_lock); } - if (!list_empty(&elist[idx].list)) - mod_timer(&elist[idx].timer, jiffies + ((HZ/4) << idx)); + if (!list_empty(&elist[idx].list)) { + elist[idx].next_jiffies += ((HZ/4) << idx); + + if (unlikely(time_after_eq(jiffies, elist[idx].next_jiffies))) { + /* Ouch... timer was delayed. */ + elist[idx].next_jiffies = jiffies + 1; + } + mod_timer(&elist[idx].timer, elist[idx].next_jiffies); + } rcu_read_unlock(); } @@ -251,9 +257,10 @@ int gen_new_estimator(struct gnet_stats_basic_packed *bstats, setup_timer(&elist[idx].timer, est_timer, idx); } - if (list_empty(&elist[idx].list)) - mod_timer(&elist[idx].timer, jiffies + ((HZ/4) << idx)); - + if (list_empty(&elist[idx].list)) { + elist[idx].next_jiffies = jiffies + ((HZ/4) << idx); + mod_timer(&elist[idx].timer, elist[idx].next_jiffies); + } list_add_rcu(&est->list, &elist[idx].list); gen_add_node(est); spin_unlock_bh(&est_tree_lock);
[PATCH] batman-adv: Check for alloc errors when preparing TT local data
From: Sven Eckelmann batadv_tt_prepare_tvlv_local_data can fail to allocate the memory for the new TVLV block. The caller is informed about this problem with the returned length of 0. Not checking this value results in an invalid memory access when either tt_data or tt_change is accessed. Reported-by: Dan Carpenter Fixes: 7ea7b4a14275 ("batman-adv: make the TT CRC logic VLAN specific") Signed-off-by: Sven Eckelmann Signed-off-by: Simon Wunderlich --- net/batman-adv/translation-table.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c index 7f66309..0dc85eb 100644 --- a/net/batman-adv/translation-table.c +++ b/net/batman-adv/translation-table.c @@ -3282,7 +3282,7 @@ static bool batadv_send_my_tt_response(struct batadv_priv *bat_priv, &tvlv_tt_data, &tt_change, &tt_len); - if (!tt_len) + if (!tt_len || !tvlv_len) goto unlock; /* Copy the last orig_node's OGM buffer */ @@ -3300,7 +3300,7 @@ static bool batadv_send_my_tt_response(struct batadv_priv *bat_priv, &tvlv_tt_data, &tt_change, &tt_len); - if (!tt_len) + if (!tt_len || !tvlv_len) goto out; /* fill the rest of the tvlv with the real TT entries */ -- 2.10.2
Re: [PATCH 1/1] ax25: Fix segfault when receiving an iframe with net2kiss loaded
From: Basil Gunn Date: Wed, 30 Nov 2016 11:15:25 -0800 > AX.25 uses sock_queue_rcv_skb() to queue an iframe received packet. > This routine writes NULL to the socket buffer device structure > pointer. The socket buffer is subsequently serviced by > __netif_receiv_skb_core() which dereferences the device structure > pointer & segfaults. > > The fix puts the ax25 device structure pointer back in the socket > buffer struct after sock_queue_rcv_skb() is called. You cannot do this. We have no reference count held on the device object, therefore once access to this SKB leaves the receive packet processing path and thus the RCU protected region, we must set skb->dev to NULL. Queueing the SKB to the socket causes the SKB to leave the receive processing path. You will have to find another way to propagate this device pointer to the code that needs it.
[PATCH net-next] udp: be less conservative with sock rmem accounting
Before commit 850cbaddb52d ("udp: use it's own memory accounting schema"), the udp protocol allowed sk_rmem_alloc to grow beyond the rcvbuf by the whole current packet's truesize. After said commit we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue is empty. As reported by Jesper this cause a performance regression for some (small) values of rcvbuf. This commit is intended to fix the regression restoring the old handling of the rcvbuf limit. Reported-by: Jesper Dangaard Brouer Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema") Signed-off-by: Paolo Abeni --- net/ipv4/udp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index e1d0bf8..16d88ba 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1205,14 +1205,14 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb) * queue is full; always allow at least a packet */ rmem = atomic_read(&sk->sk_rmem_alloc); - if (rmem && (rmem + size > sk->sk_rcvbuf)) + if (rmem > sk->sk_rcvbuf) goto drop; /* we drop only if the receive buf is full and the receive * queue contains some other skb */ rmem = atomic_add_return(size, &sk->sk_rmem_alloc); - if ((rmem > sk->sk_rcvbuf) && (rmem > size)) + if (rmem > (size + sk->sk_rcvbuf)) goto uncharge_drop; spin_lock(&list->lock); -- 1.8.3.1
Re: arp_filter and IPv6 ND
Hello, On 02.12.2016 16:42, Saku Ytti wrote: > On 2 December 2016 at 16:08, Hannes Frederic Sowa > wrote: > > Hey, > >> May I ask why you want to turn it off? > > Certainly. I don't want device to answer with link address for L3 > address it does not have on the link. In my case it triggers this bug > https://supportforums.cisco.com/document/12098096/cscse46790-cef-prefers-arp-adjacency-over-rib-next-hop Okay, that should not happen. Redirects and neighbor advertisements are the only way how you can announce prefixes on-link. Unfortunately historically we automatically add device routes for prefixes, too. We can't change this anymore but this is wrong. > In this particular case, for one reason or another my Cisco device > would have ND entry for Linux loopback pointing to an interface with > completely different network. Which itself would be just weird, but > combined with weird behaviour of Cisco it actually causes the loopback > route advertised by BGP not to be installed. If the ND entry didn't > exist, the BGP route would be installed. H... Loopback route advertised by BGP? Do you use filter to get rid of that on your AS-border? So you probably don't use an IGP? Do you use next-hop-self attribute on your neighbor in that direction? BGP in general doesn't lead to ND entry installs, protocols like IS-IS afair can short circuit here. Hmm, I would keep the Loopback announcements out of the BGP. > I don't really even know why the ND entry exists, all I can think of > is that Linux must have sent gratuitous reply, because I don't se why > Cisco would have tried to discover it. > > Expected behaviour is that the loopback/128 BGP route resolves to > on-link next-hop, and on-link next hop is then ND'd. Observed > behaviour is that loopback/128 BGP route also appears in ND cache. Yep, exactly. >> In IPv6 this depends on the scope. In IPv4 this concept doesn't really >> exist. >> >> Please notice that in IPv4 arp_filter does not necessarily mean that the >> system is operating in strong end system mode but you end up in an >> hybrid clone where arp is acting strong but routing not and thus you >> also have to add fib rules to simulate that. > > It's just very peculiar behaviour to have ARP or ND entries on a > interface where given subnet does not exist, it rudimentarily causes > difficult to troubleshoot problems and is surprising/unexpected > behaviour. For enterprise and cloud stuff it is certainly very surprising, as some isolations don't work as expected. OTOH it is really easy to build up home networks and things are more plug and play. > Of course well behaving device wouldn't accept such replies, because > it itself could be attack vector (imagine me telling you 8.8.8.8 is on > the link, or worse, your bank). > > I'm curious, why does this behaviour exist? When is this desirable? > I've never seen any other device than Linux behave like this, and when > ever I've heard about the problem, I've only seen surprised faces that > it does behave like this. I don't feel comfortable to answer that, just some thoughts... Some RFCs require that for some router implementations (CPE), on the other hand weak end model in Linux was probably inherited by IPv4. The addition of duplicate address detection (which of course only makes sense in strong end systems) to IPv6, basically shows that IPv6 is more or less designed to be a strong end system model. Anyway, a patch to suppress ndisc requests on those interfaces will probably be accepted. For unicast reverse filtering e.g. there is actually no sysctl available anymore, instead you are supposed to install a netfilter rule to handle this, which automatically takes care of this. Bye, Hannes
Re: [PATCH 2/7] net: ethernet: ti: cpdma: fix desc re-queuing
On 12/02/2016 05:03 AM, Ivan Khoronzhuk wrote: > On Thu, Dec 01, 2016 at 05:34:27PM -0600, Grygorii Strashko wrote: >> The currently processing cpdma descriptor with EOQ flag set may >> contain two values in Next Descriptor Pointer field: >> - valid pointer: means CPDMA missed addition of new desc in queue; > It shouldn't happen in normal circumstances, right? it might happen, because desc push compete with desc pop. You can check stats values: chan->stats.misqueued chan->stats.requeue under different types of net-loads. TRM: " If the pNext pointer is initially NULL, and more packets need to be queued for transmit, the software application may alter this pointer to point to a newly appended descriptor. The EMAC will use the new pointer value and proceed to the next descriptor unless the pNext value has already been read. In this latter case, the transmitter will halt on the transmit channel in question, and the software application may restart it at that time. The software can detect this case by checking for an end of queue (EOQ) condition flag on the updated packet descriptor when it is returned by the EMAC. " > So, why it happens only for egress channels? And Does that mean > there is some resynchronization between submit and process function, > or this is h/w issue? no hw issues. this patch just removes one unnecessary I/O access > >> - null: no more descriptors in queue. >> In the later case, it's not required to write to HDP register, but now >> CPDMA does it. >> >> Hence, add additional check for Next Descriptor Pointer != null in >> cpdma_chan_process() function before writing in HDP register. >> >> Signed-off-by: Grygorii Strashko >> --- >> drivers/net/ethernet/ti/davinci_cpdma.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c >> b/drivers/net/ethernet/ti/davinci_cpdma.c >> index 0924014..379314f 100644 >> --- a/drivers/net/ethernet/ti/davinci_cpdma.c >> +++ b/drivers/net/ethernet/ti/davinci_cpdma.c >> @@ -1152,7 +1152,7 @@ static int __cpdma_chan_process(struct cpdma_chan >> *chan) >> chan->count--; >> chan->stats.good_dequeue++; >> >> -if (status & CPDMA_DESC_EOQ) { >> +if ((status & CPDMA_DESC_EOQ) && chan->head) { >> chan->stats.requeue++; >> chan_write(chan, hdp, desc_phys(pool, chan->head)); >> } >> -- >> 2.10.1 >> -- regards, -grygorii
Re: [PATCH net-next v4 0/4] bpf: BPF for lightweight tunnel encapsulation
From: Thomas Graf Date: Wed, 30 Nov 2016 17:10:07 +0100 > This series implements BPF program invocation from dst entries via the > lightweight tunnels infrastructure. Nice work, applied, thanks Thomas.
Re: [PATCH net-next] net/mlx5e: skip loopback selftest with !CONFIG_INET
From: Arnd Bergmann Date: Wed, 30 Nov 2016 22:05:39 +0100 > When CONFIG_INET is disabled, the new selftest results in a link > error: > > drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.o: In function > `mlx5e_test_loopback': > en_selftest.c:(.text.mlx5e_test_loopback+0x2ec): undefined reference to > `ip_send_check' > en_selftest.c:(.text.mlx5e_test_loopback+0x34c): undefined reference to > `udp4_hwcsum' > > This hides the specific test in that configuration. > > Fixes: 0952da791c97 ("net/mlx5e: Add support for loopback selftest") > Signed-off-by: Arnd Bergmann Applied, thanks Arnd.
Re: [PATCH net v2] igb: re-assign hw address pointer on reset after PCI error
On Fri, 2016-12-02 at 10:55 -0200, Guilherme G. Piccoli wrote: > On 11/10/2016 04:46 PM, Guilherme G. Piccoli wrote: > > Whenever the igb driver detects the result of a read operation > > returns > > a value composed only by F's (like 0x), it will detach the > > net_device, clear the hw_addr pointer and warn to the user that > > adapter's > > link is lost - those steps happen on igb_rd32(). > > > > In case a PCI error happens on Power architecture, there's a > > recovery > > mechanism called EEH, that will reset the PCI slot and call > > driver's > > handlers to reset the adapter and network functionality as well. > > > > We observed that once hw_addr is NULL after the error is detected > > on > > igb_rd32(), it's never assigned back, so in the process of > > resetting > > the network functionality we got a NULL pointer dereference in both > > igb_configure_tx_ring() and igb_configure_rx_ring(). In order to > > avoid > > such bug, this patch re-assigns the hw_addr value in the slot_reset > > handler. > > > > Reported-by: Anthony H. Thai > > Reported-by: Harsha Thyagaraja > > Signed-off-by: Guilherme G. Piccoli > > --- > > drivers/net/ethernet/intel/igb/igb_main.c | 5 + > > 1 file changed, 5 insertions(+) > > > > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c > > b/drivers/net/ethernet/intel/igb/igb_main.c > > index edc9a6a..136ee9e 100644 > > --- a/drivers/net/ethernet/intel/igb/igb_main.c > > +++ b/drivers/net/ethernet/intel/igb/igb_main.c > > @@ -7878,6 +7878,11 @@ static pci_ers_result_t > > igb_io_slot_reset(struct pci_dev *pdev) > > pci_enable_wake(pdev, PCI_D3hot, 0); > > pci_enable_wake(pdev, PCI_D3cold, 0); > > > > + /* In case of PCI error, adapter lose its HW > > address > > + * so we should re-assign it here. > > + */ > > + hw->hw_addr = adapter->io_addr; > > + > > igb_reset(adapter); > > wr32(E1000_WUS, ~0); > > result = PCI_ERS_RESULT_RECOVERED; > > > > Ping? > > Sorry to annoy, any news on this? > Thanks in advance! > > Cheers, > > > Guilherme > This seems reasonable. It's similar to what fm10k driver does under this circumstance. Thanks, Jake
Re: [flamebait] xdp, well meaning but pointless
On Fri, Dec 2, 2016 at 3:54 AM, Hannes Frederic Sowa wrote: > On 02.12.2016 11:24, Jesper Dangaard Brouer wrote: >> On Thu, 1 Dec 2016 13:51:32 -0800 >> Tom Herbert wrote: >> > The technical plenary at last IETF on Seoul a couple of weeks ago was > exclusively focussed on DDOS in light of the recent attack against > Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare > presentation by Nick Sullivan > (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf) > alluded to some implementation of DDOS mitigation. In particular, on > slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel" >> >> slide 14 >> > numbers he gave we're based in iptables+BPF and that was a whole > 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic > and that's also when I introduced XDP to whole IETF :-) ). If that's > the best we can do the Internet is in a world hurt. DDOS mitigation > alone is probably a sufficient motivation to look at XDP. We need > something that drops bad packets as quickly as possible when under > attack, we need this to be integrated into the stack, we need it to be > programmable to deal with the increasing savvy of attackers, and we > don't want to be forced to be dependent on HW solutions. This is why > we created XDP! >> >> The 1.2Mpps number is a bit low, but we are unfortunately in that >> ballpark. >> I totally understand that. But in my reply to David in this thread I mentioned DNS apex processing as being problematic which is actually being referred in your linked slide deck on page 9 ("What do floods look like") and the problematic of parsing DNS packets in XDP due to string processing and looping inside eBPF. >> >> That is a weak argument. You do realize CloudFlare actually use eBPF to >> do this exact filtering, and (so-far) eBPF for parsing DNS have been >> sufficient for them. > > You are talking about this code on the following slides (I actually > transcribed it for you here and disassembled): > > l0: ld #0x14 > l1: ldxb 4*([0]&0xf) > l2: add x > l3: tax > l4: ld [x+0] > l5: jeq #0x7657861, l6, l13 > l6: ld [x+4] > l7: jeq #0x6d706c65, l8, l13 > l8: ld [x+8] > l9: jeq #0x3636f6d, l10, l13 > l10:ldb [x+12] > l11:jeq #0, l12, l13 > l12:ret #0x1 > l13:ret #0 > > You can offload this to u32 in hardware if that is what you want. > > The reason this works is because of netfilter, which allows them to > dynamically generate BPF programs and insert and delete them from > chains, do intersection or unions of them. > > If you have a freestanding program like in XDP the complexity space is a > different one and not comparable to this at all. > I don't understand this comment about complexity especially in regards to the idea of offloading u32 to hardware. Relying on hardware to do anything always leads to more complexity than an equivalent SW implementation for the same functionality. The only reason we ever use a hardware mechanisms is if it gives *significantly* better performance. If the performance difference isn't there then doing things in SW is going to be the better path (as we see in XDP). Tom > Bye, > Hannes >
Re: [Intel-wired-lan] [PATCH v2] e1000e: free IRQ regardless of __E1000_DOWN
Hello Sasha, Were you able to reproduce this issue? Do you have a patch fixing the close function inconsistencies that you mentioned which I could try out? Thanks, Tyler On 11/21/2016 1:40 PM, Baicar, Tyler wrote: On 11/17/2016 6:31 AM, Neftin, Sasha wrote: On 11/13/2016 10:34 AM, Neftin, Sasha wrote: On 11/11/2016 12:35 AM, Baicar, Tyler wrote: Hello Sasha, On 11/9/2016 11:19 PM, Neftin, Sasha wrote: On 11/9/2016 11:41 PM, Tyler Baicar wrote: Move IRQ free code so that it will happen regardless of the __E1000_DOWN bit. Currently the e1000e driver only releases its IRQ if the __E1000_DOWN bit is cleared. This is not sufficient because it is possible for __E1000_DOWN to be set without releasing the IRQ. In such a situation, we will hit a kernel bug later in e1000_remove because the IRQ still has action since it was never freed. A secondary bus reset can cause this case to happen. Signed-off-by: Tyler Baicar --- drivers/net/ethernet/intel/e1000e/netdev.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 7017281..36cfcb0 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -4679,12 +4679,13 @@ int e1000e_close(struct net_device *netdev) if (!test_bit(__E1000_DOWN, &adapter->state)) { e1000e_down(adapter, true); -e1000_free_irq(adapter); /* Link status message must follow this format */ pr_info("%s NIC Link is Down\n", adapter->netdev->name); } +e1000_free_irq(adapter); + napi_disable(&adapter->napi); e1000e_free_tx_resources(adapter->tx_ring); I would like not recommend insert this change. This change related driver state machine, we afraid from lot of synchronization problem and issues. We need keep e1000_free_irq in loop and check for 'test_bit' ready. What do you mean here? There is no loop. If __E1000_DOWN is set then we will never free the IRQ. Another point, does before execute secondary bus reset your SW back up pcie configuration space as properly? After a secondary bus reset, the link needs to recover and go back to a working state after 1 second. From the callstack, the issue is happening while removing the endpoint from the system, before applying the secondary bus reset. The order of events is 1. remove the drivers 2. cause a secondary bus reset 3. wait 1 second Actually, this is too much, usually link up in less than 100ms.You can check Data Link Layer indication. 4. recover the link callstack: free_msi_irqs+0x6c/0x1a8 pci_disable_msi+0xb0/0x148 e1000e_reset_interrupt_capability+0x60/0x78 e1000_remove+0xc8/0x180 pci_device_remove+0x48/0x118 __device_release_driver+0x80/0x108 device_release_driver+0x2c/0x40 pci_stop_bus_device+0xa0/0xb0 pci_stop_bus_device+0x3c/0xb0 pci_stop_root_bus+0x54/0x80 acpi_pci_root_remove+0x28/0x64 acpi_bus_trim+0x6c/0xa4 acpi_device_hotplug+0x19c/0x3f4 acpi_hotplug_work_fn+0x28/0x3c process_one_work+0x150/0x460 worker_thread+0x50/0x4b8 kthread+0xd4/0xe8 ret_from_fork+0x10/0x50 Thanks, Tyler Hello Tyler, Okay, we need consult more about this suggestion. May I ask what is setup you run? Is there NIC or on board LAN? I would like try reproduce this issue in our lab's too. Also, is same issue observed with same scenario and others NIC's too? Sasha ___ Intel-wired-lan mailing list intel-wired-...@lists.osuosl.org http://lists.osuosl.org/mailman/listinfo/intel-wired-lan Hello Tyler, I see some in consistent implementation of __*_close methods in our drivers. Do you have any igb NIC to check if same problem persist there? Thanks, Sasha Hello Sasha, I couldn't find an igb NIC to test with, but I did find another e1000e card that does not cause the same issue. That card is: 0004:01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection Subsystem: Intel Corporation Gigabit CT Desktop Adapter Physical Slot: 5-1 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 299 Region 0: Memory at c01001c (32-bit, non-prefetchable) [size=128K] Region 1: Memory at c010010 (32-bit, non-prefetchable) [size=512K] Region 2: I/O ports at 1000 [size=32] Region 3: Memory at c01001e (32-bit, non-prefetchable) [size=16K] Expansion ROM at c010018 [disabled] [size=256K] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit
Re: [PATCH] net: atarilance: use %8ph for printing hex string
From: Rasmus Villemoes Date: Wed, 30 Nov 2016 23:02:54 +0100 > This is already using the %pM printf extension; might as well also use > %ph to make the code smaller. > > Signed-off-by: Rasmus Villemoes Applied, thank you.
Re: net/can: warning in raw_setsockopt/__alloc_pages_slowpath
On 12/02/2016 04:42 PM, Marc Kleine-Budde wrote: On 12/02/2016 04:11 PM, Oliver Hartkopp wrote: On 12/02/2016 02:24 PM, Marc Kleine-Budde wrote: On 12/02/2016 01:43 PM, Andrey Konovalov wrote: [] raw_setsockopt+0x1be/0x9f0 net/can/raw.c:506 We should add a check for a sensible optlen static int raw_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen) { struct sock *sk = sock->sk; struct raw_sock *ro = raw_sk(sk); struct can_filter *filter = NULL; /* dyn. alloc'ed filters */ struct can_filter sfilter; /* single filter */ struct net_device *dev = NULL; can_err_mask_t err_mask = 0; int count = 0; int err = 0; if (level != SOL_CAN_RAW) return -EINVAL; switch (optname) { case CAN_RAW_FILTER: if (optlen % sizeof(struct can_filter) != 0) return -EINVAL; here... if (optlen > 64 * sizeof(struct can_filter)) return -EINVAL; Agreed. But what is sensible here? 64 filters is way to small IMO. When thinking about picking a bunch of single CAN IDs I would tend to something like 512 filters. Ok - 64 was just an arbitrary chosen value for demonstration purposes :) :-) Would you like to send a patch? Regards, Oliver
Re: [PATCH/RFC iproute2/net-next 2/3] tc: flower: introduce enum flower_endpoint
Fri, Dec 02, 2016 at 10:59:44AM CET, simon.hor...@netronome.com wrote: >Introduce enum flower_endpoint and use it instead of a bool >as the type for paramatising source and destination. > >This is intended to improve read-ability and provide some type >checking of endpoint parameters. > >Signed-off-by: Simon Horman >--- > tc/f_flower.c | 22 ++ > 1 file changed, 14 insertions(+), 8 deletions(-) > >diff --git a/tc/f_flower.c b/tc/f_flower.c >index 615e8f27bed2..42253067b43d 100644 >--- a/tc/f_flower.c >+++ b/tc/f_flower.c >@@ -23,6 +23,11 @@ > #include "tc_util.h" > #include "rt_names.h" > >+enum flower_endpoint { >+ flower_src, >+ flower_dst FLOWER_ENDPOINT_SRC, FLOWER_ENDPOINT_DST, >+}; >+ > static void explain(void) > { > fprintf(stderr, >@@ -160,29 +165,30 @@ static int flower_parse_ip_addr(char *str, __be16 >eth_type, > return 0; > } > >-static int flower_port_attr_type(__u8 ip_proto, bool is_src) >+static int flower_port_attr_type(__u8 ip_proto, enum flower_endpoint endpoint) > { > if (ip_proto == IPPROTO_TCP) >- return is_src ? TCA_FLOWER_KEY_TCP_SRC : >+ return endpoint == flower_src ? TCA_FLOWER_KEY_TCP_SRC : > TCA_FLOWER_KEY_TCP_DST; > else if (ip_proto == IPPROTO_UDP) >- return is_src ? TCA_FLOWER_KEY_UDP_SRC : >+ return endpoint == flower_src ? TCA_FLOWER_KEY_UDP_SRC : > TCA_FLOWER_KEY_UDP_DST; > else if (ip_proto == IPPROTO_SCTP) >- return is_src ? TCA_FLOWER_KEY_SCTP_SRC : >+ return endpoint == flower_src ? TCA_FLOWER_KEY_SCTP_SRC : > TCA_FLOWER_KEY_SCTP_DST; > else > return -1; > } > >-static int flower_parse_port(char *str, __u8 ip_proto, bool is_src, >+static int flower_parse_port(char *str, __u8 ip_proto, >+ enum flower_endpoint endpoint, >struct nlmsghdr *n) > { > int ret; > int type; > __be16 port; > >- type = flower_port_attr_type(ip_proto, is_src); >+ type = flower_port_attr_type(ip_proto, endpoint); > if (type < 0) > return -1; > >@@ -340,14 +346,14 @@ static int flower_parse_opt(struct filter_util *qu, char >*handle, > } > } else if (matches(*argv, "dst_port") == 0) { > NEXT_ARG(); >- ret = flower_parse_port(*argv, ip_proto, false, n); >+ ret = flower_parse_port(*argv, ip_proto, flower_dst, n); > if (ret < 0) { > fprintf(stderr, "Illegal \"dst_port\"\n"); > return -1; > } > } else if (matches(*argv, "src_port") == 0) { > NEXT_ARG(); >- ret = flower_parse_port(*argv, ip_proto, true, n); >+ ret = flower_parse_port(*argv, ip_proto, flower_src, n); > if (ret < 0) { > fprintf(stderr, "Illegal \"src_port\"\n"); > return -1; >-- >2.7.0.rc3.207.g0ac5344 >
Re: [PATCH/RFC iproute2/net-next 3/3] tc: flower: support matching on ICMP type and code
Fri, Dec 02, 2016 at 10:59:45AM CET, simon.hor...@netronome.com wrote: >Support matching on ICMP type and code. > >Example usage: > >tc qdisc add dev eth0 ingress > >tc filter add dev eth0 protocol ip parent : flower \ > indev eth0 ip_proto icmp type 8 code 0 action drop > >tc filter add dev eth0 protocol ipv6 parent : flower \ > indev eth0 ip_proto icmpv6 type 128 code 0 action drop > >Signed-off-by: Simon Horman >--- > man/man8/tc-flower.8 | 20 --- > tc/f_flower.c| 96 > 2 files changed, 105 insertions(+), 11 deletions(-) > >diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8 >index a401293fed50..c01ace6249dd 100644 >--- a/man/man8/tc-flower.8 >+++ b/man/man8/tc-flower.8 >@@ -29,7 +29,7 @@ flower \- flow based traffic control filter > .IR PRIORITY " | " > .BR vlan_eth_type " { " ipv4 " | " ipv6 " | " > .IR ETH_TYPE " } | " >-.BR ip_proto " { " tcp " | " udp " | " sctp " | " >+.BR ip_proto " { " tcp " | " udp " | " sctp " | " icmp " | " icmpv6 " | " > .IR IP_PROTO " } | { " > .BR dst_ip " | " src_ip " } { " > .IR ipv4_address " | " ipv6_address " } | { " >@@ -94,7 +94,7 @@ or an unsigned 16bit value in hexadecimal format. > Match on layer four protocol. > .I IP_PROTO > may be >-.BR tcp ", " udp ", " sctp >+.BR tcp ", " udp ", " sctp ", " icmp ", " icmpv6 > or an unsigned 8bit value in hexadecimal format. > .TP > .BI dst_ip " ADDRESS" >@@ -112,6 +112,13 @@ option of tc filter. > Match on layer 4 protocol source or destination port number. Only available > for > .BR ip_proto " values " udp ", " tcp " and " sctp > which have to be specified in beforehand. >+.TP >+.BI type " NUMBER" >+.TQ >+.BI code " NUMBER" >+Match on ICMP type or code. Only available for >+.BR ip_proto " values " icmp " and " icmpv6 >+which have to be specified in beforehand. > .SH NOTES > As stated above where applicable, matches of a certain layer implicitly depend > on the matches of the next lower layer. Precisely, layer one and two matches >@@ -120,13 +127,16 @@ have no dependency, layer three matches > (\fBip_proto\fR, \fBdst_ip\fR and \fBsrc_ip\fR) > depend on the > .B protocol >-option of tc filter >-and finally layer four matches >+option of tc filter, layer four port matches > (\fBdst_port\fR and \fBsrc_port\fR) > depend on > .B ip_proto > being set to >-.BR tcp ", " udp " or " sctp. >+.BR tcp ", " udp " or " sctp, >+and finally ICMP matches (\fBcode\fR and \fBtype\fR) depend on >+.B ip_proto >+being set to >+.BR icmp " or " icmpv6. > .P > There can be only used one mask per one prio. If user needs to specify > different > mask, he has to use different prio. >diff --git a/tc/f_flower.c b/tc/f_flower.c >index 42253067b43d..59f6f1ea26e6 100644 >--- a/tc/f_flower.c >+++ b/tc/f_flower.c >@@ -28,6 +28,11 @@ enum flower_endpoint { > flower_dst > }; > >+enum flower_icmp_field { >+ flower_icmp_type, >+ flower_icmp_code FLOWER_ICMP_FIELD_TYPE, FLOWER_ICMP_FIELD_CODE, >+}; >+ > static void explain(void) > { > fprintf(stderr, >@@ -42,11 +47,13 @@ static void explain(void) > " vlan_ethtype [ ipv4 | ipv6 | ETH-TYPE ] > |\n" > " dst_mac MAC-ADDR |\n" > " src_mac MAC-ADDR |\n" >- " ip_proto [tcp | udp | sctp | IP-PROTO ] >|\n" >+ " ip_proto [tcp | udp | sctp | icmp | >icmpv6 | IP-PROTO ] |\n" > " dst_ip [ IPV4-ADDR | IPV6-ADDR ] |\n" > " src_ip [ IPV4-ADDR | IPV6-ADDR ] |\n" > " dst_port PORT-NUMBER |\n" >- " src_port PORT-NUMBER }\n" >+ " src_port PORT-NUMBER |\n" >+ " type ICMP-TYPE |\n" >+ " code ICMP-CODE }\n" > " FILTERID := X:Y:Z\n" > " ACTION-SPEC := ... look at individual actions\n" > "\n" >@@ -95,16 +102,23 @@ static int flower_parse_ip_proto(char *str, __be16 >eth_type, int type, > int ret; > __u8 ip_proto; > >- if (eth_type != htons(ETH_P_IP) && eth_type != htons(ETH_P_IPV6)) { >- fprintf(stderr, "Illegal \"eth_type\" for ip proto\n"); >- return -1; >- } >+ if (eth_type != htons(ETH_P_IP) && eth_type != htons(ETH_P_IPV6)) >+ goto err; >+ > if (matches(str, "tcp") == 0) { > ip_proto = IPPROTO_TCP; > } else if (matches(str, "udp") == 0) { > ip_proto = IPPROTO_UDP; > } else if (matches(str, "sctp") == 0) { > ip_proto = IPPROTO_SCTP; >+ } else if (matches(str, "icmp") == 0) { >+ if (eth_type != htons(ETH_P_IP)) >+ goto err; >+ ip_proto = IPPROTO_ICMP;
Re: [PATCH/RFC iproute2/net-next 0/3] tc: flower: Support matching on ICMP
Fri, Dec 02, 2016 at 10:59:42AM CET, simon.hor...@netronome.com wrote: >Add support for matching on ICMP type and code to flower. This is modeled >on existing support for matching on L4 ports. > >The second patch provided a minor cleanup which is in keeping with >they style used in the last patch. > >This is marked as an RFC to match the same designation given to the >corresponding kernel patches. Looks nice, I only have those 2 enum nitpicks. Thanks. > >Based on iproute2/net-next with the following applied: >* [[PATCH iproute2/net-next v2] 0/4] tc: flower: SCTP and other port fixes > >Simon Horman (3): > tc: flower: update headers for TCA_FLOWER_KEY_ICMP* > tc: flower: introduce enum flower_endpoint > tc: flower: support matching on ICMP type and code > > include/linux/pkt_cls.h | 10 > man/man8/tc-flower.8| 20 ++-- > tc/f_flower.c | 118 ++-- > 3 files changed, 129 insertions(+), 19 deletions(-) > >-- >2.7.0.rc3.207.g0ac5344 >
Re: [PATCH/RFC net-next 0/2] net/sched: cls_flower: Support matching on ICMP
Fri, Dec 02, 2016 at 10:52:30AM CET, simon.hor...@netronome.com wrote: >Hi, > >this series add supports for matching on ICMP type and code to cls_flower. >This is modeled on existing support for matching on L4 ports. The updates >to the dissector are intended to allow for code and storage re-use. Looks fine to me. Thanks! > >Simon Horman (2): > flow dissector: ICMP support > net/sched: cls_flower: Support matching on ICMP type and code > > drivers/net/bonding/bond_main.c | 6 +++-- > include/linux/skbuff.h | 5 + > include/net/flow_dissector.h| 50 ++--- > include/uapi/linux/pkt_cls.h| 10 + > net/core/flow_dissector.c | 34 +--- > net/sched/cls_flow.c| 4 ++-- > net/sched/cls_flower.c | 42 ++ > 7 files changed, 141 insertions(+), 10 deletions(-) > >-- >2.7.0.rc3.207.g0ac5344 >
Re: [PATCH 2/2] net: ethernet: altera: TSE: do not use tx queue lock in tx completion handler
From: Lino Sanfilippo Date: Wed, 30 Nov 2016 23:48:32 +0100 > The driver already uses its private lock for synchronization between xmit > and xmit completion handler making the additional use of the xmit_lock > unnecessary. > Furthermore the driver does not set NETIF_F_LLTX resulting in xmit to be > called with the xmit_lock held and then taking the private lock while xmit > completion handler does the reverse, first take the private lock, then the > xmit_lock. > Fix these issues by not taking the xmit_lock in the tx completion handler. > > Signed-off-by: Lino Sanfilippo Yeah that could be a nasty deadlock, in fact. Applied, thanks.
Re: Initial thoughts on TXDP
On Fri, Dec 2, 2016 at 6:36 AM, Edward Cree wrote: > On 01/12/16 23:46, Tom Herbert wrote: >> The only time we >> _really_ to allocate an skbuf is when we need to put the packet onto a >> queue. All the other use cases are really just to pass a structure >> containing a packet from function to function. For that purpose we >> should be able to just pass a much smaller structure in a stack >> argument and only allocate an skbuff when we need to enqueue. In cases >> where we don't ever queue a packet we might never need to allocate any >> skbuff > Now this intrigues me, because one of the objections to bundling (vs GRO) > was the memory usage of all those SKBs. IIRC we already do a 'GRO-like' > coalescing when packets reach a TCP socket anyway (or at least in some > cases, not sure if all the different ways we can enqueue a TCP packet for > RX do it), but if we could carry the packets from NIC to socket without > SKBs, doing so in lists rather than one-at-a-time wouldn't cost any extra > memory (the packet-pages are all already allocated on the NIC RX ring). > Possibly combine the two, so that rather than having potentially four > versions of each function (skb, skbundle, void*, void* bundle) you just > have the two 'ends'. > Yep, seems like a good idea to incorporate bundling into TXDP from the get-go. Tom > -Ed
Re: [PATCH net] packet: fix race condition in packet_set_ring
From: Eric Dumazet Date: Wed, 30 Nov 2016 14:55:36 -0800 > From: Philip Pettersson > > When packet_set_ring creates a ring buffer it will initialize a > struct timer_list if the packet version is TPACKET_V3. This value > can then be raced by a different thread calling setsockopt to > set the version to TPACKET_V1 before packet_set_ring has finished. > > This leads to a use-after-free on a function pointer in the > struct timer_list when the socket is closed as the previously > initialized timer will not be deleted. > > The bug is fixed by taking lock_sock(sk) in packet_setsockopt when > changing the packet version while also taking the lock at the start > of packet_set_ring. > > Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.") > Signed-off-by: Philip Pettersson > Signed-off-by: Eric Dumazet Applied and queued up for -stable, thanks.
Re: [PATCH 1/2] net: ethernet: altera: TSE: Remove unneeded dma sync for tx buffers
From: Lino Sanfilippo Date: Wed, 30 Nov 2016 23:48:31 +0100 > An explicit dma sync for device directly after mapping as well as an > explicit dma sync for cpu directly before unmapping is unnecessary and > costly on the hotpath. So remove these calls. > > Signed-off-by: Lino Sanfilippo Applied.
Re: [flamebait] xdp, well meaning but pointless
On Thu, 1 Dec 2016 10:11:08 +0100 Florian Westphal wrote: > In light of DPDKs existence it make a lot more sense to me to provide > a). a faster mmap based interface (possibly AF_PACKET based) that allows > to map nic directly into userspace, detaching tx/rx queue from kernel. > > John Fastabend sent something like this last year as a proof of > concept, iirc it was rejected because register space got exposed directly > to userspace. I think we should re-consider merging netmap > (or something conceptually close to its design). I'm actually working in this direction, of zero-copy RX mapping packets into userspace. This work is mostly related to page_pool, and I only plan to use XDP as a filter for selecting packets going to userspace, as this choice need to be taken very early. My design is here: https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html This is mostly about changing the memory model in the drivers, to allow for safely mapping pages to userspace. (An efficient queue mechanism is not covered). People often overlook that netmap's efficiency *also* comes from introducing pre-mapping memory/pages to userspace. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op
From: Vivien Didelot Date: Wed, 30 Nov 2016 17:59:27 -0500 > @@ -765,6 +765,9 @@ struct mv88e6xxx_ops { > int (*phy_write)(struct mv88e6xxx_chip *chip, int addr, int reg, >u16 val); > > + /* Switch Software Reset */ > + int (*reset)(struct mv88e6xxx_chip *chip); > + I think Andrew's request to name this method "g1_reset" is reasonable, please respin with that change. Thanks.
Re: [PATCH 5/7] Documentation: DT: net: cpsw: allow to specify descriptors pool size
On 12/02/2016 05:28 AM, Ivan Khoronzhuk wrote: > On Thu, Dec 01, 2016 at 05:34:30PM -0600, Grygorii Strashko wrote: >> Add optional property "descs_pool_size" to specify buffer descriptor's >> pool size. The "descs_pool_size" should define total number of CPDMA >> CPPI descriptors to be used for both ingress/egress packets >> processing. If not specified - the default value 256 will be used >> which will allow to place descriptor's pool into the internal CPPI >> RAM on most of TI SoC. >> >> Signed-off-by: Grygorii Strashko >> --- >> Documentation/devicetree/bindings/net/cpsw.txt | 5 + >> 1 file changed, 5 insertions(+) >> >> diff --git a/Documentation/devicetree/bindings/net/cpsw.txt >> b/Documentation/devicetree/bindings/net/cpsw.txt >> index 5ad439f..b99d196 100644 >> --- a/Documentation/devicetree/bindings/net/cpsw.txt >> +++ b/Documentation/devicetree/bindings/net/cpsw.txt >> @@ -35,6 +35,11 @@ Optional properties: >>For example in dra72x-evm, pcf gpio has to be >>driven low so that cpsw slave 0 and phy data >>lines are connected via mux. >> +- descs_pool_size : total number of CPDMA CPPI descriptors to be used for >> + both ingress/egress packets processing. if not >> + specified the default value 256 will be used which >> + will allow to place descriptors pool into the >> + internal CPPI RAM. > Does it describe h/w? Why now module parameter? or even smth like ethtool num > ring entries? > It can be module parameter too. in general this is expected to be one-time boot setting only. - OR So, do you propose to use ethtool -g ethX ethtool -G ethX [rx N] [tx N] ? Now cpdma has one pool for all RX/TX channels, so changing this settings by ethtool will require: pause interfaces, reallocate cpdma pool, re-arrange buffers between channels, resume interface. Correct? How do you think - we can move forward with one pool or better to have two (Rx and Tx)? Wouldn't it be reasonable to still have DT (or module) parameter to avoid cpdma reconfiguration on system startup (pause/resume interfaces) (faster boot)? How about cpdma re-allocation policy (with expectation that is shouldn't happen too often)? - increasing of Rx, Tx will grow total number of physically allocated buffers (total_desc_num) - decreasing of Rx, Tx will just change number of available buffers (no memory re-allocation) - OR Can we move forward with current patch (total number of CPDMA CPPI descriptors defined in DT) and add ethtool -G ethX [rx N] [tx N] which will allow to re-split descs between RX and TX? -- regards, -grygorii
Re: [PATCH net-next v3] ipv6 addrconf: Implemented enhanced DAD (RFC7527)
From: Erik Nordmark Date: Wed, 30 Nov 2016 15:39:19 -0800 > @@ -794,6 +808,17 @@ static void ndisc_recv_ns(struct sk_buff *skb) > have_ifp: > if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) { > if (dad) { > + if (nonce != 0 && ifp->dad_nonce == nonce) { > + u8 *np = (u8 *)&nonce; > + /* Matching nonce if looped back */ > + ND_PRINTK(2, notice, > + "%s: IPv6 DAD loopback for > address %pI6c nonce %02x:%02x:%02x:%02x:%02x:%02x ignored\n", > + ifp->idev->dev->name, > + &ifp->addr, > + np[0], np[1], np[2], np[3], > + np[4], np[5]); I know you said you'd leave this, but I'd actually like to ask that you use %pM here to save some kernel size. Thank you.
Re: [PATCH net v2 1/3] ipv4: Set skb->protocol properly for local output
From: Eli Cooper Date: Thu, 1 Dec 2016 10:05:10 +0800 > When xfrm is applied to TSO/GSO packets, it follows this path: > > xfrm_output() -> xfrm_output_gso() -> skb_gso_segment() > > where skb_gso_segment() relies on skb->protocol to function properly. > > This patch sets skb->protocol to ETH_P_IP before dst_output() is called, > fixing a bug where GSO packets sent through a sit tunnel are dropped > when xfrm is involved. > > Cc: sta...@vger.kernel.org > Signed-off-by: Eli Cooper Applied.
Re: [PATCH net v2 2/3] ipv6: Set skb->protocol properly for local output
From: Eli Cooper Date: Thu, 1 Dec 2016 10:05:11 +0800 > When xfrm is applied to TSO/GSO packets, it follows this path: > > xfrm_output() -> xfrm_output_gso() -> skb_gso_segment() > > where skb_gso_segment() relies on skb->protocol to function properly. > > This patch sets skb->protocol to ETH_P_IPV6 before dst_output() is called, > fixing a bug where GSO packets sent through an ipip6 tunnel are dropped > when xfrm is involved. > > Cc: sta...@vger.kernel.org > Signed-off-by: Eli Cooper Applied.
Re: [PATCH net v2 3/3] Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"
From: Eli Cooper Date: Thu, 1 Dec 2016 10:05:12 +0800 > This reverts commit ae148b085876fa771d9ef2c05f85d4b4bf09ce0d > ("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"). > > skb->protocol is now set in __ip_local_out() and __ip6_local_out() before > dst_output() is called. It is no longer necessary to do it for each tunnel. > > Cc: sta...@vger.kernel.org > Signed-off-by: Eli Cooper Applied.
Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op
Hi Andrew, Andrew Lunn writes: > + /* Switch Software Reset */ > + int (*g1_reset)(struct mv88e6xxx_chip *chip); > + > > We have a collection of function pointers with port_ prefix, another > collection with stats_, and a third with ppu_, etc. And then we have > some which do not fit a specific category. Those i have prefixed with > g1_ or g2_. I think we should have some prefix, and that is my > suggestion. I disagree. There's only one entry point to issue a switch software reset, so .reset is enough. I use this opportunity to give a bit of details about mv88e6xxx/ so that things get written down at least once somewhere: global1.c implements accessors to "Global 1 Registers" features and are prefixed with mv88e6xxx_g1_; port.c implements accessors to "Port Registers" features and are prefixed with mv88e6xxx_port_, and so on. (where xxx can be a model if there's conflict due to a redefinition of the same register) If a feature is not present or if there's more than one way to access it, these accessors are bundled in the per-chip mv88e6xxx_ops structure for disambiguation. chip.c implements support for a single chip by aggregating and nicely wrapping these operations. It provides a generic API for Marvell switches, used to implement net/dsa routines. Here's a couple of example. Setting a switch MAC can be done in Global 1, or Global 2 depending on the model. Thus .set_switch_mac can be mv88e6xxx_g1_set_switch_mac or mv88e6xxx_g2_set_switch_mac. Setting the port's speed is always in the same Port register, but its layout varies with the model. Thus .port_set_speed can be mv88e6185_port_set_speed or mv88e6352_port_set_speed. Thanks, Vivien
Re: [PATCH v2] cpsw: ethtool: add support for getting/setting EEE registers
On 12/02/2016 01:11 AM, Giuseppe CAVALLARO wrote: > Hi Florian > sorry for my delay. > > On 11/24/2016 7:23 PM, Florian Fainelli wrote: >> +Peppe, >> >> Le 24/11/2016 à 07:38, Andrew Lunn a écrit : As for enabling advertising and correct working of cpsw do you mean it would be better to disable EEE in any PHY on cpsw initialization as long as cpsw doesn't provide support for EEE? We observe some strange behavior with our gigabit PHYs and a link partner in a EEE-capable unmanaged NetGear switch. Disabling advertising seems to help. Though we're still investigating the issue. >>> >>> Hi Florian >>> >>> Am i right in saying, a PHY should not advertise EEE until the MAC >>> driver calls phy_init_eee(), indicating the MAC supports EEE? >> >> You would think so, but I don't see how this could possibly work if that >> was not the case already, see below. >> >>> >>> If so, it looks like we need to change a few of the PHY drivers, in >>> particular, the bcm-*.c. >> >> The first part that bcm-phy-lib.c does is make sure that EEE is enabled >> such that this gets reflected in MDIO_PCS_EEE_ABLE, without this, we >> won't be able to pass the first test in phy_init_eee(). The second part >> is to advertise EEE such that this gets reflected in MDIO_AN_EEE_ADV, >> also to make sure that we can pass the second check in phy_init_eee(). >> >> Now, looking at phy_init_eee(), and what stmmac does (and bcmgenet, >> copied after stmmac), we need to somehow, have EEE advertised for >> phy_init_eee() to succeed, prepare the MAC to support EEE, and finally >> conclude with a call to phy_ethtool_set_eee(), which writes to the >> MDIO_AN_EEE_ADV register, and concludes the EEE auto-negotiated process. >> Since we already have EEE advertised, we are essentially just checking >> that the EEE advertised settings and the LP advertised settings actually >> do match, so it sounds like the final call to phy_ethtool_set_eee() is >> potentially useless if the resolved advertised and link partner >> advertised settings already match... >> >> So it sounds like at least, the first time you try to initialize EEE, we >> should start with EEE not advertised, and then, if we have EEE enabled >> at some point, and we re-negotiate the link parameters, somehow >> phy_init_eee() does a right job for that. >> >> Peppe, any thoughts on this? > > I share what you say. > > In sum, the EEE management inside the stmmac is: > > - the driver looks at own HW cap register if EEE is supported > > (indeed the user could keep disable EEE if bugged on some HW > + Alex, Fabrice: we had some patches for this to propose where we > called the phy_ethtool_set_eee to disable feature at phy > level > > - then the stmmac asks PHY layer to understand if transceiver and > partners are EEE capable. > > - If all matches the EEE is actually initialized. > > the logic above should be respected when use ethtool, hmm, I will > check the stmmac_ethtool_op_set_eee asap. > > Hoping this is useful This is definitively useful, the only part that I am struggling to understand in phy_init_eee() is this: eee_adv = phy_read_mmd_indirect(phydev, MDIO_AN_EEE_ADV, MDIO_MMD_AN); if (eee_adv <= 0) goto eee_exit_err; if we are not already advertising EEE in the PHY's MMIO_MMD_AN page, by the time we call phy_init_eee(), then we cannot complete the EEE configuration at the PHY level, and presumably we should abort the EEE configuration at the MAC level. While this condition makes sense if e.g: you are re-negotiating the link with your partner for instance and if EEE was already advertised, the very first time this function is called, it seems to be like we should skip the check, because phy_init_eee() should actually tell us if, as a result of a successful check, we should be setting EEE as something we advertise? Do you remember what was the logic behind this check when you added it? Thanks! -- Florian
Re: [PATCH net-next v3 1/2] tcp: randomize tcp timestamp offsets for each connection
From: Florian Westphal Date: Thu, 1 Dec 2016 11:32:06 +0100 > jiffies based timestamps allow for easy inference of number of devices > behind NAT translators and also makes tracking of hosts simpler. > > commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset") > added the main infrastructure that is needed for per-connection ts > randomization, in particular writing/reading the on-wire tcp header > format takes the offset into account so rest of stack can use normal > tcp_time_stamp (jiffies). > > So only two items are left: > - add a tsoffset for request sockets > - extend the tcp isn generator to also return another 32bit number >in addition to the ISN. > > Re-use of ISN generator also means timestamps are still monotonically > increasing for same connection quadruple, i.e. PAWS will still work. > > Includes fixes from Eric Dumazet. > > Signed-off-by: Florian Westphal > Acked-by: Eric Dumazet > Acked-by: Yuchung Cheng Applied.
Re: [PATCH net-next v3 2/2] tcp: allow to turn tcp timestamp randomization off
From: Florian Westphal Date: Thu, 1 Dec 2016 11:32:07 +0100 > Eric says: "By looking at tcpdump, and TS val of xmit packets of multiple > flows, we can deduct the relative qdisc delays (think of fq pacing). > This should work even if we have one flow per remote peer." > > Having random per flow (or host) offsets doesn't allow that anymore so add > a way to turn this off. > > Suggested-by: Eric Dumazet > Signed-off-by: Florian Westphal > Acked-by: Yuchung Cheng Applied.
Re: arp_filter and IPv6 ND
On 2 December 2016 at 18:45, Hannes Frederic Sowa wrote: > next-hop-self attribute on your neighbor in that direction? BGP in > general doesn't lead to ND entry installs, protocols like IS-IS afair > can short circuit here. That's the whole problem, Linux does not think of ND or ARP as interface specific thing, but as global thing. ND and ARP will happily answer to query from any interface if any other interface has said IP. I'm not sure why the Loopback ended up in Cisco ND Cache, answer is either Cisco queried for it or Linux did gratuitous answer. I believe gratuitous. > Hmm, I would keep the Loopback announcements out of the BGP. It's extremely common way to do anycast, but not interesting for the topic at hand. > For enterprise and cloud stuff it is certainly very surprising, as some > isolations don't work as expected. OTOH it is really easy to build up > home networks and things are more plug and play. Can you give me practical example when the behaviour is desirable, my imagination is failing me. I'm not arguing, I just want to understand it, as I've never had the need myself. I've never ran into setup which needs it, but cursory googling shows several people having broken networks because of the behaviour. If it is needed, I'm sure it's esoteric setup and perhaps saner default would that extra sysctl config is needed to get this interface agnostic ARP/ND behaviour. > Some RFCs require that for some router implementations (CPE), on the > other hand weak end model in Linux was probably inherited by IPv4. The > addition of duplicate address detection (which of course only makes > sense in strong end systems) to IPv6, basically shows that IPv6 is more > or less designed to be a strong end system model. > > Anyway, a patch to suppress ndisc requests on those interfaces will > probably be accepted. Grand, not that I feel comfortable writing it. I'd rather see the whole suppression functionality moved to neighbour.c from being AFI specific. -- ++ytti
Re: [PATCH/RFC net-next 0/2] net/sched: cls_flower: Support matching on ICMP
On Fri, Dec 02, 2016 at 06:10:48PM +0100, Jiri Pirko wrote: > Fri, Dec 02, 2016 at 10:52:30AM CET, simon.hor...@netronome.com wrote: > >Hi, > > > >this series add supports for matching on ICMP type and code to cls_flower. > >This is modeled on existing support for matching on L4 ports. The updates > >to the dissector are intended to allow for code and storage re-use. > > Looks fine to me. Thanks! Thanks, I'll drop the RFC designation and repost.
Re: [PATCH/RFC iproute2/net-next 0/3] tc: flower: Support matching on ICMP
On Fri, Dec 02, 2016 at 06:10:20PM +0100, Jiri Pirko wrote: > Fri, Dec 02, 2016 at 10:59:42AM CET, simon.hor...@netronome.com wrote: > >Add support for matching on ICMP type and code to flower. This is modeled > >on existing support for matching on L4 ports. > > > >The second patch provided a minor cleanup which is in keeping with > >they style used in the last patch. > > > >This is marked as an RFC to match the same designation given to the > >corresponding kernel patches. > > Looks nice, I only have those 2 enum nitpicks. Thanks, I'll fix those and repost.
Re: [PATCH v3 0/3] Add QLogic FastLinQ iSCSI (qedi) driver.
From: "Rangankar, Manish" Date: Fri, 2 Dec 2016 07:00:39 + > Please consider applying the qed patches 1 & 2 to net-next. Ok, done.
Re: [PATCH 4/6] net: ethernet: ti: cpts: add ptp pps support
Hi Richard, On 12/02/2016 03:58 AM, Richard Cochran wrote: > On Wed, Nov 30, 2016 at 11:17:38PM +0100, Richard Cochran wrote: >> On Wed, Nov 30, 2016 at 02:43:57PM -0600, Grygorii Strashko wrote: >>> Sry, but this is questionable - code for pps comes from TI internal >>> branches (SDK releases) where it survived for a pretty long time. > > Actually, there is a way to get an accurate PPS from the am335x. See > this recent thread: > > > https://www.mail-archive.com/linuxptp-devel@lists.sourceforge.net/msg01726.html > > That is the way to go, and so, please drop this present patch. > thanks for the links - it sounds very interesting. As I understood, people trying to enable PPS on am335 device with the goal to have PPS signal generated on some SoC pin and therefore they use DMtimer. Also, as i understood, the Timer Load Register (TLDR) is corrected once a second at each HW_TS_PUSH - as result, if freq was corrected during current sec there will be some HW_TS_PUSH generation jitter any way. Above solution is a bit complex for keystone 2 SoCs, as CPTS itself on these SoCs has output pin (ts_comp) which can be used for PPS signal generation. So, I think, similar results can be achieved by removing PPS correction code from cpts_ptp_adjfreq() and updating CPTS_TS_LOAD_VAL once a sec in cpts_overflow_check(). or I missed smth? -- regards, -grygorii
[PATCHv2 net-next 0/4] MV88E6390 batch two
This is the second batch of patches adding support for the MV88e6390. They are not sufficient to make it work properly. The mv88e6390 has a much expanded set of priority maps. Refactor the existing code, and implement basic support for the new device. Similarly, the monitor control register has been reworked. The mv88e6390 has something odd in its EDSA tagging implementation, which means it is not possible to use it. So we need to use DSA tagging. This is the first device with EDSA support where we need to use DSA, and the code does not support this. So two patches refactor the existing code. The two different register definitions are separated out, and using DSA on an EDSA capable device is added. v2: Add port prefix Add helper function for 6390 Add _IEEE_ into #defines Split monitor_ctrl into a number of separate ops. Remove 6390 code which is management, used in a later patch s/EGREES/EGRESS/. Broke up setup_port_dsa() and set_port_dsa() into a number of ops Andrew Lunn (4): net: dsa: mv88e6xxx: Implement mv88e6390 tag remap net: dsa: mv88e6xxx: Monitor and Management tables net: dsa: mv88e6xxx: Move the tagging protocol into info net: dsa: mv88e6xxx: Refactor CPU and DSA port setup drivers/net/dsa/mv88e6xxx/chip.c | 339 ++ drivers/net/dsa/mv88e6xxx/global1.c | 69 +++ drivers/net/dsa/mv88e6xxx/global1.h | 4 + drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 62 +-- drivers/net/dsa/mv88e6xxx/port.c | 181 ++ drivers/net/dsa/mv88e6xxx/port.h | 15 ++ 6 files changed, 583 insertions(+), 87 deletions(-) -- 2.10.2
[PATCHv2 net-next 2/4] net: dsa: mv88e6xxx: Monitor and Management tables
The mv88e6390 changes the monitor control register into the Monitor and Management control, which is an indirection register to various registers. Add ops to set the CPU port and the ingress/egress port for both register layouts, to global1 Signed-off-by: Andrew Lunn --- drivers/net/dsa/mv88e6xxx/chip.c | 68 +- drivers/net/dsa/mv88e6xxx/global1.c | 69 +++ drivers/net/dsa/mv88e6xxx/global1.h | 4 ++ drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 13 +++ 4 files changed, 145 insertions(+), 9 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index ff4bd2f74357..6e981bedd028 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -2747,15 +2747,17 @@ static int mv88e6xxx_g1_setup(struct mv88e6xxx_chip *chip) if (err) return err; - /* Configure the upstream port, and configure it as the port to which -* ingress and egress and ARP monitor frames are to be sent. -*/ - reg = upstream_port << GLOBAL_MONITOR_CONTROL_INGRESS_SHIFT | - upstream_port << GLOBAL_MONITOR_CONTROL_EGRESS_SHIFT | - upstream_port << GLOBAL_MONITOR_CONTROL_ARP_SHIFT; - err = mv88e6xxx_g1_write(chip, GLOBAL_MONITOR_CONTROL, reg); - if (err) - return err; + if (chip->info->ops->g1_set_cpu_port) { + err = chip->info->ops->g1_set_cpu_port(chip, upstream_port); + if (err) + return err; + } + + if (chip->info->ops->g1_set_egress_port) { + err = chip->info->ops->g1_set_egress_port(chip, upstream_port); + if (err) + return err; + } /* Disable remote management, and set the switch's DSA device number. */ err = mv88e6xxx_g1_write(chip, GLOBAL_CONTROL_2, @@ -3184,6 +3186,8 @@ static const struct mv88e6xxx_ops mv88e6085_ops = { .stats_get_sset_count = mv88e6095_stats_get_sset_count, .stats_get_strings = mv88e6095_stats_get_strings, .stats_get_stats = mv88e6095_stats_get_stats, + .g1_set_cpu_port = mv88e6095_g1_set_cpu_port, + .g1_set_egress_port = mv88e6095_g1_set_egress_port, }; static const struct mv88e6xxx_ops mv88e6095_ops = { @@ -3213,6 +3217,8 @@ static const struct mv88e6xxx_ops mv88e6097_ops = { .stats_get_sset_count = mv88e6095_stats_get_sset_count, .stats_get_strings = mv88e6095_stats_get_strings, .stats_get_stats = mv88e6095_stats_get_stats, + .g1_set_cpu_port = mv88e6095_g1_set_cpu_port, + .g1_set_egress_port = mv88e6095_g1_set_egress_port, }; static const struct mv88e6xxx_ops mv88e6123_ops = { @@ -3227,6 +3233,8 @@ static const struct mv88e6xxx_ops mv88e6123_ops = { .stats_get_sset_count = mv88e6095_stats_get_sset_count, .stats_get_strings = mv88e6095_stats_get_strings, .stats_get_stats = mv88e6095_stats_get_stats, + .g1_set_cpu_port = mv88e6095_g1_set_cpu_port, + .g1_set_egress_port = mv88e6095_g1_set_egress_port, }; static const struct mv88e6xxx_ops mv88e6131_ops = { @@ -3242,6 +3250,8 @@ static const struct mv88e6xxx_ops mv88e6131_ops = { .stats_get_sset_count = mv88e6095_stats_get_sset_count, .stats_get_strings = mv88e6095_stats_get_strings, .stats_get_stats = mv88e6095_stats_get_stats, + .g1_set_cpu_port = mv88e6095_g1_set_cpu_port, + .g1_set_egress_port = mv88e6095_g1_set_egress_port, }; static const struct mv88e6xxx_ops mv88e6161_ops = { @@ -3257,6 +3267,8 @@ static const struct mv88e6xxx_ops mv88e6161_ops = { .stats_get_sset_count = mv88e6095_stats_get_sset_count, .stats_get_strings = mv88e6095_stats_get_strings, .stats_get_stats = mv88e6095_stats_get_stats, + .g1_set_cpu_port = mv88e6095_g1_set_cpu_port, + .g1_set_egress_port = mv88e6095_g1_set_egress_port, }; static const struct mv88e6xxx_ops mv88e6165_ops = { @@ -3271,6 +3283,8 @@ static const struct mv88e6xxx_ops mv88e6165_ops = { .stats_get_sset_count = mv88e6095_stats_get_sset_count, .stats_get_strings = mv88e6095_stats_get_strings, .stats_get_stats = mv88e6095_stats_get_stats, + .g1_set_cpu_port = mv88e6095_g1_set_cpu_port, + .g1_set_egress_port = mv88e6095_g1_set_egress_port, }; static const struct mv88e6xxx_ops mv88e6171_ops = { @@ -3287,6 +3301,8 @@ static const struct mv88e6xxx_ops mv88e6171_ops = { .stats_get_sset_count = mv88e6095_stats_get_sset_count, .stats_get_strings = mv88e6095_stats_get_strings, .stats_get_stats = mv88e6095_stats_get_stats, + .g1_set_cpu_port = mv88e6095_g1_set_cpu_port, + .g1_set_egress_port = mv88e6095_g1_set_egress_port, }; static const struct mv88e6xxx_ops mv88e6172_ops = { @@ -3305,6 +3321,8 @@ static const struct mv88e6xxx_ops mv88e6172_ops = { .stats_get_sset_cou
[PATCHv2 net-next 3/4] net: dsa: mv88e6xxx: Move the tagging protocol into info
Older chips support a single tagging protocol, DSA. New chips support both DSA and EDSA, an enhanced version. Having both as an option changes the register layouts. Up until now, it has been assumed that if EDSA is supported, it will be used. Hence the register layout has been determined by which protocol should be used. However, mv88e6390 has a different implementation of EDSA, which requires we need to use the DSA tagging. Hence separate the selection of the protocol from the register layout. Signed-off-by: Andrew Lunn --- drivers/net/dsa/mv88e6xxx/chip.c | 33 +++-- drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 17 - 2 files changed, 31 insertions(+), 19 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index 6e981bedd028..80efee6f5e16 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -2482,7 +2482,7 @@ static int mv88e6xxx_setup_port(struct mv88e6xxx_chip *chip, int port) PORT_CONTROL_USE_TAG | PORT_CONTROL_USE_IP | PORT_CONTROL_STATE_FORWARDING; if (dsa_is_cpu_port(ds, port)) { - if (mv88e6xxx_has(chip, MV88E6XXX_FLAG_EDSA)) + if (chip->info->tag_protocol == DSA_TAG_PROTO_EDSA) reg |= PORT_CONTROL_FRAME_ETHER_TYPE_DSA | PORT_CONTROL_FORWARD_UNKNOWN_MC; else @@ -2611,7 +2611,7 @@ static int mv88e6xxx_setup_port(struct mv88e6xxx_chip *chip, int port) /* Port Ethertype: use the Ethertype DSA Ethertype * value. */ - if (mv88e6xxx_has(chip, MV88E6XXX_FLAG_EDSA)) { + if (chip->info->tag_protocol == DSA_TAG_PROTO_EDSA) { err = mv88e6xxx_port_write(chip, port, PORT_ETH_TYPE, ETH_P_EDSA); if (err) @@ -3637,6 +3637,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = { .global1_addr = 0x1b, .age_time_coeff = 15000, .g1_irqs = 8, + .tag_protocol = DSA_TAG_PROTO_DSA, .flags = MV88E6XXX_FLAGS_FAMILY_6097, .ops = &mv88e6085_ops, }, @@ -3651,6 +3652,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = { .global1_addr = 0x1b, .age_time_coeff = 15000, .g1_irqs = 8, + .tag_protocol = DSA_TAG_PROTO_DSA, .flags = MV88E6XXX_FLAGS_FAMILY_6095, .ops = &mv88e6095_ops, }, @@ -3679,6 +3681,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = { .global1_addr = 0x1b, .age_time_coeff = 15000, .g1_irqs = 9, + .tag_protocol = DSA_TAG_PROTO_DSA, .flags = MV88E6XXX_FLAGS_FAMILY_6165, .ops = &mv88e6123_ops, }, @@ -3693,6 +3696,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = { .global1_addr = 0x1b, .age_time_coeff = 15000, .g1_irqs = 9, + .tag_protocol = DSA_TAG_PROTO_DSA, .flags = MV88E6XXX_FLAGS_FAMILY_6185, .ops = &mv88e6131_ops, }, @@ -3707,6 +3711,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = { .global1_addr = 0x1b, .age_time_coeff = 15000, .g1_irqs = 9, + .tag_protocol = DSA_TAG_PROTO_DSA, .flags = MV88E6XXX_FLAGS_FAMILY_6165, .ops = &mv88e6161_ops, }, @@ -3721,6 +3726,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = { .global1_addr = 0x1b, .age_time_coeff = 15000, .g1_irqs = 9, + .tag_protocol = DSA_TAG_PROTO_DSA, .flags = MV88E6XXX_FLAGS_FAMILY_6165, .ops = &mv88e6165_ops, }, @@ -3735,6 +3741,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = { .global1_addr = 0x1b, .age_time_coeff = 15000, .g1_irqs = 9, + .tag_protocol = DSA_TAG_PROTO_EDSA, .flags = MV88E6XXX_FLAGS_FAMILY_6351, .ops = &mv88e6171_ops, }, @@ -3749,6 +3756,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = { .global1_addr = 0x1b, .age_time_coeff = 15000, .g1_irqs = 9, + .tag_protocol = DSA_TAG_PROTO_EDSA, .flags = MV88E6XXX_FLAGS_FAMILY_6352, .ops = &mv88e6172_ops, }, @@ -3763,6 +3771,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = { .global1_addr = 0x1b, .age_time_coeff = 15000, .g1_irqs = 9, + .tag_protocol = DSA_TAG_PROTO_EDSA, .flags = MV88E6XXX_FLAGS_FAMIL
[PATCHv2 net-next 4/4] net: dsa: mv88e6xxx: Refactor CPU and DSA port setup
Older chips only support DSA tagging. Newer chips have both DSA and EDSA tagging. Refactor the code by adding port functions for setting the frame mode, egress mode, and if to forward unknown frames. This results in the helper mv88e6xxx_6065_family() becoming unused, so remove it. Signed-off-by: Andrew Lunn --- drivers/net/dsa/mv88e6xxx/chip.c | 205 ++ drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 20 drivers/net/dsa/mv88e6xxx/port.c | 118 +++ drivers/net/dsa/mv88e6xxx/port.h | 13 +++ 4 files changed, 308 insertions(+), 48 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index 80efee6f5e16..d1e6a0760a75 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -677,11 +677,6 @@ static int mv88e6xxx_phy_ppu_write(struct mv88e6xxx_chip *chip, int addr, return err; } -static bool mv88e6xxx_6065_family(struct mv88e6xxx_chip *chip) -{ - return chip->info->family == MV88E6XXX_FAMILY_6065; -} - static bool mv88e6xxx_6095_family(struct mv88e6xxx_chip *chip) { return chip->info->family == MV88E6XXX_FAMILY_6095; @@ -2438,6 +2433,80 @@ static int mv88e6xxx_serdes_power_on(struct mv88e6xxx_chip *chip) return err; } +static int mv88e6xxx_setup_port_dsa(struct mv88e6xxx_chip *chip, int port, + int upstream_port) +{ + int err; + + err = chip->info->ops->port_set_frame_mode( + chip, port, MV88E6XXX_FRAME_MODE_DSA); + if (err) + return err; + + err = chip->info->ops->port_set_egress_unknowns( + chip, port, port == upstream_port); + if (err) + return err; + + if (chip->info->ops->port_set_ether_type) + return chip->info->ops->port_set_ether_type( + chip, port, ETH_P_EDSA); + + return 0; +} + +static int mv88e6xxx_setup_port_cpu(struct mv88e6xxx_chip *chip, int port) +{ + int err; + + switch (chip->info->tag_protocol) { + case DSA_TAG_PROTO_EDSA: + err = chip->info->ops->port_set_frame_mode( + chip, port, MV88E6XXX_FRAME_MODE_ETHERTYPE); + if (err) + return err; + + err = mv88e6xxx_port_set_egress_mode( + chip, port, PORT_CONTROL_EGRESS_ADD_TAG); + if (err) + return err; + + if (chip->info->ops->port_set_ether_type) + err = chip->info->ops->port_set_ether_type( + chip, port, ETH_P_EDSA); + break; + + case DSA_TAG_PROTO_DSA: + err = chip->info->ops->port_set_frame_mode( + chip, port, MV88E6XXX_FRAME_MODE_DSA); + if (err) + return err; + + err = mv88e6xxx_port_set_egress_mode( + chip, port, PORT_CONTROL_EGRESS_UNMODIFIED); + break; + default: + err = -EINVAL; + } + + if (err) + return err; + + return chip->info->ops->port_set_egress_unknowns(chip, port, true); +} + +static int mv88e6xxx_setup_port_normal(struct mv88e6xxx_chip *chip, int port) +{ + int err; + + err = chip->info->ops->port_set_frame_mode( + chip, port, MV88E6XXX_FRAME_MODE_NORMAL); + if (err) + return err; + + return chip->info->ops->port_set_egress_unknowns(chip, port, false); +} + static int mv88e6xxx_setup_port(struct mv88e6xxx_chip *chip, int port) { struct dsa_switch *ds = chip->ds; @@ -2473,44 +2542,25 @@ static int mv88e6xxx_setup_port(struct mv88e6xxx_chip *chip, int port) * If this is the upstream port for this switch, enable * forwarding of unknown unicasts and multicasts. */ - reg = 0; - if (mv88e6xxx_6352_family(chip) || mv88e6xxx_6351_family(chip) || - mv88e6xxx_6165_family(chip) || mv88e6xxx_6097_family(chip) || - mv88e6xxx_6095_family(chip) || mv88e6xxx_6065_family(chip) || - mv88e6xxx_6185_family(chip) || mv88e6xxx_6320_family(chip)) - reg = PORT_CONTROL_IGMP_MLD_SNOOP | + reg = PORT_CONTROL_IGMP_MLD_SNOOP | PORT_CONTROL_USE_TAG | PORT_CONTROL_USE_IP | PORT_CONTROL_STATE_FORWARDING; + err = mv88e6xxx_port_write(chip, port, PORT_CONTROL, reg); + if (err) + return err; + if (dsa_is_cpu_port(ds, port)) { - if (chip->info->tag_protocol == DSA_TAG_PROTO_EDSA) - reg |= PORT_CONTROL_FRAME_ETHER_TYPE_DSA | - PORT_CONTROL_FORWARD_UNKNOWN_MC; - else - reg |= PORT_CONTROL_DSA_TAG; - reg |= PORT_CONTROL_EGRESS_ADD_TAG | - PORT_CONTROL_FORWARD_UNKN
[PATCHv2 net-next 1/4] net: dsa: mv88e6xxx: Implement mv88e6390 tag remap
The mv88e6390 does not have the two registers to set the frame priority map. Instead it has an indirection registers for setting a number of different priority maps. Refactor the old code into an function, implement the mv88e6390 version, and use an op to call the right one. Signed-off-by: Andrew Lunn --- v2: Add port prefix Add helper function for 6390 Add _IEEE_ into #defines --- drivers/net/dsa/mv88e6xxx/chip.c | 37 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 12 +++ drivers/net/dsa/mv88e6xxx/port.c | 63 +++ drivers/net/dsa/mv88e6xxx/port.h | 2 ++ 4 files changed, 101 insertions(+), 13 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index ce2f7ff8066e..ff4bd2f74357 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -2617,20 +2617,10 @@ static int mv88e6xxx_setup_port(struct mv88e6xxx_chip *chip, int port) if (err) return err; } + } - /* Tag Remap: use an identity 802.1p prio -> switch -* prio mapping. -*/ - err = mv88e6xxx_port_write(chip, port, PORT_TAG_REGMAP_0123, - 0x3210); - if (err) - return err; - - /* Tag Remap 2: use an identity 802.1p prio -> switch -* prio mapping. -*/ - err = mv88e6xxx_port_write(chip, port, PORT_TAG_REGMAP_4567, - 0x7654); + if (chip->info->ops->port_tag_remap) { + err = chip->info->ops->port_tag_remap(chip, port); if (err) return err; } @@ -3189,6 +3179,7 @@ static const struct mv88e6xxx_ops mv88e6085_ops = { .port_set_link = mv88e6xxx_port_set_link, .port_set_duplex = mv88e6xxx_port_set_duplex, .port_set_speed = mv88e6185_port_set_speed, + .port_tag_remap = mv88e6095_port_tag_remap, .stats_snapshot = mv88e6xxx_g1_stats_snapshot, .stats_get_sset_count = mv88e6095_stats_get_sset_count, .stats_get_strings = mv88e6095_stats_get_strings, @@ -3217,6 +3208,7 @@ static const struct mv88e6xxx_ops mv88e6097_ops = { .port_set_link = mv88e6xxx_port_set_link, .port_set_duplex = mv88e6xxx_port_set_duplex, .port_set_speed = mv88e6185_port_set_speed, + .port_tag_remap = mv88e6095_port_tag_remap, .stats_snapshot = mv88e6xxx_g1_stats_snapshot, .stats_get_sset_count = mv88e6095_stats_get_sset_count, .stats_get_strings = mv88e6095_stats_get_strings, @@ -3245,6 +3237,7 @@ static const struct mv88e6xxx_ops mv88e6131_ops = { .port_set_link = mv88e6xxx_port_set_link, .port_set_duplex = mv88e6xxx_port_set_duplex, .port_set_speed = mv88e6185_port_set_speed, + .port_tag_remap = mv88e6095_port_tag_remap, .stats_snapshot = mv88e6xxx_g1_stats_snapshot, .stats_get_sset_count = mv88e6095_stats_get_sset_count, .stats_get_strings = mv88e6095_stats_get_strings, @@ -3259,6 +3252,7 @@ static const struct mv88e6xxx_ops mv88e6161_ops = { .port_set_link = mv88e6xxx_port_set_link, .port_set_duplex = mv88e6xxx_port_set_duplex, .port_set_speed = mv88e6185_port_set_speed, + .port_tag_remap = mv88e6095_port_tag_remap, .stats_snapshot = mv88e6xxx_g1_stats_snapshot, .stats_get_sset_count = mv88e6095_stats_get_sset_count, .stats_get_strings = mv88e6095_stats_get_strings, @@ -3288,6 +3282,7 @@ static const struct mv88e6xxx_ops mv88e6171_ops = { .port_set_duplex = mv88e6xxx_port_set_duplex, .port_set_rgmii_delay = mv88e6352_port_set_rgmii_delay, .port_set_speed = mv88e6185_port_set_speed, + .port_tag_remap = mv88e6095_port_tag_remap, .stats_snapshot = mv88e6320_g1_stats_snapshot, .stats_get_sset_count = mv88e6095_stats_get_sset_count, .stats_get_strings = mv88e6095_stats_get_strings, @@ -3305,6 +3300,7 @@ static const struct mv88e6xxx_ops mv88e6172_ops = { .port_set_duplex = mv88e6xxx_port_set_duplex, .port_set_rgmii_delay = mv88e6352_port_set_rgmii_delay, .port_set_speed = mv88e6352_port_set_speed, + .port_tag_remap = mv88e6095_port_tag_remap, .stats_snapshot = mv88e6320_g1_stats_snapshot, .stats_get_sset_count = mv88e6095_stats_get_sset_count, .stats_get_strings = mv88e6095_stats_get_strings, @@ -3320,6 +3316,7 @@ static const struct mv88e6xxx_ops mv88e6175_ops = { .port_set_duplex = mv88e6xxx_port_set_duplex, .port_set_rgmii_delay = mv88e6352_port_set_rgmii_delay, .port_set_speed = mv88e6185_port_set_speed, + .port_tag_remap = mv88e6095_port_tag_remap, .stats_snapshot = mv88e6320_g1_stats_snapshot, .stats_get_sset_count
[PATCH net-next 0/2] net/sched: cls_flower: Support matching on ICMP
Hi, this series add supports for matching on ICMP type and code to cls_flower. This is modeled on existing support for matching on L4 ports. The updates to the dissector are intended to allow for code and storage re-use. Change since RFC: * Drop RFC designation after positive review from Jiri Simon Horman (2): flow dissector: ICMP support net/sched: cls_flower: Support matching on ICMP type and code drivers/net/bonding/bond_main.c | 6 +++-- include/linux/skbuff.h | 5 + include/net/flow_dissector.h| 50 ++--- include/uapi/linux/pkt_cls.h| 10 + net/core/flow_dissector.c | 34 +--- net/sched/cls_flow.c| 4 ++-- net/sched/cls_flower.c | 42 ++ 7 files changed, 141 insertions(+), 10 deletions(-) -- 2.7.0.rc3.207.g0ac5344
[PATCH net-next 1/2] flow dissector: ICMP support
Allow dissection of ICMP(V6) type and code. This re-uses transport layer port dissection code as although ICMP is not a transport protocol and their type and code are not ports this allows sharing of both code and storage. Signed-off-by: Simon Horman --- drivers/net/bonding/bond_main.c | 6 -- include/linux/skbuff.h | 5 + include/net/flow_dissector.h| 30 +++--- net/core/flow_dissector.c | 34 +++--- net/sched/cls_flow.c| 4 ++-- 5 files changed, 69 insertions(+), 10 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 8029dd4912b6..a6f75cfb2bf7 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -3181,7 +3181,8 @@ static bool bond_flow_dissect(struct bonding *bond, struct sk_buff *skb, } else { return false; } - if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 && proto >= 0) + if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 && + proto >= 0 && !skb_flow_is_icmp_any(skb, proto)) fk->ports.ports = skb_flow_get_ports(skb, noff, proto); return true; @@ -3209,7 +3210,8 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff *skb) return bond_eth_hash(skb); if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER23 || - bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23) + bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23 || + flow_keys_are_icmp_any(&flow)) hash = bond_eth_hash(skb); else hash = (__force u32)flow.ports.ports; diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 9c535fbccf2c..44a8f69a9198 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1094,6 +1094,11 @@ u32 __skb_get_poff(const struct sk_buff *skb, void *data, __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto, void *data, int hlen_proto); +static inline bool skb_flow_is_icmp_any(const struct sk_buff *skb, u8 ip_proto) +{ + return flow_protos_are_icmp_any(skb->protocol, ip_proto); +} + static inline __be32 skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto) { diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h index c4f31666afd2..8880025914e3 100644 --- a/include/net/flow_dissector.h +++ b/include/net/flow_dissector.h @@ -2,6 +2,7 @@ #define _NET_FLOW_DISSECTOR_H #include +#include #include #include @@ -89,10 +90,15 @@ struct flow_dissector_key_addrs { }; /** - * flow_dissector_key_tp_ports: - * @ports: port numbers of Transport header + * flow_dissector_key_ports: + * @ports: port numbers of Transport header or + * type and code of ICMP header + * ports: source (high) and destination (low) port numbers * src: source port number * dst: destination port number + * icmp: ICMP type (high) and code (low) + * type: ICMP type + * type: ICMP code */ struct flow_dissector_key_ports { union { @@ -101,6 +107,11 @@ struct flow_dissector_key_ports { __be16 src; __be16 dst; }; + __be16 icmp; + struct { + u8 type; + u8 code; + }; }; }; @@ -188,9 +199,22 @@ struct flow_keys_digest { void make_flow_keys_digest(struct flow_keys_digest *digest, const struct flow_keys *flow); +static inline bool flow_protos_are_icmp_any(__be16 n_proto, u8 ip_proto) +{ + return (n_proto == htons(ETH_P_IP) && ip_proto == IPPROTO_ICMP) || + (n_proto == htons(ETH_P_IPV6) && ip_proto == IPPROTO_ICMPV6); +} + +static inline bool flow_keys_are_icmp_any(const struct flow_keys *keys) +{ + return flow_protos_are_icmp_any(keys->basic.n_proto, + keys->basic.ip_proto); +} + static inline bool flow_keys_have_l4(const struct flow_keys *keys) { - return (keys->ports.ports || keys->tags.flow_label); + return (!flow_keys_are_icmp_any(keys) && keys->ports.ports) || + keys->tags.flow_label; } u32 flow_hash_from_keys(struct flow_keys *keys); diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index 1eb6f949e5b2..0584b4bb4390 100644 --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -58,6 +58,28 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector, EXPORT_SYMBOL(skb_flow_dissector_init); /** + * skb_flow_get_be16 - extract be16 entity + * @skb: sk_buff to extract from + * @poff: offset to extract at + * @data: raw buffer pointer to the packet + * @hlen: packet header length + * + * The function will
[PATCH net-next 2/2] net/sched: cls_flower: Support matching on ICMP type and code
Support matching on ICMP type and code. Example usage: tc qdisc add dev eth0 ingress tc filter add dev eth0 protocol ip parent : flower \ indev eth0 ip_proto icmp type 8 code 0 action drop tc filter add dev eth0 protocol ipv6 parent : flower \ indev eth0 ip_proto icmpv6 type 128 code 0 action drop Signed-off-by: Simon Horman --- include/net/flow_dissector.h | 24 ++-- include/uapi/linux/pkt_cls.h | 10 ++ net/sched/cls_flower.c | 42 ++ 3 files changed, 74 insertions(+), 2 deletions(-) diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h index 8880025914e3..5540dfa18872 100644 --- a/include/net/flow_dissector.h +++ b/include/net/flow_dissector.h @@ -199,10 +199,30 @@ struct flow_keys_digest { void make_flow_keys_digest(struct flow_keys_digest *digest, const struct flow_keys *flow); +static inline bool flow_protos_are_icmpv4(__be16 n_proto, u8 ip_proto) +{ + return n_proto == htons(ETH_P_IP) && ip_proto == IPPROTO_ICMP; +} + +static inline bool flow_protos_are_icmpv6(__be16 n_proto, u8 ip_proto) +{ + return n_proto == htons(ETH_P_IPV6) && ip_proto == IPPROTO_ICMPV6; +} + static inline bool flow_protos_are_icmp_any(__be16 n_proto, u8 ip_proto) { - return (n_proto == htons(ETH_P_IP) && ip_proto == IPPROTO_ICMP) || - (n_proto == htons(ETH_P_IPV6) && ip_proto == IPPROTO_ICMPV6); + return flow_protos_are_icmpv4(n_proto, ip_proto) || + flow_protos_are_icmpv6(n_proto, ip_proto); +} + +static inline bool flow_basic_key_is_icmpv4(const struct flow_dissector_key_basic *basic) +{ + return flow_protos_are_icmpv4(basic->n_proto, basic->ip_proto); +} + +static inline bool flow_basic_key_is_icmpv6(const struct flow_dissector_key_basic *basic) +{ + return flow_protos_are_icmpv6(basic->n_proto, basic->ip_proto); } static inline bool flow_keys_are_icmp_any(const struct flow_keys *keys) diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index 86786d45ee66..58160fe80b80 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -457,6 +457,16 @@ enum { TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK, /* be16 */ TCA_FLOWER_KEY_ENC_UDP_DST_PORT,/* be16 */ TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK, /* be16 */ + + TCA_FLOWER_KEY_ICMPV4_CODE, /* u8 */ + TCA_FLOWER_KEY_ICMPV4_CODE_MASK,/* u8 */ + TCA_FLOWER_KEY_ICMPV4_TYPE, /* u8 */ + TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,/* u8 */ + TCA_FLOWER_KEY_ICMPV6_CODE, /* u8 */ + TCA_FLOWER_KEY_ICMPV6_CODE_MASK,/* u8 */ + TCA_FLOWER_KEY_ICMPV6_TYPE, /* u8 */ + TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,/* u8 */ + __TCA_FLOWER_MAX, }; diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index e8dd09af0d0c..412efa7de226 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -355,6 +355,14 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = { [TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK] = { .type = NLA_U16 }, [TCA_FLOWER_KEY_ENC_UDP_DST_PORT] = { .type = NLA_U16 }, [TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK] = { .type = NLA_U16 }, + [TCA_FLOWER_KEY_ICMPV4_TYPE]= { .type = NLA_U8 }, + [TCA_FLOWER_KEY_ICMPV4_TYPE_MASK] = { .type = NLA_U8 }, + [TCA_FLOWER_KEY_ICMPV4_CODE]= { .type = NLA_U8 }, + [TCA_FLOWER_KEY_ICMPV4_CODE_MASK] = { .type = NLA_U8 }, + [TCA_FLOWER_KEY_ICMPV6_TYPE]= { .type = NLA_U8 }, + [TCA_FLOWER_KEY_ICMPV6_TYPE_MASK] = { .type = NLA_U8 }, + [TCA_FLOWER_KEY_ICMPV6_CODE]= { .type = NLA_U8 }, + [TCA_FLOWER_KEY_ICMPV6_CODE_MASK] = { .type = NLA_U8 }, }; static void fl_set_key_val(struct nlattr **tb, @@ -471,6 +479,20 @@ static int fl_set_key(struct net *net, struct nlattr **tb, fl_set_key_val(tb, &key->tp.dst, TCA_FLOWER_KEY_SCTP_DST, &mask->tp.dst, TCA_FLOWER_KEY_SCTP_DST_MASK, sizeof(key->tp.dst)); + } else if (flow_basic_key_is_icmpv4(&key->basic)) { + fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE, + &mask->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE_MASK, + sizeof(key->tp.type)); + fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE, + &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK, + sizeof(key->tp.code)); + } else if (flow_basic_key_is_icmpv6(&key->basic)) { + fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE, + &mask->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE_MASK, + sizeof(key->tp.type)); + fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE, +
Re: [flamebait] xdp, well meaning but pointless
On 02.12.2016 17:59, Tom Herbert wrote: > On Fri, Dec 2, 2016 at 3:54 AM, Hannes Frederic Sowa > wrote: >> On 02.12.2016 11:24, Jesper Dangaard Brouer wrote: >>> On Thu, 1 Dec 2016 13:51:32 -0800 >>> Tom Herbert wrote: >>> >> The technical plenary at last IETF on Seoul a couple of weeks ago was >> exclusively focussed on DDOS in light of the recent attack against >> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare >> presentation by Nick Sullivan >> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf) >> alluded to some implementation of DDOS mitigation. In particular, on >> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel" >>> >>> slide 14 >>> >> numbers he gave we're based in iptables+BPF and that was a whole >> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic >> and that's also when I introduced XDP to whole IETF :-) ). If that's >> the best we can do the Internet is in a world hurt. DDOS mitigation >> alone is probably a sufficient motivation to look at XDP. We need >> something that drops bad packets as quickly as possible when under >> attack, we need this to be integrated into the stack, we need it to be >> programmable to deal with the increasing savvy of attackers, and we >> don't want to be forced to be dependent on HW solutions. This is why >> we created XDP! >>> >>> The 1.2Mpps number is a bit low, but we are unfortunately in that >>> ballpark. >>> > I totally understand that. But in my reply to David in this thread I > mentioned DNS apex processing as being problematic which is actually > being referred in your linked slide deck on page 9 ("What do floods look > like") and the problematic of parsing DNS packets in XDP due to string > processing and looping inside eBPF. >>> >>> That is a weak argument. You do realize CloudFlare actually use eBPF to >>> do this exact filtering, and (so-far) eBPF for parsing DNS have been >>> sufficient for them. >> >> You are talking about this code on the following slides (I actually >> transcribed it for you here and disassembled): >> >> l0: ld #0x14 >> l1: ldxb 4*([0]&0xf) >> l2: add x >> l3: tax >> l4: ld [x+0] >> l5: jeq #0x7657861, l6, l13 >> l6: ld [x+4] >> l7: jeq #0x6d706c65, l8, l13 >> l8: ld [x+8] >> l9: jeq #0x3636f6d, l10, l13 >> l10:ldb [x+12] >> l11:jeq #0, l12, l13 >> l12:ret #0x1 >> l13:ret #0 >> >> You can offload this to u32 in hardware if that is what you want. >> >> The reason this works is because of netfilter, which allows them to >> dynamically generate BPF programs and insert and delete them from >> chains, do intersection or unions of them. >> >> If you have a freestanding program like in XDP the complexity space is a >> different one and not comparable to this at all. >> > I don't understand this comment about complexity especially in regards > to the idea of offloading u32 to hardware. Relying on hardware to do > anything always leads to more complexity than an equivalent SW > implementation for the same functionality. The only reason we ever use > a hardware mechanisms is if it gives *significantly* better > performance. If the performance difference isn't there then doing > things in SW is going to be the better path (as we see in XDP). I am just wondering why the u32 filter wasn't mentioned in their slide deck. If all what Cloudflare needs are those kind of matches, they are in fact actually easier to generate than an cBPF program. It is not a good example of how a real world DoS filter in XDP would look like. If you argue XDP as a C function hook that can call arbitrary code in the driver before submitting that to the networking stack, yep, that is not complex at all. Depending on how those modules will be maintained, they either end up in the kernel and will be updated on major changes or are 3rd party and people have to update them and also depend on the driver features. But this opens up a whole new can of worms also. I haven't really thought this through completely, but last time the patches were nack'ed with lots of strong opinions and I tended to agree with them. I am revisiting this position. Certainly you can build real-world DoS protection with this function pointer hook and C code in the driver. In this case a user space solution still has advantages because of maintainability, as e.g. with netmap or dpdk you are again decoupled from the in-kernel API/ABI and don't need to test, recompile etc. on each kernel upgrade. If the module ends up in the kernel, those problems might also disappear. For XDP+eBPF to provide a full DoS mitigation (protocol parsing, sampling and dropping) solution seems to be too complex for me because of the arguments I stated in my previous mail. Bye, Hannes
[PROBLEM]: IPv6 ICMP Ancillary Data IPV6_PKTINFO sendmsg() invalid argument
Hi All, First post here; any pointers or improvements for bug reporting are welcome. [1.] One line summary of the problem: sendmsg() with ancillary data of type IPV6_PKTINFO fails with invalid argument and further possible memory corruption. - [2.] Full description of the problem/report: As part of my project I am attempting to perform IPv6 neighbour solicitations to discover the MAC address of my local neighbours. In the included test code link-local src and dst addresses are used to send neighbour solicitations along with specifying the network interface in IPV6_PKTINFO ancillary data. In my test environment I have a Ubuntu 16.04 LTS Server/Router and a Ubuntu 16.04 LTS Desktop. These are both hosted on a Windows 10 Host as Virtual Guests. Router/Server Interfaces: 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 08:00:27:c5:0e:76 brd ff:ff:ff:ff:ff:ff inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::a00:27ff:fec5:e76/64 scope link valid_lft forever preferred_lft forever 3: eth1: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 08:00:27:0a:c1:80 brd ff:ff:ff:ff:ff:ff inet6 2001:db8:1000:baba::1/64 scope global valid_lft forever preferred_lft forever inet6 fe80::a00:27ff:fe0a:c180/64 scope link valid_lft forever preferred_lft forever Eth0 is set up for NAT to a physical IPv4 network but is unused in this test setup. Eth1 is IPv6 and is the radvd interface attatched to an internal virtual box network interface. In my client environment I have three network interfaces: 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp0s3: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 08:00:27:b9:42:47 brd ff:ff:ff:ff:ff:ff inet 192.168.100.121/24 brd 192.168.100.255 scope global dynamic enp0s3 valid_lft 79320sec preferred_lft 79320sec inet6 fe80::5523:e7df:d12c:681/64 scope link valid_lft forever preferred_lft forever 3: enp0s8: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 08:00:27:1a:a7:22 brd ff:ff:ff:ff:ff:ff 4: enp0s9: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 08:00:27:35:13:22 brd ff:ff:ff:ff:ff:ff inet6 2001:db8:1000:baba:b30a:ed85:8568:cc66/64 scope global noprefixroute dynamic valid_lft 86389sec preferred_lft 14389sec inet6 fe80::cffe:e7ef:a03a:34cb/64 scope link valid_lft forever preferred_lft forever enp0s3 is a bridged physical wired connection on the host to an ipv4 network with router / gateway, ipv6 is active but not used. enp0s8 is a bridged physical wifi connection on the host to another ipv4 subnet within the same network on the same router / gateway ipv6 is active but not used . enp0d9 is an internal virtual box network connected directly to another Ubuntu guest as an IPv6 stateless router (the neighbour I am solicitating) this interfaces is ipv4 enabled but not used. The expected behaviour is to use the sendmsg() system call and specify the outgoing interface for the neighbour sonication message using the IPV6_PKTINFO ancillary data as detailed in the RFC3542 links below. 6.1. Specifying/Receiving the Interface https://tools.ietf.org/html/rfc3542.html#section-6.1 6.7 Summary of Outgoing Interface Selection https://tools.ietf.org/html/rfc3542.html#section-6.7 The actual behaviour is for the sendmsg() system call to fail with errno set to 22, Invalid Argument. A note of interest is that if a perror() call is made to report this problem a seg fault in mallo.c is encountered. I have not performed further investigation. You can see this has been commented after the sendmsg() system call in the demo code. Warning: In my testing I have removed the IPV6_PKTINFO ancillary data and have found that if the IPv6 router is not available at boot by default enp0s3 is used to send the packet. I will send another e-mail further describing the behaviour with the subject: [PROBLEM]: IPv6 Link-Local Routing through wrong interface - [3.] Keywords (i.e., modules, networking, kernel): networking, ipv6, link-local, ancillary - [4.] Kernel version (from /proc/version): Linux version 4.4.0-51-generic (buildd@lcy01-08) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #72-Ubuntu SMP Thu Nov 24 18:29:54
Re: [WIP] net+mlx4: auto doorbell
On Wed, 2016-11-30 at 18:50 -0800, Eric Dumazet wrote: > On Wed, 2016-11-30 at 18:32 -0800, Eric Dumazet wrote: > > > I simply suggest we try to queue the qdisc for further servicing as we > > do today, from net_tx_action(), but we might use a different bit, so > > that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED > > before we grab it from net_tx_action(), maybe 100 usec later (time to > > flush all skbs queued in napi_consume_skb() and maybe RX processing, > > since most NIC handle TX completion before doing RX processing from thei > > napi poll handler. > > > > Should be doable with few changes in __netif_schedule() and > > net_tx_action(), plus some control paths that will need to take care of > > the new bit at dismantle time, right ? > > Hmm... this is silly. Code already implements a different bit. > > qdisc_run() seems to run more often from net_tx_action(), I have to find > out why. After more analysis I believe TSQ was one of the bottlenecks. I prepared a patch series that helped my use cases.
Re: [PATCH net-next 1/2] samples, bpf: Refactor test_current_task_under_cgroup - separate out helpers
On Fri, Dec 02, 2016 at 02:42:18AM -0800, Sargun Dhillon wrote: > This patch modifies test_current_task_under_cgroup_user. The test has > several helpers around creating a temporary environment for cgroup > testing, and moving the current task around cgroups. This set of > helpers can then be used in other tests. > > Signed-off-by: Sargun Dhillon lgtm Acked-by: Alexei Starovoitov