Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)
On Tue, Dec 13, 2016 at 1:52 AM, Doug Ledford wrote: > On 12/9/2016 1:47 AM, Selvin Xavier wrote: >> This series introduces the RoCE driver for the Broadcom >> NetXtreme-E 10/25/40/50 gigabit RoCE HCAs. >> This driver is dependent on the bnxt_en NIC driver and is >> based on the bnxt_re branch in Doug's repository. bnxt_en changes >> required for this patch series is already available in this branch. >> >> I am preparing a git repository with these changes as per Jason's >> comment and will share the details later today. >> >> v1-> v2: >> * The license text in each file updated to reflect Dual license. >> * Makefile and Kconfig changes are pushed to the last patch >> * Moved bnxt_re_uverbs_abi.h to include/uapi/rdma folder >> * Remove duplicate structure definitions from bnxt_re_hsi.h as >> it is available in the corresponding bnxt_en header file (bnxt_hsi.h) >> * Removed some unused code reported during code review. >> * Fixed few sparse warnings >> >> Doug, >> Please review and consider applying this to linux-rdma repository. > > There are outstanding review comments to be addressed still yet, and the > v2 patchset doesn't compile for me in 0day testing. I'm going to bounce > this one to 4.11. I made some quick on-the-surface static checkers etc rub on the new driver (Doug, I used the bits in your github bnxt_re branch), there are bunch (tons...) of smatch [1] and sparse [2] complaints along with few checkpatch [3] things too. So... addressing them before this goes upstream? BTW net-next's drivers/net/ethernet/broadcom/bnxt (where you put the pre patches for the RDMA driver) is pretty much clean of that Or. [1] smatch CHECK drivers/infiniband/hw/bnxt_re/bnxt_re_main.c CHECK drivers/infiniband/hw/bnxt_re/bnxt_re_ib_verbs.c CHECK drivers/infiniband/hw/bnxt_re/bnxt_re_debugfs.c CHECK drivers/infiniband/hw/bnxt_re/bnxt_qplib_res.c CHECK drivers/infiniband/hw/bnxt_re/bnxt_qplib_rcfw.c CHECK drivers/infiniband/hw/bnxt_re/bnxt_qplib_sp.c CHECK drivers/infiniband/hw/bnxt_re/bnxt_qplib_fp.c drivers/infiniband/hw/bnxt_re/bnxt_qplib_sp.c:185 bnxt_qplib_del_sgid() warn: impossible condition '(*sgid_tbl->hw_id + index == -1) => (0-65535 == (-1))' drivers/infiniband/hw/bnxt_re/bnxt_qplib_rcfw.c:107 bnxt_qplib_rcfw_send_message() warn: test_bit() takes a bit number drivers/infiniband/hw/bnxt_re/bnxt_qplib_rcfw.c:116 bnxt_qplib_rcfw_send_message() warn: test_bit() takes a bit number drivers/infiniband/hw/bnxt_re/bnxt_qplib_rcfw.c:148 bnxt_qplib_rcfw_send_message() error: double lock 'irqsave:flags' drivers/infiniband/hw/bnxt_re/bnxt_qplib_rcfw.c:204 bnxt_qplib_rcfw_send_message() error: double unlock 'irqsave:flags' drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:121 bnxt_re_unregister_netdev() warn: variable dereferenced before check 'rdev' (see line 118) drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:140 bnxt_re_register_netdev() warn: variable dereferenced before check 'rdev' (see line 137) drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:155 bnxt_re_free_msix() warn: variable dereferenced before check 'rdev' (see line 152) drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:173 bnxt_re_request_msix() warn: variable dereferenced before check 'rdev' (see line 171) drivers/infiniband/hw/bnxt_re/bnxt_qplib_res.c:172 bnxt_qplib_alloc_init_hwq() warn: unsigned '*elements' is never less than zero. drivers/infiniband/hw/bnxt_re/bnxt_qplib_rcfw.c:387 bnxt_qplib_deinit_rcfw() warn: test_bit() takes a bit number drivers/infiniband/hw/bnxt_re/bnxt_qplib_res.c:488 bnxt_qplib_alloc_sgid_tbl() warn: double check that we're allocating correct size: 2 vs 4 drivers/infiniband/hw/bnxt_re/bnxt_qplib_res.c:688 bnxt_qplib_alloc_dpi_tbl() warn: consider using resource_size() here drivers/infiniband/hw/bnxt_re/bnxt_qplib_rcfw.c:496 bnxt_qplib_init_rcfw() warn: test_bit() takes a bit number ./include/linux/mm.h:1385 stack_guard_page_end() warn: bitwise AND condition is false here ./include/linux/mm.h:1379 vma_growsup() warn: bitwise AND condition is false here drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:522 show_rev() info: ignoring unreachable code. drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:835 bnxt_re_dev_stop() warn: was && intended here instead of ||? drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:1048 bnxt_re_ib_reg() warn: curly braces intended? drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:1050 bnxt_re_ib_reg() warn: inconsistent indenting drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:1190 bnxt_re_task() warn: curly braces intended? drivers/infiniband/hw/bnxt_re/bnxt_qplib_fp.c:1639 __flush_sq() warn: variable dereferenced before check 'budget' (see line 1621) drivers/infiniband/hw/bnxt_re/bnxt_qplib_fp.c:1852 bnxt_qplib_cq_process_res_ud() warn: shift has higher precedence than mask drivers/infiniband/hw/bnxt_re/bnxt_qplib_fp.c:1861 bnxt_qplib_cq_process_res_ud() warn: inconsistent indenting drivers/infiniband/hw/bnxt_re/bnxt_qplib_fp.c:2015 bnxt_qplib_cq_process_terminal() warn: variable dereferen
Re: netlink: GPF in sock_sndtimeo
On 2016-12-09 23:40, Cong Wang wrote: > On Fri, Dec 9, 2016 at 8:13 PM, Cong Wang wrote: > > On Fri, Dec 9, 2016 at 3:01 AM, Richard Guy Briggs wrote: > >> On 2016-12-08 22:57, Cong Wang wrote: > >>> On Thu, Dec 8, 2016 at 10:02 PM, Richard Guy Briggs > >>> wrote: > >>> > I also tried to extend Cong Wang's idea to attempt to proactively > >>> > respond to a > >>> > NETLINK_URELEASE on the audit_sock and reset it, but ran into a locking > >>> > error > >>> > stack dump using mutex_lock(&audit_cmd_mutex) in the notifier callback. > >>> > Eliminating the lock since the sock is dead anways eliminates the error. > >>> > > >>> > Is it safe? I'll resubmit if this looks remotely sane. Meanwhile I'll > >>> > try to > >>> > get the test case to compile. > >>> > >>> It doesn't look safe, because 'audit_sock', 'audit_nlk_portid' and > >>> 'audit_pid' > >>> are updated as a whole and race between audit_receive_msg() and > >>> NETLINK_URELEASE. > >> > >> This is what I expected and why I originally added the mutex lock in the > >> callback... The dumps I got were bare with no wrapper identifying the > >> process context or specific error, so I'm at a bit of a loss how to > >> solve this (without thinking more about it) other than instinctively > >> removing the mutex. > > > > Netlink notifier can safely be converted to blocking one, I will send > > a patch. > > > > But I seriously doubt you really need NETLINK_URELEASE here, > > it adds nothing but overhead, b/c the netlink notifier is called on > > every netlink socket in the system, but for net exit path, that is > > relatively a slow path. > > > > Also, kauditd_send_skb() needs audit_cmd_mutex too. > > Please let me know what you think about the attached patch? > > Thanks! > commit a12b43ee814625933ff155c20dc863c59cfcf240 > Author: Cong Wang > Date: Fri Dec 9 17:56:42 2016 -0800 > > audit: close a race condition on audit_sock > > Signed-off-by: Cong Wang > > diff --git a/kernel/audit.c b/kernel/audit.c > index f1ca116..ab947d8 100644 > --- a/kernel/audit.c > +++ b/kernel/audit.c > @@ -423,6 +423,8 @@ static void kauditd_send_skb(struct sk_buff *skb) > snprintf(s, sizeof(s), "audit_pid=%d reset", > audit_pid); > audit_log_lost(s); > audit_pid = 0; > + audit_nlk_portid = 0; > + sock_put(audit_sock); > audit_sock = NULL; > } else { > pr_warn("re-scheduling(#%d) write to > audit_pid=%d\n", > @@ -899,6 +901,9 @@ static int audit_receive_msg(struct sk_buff *skb, struct > nlmsghdr *nlh) > audit_log_config_change("audit_pid", new_pid, > audit_pid, 1); > audit_pid = new_pid; > audit_nlk_portid = NETLINK_CB(skb).portid; > + sock_hold(skb->sk); > + if (audit_sock) > + sock_put(audit_sock); > audit_sock = skb->sk; > } > if (s.mask & AUDIT_STATUS_RATE_LIMIT) { > @@ -1167,10 +1172,6 @@ static void __net_exit audit_net_exit(struct net *net) > { > struct audit_net *aunet = net_generic(net, audit_net_id); > struct sock *sock = aunet->nlsk; > - if (sock == audit_sock) { > - audit_pid = 0; > - audit_sock = NULL; > - } So how does this not leak memory leaving the sock refcount incremented by the registered audit daemon when that daemon shuts down normally? - RGB -- Richard Guy Briggs Kernel Security Engineering, Base Operating Systems, Red Hat Remote, Ottawa, Canada Voice: +1.647.777.2635, Internal: (81) 32635
Re: Synopsys Ethernet QoS
On 12/12/2016 5:25 PM, Niklas Cassel wrote: On 12/12/2016 11:19 AM, Joao Pinto wrote: Hi, Às 1:44 AM de 12/10/2016, Florian Fainelli escreveu: Le 12/09/16 à 16:16, Andy Shevchenko a écrit : On Sat, Dec 10, 2016 at 12:52 AM, Florian Fainelli wrote: It's kind of sad that customers of that IP (stmmac, amd-xgbe, sxgbe) did actually pioneer the upstreaming effort, but it is good to see people from Synopsys willing to fix that in the future. Wait, you would like to tell that we have more than 2 drivers for the same (okay, same vendor) IP?! It's better to unify them earlier, than have n+ copies. Unfortunately that is the case, see this email: https://www.mail-archive.com/netdev@vger.kernel.org/msg142796.html dwc_eth_qos and stmmac have some overlap. There seems to be work underway to unify these two to begin with. P.S. Though, I don't see how sxgbe got in the list. First glance on the code doesn't show similarities. Well samsung/sxgbe looks potentially similar to amd/xgbe, but that's just my cursory look at the code, it may very well be something entirely different. The descriptor formats just look suspiciously similar. Thank you for your inputs! Renaming seems to be a hotspot. I agree that maybe instead of renaming (breaking retro-compatibility as David and Florian mentioned), the best is to move stmmac to synopsys/ after merging *qos* and removing it. As Florian mentioned, git is capable of detecting folder restructured. @Rabin Vincent: Hi Rabin. Since Axis is more familiar with the synopsys/*qos* driver would it be possible for you to make an initial analysis of what has to be merged into Stmmac? This way the development would speed-up. I can answer that question. I've sent out 12 patches to the stmmac driver (all patches are included in the current net-next tree), ok I have seen these patches applied, I had just a minor concern about the failure when DMA configuration is missing. In these years, I have noticed that, for this kind of HW, default DMA configuration is usually good to have a driver working. AHB, AXI parameters can be provided to have a best tuning or to fix know issues on some platforms. So IMO, we should relax the check with a warning. Please, consider that, the stmmac also supports very old MAC10/100 versions where the DMA configuration was often never passed. with these patches the stmmac driver works properly on Axis hardware (we use Synopsys GMAC 4.10a synthesized with multiple TX queues). perfect and thanks a lot for this effort. stmmac's DT binding has also been extended with properties that existed in DWC EQoS's DT binding, such as no-pbl-x8, txpbl, rxpbl. Since we have no problem updating the DTB together with the kernel, we will simply move to using the start using the stmmac driver, with stmmac's DT binding. However, I've noticed that NVIDIA has extended the DWC EQoS DT binding, I don't how easy it would be for them to switch to stmmac's DT binding. (Adding Stephen Warren to CC.) ok The reset sequence that Lars Persson was worried about is not an issue with the stmmac driver. thx for this check. There are some performance problems with the stmmac driver though: When running iperf3 with 3 streams: iperf3 -c 192.168.0.90 -P 3 -t 30 iperf3 -c 192.168.0.90 -P 3 -t 30 -R I get really bad fairness between the streams. Can you confirm you are using the 4.xxa version? This doesn't match with Alex's experiments on ARM platforms. This appears to be an issue with how TX IRQ coalescing is implemented in stmmac. Disabling TX IRQ coalescing in the stmmac driver makes the problem go away. this doesn't match with what we had seen but I am happy and open to review and accept new strategy. We have a local patch that implements TX IRQ coalescing in the dwceqos driver, and we don't see the same problem. please, if you have new patch add me on CC and we will review all together. Also netperf TCP_RR and UDP_RR gives really bad results compared to the dwceqos driver (without IRQ coalescing). 2000 transactions/sec vs 9000 transactions/sec. Turning TX IRQ coalescing off and RX interrupt watchdog off in stmmac gives the same performance. I guess it's a trade off, low CPU usage vs low latency, so I don't know how important TCP_RR/UDP_RR really is. The best thing would be to get a good working TX IRQ coalesce implementation with HR timers in stmmac. as said, welcome patches. Basically, the default tuning of coalescence parameters comes from ST platform experiences. I mean, we tuned to driver to have good performance and saving CPU on SH4 (UP) and ARM (SMP) systems. In these years, these default was accepted but, if today we need to change something welcome effort. On my side, I can try to perform some bench to see if I have regressions on not. Perhaps it should also be investigated if the RX interrupt watchdog timeout should have a lower default value. Do not expect to get many improvements to play with the HW watchdog due to the poor granulari
Re: stmmac DT property snps,axi_all
Hello Niklas, Alex, my fault and a step behind... Current code is OK when manage the AAL that, although it is passed from the axi structure, it is always used to program, for all the chip versions, the writable bit inside the DMA_BUS_MODE register. So I guess no extra patch is needed. Regards Peppe On 12/12/2016 3:18 PM, Giuseppe CAVALLARO wrote: Please Niklas when you send the patch, add my Acked-by: Giuseppe Cavallaro On 12/9/2016 10:53 AM, Niklas Cassel wrote: On 12/09/2016 10:20 AM, Niklas Cassel wrote: On 12/08/2016 02:36 PM, Alexandre Torgue wrote: Hi Niklas, On 12/05/2016 05:18 PM, Niklas Cassel wrote: Hello Giuseppe I'm trying to figure out what snps,axi_all is supposed to represent. It appears that the value is saved, but never used in the code. Looking at the register specification, I'm guessing that it represents Address-Aligned Beats, but there is already the property snps,aal for that. IMO, it is not useful. Indeed AXI_AAL is a read only bit (in AXI bus mode register) and reflects the aal bit in DMA bus register. As you know we use "snps,aal" to set aal bit in DMA bus register. So "snps,axi_all" entry seems useless. Let's see with Peppe. Ok, I see. GMAC and GMAC4 is different here. For GMAC4 AAL only exists in DMA_SYS_BUS_MODE. It's not reflected anywhere else. The code is correct in the driver. If snps,axi_all is just created for a read-only register, and it is currently never used in the code, while we have snps,aal, which is correct and works, I guess it should be ok to remove snps,axi_all. I can cook up a patch. Here we go :) I will send it as a real patch once net-next reopens. From defc01cb7c22611b89d9cf1fcae72544092bd62c Mon Sep 17 00:00:00 2001 From: Niklas Cassel Date: Fri, 9 Dec 2016 10:27:00 +0100 Subject: [PATCH net-next] net: stmmac: remove unused duplicate property snps,axi_all For core revision 3.x Address-Aligned Beats is available in two registers. The DT property snps,aal was created for AAL in the DMA bus register, which is a read/write bit. The DT property snps,axi_all was created for AXI_AAL in the AXI bus mode register, which is a read only bit that reflects the value of AAL in the DMA bus register. Since the value of snps,axi_all is never used in the driver, and since the property was created for a bit that is read only, it should be safe to remove the property. Signed-off-by: Niklas Cassel --- Documentation/devicetree/bindings/net/stmmac.txt | 1 - drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c | 1 - include/linux/stmmac.h| 1 - 3 files changed, 3 deletions(-) diff --git a/Documentation/devicetree/bindings/net/stmmac.txt b/Documentation/devicetree/bindings/net/stmmac.txt index 128da752fec9..c3d2fd480a1b 100644 --- a/Documentation/devicetree/bindings/net/stmmac.txt +++ b/Documentation/devicetree/bindings/net/stmmac.txt @@ -65,7 +65,6 @@ Optional properties: - snps,wr_osr_lmt: max write outstanding req. limit - snps,rd_osr_lmt: max read outstanding req. limit - snps,kbbe: do not cross 1KiB boundary. -- snps,axi_all: align address - snps,blen: this is a vector of supported burst length. - snps,fb: fixed-burst - snps,mb: mixed-burst diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c index 082cd48db6a7..60ba8993c650 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c @@ -121,7 +121,6 @@ static struct stmmac_axi *stmmac_axi_setup(struct platform_device *pdev) axi->axi_lpi_en = of_property_read_bool(np, "snps,lpi_en"); axi->axi_xit_frm = of_property_read_bool(np, "snps,xit_frm"); axi->axi_kbbe = of_property_read_bool(np, "snps,axi_kbbe"); -axi->axi_axi_all = of_property_read_bool(np, "snps,axi_all"); axi->axi_fb = of_property_read_bool(np, "snps,axi_fb"); axi->axi_mb = of_property_read_bool(np, "snps,axi_mb"); axi->axi_rb = of_property_read_bool(np, "snps,axi_rb"); diff --git a/include/linux/stmmac.h b/include/linux/stmmac.h index 266dab9ad782..889e0e9a3f1c 100644 --- a/include/linux/stmmac.h +++ b/include/linux/stmmac.h @@ -103,7 +103,6 @@ struct stmmac_axi { u32 axi_wr_osr_lmt; u32 axi_rd_osr_lmt; bool axi_kbbe; -bool axi_axi_all; u32 axi_blen[AXI_BLEN]; bool axi_fb; bool axi_mb;
Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)
On Mon, Dec 12, 2016 at 8:52 PM, Selvin Xavier wrote: >> CHECK drivers/infiniband/hw/bnxtre/bnxt_qplib_rcfw.c >> CHECK drivers/infiniband/hw/bnxtre/bnxt_qplib_sp.c >> CHECK drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c >> drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c:1015:22: warning: context >> imbalance in 'bnxt_qplib_lock_cqs' - wrong count at exit >> drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c:1030:28: warning: context >> imbalance in 'bnxt_qplib_unlock_cqs' - unexpected unlock > The above two are false positives, since locking and unlocking are > handled in two separate functions. This is a wrapper to lock/unlock > both SQ and RQ CQ locks. Functionally it is ok since > bnxt_qplib_unlock_cqs is called just after the critical section and > both locks are freed in order. I think we can ignore this warning. > > You can use __releases() and __acquires() macros to denote these cases for sparse.
Re: [PATCH V2 18/22] bnxt_re: Support for DCB
On Sat, Dec 10, 2016 at 7:20 PM, Or Gerlitz wrote: > On Fri, Dec 9, 2016 at 8:48 AM, Selvin Xavier > wrote: >> This patch queries the configured RoCE APP Priority on the host >> using the dcbnl API and programs the RoCE FW with the corresponding >> Traffic Class(es) for the priority. > >> +#define BNXT_RE_ROCE_V1_ETH_TYPE 0x8915 >> +#define BNXT_RE_ROCE_V2_PORT_NO4791 > > I believe these two are defined already, try # git grep on each under include Thanks Or for your comments. V2 port number is defined in ib_verbs.h. i will include this in the next patch set. v1 eth_type is not defined. All vendor drivers have their own definition. Thanks, Selvin
[PATCH net] virtio-net: correctly enable multiqueue
Commit 4490001029012539937ff02778fe6180613fa949 ("virtio-net: enable multiqueue by default") blindly set the affinity instead of queues during probe which can cause a mismatch of #queues between guest and host. This patch fixes it by setting queues. Reported-by: Theodore Ts'o Tested-by: Theodore Ts'o Cc: Neil Horman Cc: Michael S. Tsirkin Fixes: 49000102901 ("virtio-net: enable multiqueue by default") Signed-off-by: Jason Wang --- drivers/net/virtio_net.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index b425fa1..fe9f772 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1930,7 +1930,9 @@ static int virtnet_probe(struct virtio_device *vdev) goto free_unregister_netdev; } - virtnet_set_affinity(vi); + rtnl_lock(); + virtnet_set_queues(vi, vi->curr_queue_pairs); + rtnl_unlock(); /* Assume link up if device can't report link status, otherwise get link status from config. */ -- 2.7.4
Re: [PATCH V2 13/22] bnxt_re: Support QP verbs
On Mon, Dec 12, 2016 at 11:57 PM, Leon Romanovsky wrote: > It can help to review if you break this function into smaller pieces and > get rid of switch->switch->if construction. Thanks Leon. I will address this and your previous comments in v3 patch set.
Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)
On Mon, Dec 12, 2016 at 10:37 PM, Jason Gunthorpe wrote: > On Sat, Dec 10, 2016 at 11:06:58AM +0530, Selvin Xavier wrote: >> On Fri, Dec 9, 2016 at 12:17 PM, Selvin Xavier >> wrote: >> > I am preparing a git repository with these changes as per Jason's >> > comment and will share the details later today. >> >> Please use bnxt_re branch in this git repository. >> >> https://github.com/Broadcom/linux-rdma-nxt.git > > Why are you using __packed in bnxt_re_uverbs_abi.h ? that doesn't seem > necessary. It is a good idea to make sure all those structures are a > multiple of 64 bits (add explicit reserved fields), and make sure you > test 32 bit verbs as well. Will take care in v3. > > Why are you using debugfs just to export counters? Isn't the core code > counter framework good enough? I agree that some of the counters exported by this patch set, tx and rx bytes/pkts etc, can be exported through the core counters. i will try adding this support in v3, if not, will post as a separate patch. debugfs was introduced more for the future, in case any HW specific data needs to be displayed. As of now, it tracks only the count of resources( CQ/MR/QPs) active at any given point. So its ok to skip this patch from this series. > > Please try and avoid writing functions as defines (eg rdev_to_dev, > to_bnxt_re, SQE_PG, RCFW_CMDQ_COOKIE, PTR_PG etc) > Sure, will take care in v3. > There is something wrong with the tabs and spaces (see > https://github.com/Broadcom/linux-rdma-nxt/blob/03e23b087f7e86ea28656273994e065827210ce5/drivers/infiniband/hw/bnxtre/bnxt_re_hsi.h) > > FWIW, I really dislike the column alignment style, it is so hard to > maintain.. This file contains the Macro defines for the FW/HW structures and are auto-generated. Some of these auto-generated defines are very long which makes the lines greater than 80 characters. I will fix whatever possible and include in v3 set. > > Jason Thanks, Selvin
[PATCH iproute2 1/2] tc: flower: Fix typo in the flower man page
Replace vlan_eth_type with vlan_ethtype. Fixes: 745d91726006 ("tc: flower: Introduce vlan support") Signed-off-by: Roi Dayan Reviewed-by: Hadar Hen Zion --- man/man8/tc-flower.8 | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8 index 90fdfba..1cea54d 100644 --- a/man/man8/tc-flower.8 +++ b/man/man8/tc-flower.8 @@ -27,7 +27,7 @@ flower \- flow based traffic control filter .IR VID " | " .B vlan_prio .IR PRIORITY " | " -.BR vlan_eth_type " { " ipv4 " | " ipv6 " | " +.BR vlan_ethtype " { " ipv4 " | " ipv6 " | " .IR ETH_TYPE " } | " .BR ip_proto " { " tcp " | " udp " | " sctp " | " icmp " | " icmpv6 " | " .IR IP_PROTO " } | { " @@ -87,7 +87,7 @@ Match on vlan tag priority. .I PRIORITY is an unsigned 3bit value in decimal format. .TP -.BI vlan_eth_type " VLAN_ETH_TYPE" +.BI vlan_ethtype " VLAN_ETH_TYPE" Match on layer three protocol. .I ETH_TYPE may be either -- 2.7.4
[PATCH iproute2 2/2] tc: tunnel_key: Add tc-tunnel_key man page to Makefile
To be installed with the other man pages. Fixes: d57639a475a9 ("tc/act_tunnel: Introduce ip tunnel action") Signed-off-by: Roi Dayan Reviewed-by: Amir Vadai --- man/man8/Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/man/man8/Makefile b/man/man8/Makefile index de6f249..d4cb01a 100644 --- a/man/man8/Makefile +++ b/man/man8/Makefile @@ -17,6 +17,7 @@ MAN8PAGES = $(TARGETS) ip.8 arpd.8 lnstat.8 routel.8 rtacct.8 rtmon.8 rtpr.8 ss. tc-tcindex.8 tc-u32.8 tc-matchall.8 \ tc-connmark.8 tc-csum.8 tc-mirred.8 tc-nat.8 tc-pedit.8 tc-police.8 \ tc-simple.8 tc-skbedit.8 tc-vlan.8 tc-xt.8 tc-ife.8 tc-skbmod.8 \ + tc-tunnel_key.8 \ devlink.8 devlink-dev.8 devlink-monitor.8 devlink-port.8 devlink-sb.8 all: $(TARGETS) -- 2.7.4
[PATCH iproute2 0/2] Man page fixes
Hi, The 2 patches are man page related only. First fixes a typo and second adding missing man page to the Makefile. Thanks Roi Dayan (2): tc: flower: Fix typo in the flower man page tc: tunnel_key: Add tc-tunnel_key man page to Makefile man/man8/Makefile| 1 + man/man8/tc-flower.8 | 4 ++-- 2 files changed, 3 insertions(+), 2 deletions(-) -- 2.7.4
Re: [PATCH net-next 00/27] Remove VLAN CFI bit abuse
On Tue, Dec 13, 2016 at 01:12:32AM +0100, Michał Mirosław wrote: > Dear NetDevs > > This series removes an abuse of VLAN CFI bit in Linux networking stack. > Currently Linux always clears the bit on outgoing traffic and presents > it cleared to userspace (even via AF_PACKET/tcpdump when hw-accelerated). [...] I just noticed net-next got closed few days ago. I'll resend after it opens again. Nevertheless, review is appreciated. Best Regards, Michał Mirosław
Re: [PATCH v2] audit: use proper refcount locking on audit_sock
On 2016-12-12 15:18, Paul Moore wrote: > On Mon, Dec 12, 2016 at 5:03 AM, Richard Guy Briggs wrote: > > Resetting audit_sock appears to be racy. > > > > audit_sock was being copied and dereferenced without using a refcount on > > the source sock. > > > > Bump the refcount on the underlying sock when we store a refrence in > > audit_sock and release it when we reset audit_sock. audit_sock > > modification needs the audit_cmd_mutex. > > > > See: https://lkml.org/lkml/2016/11/26/232 > > > > Thanks to Eric Dumazet and Cong Wang > > on ideas how to fix it. > > > > Signed-off-by: Richard Guy Briggs > > --- > > There has been a lot of change in the audit code that is about to go > > upstream to address audit queue issues. This patch is based on the > > source tree: git://git.infradead.org/users/pcmoore/audit#next > > --- > > kernel/audit.c | 34 -- > > 1 files changed, 28 insertions(+), 6 deletions(-) > > diff --git a/kernel/audit.c b/kernel/audit.c > > index f20eee0..439f7f3 100644 > > --- a/kernel/audit.c > > +++ b/kernel/audit.c > > @@ -452,7 +452,9 @@ static void auditd_reset(void) > > struct sk_buff *skb; > > > > /* break the connection */ > > + sock_put(audit_sock); > > audit_pid = 0; > > + audit_nlk_portid = 0; > > audit_sock = NULL; > > > > /* flush all of the retry queue to the hold queue */ > > @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff > > *skb) > > if (rc >= 0) { > > consume_skb(skb); > > rc = 0; > > + } else { > > + if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) { > > I dislike the way you wrote this because instead of simply looking at > this to see if it correct I need to sort out all the bits and find out > if there are other error codes that could run afoul of this check ... > make it simple, e.g. (rc == -ENOMEM || rc == -EPERM || ...). > Actually, since EPERM is 1, -EPERM (-1 in two's compliment is > 0x) is going to cause this to be true for pretty much any > value of rc, yes? Yes, you are correct. We need there a logical or on the results of each comparison to the return code rather than bit-wise or-ing the result codes together first to save a step. > > + mutex_lock(&audit_cmd_mutex); > > + auditd_reset(); > > + mutex_unlock(&audit_cmd_mutex); > > + } > > The code in audit#next handles netlink_unicast() errors in > kauditd_thread() and you are adding error handling code here in > kauditd_send_unicast_skb() ... that's messy. I don't care too much > where the auditd_reset() call is made, but let's only do it in one > function; FWIW, I originally put the error handling code in > kauditd_thread() because there was other error handling code that > needed to done in that scope so it resulted in cleaner code. Hmmm, I seem to remember it not returning the return code and I thought I had changed it to do so, but I see now that it was already there. Agreed, I needlessly duplicated that error handling. > Related, I see you are now considering ENOMEM to be a fatal condition, > that differs from the AUDITD_BAD macro in kauditd_thread(); this > difference needs to be reconciled. Also correct about -EPERM now that I check back to the intent of commit 32a1dbaece7e ("audit: try harder to send to auditd upon netlink failure") > Finally, you should update the comment header block for auditd_reset() > that it needs to be called with the audit_cmd_mutex held. Yup. > > @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, > > struct nlmsghdr *nlh) > > return -EACCES; > > } > > if (audit_pid && new_pid && > > - audit_replace(requesting_pid) != -ECONNREFUSED) > > { > > + (audit_replace(requesting_pid) & > > (-ECONNREFUSED|-EPERM|-ENOMEM))) { > > Do we simply want to treat any error here as fatal, and not just > ECONN/EPERM/ENOMEM? If not, let's come up with a single macro to > handle the fatal netlink_unicast() return codes so we have some chance > to keep things consistent in the future. I'll work through this before I post another patch... > paul moore - RGB -- Richard Guy Briggs Kernel Security Engineering, Base Operating Systems, Red Hat Remote, Ottawa, Canada Voice: +1.647.777.2635, Internal: (81) 32635
Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)
On Mon, Dec 12, 2016 at 10:24 PM, Jonathan Toppins wrote: > CHECK drivers/infiniband/hw/bnxtre/bnxt_re_debugfs.c > CHECK drivers/infiniband/hw/bnxtre/bnxt_qplib_res.c > drivers/infiniband/hw/bnxtre/bnxt_qplib_res.c:729:6: warning: symbol > 'bnxt_qplib_cleanup_pkey_tbl' was not declared. Should it be static? I will remove this warning in v3 patch set. > CHECK drivers/infiniband/hw/bnxtre/bnxt_qplib_rcfw.c > CHECK drivers/infiniband/hw/bnxtre/bnxt_qplib_sp.c > CHECK drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c > drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c:1015:22: warning: context > imbalance in 'bnxt_qplib_lock_cqs' - wrong count at exit > drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c:1030:28: warning: context > imbalance in 'bnxt_qplib_unlock_cqs' - unexpected unlock The above two are false positives, since locking and unlocking are handled in two separate functions. This is a wrapper to lock/unlock both SQ and RQ CQ locks. Functionally it is ok since bnxt_qplib_unlock_cqs is called just after the critical section and both locks are freed in order. I think we can ignore this warning. > MODPOST 2 modules
Re: [PATCH v2] audit: use proper refcount locking on audit_sock
On 2016-12-12 12:10, Paul Moore wrote: > On Mon, Dec 12, 2016 at 5:03 AM, Richard Guy Briggs wrote: > > Resetting audit_sock appears to be racy. > > > > audit_sock was being copied and dereferenced without using a refcount on > > the source sock. > > > > Bump the refcount on the underlying sock when we store a refrence in > > audit_sock and release it when we reset audit_sock. audit_sock > > modification needs the audit_cmd_mutex. > > > > See: https://lkml.org/lkml/2016/11/26/232 > > > > Thanks to Eric Dumazet and Cong Wang > > on ideas how to fix it. > > > > Signed-off-by: Richard Guy Briggs > > --- > > There has been a lot of change in the audit code that is about to go > > upstream to address audit queue issues. This patch is based on the > > source tree: git://git.infradead.org/users/pcmoore/audit#next > > --- > > kernel/audit.c | 34 -- > > 1 files changed, 28 insertions(+), 6 deletions(-) > > This is coming in pretty late for the v4.10 merge window, much later > than I would usually take things, but this is arguably important, and > (at first glance) relatively low risk - what testing have you done on > this? At this point, compile and boot, and I'm able to compile and run the supplied test code without any issues. > > diff --git a/kernel/audit.c b/kernel/audit.c > > index f20eee0..439f7f3 100644 > > --- a/kernel/audit.c > > +++ b/kernel/audit.c > > @@ -452,7 +452,9 @@ static void auditd_reset(void) > > struct sk_buff *skb; > > > > /* break the connection */ > > + sock_put(audit_sock); > > audit_pid = 0; > > + audit_nlk_portid = 0; > > audit_sock = NULL; > > > > /* flush all of the retry queue to the hold queue */ > > @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff > > *skb) > > if (rc >= 0) { > > consume_skb(skb); > > rc = 0; > > + } else { > > + if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) { > > + mutex_lock(&audit_cmd_mutex); > > + auditd_reset(); > > + mutex_unlock(&audit_cmd_mutex); > > + } > > } > > > > return rc; > > @@ -579,7 +587,9 @@ static int kauditd_thread(void *dummy) > > > > auditd = 0; > > if (AUDITD_BAD(rc, reschedule)) { > > + mutex_lock(&audit_cmd_mutex); > > auditd_reset(); > > + mutex_unlock(&audit_cmd_mutex); > > reschedule = 0; > > } > > } else > > @@ -594,7 +604,9 @@ static int kauditd_thread(void *dummy) > > auditd = 0; > > if (AUDITD_BAD(rc, reschedule)) { > > kauditd_hold_skb(skb); > > + mutex_lock(&audit_cmd_mutex); > > auditd_reset(); > > + mutex_unlock(&audit_cmd_mutex); > > reschedule = 0; > > } else > > /* temporary problem (we hope), > > queue > > @@ -623,7 +635,9 @@ quick_loop: > > if (rc) { > > auditd = 0; > > if (AUDITD_BAD(rc, reschedule)) { > > + > > mutex_lock(&audit_cmd_mutex); > > auditd_reset(); > > + > > mutex_unlock(&audit_cmd_mutex); > > reschedule = 0; > > } > > > > @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, > > struct nlmsghdr *nlh) > > return -EACCES; > > } > > if (audit_pid && new_pid && > > - audit_replace(requesting_pid) != -ECONNREFUSED) > > { > > + (audit_replace(requesting_pid) & > > (-ECONNREFUSED|-EPERM|-ENOMEM))) { > > audit_log_config_change("audit_pid", > > new_pid, audit_pid, 0); > > return -EEXIST; > > } > > if (audit_enabled != AUDIT_OFF) > > audit_log_config_change("audit_pid", > > new_pid, audit_pid, 1); > > - audit_pid = new_pid; > > - audit_nlk_portid = NETLINK_CB(skb).portid; > > - audit_sock = skb->sk; > > - if (!new_pid) > > +
Re: [PATCH V2 for-next 00/11] Code improvements & fixes for HNS RoCE driver
On 11/15/2016 1:10 PM, Salil Mehta wrote: > This patchset introduces some code improvements and fixes > for the identified problems in the HNS RoCE driver. > > Lijun Ou (4): > IB/hns: Add the interface for querying QP1 > IB/hns: add self loopback for CM > IB/hns: Modify the condition of notifying hardware loopback > IB/hns: Fix the bug for qp state in hns_roce_v1_m_qp() > > Salil Mehta (1): > IB/hns: Fix for Checkpatch.pl comment style errors > > Shaobo Xu (1): > IB/hns: Implement the add_gid/del_gid and optimize the GIDs > management > > Wei Hu (Xavier) (5): > IB/hns: Add code for refreshing CQ CI using TPTR > IB/hns: Optimize the logic of allocating memory using APIs > IB/hns: Modify the macro for the timeout when cmd process > IB/hns: Modify query info named port_num when querying RC QP > IB/hns: Change qpn allocation to round-robin mode. > > drivers/infiniband/hw/hns/hns_roce_alloc.c | 11 +- > drivers/infiniband/hw/hns/hns_roce_cmd.c|8 +- > drivers/infiniband/hw/hns/hns_roce_cmd.h|7 +- > drivers/infiniband/hw/hns/hns_roce_common.h |2 - > drivers/infiniband/hw/hns/hns_roce_cq.c | 17 +- > drivers/infiniband/hw/hns/hns_roce_device.h | 45 ++-- > drivers/infiniband/hw/hns/hns_roce_eq.c |6 +- > drivers/infiniband/hw/hns/hns_roce_hem.c|6 +- > drivers/infiniband/hw/hns/hns_roce_hw_v1.c | 267 +-- > drivers/infiniband/hw/hns/hns_roce_hw_v1.h | 17 +- > drivers/infiniband/hw/hns/hns_roce_main.c | 311 > +++ > drivers/infiniband/hw/hns/hns_roce_mr.c | 21 +- > drivers/infiniband/hw/hns/hns_roce_pd.c |5 +- > drivers/infiniband/hw/hns/hns_roce_qp.c |2 +- > 14 files changed, 363 insertions(+), 362 deletions(-) > Series applied, thanks. -- Doug Ledford GPG Key ID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE
On Tue, Dec 13, 2016 at 11:43:00AM +0800, Jason Wang wrote: > Thanks for reporting this issue. Looks like I blindly set the affinity > instead of queues during probe. Could you please try the following patch to > see if it works? This fixed things, thanks!! - Ted > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index b425fa1..fe9f772 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -1930,7 +1930,9 @@ static int virtnet_probe(struct virtio_device *vdev) > goto free_unregister_netdev; > } > > - virtnet_set_affinity(vi); > + rtnl_lock(); > + virtnet_set_queues(vi, vi->curr_queue_pairs); > + rtnl_unlock(); > > /* Assume link up if device can't report link status, >otherwise get link status from config. */ > >
Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)
On Tue, Dec 13, 2016 at 5:22 AM, Doug Ledford wrote: > > There are outstanding review comments to be addressed still yet, and the > v2 patchset doesn't compile for me in 0day testing. I'm going to bounce > this one to 4.11. I will address all review comments and fix the 0day compilation error and post a v3 soon. Thanks, Selvin Xavier
Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE
On 2016年12月13日 11:12, Theodore Ts'o wrote: On Tue, Dec 13, 2016 at 04:28:17AM +0200, Michael S. Tsirkin wrote: That's unfortunate, of course. It could be a hypervisor or a guest kernel bug. ideas: - does host have mq capability? how many queues? - how about # of msix vectors? - after you send something on tx queues, are interrupts arriving on rx queues? - is problem rx or tx? set ip and arp manually and send a packet to known MAC, does it get there? Sorry, I don't know how to debug virtio-net. Given that it's in a cloud environment, I also can't set ip addresses manually, since ip addresses are set manually. If you can send me a patch, I'm happy to apply it and send you back results. I can say that I've had _zero_ problems using pretty much any kernel from 3.10 to 4.9 using Google Compute Engine. The commit I referenced caused things to stop working. So in terms of regression, this is definitely a regression, and it's definitely caused by commit 449000102901. Even if it is a hypervisor "bug", I'm pretty sure I know what Linus will say if I ask him to revert it. Linux kernels are expected to work around hardware bugs, and breaking users just because hardware is "broken" by some definition is generally not considered friendly, especially when has been working for years and years before some commit "fixed" things. I would very much like to work with you to fix it, but I will need your help, since virtio-net doesn't seem to print any informational during the boot sequence, and I don't know how the best way to debug it. Cheers, - Ted Thanks for reporting this issue. Looks like I blindly set the affinity instead of queues during probe. Could you please try the following patch to see if it works? diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index b425fa1..fe9f772 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1930,7 +1930,9 @@ static int virtnet_probe(struct virtio_device *vdev) goto free_unregister_netdev; } - virtnet_set_affinity(vi); + rtnl_lock(); + virtnet_set_queues(vi, vi->curr_queue_pairs); + rtnl_unlock(); /* Assume link up if device can't report link status, otherwise get link status from config. */
Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE
On Mon, Dec 12, 2016 at 10:12:43PM -0500, Theodore Ts'o wrote: > On Tue, Dec 13, 2016 at 04:28:17AM +0200, Michael S. Tsirkin wrote: > > > > That's unfortunate, of course. It could be a hypervisor or > > a guest kernel bug. ideas: > > - does host have mq capability? how many queues? > > - how about # of msix vectors? > > - after you send something on tx queues, > > are interrupts arriving on rx queues? > > - is problem rx or tx? > > set ip and arp manually and send a packet to known MAC, > > does it get there? > > Sorry, I don't know how to debug virtio-net. Given that it's in a > cloud environment, I also can't set ip addresses manually, since ip > addresses are set manually. OK, but you can send raw ethernet frames preseumably? > If you can send me a patch, I'm happy to apply it and send you back > results. Let's start with collecting stats from sysfs for this device. pls get features bitmap from there, pls get /proc/interrupts mappings, and pls use lspci to dump pci config. > I can say that I've had _zero_ problems using pretty much any kernel > from 3.10 to 4.9 using Google Compute Engine. The commit I referenced > caused things to stop working. So in terms of regression, this is > definitely a regression, and it's definitely caused by commit > 449000102901. Even if it is a hypervisor "bug", I'm pretty sure I > know what Linus will say if I ask him to revert it. Linux kernels are > expected to work around hardware bugs, and breaking users just because > hardware is "broken" by some definition is generally not considered > friendly, especially when has been working for years and years before > some commit "fixed" things. I'm open to limiting new features to virtio 1 mode just to avoid the hassle of dealing with legacy hypervisors. But let's not argue about it until we know the root cause. > > I would very much like to work with you to fix it, but I will need > your help, since virtio-net doesn't seem to print any informational > during the boot sequence, and I don't know how the best way to debug > it. > > Cheers, > > - Ted Let's start with debugging it like any PCI NIC. -- MST
Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE
On Tue, Dec 13, 2016 at 04:28:17AM +0200, Michael S. Tsirkin wrote: > > That's unfortunate, of course. It could be a hypervisor or > a guest kernel bug. ideas: > - does host have mq capability? how many queues? > - how about # of msix vectors? > - after you send something on tx queues, > are interrupts arriving on rx queues? > - is problem rx or tx? > set ip and arp manually and send a packet to known MAC, > does it get there? Sorry, I don't know how to debug virtio-net. Given that it's in a cloud environment, I also can't set ip addresses manually, since ip addresses are set manually. If you can send me a patch, I'm happy to apply it and send you back results. I can say that I've had _zero_ problems using pretty much any kernel from 3.10 to 4.9 using Google Compute Engine. The commit I referenced caused things to stop working. So in terms of regression, this is definitely a regression, and it's definitely caused by commit 449000102901. Even if it is a hypervisor "bug", I'm pretty sure I know what Linus will say if I ask him to revert it. Linux kernels are expected to work around hardware bugs, and breaking users just because hardware is "broken" by some definition is generally not considered friendly, especially when has been working for years and years before some commit "fixed" things. I would very much like to work with you to fix it, but I will need your help, since virtio-net doesn't seem to print any informational during the boot sequence, and I don't know how the best way to debug it. Cheers, - Ted
Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE
On Mon, Dec 12, 2016 at 06:33:43PM -0500, Theodore Ts'o wrote: > Hi, > > I was doing a last minute regression test of the ext4 tree before > sending a pull request to Linus, which I do using gce-xfstests[1], and > I found that using networking was broken on GCE on linux-next. I was > using next-20161209, and after bisecting things, I narrowed down the > commit which causing things to break to commit 449000102901: > "virtio-net: enable multiqueue by default". Reverting this commit on > top of next-20161209 fixed the problem. > > [1] http://thunk.org/gce-xfstests > > You can reproduce the problem for building the kernel for Google > Compute Engine --- I use a config such as this [2], and then try to > boot a kernel on a VM. The way I do this involves booting a test > appliance and then kexec'ing into the kernel to be tested[3], using a > 2cpu configuration. (GCE machine type: n1-standard-2) > > [2] > https://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kernel-configs/ext4-x86_64-config-4.9 > [3] > https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md > > You can then take a look at serial console using a command such as > "gcloud compute instances get-serial-port-output ", and > you will get something like this (see attached). The important bit is > that the dhclient command is completely failing to be able to get a > response from the network, from which I deduce that apparently that > either networking send or receive or both seem to be badly affected by > the commit in question. > > Please let me know if there's anything I can do to help you debug this > further. > > Cheers, > > - Ted That's unfortunate, of course. It could be a hypervisor or a guest kernel bug. ideas: - does host have mq capability? how many queues? - how about # of msix vectors? - after you send something on tx queues, are interrupts arriving on rx queues? - is problem rx or tx? set ip and arp manually and send a packet to known MAC, does it get there? > Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] Linux version > 4.9.0-rc8-ext4-06387-g03e5cbd (tytso@tytso-ssd) (gcc version 4.9.2 (Debian > 4.9.2-10) ) #9 SMP Mon Dec 12 04:50:16 UTC 2016 > Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] Command line: > root=/dev/sda1 ro console=ttyS0,38400n8 elevator=noop console=ttyS0 > fstestcfg=4k fstestset=-g,quick fstestexc= fstestopt=aex fstesttyp=ext4 > fstestapi=1.3 > Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: > Supporting XSAVE feature 0x001: 'x87 floating point registers' > Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: > Supporting XSAVE feature 0x002: 'SSE registers' > Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: > Supporting XSAVE feature 0x004: 'AVX registers' > Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: > xstate_offset[2]: 576, xstate_sizes[2]: 256 > Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: Enabled > xstate features 0x7, context size is 832 bytes, using 'standard' format. > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Load Kernel Modules. > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Apply Kernel > Variables... > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounting Configuration File > System... > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounting FUSE Control File > System... > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounted FUSE Control File > System. > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounted Configuration File > System. > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Apply Kernel > Variables. > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Create Static > Device Nodes in /dev. > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting udev Kernel Device > Manager... > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Kernel Device > Manager. > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Coldplug all > Devices. > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting udev Wait for > Complete Device Initialization... > Dec 11 23:53:20 xfstests-201612120451 systemd-fsck[1659]: xfstests-root: > clean, 56268/655360 files, 357439/2620928 blocks > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started File System Check > on Root Device. > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Remount Root and > Kernel File Systems... > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Remount Root and > Kernel File Systems. > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Various fixups to > make systemd work better on Debian. > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Load/Save Random > Seed... > Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Local File Systems > (Pre). >
Re: [PATCH 0/6] USB support for Broadcom NSP SoC
On 11/09/2016 01:33 AM, Yendapally Reddy Dhananjaya Reddy wrote: > This patch set contains the usb support for Broadcom NSP SoC. > The usb phy is connected through mdio interface. The mdio interface > can be used to access either internal phys or external phys using a > multiplexer. > > The first patch provides the documentation details for mdio-mux and > second patch provides the documentation details for usb3 phy. The third > patch contains the mdio-mux support and fourth patch contains the > changes to the mdio bus driver. > > The fifth patch provides the phy driver and sixth patch provides the > enable method for usb. > > This patch series has been tested on NSP bcm958625HR board. > This patch series is based on v4.9.0-rc1 and is available from github- > repo: https://github.com/Broadcom/cygnus-linux.git > branch:nsp-usb-v1 Can you resubmit this patch series with the feedback from Andrew, Rob and Scott addressed? Thanks! -- Florian
Re: [PATCH v2 1/4] net: hix5hd2_gmac: add generic compatible string
On 2016/12/12 22:21, Rob Herring wrote: > On Mon, Dec 12, 2016 at 5:16 AM, Dongpo Li wrote: >> Hi Rob, >> >> On 2016/12/10 6:35, Rob Herring wrote: >>> On Mon, Dec 05, 2016 at 09:27:58PM +0800, Dongpo Li wrote: The "hix5hd2" is SoC name, add the generic ethernet driver name. The "hisi-gemac-v1" is the basic version and "hisi-gemac-v2" adds the SG/TXCSUM/TSO/UFO features. Signed-off-by: Dongpo Li --- .../devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt| 9 +++-- drivers/net/ethernet/hisilicon/hix5hd2_gmac.c | 15 +++ 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt index 75d398b..75920f0 100644 --- a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt +++ b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt @@ -1,7 +1,12 @@ Hisilicon hix5hd2 gmac controller Required properties: -- compatible: should be "hisilicon,hix5hd2-gmac". +- compatible: should contain one of the following SoC strings: +* "hisilicon,hix5hd2-gemac" +* "hisilicon,hi3798cv200-gemac" +and one of the following version string: +* "hisilicon,hisi-gemac-v1" +* "hisilicon,hisi-gemac-v2" >>> >>> What combinations are valid? I assume both chips don't have both v1 and >>> v2. 2 SoCs and 2 versions so far, I don't think there is much point to >>> have the v1 and v2 compatible strings. >>> >> The v1 and v2 are generic MAC compatible strings, many HiSilicon SoCs may >> use the same MAC version. For example, >> hix5hd2, hi3716cv200 SoCs use the v1 MAC version, >> hi3798cv200, hi3516a SoCs use the v2 MAC version, >> and there may be more SoCs added in future. >> So I think the generic compatible strings are okay here. >> Should I add the hi3716cv200, hi3516a SoCs compatible here? > > Yes. > >> Do you have any good advice? >> - reg: specifies base physical address(s) and size of the device registers. The first region is the MAC register base and size. The second region is external interface control register. @@ -20,7 +25,7 @@ Required properties: Example: gmac0: ethernet@f984 { -compatible = "hisilicon,hix5hd2-gmac"; +compatible = "hisilicon,hix5hd2-gemac", "hisilicon,hisi-gemac-v1"; >>> >>> You can't just change compatible strings. >>> >> Okay, maybe I should name all the compatible string with the suffix "-gmac" >> instead of >> "-gemac". This can keep the compatible strings with the same suffix. Is this >> okay? >> Can I just add the generic compatible string without changing the SoCs >> compatible string? >> Like following: >> gmac0: ethernet@f984 { >> - compatible = "hisilicon,hix5hd2-gmac"; >> + compatible = "hisilicon,hix5hd2-gmac", >> "hisilicon,hisi-gmac-v1"; > > Yes, this is fine. > Many thanks for your advice. As the patch series have been applied to net-next branch, in which way should I commit this compatible fix? Should I send a new patch with "Fixes: "? Regards, Dongpo .
RE: [PATCH net-next 09/27] cxgb4: use __vlan_hwaccel helpers
> This also initializes vlan_proto field. > > Signed-off-by: Michał Mirosław Acked-by: Steve Wise
Re: [PATCH net-next 2/5] liquidio VF vxlan
Or Gerlitz wrote on Sat [2016-Dec-10 05:46:13 -0800]: > On Fri, Dec 9, 2016 at 12:42 AM, Vatsavayi, Raghu > wrote: > >> From: Or Gerlitz [mailto:gerlitz...@gmail.com] > >> On Thu, Dec 8, 2016 at 11:00 PM, Raghu Vatsavayi > >> wrote: > > >>> Adds VF vxlan offload support. > > >> What's the use case for that? a VM running a VTEP, isn't that part needs to > >> run @ the host? > > > Our HW can support offloads for VF which is required if we load it on > > Hypervisor. > > > + nctrl.ncmd.u64 = 0; > + nctrl.ncmd.s.cmd = command; > + nctrl.ncmd.s.more = vxlan_cmd_bit; > + nctrl.ncmd.s.param1 = vxlan_port; > + nctrl.iq_no = lio->linfo.txpciq[0].s.q_no; > + nctrl.wait_time = 100; > + nctrl.netpndev = (u64)netdev; > + nctrl.cb_fn = liquidio_link_ctrl_cmd_completion; > + > + ret = octnet_send_nic_ctrl_pkt(lio->oct_dev, &nctrl); > > 1. What happens if > 1 one VF runs this code, each with different > port? who wins? is the result well defined? There is neither race nor contention, but all VFs "win" (meaning they get what they ask for) because the VxLAN UDP port can be set on a per VF basis. So the result of the above case is: for VFs running in the host (not in VMs), each VF interface is a VTEP with a distinct UDP port for VxLAN. > 2. does octnet_send_nic_ctrl_pkt() goes to sleep? this is disallowed here No, it does not go to sleep. Felix
Re: [PATCH net-next 20/27] net/bpf_jit: MIPS: split VLAN_PRESENT bit handling from VLAN_TCI
I assume you want to merge this together with the rest of you series, so Acked-by: Ralf Baechle Cheers, Ralf On Tue, Dec 13, 2016 at 01:12:39AM +0100, Michał Mirosław wrote: > Date: Tue, 13 Dec 2016 01:12:39 +0100 (CET) > From: Michał Mirosław > To: netdev@vger.kernel.org > Cc: Ralf Baechle , "open list:MIPS" > > Subject: [PATCH net-next 20/27] net/bpf_jit: MIPS: split VLAN_PRESENT bit > handling from VLAN_TCI > Content-Type: text/plain; charset=UTF-8 > > Signed-off-by: Michał Mirosław > --- > arch/mips/net/bpf_jit.c | 20 > 1 file changed, 12 insertions(+), 8 deletions(-) > > diff --git a/arch/mips/net/bpf_jit.c b/arch/mips/net/bpf_jit.c > index 49a2e22..4b12b5d 100644 > --- a/arch/mips/net/bpf_jit.c > +++ b/arch/mips/net/bpf_jit.c > @@ -1138,19 +1138,23 @@ static int build_body(struct jit_ctx *ctx) > emit_load(r_A, r_skb, off, ctx); > break; > case BPF_ANC | SKF_AD_VLAN_TAG: > - case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT: > ctx->flags |= SEEN_SKB | SEEN_A; > BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, > vlan_tci) != 2); > off = offsetof(struct sk_buff, vlan_tci); > emit_half_load(r_s0, r_skb, off, ctx); > - if (code == (BPF_ANC | SKF_AD_VLAN_TAG)) { > - emit_andi(r_A, r_s0, (u16)~VLAN_TAG_PRESENT, > ctx); > - } else { > - emit_andi(r_A, r_s0, VLAN_TAG_PRESENT, ctx); > - /* return 1 if present */ > - emit_sltu(r_A, r_zero, r_A, ctx); > - } > +#ifdef VLAN_TAG_PRESENT > + emit_andi(r_A, r_s0, (u16)~VLAN_TAG_PRESENT, ctx); > +#endif > + break; > + case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT: > + ctx->flags |= SEEN_SKB | SEEN_A; > + emit_load_byte(r_A, r_skb, PKT_VLAN_PRESENT_OFFSET(), > ctx); > + if (PKT_VLAN_PRESENT_BIT) > + emit_srl(r_A, r_A, PKT_VLAN_PRESENT_BIT, ctx); > + emit_andi(r_A, r_s0, 1, ctx); > + /* return 1 if present */ > + emit_sltu(r_A, r_zero, r_A, ctx); > break; > case BPF_ANC | SKF_AD_PKTTYPE: > ctx->flags |= SEEN_SKB; > -- > 2.10.2
Soft lockup in tc_classify
Hi All, sorry for the spam, the first time was sent with html part and was rejected. We observed an issue where a classifier instance next member is pointing back to itself, causing a CPU soft lockup. We found it by running traffic on many udp connections and then adding a new flower rule using tc. We added a quick workaround to verify it: In tc_classify: for (; tp; tp = rcu_dereference_bh(tp->next)) { int err; + if (tp == tp->next) + RCU_INIT_POINTER(tp->next, NULL); We also had a print here showing tp->next is pointing to tp. With this workaround we are not hitting the issue anymore. We are not sure we fully understand the mechanism here - with the rtnl and rcu locks. We'll appreciate your help solving this issue. Thanks, Shahar
[PATCH net-next 13/27] bridge: use __vlan_hwaccel helpers
This removes assumption than vlan_tci != 0 when tag is present. Signed-off-by: Michał Mirosław --- net/bridge/br_netfilter_hooks.c | 14 -- net/bridge/br_private.h | 2 +- net/bridge/br_vlan.c| 6 +++--- 3 files changed, 12 insertions(+), 10 deletions(-) diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c index b12501a..2cc0747 100644 --- a/net/bridge/br_netfilter_hooks.c +++ b/net/bridge/br_netfilter_hooks.c @@ -682,10 +682,8 @@ static int br_nf_push_frag_xmit(struct net *net, struct sock *sk, struct sk_buff return 0; } - if (data->vlan_tci) { - skb->vlan_tci = data->vlan_tci; - skb->vlan_proto = data->vlan_proto; - } + if (data->vlan_proto) + __vlan_hwaccel_put_tag(skb, data->vlan_proto, data->vlan_tci); skb_copy_to_linear_data_offset(skb, -data->size, data->mac, data->size); __skb_push(skb, data->encap_size); @@ -749,8 +747,12 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff data = this_cpu_ptr(&brnf_frag_data_storage); - data->vlan_tci = skb->vlan_tci; - data->vlan_proto = skb->vlan_proto; + if (skb_vlan_tag_present(skb)) { + data->vlan_tci = skb->vlan_tci; + data->vlan_proto = skb->vlan_proto; + } else + data->vlan_proto = 0; + data->encap_size = nf_bridge_encap_header_len(skb); data->size = ETH_HLEN + data->encap_size; diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 8ce621e..2efbdaf 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -819,7 +819,7 @@ static inline int br_vlan_get_tag(const struct sk_buff *skb, u16 *vid) int err = 0; if (skb_vlan_tag_present(skb)) { - *vid = skb_vlan_tag_get(skb) & VLAN_VID_MASK; + *vid = skb_vlan_tag_get_id(skb); } else { *vid = 0; err = -EINVAL; diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c index b6de4f4..ef94664 100644 --- a/net/bridge/br_vlan.c +++ b/net/bridge/br_vlan.c @@ -377,7 +377,7 @@ struct sk_buff *br_handle_vlan(struct net_bridge *br, } if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED) - skb->vlan_tci = 0; + __vlan_hwaccel_clear_tag(skb); out: return skb; } @@ -444,8 +444,8 @@ static bool __allowed_ingress(const struct net_bridge *br, __vlan_hwaccel_put_tag(skb, br->vlan_proto, pvid); else /* Priority-tagged Frame. -* At this point, We know that skb->vlan_tci had -* VLAN_TAG_PRESENT bit and its VID field was 0x000. +* At this point, We know that skb->vlan_tci VID +* field was 0x000. * We update only VID field and preserve PCP field. */ skb->vlan_tci |= pvid; -- 2.10.2
[PATCH net-next 09/27] cxgb4: use __vlan_hwaccel helpers
This also initializes vlan_proto field. Signed-off-by: Michał Mirosław --- drivers/infiniband/hw/cxgb4/cm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c index f1510cc..66a3d39 100644 --- a/drivers/infiniband/hw/cxgb4/cm.c +++ b/drivers/infiniband/hw/cxgb4/cm.c @@ -3899,7 +3899,7 @@ static int rx_pkt(struct c4iw_dev *dev, struct sk_buff *skb) } else { vlan_eh = (struct vlan_ethhdr *)(req + 1); iph = (struct iphdr *)(vlan_eh + 1); - skb->vlan_tci = ntohs(cpl->vlan); + __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), ntohs(cpl->vlan)); } if (iph->version != 0x4) -- 2.10.2
[PATCH net-next 11/27] sky2: use __vlan_hwaccel helpers
Signed-off-by: Michał Mirosław --- drivers/net/ethernet/marvell/sky2.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c index b60ad0e..bcd20e0 100644 --- a/drivers/net/ethernet/marvell/sky2.c +++ b/drivers/net/ethernet/marvell/sky2.c @@ -2485,13 +2485,11 @@ static struct sk_buff *receive_copy(struct sky2_port *sky2, skb->ip_summed = re->skb->ip_summed; skb->csum = re->skb->csum; skb_copy_hash(skb, re->skb); - skb->vlan_proto = re->skb->vlan_proto; - skb->vlan_tci = re->skb->vlan_tci; + __vlan_hwaccel_copy_tag(skb, re->skb); pci_dma_sync_single_for_device(sky2->hw->pdev, re->data_addr, length, PCI_DMA_FROMDEVICE); - re->skb->vlan_proto = 0; - re->skb->vlan_tci = 0; + __vlan_hwaccel_clear_tag(re->skb); skb_clear_hash(re->skb); re->skb->ip_summed = CHECKSUM_NONE; skb_put(skb, length); -- 2.10.2
[PATCH net-next 21/27] net/bpf_jit: PPC: split VLAN_PRESENT bit handling from VLAN_TCI
Signed-off-by: Michał Mirosław --- arch/powerpc/net/bpf_jit_comp.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c index 7e706f3..22ae63f 100644 --- a/arch/powerpc/net/bpf_jit_comp.c +++ b/arch/powerpc/net/bpf_jit_comp.c @@ -377,18 +377,19 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, hash)); break; case BPF_ANC | SKF_AD_VLAN_TAG: - case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT: BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2); - BUILD_BUG_ON(VLAN_TAG_PRESENT != 0x1000); PPC_LHZ_OFFS(r_A, r_skb, offsetof(struct sk_buff, vlan_tci)); - if (code == (BPF_ANC | SKF_AD_VLAN_TAG)) { - PPC_ANDI(r_A, r_A, ~VLAN_TAG_PRESENT); - } else { - PPC_ANDI(r_A, r_A, VLAN_TAG_PRESENT); - PPC_SRWI(r_A, r_A, 12); - } +#ifdef VLAN_TAG_PRESENT + PPC_ANDI(r_A, r_A, ~VLAN_TAG_PRESENT); +#endif + break; + case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT: + PPC_LBZ_OFFS(r_A, r_skb, PKT_VLAN_PRESENT_OFFSET()); + if (PKT_VLAN_PRESENT_BIT) + PPC_SRWI(r_A, r_A, PKT_VLAN_PRESENT_BIT); + PPC_ANDI(r_A, r_A, 1); break; case BPF_ANC | SKF_AD_QUEUE: BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, -- 2.10.2
[PATCH net-next 24/27] bpf_test: prepare for VLAN_TAG_PRESENT removal
Signed-off-by: Michał Mirosław --- lib/test_bpf.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/lib/test_bpf.c b/lib/test_bpf.c index 0362da0..00d3450 100644 --- a/lib/test_bpf.c +++ b/lib/test_bpf.c @@ -691,8 +691,13 @@ static struct bpf_test tests[] = { CLASSIC, { }, { +#ifdef VLAN_TAG_PRESENT { 1, SKB_VLAN_TCI & ~VLAN_TAG_PRESENT }, { 10, SKB_VLAN_TCI & ~VLAN_TAG_PRESENT } +#else + { 1, SKB_VLAN_TCI }, + { 10, SKB_VLAN_TCI } +#endif }, }, { @@ -705,8 +710,13 @@ static struct bpf_test tests[] = { CLASSIC, { }, { +#ifdef VLAN_TAG_PRESENT { 1, !!(SKB_VLAN_TCI & VLAN_TAG_PRESENT) }, { 10, !!(SKB_VLAN_TCI & VLAN_TAG_PRESENT) } +#else + { 1, SKB_VLAN_PRESENT }, + { 10, SKB_VLAN_PRESENT } +#endif }, }, { @@ -4773,8 +4783,13 @@ static struct bpf_test tests[] = { CLASSIC, { }, { +#ifdef VLAN_TAG_PRESENT { 1, !!(SKB_VLAN_TCI & VLAN_TAG_PRESENT) }, { 10, !!(SKB_VLAN_TCI & VLAN_TAG_PRESENT) } +#else + { 1, SKB_VLAN_PRESENT }, + { 10, SKB_VLAN_PRESENT } +#endif }, .fill_helper = bpf_fill_maxinsns6, }, -- 2.10.2
[PATCH net-next 15/27] ipv4/tunnel: use __vlan_hwaccel helpers
Signed-off-by: Michał Mirosław --- net/ipv4/ip_tunnel_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c index fed3d29..0004a54 100644 --- a/net/ipv4/ip_tunnel_core.c +++ b/net/ipv4/ip_tunnel_core.c @@ -120,7 +120,7 @@ int __iptunnel_pull_header(struct sk_buff *skb, int hdr_len, } skb_clear_hash_if_not_l4(skb); - skb->vlan_tci = 0; + __vlan_hwaccel_clear_tag(skb); skb_set_queue_mapping(skb, 0); skb_scrub_packet(skb, xnet); -- 2.10.2
[PATCH net-next 00/27] Remove VLAN CFI bit abuse
Dear NetDevs This series removes an abuse of VLAN CFI bit in Linux networking stack. Currently Linux always clears the bit on outgoing traffic and presents it cleared to userspace (even via AF_PACKET/tcpdump when hw-accelerated). This uses a new vlan_present bit in struct skbuff, and removes an assumption that vlan_proto != 0 when VLAN tag is present. As I can't test most of the driver changes, please look at them carefully. The series is supposed to be bisect-friendly and that requires temporary insertion of #define VLAN_TAG_PRESENT in BPF code to be able to split JIT changes per architecture. Best Regards, Michał Mirosław --- Michał Mirosław (27): net/vlan: introduce __vlan_hwaccel_clear_tag() helper net/vlan: introduce __vlan_hwaccel_copy_tag() helper ibmvnic: fix accelerated VLAN handling qlcnic: remove assumption that vlan_tci != 0 i40iw: remove use of VLAN_TAG_PRESENT cnic: remove use of VLAN_TAG_PRESENT gianfar: remove use of VLAN_TAG_PRESENT net/hyperv: remove use of VLAN_TAG_PRESENT cxgb4: use __vlan_hwaccel helpers benet: use __vlan_hwaccel helpers sky2: use __vlan_hwaccel helpers net/core: use __vlan_hwaccel helpers bridge: use __vlan_hwaccel helpers 8021q: use __vlan_hwaccel helpers ipv4/tunnel: use __vlan_hwaccel helpers nfnetlink/queue: use __vlan_hwaccel helpers OVS: remove assumptions about VLAN_TAG_PRESENT bit net/skbuff: add macros for VLAN_PRESENT bit net/bpf_jit: ARM: split VLAN_PRESENT bit handling from VLAN_TCI net/bpf_jit: MIPS: split VLAN_PRESENT bit handling from VLAN_TCI net/bpf_jit: PPC: split VLAN_PRESENT bit handling from VLAN_TCI net/bpf_jit: SPARC: split VLAN_PRESENT bit handling from VLAN_TCI net/bpf: split VLAN_PRESENT bit handling from VLAN_TCI bpf_test: prepare for VLAN_TAG_PRESENT removal net: remove VLAN_TAG_PRESENT net/hyperv: enable passing of VLAN.CFI bit net/vlan: remove unused #define HAVE_VLAN_GET_TAG Documentation/networking/openvswitch.txt | 14 -- arch/arm/net/bpf_jit_32.c| 17 --- arch/mips/net/bpf_jit.c | 17 +++ arch/powerpc/net/bpf_jit_comp.c | 14 +++--- arch/sparc/net/bpf_jit_comp.c| 14 +++--- drivers/infiniband/hw/cxgb4/cm.c | 2 +- drivers/infiniband/hw/i40iw/i40iw_cm.c | 8 ++-- drivers/net/ethernet/broadcom/cnic.c | 2 +- drivers/net/ethernet/emulex/benet/be_main.c | 4 +- drivers/net/ethernet/freescale/gianfar_ethtool.c | 8 ++-- drivers/net/ethernet/ibm/ibmvnic.c | 5 +- drivers/net/ethernet/marvell/sky2.c | 6 +-- drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c | 8 ++-- drivers/net/hyperv/hyperv_net.h | 2 +- drivers/net/hyperv/netvsc_drv.c | 14 +++--- drivers/net/hyperv/rndis_filter.c| 5 +- include/linux/if_vlan.h | 37 +++--- include/linux/skbuff.h | 10 +++- lib/test_bpf.c | 14 +++--- net/8021q/vlan_core.c| 2 +- net/bridge/br_netfilter_hooks.c | 14 +++--- net/bridge/br_private.h | 2 +- net/bridge/br_vlan.c | 6 +-- net/core/dev.c | 8 ++-- net/core/filter.c| 17 +++ net/core/skbuff.c| 2 +- net/ipv4/ip_tunnel_core.c| 2 +- net/netfilter/nfnetlink_queue.c | 5 +- net/openvswitch/actions.c| 13 ++--- net/openvswitch/flow.c | 4 +- net/openvswitch/flow.h | 4 +- net/openvswitch/flow_netlink.c | 61 net/sched/act_vlan.c | 2 +- 33 files changed, 170 insertions(+), 173 deletions(-) -- 2.10.2
[PATCH net-next 16/27] nfnetlink/queue: use __vlan_hwaccel helpers
Signed-off-by: Michał Mirosław --- net/netfilter/nfnetlink_queue.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c index be7627b..f268bb9 100644 --- a/net/netfilter/nfnetlink_queue.c +++ b/net/netfilter/nfnetlink_queue.c @@ -,8 +,9 @@ static int nfqa_parse_bridge(struct nf_queue_entry *entry, if (!tb[NFQA_VLAN_TCI] || !tb[NFQA_VLAN_PROTO]) return -EINVAL; - entry->skb->vlan_tci = ntohs(nla_get_be16(tb[NFQA_VLAN_TCI])); - entry->skb->vlan_proto = nla_get_be16(tb[NFQA_VLAN_PROTO]); + __vlan_hwaccel_put_tag(entry->skb, + nla_get_be16(tb[NFQA_VLAN_PROTO]), + ntohs(nla_get_be16(tb[NFQA_VLAN_TCI]))); } if (nfqa[NFQA_L2HDR]) { -- 2.10.2
[PATCH net-next 14/27] 8021q: use __vlan_hwaccel helpers
Signed-off-by: Michał Mirosław --- net/8021q/vlan_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c index e2ed698..604a67a 100644 --- a/net/8021q/vlan_core.c +++ b/net/8021q/vlan_core.c @@ -50,7 +50,7 @@ bool vlan_do_receive(struct sk_buff **skbp) } skb->priority = vlan_get_ingress_priority(vlan_dev, skb->vlan_tci); - skb->vlan_tci = 0; + __vlan_hwaccel_clear_tag(skb); rx_stats = this_cpu_ptr(vlan_dev_priv(vlan_dev)->vlan_pcpu_stats); -- 2.10.2
[PATCH net-next 22/27] net/bpf_jit: SPARC: split VLAN_PRESENT bit handling from VLAN_TCI
Signed-off-by: Michał Mirosław --- arch/sparc/net/bpf_jit_comp.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/arch/sparc/net/bpf_jit_comp.c b/arch/sparc/net/bpf_jit_comp.c index a6d9204..61cc15d 100644 --- a/arch/sparc/net/bpf_jit_comp.c +++ b/arch/sparc/net/bpf_jit_comp.c @@ -601,15 +601,17 @@ void bpf_jit_compile(struct bpf_prog *fp) emit_skb_load32(hash, r_A); break; case BPF_ANC | SKF_AD_VLAN_TAG: - case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT: emit_skb_load16(vlan_tci, r_A); - if (code != (BPF_ANC | SKF_AD_VLAN_TAG)) { - emit_alu_K(SRL, 12); - emit_andi(r_A, 1, r_A); - } else { - emit_loadimm(~VLAN_TAG_PRESENT, r_TMP); - emit_and(r_A, r_TMP, r_A); - } +#ifdef VLAN_TAG_PRESENT + emit_loadimm(~VLAN_TAG_PRESENT, r_TMP); + emit_and(r_A, r_TMP, r_A); +#endif + break; + case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT: + __emit_skb_load8(__pkt_vlan_present_offset, r_A); + if (PKT_VLAN_PRESENT_BIT) + emit_alu_K(SRL, PKT_VLAN_PRESENT_BIT); + emit_andi(r_A, 1, r_A); break; case BPF_LD | BPF_W | BPF_LEN: emit_skb_load32(len, r_A); -- 2.10.2
[PATCH net-next 02/27] net/vlan: introduce __vlan_hwaccel_copy_tag() helper
Signed-off-by: Michał Mirosław --- include/linux/if_vlan.h | 13 + 1 file changed, 13 insertions(+) diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h index 38be904..75e839b 100644 --- a/include/linux/if_vlan.h +++ b/include/linux/if_vlan.h @@ -393,6 +393,19 @@ static inline void __vlan_hwaccel_clear_tag(struct sk_buff *skb) skb->vlan_tci = 0; } +/** + * __vlan_hwaccel_copy_tag - copy hardware accelerated VLAN info from another skb + * @dst: skbuff to copy to + * @src: skbuff to copy from + * + * Copies VLAN information from @src to @dst (for branchless code) + */ +static inline void __vlan_hwaccel_copy_tag(struct sk_buff *dst, const struct sk_buff *src) +{ + dst->vlan_proto = src->vlan_proto; + dst->vlan_tci = src->vlan_tci; +} + /* * __vlan_hwaccel_push_inside - pushes vlan tag to the payload * @skb: skbuff to tag -- 2.10.2
[PATCH net-next 26/27] net/hyperv: enable passing of VLAN.CFI bit
Signed-off-by: Michał Mirosław --- drivers/net/hyperv/netvsc_drv.c | 1 + drivers/net/hyperv/rndis_filter.c | 1 + 2 files changed, 2 insertions(+) diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 6597d79..4e20f4c 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -441,6 +441,7 @@ static int netvsc_start_xmit(struct sk_buff *skb, struct net_device *net) vlan = (struct ndis_pkt_8021q_info *)((void *)ppi + ppi->ppi_offset); vlan->vlanid = skb->vlan_tci & VLAN_VID_MASK; + vlan->cfi = !!(skb->vlan_tci & VLAN_CFI_MASK); vlan->pri = (skb->vlan_tci & VLAN_PRIO_MASK) >> VLAN_PRIO_SHIFT; } diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 7f7b410..9759d73 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -382,6 +382,7 @@ static int rndis_filter_receive_data(struct rndis_device *dev, vlan = rndis_get_ppi(rndis_pkt, IEEE_8021Q_INFO); if (vlan) { vlan_tci = vlan->vlanid | + (vlan->cfi ? VLAN_CFI_MASK : 0) | (vlan->pri << VLAN_PRIO_SHIFT); } -- 2.10.2
[PATCH net-next 19/27] net/bpf_jit: ARM: split VLAN_PRESENT bit handling from VLAN_TCI
Signed-off-by: Michał Mirosław --- arch/arm/net/bpf_jit_32.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c index 93d0b6d..6dbc602 100644 --- a/arch/arm/net/bpf_jit_32.c +++ b/arch/arm/net/bpf_jit_32.c @@ -915,17 +915,20 @@ static int build_body(struct jit_ctx *ctx) emit(ARM_LDR_I(r_A, r_skb, off), ctx); break; case BPF_ANC | SKF_AD_VLAN_TAG: - case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT: ctx->seen |= SEEN_SKB; BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2); off = offsetof(struct sk_buff, vlan_tci); emit(ARM_LDRH_I(r_A, r_skb, off), ctx); - if (code == (BPF_ANC | SKF_AD_VLAN_TAG)) - OP_IMM3(ARM_AND, r_A, r_A, ~VLAN_TAG_PRESENT, ctx); - else { - OP_IMM3(ARM_LSR, r_A, r_A, 12, ctx); - OP_IMM3(ARM_AND, r_A, r_A, 0x1, ctx); - } +#ifdef VLAN_TAG_PRESENT + OP_IMM3(ARM_AND, r_A, r_A, ~VLAN_TAG_PRESENT, ctx); +#endif + break; + case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT: + off = PKT_VLAN_PRESENT_OFFSET(); + emit(ARM_LDRB_I(r_A, r_skb, off), ctx); + if (PKT_VLAN_PRESENT_BIT) + OP_IMM3(ARM_LSR, r_A, r_A, PKT_VLAN_PRESENT_BIT, ctx); + OP_IMM3(ARM_AND, r_A, r_A, 0x1, ctx); break; case BPF_ANC | SKF_AD_PKTTYPE: ctx->seen |= SEEN_SKB; -- 2.10.2
[PATCH net-next 27/27] net/vlan: remove unused #define HAVE_VLAN_GET_TAG
Signed-off-by: Michał Mirosław --- include/linux/if_vlan.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h index 8ff2f0e..f0b9356 100644 --- a/include/linux/if_vlan.h +++ b/include/linux/if_vlan.h @@ -477,8 +477,6 @@ static inline int __vlan_hwaccel_get_tag(const struct sk_buff *skb, } } -#define HAVE_VLAN_GET_TAG - /** * vlan_get_tag - get the VLAN ID from the skb * @skb: skbuff to query -- 2.10.2
[PATCH net-next 05/27] i40iw: remove use of VLAN_TAG_PRESENT
Signed-off-by: Michał Mirosław --- drivers/infiniband/hw/i40iw/i40iw_cm.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/i40iw/i40iw_cm.c b/drivers/infiniband/hw/i40iw/i40iw_cm.c index 8563769..25cf689 100644 --- a/drivers/infiniband/hw/i40iw/i40iw_cm.c +++ b/drivers/infiniband/hw/i40iw/i40iw_cm.c @@ -414,7 +414,7 @@ static struct i40iw_puda_buf *i40iw_form_cm_frame(struct i40iw_cm_node *cm_node, pd_len += MPA_ZERO_PAD_LEN; } - if (cm_node->vlan_id < VLAN_TAG_PRESENT) + if (cm_node->vlan_id <= VLAN_VID_MASK) eth_hlen += 4; if (cm_node->ipv4) @@ -443,7 +443,7 @@ static struct i40iw_puda_buf *i40iw_form_cm_frame(struct i40iw_cm_node *cm_node, ether_addr_copy(ethh->h_dest, cm_node->rem_mac); ether_addr_copy(ethh->h_source, cm_node->loc_mac); - if (cm_node->vlan_id < VLAN_TAG_PRESENT) { + if (cm_node->vlan_id <= VLAN_VID_MASK) { ((struct vlan_ethhdr *)ethh)->h_vlan_proto = htons(ETH_P_8021Q); ((struct vlan_ethhdr *)ethh)->h_vlan_TCI = htons(cm_node->vlan_id); @@ -472,7 +472,7 @@ static struct i40iw_puda_buf *i40iw_form_cm_frame(struct i40iw_cm_node *cm_node, ether_addr_copy(ethh->h_dest, cm_node->rem_mac); ether_addr_copy(ethh->h_source, cm_node->loc_mac); - if (cm_node->vlan_id < VLAN_TAG_PRESENT) { + if (cm_node->vlan_id <= VLAN_VID_MASK) { ((struct vlan_ethhdr *)ethh)->h_vlan_proto = htons(ETH_P_8021Q); ((struct vlan_ethhdr *)ethh)->h_vlan_TCI = htons(cm_node->vlan_id); ((struct vlan_ethhdr *)ethh)->h_vlan_encapsulated_proto = htons(ETH_P_IPV6); @@ -3235,7 +3235,7 @@ static void i40iw_init_tcp_ctx(struct i40iw_cm_node *cm_node, tcp_info->flow_label = 0; tcp_info->snd_mss = cpu_to_le32(((u32)cm_node->tcp_cntxt.mss)); - if (cm_node->vlan_id < VLAN_TAG_PRESENT) { + if (cm_node->vlan_id <= VLAN_VID_MASK) { tcp_info->insert_vlan_tag = true; tcp_info->vlan_tag = cpu_to_le16(cm_node->vlan_id); } -- 2.10.2
[PATCH net-next 23/27] net/bpf: split VLAN_PRESENT bit handling from VLAN_TCI
Signed-off-by: Michał Mirosław --- net/core/filter.c | 20 +--- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index b146170..c3321f1 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -188,22 +188,20 @@ static u32 convert_skb_access(int skb_field, int dst_reg, int src_reg, break; case SKF_AD_VLAN_TAG: - case SKF_AD_VLAN_TAG_PRESENT: BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2); - BUILD_BUG_ON(VLAN_TAG_PRESENT != 0x1000); /* dst_reg = *(u16 *) (src_reg + offsetof(vlan_tci)) */ *insn++ = BPF_LDX_MEM(BPF_H, dst_reg, src_reg, offsetof(struct sk_buff, vlan_tci)); - if (skb_field == SKF_AD_VLAN_TAG) { - *insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg, - ~VLAN_TAG_PRESENT); - } else { - /* dst_reg >>= 12 */ - *insn++ = BPF_ALU32_IMM(BPF_RSH, dst_reg, 12); - /* dst_reg &= 1 */ - *insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg, 1); - } +#ifdef VLAN_TAG_PRESENT + *insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg, ~VLAN_TAG_PRESENT); +#endif + break; + case SKF_AD_VLAN_TAG_PRESENT: + *insn++ = BPF_LDX_MEM(BPF_B, dst_reg, src_reg, PKT_VLAN_PRESENT_OFFSET()); + if (PKT_VLAN_PRESENT_BIT) + *insn++ = BPF_ALU32_IMM(BPF_RSH, dst_reg, PKT_VLAN_PRESENT_BIT); + *insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg, 1); break; } -- 2.10.2
[PATCH net-next 17/27] OVS: remove assumptions about VLAN_TAG_PRESENT bit
This leaves CFI bit toggled in API, because userspace might depend this is set for normal ethernet traffic with tag present. Signed-off-by: Michał Mirosław --- Documentation/networking/openvswitch.txt | 14 net/openvswitch/actions.c| 13 +++ net/openvswitch/flow.c | 4 +-- net/openvswitch/flow.h | 4 +-- net/openvswitch/flow_netlink.c | 61 ++-- 5 files changed, 30 insertions(+), 66 deletions(-) diff --git a/Documentation/networking/openvswitch.txt b/Documentation/networking/openvswitch.txt index b3b9ac6..e7ca27d 100644 --- a/Documentation/networking/openvswitch.txt +++ b/Documentation/networking/openvswitch.txt @@ -219,20 +219,6 @@ this: eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0) -As another example, consider a packet with an Ethernet type of 0x8100, -indicating that a VLAN TCI should follow, but which is truncated just -after the Ethernet type. The flow key for this packet would include -an all-zero-bits vlan and an empty encap attribute, like this: - -eth(...), eth_type(0x8100), vlan(0), encap() - -Unlike a TCP packet with source and destination ports 0, an -all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka -VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan -attribute expressly to allow this situation to be distinguished. -Thus, the flow key in this second example unambiguously indicates a -missing or malformed VLAN TCI. - Other rules --- diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index 514f7bc..6015bc9 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -277,8 +277,7 @@ static int push_vlan(struct sk_buff *skb, struct sw_flow_key *key, key->eth.vlan.tci = vlan->vlan_tci; key->eth.vlan.tpid = vlan->vlan_tpid; } - return skb_vlan_push(skb, vlan->vlan_tpid, -ntohs(vlan->vlan_tci) & ~VLAN_TAG_PRESENT); + return skb_vlan_push(skb, vlan->vlan_tpid, ntohs(vlan->vlan_tci ^ VLAN_CFI_MASK)); } /* 'src' is already properly masked. */ @@ -704,8 +703,10 @@ static int ovs_vport_output(struct net *net, struct sock *sk, struct sk_buff *sk __skb_dst_copy(skb, data->dst); *OVS_CB(skb) = data->cb; skb->inner_protocol = data->inner_protocol; - skb->vlan_tci = data->vlan_tci; - skb->vlan_proto = data->vlan_proto; + if (data->vlan_proto) + __vlan_hwaccel_put_tag(skb, data->vlan_proto, data->vlan_tci ^ VLAN_CFI_MASK); + else + __vlan_hwaccel_clear_tag(skb); /* Reconstruct the MAC header. */ skb_push(skb, data->l2_len); @@ -749,8 +750,8 @@ static void prepare_frag(struct vport *vport, struct sk_buff *skb, data->cb = *OVS_CB(skb); data->inner_protocol = skb->inner_protocol; data->network_offset = orig_network_offset; - data->vlan_tci = skb->vlan_tci; - data->vlan_proto = skb->vlan_proto; + data->vlan_tci = skb_vlan_tag_present(skb) ? skb->vlan_tci ^ VLAN_CFI_MASK : 0; + data->vlan_proto = skb_vlan_tag_present(skb) ? skb->vlan_proto : 0; data->mac_proto = mac_proto; data->l2_len = hlen; memcpy(&data->l2_data, skb->data, hlen); diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index 08aa926..df58cfd 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -327,7 +327,7 @@ static int parse_vlan_tag(struct sk_buff *skb, struct vlan_head *key_vh) return -ENOMEM; vh = (struct vlan_head *)skb->data; - key_vh->tci = vh->tci | htons(VLAN_TAG_PRESENT); + key_vh->tci = vh->tci ^ htons(VLAN_CFI_MASK); key_vh->tpid = vh->tpid; __skb_pull(skb, sizeof(struct vlan_head)); @@ -347,7 +347,7 @@ static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key) int res; if (skb_vlan_tag_present(skb)) { - key->eth.vlan.tci = htons(skb->vlan_tci); + key->eth.vlan.tci = htons(skb->vlan_tci) ^ htons(VLAN_CFI_MASK); key->eth.vlan.tpid = skb->vlan_proto; } else { /* Parse outer vlan tag in the non-accelerated case. */ diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h index f61cae7..f5115ed 100644 --- a/net/openvswitch/flow.h +++ b/net/openvswitch/flow.h @@ -57,8 +57,8 @@ struct ovs_tunnel_info { }; struct vlan_head { - __be16 tpid; /* Vlan type. Generally 802.1q or 802.1ad.*/ - __be16 tci; /* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */ + __be16 tpid; /* Vlan type. Generally 802.1q or 802.1ad. 0 if no VLAN*/ + __be16 tci; }; #define OVS_SW_FLOW_KEY_METADATA_SIZE \ diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c index d19044f..6ae5218 100644 --- a/net/openvswitch/flow_netlink.c +++ b/net/openvswitch/flow_netlink.c @@ -835
[PATCH net-next 01/27] net/vlan: introduce __vlan_hwaccel_clear_tag() helper
Signed-off-by: Michał Mirosław --- include/linux/if_vlan.h | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h index 8d5fcd6..38be904 100644 --- a/include/linux/if_vlan.h +++ b/include/linux/if_vlan.h @@ -382,6 +382,17 @@ static inline struct sk_buff *vlan_insert_tag_set_proto(struct sk_buff *skb, return skb; } +/** + * __vlan_hwaccel_clear_tag - clear hardware accelerated VLAN info + * @skb: skbuff to clear + * + * Clears the VLAN information from @skb + */ +static inline void __vlan_hwaccel_clear_tag(struct sk_buff *skb) +{ + skb->vlan_tci = 0; +} + /* * __vlan_hwaccel_push_inside - pushes vlan tag to the payload * @skb: skbuff to tag @@ -396,7 +407,7 @@ static inline struct sk_buff *__vlan_hwaccel_push_inside(struct sk_buff *skb) skb = vlan_insert_tag_set_proto(skb, skb->vlan_proto, skb_vlan_tag_get(skb)); if (likely(skb)) - skb->vlan_tci = 0; + __vlan_hwaccel_clear_tag(skb); return skb; } -- 2.10.2
[PATCH net-next 20/27] net/bpf_jit: MIPS: split VLAN_PRESENT bit handling from VLAN_TCI
Signed-off-by: Michał Mirosław --- arch/mips/net/bpf_jit.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/arch/mips/net/bpf_jit.c b/arch/mips/net/bpf_jit.c index 49a2e22..4b12b5d 100644 --- a/arch/mips/net/bpf_jit.c +++ b/arch/mips/net/bpf_jit.c @@ -1138,19 +1138,23 @@ static int build_body(struct jit_ctx *ctx) emit_load(r_A, r_skb, off, ctx); break; case BPF_ANC | SKF_AD_VLAN_TAG: - case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT: ctx->flags |= SEEN_SKB | SEEN_A; BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2); off = offsetof(struct sk_buff, vlan_tci); emit_half_load(r_s0, r_skb, off, ctx); - if (code == (BPF_ANC | SKF_AD_VLAN_TAG)) { - emit_andi(r_A, r_s0, (u16)~VLAN_TAG_PRESENT, ctx); - } else { - emit_andi(r_A, r_s0, VLAN_TAG_PRESENT, ctx); - /* return 1 if present */ - emit_sltu(r_A, r_zero, r_A, ctx); - } +#ifdef VLAN_TAG_PRESENT + emit_andi(r_A, r_s0, (u16)~VLAN_TAG_PRESENT, ctx); +#endif + break; + case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT: + ctx->flags |= SEEN_SKB | SEEN_A; + emit_load_byte(r_A, r_skb, PKT_VLAN_PRESENT_OFFSET(), ctx); + if (PKT_VLAN_PRESENT_BIT) + emit_srl(r_A, r_A, PKT_VLAN_PRESENT_BIT, ctx); + emit_andi(r_A, r_s0, 1, ctx); + /* return 1 if present */ + emit_sltu(r_A, r_zero, r_A, ctx); break; case BPF_ANC | SKF_AD_PKTTYPE: ctx->flags |= SEEN_SKB; -- 2.10.2
[PATCH net-next 18/27] net/skbuff: add macros for VLAN_PRESENT bit
Signed-off-by: Michał Mirosław --- include/linux/skbuff.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 332e767..4a85a1f 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -768,6 +768,12 @@ struct sk_buff { __u32 priority; int skb_iif; __u32 hash; +#define PKT_VLAN_PRESENT_BIT 4 // CFI (12-th bit) in TCI +#ifdef __BIG_ENDIAN +#define PKT_VLAN_PRESENT_OFFSET() offsetof(struct sk_buff, vlan_tci) +#else +#define PKT_VLAN_PRESENT_OFFSET() (offsetof(struct sk_buff, vlan_tci) + 1) +#endif __be16 vlan_proto; __u16 vlan_tci; #if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS) -- 2.10.2
[PATCH net-next 03/27] ibmvnic: fix accelerated VLAN handling
Signed-off-by: Michał Mirosław --- drivers/net/ethernet/ibm/ibmvnic.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index c125966..c7664db 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -765,7 +765,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev) tx_crq.v1.sge_len = cpu_to_be32(skb->len); tx_crq.v1.ioba = cpu_to_be64(data_dma_addr); - if (adapter->vlan_header_insertion) { + if (adapter->vlan_header_insertion && skb_vlan_tag_present(skb)) { tx_crq.v1.flags2 |= IBMVNIC_TX_VLAN_INSERT; tx_crq.v1.vlan_id = cpu_to_be16(skb->vlan_tci); } @@ -964,7 +964,8 @@ static int ibmvnic_poll(struct napi_struct *napi, int budget) skb = rx_buff->skb; skb_copy_to_linear_data(skb, rx_buff->data + offset, length); - skb->vlan_tci = be16_to_cpu(next->rx_comp.vlan_tci); + if (flags & IBMVNIC_VLAN_STRIPPED) + __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), be16_to_cpu(next->rx_comp.vlan_tci)); /* free the entry */ next->rx_comp.first = 0; remove_buff_from_pool(adapter, rx_buff); -- 2.10.2
[PATCH net-next 25/27] net: remove VLAN_TAG_PRESENT
Signed-off-by: Michał Mirosław --- arch/mips/net/bpf_jit.c | 3 --- arch/powerpc/net/bpf_jit_comp.c | 3 --- arch/sparc/net/bpf_jit_comp.c | 4 include/linux/if_vlan.h | 11 ++- include/linux/skbuff.h | 16 +--- lib/test_bpf.c | 17 ++--- net/core/filter.c | 3 --- 7 files changed, 17 insertions(+), 40 deletions(-) diff --git a/arch/mips/net/bpf_jit.c b/arch/mips/net/bpf_jit.c index 4b12b5d..fb6d234 100644 --- a/arch/mips/net/bpf_jit.c +++ b/arch/mips/net/bpf_jit.c @@ -1143,9 +1143,6 @@ static int build_body(struct jit_ctx *ctx) vlan_tci) != 2); off = offsetof(struct sk_buff, vlan_tci); emit_half_load(r_s0, r_skb, off, ctx); -#ifdef VLAN_TAG_PRESENT - emit_andi(r_A, r_s0, (u16)~VLAN_TAG_PRESENT, ctx); -#endif break; case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT: ctx->flags |= SEEN_SKB | SEEN_A; diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c index 22ae63f..fb38927 100644 --- a/arch/powerpc/net/bpf_jit_comp.c +++ b/arch/powerpc/net/bpf_jit_comp.c @@ -381,9 +381,6 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, PPC_LHZ_OFFS(r_A, r_skb, offsetof(struct sk_buff, vlan_tci)); -#ifdef VLAN_TAG_PRESENT - PPC_ANDI(r_A, r_A, ~VLAN_TAG_PRESENT); -#endif break; case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT: PPC_LBZ_OFFS(r_A, r_skb, PKT_VLAN_PRESENT_OFFSET()); diff --git a/arch/sparc/net/bpf_jit_comp.c b/arch/sparc/net/bpf_jit_comp.c index 61cc15d..d499b39 100644 --- a/arch/sparc/net/bpf_jit_comp.c +++ b/arch/sparc/net/bpf_jit_comp.c @@ -602,10 +602,6 @@ void bpf_jit_compile(struct bpf_prog *fp) break; case BPF_ANC | SKF_AD_VLAN_TAG: emit_skb_load16(vlan_tci, r_A); -#ifdef VLAN_TAG_PRESENT - emit_loadimm(~VLAN_TAG_PRESENT, r_TMP); - emit_and(r_A, r_TMP, r_A); -#endif break; case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT: __emit_skb_load8(__pkt_vlan_present_offset, r_A); diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h index 75e839b..8ff2f0e 100644 --- a/include/linux/if_vlan.h +++ b/include/linux/if_vlan.h @@ -66,7 +66,6 @@ static inline struct vlan_ethhdr *vlan_eth_hdr(const struct sk_buff *skb) #define VLAN_PRIO_MASK 0xe000 /* Priority Code Point */ #define VLAN_PRIO_SHIFT13 #define VLAN_CFI_MASK 0x1000 /* Canonical Format Indicator */ -#define VLAN_TAG_PRESENT VLAN_CFI_MASK #define VLAN_VID_MASK 0x0fff /* VLAN Identifier */ #define VLAN_N_VID 4096 @@ -78,8 +77,8 @@ static inline bool is_vlan_dev(const struct net_device *dev) return dev->priv_flags & IFF_802_1Q_VLAN; } -#define skb_vlan_tag_present(__skb)((__skb)->vlan_tci & VLAN_TAG_PRESENT) -#define skb_vlan_tag_get(__skb)((__skb)->vlan_tci & ~VLAN_TAG_PRESENT) +#define skb_vlan_tag_present(__skb)((__skb)->vlan_present) +#define skb_vlan_tag_get(__skb)((__skb)->vlan_tci) #define skb_vlan_tag_get_id(__skb) ((__skb)->vlan_tci & VLAN_VID_MASK) #define skb_vlan_tag_get_prio(__skb) ((__skb)->vlan_tci & VLAN_PRIO_MASK) @@ -390,7 +389,7 @@ static inline struct sk_buff *vlan_insert_tag_set_proto(struct sk_buff *skb, */ static inline void __vlan_hwaccel_clear_tag(struct sk_buff *skb) { - skb->vlan_tci = 0; + skb->vlan_present = 0; } /** @@ -402,6 +401,7 @@ static inline void __vlan_hwaccel_clear_tag(struct sk_buff *skb) */ static inline void __vlan_hwaccel_copy_tag(struct sk_buff *dst, const struct sk_buff *src) { + dst->vlan_present = src->vlan_present; dst->vlan_proto = src->vlan_proto; dst->vlan_tci = src->vlan_tci; } @@ -436,7 +436,8 @@ static inline void __vlan_hwaccel_put_tag(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci) { skb->vlan_proto = vlan_proto; - skb->vlan_tci = VLAN_TAG_PRESENT | vlan_tci; + skb->vlan_tci = vlan_tci; + skb->vlan_present = 1; } /** diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 4a85a1f..3577cca 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -740,6 +740,14 @@ struct sk_buff { __u8csum_level:2; __u8csum_bad:1; +#ifdef __BIG_ENDIAN_BITFIELD +#define PKT_VLAN_PRESENT_BIT 7 +#else +#define PKT_VLAN_PRESENT_BIT 0 +#endif +#define PKT_VLAN_PRESENT_OFFSET() offsetof(
[PATCH net-next 07/27] gianfar: remove use of VLAN_TAG_PRESENT
Signed-off-by: Michał Mirosław --- drivers/net/ethernet/freescale/gianfar_ethtool.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/freescale/gianfar_ethtool.c b/drivers/net/ethernet/freescale/gianfar_ethtool.c index 56588f2..95fa647 100644 --- a/drivers/net/ethernet/freescale/gianfar_ethtool.c +++ b/drivers/net/ethernet/freescale/gianfar_ethtool.c @@ -1155,11 +1155,9 @@ static int gfar_convert_to_filer(struct ethtool_rx_flow_spec *rule, prio = vlan_tci_prio(rule); prio_mask = vlan_tci_priom(rule); - if (cfi == VLAN_TAG_PRESENT && cfi_mask == VLAN_TAG_PRESENT) { - vlan |= RQFPR_CFI; - vlan_mask |= RQFPR_CFI; - } else if (cfi != VLAN_TAG_PRESENT && - cfi_mask == VLAN_TAG_PRESENT) { + if (cfi_mask) { + if (cfi) + vlan |= RQFPR_CFI; vlan_mask |= RQFPR_CFI; } } -- 2.10.2
[PATCH net-next 04/27] qlcnic: remove assumption that vlan_tci != 0
Signed-off-by: Michał Mirosław --- drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c index fedd736..c3cc707 100644 --- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c +++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c @@ -459,7 +459,7 @@ static int qlcnic_tx_pkt(struct qlcnic_adapter *adapter, struct cmd_desc_type0 *first_desc, struct sk_buff *skb, struct qlcnic_host_tx_ring *tx_ring) { - u8 l4proto, opcode = 0, hdr_len = 0; + u8 l4proto, opcode = 0, hdr_len = 0, tag_vlan = 0; u16 flags = 0, vlan_tci = 0; int copied, offset, copy_len, size; struct cmd_desc_type0 *hwdesc; @@ -472,14 +472,16 @@ static int qlcnic_tx_pkt(struct qlcnic_adapter *adapter, flags = QLCNIC_FLAGS_VLAN_TAGGED; vlan_tci = ntohs(vh->h_vlan_TCI); protocol = ntohs(vh->h_vlan_encapsulated_proto); + tag_vlan = 1; } else if (skb_vlan_tag_present(skb)) { flags = QLCNIC_FLAGS_VLAN_OOB; vlan_tci = skb_vlan_tag_get(skb); + tag_vlan = 1; } if (unlikely(adapter->tx_pvid)) { - if (vlan_tci && !(adapter->flags & QLCNIC_TAGGING_ENABLED)) + if (tag_vlan && !(adapter->flags & QLCNIC_TAGGING_ENABLED)) return -EIO; - if (vlan_tci && (adapter->flags & QLCNIC_TAGGING_ENABLED)) + if (tag_vlan && (adapter->flags & QLCNIC_TAGGING_ENABLED)) goto set_flags; flags = QLCNIC_FLAGS_VLAN_OOB; -- 2.10.2
[PATCH net-next 06/27] cnic: remove use of VLAN_TAG_PRESENT
Signed-off-by: Michał Mirosław --- drivers/net/ethernet/broadcom/cnic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/cnic.c b/drivers/net/ethernet/broadcom/cnic.c index b1d2ac8..6e3c610 100644 --- a/drivers/net/ethernet/broadcom/cnic.c +++ b/drivers/net/ethernet/broadcom/cnic.c @@ -5734,7 +5734,7 @@ static int cnic_netdev_event(struct notifier_block *this, unsigned long event, if (realdev) { dev = cnic_from_netdev(realdev); if (dev) { - vid |= VLAN_TAG_PRESENT; + vid |= VLAN_CFI_MASK; /* make non-zero */ cnic_rcv_netevent(dev->cnic_priv, event, vid); cnic_put(dev); } -- 2.10.2
[PATCH net-next 08/27] net/hyperv: remove use of VLAN_TAG_PRESENT
Signed-off-by: Michał Mirosław --- drivers/net/hyperv/hyperv_net.h | 2 +- drivers/net/hyperv/netvsc_drv.c | 13 ++--- drivers/net/hyperv/rndis_filter.c | 4 ++-- 3 files changed, 9 insertions(+), 10 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 3958ada..b53729e 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -186,7 +186,7 @@ int netvsc_recv_callback(struct hv_device *device_obj, void **data, struct ndis_tcp_ip_checksum_info *csum_info, struct vmbus_channel *channel, - u16 vlan_tci); + u16 vlan_tci, bool vlan_present); void netvsc_channel_cb(void *context); int rndis_filter_open(struct netvsc_device *nvdev); int rndis_filter_close(struct netvsc_device *nvdev); diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index c9414c0..6597d79 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -595,7 +595,7 @@ void netvsc_linkstatus_callback(struct hv_device *device_obj, static struct sk_buff *netvsc_alloc_recv_skb(struct net_device *net, struct hv_netvsc_packet *packet, struct ndis_tcp_ip_checksum_info *csum_info, - void *data, u16 vlan_tci) + void *data) { struct sk_buff *skb; @@ -625,10 +625,6 @@ static struct sk_buff *netvsc_alloc_recv_skb(struct net_device *net, skb->ip_summed = CHECKSUM_UNNECESSARY; } - if (vlan_tci & VLAN_TAG_PRESENT) - __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), - vlan_tci); - return skb; } @@ -641,7 +637,7 @@ int netvsc_recv_callback(struct hv_device *device_obj, void **data, struct ndis_tcp_ip_checksum_info *csum_info, struct vmbus_channel *channel, - u16 vlan_tci) + u16 vlan_tci, bool vlan_present) { struct net_device *net = hv_get_drvdata(device_obj); struct net_device_context *net_device_ctx = netdev_priv(net); @@ -664,12 +660,15 @@ int netvsc_recv_callback(struct hv_device *device_obj, net = vf_netdev; /* Allocate a skb - TODO direct I/O to pages? */ - skb = netvsc_alloc_recv_skb(net, packet, csum_info, *data, vlan_tci); + skb = netvsc_alloc_recv_skb(net, packet, csum_info, *data); if (unlikely(!skb)) { ++net->stats.rx_dropped; return NVSP_STAT_FAIL; } + if (vlan_present) + __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlan_tci); + if (net != vf_netdev) skb_record_rx_queue(skb, channel->offermsg.offer.sub_channel_index); diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c index 8d90904..7f7b410 100644 --- a/drivers/net/hyperv/rndis_filter.c +++ b/drivers/net/hyperv/rndis_filter.c @@ -381,13 +381,13 @@ static int rndis_filter_receive_data(struct rndis_device *dev, vlan = rndis_get_ppi(rndis_pkt, IEEE_8021Q_INFO); if (vlan) { - vlan_tci = VLAN_TAG_PRESENT | vlan->vlanid | + vlan_tci = vlan->vlanid | (vlan->pri << VLAN_PRIO_SHIFT); } csum_info = rndis_get_ppi(rndis_pkt, TCPIP_CHKSUM_PKTINFO); return netvsc_recv_callback(net_device_ctx->device_ctx, pkt, data, - csum_info, channel, vlan_tci); + csum_info, channel, vlan_tci, vlan); } int rndis_filter_receive(struct hv_device *dev, -- 2.10.2
Re: netlink: GPF in sock_sndtimeo
On Mon, Dec 12, 2016 at 2:02 AM, Richard Guy Briggs wrote: > On 2016-12-09 20:13, Cong Wang wrote: >> Netlink notifier can safely be converted to blocking one, I will send >> a patch. > > I had a quick look at how that might happen. The netlink notifier chain > is atomic. Would the registered callback funciton need to spawn a > one-time thread to avoid blocking? It is already non-atomic now: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=efa172f42836477bf1ac3c9a3053140df764699c > I had a look at your patch. It looks attractively simple. The audit > next tree has patches queued that add an audit_reset function that will > require more work. I still see some potential gaps. > > - If the process messes up (or the sock lookup messes up) it is reset > in the kauditd thread under the audit_cmd_mutex. > > - If the process exits normally or is replaced due to an audit_replace > error, it is reset from audit_receive_skb under the audit_cmd_mutex. > > - If the process dies before the kauditd thread notices, either reap it > via notifier callback or it needs a check on net exit to reset. This > last one appears necessary to decrement the sock refcount so the sock > can be released in netlink_kernel_release(). > > If we want to be proactive and use the netlink notifier, we assume the > overhead of adding to the netlink notifier chain and eliminate all the > other reset calls under the kauditd thread. If we are ok being > reactionary, then we'll at least need the net exit check on audit_sock. > I don't see why we need to check it in net exit if we use refcnt, because we have two different users of audit_sock: kauditd and netns, if both take care of refcnt properly, we don't need to worry about who is the last, no matter what failures occur in what order.
Re: [PATCH v2] audit: use proper refcount locking on audit_sock
On Mon, Dec 12, 2016 at 2:03 AM, Richard Guy Briggs wrote: > Resetting audit_sock appears to be racy. > > audit_sock was being copied and dereferenced without using a refcount on > the source sock. > > Bump the refcount on the underlying sock when we store a refrence in > audit_sock and release it when we reset audit_sock. audit_sock > modification needs the audit_cmd_mutex. > > See: https://lkml.org/lkml/2016/11/26/232 > > Thanks to Eric Dumazet and Cong Wang > on ideas how to fix it. > > Signed-off-by: Richard Guy Briggs > --- > There has been a lot of change in the audit code that is about to go > upstream to address audit queue issues. This patch is based on the > source tree: git://git.infradead.org/users/pcmoore/audit#next > --- > kernel/audit.c | 34 -- > 1 files changed, 28 insertions(+), 6 deletions(-) > > diff --git a/kernel/audit.c b/kernel/audit.c > index f20eee0..439f7f3 100644 > --- a/kernel/audit.c > +++ b/kernel/audit.c > @@ -452,7 +452,9 @@ static void auditd_reset(void) > struct sk_buff *skb; > > /* break the connection */ > + sock_put(audit_sock); Why audit_sock can't be NULL here? > audit_pid = 0; > + audit_nlk_portid = 0; > audit_sock = NULL; > > /* flush all of the retry queue to the hold queue */ > @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff *skb) > if (rc >= 0) { > consume_skb(skb); > rc = 0; > + } else { > + if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) { Are these errno's bits?? > + mutex_lock(&audit_cmd_mutex); > + auditd_reset(); > + mutex_unlock(&audit_cmd_mutex); > + } > } > > return rc; > @@ -579,7 +587,9 @@ static int kauditd_thread(void *dummy) > > auditd = 0; > if (AUDITD_BAD(rc, reschedule)) { > + mutex_lock(&audit_cmd_mutex); > auditd_reset(); > + mutex_unlock(&audit_cmd_mutex); > reschedule = 0; > } > } else > @@ -594,7 +604,9 @@ static int kauditd_thread(void *dummy) > auditd = 0; > if (AUDITD_BAD(rc, reschedule)) { > kauditd_hold_skb(skb); > + mutex_lock(&audit_cmd_mutex); > auditd_reset(); > + mutex_unlock(&audit_cmd_mutex); > reschedule = 0; > } else > /* temporary problem (we hope), queue > @@ -623,7 +635,9 @@ quick_loop: > if (rc) { > auditd = 0; > if (AUDITD_BAD(rc, reschedule)) { > + mutex_lock(&audit_cmd_mutex); > auditd_reset(); > + > mutex_unlock(&audit_cmd_mutex); > reschedule = 0; > } > > @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, > struct nlmsghdr *nlh) > return -EACCES; > } > if (audit_pid && new_pid && > - audit_replace(requesting_pid) != -ECONNREFUSED) { > + (audit_replace(requesting_pid) & > (-ECONNREFUSED|-EPERM|-ENOMEM))) { > audit_log_config_change("audit_pid", new_pid, > audit_pid, 0); > return -EEXIST; > } > if (audit_enabled != AUDIT_OFF) > audit_log_config_change("audit_pid", new_pid, > audit_pid, 1); > - audit_pid = new_pid; > - audit_nlk_portid = NETLINK_CB(skb).portid; > - audit_sock = skb->sk; > - if (!new_pid) > + if (new_pid) { > + if (audit_sock) > + sock_put(audit_sock); > + audit_pid = new_pid; > + audit_nlk_portid = NETLINK_CB(skb).portid; > + sock_hold(skb->sk); Why refcnt is still needed here? I need it because I removed the code in net exit code path. > + audit_sock = skb->sk; > + } else { >
Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)
On 12/9/2016 1:47 AM, Selvin Xavier wrote: > This series introduces the RoCE driver for the Broadcom > NetXtreme-E 10/25/40/50 gigabit RoCE HCAs. > This driver is dependent on the bnxt_en NIC driver and is > based on the bnxt_re branch in Doug's repository. bnxt_en changes > required for this patch series is already available in this branch. > > I am preparing a git repository with these changes as per Jason's > comment and will share the details later today. > > v1-> v2: > * The license text in each file updated to reflect Dual license. > * Makefile and Kconfig changes are pushed to the last patch > * Moved bnxt_re_uverbs_abi.h to include/uapi/rdma folder > * Remove duplicate structure definitions from bnxt_re_hsi.h as > it is available in the corresponding bnxt_en header file (bnxt_hsi.h) > * Removed some unused code reported during code review. > * Fixed few sparse warnings > > Doug, > Please review and consider applying this to linux-rdma repository. There are outstanding review comments to be addressed still yet, and the v2 patchset doesn't compile for me in 0day testing. I'm going to bounce this one to 4.11. -- Doug Ledford GPG Key ID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: [iproute2 v2 net-next 0/8] Add support for vrf helper
On Sat, 10 Dec 2016 12:32:06 -0800 David Ahern wrote: > This series adds support to iproute2 to run a command against a specific > VRF. The user semnatics are similar to 'ip netns'. > > The 'ip vrf' subcommand supports 3 usages: > > 1. Run a command against a given vrf: >ip vrf exec NAME CMD > >Uses the recently committed cgroup/sock BPF option. vrf directory >is added to cgroup2 mount. Individual vrfs are created under it. BPF >filter is attached to vrf/NAME cgroup2 to set sk_bound_dev_if to the >device index of the VRF. From there the current process (ip's pid) is >addded to the cgroups.proc file and the given command is exected. In >doing so all AF_INET/AF_INET6 (ipv4/ipv6) sockets are automatically >bound to the VRF domain. > >The association is inherited parent to child allowing the command to >be a shell from which other commands are run relative to the VRF. > > 2. Show the VRF a process is bound to: >ip vrf id >This command essentially looks at /proc/pid/cgroup for a "::/vrf/" >entry. > > 3. Show process ids bound to a VRF >ip vrf pids NAME >This command dumps the file MNT/vrf/NAME/cgroup.procs since that file >shows the process ids in the particular vrf cgroup. > > v2 > - updated suject of patch 3 to avoid spam filters on vger > > David Ahern (8): > lib bpf: Add support for BPF_PROG_ATTACH and BPF_PROG_DETACH > bpf: export bpf_prog_load > Add libbpf.h header with BPF_ macros > move cmd_exec to lib utils > Add filesystem APIs to lib > change name_is_vrf to return index > libnetlink: Add variant of rtnl_talk that does not display RTNETLINK > answers error > Introduce ip vrf command > > include/bpf_util.h | 6 ++ > include/libbpf.h | 184 > include/libnetlink.h | 3 + > include/utils.h | 4 + > ip/Makefile | 3 +- > ip/ip.c | 4 +- > ip/ip_common.h | 4 +- > ip/iplink_vrf.c | 29 -- > ip/ipnetns.c | 34 -- > ip/ipvrf.c | 289 > +++ > lib/Makefile | 2 +- > lib/bpf.c| 71 - > lib/exec.c | 41 > lib/fs.c | 143 + > lib/libnetlink.c | 20 +++- > man/man8/ip-vrf.8| 88 > 16 files changed, 850 insertions(+), 75 deletions(-) > create mode 100644 include/libbpf.h > create mode 100644 ip/ipvrf.c > create mode 100644 lib/exec.c > create mode 100644 lib/fs.c > create mode 100644 man/man8/ip-vrf.8 > Please use tooling that puts v2 on all the updated patches. It makes it easier to spot them in patchwork
"virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE
Hi, I was doing a last minute regression test of the ext4 tree before sending a pull request to Linus, which I do using gce-xfstests[1], and I found that using networking was broken on GCE on linux-next. I was using next-20161209, and after bisecting things, I narrowed down the commit which causing things to break to commit 449000102901: "virtio-net: enable multiqueue by default". Reverting this commit on top of next-20161209 fixed the problem. [1] http://thunk.org/gce-xfstests You can reproduce the problem for building the kernel for Google Compute Engine --- I use a config such as this [2], and then try to boot a kernel on a VM. The way I do this involves booting a test appliance and then kexec'ing into the kernel to be tested[3], using a 2cpu configuration. (GCE machine type: n1-standard-2) [2] https://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kernel-configs/ext4-x86_64-config-4.9 [3] https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md You can then take a look at serial console using a command such as "gcloud compute instances get-serial-port-output ", and you will get something like this (see attached). The important bit is that the dhclient command is completely failing to be able to get a response from the network, from which I deduce that apparently that either networking send or receive or both seem to be badly affected by the commit in question. Please let me know if there's anything I can do to help you debug this further. Cheers, - Ted Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] Linux version 4.9.0-rc8-ext4-06387-g03e5cbd (tytso@tytso-ssd) (gcc version 4.9.2 (Debian 4.9.2-10) ) #9 SMP Mon Dec 12 04:50:16 UTC 2016 Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] Command line: root=/dev/sda1 ro console=ttyS0,38400n8 elevator=noop console=ttyS0 fstestcfg=4k fstestset=-g,quick fstestexc= fstestopt=aex fstesttyp=ext4 fstestapi=1.3 Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format. Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Load Kernel Modules. Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Apply Kernel Variables... Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounting Configuration File System... Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounting FUSE Control File System... Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounted FUSE Control File System. Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounted Configuration File System. Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Apply Kernel Variables. Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Create Static Device Nodes in /dev. Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting udev Kernel Device Manager... Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Kernel Device Manager. Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Coldplug all Devices. Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting udev Wait for Complete Device Initialization... Dec 11 23:53:20 xfstests-201612120451 systemd-fsck[1659]: xfstests-root: clean, 56268/655360 files, 357439/2620928 blocks Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started File System Check on Root Device. Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Remount Root and Kernel File Systems... Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Remount Root and Kernel File Systems. Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Various fixups to make systemd work better on Debian. Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Load/Save Random Seed... Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Local File Systems (Pre). Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Reached target Local File Systems (Pre). Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Load/Save Random Seed. Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Wait for Complete Device Initialization. Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Activation of LVM2 logical volumes... Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Copy rules generated while the root was ro... Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Found device /dev/ttyS0. Dec 11 23:53:20 xfstests-201612120451
[ANNOUNCE] iproute2 4.9
Release of iproute2 for Linux 4.9, just in time for your holiday giving. Update to iproute2 utility to support new features in Linux 4.9. Mostly this is refinements to add new flags to tipc, l2tp, ss and macsec support. There are also a couple of performance enhancments for handling lots of interfaces and namespaces. Source: https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.9.0.tar.gz Repository: git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git Report problems (or enhancements) to the netdev@vger.kernel.org mailing list. --- Alexei Starovoitov (1): iptnl: add support for collect_md flag in IPv4 and IPv6 tunnels Anton Aksola (1): iproute2: build nsid-name cache only for commands that need it Asbjørn Sloth Tønnesen (9): man: ip-l2tp.8: fix l2spec_type documentation man: ip-l2tp.8: remove non-existent tunnel parameter name l2tp: fix integers with too few significant bits l2tp: fix L2TP_ATTR_{RECV,SEND}_SEQ handling l2tp: fix L2TP_ATTR_UDP_CSUM handling l2tp: read IPv6 UDP checksum attributes from kernel l2tp: support sequence numbering l2tp: show tunnel: expose UDP checksum state man: ip-l2tp.8: document UDP checksum options Craig Dillabaugh (1): action gact: list pipe as a valid action Daniel Borkmann (1): tc, ipt: don't enforce iproute2 dependency on iptables-devel Daniel Hopf (1): macsec: Nr. of packets and octets for macsec tx stats were swapped Eric Dumazet (1): tc: fq: display unthrottle latency Hadar Hen Zion (2): tc: flower: Introduce vlan support tc: m_vlan: Add priority option to push vlan action Hangbin Liu (4): misc/ss: tcp cwnd should be unsigned ip rule: merge ip rule flush and list, save together ip rule: add selector support devlink: Convert conditional in dl_argv_handle_port() to switch() Isaac Boukris (1): iproute2: ss: escape all null bytes in abstract unix domain socket Jakub Kicinski (1): tc: cls_bpf: handle skip_sw and skip_hw flags Jamal Hadi Salim (4): actions ife: Introduce encoding and decoding of tcindex metadata actions: add skbmod action man pages: Add tc-ife to Makefile tc filters: add support to get individual filters by handle Lorenzo Colitti (1): ss: Support displaying and filtering on socket marks. Lucas Bates (2): man pages: update ife action to include tcindex man pages: add man page for skbmod action Mahesh Bandewar (1): ip: (ipvlan) introduce L3s mode Mike Frysinger (1): ifstat/nstat: fix help output alignment Moshe Shemesh (1): ip link: Add support to configure SR-IOV VF to vlan protocol 802.1ad (VST QinQ) Neal Cardwell (1): ss: output TCP BBR diag information Nikolay Aleksandrov (4): bridge: vlan: add support to display per-vlan statistics ipmroute: add support for age dumping bridge: vlan: remove wrong stats help bridge: add support for the multicast flood flag Parthasarathy Bhuvaragan (7): tipc: remove dead code tipc: add link monitor set threshold tipc: add link monitor get threshold tipc: add link monitor summary tipc: refractor bearer to facilitate link monitor tipc: add link monitor list tipc: update man page for link monitor Paul Blakey (1): tc: flower: Fix usage message Phil Sutter (6): iproute: fix documentation for ip rule scan order include: Add linux/sctp.h ss: Add support for SCTP protocol ipaddress: Simplify vf_info parsing ipaddress: Print IFLA_VF_QUERY_RSS_EN setting man: ip-route.8: Add notes about dropped IPv4 route cache Richard Alpe (3): tipc: add peer remove functionality tipc: introduce bearer add for remoteip tipc: add the ability to get UDP bearer options Roi Dayan (2): devlink: Add usage help for eswitch subcommand devlink: Add option to set and show eswitch inline mode Roman Mashak (7): ife action: allow specifying index in hex ife: print prio, mark and hash as unsigned ife: improve help text tc: updated man page to reflect GET command to retrieve a single filter. tc: improved usage help for fw classifier. tc: print raw qdisc handle. tc: distinguish Add/Replace filter operations Shmulik Ladkani (1): tc: m_vlan: Add vlan modify action Simon Horman (1): ss: initialise variables outside of for loop Stephen Hemminger (24): update headers to 4.8-rc2 net-next update TIPC headers tipc: cleanup style issues update kernel headers from net-next update bpf.h update headers from pre 4.9 (net-next) iplink: cleanup style errors ip: iprule style cleanup tc: skbmod style cleanup tc_filter: style cleanup ip: macvlan style cleanup Revert "iproute2: macvlan: add "source" mode" cleanup debris from revert ss: break really lo
Re: Soft lockup in tc_classify
On Mon, Dec 12, 2016 at 1:18 PM, Or Gerlitz wrote: > On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann wrote: > >> Note that there's still the RCU fix missing for the deletion race that >> Cong will still send out, but you say that the only thing you do is to >> add a single rule, but no other operation in involved during that test? > > What's missing to have the deletion race fixed? making a patch or > testing to a patch which was sent? If you think it would help for this problem, here is my patch rebased on the latest net-next. Again, I don't see how it could help this case yet, especially I don't see how we could have a loop in this singly linked list. commit f6becda1e12fd8ef74e901fe39adb4558ce6c8f9 Author: Cong Wang Date: Wed Nov 23 14:58:01 2016 -0800 net_sched: move the empty tp check from ->destroy() to ->delete() Roi reported we could have a race condition where in ->classify() path we dereference tp->root and meanwhile a parallel ->destroy() makes it a NULL. This is possible because ->destroy() could be called when deleting a filter to check if we are the last one in tp, this tp is still linked and visible at that time. The root cause of this problem is the semantic of ->destroy(), it does two things (for non-force case): 1) check if tp is empty 2) if tp is empty we could really destroy it and its caller, if cares, needs to check its return value to see if it is really destroyed. Therefore we can't unlink tp unless we know it is empty. As suggested by Daniel, we could actually move the test logic to ->delete() so that we can safely unlink tp after ->delete() tells us the last one is just deleted and before ->destroy(). What's more, even we unlink it before ->destroy(), it could still have readers since we don't wait for a grace period here, we should not modify tp->root in ->destroy() either. Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone") Reported-by: Roi Dayan Cc: Daniel Borkmann Cc: John Fastabend Signed-off-by: Cong Wang diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 498f81b..b5eda3f 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -203,14 +203,14 @@ struct tcf_proto_ops { const struct tcf_proto *, struct tcf_result *); int (*init)(struct tcf_proto*); - bool(*destroy)(struct tcf_proto*, bool); + void(*destroy)(struct tcf_proto*); unsigned long (*get)(struct tcf_proto*, u32 handle); int (*change)(struct net *net, struct sk_buff *, struct tcf_proto*, unsigned long, u32 handle, struct nlattr **, unsigned long *, bool); - int (*delete)(struct tcf_proto*, unsigned long); + int (*delete)(struct tcf_proto*, unsigned long, bool*); void(*walk)(struct tcf_proto*, struct tcf_walker *arg); /* rtnetlink specific */ @@ -405,7 +405,7 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue *dev_queue, const struct Qdisc_ops *ops, u32 parentid); void __qdisc_calculate_pkt_len(struct sk_buff *skb, const struct qdisc_size_table *stab); -bool tcf_destroy(struct tcf_proto *tp, bool force); +void tcf_destroy(struct tcf_proto *tp); void tcf_destroy_chain(struct tcf_proto __rcu **fl); int skb_do_redirect(struct sk_buff *); diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index 3fbba79..f9179e0 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -321,7 +321,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n) tfilter_notify(net, skb, n, tp, fh, RTM_DELTFILTER, false); - tcf_destroy(tp, true); + tcf_destroy(tp); err = 0; goto errout; } @@ -331,25 +331,29 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n) !(n->nlmsg_flags & NLM_F_CREATE)) goto errout; } else { + bool last; + switch (n->nlmsg_type) { case RTM_NEWTFILTER: err = -EEXIST; if (n->nlmsg_flags & NLM_F_EXCL) { if (tp_created) - tcf_destroy(tp, true); + tcf_destroy(tp); goto errout; } break;
[PATCH] net: cirrus: ep93xx: use new api ethtool_{get|set}_link_ksettings
The ethtool api {get|set}_settings is deprecated. We move this driver to new api {get|set}_link_ksettings. Signed-off-by: Philippe Reynes --- drivers/net/ethernet/cirrus/ep93xx_eth.c | 14 -- 1 files changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/cirrus/ep93xx_eth.c b/drivers/net/ethernet/cirrus/ep93xx_eth.c index a1de0d1..396c886 100644 --- a/drivers/net/ethernet/cirrus/ep93xx_eth.c +++ b/drivers/net/ethernet/cirrus/ep93xx_eth.c @@ -715,16 +715,18 @@ static void ep93xx_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *i strlcpy(info->version, DRV_MODULE_VERSION, sizeof(info->version)); } -static int ep93xx_get_settings(struct net_device *dev, struct ethtool_cmd *cmd) +static int ep93xx_get_link_ksettings(struct net_device *dev, +struct ethtool_link_ksettings *cmd) { struct ep93xx_priv *ep = netdev_priv(dev); - return mii_ethtool_gset(&ep->mii, cmd); + return mii_ethtool_get_link_ksettings(&ep->mii, cmd); } -static int ep93xx_set_settings(struct net_device *dev, struct ethtool_cmd *cmd) +static int ep93xx_set_link_ksettings(struct net_device *dev, +const struct ethtool_link_ksettings *cmd) { struct ep93xx_priv *ep = netdev_priv(dev); - return mii_ethtool_sset(&ep->mii, cmd); + return mii_ethtool_set_link_ksettings(&ep->mii, cmd); } static int ep93xx_nway_reset(struct net_device *dev) @@ -741,10 +743,10 @@ static u32 ep93xx_get_link(struct net_device *dev) static const struct ethtool_ops ep93xx_ethtool_ops = { .get_drvinfo= ep93xx_get_drvinfo, - .get_settings = ep93xx_get_settings, - .set_settings = ep93xx_set_settings, .nway_reset = ep93xx_nway_reset, .get_link = ep93xx_get_link, + .get_link_ksettings = ep93xx_get_link_ksettings, + .set_link_ksettings = ep93xx_set_link_ksettings, }; static const struct net_device_ops ep93xx_netdev_ops = { -- 1.7.4.4
Re: Soft lockup in inet_put_port on 4.6
On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa wrote: On 12.12.2016 19:05, Josef Bacik wrote: On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet wrote: On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote: Hmm... Is your ephemeral port range includes the port your load balancing app is using ? I suspect that you might have processes doing bind( port = 0) that are trapped into the bind_conflict() scan ? With 100,000 + timewaits there, this possibly hurts. Can you try the following loop breaker ? It doesn't appear that the app is doing bind(port = 0) during normal operation. I tested this patch and it made no difference. I'm going to test simply restarting the app without changing to the SO_REUSEPORT option. Thanks, Would it be possible to trace the time the function uses with trace? If we don't see the number growing considerably over time we probably can rule out that we loop somewhere in there (I would instrument inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port). __inet_hash_connect -> __inet_check_established also takes a lock (inet_ehash_lockp) which can be locked from inet_diag code path during socket diag info dumping. Unfortunately we couldn't reproduce it so far. :/ So I had a bcc script running to time how long we spent in inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port, but of course I'm an idiot and didn't actually separate out the stats so I could tell _which_ one was taking forever. But anyway here's a normal distribution on the box Some shit : count distribution 0 -> 1 : 0| | 2 -> 3 : 0| | 4 -> 7 : 0| | 8 -> 15 : 0| | 16 -> 31 : 0| | 32 -> 63 : 0| | 64 -> 127: 0| | 128 -> 255: 0| | 256 -> 511: 0| | 512 -> 1023 : 0| | 1024 -> 2047 : 74 | | 2048 -> 4095 : 10537 || 4096 -> 8191 : 8497 | | 8192 -> 16383 : 3745 |** | 16384 -> 32767 : 300 |* | 32768 -> 65535 : 250 | | 65536 -> 131071 : 180 | | 131072 -> 262143 : 71 | | 262144 -> 524287 : 18 | | 524288 -> 1048575: 5| | With the times in nanoseconds, and here's the distribution during the problem Some shit : count distribution 0 -> 1 : 0| | 2 -> 3 : 0| | 4 -> 7 : 0| | 8 -> 15 : 0| | 16 -> 31 : 0| | 32 -> 63 : 0| | 64 -> 127: 0| | 128 -> 255: 0| | 256 -> 511: 0| | 512 -> 1023 : 0| | 1024 -> 2047 : 21 | | 2048 -> 4095 : 21820 || 4096 -> 8191 : 11598|* | 8192 -> 16383 : 4337 |*** | 16384 -> 32767 : 290 | | 32768 -> 65535 : 59 | | 65536 -> 131071 : 23 | | 131072 -> 262143 : 12 | | 262144 -> 524287 : 6| | 524288 -> 1048575: 19 | | 1048576 -> 2097151: 1079 |* | 2097152 -> 4194303: 0|
Re: [PATCH for-next 0/6] IB/hns: Bug Fixes for HNS RoCE Driver
On 11/29/2016 6:10 PM, Salil Mehta wrote: > This patch-set contains bug fixes for the HNS RoCE driver. > > Lijun Ou (1): > IB/hns: Fix the IB device name > > Shaobo Xu (2): > IB/hns: Fix the bug when free mr > IB/hns: Fix the bug when free cq > > Wei Hu (Xavier) (3): > IB/hns: Fix the bug when destroy qp > IB/hns: Fix the bug of setting port mtu > IB/hns: Delete the redundant memset operation > > drivers/infiniband/hw/hns/hns_roce_cmd.h|5 - > drivers/infiniband/hw/hns/hns_roce_common.h | 42 ++ > drivers/infiniband/hw/hns/hns_roce_cq.c | 27 +- > drivers/infiniband/hw/hns/hns_roce_device.h | 18 + > drivers/infiniband/hw/hns/hns_roce_hw_v1.c | 967 > --- > drivers/infiniband/hw/hns/hns_roce_hw_v1.h | 57 ++ > drivers/infiniband/hw/hns/hns_roce_main.c | 26 +- > drivers/infiniband/hw/hns/hns_roce_mr.c | 21 +- > 8 files changed, 1026 insertions(+), 137 deletions(-) > Series applied, thanks. -- Doug Ledford GPG Key ID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: [PATCH V3 for-next 00/11] Code improvements & fixes for HNS RoCE driver
On 11/23/2016 2:40 PM, Salil Mehta wrote: > This patchset introduces some code improvements and fixes > for the identified problems in the HNS RoCE driver. > > Lijun Ou (4): > IB/hns: Add the interface for querying QP1 > IB/hns: add self loopback for CM > IB/hns: Modify the condition of notifying hardware loopback > IB/hns: Fix the bug for qp state in hns_roce_v1_m_qp() > > Salil Mehta (1): > IB/hns: Fix for Checkpatch.pl comment style errors > > Shaobo Xu (1): > IB/hns: Implement the add_gid/del_gid and optimize the GIDs > management > > Wei Hu (Xavier) (5): > IB/hns: Add code for refreshing CQ CI using TPTR > IB/hns: Optimize the logic of allocating memory using APIs > IB/hns: Modify the macro for the timeout when cmd process > IB/hns: Modify query info named port_num when querying RC QP > IB/hns: Change qpn allocation to round-robin mode. > > drivers/infiniband/hw/hns/hns_roce_alloc.c | 11 +- > drivers/infiniband/hw/hns/hns_roce_cmd.c|8 +- > drivers/infiniband/hw/hns/hns_roce_cmd.h|7 +- > drivers/infiniband/hw/hns/hns_roce_common.h |2 - > drivers/infiniband/hw/hns/hns_roce_cq.c | 17 +- > drivers/infiniband/hw/hns/hns_roce_device.h | 45 ++-- > drivers/infiniband/hw/hns/hns_roce_eq.c |6 +- > drivers/infiniband/hw/hns/hns_roce_hem.c|6 +- > drivers/infiniband/hw/hns/hns_roce_hw_v1.c | 267 +-- > drivers/infiniband/hw/hns/hns_roce_hw_v1.h | 17 +- > drivers/infiniband/hw/hns/hns_roce_main.c | 311 > +++ > drivers/infiniband/hw/hns/hns_roce_mr.c | 22 +- > drivers/infiniband/hw/hns/hns_roce_pd.c |5 +- > drivers/infiniband/hw/hns/hns_roce_qp.c |2 +- > 14 files changed, 364 insertions(+), 362 deletions(-) > Series applied, thanks. -- Doug Ledford GPG Key ID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: Soft lockup in inet_put_port on 4.6
On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa wrote: On 12.12.2016 19:05, Josef Bacik wrote: On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet wrote: On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote: Hmm... Is your ephemeral port range includes the port your load balancing app is using ? I suspect that you might have processes doing bind( port = 0) that are trapped into the bind_conflict() scan ? With 100,000 + timewaits there, this possibly hurts. Can you try the following loop breaker ? It doesn't appear that the app is doing bind(port = 0) during normal operation. I tested this patch and it made no difference. I'm going to test simply restarting the app without changing to the SO_REUSEPORT option. Thanks, Would it be possible to trace the time the function uses with trace? If we don't see the number growing considerably over time we probably can rule out that we loop somewhere in there (I would instrument inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port). __inet_hash_connect -> __inet_check_established also takes a lock (inet_ehash_lockp) which can be locked from inet_diag code path during socket diag info dumping. Unfortunately we couldn't reproduce it so far. :/ Working on getting the timing info, will probably be tomorrow due to meetings. I did test simply restarting the app without changing to the config that enabled the use of SO_REUSEPORT and the problem didn't occur, so it definitely has something to do with SO_REUSEPORT. Thanks, Josef
Re: Soft lockup in tc_classify
On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann wrote: > Note that there's still the RCU fix missing for the deletion race that > Cong will still send out, but you say that the only thing you do is to > add a single rule, but no other operation in involved during that test? What's missing to have the deletion race fixed? making a patch or testing to a patch which was sent?
Re: [RFC PATCH net-next v3 1/2] macb: Add 1588 support in Cadence GEM.
On Mon, Dec 12, 2016 at 10:22:43AM +, andrei.pistir...@microchip.com wrote: > Richard, are you agree with this? Yes, but please trim your replies next time. Scrolling through pages of quoted headers and stale content in order to read one line is very annoying. Thanks, Richard
Re: [PATCH V2 10/22] bnxt_re: Support for CQ verbs
On 12/09/2016 01:48 AM, Selvin Xavier wrote: > This patch implements support for create_cq, destroy_cq and req_notify_cq > verbs. > > Signed-off-by: Eddie Wai > Signed-off-by: Devesh Sharma > Signed-off-by: Somnath Kotur > Signed-off-by: Sriharsha Basavapatna > Signed-off-by: Selvin Xavier > --- > drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c| 183 > > drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h| 47 ++ > drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c | 154 > drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h | 19 +++ > drivers/infiniband/hw/bnxtre/bnxt_re_main.c | 4 + > include/uapi/rdma/bnxt_re_uverbs_abi.h | 11 ++ > 6 files changed, 418 insertions(+) Something I just realized is this patch series does not modify the MAINTAINERS file. Whom from Broadcom will be maintaining this driver? Probably want to include this info in the v3 series [...] > diff --git a/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c > b/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c > index 3417829..f316598 100644 > --- a/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c > +++ b/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c > @@ -60,6 +60,16 @@ > #include "bnxt_re_ib_verbs.h" > #include > > +static int bnxt_re_copy_to_udata(struct bnxt_re_dev *rdev, void *data, int > len, > + struct ib_udata *udata) > +{ > + int rc; > + > + rc = ib_copy_to_udata(udata, data, len); > + > + return rc ? -EFAULT : 0; > +} This function seems to provide no value by wrapping ib_copy_to_udata, any reason to keep it? From the two call sites for this function it appears it can be replaced with a direct call to ib_copy_to_udata.
Re: [PATCH 1/1] Fixed to BUG_ON to WARN_ON def
12.12.2016, 20:18, "Leon Romanovsky" : > On Mon, Dec 12, 2016 at 03:04:28PM +0200, Ozgur Karatas wrote: >> Dear Romanovsky; > > Please avoid top-posting in your replies. > Thanks Dear Leon; thanks for the information., I will pay attention. >> I'm trying to learn english and I apologize for my mistake words and >> phrases. So, I think the code when call to "sg_set_buf" and next time set >> memory and buffer. For example, isn't to call "WARN_ON" function, get a >> error to implicit declaration, right? >> >> Because, you will use to "BUG_ON" get a error implicit declaration of >> functions. > > I'm not sure that I followed you. mem->offset is set by sg_set_buf from > buf variable returned by dma_alloc_coherent(). HW needs to get very > precise size of this buf, in multiple of pages and aligned to pages > boundaries. I have studied the following your coding and I guess that's the right patchs. You are the very expert in this matter, thank you for the correct for me. I learn to your style as an example. Regards, Ozgur Karatas > See the patch inline which removes this BUG_ON in proper and safe way. > > From 7babe807affa2b27d51d3610afb75b693929ea1a Mon Sep 17 00:00:00 2001 > From: Leon Romanovsky > Date: Mon, 12 Dec 2016 20:02:45 +0200 > Subject: [PATCH] net/mlx4: Remove BUG_ON from ICM allocation routine > > This patch removes BUG_ON() macro from mlx4_alloc_icm_coherent() > by checking DMA address aligment in advance and performing proper > folding in case of error. > > Fixes: 5b0bf5e25efe ("mlx4_core: Support ICM tables in coherent memory") > Reported-by: Ozgur Karatas > Signed-off-by: Leon Romanovsky > --- > drivers/net/ethernet/mellanox/mlx4/icm.c | 7 ++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c > b/drivers/net/ethernet/mellanox/mlx4/icm.c > index 2a9dd46..e1f9e7c 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/icm.c > +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c > @@ -118,8 +118,13 @@ static int mlx4_alloc_icm_coherent(struct device *dev, > struct scatterlist *mem, > if (!buf) > return -ENOMEM; > > + if (offset_in_page(buf)) { > + dma_free_coherent(dev, PAGE_SIZE << order, > + buf, sg_dma_address(mem)); > + return -ENOMEM; > + } > + > sg_set_buf(mem, buf, PAGE_SIZE << order); > - BUG_ON(mem->offset); > sg_dma_len(mem) = PAGE_SIZE << order; > return 0; > } > -- > 2.10.2
Re: [PATCH v2] audit: use proper refcount locking on audit_sock
On Mon, Dec 12, 2016 at 5:03 AM, Richard Guy Briggs wrote: > Resetting audit_sock appears to be racy. > > audit_sock was being copied and dereferenced without using a refcount on > the source sock. > > Bump the refcount on the underlying sock when we store a refrence in > audit_sock and release it when we reset audit_sock. audit_sock > modification needs the audit_cmd_mutex. > > See: https://lkml.org/lkml/2016/11/26/232 > > Thanks to Eric Dumazet and Cong Wang > on ideas how to fix it. > > Signed-off-by: Richard Guy Briggs > --- > There has been a lot of change in the audit code that is about to go > upstream to address audit queue issues. This patch is based on the > source tree: git://git.infradead.org/users/pcmoore/audit#next > --- > kernel/audit.c | 34 -- > 1 files changed, 28 insertions(+), 6 deletions(-) My previous question about testing still stands, but I took a closer look and have some additional comments, see below ... > diff --git a/kernel/audit.c b/kernel/audit.c > index f20eee0..439f7f3 100644 > --- a/kernel/audit.c > +++ b/kernel/audit.c > @@ -452,7 +452,9 @@ static void auditd_reset(void) > struct sk_buff *skb; > > /* break the connection */ > + sock_put(audit_sock); > audit_pid = 0; > + audit_nlk_portid = 0; > audit_sock = NULL; > > /* flush all of the retry queue to the hold queue */ > @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff *skb) > if (rc >= 0) { > consume_skb(skb); > rc = 0; > + } else { > + if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) { I dislike the way you wrote this because instead of simply looking at this to see if it correct I need to sort out all the bits and find out if there are other error codes that could run afoul of this check ... make it simple, e.g. (rc == -ENOMEM || rc == -EPERM || ...). Actually, since EPERM is 1, -EPERM (-1 in two's compliment is 0x) is going to cause this to be true for pretty much any value of rc, yes? > + mutex_lock(&audit_cmd_mutex); > + auditd_reset(); > + mutex_unlock(&audit_cmd_mutex); > + } The code in audit#next handles netlink_unicast() errors in kauditd_thread() and you are adding error handling code here in kauditd_send_unicast_skb() ... that's messy. I don't care too much where the auditd_reset() call is made, but let's only do it in one function; FWIW, I originally put the error handling code in kauditd_thread() because there was other error handling code that needed to done in that scope so it resulted in cleaner code. Related, I see you are now considering ENOMEM to be a fatal condition, that differs from the AUDITD_BAD macro in kauditd_thread(); this difference needs to be reconciled. Finally, you should update the comment header block for auditd_reset() that it needs to be called with the audit_cmd_mutex held. > @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, > struct nlmsghdr *nlh) > return -EACCES; > } > if (audit_pid && new_pid && > - audit_replace(requesting_pid) != -ECONNREFUSED) { > + (audit_replace(requesting_pid) & > (-ECONNREFUSED|-EPERM|-ENOMEM))) { Do we simply want to treat any error here as fatal, and not just ECONN/EPERM/ENOMEM? If not, let's come up with a single macro to handle the fatal netlink_unicast() return codes so we have some chance to keep things consistent in the future. -- paul moore www.paul-moore.com
Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries
Hi Andrew, Andrew Lunn writes: > Humm, it looks like we are doing the atu_get wrong. We are looking for > a specific MAC address. Yet we seem to be walking the whole table to > find it, rather than getting the hardware to do the search. We are not doing it wrong, the hardware does the search. A classic dump of an ATU database consists of starting from the broadcast address ff:ff:ff:ff:ff:ff and issuing GetNext operation until we reach back the broadcast address. Only addresses in used are returned by GetNext, thus dumping an empty database is completed in a single operation. I implemented atu_get intentionally this way because it provides simpler code, rather than doing arithmetic on MAC addresses (Unless I am unaware of simple increment/decrement code.) > The current code is: > > static int mv88e6xxx_atu_get(struct mv88e6xxx_chip *chip, int fid, > const u8 *addr, struct mv88e6xxx_atu_entry > *entry) > { > struct mv88e6xxx_atu_entry next; > int err; > > eth_broadcast_addr(next.mac); > > err = _mv88e6xxx_atu_mac_write(chip, next.mac); > > We should be setting next.mac to one less than the address we are > looking for. > > Volodymyr, please could you try that, and see how much of a speed up > you get. > > There is another optimization which can be made. We only say there is > no such entry once we have reached the end of the table. But it will > return the entries in ascending order. So if the entry it returned is > bigger than what we are looking for, we can immediately abort the > search and say it does not exist. However your two suggestions to optimize the lookup are correct. It'd be interesting to see if that makes a significant difference or not. Thanks, Vivien
Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries
On Mon, Dec 12, 2016 at 08:37:50AM -0800, Florian Fainelli wrote: > On 12/12/2016 07:22 AM, Volodymyr Bendiuga wrote: > > Hi, > > > > I apologise for incorrectly formatted patch, I will fix and resend it. > > The problem with the ATU right now is that it is too slow when inserting > > entries. > > When the OS boots up, it might insert some multicast entries into the > > atu (if > > they are preconfigured by user). I run a test with 10 mc entries being > > configured for > > each port (13 ports), and it took 15 seconds, which made system quite > > slow on responding to > > other commands, as it has been inserting mc entries. The implementation > > with hashtable > > made insert command for 13 ports and 10 entries per port about 700 msec > > long. Humm, it looks like we are doing the atu_get wrong. We are looking for a specific MAC address. Yet we seem to be walking the whole table to find it, rather than getting the hardware to do the search. The current code is: static int mv88e6xxx_atu_get(struct mv88e6xxx_chip *chip, int fid, const u8 *addr, struct mv88e6xxx_atu_entry *entry) { struct mv88e6xxx_atu_entry next; int err; eth_broadcast_addr(next.mac); err = _mv88e6xxx_atu_mac_write(chip, next.mac); We should be setting next.mac to one less than the address we are looking for. Volodymyr, please could you try that, and see how much of a speed up you get. There is another optimization which can be made. We only say there is no such entry once we have reached the end of the table. But it will return the entries in ascending order. So if the entry it returned is bigger than what we are looking for, we can immediately abort the search and say it does not exist. Andrew
Re: Soft lockup in tc_classify
On Mon, Dec 12, 2016 at 8:04 AM, Shahar Klein wrote: > > > On 12/12/2016 3:28 PM, Daniel Borkmann wrote: >> >> Hi Shahar, >> >> On 12/12/2016 10:43 AM, Shahar Klein wrote: >>> >>> Hi All, >>> >>> sorry for the spam, the first time was sent with html part and was >>> rejected. >>> >>> We observed an issue where a classifier instance next member is >>> pointing back to itself, causing a CPU soft lockup. >>> We found it by running traffic on many udp connections and then adding >>> a new flower rule using tc. >>> >>> We added a quick workaround to verify it: >>> >>> In tc_classify: >>> >>> for (; tp; tp = rcu_dereference_bh(tp->next)) { >>> int err; >>> + if (tp == tp->next) >>> + RCU_INIT_POINTER(tp->next, NULL); >>> >>> >>> We also had a print here showing tp->next is pointing to tp. With this >>> workaround we are not hitting the issue anymore. >>> We are not sure we fully understand the mechanism here - with the rtnl >>> and rcu locks. >>> We'll appreciate your help solving this issue. >> >> >> Note that there's still the RCU fix missing for the deletion race that >> Cong will still send out, but you say that the only thing you do is to >> add a single rule, but no other operation in involved during that test? Hmm, I thought RCU_INIT_POINTER() respects readers, but seems no? If so, that could be the cause since we play with the next pointer and there is only one filter in this case, but I don't see why we could have a loop here. >> >> Do you have a script and kernel .config for reproducing this? > > > I'm using a user space socket app(https://github.com/shahar-klein/noodle)on > a vm to push udp packets from ~2000 different udp src ports ramping up at > ~100 per second towards another vm on the same Hypervisor. Once the traffic > starts I'm pushing ingress flower tc udp rules(even_udp_src_port->mirred, > odd->drop) on the relevant representor in the Hypervisor. Do you mind to share your `tc filter show dev...` output? Also, since you mentioned you only add one flower filter, just want to make sure you never delete any filter before/when the bug happens? How reproducible is this? Thanks!
Re: Soft lockup in inet_put_port on 4.6
On 12.12.2016 19:05, Josef Bacik wrote: > On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet > wrote: >> On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote: >> >>> >>> Hmm... Is your ephemeral port range includes the port your load >>> balancing app is using ? >> >> I suspect that you might have processes doing bind( port = 0) that are >> trapped into the bind_conflict() scan ? >> >> With 100,000 + timewaits there, this possibly hurts. >> >> Can you try the following loop breaker ? > > It doesn't appear that the app is doing bind(port = 0) during normal > operation. I tested this patch and it made no difference. I'm going to > test simply restarting the app without changing to the SO_REUSEPORT > option. Thanks, Would it be possible to trace the time the function uses with trace? If we don't see the number growing considerably over time we probably can rule out that we loop somewhere in there (I would instrument inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port). __inet_hash_connect -> __inet_check_established also takes a lock (inet_ehash_lockp) which can be locked from inet_diag code path during socket diag info dumping. Unfortunately we couldn't reproduce it so far. :/ Thanks, Hannes
Re: [PATCH V2 13/22] bnxt_re: Support QP verbs
On Thu, Dec 08, 2016 at 10:48:07PM -0800, Selvin Xavier wrote: > This patch implements create_qp, destroy_qp, query_qp and modify_qp verbs. > > v2: Fixed sparse warnings > > Signed-off-by: Eddie Wai > Signed-off-by: Devesh Sharma > Signed-off-by: Somnath Kotur > Signed-off-by: Sriharsha Basavapatna > Signed-off-by: Selvin Xavier > --- > drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c| 873 > > drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h| 250 +++ > drivers/infiniband/hw/bnxtre/bnxt_re.h | 14 + > drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c | 762 + > drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h | 21 + > drivers/infiniband/hw/bnxtre/bnxt_re_main.c | 6 + > include/uapi/rdma/bnxt_re_uverbs_abi.h | 10 + > 7 files changed, 1936 insertions(+) > > diff --git a/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c > b/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c > index 636306f..edc9411 100644 > --- a/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c > +++ b/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c > @@ -50,6 +50,69 @@ > #include "bnxt_qplib_fp.h" > > static void bnxt_qplib_arm_cq_enable(struct bnxt_qplib_cq *cq); > + > +static void bnxt_qplib_free_qp_hdr_buf(struct bnxt_qplib_res *res, > +struct bnxt_qplib_qp *qp) > +{ > + struct bnxt_qplib_q *rq = &qp->rq; > + struct bnxt_qplib_q *sq = &qp->sq; > + > + if (qp->rq_hdr_buf) > + dma_free_coherent(&res->pdev->dev, > + rq->hwq.max_elements * qp->rq_hdr_buf_size, > + qp->rq_hdr_buf, qp->rq_hdr_buf_map); > + if (qp->sq_hdr_buf) > + dma_free_coherent(&res->pdev->dev, > + sq->hwq.max_elements * qp->sq_hdr_buf_size, > + qp->sq_hdr_buf, qp->sq_hdr_buf_map); > + qp->rq_hdr_buf = NULL; > + qp->sq_hdr_buf = NULL; > + qp->rq_hdr_buf_map = 0; > + qp->sq_hdr_buf_map = 0; > + qp->sq_hdr_buf_size = 0; > + qp->rq_hdr_buf_size = 0; > +} > + > +static int bnxt_qplib_alloc_qp_hdr_buf(struct bnxt_qplib_res *res, > +struct bnxt_qplib_qp *qp) > +{ > + struct bnxt_qplib_q *rq = &qp->rq; > + struct bnxt_qplib_q *sq = &qp->rq; > + int rc = 0; > + > + if (qp->sq_hdr_buf_size && sq->hwq.max_elements) { > + qp->sq_hdr_buf = dma_alloc_coherent(&res->pdev->dev, > + sq->hwq.max_elements * > + qp->sq_hdr_buf_size, > + &qp->sq_hdr_buf_map, GFP_KERNEL); > + if (!qp->sq_hdr_buf) { > + rc = -ENOMEM; > + dev_err(&res->pdev->dev, > + "QPLIB: Failed to create sq_hdr_buf"); > + goto fail; > + } > + } > + > + if (qp->rq_hdr_buf_size && rq->hwq.max_elements) { > + qp->rq_hdr_buf = dma_alloc_coherent(&res->pdev->dev, > + rq->hwq.max_elements * > + qp->rq_hdr_buf_size, > + &qp->rq_hdr_buf_map, > + GFP_KERNEL); > + if (!qp->rq_hdr_buf) { > + rc = -ENOMEM; > + dev_err(&res->pdev->dev, > + "QPLIB: Failed to create rq_hdr_buf"); > + goto fail; > + } > + } > + return 0; > + > +fail: > + bnxt_qplib_free_qp_hdr_buf(res, qp); > + return rc; > +} > + > static void bnxt_qplib_service_nq(unsigned long data) > { > struct bnxt_qplib_nq *nq = (struct bnxt_qplib_nq *)data; > @@ -215,6 +278,816 @@ int bnxt_qplib_alloc_nq(struct pci_dev *pdev, struct > bnxt_qplib_nq *nq) > return 0; > } > > +/* QP */ > +int bnxt_qplib_create_qp1(struct bnxt_qplib_res *res, struct bnxt_qplib_qp > *qp) > +{ > + struct bnxt_qplib_rcfw *rcfw = res->rcfw; > + struct cmdq_create_qp1 req; > + struct creq_create_qp1_resp *resp; > + struct bnxt_qplib_pbl *pbl; > + struct bnxt_qplib_q *sq = &qp->sq; > + struct bnxt_qplib_q *rq = &qp->rq; > + int rc; > + u16 cmd_flags = 0; > + u32 qp_flags = 0; > + > + RCFW_CMD_PREP(req, CREATE_QP1, cmd_flags); > + > + /* General */ > + req.type = qp->type; > + req.dpi = cpu_to_le32(qp->dpi->dpi); > + req.qp_handle = cpu_to_le64(qp->qp_handle); > + > + /* SQ */ > + sq->hwq.max_elements = sq->max_wqe; > + rc = bnxt_qplib_alloc_init_hwq(res->pdev, &sq->hwq, NULL, 0, > +&sq->hwq.max_elements, > +BNXT_QPLIB_MAX_SQE_ENTRY_SIZE, 0, > +PAGE_SIZE, HWQ_TYPE_QUEUE); > + if (rc) > + goto exit; > + > + sq
Re: [PATCH 1/1] Fixed to BUG_ON to WARN_ON def
On Mon, Dec 12, 2016 at 03:04:28PM +0200, Ozgur Karatas wrote: > Dear Romanovsky; Please avoid top-posting in your replies. Thanks > > I'm trying to learn english and I apologize for my mistake words and phrases. > So, I think the code when call to "sg_set_buf" and next time set memory and > buffer. For example, isn't to call "WARN_ON" function, get a error to > implicit declaration, right? > > Because, you will use to "BUG_ON" get a error implicit declaration of > functions. I'm not sure that I followed you. mem->offset is set by sg_set_buf from buf variable returned by dma_alloc_coherent(). HW needs to get very precise size of this buf, in multiple of pages and aligned to pages boundaries. > > sg_set_buf(mem, buf, PAGE_SIZE << order); > WARN_ON(mem->offset); See the patch inline which removes this BUG_ON in proper and safe way. From 7babe807affa2b27d51d3610afb75b693929ea1a Mon Sep 17 00:00:00 2001 From: Leon Romanovsky Date: Mon, 12 Dec 2016 20:02:45 +0200 Subject: [PATCH] net/mlx4: Remove BUG_ON from ICM allocation routine This patch removes BUG_ON() macro from mlx4_alloc_icm_coherent() by checking DMA address aligment in advance and performing proper folding in case of error. Fixes: 5b0bf5e25efe ("mlx4_core: Support ICM tables in coherent memory") Reported-by: Ozgur Karatas Signed-off-by: Leon Romanovsky --- drivers/net/ethernet/mellanox/mlx4/icm.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index 2a9dd46..e1f9e7c 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -118,8 +118,13 @@ static int mlx4_alloc_icm_coherent(struct device *dev, struct scatterlist *mem, if (!buf) return -ENOMEM; + if (offset_in_page(buf)) { + dma_free_coherent(dev, PAGE_SIZE << order, + buf, sg_dma_address(mem)); + return -ENOMEM; + } + sg_set_buf(mem, buf, PAGE_SIZE << order); - BUG_ON(mem->offset); sg_dma_len(mem) = PAGE_SIZE << order; return 0; } -- 2.10.2 signature.asc Description: PGP signature
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote: > Hmmm. If you can rely on hardware setup to give you steering and > dedicated access to the RX rings. In those cases, I guess, the "push" > model could be a more direct API approach. If the hardware does not support steering then one should be able to provide those services in software. > I was shooting for a model that worked without hardware support. And > then transparently benefit from HW support by configuring a HW filter > into a specific RX queue and attaching/using to that queue. The discussion here is a bit amusing since these issues have been resolved a long time ago with the design of the RDMA subsystem. Zero copy is already in wide use. Memory registration is used to pin down memory areas. Work requests can be filed with the RDMA subsystem that then send and receive packets from the registered memory regions. This is not strictly remote memory access but this is a basic mode of operations supported by the RDMA subsystem. The mlx5 driver quoted here supports all of that. What is bad about RDMA is that it is a separate kernel subsystem. What I would like to see is a deeper integration with the network stack so that memory regions can be registred with a network socket and work requests then can be submitted and processed that directly read and write in these regions. The network stack should provide the services that the hardware of the NIC does not suppport as usual. The RX/TX ring in user space should be an additional mode of operation of the socket layer. Once that is in place the "Remote memory acces" can be trivially implemented on top of that and the ugly RDMA sidecar subsystem can go away.
Re: Soft lockup in inet_put_port on 4.6
On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet wrote: On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote: Hmm... Is your ephemeral port range includes the port your load balancing app is using ? I suspect that you might have processes doing bind( port = 0) that are trapped into the bind_conflict() scan ? With 100,000 + timewaits there, this possibly hurts. Can you try the following loop breaker ? It doesn't appear that the app is doing bind(port = 0) during normal operation. I tested this patch and it made no difference. I'm going to test simply restarting the app without changing to the SO_REUSEPORT option. Thanks, Josef
Re: Soft lockup in tc_classify
On 12/12/2016 3:28 PM, Daniel Borkmann wrote: Hi Shahar, On 12/12/2016 10:43 AM, Shahar Klein wrote: Hi All, sorry for the spam, the first time was sent with html part and was rejected. We observed an issue where a classifier instance next member is pointing back to itself, causing a CPU soft lockup. We found it by running traffic on many udp connections and then adding a new flower rule using tc. We added a quick workaround to verify it: In tc_classify: for (; tp; tp = rcu_dereference_bh(tp->next)) { int err; + if (tp == tp->next) + RCU_INIT_POINTER(tp->next, NULL); We also had a print here showing tp->next is pointing to tp. With this workaround we are not hitting the issue anymore. We are not sure we fully understand the mechanism here - with the rtnl and rcu locks. We'll appreciate your help solving this issue. Note that there's still the RCU fix missing for the deletion race that Cong will still send out, but you say that the only thing you do is to add a single rule, but no other operation in involved during that test? Do you have a script and kernel .config for reproducing this? I'm using a user space socket app(https://github.com/shahar-klein/noodle)on a vm to push udp packets from ~2000 different udp src ports ramping up at ~100 per second towards another vm on the same Hypervisor. Once the traffic starts I'm pushing ingress flower tc udp rules(even_udp_src_port->mirred, odd->drop) on the relevant representor in the Hypervisor. Thanks, Daniel
Re: [PATCH] sh_eth: add wake-on-lan support via magic packet
On 12/12/2016 06:49 PM, Niklas Söderlund wrote: Thanks for your feedback. Not at all, it's my duty now. :-) I should probably have warned you not to post the new version to netdev -- DaveM has closed his net-next.git tree (ahead of the usual time, which would have been 4.9 release), so you posting would only upset him... [...] You only enable the WOL support fo the R-Car gen2 chips but never say that explicitly, neither in the subject nor here. Signed-off-by: Niklas Söderlund --- drivers/net/ethernet/renesas/sh_eth.c | 120 +++--- drivers/net/ethernet/renesas/sh_eth.h | 4 ++ 2 files changed, 116 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c index 05b0dc5..3974046 100644 --- a/drivers/net/ethernet/renesas/sh_eth.c +++ b/drivers/net/ethernet/renesas/sh_eth.c [...] + /* Handle MagicPacket interrupt */ + if (sh_eth_read(ndev, ECSR) & ECSR_MPD) What if it wasn't enabled ATM? Sorry I don't understand this comment. I'm trying to handle only the enabled interrupts but this hasn't been consistently done yet (only for EESR, not ECSR), so nevermind. :-) [...] @@ -3150,15 +3193,71 @@ static int sh_eth_drv_remove(struct platform_device *pdev) [...] This is how it's done in other parts of the driver when disabling interrupts. Not in all parts of the driver that disable EESIPR interrupts... I must confess that I never liked that 'mdp->irq_enabled' flag and still suspect we can get things done without it... I need to look at this code again, sigh... Well, we can't most probably but I have a patch almost ready that turns the boolean flag into u32 field holding the EESIPR value to be written next. Would that help you? This is also why I only check for MagicPacket interrupts if irq_enabled is false. I would have preferred that this was done with the other EMAC interrupts, in sh_eth_error(). I removed the check for Magic Packet in sh_eth_interrupt() and running without setting mdp->irq_enabled = false. sh_eth_error() will then clear any ECI interrupt so no need to add Magic Packet detection to it since all that is needed on Magic Packet is to clear the interrupt which already is done. This works and I can do multiple suspend/resume cycles, will be in v2 thanks for the suggestion. OK, let's see what you have when I have some more time. We have a lot of time for ironing things out till net-next is opened again -- which will happen after -rc1)... [...] + + /* Enable MagicPacket */ + sh_eth_modify(ndev, ECMR, 0, ECMR_PMDE); + + /* Increased clock usage so device won't be suspended */ + clk_enable(mdp->clk); Hum, intermixiggn runtime PM with clock API doesn't look good... I agree it looks weird but I need a way to increment the usage count for the clock otherwise the PM code will disable the module clock and WoL will not work. How will it do it if you don't call sh_eth_close() in this case? Note that this call will not enable the clock just increase the usage count so it won't be disabled when the PM code decrease it after the sh_eth suspend function is run. You mean that the PM code calls RPM or clk API on its own? That's strange... Yes it calls clk API. Hum, will have to look into it as well... [...] MBR, Sergei
Re: [PATCH v2] audit: use proper refcount locking on audit_sock
On Mon, Dec 12, 2016 at 5:03 AM, Richard Guy Briggs wrote: > Resetting audit_sock appears to be racy. > > audit_sock was being copied and dereferenced without using a refcount on > the source sock. > > Bump the refcount on the underlying sock when we store a refrence in > audit_sock and release it when we reset audit_sock. audit_sock > modification needs the audit_cmd_mutex. > > See: https://lkml.org/lkml/2016/11/26/232 > > Thanks to Eric Dumazet and Cong Wang > on ideas how to fix it. > > Signed-off-by: Richard Guy Briggs > --- > There has been a lot of change in the audit code that is about to go > upstream to address audit queue issues. This patch is based on the > source tree: git://git.infradead.org/users/pcmoore/audit#next > --- > kernel/audit.c | 34 -- > 1 files changed, 28 insertions(+), 6 deletions(-) This is coming in pretty late for the v4.10 merge window, much later than I would usually take things, but this is arguably important, and (at first glance) relatively low risk - what testing have you done on this? > diff --git a/kernel/audit.c b/kernel/audit.c > index f20eee0..439f7f3 100644 > --- a/kernel/audit.c > +++ b/kernel/audit.c > @@ -452,7 +452,9 @@ static void auditd_reset(void) > struct sk_buff *skb; > > /* break the connection */ > + sock_put(audit_sock); > audit_pid = 0; > + audit_nlk_portid = 0; > audit_sock = NULL; > > /* flush all of the retry queue to the hold queue */ > @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff *skb) > if (rc >= 0) { > consume_skb(skb); > rc = 0; > + } else { > + if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) { > + mutex_lock(&audit_cmd_mutex); > + auditd_reset(); > + mutex_unlock(&audit_cmd_mutex); > + } > } > > return rc; > @@ -579,7 +587,9 @@ static int kauditd_thread(void *dummy) > > auditd = 0; > if (AUDITD_BAD(rc, reschedule)) { > + mutex_lock(&audit_cmd_mutex); > auditd_reset(); > + mutex_unlock(&audit_cmd_mutex); > reschedule = 0; > } > } else > @@ -594,7 +604,9 @@ static int kauditd_thread(void *dummy) > auditd = 0; > if (AUDITD_BAD(rc, reschedule)) { > kauditd_hold_skb(skb); > + mutex_lock(&audit_cmd_mutex); > auditd_reset(); > + mutex_unlock(&audit_cmd_mutex); > reschedule = 0; > } else > /* temporary problem (we hope), queue > @@ -623,7 +635,9 @@ quick_loop: > if (rc) { > auditd = 0; > if (AUDITD_BAD(rc, reschedule)) { > + mutex_lock(&audit_cmd_mutex); > auditd_reset(); > + > mutex_unlock(&audit_cmd_mutex); > reschedule = 0; > } > > @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, > struct nlmsghdr *nlh) > return -EACCES; > } > if (audit_pid && new_pid && > - audit_replace(requesting_pid) != -ECONNREFUSED) { > + (audit_replace(requesting_pid) & > (-ECONNREFUSED|-EPERM|-ENOMEM))) { > audit_log_config_change("audit_pid", new_pid, > audit_pid, 0); > return -EEXIST; > } > if (audit_enabled != AUDIT_OFF) > audit_log_config_change("audit_pid", new_pid, > audit_pid, 1); > - audit_pid = new_pid; > - audit_nlk_portid = NETLINK_CB(skb).portid; > - audit_sock = skb->sk; > - if (!new_pid) > + if (new_pid) { > + if (audit_sock) > + sock_put(audit_sock); > + audit_pid = new_pid; > + audit_nlk_portid = NETLINK_CB(skb).portid; > + sock_hold(skb->sk); > + audit_sock = skb->sk;
Re: Designing a safe RX-zero-copy Memory Model for Networking
On Mon, 12 Dec 2016 06:49:03 -0800 John Fastabend wrote: > On 16-12-12 06:14 AM, Mike Rapoport wrote: > > On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote: > >> > >> On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport > >> wrote: > >> > >>> Hello Jesper, > >>> > >>> On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote: > Hi all, > > This is my design for how to safely handle RX zero-copy in the network > stack, by using page_pool[1] and modifying NIC drivers. Safely means > not leaking kernel info in pages mapped to userspace and resilience > so a malicious userspace app cannot crash the kernel. > > Design target > = > > Allow the NIC to function as a normal Linux NIC and be shared in a > safe manor, between the kernel network stack and an accelerated > userspace application using RX zero-copy delivery. > > Target is to provide the basis for building RX zero-copy solutions in > a memory safe manor. An efficient communication channel for userspace > delivery is out of scope for this document, but OOM considerations are > discussed below (`Userspace delivery and OOM`_). > >>> > >>> Sorry, if this reply is a bit off-topic. > >> > >> It is very much on topic IMHO :-) > >> > >>> I'm working on implementation of RX zero-copy for virtio and I've > >>> dedicated > >>> some thought about making guest memory available for physical NIC DMAs. > >>> I believe this is quite related to your page_pool proposal, at least from > >>> the NIC driver perspective, so I'd like to share some thoughts here. > >> > >> Seems quite related. I'm very interested in cooperating with you! I'm > >> not very familiar with virtio, and how packets/pages gets channeled > >> into virtio. > > > > They are copied :-) > > Presuming we are dealing only with vhost backend, the received skb > > eventually gets converted to IOVs, which in turn are copied to the guest > > memory. The IOVs point to the guest memory that is allocated by virtio-net > > running in the guest. > > > > Great I'm also doing something similar. > > My plan was to embed the zero copy as an AF_PACKET mode and then push > a AF_PACKET backend into vhost. I'll post a patch later this week. > > >>> The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g. > >>> using macvtap, and then propagate guest RX memory allocations to the NIC > >>> using something like new .ndo_set_rx_buffers method. > >> > >> I believe the page_pool API/design aligns with this idea/use-case. > >> > >>> What is your view about interface between the page_pool and the NIC > >>> drivers? > >> > >> In my Prove-of-Concept implementation, the NIC driver (mlx5) register > >> a page_pool per RX queue. This is done for two reasons (1) performance > >> and (2) for supporting use-cases where only one single RX-ring queue is > >> (re)configured to support RX-zero-copy. There are some associated > >> extra cost of enabling this mode, thus it makes sense to only enable it > >> when needed. > >> > >> I've not decided how this gets enabled, maybe some new driver NDO. It > >> could also happen when a XDP program gets loaded, which request this > >> feature. > >> > >> The macvtap solution is nice and we should support it, but it requires > >> VM to have their MAC-addr registered on the physical switch. This > >> design is about adding flexibility. Registering an XDP eBPF filter > >> provides the maximum flexibility for matching the destination VM. > > > > I'm not very familiar with XDP eBPF, and it's difficult for me to estimate > > what needs to be done in BPF program to do proper conversion of skb to the > > virtio descriptors. > > I don't think XDP has much to do with this code and they should be done > separately. XDP runs eBPF code on received packets after the DMA engine > has already placed the packet in memory so its too late in the process. It does not have to be connected to XDP. My idea should support RX zero-copy into normal sockets, without XDP. My idea was to pre-VMA map the RX ring, when zero-copy is requested, thus it is not too late in the process. When frame travel the normal network stack, then require the SKB-read-only-page mode (skb-frags). If the SKB reach a socket that support zero-copy, then we can do RX zero-copy on normal sockets. > The other piece here is enabling XDP in vhost but that is again separate > IMO. > > Notice that ixgbe supports pushing packets into a macvlan via 'tc' > traffic steering commands so even though macvlan gets an L2 address it > doesn't mean it can't use other criteria to steer traffic to it. This sounds interesting. As this allow much more flexibility macvlan matching, which I like, but still depending on HW support. > > We were not considered using XDP yet, so we've decided to limit the initial > > implementation to macvtap because we can ensure correspondence betw
Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries
Hi all, Florian Fainelli writes: > Seeing such a change makes me wonder if we should not try to push some > of this hashtable abstraction (provided that we agree we want it) at a > higher layer, like net/dsa/slave.c? That is the major reason why I am reluctant to cache stuffs in drivers. In most cases, we want the DSA drivers to be "stupid", as much stateless as possible, simply implementing the supported DSA switch operations. The DSA core then handles the generic logic of how switch fabrics should behave, and thus all DSA drivers are consistent and benefit from this. Thanks, Vivien
Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)
On Sat, Dec 10, 2016 at 11:06:58AM +0530, Selvin Xavier wrote: > On Fri, Dec 9, 2016 at 12:17 PM, Selvin Xavier > wrote: > > I am preparing a git repository with these changes as per Jason's > > comment and will share the details later today. > > Please use bnxt_re branch in this git repository. > > https://github.com/Broadcom/linux-rdma-nxt.git Why are you using __packed in bnxt_re_uverbs_abi.h ? that doesn't seem necessary. It is a good idea to make sure all those structures are a multiple of 64 bits (add explicit reserved fields), and make sure you test 32 bit verbs as well. Why are you using debugfs just to export counters? Isn't the core code counter framework good enough? Please try and avoid writing functions as defines (eg rdev_to_dev, to_bnxt_re, SQE_PG, RCFW_CMDQ_COOKIE, PTR_PG etc) There is something wrong with the tabs and spaces (see https://github.com/Broadcom/linux-rdma-nxt/blob/03e23b087f7e86ea28656273994e065827210ce5/drivers/infiniband/hw/bnxtre/bnxt_re_hsi.h) FWIW, I really dislike the column alignment style, it is so hard to maintain.. Jason
Re: [PATCH net-next 0/2] Add ethtool set regs support
On 12/11/2016 07:22 AM, Andrew Lunn wrote: > On Sun, Dec 11, 2016 at 02:18:00PM +0200, Saeed Mahameed wrote: >> On Wed, Dec 7, 2016 at 4:41 AM, Andrew Lunn wrote: >>> On Wed, Dec 07, 2016 at 12:33:08AM +0200, Saeed Mahameed wrote: Hi Dave, This series adds the support for setting device registers from user space ethtool. >>> >>> Is this not the start of allowing binary only drivers in user space? >>> >> >> It is not, we want to do same as set_eeprom already do, >> Just set some HW registers, for analysis/debug/tweak/configure HW >> dependent register for the NIC netdev sake. > > Mellanox has a good reputation of open drivers. However, this API > sounds like it would be the first step towards user space > drivers. This is an API which can peek and poke registers, so it > probably could be mis-used to put part of a driver in user > space. Hence we should avoid this sort of API to start with. I don't necessarily share your concerns here on the proprietary vs. open source driver, because this interface is limited to the register space, not the data path, there is only a handful of things you can do here, but getting a NIC to work, is not probably one of them. My concern is more with the support/debugging aspect, if someone starts tweaking a gazillion of registers through that interface, and I have no way to tell, how am I going to support that? Of course, the first step is: have you tried to turn it on and off again, and see if this is reproducible, but what if I was asked/told to tweak this or that value first, etc... it can be hard to collect the exact state in which a NIC is at the time of the problem. NB: on the proprietary driver side, you can already mmap() the PCI device's space and write an entire user-space based driver (DPDK) and bypass the kernel for most things, ethtool -D is not much worse here since it just offers a subset (and a small one) of that. -- Florian
Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)
On 12/09/2016 01:47 AM, Selvin Xavier wrote: > This series introduces the RoCE driver for the Broadcom > NetXtreme-E 10/25/40/50 gigabit RoCE HCAs. > This driver is dependent on the bnxt_en NIC driver and is > based on the bnxt_re branch in Doug's repository. bnxt_en changes > required for this patch series is already available in this branch. > > I am preparing a git repository with these changes as per Jason's > comment and will share the details later today. > > v1-> v2: > * The license text in each file updated to reflect Dual license. > * Makefile and Kconfig changes are pushed to the last patch > * Moved bnxt_re_uverbs_abi.h to include/uapi/rdma folder > * Remove duplicate structure definitions from bnxt_re_hsi.h as > it is available in the corresponding bnxt_en header file (bnxt_hsi.h) > * Removed some unused code reported during code review. > * Fixed few sparse warnings > I get the following sparse errors (filtered for only bnxt_re ones), please let me know if they are false positives: $ make C=2 drivers/net/ethernet/broadcom/bnxt/bnxt_en.ko drivers/infiniband/hw/bnxtre/bnxt_re.ko CHK include/config/kernel.release CHK include/generated/uapi/linux/version.h CHK include/generated/utsrelease.h CHECK arch/x86/purgatory/purgatory.c [...] CHECK arch/x86/purgatory/sha256.c CHECK arch/x86/purgatory/string.c [...] CHK include/generated/bounds.h CHK include/generated/timeconst.h CHK include/generated/asm-offsets.h CALLscripts/checksyscalls.sh CHECK scripts/mod/empty.c CHECK drivers/net/ethernet/broadcom/bnxt/bnxt.c CHECK drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c CHECK drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c CHECK drivers/net/ethernet/broadcom/bnxt/bnxt_dcb.c CHECK drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c MODPOST 2 modules CHECK drivers/infiniband/hw/bnxtre/bnxt_re_main.c CHECK drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c [...] CHECK drivers/infiniband/hw/bnxtre/bnxt_re_debugfs.c CHECK drivers/infiniband/hw/bnxtre/bnxt_qplib_res.c drivers/infiniband/hw/bnxtre/bnxt_qplib_res.c:729:6: warning: symbol 'bnxt_qplib_cleanup_pkey_tbl' was not declared. Should it be static? CHECK drivers/infiniband/hw/bnxtre/bnxt_qplib_rcfw.c CHECK drivers/infiniband/hw/bnxtre/bnxt_qplib_sp.c CHECK drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c:1015:22: warning: context imbalance in 'bnxt_qplib_lock_cqs' - wrong count at exit drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c:1030:28: warning: context imbalance in 'bnxt_qplib_unlock_cqs' - unexpected unlock MODPOST 2 modules -Jon
Re: fib_frontend: Add network specific broadcasts, when it takes a sense
On 12.12.2016 15:44, Brandon Philips wrote: > On Mon, Dec 12, 2016 at 9:30 AM, Jiri Benc wrote: >> On Fri, 9 Dec 2016 15:41:52 -0800, Brandon Philips wrote: >>> The issue we have: when creating the VXLAN interface and assigning it >>> an address we see a broadcast route being added by the Kernel. For >>> example if we have 10.4.0.0/16 a broadcast route to 10.4.0.0 is >>> created. This route is unwanted because we assign 10.4.0.0 to one of >>> our VXLAN interfaces. >> >> Are you saying you're trying to assign the IP address 10.4.0.0/16 as a >> unicast address to an interface? Then you'll run into way more problems >> than the one you're describing. You can't have host part of the IP >> address consisting of all zeros (or all ones). Just don't do it. Choose >> a valid IP address instead. > > Yes, this is what we are doing; it is because of an upstream, to us, > address assignment so I will figure it out upstream. > > Regardless, it is hard to find an RFC that says "simply don't do this > because _". The closest I could find was RFC 922 after sending > this which says: > > "There is probably no reason for such addresses to appear anywhere but > as the source address of an ICMP Information". Alternatively you can renumber the network to use /32 and add the unicast routes for your /16 yourself. Bye, Hannes
RE: Re: Synopsys Ethernet QoS
Niklas Cassel wrote at Monday, December 12, 2016 9:25 AM: ... > However, I've noticed that NVIDIA has extended the DWC EQoS DT binding, > I don't how easy it would be for them to switch to stmmac's DT binding. > (Adding Stephen Warren to CC.) I don't believe there's any issue switching drivers, so long as the new one works on our HW. However, we can't switch DT binding since that's an ABI. So, if we switch drivers, the "new" driver needs to support all existing DT bindings. FWIW, I'd recommend that we don't name the "new" driver stmmac or anything like that. Rather, name it after the IP block so that any new user of the same IP block will find the name they expect in the source tree, and just use it. Renaming the "new" driver to dwc_eqos or similar might be one way to achieve that. -- nvpublic
Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries
On 12/12/2016 07:22 AM, Volodymyr Bendiuga wrote: > Hi, > > I apologise for incorrectly formatted patch, I will fix and resend it. > The problem with the ATU right now is that it is too slow when inserting > entries. > When the OS boots up, it might insert some multicast entries into the > atu (if > they are preconfigured by user). I run a test with 10 mc entries being > configured for > each port (13 ports), and it took 15 seconds, which made system quite > slow on responding to > other commands, as it has been inserting mc entries. The implementation > with hashtable > made insert command for 13 ports and 10 entries per port about 700 msec > long. Just wondering how do you achieve such speed up? What part of using a hashtable allows you not to write down all 10 MC entries across the ports? I would assume that the number of MDIO (is that the bus you are using here?) operations would be identical. Seeing such a change makes me wonder if we should not try to push some of this hashtable abstraction (provided that we agree we want it) at a higher layer, like net/dsa/slave.c? Thanks! -- Florian
Re: [PATCH] net:phy fix driver reference count error when attach and detach phy device
On 12/12/2016 12:49 AM, maowenan wrote: > > > On 2016/12/5 16:47, maowenan wrote: >> >> >> On 2016/12/2 17:45, David Laight wrote: >>> From: Mao Wenan Sent: 30 November 2016 10:23 The nic in my board use the phy dev from marvell, and the system will load the marvell phy driver automatically, but when I remove the phy drivers, the system immediately panic: Call trace: [ 2582.834493] [] phy_state_machine+0x3c/0x438 [ 2582.851754] [] process_one_work+0x150/0x428 [ 2582.868188] [] worker_thread+0x144/0x4b0 [ 2582.883882] [] kthread+0xfc/0x110 there should be proper reference counting in place to avoid that. I found that phy_attach_direct() forgets to add phy device driver reference count, and phy_detach() forgets to subtract reference count. This patch is to fix this bug, after that panic is disappeared when remove marvell.ko Signed-off-by: Mao Wenan --- drivers/net/phy/phy_device.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index 1a4bf8a..a7ec7c2 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -866,6 +866,11 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev, return -EIO; } + if (!try_module_get(d->driver->owner)) { + dev_err(&dev->dev, "failed to get the device driver module\n"); + return -EIO; + } >>> >>> If this is the phy code, what stops the phy driver being unloaded >>> before the try_module_get() obtains a reference. >>> If it isn't the phy driver then there ought to be a reference count obtained >>> when the phy driver is located (by whatever decides which phy driver to >>> use). >>> Even if that code later releases its reference (it probably shouldn't on >>> success) >>> then you can't fail to get an extra reference here. >> >> [Mao Wenan]Yes, this is phy code, in function phy_attach_direct(), >> drivers/net/phy/phy_device.c. >> when one NIC driver to do probe behavior, it will attach one matched phy >> driver. phy_attach_direct() >> is to obtain phy driver reference and bind phy driver, if try_module_get() >> execute on success, the reference >> count is added; if failed, the driver can't be attached to this NIC, and it >> can't added the phy driver >> reference count. So before try_module_get obtains a reference, phy driver >> can't can't be bound to this NIC. >> when the phy driver is attached to NIC, the reference count is added, if >> someone remove phy driver directly, >> it will be failed because reference count is not equal to 0. >> >> An example of call trace when there is NIC driver to attch one phy driver: >> hns_nic_dev_probe->hns_nic_try_get_ae->hns_nic_init_phy->of_phy_connect->phy_connect_direct->phy_attach_direct >> >> Consider the steps of phy driver(marvell.ko) added and removed, and NIC >> driver(hns_enet_drv.ko) added and removed: >> 1)insmod marvell ref=0 >> 2)insmod hns_enet_drv ref=1 >> 3)rmmod marvell(should not on success, ref=1) >> 4)rmmod hns_enet_drv ref=0 >> 5)rmmod marvell(should on success, because ref=0) >> >> if we don't add the reference count in phy_attach_direct(the second step >> ref=0), so the third step rmmod marvell will >> be panic, because there is one user remain use marvell driver and >> phy_stat_machine use the NULL drv pointer. >> >>> + get_device(d); /* Assume that if there is no driver, that it doesn't @@ -921,6 +926,7 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev, error: put_device(d); + module_put(d->driver->owner); >>> >>> Are those two in the wrong order ? >>> module_put(bus->owner); return err; } @@ -998,6 +1004,7 @@ void phy_detach(struct phy_device *phydev) bus = phydev->mdio.bus; put_device(&phydev->mdio.dev); + module_put(phydev->mdio.dev.driver->owner); module_put(bus->owner); >>> >>> Where is this code called from? >>> You can't call it from the phy driver because the driver can be unloaded >>> as soon as the last reference is removed. >>> At that point the code memory is freed. >> >> [Mao Wenan] it is called by NIC when it is removed, which aims to disconnect >> one bound phy driver. If this phy driver >> is not used for this NIC, reference count should be subtracted, and phy >> driver can be removed if there is no user. >> hns_nic_dev_remove->phy_disconnect->phy_detach >> >> >> >>> } EXPORT_SYMBOL(phy_detach); -- 2.7.0 >>> >>> >>> . >>> > > @Florian Fainelli, what's your comments about this patch? I am trying to reproduce what you are seeing, but at first glance is looks like an appropriate solution to me. Do you mind giving me a couple more days? Thanks! -- Florian
Re: Misalignment, MIPS, and ip_hdr(skb)->version
David Laight writes: > From: Måns Rullgård >> Sent: 10 December 2016 13:25 > ... >> I solved this problem in an Ethernet driver by copying the initial part >> of the packet to an aligned skb and appending the remainder using >> skb_add_rx_frag(). The kernel network stack only cares about the >> headers, so the alignment of the packet payload doesn't matter. > > That rather depends on where the packet payload ends up. > It is likely that it will be copied to userspace (or maybe > into some aligned kernel buffer). > In which case you get an expensive misaligned copy. There's not much to be done about that. The only other option is to bypass DMA entirely, and that's sure to be even worse. > What do the hardware engineers think the ethernet interface will > be used for! Ticking boxes in marketing material. -- Måns Rullgård
Re: [PATCH 0/2] net: ethernet: hisilicon: set dev->dev.parent before PHY connect
On 12/12/2016 04:03 AM, Dongpo Li wrote: > This patch series builds atop: > ec988ad78ed6d184a7f4ca6b8e962b0e8f1de461 ("phy: Don't increment MDIO bus > refcount unless it's a different owner") > > I have checked all the hisilicon ethernet driver and found only two drivers > need to be fixed to make sure set dev->dev.parent before PHY connect. Thanks for doing this, these two drivers did not show up on my list because I did not see them calling SET_NETDEV_DEV() so late. -- Florian
Re: [PATCH 1/2] net: ethernet: hisi_femac: Call SET_NETDEV_DEV()
On 12/12/2016 04:03 AM, Dongpo Li wrote: > The hisi_femac driver calls into PHYLIB which now checks for > net_device->dev.parent, so make sure we do set it before calling into > any MDIO/PHYLIB related function. > > Fixes: ec988ad78ed6 ("phy: Don't increment MDIO bus refcount unless it's a > different owner") > Signed-off-by: Dongpo Li Reviewed-by: Florian Fainelli -- Florian
Re: [PATCH 2/2] net: ethernet: hip04: Call SET_NETDEV_DEV()
On 12/12/2016 04:03 AM, Dongpo Li wrote: > The hip04 driver calls into PHYLIB which now checks for > net_device->dev.parent, so make sure we do set it before calling into > any MDIO/PHYLIB related function. > > Fixes: ec988ad78ed6 ("phy: Don't increment MDIO bus refcount unless it's a > different owner") > Signed-off-by: Dongpo Li Reviewed-by: Florian Fainelli -- Florian
Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries
Hi Volodymyr, Volodymyr Bendiuga writes: > +struct pvec_tbl_entry { > +struct hlist_node entry; > +u32 key_crc32; /* key */ > +u16 pvec; > +struct pvec_tbl_key { > +u8 addr[ETH_ALEN]; > +u16 fid; > +} key; > +}; > + > struct mv88e6xxx_atu_entry { > u16 fid; > u8 state; Also, if we were to cache some values in mv88e6xxx, I'd use the existing structures which match the hardware definition. Easier to understand. mv88e6xxx_atu_entry already represents an ATU entry with its port vector, FID, MAC address and more. Do you think it can be used here? Thanks, Vivien
Re: Re: Synopsys Ethernet QoS
On 12/12/2016 11:19 AM, Joao Pinto wrote: > Hi, > > Às 1:44 AM de 12/10/2016, Florian Fainelli escreveu: >> Le 12/09/16 à 16:16, Andy Shevchenko a écrit : >>> On Sat, Dec 10, 2016 at 12:52 AM, Florian Fainelli >>> wrote: >>> It's kind of sad that customers of that IP (stmmac, amd-xgbe, sxgbe) did actually pioneer the upstreaming effort, but it is good to see people from Synopsys willing to fix that in the future. >>> Wait, you would like to tell that we have more than 2 drivers for the >>> same (okay, same vendor) IP?! >>> It's better to unify them earlier, than have n+ copies. >> Unfortunately that is the case, see this email: >> >> https://www.mail-archive.com/netdev@vger.kernel.org/msg142796.html >> >> dwc_eth_qos and stmmac have some overlap. There seems to be work >> underway to unify these two to begin with. >> >>> P.S. Though, I don't see how sxgbe got in the list. First glance on >>> the code doesn't show similarities. >> Well samsung/sxgbe looks potentially similar to amd/xgbe, but that's >> just my cursory look at the code, it may very well be something entirely >> different. The descriptor formats just look suspiciously similar. >> > Thank you for your inputs! Renaming seems to be a hotspot. I agree that maybe > instead of renaming (breaking retro-compatibility as David and Florian > mentioned), the best is to move stmmac to synopsys/ after merging *qos* and > removing it. As Florian mentioned, git is capable of detecting folder > restructured. > > @Rabin Vincent: Hi Rabin. Since Axis is more familiar with the synopsys/*qos* > driver would it be possible for you to make an initial analysis of what has to > be merged into Stmmac? This way the development would speed-up. I can answer that question. I've sent out 12 patches to the stmmac driver (all patches are included in the current net-next tree), with these patches the stmmac driver works properly on Axis hardware (we use Synopsys GMAC 4.10a synthesized with multiple TX queues). stmmac's DT binding has also been extended with properties that existed in DWC EQoS's DT binding, such as no-pbl-x8, txpbl, rxpbl. Since we have no problem updating the DTB together with the kernel, we will simply move to using the start using the stmmac driver, with stmmac's DT binding. However, I've noticed that NVIDIA has extended the DWC EQoS DT binding, I don't how easy it would be for them to switch to stmmac's DT binding. (Adding Stephen Warren to CC.) The reset sequence that Lars Persson was worried about is not an issue with the stmmac driver. There are some performance problems with the stmmac driver though: When running iperf3 with 3 streams: iperf3 -c 192.168.0.90 -P 3 -t 30 iperf3 -c 192.168.0.90 -P 3 -t 30 -R I get really bad fairness between the streams. This appears to be an issue with how TX IRQ coalescing is implemented in stmmac. Disabling TX IRQ coalescing in the stmmac driver makes the problem go away. We have a local patch that implements TX IRQ coalescing in the dwceqos driver, and we don't see the same problem. Also netperf TCP_RR and UDP_RR gives really bad results compared to the dwceqos driver (without IRQ coalescing). 2000 transactions/sec vs 9000 transactions/sec. Turning TX IRQ coalescing off and RX interrupt watchdog off in stmmac gives the same performance. I guess it's a trade off, low CPU usage vs low latency, so I don't know how important TCP_RR/UDP_RR really is. The best thing would be to get a good working TX IRQ coalesce implementation with HR timers in stmmac. Perhaps it should also be investigated if the RX interrupt watchdog timeout should have a lower default value. > > Thanks to all. > > Joao
RE: Misalignment, MIPS, and ip_hdr(skb)->version
From: Måns Rullgård > Sent: 10 December 2016 13:25 ... > I solved this problem in an Ethernet driver by copying the initial part > of the packet to an aligned skb and appending the remainder using > skb_add_rx_frag(). The kernel network stack only cares about the > headers, so the alignment of the packet payload doesn't matter. That rather depends on where the packet payload ends up. It is likely that it will be copied to userspace (or maybe into some aligned kernel buffer). In which case you get an expensive misaligned copy. Encapsulation protocols not using headers that are multiples of 4 bytes is as stupid as ethernet hardware that can't place the mac address on a 4n+2 boundary. The latter is particularly stupid when it happens on embedded silicon with a processor that can only do aligned memory accesses. What do the hardware engineers think the ethernet interface will be used for! David
[PATCHv2 1/5] sh_eth: add generic wake-on-lan support via magic packet
Add generic functionality to support Wake-on-Lan using MagicPacket which are supported by at least a few versions of sh_eth. Only add functionality for WoL, no specific sh_eth version are marked to support WoL yet. WoL is enabled in the suspend callback by setting MagicPacket detection and disabling all interrupts expect MagicPacket. In the resume path the driver needs to reset the hardware to rearm the WoL logic, this prevents the driver from simply restoring the registers and to take advantage of that sh_eth was not suspended to reduce resume time. To reset the hardware the driver close and reopens the device just like it would do in a normal suspend/resume scenario without WoL enabled, but it both close and open the device in the resume callback since the device needs to be open for WoL to work. One quirk needed for WoL is that the module clock needs to be prevented from being switched off by Runtime PM. To keep the clock alive the suspend callback need to call clk_enable() directly to increase the usage count of the clock. Then when Runtime PM decreases the clock usage count it won't reach 0 and be switched off. Signed-off-by: Niklas Söderlund --- drivers/net/ethernet/renesas/sh_eth.c | 112 +++--- drivers/net/ethernet/renesas/sh_eth.h | 3 + 2 files changed, 107 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c index 05b0dc5..87640b9 100644 --- a/drivers/net/ethernet/renesas/sh_eth.c +++ b/drivers/net/ethernet/renesas/sh_eth.c @@ -2199,6 +2199,33 @@ static int sh_eth_set_ringparam(struct net_device *ndev, return 0; } +static void sh_eth_get_wol(struct net_device *ndev, struct ethtool_wolinfo *wol) +{ + struct sh_eth_private *mdp = netdev_priv(ndev); + + wol->supported = 0; + wol->wolopts = 0; + + if (mdp->cd->magic && mdp->clk) { + wol->supported = WAKE_MAGIC; + wol->wolopts = mdp->wol_enabled ? WAKE_MAGIC : 0; + } +} + +static int sh_eth_set_wol(struct net_device *ndev, struct ethtool_wolinfo *wol) +{ + struct sh_eth_private *mdp = netdev_priv(ndev); + + if (!mdp->cd->magic || !mdp->clk || wol->wolopts & ~WAKE_MAGIC) + return -EOPNOTSUPP; + + mdp->wol_enabled = !!(wol->wolopts & WAKE_MAGIC); + + device_set_wakeup_enable(&mdp->pdev->dev, mdp->wol_enabled); + + return 0; +} + static const struct ethtool_ops sh_eth_ethtool_ops = { .get_regs_len = sh_eth_get_regs_len, .get_regs = sh_eth_get_regs, @@ -2213,6 +2240,8 @@ static const struct ethtool_ops sh_eth_ethtool_ops = { .set_ringparam = sh_eth_set_ringparam, .get_link_ksettings = sh_eth_get_link_ksettings, .set_link_ksettings = sh_eth_set_link_ksettings, + .get_wol= sh_eth_get_wol, + .set_wol= sh_eth_set_wol, }; /* network device open function */ @@ -3017,6 +3046,11 @@ static int sh_eth_drv_probe(struct platform_device *pdev) goto out_release; } + /* Get clock, if not found that's OK but Wake-On-Lan is unavailable */ + mdp->clk = devm_clk_get(&pdev->dev, NULL); + if (IS_ERR(mdp->clk)) + mdp->clk = NULL; + ndev->base_addr = res->start; spin_lock_init(&mdp->lock); @@ -3111,6 +3145,9 @@ static int sh_eth_drv_probe(struct platform_device *pdev) if (ret) goto out_napi_del; + if (mdp->cd->magic && mdp->clk) + device_set_wakeup_capable(&pdev->dev, 1); + /* print device information */ netdev_info(ndev, "Base address at 0x%x, %pM, IRQ %d.\n", (u32)ndev->base_addr, ndev->dev_addr, ndev->irq); @@ -3150,15 +3187,67 @@ static int sh_eth_drv_remove(struct platform_device *pdev) #ifdef CONFIG_PM #ifdef CONFIG_PM_SLEEP +static int sh_eth_wol_setup(struct net_device *ndev) +{ + struct sh_eth_private *mdp = netdev_priv(ndev); + + /* Only allow ECI interrupts */ + synchronize_irq(ndev->irq); + napi_disable(&mdp->napi); + sh_eth_write(ndev, DMAC_M_ECI, EESIPR); + + /* Enable MagicPacket */ + sh_eth_modify(ndev, ECMR, 0, ECMR_PMDE); + + /* Increased clock usage so device won't be suspended */ + clk_enable(mdp->clk); + + return enable_irq_wake(ndev->irq); +} + +static int sh_eth_wol_restore(struct net_device *ndev) +{ + struct sh_eth_private *mdp = netdev_priv(ndev); + int ret; + + napi_enable(&mdp->napi); + + /* Disable MagicPacket */ + sh_eth_modify(ndev, ECMR, ECMR_PMDE, 0); + + /* The device need to be reset to restore MagicPacket logic +* for next wakeup. If we close and open the device it will +* both be reset and all registers restored. This is what +* happens during suspend and resume without WoL enabled. +*/ + ret = sh_eth_close(ndev); + if (ret < 0) + ret