[Patch net] net_sched: fix a memory leak of filter chain
tcf_chain_destroy() is called by tcf_block_put() and tcf_chain_put(). tcf_chain_put() is refcn'ed and paired with tcf_chain_get(), but tcf_block_put() is not, it should be paired with tcf_block_get() and we still need to decrease the refcnt. However, tcf_block_put() is special, it stores the chains too, we have to detach them if it is not the last user. What's more, index 0 is not special at all, it should be treated like other chains. This also makes the code more readable. Fixes: 744a4cf63e52 ("net: sched: fix use after free when tcf_chain_destroy is called multiple times") Reported-by: Jakub KicinskiCc: Jiri Pirko Signed-off-by: Cong Wang --- net/sched/cls_api.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index 6c5ea84d2682..c6d25b29bcd4 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -213,17 +213,17 @@ static void tcf_chain_flush(struct tcf_chain *chain) } } -static void tcf_chain_destroy(struct tcf_chain *chain) +static void tcf_chain_detach(struct tcf_chain *chain) { /* May be already removed from the list by the previous call. */ if (!list_empty(>list)) list_del_init(>list); +} - /* There might still be a reference held when we got here from -* tcf_block_put. Wait for the user to drop reference before free. -*/ - if (!chain->refcnt) - kfree(chain); +static void tcf_chain_destroy(struct tcf_chain *chain) +{ + tcf_chain_detach(chain); + kfree(chain); } struct tcf_chain *tcf_chain_get(struct tcf_block *block, u32 chain_index, @@ -246,10 +246,7 @@ EXPORT_SYMBOL(tcf_chain_get); void tcf_chain_put(struct tcf_chain *chain) { - /* Destroy unused chain, with exception of chain 0, which is the -* default one and has to be always present. -*/ - if (--chain->refcnt == 0 && !chain->filter_chain && chain->index != 0) + if (--chain->refcnt == 0) tcf_chain_destroy(chain); } EXPORT_SYMBOL(tcf_chain_put); @@ -296,8 +293,11 @@ void tcf_block_put(struct tcf_block *block) list_for_each_entry_safe(chain, tmp, >chain_list, list) { tcf_chain_flush(chain); - tcf_chain_destroy(chain); + tcf_chain_put(chain); } + /* If tc actions still hold the chain, just detach it. */ + list_for_each_entry_safe(chain, tmp, >chain_list, list) + tcf_chain_detach(chain); kfree(block); } EXPORT_SYMBOL(tcf_block_put); -- 2.13.0
Re: [RFC net-next] net: sch_clsact: add support for global per-netns classifier mode
On Tue, Sep 5, 2017 at 3:25 PM, Jamal Hadi Salimwrote: > On 17-09-05 06:01 PM, Roopa Prabhu wrote: > >> >> yes, like Nikolay says we have been discussing this as well. Nikolay's >> patch is a cleaver and most importantly non-invasive >> way today given the anchor point for tc rules is a netdev. we have >> also considered a separate implicit tc anchor device. >> lo seemed like a good fit for all global rules given it is already >> available. And it is not uncommon to hang off global >> network config on the loopback interface. >> > > IMO, Jiri has done all the necessary work already with the concept of > blocks. We dont really need the netdev to be the attachment point. > > You can add to a block in many locations in the kernel by > constructing the proper "coordinates" in the tcmsg. > i.e this: > tcmsg { > unsigned char tcm_family; > unsigned char tcm__pad1; > unsigned short tcm__pad2; > int tcm_ifindex; > __u32 tcm_handle; > __u32 tcm_parent; > } > > If you were to set tcm_ifindex to -1 (since that is not a legit > ifindex) then all we need to do is define a parent for a > different location. Current locations tied to netdevs are: > - > #define TC_H_ROOT (0xU) > #define TC_H_INGRESS(0xFFF1U) > #define TC_H_CLSACT TC_H_INGRESS > > #define TC_H_MIN_INGRESS0xFFF2U > #define TC_H_MIN_EGRESS 0xFFF3U > - > > You should be able to say add a location which maps to a pre-routing > or post-routing etc; and this would work as well... > ok, interesting. Jiri's examples still showed netdev as attachment point. we will explore some more. thanks.
Re: [RFC net-next] net: sch_clsact: add support for global per-netns classifier mode
On Tue, Sep 5, 2017 at 3:45 PM, Daniel Borkmannwrote: > On 09/06/2017 12:01 AM, Roopa Prabhu wrote: >> >> On Tue, Sep 5, 2017 at 11:18 AM, Cong Wang >> wrote: >>> >>> On Tue, Sep 5, 2017 at 5:48 AM, Nikolay Aleksandrov >>> wrote: Hi all, This RFC adds a new mode for clsact which designates a device's egress classifier as global per netns. The packets that are not classified for a particular device will be classified using the global classifier. We have needed a global classifier for some time now for various purposes and setting the single bridge or loopback/vrf device as the > > > Can you elaborate a bit more on the ... "we have needed a global > classifier for some time now for various purposes". Most of our acl's are global or use a wildcard. eg iptables supports global rules without an dev. We do end up having hundreds of netdevs. Another use case for the future is use of tc for policy based routing which requires global rules.
[no subject]
45388.doc Description: MS-Word document
[PATCH net] netlink: access nlk groups safely in netlink bind and getname
Now there is no lock protecting nlk ngroups/groups' accessing in netlink bind and getname. It's safe from nlk groups' setting in netlink_release, but not from netlink_realloc_groups called by netlink_setsockopt. netlink_lock_table is needed in both netlink bind and getname when accessing nlk groups. Acked-by: Florian WestphalSigned-off-by: Xin Long --- net/netlink/af_netlink.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 94a61e6..3278077 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -955,7 +955,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr, struct net *net = sock_net(sk); struct netlink_sock *nlk = nlk_sk(sk); struct sockaddr_nl *nladdr = (struct sockaddr_nl *)addr; - int err; + int err = 0; long unsigned int groups = nladdr->nl_groups; bool bound; @@ -983,6 +983,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr, return -EINVAL; } + netlink_lock_table(); if (nlk->netlink_bind && groups) { int group; @@ -993,7 +994,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr, if (!err) continue; netlink_undo_bind(group, groups, sk); - return err; + goto unlock; } } @@ -1006,12 +1007,13 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr, netlink_autobind(sock); if (err) { netlink_undo_bind(nlk->ngroups, groups, sk); - return err; + goto unlock; } } if (!groups && (nlk->groups == NULL || !(u32)nlk->groups[0])) - return 0; + goto unlock; + netlink_unlock_table(); netlink_table_grab(); netlink_update_subscriptions(sk, nlk->subscriptions + @@ -1022,6 +1024,10 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr, netlink_table_ungrab(); return 0; + +unlock: + netlink_unlock_table(); + return err; } static int netlink_connect(struct socket *sock, struct sockaddr *addr, @@ -1079,7 +1085,9 @@ static int netlink_getname(struct socket *sock, struct sockaddr *addr, nladdr->nl_groups = netlink_group_mask(nlk->dst_group); } else { nladdr->nl_pid = nlk->portid; + netlink_lock_table(); nladdr->nl_groups = nlk->groups ? nlk->groups[0] : 0; + netlink_unlock_table(); } return 0; } -- 2.1.0
[PATCH net] netlink: fix an use-after-free issue for nlk groups
ChunYu found a netlink use-after-free issue by syzkaller: [28448.842981] BUG: KASAN: use-after-free in __nla_put+0x37/0x40 at addr 8807185e2378 [28448.969918] Call Trace: [...] [28449.117207] __nla_put+0x37/0x40 [28449.132027] nla_put+0xf5/0x130 [28449.146261] sk_diag_fill.isra.4.constprop.5+0x5a0/0x750 [netlink_diag] [28449.176608] __netlink_diag_dump+0x25a/0x700 [netlink_diag] [28449.202215] netlink_diag_dump+0x176/0x240 [netlink_diag] [28449.226834] netlink_dump+0x488/0xbb0 [28449.298014] __netlink_dump_start+0x4e8/0x760 [28449.317924] netlink_diag_handler_dump+0x261/0x340 [netlink_diag] [28449.413414] sock_diag_rcv_msg+0x207/0x390 [28449.432409] netlink_rcv_skb+0x149/0x380 [28449.467647] sock_diag_rcv+0x2d/0x40 [28449.484362] netlink_unicast+0x562/0x7b0 [28449.564790] netlink_sendmsg+0xaa8/0xe60 [28449.661510] sock_sendmsg+0xcf/0x110 [28449.865631] __sys_sendmsg+0xf3/0x240 [28450.000964] SyS_sendmsg+0x32/0x50 [28450.016969] do_syscall_64+0x25c/0x6c0 [28450.154439] entry_SYSCALL64_slow_path+0x25/0x25 It was caused by no protection between nlk groups' free in netlink_release and nlk groups' accessing in sk_diag_dump_groups. The similar issue also exists in netlink_seq_show(). This patch is to defer nlk groups' free in deferred_put_nlk_sk. Reported-by: ChunYu WangAcked-by: Florian Westphal Signed-off-by: Xin Long --- net/netlink/af_netlink.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 5acee49..94a61e6 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -691,6 +691,9 @@ static void deferred_put_nlk_sk(struct rcu_head *head) struct netlink_sock *nlk = container_of(head, struct netlink_sock, rcu); struct sock *sk = >sk; + kfree(nlk->groups); + nlk->groups = NULL; + if (!refcount_dec_and_test(>sk_refcnt)) return; @@ -769,9 +772,6 @@ static int netlink_release(struct socket *sock) netlink_table_ungrab(); } - kfree(nlk->groups); - nlk->groups = NULL; - local_bh_disable(); sock_prot_inuse_add(sock_net(sk), _proto, -1); local_bh_enable(); -- 2.1.0
Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
On Tue, Sep 05, 2017 at 04:09:19PM +0200, Willem de Bruijn wrote: > On Mon, Sep 4, 2017 at 5:03 AM, Jason Wangwrote: > > > > > > On 2017年09月02日 00:17, Willem de Bruijn wrote: > > > > This is not a 50/50 split, which impliesTw that some packets from the > > large > > packet flow are still converted to copying. Without the change the rate > > without queue was 80k zerocopy vs 80k copy, so this choice of > > (vq->num >> 2) appears too conservative. > > > > However, testing with (vq->num >> 1) was not as effective at mitigating > > stalls. I did not save that data, unfortunately. Can run more tests on > > fine > > tuning this variable, if the idea sounds good. > > > Looks like there're still two cases were left: > >>> > >>> To be clear, this patch is not intended to fix all issues. It is a small > >>> improvement to avoid HoL blocking due to queued zerocopy skbs. > > > > > > Right, just want to see if there's anything left. > > > >>> > >>> The trade-off is that reverting to copying in these cases increases > >>> cycle cost. I think that that is a trade-off worth making compared to > >>> the alternative drop in throughput. It probably would be good to be > >>> able to measure this without kernel instrumentation: export > >>> counters similar to net->tx_zcopy_err and net->tx_packets (though > >>> without reset to zero, as in vhost_net_tx_packet). > > > > > > I think it's acceptable if extra cycles were spent if we detect HOL anyhow. > > > >>> > 1) sndbuf is not INT_MAX > >>> > >>> You mean the case where the device stalls, later zerocopy notifications > >>> are queued, but these are never cleaned in free_old_xmit_skbs, > >>> because it requires a start_xmit and by now the (only) socket is out of > >>> descriptors? > >> > >> Typo, sorry. I meant out of sndbuf. > > > > > > I mean e.g for tun. If its sndbuf is smaller than e.g (vq->num >> 1) * > > $pkt_size and if all packet were held by some modules, limitation like > > vq->num >> 1 won't work since we hit sudbuf before it. > > Good point. I agree however anyone using SNDBUF < infinity already hits HOQ blocking in some scenarios. > >> > >>> A watchdog would help somewhat. With tx-napi, this case cannot occur, > >>> either, as free_old_xmit_skbs no longer depends on a call to start_xmit. > >>> > 2) tx napi is used for virtio-net > >>> > >>> I am not aware of any issue specific to the use of tx-napi? > > > > > > Might not be clear here, I mean e.g virtio_net (tx-napi) in guest + > > vhost_net (zerocopy) in host. In this case, even if we switch to datacopy if > > ubuf counts exceeds vq->num >> 1, we still complete tx buffers in order, tx > > interrupt could be delayed for indefinite time. > > Copied buffers are completed immediately in handle_tx. > > Do you mean when a process sends fewer packets than vq->num >> 1, > so that all are queued? Yes, then the latency is indeed that of the last > element leaving the qdisc. > > >>> > 1) could be a corner case, and for 2) what your suggest here may not > solve > the issue since it still do in order completion. > >>> > >>> Somewhat tangential, but it might also help to break the in-order > >>> completion processing in vhost_zerocopy_signal_used. Complete > >>> all descriptors between done_idx and upend_idx. done_idx should > >>> then only be forward to the oldest still not-completed descriptor. > >>> > >>> In the test I ran, where the oldest descriptors are held in a queue and > >>> all newer ones are tail-dropped, > > > > > > Do you mean the descriptors were tail-dropped by vhost? > > Tail-dropped by netem. The dropped items are completed out of > order by vhost before the held items.
Re: [PATCH RFC v3 0/4] Add cross-compilation support to eBPF samples
Hi Alexei, On Tue, Sep 5, 2017 at 8:24 PM, Alexei Starovoitovwrote: > On Sun, Sep 03, 2017 at 11:23:21AM -0700, Joel Fernandes wrote: >> These patches fix issues seen when cross-compiling eBPF samples on arm64. >> Compared to [1], I dropped the controversial inline-asm patch pending further >> discussion on the right way to do it. However these patches are still a step >> in >> the right direction and I wanted them to get in before the more controversial >> bit. > > All patches in this set look good to me. > When you repost without changes after net-next reopens pls keep my: > Acked-by: Alexei Starovoitov Thanks a lot! will do. Regards, Joel
Re: [PATCH RFC v3 0/4] Add cross-compilation support to eBPF samples
On Sun, Sep 03, 2017 at 11:23:21AM -0700, Joel Fernandes wrote: > These patches fix issues seen when cross-compiling eBPF samples on arm64. > Compared to [1], I dropped the controversial inline-asm patch pending further > discussion on the right way to do it. However these patches are still a step > in > the right direction and I wanted them to get in before the more controversial > bit. All patches in this set look good to me. When you repost without changes after net-next reopens pls keep my: Acked-by: Alexei Starovoitov
Re: [PATCH v3 2/2] ip6_tunnel: fix ip6 tunnel lookup in collect_md mode
On 9/4/17 1:36 AM, Haishuang Yan wrote: In collect_md mode, if the tun dev is down, it still can call __ip6_tnl_rcv to receive on packets, and the rx statistics increase improperly. Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels") Cc: Alexei StarovoitovSigned-off-by: Haishuang Yan --- Change since v3: * Increment rx_dropped if tunnel device is not up, suggested by Pravin B Shelar * Fix wrong recipient address --- net/ipv6/ip6_tunnel.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index 10a693a..e91d3b6 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -171,8 +171,11 @@ static struct net_device_stats *ip6_get_stats(struct net_device *dev) } t = rcu_dereference(ip6n->collect_md_tun); - if (t) - return t; + if (t) { + if (t->dev->flags & IFF_UP) + return t; + t->dev->stats.rx_dropped++; + } Why increment the stats only for this drop case? There are plenty of other conditions where packet will be dropped in ip6 tunnel. I think it's important to present consistent behavior to the users, so I'd increment drop stats either for all drop cases or for none. And today it's none. The ! IFF_UP case should probably be return NULL too.
Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbertwrote: >> The situation with encapsulation is even more complicated: >> >> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP >> constellation. If we do the fragmentation inside the vxlan tunnel and >> carry over the skb hash to all resulting UDP/vxlan packets source ports, >> we are fine and reordering on the receiver NIC won't happen in this >> case. If the fragmentation happens on the outer UDP header, this will >> result in reordering of the inner L2 flow. Unfortunately this depends on >> how the vxlan tunnel was set up, how other devices do that and (I >> believe so) on the kernel version. >> > This really isn't that complicated. The assumption that an IP network > always delivers packets in order is simply wrong. The inventors of > VXLAN must have know full well that when you use IP, packets can and > eventually will be delivered out of order. This isn't just because of > fragmentation, there are many other reasons that packets can be > delivered OOO. This also must have been known when IP/GRE and any > other protocol that carries L2 over IP was invented. If OOO is an > issue for these protocols then they need to be fixed-- this is not a > concern with IP protocol nor the stack. > > Tom As far as a little background on the original patch I believe the issue that was fixed by the patch was a video streaming application that was sending/receiving a mix of fragmented and non-fragmented packets. Receiving them out of order due to the fragmentation was causing issues with stutters in the video and so we ended up disabling UDP by default in the NICs listed. We decided to go that way as UDP RSS was viewed as a performance optimization, while the out-of-order problems were viewed as a functionality issue. The default for i40e is to have UDP RSS hashing enabled if I recall correctly. Basically as we move away from enterprise to cloud I suspect that is going to be the case more and more since all the UDP tunnels require either port recognition or UDP RSS to be enabled. For now we carry the ability to enable UDP RSS if desired in the legacy drivers, and I believe we have some white papers somewhere that suggest enabling it if you are planning to use UDP based tunnels. - Alex
Re: [net 0/2][pull request] Intel Wired LAN Driver Updates 2017-09-05
From: Jeff KirsherDate: Tue, 5 Sep 2017 18:04:16 -0700 > This series contains fixes for i40e only. Pulled, thanks Jeff.
Re: [PATCH net-next v6 3/3] openvswitch: enable NSH support
On Tue, Sep 05, 2017 at 09:12:09PM +0800, Hannes Frederic Sowa wrote: > "Yang, Yi"writes: > > > We can change this later if we really find a better way to handle this > > because it isn't defined in include/uapi/linux/openvswitch.h, so I still > > have backdoor to do this if needed :-) > > Sorry, I can't follow you. > > It doesn't matter if something is defined in uapi headers, the > observable behavior matters. If you allow users to configure flows with > specific fields, it should not stop working at a future point in time. Anyway this can be changed if it is really needed, so far current way is the best one we can come up with, we would like to use it if you can have better proposal. We have explained repeatedly context headers must be matched and set, this is bottom line. > > > For our sfc use case in Opendaylight, we use context[0] for tunnel ID, > > context[1] for destination IP for reverse RSP, they are used to match > > and set in OpenFlow table, you can't limit users to use them in such > > ways. > > So in your specific case you expect the masks to be completely stable > because you defined a protocol on top of NSH, understood. And that is > stable accross all possible paths. Understood as well. > > > If you check GENEVE implementation, tun_metadata* can be set or matched > > as any other match field. > > Yes, I wrote that in my previous mail. I wonder why NSH context metadata > is not in tun_metadata as well? tun_metadata is tunnel metadata, GENEVE needs tunnel port, but NSH is not so, NSH can't directly use tun_metadata, for MD type 2, we need to a lot of rework on tun_metadata to make it shared between GENEVE and NSH, I don't think this can happen in near term. So tun_metadata isn't option for this now. > > > Actually the most important information in NSH are just these context > > headers, you can't limit imagination of users by removing them from flow > > keys. > > > > My point is to bring miniflow into kernel data path to fix your concern, > > this will benefit your employer directly :-) > > Okay, interesting. It will probably not help if you still have a hash of > a packet inside the flow table and use that for load balancing. > > [...] > > BTW I don't want to stop this patch, I am merely interested in how the > bigger picture will look like in the end. > > Bye, > Hannes
[net 2/2] i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
From: Jacob KellerWhen introducing the functions to read the NVM through the AdminQ, we did not correctly mark the wb_desc. Fixes: 7073f46e443e ("i40e: Add AQ commands for NVM Update for X722", 2015-06-05) Signed-off-by: Jacob Keller Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_nvm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/intel/i40e/i40e_nvm.c b/drivers/net/ethernet/intel/i40e/i40e_nvm.c index 154ba49c2c6f..26d7e9fe6220 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_nvm.c +++ b/drivers/net/ethernet/intel/i40e/i40e_nvm.c @@ -230,6 +230,7 @@ static i40e_status i40e_read_nvm_aq(struct i40e_hw *hw, u8 module_pointer, struct i40e_asq_cmd_details cmd_details; memset(_details, 0, sizeof(cmd_details)); + cmd_details.wb_desc = >nvm_wb_desc; /* Here we are checking the SR limit only for the flat memory model. * We cannot do it for the module-based model, as we did not acquire -- 2.14.1
[net 0/2][pull request] Intel Wired LAN Driver Updates 2017-09-05
This series contains fixes for i40e only. These two patches fix an issue where our nvmupdate tool does not work on RHEL 7.4 and newer kernels, in fact, the use of the nvmupdate tool on newer kernels can cause the cards to be non-functional unless these patches are applied. Anjali reworks the locking around accessing the NVM so that NVM acquire timeouts do not occur which was causing the failed firmware updates. Jake correctly updates the wb_desc when reading the NVM through the AdminQ. The following are changes since commit 6d9c153a0b84392406bc77600aa7d3ea365de041: net: dsa: loop: Do not unregister invalid fixed PHY and are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 40GbE Anjali Singhai Jain (1): i40e: avoid NVM acquire deadlock during NVM update Jacob Keller (1): i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq drivers/net/ethernet/intel/i40e/i40e_nvm.c | 99 +++- drivers/net/ethernet/intel/i40e/i40e_prototype.h | 2 - 2 files changed, 61 insertions(+), 40 deletions(-) -- 2.14.1
[net 1/2] i40e: avoid NVM acquire deadlock during NVM update
From: Anjali Singhai JainX722 devices use the AdminQ to access the NVM, and this requires taking the AdminQ lock. Because of this, we lock the AdminQ during i40e_read_nvm(), which is also called in places where the lock is already held, such as the firmware update path which wants to lock once and then unlock when finished after performing several tasks. Although this should have only affected X722 devices, commit 96a39aed25e6 ("i40e: Acquire NVM lock before reads on all devices", 2016-12-02) added locking for all NVM reads, regardless of device family. This resulted in us accidentally causing NVM acquire timeouts on all devices, causing failed firmware updates which left the eeprom in a corrupt state. Create unsafe non-locked variants of i40e_read_nvm_word and i40e_read_nvm_buffer, __i40e_read_nvm_word and __i40e_read_nvm_buffer respectively. These variants will not take the NVM lock and are expected to only be called in places where the NVM lock is already held if needed. Since the only caller of i40e_read_nvm_buffer() was in such a path, remove it entirely in favor of the unsafe version. If necessary we can always add it back in the future. Additionally, we now need to hold the NVM lock in i40e_validate_checksum because the call to i40e_calc_nvm_checksum now assumes that the NVM lock is held. We can further move the call to read I40E_SR_SW_CHECKSUM_WORD up a bit so that we do not need to acquire the NVM lock twice. This should resolve firmware updates and also fix potential raise that could have caused the driver to report an invalid NVM checksum upon driver load. Reported-by: Stefan Assmann Fixes: 96a39aed25e6 ("i40e: Acquire NVM lock before reads on all devices", 2016-12-02) Signed-off-by: Anjali Singhai Jain Signed-off-by: Jacob Keller Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_nvm.c | 98 +++- drivers/net/ethernet/intel/i40e/i40e_prototype.h | 2 - 2 files changed, 60 insertions(+), 40 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_nvm.c b/drivers/net/ethernet/intel/i40e/i40e_nvm.c index 800bd55d0159..154ba49c2c6f 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_nvm.c +++ b/drivers/net/ethernet/intel/i40e/i40e_nvm.c @@ -266,7 +266,7 @@ static i40e_status i40e_read_nvm_aq(struct i40e_hw *hw, u8 module_pointer, * @offset: offset of the Shadow RAM word to read (0x00 - 0x001FFF) * @data: word read from the Shadow RAM * - * Reads one 16 bit word from the Shadow RAM using the GLNVM_SRCTL register. + * Reads one 16 bit word from the Shadow RAM using the AdminQ **/ static i40e_status i40e_read_nvm_word_aq(struct i40e_hw *hw, u16 offset, u16 *data) @@ -280,27 +280,49 @@ static i40e_status i40e_read_nvm_word_aq(struct i40e_hw *hw, u16 offset, } /** - * i40e_read_nvm_word - Reads Shadow RAM + * __i40e_read_nvm_word - Reads nvm word, assumes called does the locking * @hw: pointer to the HW structure * @offset: offset of the Shadow RAM word to read (0x00 - 0x001FFF) * @data: word read from the Shadow RAM * - * Reads one 16 bit word from the Shadow RAM using the GLNVM_SRCTL register. + * Reads one 16 bit word from the Shadow RAM. + * + * Do not use this function except in cases where the nvm lock is already + * taken via i40e_acquire_nvm(). + **/ +static i40e_status __i40e_read_nvm_word(struct i40e_hw *hw, + u16 offset, u16 *data) +{ + i40e_status ret_code = 0; + + if (hw->flags & I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE) + ret_code = i40e_read_nvm_word_aq(hw, offset, data); + else + ret_code = i40e_read_nvm_word_srctl(hw, offset, data); + return ret_code; +} + +/** + * i40e_read_nvm_word - Reads nvm word and acquire lock if necessary + * @hw: pointer to the HW structure + * @offset: offset of the Shadow RAM word to read (0x00 - 0x001FFF) + * @data: word read from the Shadow RAM + * + * Reads one 16 bit word from the Shadow RAM. **/ i40e_status i40e_read_nvm_word(struct i40e_hw *hw, u16 offset, u16 *data) { - enum i40e_status_code ret_code = 0; + i40e_status ret_code = 0; ret_code = i40e_acquire_nvm(hw, I40E_RESOURCE_READ); - if (!ret_code) { - if (hw->flags & I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE) { - ret_code = i40e_read_nvm_word_aq(hw, offset, data); - } else { - ret_code = i40e_read_nvm_word_srctl(hw, offset, data); - } - i40e_release_nvm(hw); - } + if (ret_code) + return ret_code; + + ret_code = __i40e_read_nvm_word(hw, offset, data); + + i40e_release_nvm(hw); + return ret_code; }
Re: [PATCH v2 rfc 0/8] IGMP snooping for local traffic
> The third and last issue will be explained in a followup email. Hi DSA hackers So there is the third issue. It affects just DSA, but it possible affects all DSA drivers. This patchset broken broadcast with the Marvell drivers. It could break broadcast on others drivers as well. What i found is that the Marvell chips don't flood broadcast frames between bridged ports. What appears to happen is there is a fdb miss, so it gets forwarded to the CPU port for the host to deal with. The software bridge when floods it out all ports of the bridge. But the set offload_fwd_mark patch changes this. The software bridge now assumes the hardware has already flooded broadcast out all ports of the switch as needed. So it does not do any flooding itself. As a result, on Marvell devices, broadcast packets don't get flooded at all. The issue can be fixed. I just need to add an mdb entry for the broadcast address to each port of the bridge in the switch, and the CPU port. But i don't know at what level to do this. Should this be done at the DSA level, or at the driver level? Do any chips do broadcast flooding in hardware already? Hence they currently see broadcast duplication? If i add a broadcast mdb at the DSA level, and the chip is already hard wired to flooding broadcast, is it going to double flood? Andrew
Re: [PATCH 2/3] security: bpf: Add eBPF LSM hooks and security field to eBPF map
On Tue, Sep 05, 2017 at 02:59:38PM -0700, Chenbo Feng wrote: > On Thu, Aug 31, 2017 at 7:05 PM, Alexei Starovoitov >wrote: > > On Thu, Aug 31, 2017 at 01:56:34PM -0700, Chenbo Feng wrote: > >> From: Chenbo Feng > >> > >> Introduce a pointer into struct bpf_map to hold the security information > >> about the map. The actual security struct varies based on the security > >> models implemented. Place the LSM hooks before each of the unrestricted > >> eBPF operations, the map_update_elem and map_delete_elem operations are > >> checked by security_map_modify. The map_lookup_elem and map_get_next_key > >> operations are checked by securtiy_map_read. > >> > >> Signed-off-by: Chenbo Feng > > > > ... > > > >> @@ -410,6 +418,10 @@ static int map_lookup_elem(union bpf_attr *attr) > >> if (IS_ERR(map)) > >> return PTR_ERR(map); > >> > >> + err = security_map_read(map); > >> + if (err) > >> + return -EACCES; > >> + > >> key = memdup_user(ukey, map->key_size); > >> if (IS_ERR(key)) { > >> err = PTR_ERR(key); > >> @@ -490,6 +502,10 @@ static int map_update_elem(union bpf_attr *attr) > >> if (IS_ERR(map)) > >> return PTR_ERR(map); > >> > >> + err = security_map_modify(map); > > > > I don't feel these extra hooks are really thought through. > > With such hook you'll disallow map_update for given map. That's it. > > The key/values etc won't be used in such security decision. > > In such case you don't need such hooks in update/lookup at all. > > Only in map_creation and object_get calls where FD can be received. > > In other words I suggest to follow standard unix practices: > > Do permissions checks in open() and allow read/write() if FD is valid. > > Same here. Do permission checks in prog_load/map_create/obj_pin/get > > and that will be enough to jail bpf subsystem. > > bpf cmds that need to be fast (like lookup and update) should not > > have security hooks. > > > Thanks for pointing out this. I agree we should only do checks on > creating and passing > the object from one processes to another. And if we can still > distinguish the read/write operation > when obtaining the file fd, that would be great. But that may require > us to add a new mode > field into bpf_map struct and change the attribute struct when doing > the bpf syscall. How do you > think about this approach? Or we can do simple checks for > bpf_obj_create and bpf_obj_use when > creating the object and passing the object respectively but this > solution cannot distinguish map modify and > read. iirc the idea to have read only maps was brought up in the past (unfortunately no one took a stab at implementing it), but imo that's better then relying on lsm to provide such restriction and more secure, since even if you disable map_update via lsm, the user can craft a program just to udpate the map from it. For bpffs we already test for inode_permission(inode, MAY_WRITE); during BPF_OBJ_GET command and we can extend this facility further. Also we can allow dropping 'write' permissions from the map (internally implemented via flag in struct bpf_map), so update/delete operations won't be allowed. This flag will be checked by syscall during map_update/delete and by the verifier if such read-only map is used by the program being loaded, so map_update/helpers won't be allowed in such program. Would also be good to support read-only programs as well (progs that are read-only from kernel state point of view) They won't be able to modify packets, maps, etc.
Re: [PATCH v2 rfc 0/8] IGMP snooping for local traffic
On Wed, 6 Sep 2017 01:35:02 +0200 Andrew Lunnwrote: > After the very useful feedback from Nikolay, i threw away what i had, > and started again. To recap: > > The linux bridge supports IGMP snooping. It will listen to IGMP > reports on bridge ports and keep track of which groups have been > joined on an interface. It will then forward multicast based on this > group membership. > > When the bridge adds or removed groups from an interface, it uses > switchdev to request the hardware add an mdb to a port, so the > hardware can perform the selective forwarding between ports. > > What is not covered by the current bridge code, is IGMP joins/leaves > from the host on the brX interface. These are not reported via > switchdev so that hardware knows the local host is interested in the > multicast frames. > > Luckily, the bridge does track joins/leaves on the brX interface. The > code is obfusticated, which is why i missed it with my first attempt. > So the first patch tries to remove this obfustication. Currently, > there is no notifications sent when the bridge interface joins a > group. The second patch adds them. bridge monitor then shows > joins/leaves in the same way as for other ports of the bridge. > > Then starts the work passing down to the hardware that the host has > joined/left a group. The existing switchdev mdb object cannot be used, > since the semantics are different. The existing > SWITCHDEV_OBJ_ID_PORT_MDB is used to indicate a specific multicast > group should be forwarded out that port of the switch. However here we > require the exact opposite. We want multicast frames for the group > received on the port to the forwarded to the host. Hence add a new > object SWITCHDEV_OBJ_ID_HOST_MDB, a multicast database entry to > forward to the host. This new object is then propagated through the > DSA layers. No DSA driver changes should be needed, this should just > work... > > Getting the frames to the bridge as requested turned up an issue or > three. The offload_fwd_mark is not being set by DSA, so the bridge > floods the received frames back to the switch ports, resulting in > duplication since the hardware has already flooded the packet. Fixing > that turned up an issue with the meaning of > SWITCHDEV_ATTR_ID_PORT_PARENT_ID in DSA. A DSA fabric of three > switches needs to look to the software bridge as a single > switch. Otherwise the offload_fwd_mark does not work, and we get > duplication on the non-ingress switch. But each switch returned a > different value. And they were not unique. > > The third and last issue will be explained in a followup email. > > Open questions: > > Is sending notifications going to break userspace? > Is this new switchdev object O.K. for the few non-DSA switches that exist? > Is the SWITCHDEV_ATTR_ID_PORT_PARENT_ID change acceptable? > >Andrew > > Andrew Lunn (8): > net: bridge: Rename mglist to host_joined > net: bridge: Send notification when host join/leaves a group > net: bridge: Add/del switchdev object on host join/leave > net: dsa: slave: Handle switchdev host mdb add/del > net: dsa: switch: handle host mdb add/remove > net: dsa: switch: Don't add CPU port to an mdb by default > net: dsa: set offload_fwd_mark on received packets > net: dsa: Fix SWITCHDEV_ATTR_ID_PORT_PARENT_ID > > include/net/switchdev.h | 1 + > net/bridge/br_input.c | 2 +- > net/bridge/br_mdb.c | 50 +--- > net/bridge/br_multicast.c | 18 +++- > net/bridge/br_private.h | 2 +- > net/dsa/dsa.c | 1 + > net/dsa/dsa_priv.h| 7 + > net/dsa/port.c| 26 + > net/dsa/slave.c | 16 --- > net/dsa/switch.c | 72 > +++ > net/switchdev/switchdev.c | 2 ++ > 11 files changed, 168 insertions(+), 29 deletions(-) > This looks much cleaner. I don't have DSA hardware or infrastructure to look deeper.
[PATCH v2 rfc 1/8] net: bridge: Rename mglist to host_joined
The boolean mglist indicates the host has joined a particular multicast group on the bridge interface. It is badly named, obscuring what is means. Rename it. Signed-off-by: Andrew Lunn--- net/bridge/br_input.c | 2 +- net/bridge/br_mdb.c | 2 +- net/bridge/br_multicast.c | 14 +++--- net/bridge/br_private.h | 2 +- 4 files changed, 10 insertions(+), 10 deletions(-) diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c index 7637f58c1226..0dd55a183391 100644 --- a/net/bridge/br_input.c +++ b/net/bridge/br_input.c @@ -179,7 +179,7 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb mdst = br_mdb_get(br, skb, vid); if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) && br_multicast_querier_exists(br, eth_hdr(skb))) { - if ((mdst && mdst->mglist) || + if ((mdst && mdst->host_joined) || br_multicast_is_router(br)) { local_rcv = true; br->dev->stats.multicast++; diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index ca01def49af0..71885e251988 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -654,7 +654,7 @@ static int __br_mdb_del(struct net_bridge *br, struct br_mdb_entry *entry) call_rcu_bh(>rcu, br_multicast_free_pg); err = 0; - if (!mp->ports && !mp->mglist && + if (!mp->ports && !mp->host_joined && netif_running(br->dev)) mod_timer(>timer, jiffies); break; diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c index 8dc5c8d69bcd..c6b2b8d419e7 100644 --- a/net/bridge/br_multicast.c +++ b/net/bridge/br_multicast.c @@ -249,7 +249,7 @@ static void br_multicast_group_expired(unsigned long data) if (!netif_running(br->dev) || timer_pending(>timer)) goto out; - mp->mglist = false; + mp->host_joined = false; if (mp->ports) goto out; @@ -292,7 +292,7 @@ static void br_multicast_del_pg(struct net_bridge *br, p->flags); call_rcu_bh(>rcu, br_multicast_free_pg); - if (!mp->ports && !mp->mglist && + if (!mp->ports && !mp->host_joined && netif_running(br->dev)) mod_timer(>timer, jiffies); @@ -775,7 +775,7 @@ static int br_multicast_add_group(struct net_bridge *br, goto err; if (!port) { - mp->mglist = true; + mp->host_joined = true; mod_timer(>timer, now + br->multicast_membership_interval); goto out; } @@ -1451,7 +1451,7 @@ static int br_ip4_multicast_query(struct net_bridge *br, max_delay *= br->multicast_last_member_count; - if (mp->mglist && + if (mp->host_joined && (timer_pending(>timer) ? time_after(mp->timer.expires, now + max_delay) : try_to_del_timer_sync(>timer) >= 0)) @@ -1535,7 +1535,7 @@ static int br_ip6_multicast_query(struct net_bridge *br, goto out; max_delay *= br->multicast_last_member_count; - if (mp->mglist && + if (mp->host_joined && (timer_pending(>timer) ? time_after(mp->timer.expires, now + max_delay) : try_to_del_timer_sync(>timer) >= 0)) @@ -1596,7 +1596,7 @@ br_multicast_leave_group(struct net_bridge *br, br_mdb_notify(br->dev, port, group, RTM_DELMDB, p->flags); - if (!mp->ports && !mp->mglist && + if (!mp->ports && !mp->host_joined && netif_running(br->dev)) mod_timer(>timer, jiffies); } @@ -1636,7 +1636,7 @@ br_multicast_leave_group(struct net_bridge *br, br->multicast_last_member_interval; if (!port) { - if (mp->mglist && + if (mp->host_joined && (timer_pending(>timer) ? time_after(mp->timer.expires, time) : try_to_del_timer_sync(>timer) >= 0)) { diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index fd9ee73e0a6d..e79e0611908d 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -202,7 +202,7 @@ struct net_bridge_mdb_entry struct rcu_head rcu; struct timer_list timer; struct br_ipaddr; - boolmglist; + boolhost_joined; }; struct net_bridge_mdb_htable -- 2.14.1
[PATCH v2 rfc 3/8] net: bridge: Add/del switchdev object on host join/leave
When the host joins or leaves a multicast group, use switchdev to add an object to the hardware to forward traffic for the group to the host. Signed-off-by: Andrew Lunn--- include/net/switchdev.h | 1 + net/bridge/br_mdb.c | 39 +++ net/switchdev/switchdev.c | 2 ++ 3 files changed, 42 insertions(+) diff --git a/include/net/switchdev.h b/include/net/switchdev.h index d767b7991887..b298a6b4c11d 100644 --- a/include/net/switchdev.h +++ b/include/net/switchdev.h @@ -75,6 +75,7 @@ enum switchdev_obj_id { SWITCHDEV_OBJ_ID_UNDEFINED, SWITCHDEV_OBJ_ID_PORT_VLAN, SWITCHDEV_OBJ_ID_PORT_MDB, + SWITCHDEV_OBJ_ID_HOST_MDB, }; struct switchdev_obj { diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index 80fa91ccc50c..a9c03f3a3b88 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -291,6 +291,42 @@ static void br_mdb_complete(struct net_device *dev, int err, void *priv) kfree(priv); } +static void br_mdb_switchdev_host_port(struct net_device *dev, + struct net_device *lower_dev, + struct br_mdb_entry *entry, int type) +{ + struct switchdev_obj_port_mdb mdb = { + .obj = { + .id = SWITCHDEV_OBJ_ID_HOST_MDB, + .flags = SWITCHDEV_F_DEFER, + }, + .vid = entry->vid, + }; + + if (entry->addr.proto == htons(ETH_P_IP)) + ip_eth_mc_map(entry->addr.u.ip4, mdb.addr); +#if IS_ENABLED(CONFIG_IPV6) + else + ipv6_eth_mc_map(>addr.u.ip6, mdb.addr); +#endif + + mdb.obj.orig_dev = dev; + if (type == RTM_NEWMDB) + switchdev_port_obj_add(lower_dev, ); + if (type == RTM_DELMDB) + switchdev_port_obj_del(lower_dev, ); +} + +static void br_mdb_switchdev_host(struct net_device *dev, + struct br_mdb_entry *entry, int type) +{ + struct net_device *lower_dev; + struct list_head *iter; + + netdev_for_each_lower_dev(dev, lower_dev, iter) + br_mdb_switchdev_host_port(dev, lower_dev, entry, type); +} + static void __br_mdb_notify(struct net_device *dev, struct net_bridge_port *p, struct br_mdb_entry *entry, int type) { @@ -330,6 +366,9 @@ static void __br_mdb_notify(struct net_device *dev, struct net_bridge_port *p, switchdev_port_obj_del(port_dev, ); } + if (!p) + br_mdb_switchdev_host(dev, entry, type); + skb = nlmsg_new(rtnl_mdb_nlmsg_size(), GFP_ATOMIC); if (!skb) goto errout; diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c index 0531b41d1f2d..74b9d916a58b 100644 --- a/net/switchdev/switchdev.c +++ b/net/switchdev/switchdev.c @@ -345,6 +345,8 @@ static size_t switchdev_obj_size(const struct switchdev_obj *obj) return sizeof(struct switchdev_obj_port_vlan); case SWITCHDEV_OBJ_ID_PORT_MDB: return sizeof(struct switchdev_obj_port_mdb); + case SWITCHDEV_OBJ_ID_HOST_MDB: + return sizeof(struct switchdev_obj_port_mdb); default: BUG(); } -- 2.14.1
[PATCH v2 rfc 6/8] net: dsa: switch: Don't add CPU port to an mdb by default
Now that the host indicates when a multicast group should be forwarded from the switch to the host, don't do it by default. Signed-off-by: Andrew Lunn--- net/dsa/switch.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/dsa/switch.c b/net/dsa/switch.c index 326680039c52..147295d8de94 100644 --- a/net/dsa/switch.c +++ b/net/dsa/switch.c @@ -145,7 +145,7 @@ static int dsa_switch_mdb_add(struct dsa_switch *ds, if (ds->index == info->sw_index) set_bit(info->port, group); for (port = 0; port < ds->num_ports; port++) - if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port)) + if (dsa_is_dsa_port(ds, port)) set_bit(port, group); return dsa_switch_mdb_add_bitmap(ds, info, mdb, group); -- 2.14.1
[PATCH v2 rfc 0/8] IGMP snooping for local traffic
After the very useful feedback from Nikolay, i threw away what i had, and started again. To recap: The linux bridge supports IGMP snooping. It will listen to IGMP reports on bridge ports and keep track of which groups have been joined on an interface. It will then forward multicast based on this group membership. When the bridge adds or removed groups from an interface, it uses switchdev to request the hardware add an mdb to a port, so the hardware can perform the selective forwarding between ports. What is not covered by the current bridge code, is IGMP joins/leaves from the host on the brX interface. These are not reported via switchdev so that hardware knows the local host is interested in the multicast frames. Luckily, the bridge does track joins/leaves on the brX interface. The code is obfusticated, which is why i missed it with my first attempt. So the first patch tries to remove this obfustication. Currently, there is no notifications sent when the bridge interface joins a group. The second patch adds them. bridge monitor then shows joins/leaves in the same way as for other ports of the bridge. Then starts the work passing down to the hardware that the host has joined/left a group. The existing switchdev mdb object cannot be used, since the semantics are different. The existing SWITCHDEV_OBJ_ID_PORT_MDB is used to indicate a specific multicast group should be forwarded out that port of the switch. However here we require the exact opposite. We want multicast frames for the group received on the port to the forwarded to the host. Hence add a new object SWITCHDEV_OBJ_ID_HOST_MDB, a multicast database entry to forward to the host. This new object is then propagated through the DSA layers. No DSA driver changes should be needed, this should just work... Getting the frames to the bridge as requested turned up an issue or three. The offload_fwd_mark is not being set by DSA, so the bridge floods the received frames back to the switch ports, resulting in duplication since the hardware has already flooded the packet. Fixing that turned up an issue with the meaning of SWITCHDEV_ATTR_ID_PORT_PARENT_ID in DSA. A DSA fabric of three switches needs to look to the software bridge as a single switch. Otherwise the offload_fwd_mark does not work, and we get duplication on the non-ingress switch. But each switch returned a different value. And they were not unique. The third and last issue will be explained in a followup email. Open questions: Is sending notifications going to break userspace? Is this new switchdev object O.K. for the few non-DSA switches that exist? Is the SWITCHDEV_ATTR_ID_PORT_PARENT_ID change acceptable? Andrew Andrew Lunn (8): net: bridge: Rename mglist to host_joined net: bridge: Send notification when host join/leaves a group net: bridge: Add/del switchdev object on host join/leave net: dsa: slave: Handle switchdev host mdb add/del net: dsa: switch: handle host mdb add/remove net: dsa: switch: Don't add CPU port to an mdb by default net: dsa: set offload_fwd_mark on received packets net: dsa: Fix SWITCHDEV_ATTR_ID_PORT_PARENT_ID include/net/switchdev.h | 1 + net/bridge/br_input.c | 2 +- net/bridge/br_mdb.c | 50 +--- net/bridge/br_multicast.c | 18 +++- net/bridge/br_private.h | 2 +- net/dsa/dsa.c | 1 + net/dsa/dsa_priv.h| 7 + net/dsa/port.c| 26 + net/dsa/slave.c | 16 --- net/dsa/switch.c | 72 +++ net/switchdev/switchdev.c | 2 ++ 11 files changed, 168 insertions(+), 29 deletions(-) -- 2.14.1
[PATCH v2 rfc 4/8] net: dsa: slave: Handle switchdev host mdb add/del
Add code to handle switchdev host mdb add/del. As with normal mdb add/del, send a notification to the switch layer. Signed-off-by: Andrew Lunn--- net/dsa/dsa_priv.h | 7 +++ net/dsa/port.c | 26 ++ net/dsa/slave.c| 6 ++ 3 files changed, 39 insertions(+) diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h index 9c3eeb72462d..0ffe49f78d14 100644 --- a/net/dsa/dsa_priv.h +++ b/net/dsa/dsa_priv.h @@ -24,6 +24,8 @@ enum { DSA_NOTIFIER_FDB_DEL, DSA_NOTIFIER_MDB_ADD, DSA_NOTIFIER_MDB_DEL, + DSA_NOTIFIER_HOST_MDB_ADD, + DSA_NOTIFIER_HOST_MDB_DEL, DSA_NOTIFIER_VLAN_ADD, DSA_NOTIFIER_VLAN_DEL, }; @@ -131,6 +133,11 @@ int dsa_port_mdb_add(struct dsa_port *dp, struct switchdev_trans *trans); int dsa_port_mdb_del(struct dsa_port *dp, const struct switchdev_obj_port_mdb *mdb); +int dsa_host_mdb_add(struct dsa_port *dp, +const struct switchdev_obj_port_mdb *mdb, +struct switchdev_trans *trans); +int dsa_host_mdb_del(struct dsa_port *dp, +const struct switchdev_obj_port_mdb *mdb); int dsa_port_vlan_add(struct dsa_port *dp, const struct switchdev_obj_port_vlan *vlan, struct switchdev_trans *trans); diff --git a/net/dsa/port.c b/net/dsa/port.c index 659676ba3f8b..5b18b9fe2219 100644 --- a/net/dsa/port.c +++ b/net/dsa/port.c @@ -199,6 +199,32 @@ int dsa_port_mdb_del(struct dsa_port *dp, return dsa_port_notify(dp, DSA_NOTIFIER_MDB_DEL, ); } +int dsa_host_mdb_add(struct dsa_port *dp, +const struct switchdev_obj_port_mdb *mdb, +struct switchdev_trans *trans) +{ + struct dsa_notifier_mdb_info info = { + .sw_index = dp->ds->index, + .port = dp->index, + .trans = trans, + .mdb = mdb, + }; + + return dsa_port_notify(dp, DSA_NOTIFIER_HOST_MDB_ADD, ); +} + +int dsa_host_mdb_del(struct dsa_port *dp, +const struct switchdev_obj_port_mdb *mdb) +{ + struct dsa_notifier_mdb_info info = { + .sw_index = dp->ds->index, + .port = dp->index, + .mdb = mdb, + }; + + return dsa_port_notify(dp, DSA_NOTIFIER_HOST_MDB_DEL, ); +} + int dsa_port_vlan_add(struct dsa_port *dp, const struct switchdev_obj_port_vlan *vlan, struct switchdev_trans *trans) diff --git a/net/dsa/slave.c b/net/dsa/slave.c index 78e78a6e6833..2e07be149415 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -330,6 +330,9 @@ static int dsa_slave_port_obj_add(struct net_device *dev, case SWITCHDEV_OBJ_ID_PORT_MDB: err = dsa_port_mdb_add(dp, SWITCHDEV_OBJ_PORT_MDB(obj), trans); break; + case SWITCHDEV_OBJ_ID_HOST_MDB: + err = dsa_host_mdb_add(dp, SWITCHDEV_OBJ_PORT_MDB(obj), trans); + break; case SWITCHDEV_OBJ_ID_PORT_VLAN: err = dsa_port_vlan_add(dp, SWITCHDEV_OBJ_PORT_VLAN(obj), trans); @@ -353,6 +356,9 @@ static int dsa_slave_port_obj_del(struct net_device *dev, case SWITCHDEV_OBJ_ID_PORT_MDB: err = dsa_port_mdb_del(dp, SWITCHDEV_OBJ_PORT_MDB(obj)); break; + case SWITCHDEV_OBJ_ID_HOST_MDB: + err = dsa_host_mdb_del(dp, SWITCHDEV_OBJ_PORT_MDB(obj)); + break; case SWITCHDEV_OBJ_ID_PORT_VLAN: err = dsa_port_vlan_del(dp, SWITCHDEV_OBJ_PORT_VLAN(obj)); break; -- 2.14.1
[PATCH v2 rfc 8/8] net: dsa: Fix SWITCHDEV_ATTR_ID_PORT_PARENT_ID
SWITCHDEV_ATTR_ID_PORT_PARENT_ID is used by the software bridge when determining which ports to flood a packet out. If the packet originated from a switch, it assumes the switch has already flooded the packet out the switches ports, so the bridge should not flood the packet itself out switch ports. Ports on the same switch are expected to return the same parent ID when SWITCHDEV_ATTR_ID_PORT_PARENT_ID is called. DSA gets this wrong with clusters of switches. As far as the software bridge is concerned, the cluster is all one switch. A packet from any switch in the cluster can be assumed to of been flooded as needed out all ports of the cluster, not just the switch it originated from. Hence all ports of a cluster should return the same parent. The old implementation did not, each switch in the cluster had its own ID. Also wrong was that the ID was not unique if multiple DSA instances are in operation. Use the MAC address of the master interface as the parent ID. This is the same for all switches in a cluster, and should be unique if there are multiple clusters. Signed-off-by: Andrew Lunn--- net/dsa/slave.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/net/dsa/slave.c b/net/dsa/slave.c index 2e07be149415..d2744b0dad6e 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -374,13 +374,15 @@ static int dsa_slave_port_attr_get(struct net_device *dev, struct switchdev_attr *attr) { struct dsa_slave_priv *p = netdev_priv(dev); - struct dsa_switch *ds = p->dp->ds; switch (attr->id) { - case SWITCHDEV_ATTR_ID_PORT_PARENT_ID: - attr->u.ppid.id_len = sizeof(ds->index); - memcpy(>u.ppid.id, >index, attr->u.ppid.id_len); + case SWITCHDEV_ATTR_ID_PORT_PARENT_ID: { + struct net_device *master = dsa_master_netdev(p); + + attr->u.ppid.id_len = ETH_ALEN; + ether_addr_copy(attr->u.ppid.id, master->dev_addr); break; + } case SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS_SUPPORT: attr->u.brport_flags_support = 0; break; -- 2.14.1
[PATCH v2 rfc 5/8] net: dsa: switch: handle host mdb add/remove
When receiving notifications of the host mdb add/remove, add/remove an mdb to the CPU port, so that traffic flows from a port to the CPU port and hence to the host. Signed-off-by: Andrew Lunn--- net/dsa/switch.c | 72 ++-- 1 file changed, 60 insertions(+), 12 deletions(-) diff --git a/net/dsa/switch.c b/net/dsa/switch.c index e6c06aa349a6..326680039c52 100644 --- a/net/dsa/switch.c +++ b/net/dsa/switch.c @@ -108,22 +108,14 @@ static int dsa_switch_fdb_del(struct dsa_switch *ds, info->vid); } -static int dsa_switch_mdb_add(struct dsa_switch *ds, - struct dsa_notifier_mdb_info *info) +static int dsa_switch_mdb_add_bitmap(struct dsa_switch *ds, +struct dsa_notifier_mdb_info *info, +const struct switchdev_obj_port_mdb *mdb, +unsigned long *group) { - const struct switchdev_obj_port_mdb *mdb = info->mdb; struct switchdev_trans *trans = info->trans; - DECLARE_BITMAP(group, ds->num_ports); int port, err; - /* Build a mask of Multicast group members */ - bitmap_zero(group, ds->num_ports); - if (ds->index == info->sw_index) - set_bit(info->port, group); - for (port = 0; port < ds->num_ports; port++) - if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port)) - set_bit(port, group); - if (switchdev_trans_ph_prepare(trans)) { if (!ds->ops->port_mdb_prepare || !ds->ops->port_mdb_add) return -EOPNOTSUPP; @@ -141,6 +133,40 @@ static int dsa_switch_mdb_add(struct dsa_switch *ds, return 0; } +static int dsa_switch_mdb_add(struct dsa_switch *ds, + struct dsa_notifier_mdb_info *info) +{ + const struct switchdev_obj_port_mdb *mdb = info->mdb; + DECLARE_BITMAP(group, ds->num_ports); + int port; + + /* Build a mask of Multicast group members */ + bitmap_zero(group, ds->num_ports); + if (ds->index == info->sw_index) + set_bit(info->port, group); + for (port = 0; port < ds->num_ports; port++) + if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port)) + set_bit(port, group); + + return dsa_switch_mdb_add_bitmap(ds, info, mdb, group); +} + +static int dsa_switch_host_mdb_add(struct dsa_switch *ds, + struct dsa_notifier_mdb_info *info) +{ + const struct switchdev_obj_port_mdb *mdb = info->mdb; + DECLARE_BITMAP(group, ds->num_ports); + int port; + + /* Build a mask of Multicast group members */ + bitmap_zero(group, ds->num_ports); + for (port = 0; port < ds->num_ports; port++) + if (dsa_is_cpu_port(ds, port)) + set_bit(port, group); + + return dsa_switch_mdb_add_bitmap(ds, info, mdb, group); +} + static int dsa_switch_mdb_del(struct dsa_switch *ds, struct dsa_notifier_mdb_info *info) { @@ -155,6 +181,22 @@ static int dsa_switch_mdb_del(struct dsa_switch *ds, return 0; } +static int dsa_switch_host_mdb_del(struct dsa_switch *ds, + struct dsa_notifier_mdb_info *info) +{ + const struct switchdev_obj_port_mdb *mdb = info->mdb; + int port; + + if (!ds->ops->port_mdb_del) + return -EOPNOTSUPP; + + for (port = 0; port < ds->num_ports; port++) + if (dsa_is_cpu_port(ds, port)) + ds->ops->port_mdb_del(ds, port, mdb); + + return 0; +} + static int dsa_switch_vlan_add(struct dsa_switch *ds, struct dsa_notifier_vlan_info *info) { @@ -230,6 +272,12 @@ static int dsa_switch_event(struct notifier_block *nb, case DSA_NOTIFIER_MDB_DEL: err = dsa_switch_mdb_del(ds, info); break; + case DSA_NOTIFIER_HOST_MDB_ADD: + err = dsa_switch_host_mdb_add(ds, info); + break; + case DSA_NOTIFIER_HOST_MDB_DEL: + err = dsa_switch_host_mdb_del(ds, info); + break; case DSA_NOTIFIER_VLAN_ADD: err = dsa_switch_vlan_add(ds, info); break; -- 2.14.1
[PATCH v2 rfc 7/8] net: dsa: set offload_fwd_mark on received packets
The software bridge needs to know if a packet has already been bridged by hardware offload to ports in the same hardware offload, in order that it does not re-flood them, causing duplicates. This is particularly true for multicast traffic which the host has required. By setting offload_fwd_mark in the skb the bridge will only flood to ports in other offloads and other netifs. Signed-off-by: Andrew Lunn--- net/dsa/dsa.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c index 03c58b0eb082..5732696ac71c 100644 --- a/net/dsa/dsa.c +++ b/net/dsa/dsa.c @@ -213,6 +213,7 @@ static int dsa_switch_rcv(struct sk_buff *skb, struct net_device *dev, skb_push(skb, ETH_HLEN); skb->pkt_type = PACKET_HOST; skb->protocol = eth_type_trans(skb, skb->dev); + skb->offload_fwd_mark = 1; s = this_cpu_ptr(p->stats64); u64_stats_update_begin(>syncp); -- 2.14.1
[PATCH v2 rfc 2/8] net: bridge: Send notification when host join/leaves a group
The host can join or leave a multicast group on the brX interface, as indicated by IGMP snooping. This is tracked within the bridge multicast code. Send a notification when this happens, in the same way a notification is sent when a port of the bridge joins/leaves a group because of IGMP snooping. Signed-off-by: Andrew Lunn--- net/bridge/br_mdb.c | 9 ++--- net/bridge/br_multicast.c | 6 +- 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index 71885e251988..80fa91ccc50c 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -316,7 +316,7 @@ static void __br_mdb_notify(struct net_device *dev, struct net_bridge_port *p, #endif mdb.obj.orig_dev = port_dev; - if (port_dev && type == RTM_NEWMDB) { + if (p && port_dev && type == RTM_NEWMDB) { complete_info = kmalloc(sizeof(*complete_info), GFP_ATOMIC); if (complete_info) { complete_info->port = p; @@ -326,7 +326,7 @@ static void __br_mdb_notify(struct net_device *dev, struct net_bridge_port *p, if (switchdev_port_obj_add(port_dev, )) kfree(complete_info); } - } else if (port_dev && type == RTM_DELMDB) { + } else if (p && port_dev && type == RTM_DELMDB) { switchdev_port_obj_del(port_dev, ); } @@ -352,7 +352,10 @@ void br_mdb_notify(struct net_device *dev, struct net_bridge_port *port, struct br_mdb_entry entry; memset(, 0, sizeof(entry)); - entry.ifindex = port->dev->ifindex; + if (port) + entry.ifindex = port->dev->ifindex; + else + entry.ifindex = dev->ifindex; entry.addr.proto = group->proto; entry.addr.u.ip4 = group->u.ip4; #if IS_ENABLED(CONFIG_IPV6) diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c index c6b2b8d419e7..955f340fe719 100644 --- a/net/bridge/br_multicast.c +++ b/net/bridge/br_multicast.c @@ -250,6 +250,7 @@ static void br_multicast_group_expired(unsigned long data) goto out; mp->host_joined = false; + br_mdb_notify(br->dev, NULL, >addr, RTM_DELMDB, 0); if (mp->ports) goto out; @@ -775,7 +776,10 @@ static int br_multicast_add_group(struct net_bridge *br, goto err; if (!port) { - mp->host_joined = true; + if (!mp->host_joined) { + mp->host_joined = true; + br_mdb_notify(br->dev, NULL, >addr, RTM_NEWMDB, 0); + } mod_timer(>timer, now + br->multicast_membership_interval); goto out; } -- 2.14.1
Re: [RFC net-next] net: sch_clsact: add support for global per-netns classifier mode
On 09/06/2017 12:45 AM, Daniel Borkmann wrote: On 09/06/2017 12:01 AM, Roopa Prabhu wrote: On Tue, Sep 5, 2017 at 11:18 AM, Cong Wangwrote: On Tue, Sep 5, 2017 at 5:48 AM, Nikolay Aleksandrov wrote: Hi all, This RFC adds a new mode for clsact which designates a device's egress classifier as global per netns. The packets that are not classified for a particular device will be classified using the global classifier. We have needed a global classifier for some time now for various purposes and setting the single bridge or loopback/vrf device as the Can you elaborate a bit more on the ... "we have needed a global classifier for some time now for various purposes". global classifier device is acceptable for us. Doing it this way avoids the act/cls device and queue dependencies. This is strictly an RFC patch just to show the intent, if we agree on the details the proposed patch will have support for both ingress and egress, and will be using a static key to avoid the fast path test when no global classifier has been configured. Example (need a modified tc that adds TCA_OPTIONS when using q_clsact): $ tc qdisc add dev lo clsact global $ tc filter add dev lo egress protocol ip u32 match ip dst 4.3.2.1/32 action drop the last filter will be global for all devices that don't have a specific egress_cl_list (i.e. have clsact configured). Sorry this is too ugly. +1 netdevice is still implied in your command line even if you treat it as global. It is essentially hard to bypass netdevice layer since netdevice is the core of L2 and also where everything begins. One could probably use special wildcard device name, e.g. tcpdump allows for 'tcpdump -i any' to capture on all devices, so you could indicate something like 'tc filter add dev any blah' to have similar semantics in the realm of the netns w/o making it look too hacky for tc users perhaps. Maybe the best we can do here is make tc filters standalone as tc actions so that filters can exist before qdisc's and netdevices. But this probably requires significant works to make it working with both existing non-standalone and bindings standalones with qdisc's. yes, like Nikolay says we have been discussing this as well. Nikolay's patch is a cleaver and most importantly non-invasive way today given the anchor point for tc rules is a netdev. we have also considered a separate implicit tc anchor device. Seems ugly just as well. :( Hmm, why not just having the two list pointers (ingress, egress list) in the netns struct and when something configures them to be effectively non-zero, then devices in that netns could automatically get a clsact and inherit the lists from there such that sch_handle_ingress() and sch_handle_egress() require exactly zero changes in fast-path. You could then go and say that either you would make changes to clsact for individual devices immutable when they use the 'shared' list pointers, or then duplicate the configs when being altered from the global one. Would push the complexity to control path only at least. Just a brief thought.
Re: linux-next: manual merge of the pci tree with the net tree
Hi Sinan, On Tue, 5 Sep 2017 18:54:38 -0400 Sinan Kayawrote: > > On 9/4/2017 12:54 AM, Stephen Rothwell wrote: > > Just a reminder that this conflict still exists. > > I suppose this was a message for Bjorn, right? Correct. It was just a reminder that he may have to tell Linus about the conflict when he asks Linus to merge his tree (depending on how complex the conflict is and if he gets in before Dave gets the net-next tree merged). -- Cheers, Stephen Rothwell
Re: linux-next: manual merge of the pci tree with the net tree
On 9/4/2017 12:54 AM, Stephen Rothwell wrote: > Just a reminder that this conflict still exists. I suppose this was a message for Bjorn, right? I looked at the change and it looked good to me. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [PATCH net-next RFC 1/2] tun: enable NAPI for TUN/TAP driver
On Tue, 5 Sep 2017 15:35:50 -0700 Petar Penkovwrote: > Changes TUN driver to use napi_gro_receive() upon receiving packets > rather than netif_rx_ni(). Adds flag CONFIG_TUN_NAPI that enables > these changes and operation is not affected if the flag is disabled. > SKBs are constructed upon packet arrival and are queued to be > processed later. > > The new path was evaluated with a benchmark with the following setup: > Open two tap devices and a receiver thread that reads in a loop for > each device. Start one sender thread and pin all threads to different > CPUs. Send 1M minimum UDP packets to each device and measure sending > time for each of the sending methods: > napi_gro_receive(): 4.90s > netif_rx_ni(): 4.90s > netif_receive_skb():7.20s > > Signed-off-by: Petar Penkov > Cc: Eric Dumazet > Cc: Mahesh Bandewar > Cc: Willem de Bruijn > Cc: da...@davemloft.net > Cc: ppen...@stanford.edu Why is this optional? It adds two code paths both of which need to be tested.
Re: [RFC net-next] net: sch_clsact: add support for global per-netns classifier mode
On 09/06/2017 12:01 AM, Roopa Prabhu wrote: On Tue, Sep 5, 2017 at 11:18 AM, Cong Wangwrote: On Tue, Sep 5, 2017 at 5:48 AM, Nikolay Aleksandrov wrote: Hi all, This RFC adds a new mode for clsact which designates a device's egress classifier as global per netns. The packets that are not classified for a particular device will be classified using the global classifier. We have needed a global classifier for some time now for various purposes and setting the single bridge or loopback/vrf device as the Can you elaborate a bit more on the ... "we have needed a global classifier for some time now for various purposes". global classifier device is acceptable for us. Doing it this way avoids the act/cls device and queue dependencies. This is strictly an RFC patch just to show the intent, if we agree on the details the proposed patch will have support for both ingress and egress, and will be using a static key to avoid the fast path test when no global classifier has been configured. Example (need a modified tc that adds TCA_OPTIONS when using q_clsact): $ tc qdisc add dev lo clsact global $ tc filter add dev lo egress protocol ip u32 match ip dst 4.3.2.1/32 action drop the last filter will be global for all devices that don't have a specific egress_cl_list (i.e. have clsact configured). Sorry this is too ugly. +1 netdevice is still implied in your command line even if you treat it as global. It is essentially hard to bypass netdevice layer since netdevice is the core of L2 and also where everything begins. Maybe the best we can do here is make tc filters standalone as tc actions so that filters can exist before qdisc's and netdevices. But this probably requires significant works to make it working with both existing non-standalone and bindings standalones with qdisc's. yes, like Nikolay says we have been discussing this as well. Nikolay's patch is a cleaver and most importantly non-invasive way today given the anchor point for tc rules is a netdev. we have also considered a separate implicit tc anchor device. Seems ugly just as well. :( Hmm, why not just having the two list pointers (ingress, egress list) in the netns struct and when something configures them to be effectively non-zero, then devices in that netns could automatically get a clsact and inherit the lists from there such that sch_handle_ingress() and sch_handle_egress() require exactly zero changes in fast-path. You could then go and say that either you would make changes to clsact for individual devices immutable when they use the 'shared' list pointers, or then duplicate the configs when being altered from the global one. Would push the complexity to control path only at least. Just a brief thought.
[PATCH net-next RFC 0/2] Improve code coverage of syzkaller
This patch series is intended to improve code coverage of syzkaller on the early receive path, specifically including flow dissector, GRO, and GRO with frags parts of the networking stack. Syzkaller exercises the stack through the TUN driver and this is therefore where changes reside. Current coverage through netif_receive_skb() is limited as it does not touch on any of the aforementioned code paths. Furthermore, for full coverage, it is necessary to have more flexibility over the linear and non-linear data of the skbs. The following patches address this by providing the user(syzkaller) with the ability to send via napi_gro_receive() and napi_gro_frags(). Additionally, syzkaller can specify how many fragments there are and how much data per fragment there is. This is done by exploiting the convenient structure of iovecs. Finally, this patch series adds support for exercising the flow dissector during fuzzing. The code path including napi_gro_receive() can be enabled via the CONFIG_TUN_NAPI compile-time flag, and can be used by users other than syzkaller. The remainder of the changes in this patch series give the user significantly more control over packets entering the kernel. To avoid potential security vulnerabilities, hide the ability to send custom skbs and the flow dissector code paths behind a run-time flag IFF_NAPI_FRAGS that is advertised and accepted only if CONFIG_TUN_NAPI is enabled. The patch series will be followed with changes to packetdrill, where these additions to the TUN driver are exercised and demonstrated. This will give the ability to write regression tests for specific parts of the early networking stack. Patch 1/ Add NAPI struct per receive queue, enable NAPI, and use napi_gro_receive() Patch 2/ Use NAPI skb and napi_gro_frags(), exercise flow dissector, and allow custom skbs. Petar Penkov (2): tun: enable NAPI for TUN/TAP driver tun: enable napi_gro_frags() for TUN/TAP driver drivers/net/Kconfig | 8 ++ drivers/net/tun.c | 251 +--- include/uapi/linux/if_tun.h | 1 + 3 files changed, 246 insertions(+), 14 deletions(-) -- 2.14.1.581.gf28d330327-goog
[PATCH net-next RFC 2/2] tun: enable napi_gro_frags() for TUN/TAP driver
Add a TUN/TAP receive mode that exercises the napi_gro_frags() interface. This mode is available only in TAP mode, as the interface expects packets with Ethernet headers. Furthermore, packets follow the layout of the iovec_iter that was received. The first iovec is the linear data, and every one after the first is a fragment. If there are more fragments than the max number, drop the packet. Additionally, invoke eth_get_headlen() to exercise flow dissector code and to verify that the header resides in the linear data. The napi_gro_frags() mode requires setting the IFF_NAPI_FRAGS option. This is imposed because this mode is intended for testing via tools like syzkaller and packetdrill, and the increased flexibility it provides can introduce security vulnerabilities. Signed-off-by: Petar PenkovCc: Eric Dumazet Cc: Mahesh Bandewar Cc: Willem de Bruijn Cc: da...@davemloft.net Cc: ppen...@stanford.edu --- drivers/net/tun.c | 135 ++-- include/uapi/linux/if_tun.h | 1 + 2 files changed, 130 insertions(+), 6 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index d5c824e3ec42..2ba9809ab6cd 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -75,6 +75,7 @@ #include #include #include +#include #include @@ -120,8 +121,15 @@ do { \ #define TUN_VNET_LE 0x8000 #define TUN_VNET_BE 0x4000 +#if IS_ENABLED(CONFIG_TUN_NAPI) +#define TUN_FEATURES_EXTRA IFF_NAPI_FRAGS +#else +#define TUN_FEATURES_EXTRA 0 +#endif + #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \ - IFF_MULTI_QUEUE) + IFF_MULTI_QUEUE | TUN_FEATURES_EXTRA) + #define GOODCOPY_LEN 128 #define FLT_EXACT_COUNT 8 @@ -173,6 +181,7 @@ struct tun_file { unsigned int ifindex; }; struct napi_struct napi; + struct mutex napi_mutex;/* Protects access to the above napi */ struct list_head next; struct tun_struct *detached; struct skb_array tx_array; @@ -276,6 +285,7 @@ static void tun_napi_init(struct tun_struct *tun, struct tun_file *tfile) netif_napi_add(tun->dev, >napi, tun_napi_poll, NAPI_POLL_WEIGHT); napi_enable(>napi); + mutex_init(>napi_mutex); } } @@ -291,6 +301,11 @@ static void tun_napi_del(struct tun_file *tfile) netif_napi_del(>napi); } +static bool tun_napi_frags_enabled(const struct tun_struct *tun) +{ + return READ_ONCE(tun->flags) & IFF_NAPI_FRAGS; +} + #ifdef CONFIG_TUN_VNET_CROSS_LE static inline bool tun_legacy_is_little_endian(struct tun_struct *tun) { @@ -1034,7 +1049,8 @@ static void tun_poll_controller(struct net_device *dev) * supports polling, which enables bridge devices in virt setups to * still use netconsole * If NAPI is enabled, however, we need to schedule polling for all -* queues. +* queues unless we are using napi_gro_frags(), which we call in +* process context and not in NAPI context. */ if (IS_ENABLED(CONFIG_TUN_NAPI)) { @@ -1042,6 +1058,9 @@ static void tun_poll_controller(struct net_device *dev) struct tun_file *tfile; int i; + if (tun_napi_frags_enabled(tun)) + return; + rcu_read_lock(); for (i = 0; i < tun->numqueues; i++) { tfile = rcu_dereference(tun->tfiles[i]); @@ -1264,6 +1283,64 @@ static unsigned int tun_chr_poll(struct file *file, poll_table *wait) return mask; } +static struct sk_buff *tun_napi_alloc_frags(struct tun_file *tfile, + size_t len, + const struct iov_iter *it) +{ + struct sk_buff *skb; + size_t linear; + int err; + int i; + + if (it->nr_segs > MAX_SKB_FRAGS + 1) + return ERR_PTR(-ENOMEM); + + local_bh_disable(); + skb = napi_get_frags(>napi); + local_bh_enable(); + if (!skb) + return ERR_PTR(-ENOMEM); + + linear = iov_iter_single_seg_count(it); + err = __skb_grow(skb, linear); + if (err) + goto free; + + skb->len = len; + skb->data_len = len - linear; + skb->truesize += skb->data_len; + + for (i = 1; i < it->nr_segs; i++) { + size_t fragsz = it->iov[i].iov_len; + unsigned long offset; + struct page *page; + void *data; + + if (fragsz == 0 || fragsz > PAGE_SIZE) { + err = -EINVAL; + goto free; + } + + local_bh_disable(); + data =
[PATCH net-next RFC 1/2] tun: enable NAPI for TUN/TAP driver
Changes TUN driver to use napi_gro_receive() upon receiving packets rather than netif_rx_ni(). Adds flag CONFIG_TUN_NAPI that enables these changes and operation is not affected if the flag is disabled. SKBs are constructed upon packet arrival and are queued to be processed later. The new path was evaluated with a benchmark with the following setup: Open two tap devices and a receiver thread that reads in a loop for each device. Start one sender thread and pin all threads to different CPUs. Send 1M minimum UDP packets to each device and measure sending time for each of the sending methods: napi_gro_receive(): 4.90s netif_rx_ni(): 4.90s netif_receive_skb():7.20s Signed-off-by: Petar PenkovCc: Eric Dumazet Cc: Mahesh Bandewar Cc: Willem de Bruijn Cc: da...@davemloft.net Cc: ppen...@stanford.edu --- drivers/net/Kconfig | 8 drivers/net/tun.c | 120 +++- 2 files changed, 118 insertions(+), 10 deletions(-) diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 83a1616903f8..34850d71ddd1 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -307,6 +307,14 @@ config TAP This option is selected by any driver implementing tap user space interface for a virtual interface to re-use core tap functionality. +config TUN_NAPI + bool "NAPI support on tx path for TUN/TAP driver" + default n + depends on TUN + ---help--- + This option allows the TUN/TAP driver to use NAPI to pass packets to + the kernel when receiving packets from user space via write()/send(). + config TUN_VNET_CROSS_LE bool "Support for cross-endian vnet headers on little-endian kernels" default n diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 06e8f0bb2dab..d5c824e3ec42 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -172,6 +172,7 @@ struct tun_file { u16 queue_index; unsigned int ifindex; }; + struct napi_struct napi; struct list_head next; struct tun_struct *detached; struct skb_array tx_array; @@ -229,6 +230,67 @@ struct tun_struct { struct bpf_prog __rcu *xdp_prog; }; +static int tun_napi_receive(struct napi_struct *napi, int budget) +{ + struct tun_file *tfile = container_of(napi, struct tun_file, napi); + struct sk_buff_head *queue = >sk.sk_write_queue; + struct sk_buff_head process_queue; + struct sk_buff *skb; + int received = 0; + + __skb_queue_head_init(_queue); + + spin_lock(>lock); + skb_queue_splice_tail_init(queue, _queue); + spin_unlock(>lock); + + while (received < budget && (skb = __skb_dequeue(_queue))) { + napi_gro_receive(napi, skb); + ++received; + } + + if (!skb_queue_empty(_queue)) { + spin_lock(>lock); + skb_queue_splice(_queue, queue); + spin_unlock(>lock); + } + + return received; +} + +static int tun_napi_poll(struct napi_struct *napi, int budget) +{ + unsigned int received; + + received = tun_napi_receive(napi, budget); + + if (received < budget) + napi_complete_done(napi, received); + + return received; +} + +static void tun_napi_init(struct tun_struct *tun, struct tun_file *tfile) +{ + if (IS_ENABLED(CONFIG_TUN_NAPI)) { + netif_napi_add(tun->dev, >napi, tun_napi_poll, + NAPI_POLL_WEIGHT); + napi_enable(>napi); + } +} + +static void tun_napi_disable(struct tun_file *tfile) +{ + if (IS_ENABLED(CONFIG_TUN_NAPI)) + napi_disable(>napi); +} + +static void tun_napi_del(struct tun_file *tfile) +{ + if (IS_ENABLED(CONFIG_TUN_NAPI)) + netif_napi_del(>napi); +} + #ifdef CONFIG_TUN_VNET_CROSS_LE static inline bool tun_legacy_is_little_endian(struct tun_struct *tun) { @@ -541,6 +603,11 @@ static void __tun_detach(struct tun_file *tfile, bool clean) tun = rtnl_dereference(tfile->tun); + if (tun && clean) { + tun_napi_disable(tfile); + tun_napi_del(tfile); + } + if (tun && !tfile->detached) { u16 index = tfile->queue_index; BUG_ON(index >= tun->numqueues); @@ -598,6 +665,7 @@ static void tun_detach_all(struct net_device *dev) for (i = 0; i < n; i++) { tfile = rtnl_dereference(tun->tfiles[i]); BUG_ON(!tfile); + tun_napi_disable(tfile); tfile->socket.sk->sk_shutdown = RCV_SHUTDOWN; tfile->socket.sk->sk_data_ready(tfile->socket.sk); RCU_INIT_POINTER(tfile->tun, NULL); @@ -613,6 +681,7 @@ static void tun_detach_all(struct net_device *dev) synchronize_net(); for (i
Re: [PATCH net-next] bpf: fix numa_node validation
On Mon, Sep 04, 2017 at 10:41:02PM -0700, Eric Dumazet wrote: > From: Eric Dumazet> > syzkaller reported crashes in bpf map creation or map update [1] > > Problem is that nr_node_ids is a signed integer, > NUMA_NO_NODE is also an integer, so it is very tempting > to declare numa_node as a signed integer. > > This means the typical test to validate a user provided value : > > if (numa_node != NUMA_NO_NODE && > (numa_node >= nr_node_ids || > !node_online(numa_node))) > > must be written : > > if (numa_node != NUMA_NO_NODE && > ((unsigned int)numa_node >= nr_node_ids || Thanks for fixing it. > !node_online(numa_node))) > > > [1] > kernel BUG at mm/slab.c:3256! > invalid opcode: [#1] SMP KASAN > Dumping ftrace buffer: >(ftrace buffer empty) > Modules linked in: > CPU: 0 PID: 2946 Comm: syzkaller916108 Not tainted 4.13.0-rc7+ #35 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > task: 8801d2bc60c0 task.stack: 8801c0c9 > RIP: 0010:cache_alloc_node+0x1d4/0x1e0 mm/slab.c:3292 > RSP: 0018:8801c0c97638 EFLAGS: 00010096 > RAX: 8b7b RBX: 01080220 RCX: > RDX: 8b7b RSI: 01080220 RDI: 8801dac00040 > RBP: 8801c0c976c0 R08: R09: > R10: 8801c0c97620 R11: 0001 R12: 8801dac00040 > R13: 8801dac00040 R14: R15: 8b7b > FS: 02119940() GS:8801db20() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 20001fec CR3: 0001d298 CR4: 001406f0 > Call Trace: > __do_kmalloc_node mm/slab.c:3688 [inline] > __kmalloc_node+0x33/0x70 mm/slab.c:3696 > kmalloc_node include/linux/slab.h:535 [inline] > alloc_htab_elem+0x2a8/0x480 kernel/bpf/hashtab.c:740 > htab_map_update_elem+0x740/0xb80 kernel/bpf/hashtab.c:820 > map_update_elem kernel/bpf/syscall.c:587 [inline] > SYSC_bpf kernel/bpf/syscall.c:1468 [inline] > SyS_bpf+0x20c5/0x4c40 kernel/bpf/syscall.c:1443 > entry_SYSCALL_64_fastpath+0x1f/0xbe > RIP: 0033:0x440409 > RSP: 002b:7ffd1f1792b8 EFLAGS: 0246 ORIG_RAX: 0141 > RAX: ffda RBX: 004002c8 RCX: 00440409 > RDX: 0020 RSI: 20006000 RDI: 0002 > RBP: 0086 R08: R09: > R10: R11: 0246 R12: 00401d70 > R13: 00401e00 R14: R15: > Code: 83 c2 01 89 50 18 4c 03 70 08 e8 38 f4 ff ff 4d 85 f6 0f 85 3e ff ff ff > 44 89 fe 4c 89 ef e8 94 fb ff ff 49 89 c6 e9 2b ff ff ff <0f> 0b 0f 0b 0f 0b > 66 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 > RIP: cache_alloc_node+0x1d4/0x1e0 mm/slab.c:3292 RSP: 8801c0c97638 > ---[ end trace d745f355da2e33ce ]--- > Kernel panic - not syncing: Fatal exception > > Fixes: 96eabe7a40aa ("bpf: Allow selecting numa node during map creation") > Signed-off-by: Eric Dumazet > Cc: Martin KaFai Lau > Cc: Alexei Starovoitov > Cc: Daniel Borkmann > --- > kernel/bpf/syscall.c |3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index > 021a05d9d80095303bdfed51ee85bd9067582774..70ad8e220343c7825c8e331f19c1f65c78fdb796 > 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -323,7 +323,8 @@ static int map_create(union bpf_attr *attr) > return -EINVAL; > > if (numa_node != NUMA_NO_NODE && > - (numa_node >= nr_node_ids || !node_online(numa_node))) > + ((unsigned int)numa_node >= nr_node_ids || > + !node_online(numa_node))) > return -EINVAL; > > /* find map type and init map: hashtable vs rbtree vs bloom vs ... */ > >
Re: [RFC net-next] net: sch_clsact: add support for global per-netns classifier mode
On 17-09-05 06:01 PM, Roopa Prabhu wrote: yes, like Nikolay says we have been discussing this as well. Nikolay's patch is a cleaver and most importantly non-invasive way today given the anchor point for tc rules is a netdev. we have also considered a separate implicit tc anchor device. lo seemed like a good fit for all global rules given it is already available. And it is not uncommon to hang off global network config on the loopback interface. IMO, Jiri has done all the necessary work already with the concept of blocks. We dont really need the netdev to be the attachment point. You can add to a block in many locations in the kernel by constructing the proper "coordinates" in the tcmsg. i.e this: tcmsg { unsigned char tcm_family; unsigned char tcm__pad1; unsigned short tcm__pad2; int tcm_ifindex; __u32 tcm_handle; __u32 tcm_parent; } If you were to set tcm_ifindex to -1 (since that is not a legit ifindex) then all we need to do is define a parent for a different location. Current locations tied to netdevs are: - #define TC_H_ROOT (0xU) #define TC_H_INGRESS(0xFFF1U) #define TC_H_CLSACT TC_H_INGRESS #define TC_H_MIN_INGRESS0xFFF2U #define TC_H_MIN_EGRESS 0xFFF3U - You should be able to say add a location which maps to a pre-routing or post-routing etc; and this would work as well... cheers, jamal
Re: [PATCH 1/3] security: bpf: Add eBPF LSM hooks to security module
On Fri, Sep 1, 2017 at 5:50 AM, Stephen Smalleywrote: > On Thu, 2017-08-31 at 13:56 -0700, Chenbo Feng wrote: >> From: Chenbo Feng >> >> Introduce 5 LSM hooks to provide finer granularity controls on eBPF >> related operations including create eBPF maps, modify and read eBPF >> maps >> content and load eBPF programs to the kernel. Hooks use the new >> security >> pointer inside the eBPF map struct to store the owner's security >> information and the different security modules can perform different >> checks based on the information stored inside the security field. >> >> Signed-off-by: Chenbo Feng >> --- >> include/linux/lsm_hooks.h | 41 >> + >> include/linux/security.h | 36 >> security/security.c | 28 >> 3 files changed, 105 insertions(+) >> >> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h >> index ce02f76a6188..3aaf9a08a983 100644 >> --- a/include/linux/lsm_hooks.h >> +++ b/include/linux/lsm_hooks.h >> @@ -1353,6 +1353,32 @@ >> * @inode we wish to get the security context of. >> * @ctx is a pointer in which to place the allocated security >> context. >> * @ctxlen points to the place to put the length of @ctx. >> + * >> + * Security hooks for using the eBPF maps and programs >> functionalities through >> + * eBPF syscalls. >> + * >> + * @bpf_map_create: >> + * Check permissions prior to creating a new bpf map. >> + * Return 0 if the permission is granted. >> + * >> + * @bpf_map_modify: >> + * Check permission prior to insert, update and delete map >> content. >> + * @map pointer to the struct bpf_map that contains map >> information. >> + * Return 0 if the permission is granted. >> + * >> + * @bpf_map_read: >> + * Check permission prior to read a bpf map content. >> + * @map pointer to the struct bpf_map that contains map >> information. >> + * Return 0 if the permission is granted. >> + * >> + * @bpf_prog_load: >> + * Check permission prior to load eBPF program. >> + * Return 0 if the permission is granted. >> + * >> + * @bpf_post_create: >> + * Initialize the bpf object security field inside struct >> bpf_maps and >> + * it is used for future security checks. >> + * >> */ >> union security_list_options { >> int (*binder_set_context_mgr)(struct task_struct *mgr); >> @@ -1685,6 +1711,14 @@ union security_list_options { >> struct audit_context *actx); >> void (*audit_rule_free)(void *lsmrule); >> #endif /* CONFIG_AUDIT */ >> + >> +#ifdef CONFIG_BPF_SYSCALL >> + int (*bpf_map_create)(void); >> + int (*bpf_map_read)(struct bpf_map *map); >> + int (*bpf_map_modify)(struct bpf_map *map); >> + int (*bpf_prog_load)(void); >> + int (*bpf_post_create)(struct bpf_map *map); >> +#endif /* CONFIG_BPF_SYSCALL */ >> }; >> >> struct security_hook_heads { >> @@ -1905,6 +1939,13 @@ struct security_hook_heads { >> struct list_head audit_rule_match; >> struct list_head audit_rule_free; >> #endif /* CONFIG_AUDIT */ >> +#ifdef CONFIG_BPF_SYSCALL >> + struct list_head bpf_map_create; >> + struct list_head bpf_map_read; >> + struct list_head bpf_map_modify; >> + struct list_head bpf_prog_load; >> + struct list_head bpf_post_create; >> +#endif /* CONFIG_BPF_SYSCALL */ >> } __randomize_layout; >> >> /* >> diff --git a/include/linux/security.h b/include/linux/security.h >> index 458e24bea2d4..0656a4f74d14 100644 >> --- a/include/linux/security.h >> +++ b/include/linux/security.h >> @@ -31,6 +31,7 @@ >> #include >> #include >> #include >> +#include >> >> struct linux_binprm; >> struct cred; >> @@ -1735,6 +1736,41 @@ static inline void securityfs_remove(struct >> dentry *dentry) >> >> #endif >> >> +#ifdef CONFIG_BPF_SYSCALL >> +#ifdef CONFIG_SECURITY >> +int security_map_create(void); >> +int security_map_modify(struct bpf_map *map); >> +int security_map_read(struct bpf_map *map); >> +int security_prog_load(void); >> +int security_post_create(struct bpf_map *map); >> +#else >> +static inline int security_map_create(void) >> +{ >> + return 0; >> +} >> + >> +static inline int security_map_read(struct bpf_map *map) >> +{ >> + return 0; >> +} >> + >> +static inline int security_map_modify(struct bpf_map *map) >> +{ >> + return 0; >> +} >> + >> +static inline int security_prog_load(void) >> +{ >> + return 0; >> +} >> + >> +static inline int security_post_create(struct bpf_map *map) >> +{ >> + return 0; >> +} >> +#endif /* CONFIG_SECURITY */ >> +#endif /* CONFIG_BPF_SYSCALL */ > > These should be named consistently with the ones in lsm_hooks.h and > should unambiguously indicate that these are hooks for bpf > objects/operations, i.e. security_bpf_map_create(), > security_bpf_map_read(), etc. > Thanks for pointing out, will fix this. > Do you need this level of granularity? >
Re: [PATCH 0/3] drivers: net: xgene: Misc bug fixes
From: Iyappan SubramanianDate: Tue, 5 Sep 2017 11:16:29 -0700 > This patch set fixes bugs related to handling the case for ACPI for, > reading and programming tx/rx delay values. > > Signed-off-by: Iyappan Subramanian Series applied, thank you.
Re: [RFC net-next] net: sch_clsact: add support for global per-netns classifier mode
On Tue, Sep 5, 2017 at 11:18 AM, Cong Wangwrote: > On Tue, Sep 5, 2017 at 5:48 AM, Nikolay Aleksandrov > wrote: >> Hi all, >> This RFC adds a new mode for clsact which designates a device's egress >> classifier as global per netns. The packets that are not classified for >> a particular device will be classified using the global classifier. >> We have needed a global classifier for some time now for various >> purposes and setting the single bridge or loopback/vrf device as the >> global classifier device is acceptable for us. Doing it this way avoids >> the act/cls device and queue dependencies. >> >> This is strictly an RFC patch just to show the intent, if we agree on >> the details the proposed patch will have support for both ingress and >> egress, and will be using a static key to avoid the fast path test when no >> global classifier has been configured. >> >> Example (need a modified tc that adds TCA_OPTIONS when using q_clsact): >> $ tc qdisc add dev lo clsact global >> $ tc filter add dev lo egress protocol ip u32 match ip dst 4.3.2.1/32 action >> drop >> >> the last filter will be global for all devices that don't have a >> specific egress_cl_list (i.e. have clsact configured). > > Sorry this is too ugly. > > netdevice is still implied in your command line even if you treat it > as global. It is essentially hard to bypass netdevice layer since > netdevice is the core of L2 and also where everything begins. > > Maybe the best we can do here is make tc filters standalone > as tc actions so that filters can exist before qdisc's and netdevices. > But this probably requires significant works to make it working > with both existing non-standalone and bindings standalones > with qdisc's. yes, like Nikolay says we have been discussing this as well. Nikolay's patch is a cleaver and most importantly non-invasive way today given the anchor point for tc rules is a netdev. we have also considered a separate implicit tc anchor device. lo seemed like a good fit for all global rules given it is already available. And it is not uncommon to hang off global network config on the loopback interface.
Re: [PATCH 2/3] security: bpf: Add eBPF LSM hooks and security field to eBPF map
On Thu, Aug 31, 2017 at 7:05 PM, Alexei Starovoitovwrote: > On Thu, Aug 31, 2017 at 01:56:34PM -0700, Chenbo Feng wrote: >> From: Chenbo Feng >> >> Introduce a pointer into struct bpf_map to hold the security information >> about the map. The actual security struct varies based on the security >> models implemented. Place the LSM hooks before each of the unrestricted >> eBPF operations, the map_update_elem and map_delete_elem operations are >> checked by security_map_modify. The map_lookup_elem and map_get_next_key >> operations are checked by securtiy_map_read. >> >> Signed-off-by: Chenbo Feng > > ... > >> @@ -410,6 +418,10 @@ static int map_lookup_elem(union bpf_attr *attr) >> if (IS_ERR(map)) >> return PTR_ERR(map); >> >> + err = security_map_read(map); >> + if (err) >> + return -EACCES; >> + >> key = memdup_user(ukey, map->key_size); >> if (IS_ERR(key)) { >> err = PTR_ERR(key); >> @@ -490,6 +502,10 @@ static int map_update_elem(union bpf_attr *attr) >> if (IS_ERR(map)) >> return PTR_ERR(map); >> >> + err = security_map_modify(map); > > I don't feel these extra hooks are really thought through. > With such hook you'll disallow map_update for given map. That's it. > The key/values etc won't be used in such security decision. > In such case you don't need such hooks in update/lookup at all. > Only in map_creation and object_get calls where FD can be received. > In other words I suggest to follow standard unix practices: > Do permissions checks in open() and allow read/write() if FD is valid. > Same here. Do permission checks in prog_load/map_create/obj_pin/get > and that will be enough to jail bpf subsystem. > bpf cmds that need to be fast (like lookup and update) should not > have security hooks. > Thanks for pointing out this. I agree we should only do checks on creating and passing the object from one processes to another. And if we can still distinguish the read/write operation when obtaining the file fd, that would be great. But that may require us to add a new mode field into bpf_map struct and change the attribute struct when doing the bpf syscall. How do you think about this approach? Or we can do simple checks for bpf_obj_create and bpf_obj_use when creating the object and passing the object respectively but this solution cannot distinguish map modify and read.
Re: [PATCH][next v2] rocker: fix kcalloc parameter order
From: Zahari DoychevDate: Tue, 5 Sep 2017 21:49:58 +0200 > The function calls to kcalloc use wrong parameter order and incorrect flags > values. GFP_KERNEL is used instead of flags now and the order is corrected. > > The change was done using the following coccinelle script: > > @@ > expression E1,E2; > type T; > @@ > > -kcalloc(E1, E2, sizeof(T)) > +kcalloc(E2, sizeof(T), GFP_KERNEL) > > Signed-off-by: Zahari Doychev Applied, thank you.
Re: [PATCH net] rds: Fix non-atomic operation on shared flag variable
From: Håkon BuggeDate: Tue, 5 Sep 2017 17:42:01 +0200 > The bits in m_flags in struct rds_message are used for a plurality of > reasons, and from different contexts. To avoid any missing updates to > m_flags, use the atomic set_bit() instead of the non-atomic equivalent. > > Signed-off-by: Håkon Bugge > Reviewed-by: Knut Omang > Reviewed-by: Wei Lin Guay Applied.
Re: [PATCH net v2] net: sched: don't use GFP_KERNEL under spin lock
From: Jakub KicinskiDate: Tue, 5 Sep 2017 08:31:23 -0700 > The new TC IDR code uses GFP_KERNEL under spin lock. Which leads > to: ... > Try to preallocate the memory with idr_prealloc(GFP_KERNEL) > (as suggested by Eric Dumazet), and change the allocation > flags under spin lock. > > Fixes: 65a206c01e8e ("net/sched: Change act_api and act_xxx modules to use > IDR") > Signed-off-by: Jakub Kicinski > Reviewed-by: Simon Horman Applied, thanks.
Re: [PATCH net V2] vhost_net: correctly check tx avail during rx busy polling
From: Jason WangDate: Tue, 5 Sep 2017 09:22:05 +0800 > We check tx avail through vhost_enable_notify() in the past which is > wrong since it only checks whether or not guest has filled more > available buffer since last avail idx synchronization which was just > done by vhost_vq_avail_empty() before. What we really want is checking > pending buffers in the avail ring. Fix this by calling > vhost_vq_avail_empty() instead. > > This issue could be noticed by doing netperf TCP_RR benchmark as > client from guest (but not host). With this fix, TCP_RR from guest to > localhost restores from 1375.91 trans per sec to 55235.28 trans per > sec on my laptop (Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz). > > Fixes: 030881372460 ("vhost_net: basic polling support") > Signed-off-by: Jason Wang Applied and queued up for -stable, thanks.
Re: [PATCH net-next] rxrpc: Make service connection lookup always check for retry
From: David HowellsDate: Mon, 04 Sep 2017 15:28:28 +0100 > When an RxRPC service packet comes in, the target connection is looked up > by an rb-tree search under RCU and a read-locked seqlock; the seqlock retry > check is, however, currently skipped if we got a match, but probably > shouldn't be in case the connection we found gets replaced whilst we're > doing a search. > > Make the lookup procedure always go through need_seqretry(), even if the > lookup was successful. This makes sure we always pick up on a write-lock > event. > > On the other hand, since we don't take a ref on the object, but rely on RCU > to prevent its destruction after dropping the seqlock, I'm not sure this is > necessary. > > Signed-off-by: David Howells Applied, thanks David.
Re: [PATCH v3 net-next 0/2] Use correct sk->sk_prot for IPV6
From: Ilya LesokhinDate: Mon, 4 Sep 2017 13:13:59 +0300 > The tls ulp overrides sk->prot with a new tls specific proto structs. > > The tls specific structs were previously based on the ipv4 specific > > tcp_prot sturct. > > As a result, attaching the tls ulp to an ipv6 tcp socket replaced > > some ipv6 callback with the ipv4 equivalents. > > > > This patch adds ipv6 tls proto structs and uses them when > > attached to ipv6 sockets. > > Changed since v2: > - Dropped patch to fix IPV6_ADDRFORM setsockopt > There was some disagreement about the correct way of fixinig it, > and this series does not depend on it. > > Changes since v1: > > - TLS now dependes on IPV6 > > This fixes complication issues when TLS is built-in and IPV6 is a module. > > The downside should be small as it is unlikely that there are kernel TLS > > users who can't afford to include IPV6 in thier kernel. > This is not acceptable. Forcing such a huge piece of infrastructure like ipv6 to be statically built just because someone wants to enable TLS is not going to pass. Every distrubtion out there will enable all features, including TLS, so effectively you are forcing every distribution to build ipv6 non-modules. Sorry, you will have to fix this in a way that allows TLS and IPV6 to be modular.
Re: [PATCH net-next v7] net: stmmac: Delete dead code for MDIO registration
From: Romain PerierDate: Mon, 4 Sep 2017 10:41:36 +0200 > This code is no longer used, the logging function was changed by commit > fbca164776e4 ("net: stmmac: Use the right logging function in > stmmac_mdio_register"). > It was previously showing information about the type of the IRQ, if it's > polled, ignored or a normal interrupt. As we don't want information loss, > I have moved this code to phy_attached_print(). > > Fixes: fbca164776e4 ("net: stmmac: Use the right logging function in > stmmac_mdio_register") > Signed-off-by: Romain Perier Applied.
Re: [net v2] gianfar: Fix Tx flow control deactivation
From: Claudiu ManoilDate: Mon, 4 Sep 2017 10:45:28 +0300 > The wrong register is checked for the Tx flow control bit, > it should have been maccfg1 not maccfg2. > This went unnoticed for so long probably because the impact is > hardly visible, not to mention the tangled code from adjust_link(). > First, link flow control (i.e. handling of Rx/Tx link level pause frames) > is disabled by default (needs to be enabled via 'ethtool -A'). > Secondly, maccfg2 always returns 0 for tx_flow_oldval (except for a few > old boards), which results in Tx flow control remaining always on > once activated. > > Fixes: 45b679c9a3ccd9e34f28e6ec677b812a860eb8eb ("gianfar: Implement PAUSE > frame generation support") > > Signed-off-by: Claudiu Manoil Please do not put an empty line between Fixes: and other tags. All tags are equal and should be in one uninterrupted block of text. Applied, and queued up for -stable, thanks.
Re: [PATCH net-next] cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
From: Ganesh GoudarDate: Mon, 4 Sep 2017 11:25:34 +0530 > MPS_TX_INT_CAUSE[Bubble] is a normal condition for T6, hence > ignore this interrupt for T6. > > Signed-off-by: Ganesh Goudar > Signed-off-by: Casey Leedom Applied.
Re: [PATCH net-next] cxgb4: Fix pause frame count in t4_get_port_stats
From: Ganesh GoudarDate: Mon, 4 Sep 2017 11:17:36 +0530 > MPS_STAT_CTL[CountPauseStatTx] and MPS_STAT_CTL[CountPauseStatRx] > only control whether or not Pause Frames will be counted as part > of the 64-Byte Tx/Rx Frame counters. These bits do not control > whether Pause Frames are counted in the Total Tx/Rx Frames/Bytes > counters. > > Signed-off-by: Ganesh Goudar > Signed-off-by: Casey Leedom Applied.
Re: [PATCH net-next] cxgb4: fix memory leak
From: Ganesh GoudarDate: Mon, 4 Sep 2017 11:16:28 +0530 > do not reuse the loop counter which is used iterate over > the ports, so that sched_tbl will be freed for all the ports. > > Signed-off-by: Ganesh Goudar Applied.
Re: [PATCH net-next 2/2] tun: rename generic_xdp to skb_xdp
From: Jason WangDate: Mon, 4 Sep 2017 11:36:09 +0800 > Rename "generic_xdp" to "skb_xdp" to avoid confusing it with the > generic XDP which will be done at netif_receive_skb(). > > Cc: Daniel Borkmann > Signed-off-by: Jason Wang Applied.
Re: [PATCH net-next 1/2] tun: reserve extra headroom only when XDP is set
From: Jason WangDate: Mon, 4 Sep 2017 11:36:08 +0800 > We reserve headroom unconditionally which could cause unnecessary > stress on socket memory accounting because of increased trusesize. Fix > this by only reserve extra headroom when XDP is set. > > Cc: Jakub Kicinski > Signed-off-by: Jason Wang Applied.
Re: [PATCH net-next] net: cpsw: Don't handle SIOC[GS]HWTSTAMP when CPTS is disabled
Hi On 08/31/2017 02:48 AM, Richard Cochran wrote: > On Wed, Aug 30, 2017 at 02:47:45PM -0700, David Miller wrote: >> It should not be required to disable a Kconfig option just to get PHY >> timestamping to work properly. > > Well, if the MAC driver handles the ioctl and enables time stamping, > then the PHY driver's time stamping remains disabled. We don't have a > way to choose PHY time stamping at run time. > >> Rather, if the CPTS code returns -EOPNOTSUPP we should try to >> fallthrough to the PHY library based methods. > > I agree that it would be better for the core (rather than the > individual drivers) to handle this case. I'd like to clarify one thing here - what is the preferable time-stamping device: PHY over MAC, or MAC over PHY? my understanding it's PHY and ethtool_get_ts_info() seems already implemented this way. > > There are a few callers of .ndo_do_ioctl to consider. Besides > dev_ifsioc() there is at least vlan_dev_ioctl() that needs to handle > the EOPNOTSUPP. Sry, I've not tried to do solution in Net core, but below patch updates CPSW driver to selected PHY time-stamping over MAC if supported without disabling CPTS in Kconfig (at least it's expected to fix it) - not tested as I do not have HW with dp83640 phy. --- >From 51347692087732320f2f5615030f5f36ed3c7724 Mon Sep 17 00:00:00 2001 From: Grygorii StrashkoDate: Tue, 5 Sep 2017 15:24:44 -0500 Subject: [PATCH] net: ethernet: cpsw: allow phy timestamping over mac Allow phy timestamping to be used over mac timestamping if supported. Signed-off-by: Grygorii Strashko --- drivers/net/ethernet/ti/cpsw.c | 21 + drivers/net/ethernet/ti/cpts.c | 2 ++ 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c index 95ac926..8831cb9 100644 --- a/drivers/net/ethernet/ti/cpsw.c +++ b/drivers/net/ethernet/ti/cpsw.c @@ -291,6 +291,10 @@ struct cpsw_ss_regs { #define CPSW_MAX_BLKS_TX_SHIFT 4 #define CPSW_MAX_BLKS_RX 5 +#define HAS_PHY_TXTSTAMP(p) ((p) && (p)->drv && (p)->drv->txtstamp) +#define HAS_PHY_TSTAMP(p) ((p) && (p)->drv && \ + ((p)->drv->rxtstamp || (p)->drv->rxtstamp)) + struct cpsw_host_regs { u32 max_blks; u32 blk_cnt; @@ -1600,6 +1604,8 @@ static netdev_tx_t cpsw_ndo_start_xmit(struct sk_buff *skb, { struct cpsw_priv *priv = netdev_priv(ndev); struct cpsw_common *cpsw = priv->cpsw; + int slave_no = cpsw_slave_index(cpsw, priv); + struct phy_device *phy = cpsw->slaves[slave_no].phy; struct cpts *cpts = cpsw->cpts; struct netdev_queue *txq; struct cpdma_chan *txch; @@ -1611,8 +1617,9 @@ static netdev_tx_t cpsw_ndo_start_xmit(struct sk_buff *skb, return NET_XMIT_DROP; } - if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP && - cpts_is_tx_enabled(cpts) && cpts_can_timestamp(cpts, skb)) + if ((skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) && + cpts_can_timestamp(cpts, skb) && + (cpts_is_tx_enabled(cpts) || HAS_PHY_TXTSTAMP(phy))) skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; q_idx = skb_get_queue_mapping(skb); @@ -1810,20 +1817,26 @@ static int cpsw_ndo_ioctl(struct net_device *dev, struct ifreq *req, int cmd) struct cpsw_priv *priv = netdev_priv(dev); struct cpsw_common *cpsw = priv->cpsw; int slave_no = cpsw_slave_index(cpsw, priv); + struct phy_device *phy = cpsw->slaves[slave_no].phy; + if (!netif_running(dev)) return -EINVAL; switch (cmd) { case SIOCSHWTSTAMP: + if (HAS_PHY_TSTAMP(phy)) + break; return cpsw_hwtstamp_set(dev, req); case SIOCGHWTSTAMP: + if (HAS_PHY_TSTAMP(phy)) + break; return cpsw_hwtstamp_get(dev, req); } - if (!cpsw->slaves[slave_no].phy) + if (!phy) return -EOPNOTSUPP; - return phy_mii_ioctl(cpsw->slaves[slave_no].phy, req, cmd); + return phy_mii_ioctl(phy, req, cmd); } static void cpsw_ndo_tx_timeout(struct net_device *ndev) diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c index c2121d2..f257f54 100644 --- a/drivers/net/ethernet/ti/cpts.c +++ b/drivers/net/ethernet/ti/cpts.c @@ -421,6 +421,8 @@ void cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb) u64 ns; struct skb_shared_hwtstamps ssh; + if (!cpts->rx_enable) + return; if (!(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS)) return; ns = cpts_find_ts(cpts, skb, CPTS_EV_TX); -- 2.10.1 -- regards, -grygorii
Re: [PATCH 6/10] ixgbe: Use ARRAY_SIZE macro
From: Joe PerchesDate: Tue, 05 Sep 2017 13:01:18 -0700 > On Tue, 2017-09-05 at 21:45 +0200, Thomas Meyer wrote: >> On Tue, Sep 05, 2017 at 11:50:44AM -0700, David Miller wrote: >> > From: Thomas Meyer >> > Date: Sun, 03 Sep 2017 14:19:31 +0200 >> > >> > > Use ARRAY_SIZE macro, rather than explicitly coding some variant of it >> > > yourself. >> > > Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e >> > > 's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\) >> > > /ARRAY_SIZE(\1)/g' and manual check/verification. >> > > >> > > Signed-off-by: Thomas Meyer >> > >> > This should be submitted to the Intel ethernet driver maintainers. >> >> Hi, >> >> my script checks the output of get_maintainer scripts and only sends to "open >> list" entries. >> >> The intel-wired-...@lists.osuosl.org is moderated, so that's why the patch >> wasn't send there. >> >> Strangely the lists for nouv...@lists.freedesktop.org and >> intel-gvt-...@lists.freedesktop.org appears as open lists in the MAINTAINERS >> file but seems to be also moderated lists... At least I got some reply that >> my >> message awaits approval. Maybe an update to the MAINTAINERS file is missing >> here? >> >> I may drop above check in my script and send to all mailing lists that >> get_maintainer.pl will return. > > There's a difference between moderated and subscriber-only > entries in MAINTAINERS. > > get_maintainers will by default list moderated lists and > not show subscriber-only lists unless using the -s switch. Furthermore, nothing prevented you from CC:'ing the maintainer, Jeff Kirscher.
Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
> The situation with encapsulation is even more complicated: > > We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP > constellation. If we do the fragmentation inside the vxlan tunnel and > carry over the skb hash to all resulting UDP/vxlan packets source ports, > we are fine and reordering on the receiver NIC won't happen in this > case. If the fragmentation happens on the outer UDP header, this will > result in reordering of the inner L2 flow. Unfortunately this depends on > how the vxlan tunnel was set up, how other devices do that and (I > believe so) on the kernel version. > This really isn't that complicated. The assumption that an IP network always delivers packets in order is simply wrong. The inventors of VXLAN must have know full well that when you use IP, packets can and eventually will be delivered out of order. This isn't just because of fragmentation, there are many other reasons that packets can be delivered OOO. This also must have been known when IP/GRE and any other protocol that carries L2 over IP was invented. If OOO is an issue for these protocols then they need to be fixed-- this is not a concern with IP protocol nor the stack. Tom
Re: [PATCH net-next 2/4] ip/options: explicitly provide net ns to __ip_options_echo()
On Tue, 2017-09-05 at 10:18 -0700, Eric Dumazet wrote: > On Thu, 2017-08-03 at 18:07 +0200, Paolo Abeni wrote: > > __ip_options_echo() uses the current network namespace, and > > currently retrives it via skb->dst->dev. > > > > This commit adds an explicit 'net' argument to __ip_options_echo() > > and update all the call sites to provide it, usually via a simpler > > sock_net(). > > > > After this change, __ip_options_echo() no more needs to access > > skb->dst and we can drop a couple of hack to preserve such > > info in the rx path. > > > > Signed-off-by: Paolo Abeni> > --- > > David, Paolo > > This commit (91ed1e666a4ea2e260452a7d7d311ac5ae852cba "ip/options: > explicitly provide net ns to __ip_options_echo()") > > needs to be backported to linux-4.13 stable version to avoid these kind > of crashes [1] > > This is because of MSG_PEEK operation, hitting skb_consume_udp() while > skb is still in receive queue. > > Next read() finding again the skb then can see a NULL skb->dst > > Thanks ! > > [1] > general protection fault: [#1] SMP KASAN > Dumping ftrace buffer: >(ftrace buffer empty) > Modules linked in: > CPU: 0 PID: 3017 Comm: syzkaller446772 Not tainted 4.13.0+ #68 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > task: 8801cd0a4380 task.stack: 8801cc498000 > RIP: 0010:__ip_options_echo+0xea8/0x1430 net/ipv4/ip_options.c:143 > RSP: 0018:8801cc49f628 EFLAGS: 00010246 > RAX: dc00 RBX: 8801cc49f928 RCX: > RDX: RSI: 0001 RDI: 0004 > RBP: 8801cc49f6b8 R08: 8801cc49f936 R09: ed0039893f28 > R10: 0003 R11: ed0039893f27 R12: 8801cc49f918 > R13: 8801ccbcf36c R14: 000f R15: 0018 > FS: 00979880() GS:8801db20() > knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 200c0ff0 CR3: 0001cc4ed000 CR4: 001406f0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > ip_options_echo include/net/ip.h:574 [inline] > ip_cmsg_recv_retopts net/ipv4/ip_sockglue.c:91 [inline] > ip_cmsg_recv_offset+0xa17/0x1280 net/ipv4/ip_sockglue.c:207 > udp_recvmsg+0xe0b/0x1260 net/ipv4/udp.c:1641 > inet_recvmsg+0x14c/0x5f0 net/ipv4/af_inet.c:793 > sock_recvmsg_nosec net/socket.c:792 [inline] > sock_recvmsg+0xc9/0x110 net/socket.c:799 > SYSC_recvfrom+0x2dc/0x570 net/socket.c:1788 > SyS_recvfrom+0x40/0x50 net/socket.c:1760 > entry_SYSCALL_64_fastpath+0x1f/0xbe > RIP: 0033:0x444c89 > RSP: 002b:7ffd80c788e8 EFLAGS: 0286 ORIG_RAX: 002d > RAX: ffda RBX: RCX: 00444c89 > RDX: RSI: 20bc RDI: 0004 > RBP: 0082 R08: 200c0ff0 R09: 0010 > R10: R11: 0286 R12: 00402390 > R13: 00402420 R14: R15: > Code: f6 c1 01 0f 85 a5 01 00 00 48 89 4d b8 e8 31 e9 6b fd 48 8b 4d b8 > 48 b8 00 00 00 00 00 fc ff df 48 83 e1 fe 48 89 ca 48 c1 ea 03 <80> 3c > 02 00 0f 85 41 02 00 00 48 8b 09 48 b8 00 00 00 00 00 fc > RIP: __ip_options_echo+0xea8/0x1430 net/ipv4/ip_options.c:143 RSP: > 8801cc49f628 > ---[ end trace b30d95b284222843 ]--- > Kernel panic - not syncing: Fatal exception Thank you Eric for the report! Darn me, I seriously messed-up with the stateless consume. I think we can have similar issues pith ipsec/secpath and MSG_PEEK, even if with less catastropthic outcome. What about the following, which should cover both cases? (only compile tested, I'll test it tomorrow morning my time) --- diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index d67a8182e5eb..63df75ae70ee 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -885,7 +885,7 @@ void kfree_skb(struct sk_buff *skb); void kfree_skb_list(struct sk_buff *segs); void skb_tx_error(struct sk_buff *skb); void consume_skb(struct sk_buff *skb); -void consume_stateless_skb(struct sk_buff *skb); +void __consume_stateless_skb(struct sk_buff *skb); void __kfree_skb(struct sk_buff *skb); extern struct kmem_cache *skbuff_head_cache; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index e07556606284..f2411a8744d7 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -753,14 +753,11 @@ EXPORT_SYMBOL(consume_skb); * consume_stateless_skb - free an skbuff, assuming it is stateless * @skb: buffer to free * - * Works like consume_skb(), but this variant assumes that all the head - * states have been already dropped. + * Alike consume_skb(), but this variant assumes that all the head + * states have been already dropped and usage count is one */ -void consume_stateless_skb(struct sk_buff *skb) +void __consume_stateless_skb(struct sk_buff
Re: [PATCH 2/2] b43legacy: fix unitialized reads of ret by initializing the array to zero
On Tue, 5 Sep 2017 19:16:58 +0100 Colin Kingwrote: > From: Colin Ian King > > The u8 char array ret is not being initialized and elements outside > the range start to end contain just garbage values from the stack. > This results in a later scan of the array to read potentially > uninitialized values. Fix this by initializing the array to zero. > This seems to have been an issue since the very first commit. > > Detected by CoverityScan CID#139653 ("Uninitialized scalar variable") > > Signed-off-by: Colin Ian King > --- > drivers/net/wireless/broadcom/b43legacy/radio.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/wireless/broadcom/b43legacy/radio.c > b/drivers/net/wireless/broadcom/b43legacy/radio.c > index 9501420340a9..eab1c9387846 100644 > --- a/drivers/net/wireless/broadcom/b43legacy/radio.c > +++ b/drivers/net/wireless/broadcom/b43legacy/radio.c > @@ -280,7 +280,7 @@ u8 b43legacy_radio_aci_detect(struct b43legacy_wldev > *dev, u8 channel) > u8 b43legacy_radio_aci_scan(struct b43legacy_wldev *dev) > { > struct b43legacy_phy *phy = >phy; > - u8 ret[13]; > + u8 ret[13] = { 0 }; > unsigned int channel = phy->channel; > unsigned int i; > unsigned int j; This fix seems to be correct. Thanks for finding and fixing the issue. Reviewed-by: Michael Buesch -- Michael pgp4fgRHHbeGW.pgp Description: OpenPGP digital signature
Re: [PATCH 1/2] b43: fix unitialized reads of ret by initializing the array to zero
On Tue, 5 Sep 2017 19:15:50 +0100 Colin Kingwrote: > From: Colin Ian King > > The u8 char array ret is not being initialized and elements outside > the range start to end contain just garbage values from the stack. > This results in a later scan of the array to read potentially > uninitialized values. Fix this by initializing the array to zero. > This seems to have been an issue since the very first commit. > > Detected by CoverityScan CID#139652 ("Uninitialized scalar variable") > > Signed-off-by: Colin Ian King > --- > drivers/net/wireless/broadcom/b43/phy_g.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/wireless/broadcom/b43/phy_g.c > b/drivers/net/wireless/broadcom/b43/phy_g.c > index 822dcaa8ace6..f59c02166462 100644 > --- a/drivers/net/wireless/broadcom/b43/phy_g.c > +++ b/drivers/net/wireless/broadcom/b43/phy_g.c > @@ -2297,7 +2297,7 @@ static u8 b43_gphy_aci_detect(struct b43_wldev *dev, u8 > channel) > static u8 b43_gphy_aci_scan(struct b43_wldev *dev) > { > struct b43_phy *phy = >phy; > - u8 ret[13]; > + u8 ret[13] = { 0 }; > unsigned int channel = phy->channel; > unsigned int i, j, start, end; > This fix seems to be correct. Thanks for finding and fixing the issue. Reviewed-by: Michael Buesch -- Michael pgp6xvXbtxiIn.pgp Description: OpenPGP digital signature
[PATCH] e1000e: changed some expensive calls of udelay to usleep_range
Calls to udelay are not preemtable by userspace so userspace applications experience a large (~200us) latency when running on core0. Instead usleep_range can be used to be more friendly to userspace since it is preemtable. This is due to udelay using busy-wait loops while usleep_rang uses hrtimers instead. It is recommended to use udelay when the delay is <10us since at that precision overhead of usleep_range hrtimer setup causes issues. However, the replaced calls are for 50us and 100us so this should not be not an issue. There is no open bug that this patch is fixing, but we see a good boost in zero loss performance of specific user space application (dpdk l3fwd) when this patch is applied: we get from 32% of 10Gb line to 49%. Signed-off-by: Matthew TanSigned-off-by: Michael Kardonik --- drivers/net/ethernet/intel/e1000e/phy.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/phy.c b/drivers/net/ethernet/intel/e1000e/phy.c index de13aea..e318fdc 100644 --- a/drivers/net/ethernet/intel/e1000e/phy.c +++ b/drivers/net/ethernet/intel/e1000e/phy.c @@ -158,7 +158,7 @@ s32 e1000e_read_phy_reg_mdic(struct e1000_hw *hw, u32 offset, u16 *data) * the lower time out */ for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 3); i++) { - udelay(50); + usleep_range(50, 60); mdic = er32(MDIC); if (mdic & E1000_MDIC_READY) break; @@ -183,7 +183,7 @@ s32 e1000e_read_phy_reg_mdic(struct e1000_hw *hw, u32 offset, u16 *data) * reading duplicate data in the next MDIC transaction. */ if (hw->mac.type == e1000_pch2lan) - udelay(100); + usleep_range(100, 110); return 0; } @@ -222,7 +222,7 @@ s32 e1000e_write_phy_reg_mdic(struct e1000_hw *hw, u32 offset, u16 data) * the lower time out */ for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 3); i++) { - udelay(50); + usleep_range(50, 60); mdic = er32(MDIC); if (mdic & E1000_MDIC_READY) break; @@ -246,7 +246,7 @@ s32 e1000e_write_phy_reg_mdic(struct e1000_hw *hw, u32 offset, u16 data) * reading duplicate data in the next MDIC transaction. */ if (hw->mac.type == e1000_pch2lan) - udelay(100); + usleep_range(100, 110); return 0; } -- 2.7.4
Re: [PATCH 6/10] ixgbe: Use ARRAY_SIZE macro
On Tue, 2017-09-05 at 21:45 +0200, Thomas Meyer wrote: > On Tue, Sep 05, 2017 at 11:50:44AM -0700, David Miller wrote: > > From: Thomas Meyer> > Date: Sun, 03 Sep 2017 14:19:31 +0200 > > > > > Use ARRAY_SIZE macro, rather than explicitly coding some variant of it > > > yourself. > > > Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e > > > 's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\) > > > /ARRAY_SIZE(\1)/g' and manual check/verification. > > > > > > Signed-off-by: Thomas Meyer > > > > This should be submitted to the Intel ethernet driver maintainers. > > Hi, > > my script checks the output of get_maintainer scripts and only sends to "open > list" entries. > > The intel-wired-...@lists.osuosl.org is moderated, so that's why the patch > wasn't send there. > > Strangely the lists for nouv...@lists.freedesktop.org and > intel-gvt-...@lists.freedesktop.org appears as open lists in the MAINTAINERS > file but seems to be also moderated lists... At least I got some reply that my > message awaits approval. Maybe an update to the MAINTAINERS file is missing > here? > > I may drop above check in my script and send to all mailing lists that > get_maintainer.pl will return. There's a difference between moderated and subscriber-only entries in MAINTAINERS. get_maintainers will by default list moderated lists and not show subscriber-only lists unless using the -s switch.
[PATCH][next v2] rocker: fix kcalloc parameter order
The function calls to kcalloc use wrong parameter order and incorrect flags values. GFP_KERNEL is used instead of flags now and the order is corrected. The change was done using the following coccinelle script: @@ expression E1,E2; type T; @@ -kcalloc(E1, E2, sizeof(T)) +kcalloc(E2, sizeof(T), GFP_KERNEL) Signed-off-by: Zahari Doychev--- drivers/net/ethernet/rocker/rocker_ofdpa.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c b/drivers/net/ethernet/rocker/rocker_ofdpa.c index da4e26b53a52..0653b70723a3 100644 --- a/drivers/net/ethernet/rocker/rocker_ofdpa.c +++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c @@ -1177,7 +1177,7 @@ static int ofdpa_group_l2_fan_out(struct ofdpa_port *ofdpa_port, entry->group_id = group_id; entry->group_count = group_count; - entry->group_ids = kcalloc(flags, group_count, sizeof(u32)); + entry->group_ids = kcalloc(group_count, sizeof(u32), GFP_KERNEL); if (!entry->group_ids) { kfree(entry); return -ENOMEM; @@ -1456,7 +1456,7 @@ static int ofdpa_port_vlan_flood_group(struct ofdpa_port *ofdpa_port, int err = 0; int i; - group_ids = kcalloc(flags, port_count, sizeof(u32)); + group_ids = kcalloc(port_count, sizeof(u32), GFP_KERNEL); if (!group_ids) return -ENOMEM; -- 2.13.0
Re: [PATCH 6/10] ixgbe: Use ARRAY_SIZE macro
On Tue, Sep 05, 2017 at 11:50:44AM -0700, David Miller wrote: > From: Thomas Meyer> Date: Sun, 03 Sep 2017 14:19:31 +0200 > > > Use ARRAY_SIZE macro, rather than explicitly coding some variant of it > > yourself. > > Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e > > 's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\) > > /ARRAY_SIZE(\1)/g' and manual check/verification. > > > > Signed-off-by: Thomas Meyer > > This should be submitted to the Intel ethernet driver maintainers. Hi, my script checks the output of get_maintainer scripts and only sends to "open list" entries. The intel-wired-...@lists.osuosl.org is moderated, so that's why the patch wasn't send there. Strangely the lists for nouv...@lists.freedesktop.org and intel-gvt-...@lists.freedesktop.org appears as open lists in the MAINTAINERS file but seems to be also moderated lists... At least I got some reply that my message awaits approval. Maybe an update to the MAINTAINERS file is missing here? I may drop above check in my script and send to all mailing lists that get_maintainer.pl will return. > > Thank you.
Re: 915f3e3f76 ("mac80211_hwsim: Replace bogus hrtimer clockid"): BUG: kernel reboot-without-warning in test stage
On Tue, 5 Sep 2017, Sebastian Andrzej Siewior wrote: > On 2017-09-05 09:12:40 [+0200], Thomas Gleixner wrote: > > Sorry, no. That bisect is completely bogus. The commit in question merily > > replaces the unsupported clockid with a valid one. > > The bisect is correct. It just has problems to express itself properly. So > the table says: > > | WARNING:at_kernel/time/hrtimer.c:#hrtimer_init | 7 | || | > > | BUG:kernel_reboot-without-warning_in_test_stage | 0 | 12 | 6 | 6 | > > > which means _before_ your commit it counted a warning in hrtimer_init() > (an unsupported clock id was used). With your commit, the warning was > gone and I *think* the userland then printed > "BUG:kernel_reboot-without-warning_in_test_stage" because it had no > warning. > It seems that the bot learned to live with that warning which was around > for more than three years. Now that you removed it, it seems to be a > mistake to do so because nobody complained about it so far. Thanks for the translation. I'll never learn to decode these reports. tglx
Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
Hi Tom, Tom Herbertwrites: > On Tue, Sep 5, 2017 at 4:14 AM, Hannes Frederic Sowa > wrote: >> Tom Herbert writes: >> >>> There is absolutely no requirement in IP that packets are delivered in >>> order-- there never has been and there never will be! If the ULP, like >>> Ethernet encapsulation, requires in order deliver then it needs to >>> implement that itself like TCP, GRE, and other protocols ensure that >>> with sequence numbers and reassembly. All of these hoops we do make >>> sure that packets always follow the same path and are always in order >>> are done for benefit of middlebox devices like stateful firewalls that >>> have force us to adopt their concept of network architecture-- in the >>> long run this is self-defeating and kills our ability to innovate. >>> >>> I'm not saying that we shouldn't consider legacy devices, but we >>> should scrutinize new development or solutions that perpetuate >>> incorrect design or bad assumptions. >> >> So configure RSS per port and ensure no fragments are send to those >> ports. This is possible and rather easy to do. It solves the problem >> with legacy software and it spreads out packets for your applications. >> >> It is not perfect but it is working and solves both problems. >> > Hannes, > > I don't see how that solves anything. The purpose of RSS is to > distribute the load of protocol packets across queues. This needs to > work for UDP applications. For instance, if I were building a QUIC > server I'd want the sort of flow distribution that a TCP server would > give. You can't do that by configuring a few ports in the device. I seriously am with you and I think you are on the right track. I would also much rather like to see fragmentation *not* happening and would like to see hardware logic *not* parsing deep down into packets and protocols. We can agree on that! Albeit I don't understand the problem with my approach: If you know that you are running a QUIC server and know your environment (safe against reordering), run `ethtool -N ethX rx-flow-hash udp4 sdfn' on the box and you would get spreading of UDP packets based on the UDP src+dst port numbers to the rx queues. Otherwise only source and destination addresses are being considered, which is the safe default (maybe this is sometimes enough). Intel network cards warn you if you actually switch the mode: -- >8 -- drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c: "enabling UDP RSS: fragmented packets may arrive out of order to the stack above\n"); drivers/net/ethernet/intel/igb/igb_ethtool.c: "enabling UDP RSS: fragmented packets may arrive out of order to the stack above\n"); drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c: e_warn(drv, "enabling UDP RSS: fragmented packets may arrive out of order to the stack above\n"); -- 8< -- I argue that we can't change the default. Furthermore I tried to argue that in a cloud where you don't know which applications you run within the encapsulated networks, it is very hard to predict which protocols can be considered safe against those effects. I agree, that most of the time, it probably should not matter. But given the rule about kernel backwards compatibility I do have doubts that this change would go unnoticed. I am the devil's advocate here and argue it is on us to show that is safe before we change the semantics. > If I were to suggest any HW change it would be to not do DPI on > fragments (MF or offset is set). This ensures that all packets of the > fragment train get hashed to the same queue and is on fact what RPS > has been doing for years without any complaints. The same UDP flow could consist out of fragments and non-fragmented packets. Even if hardware would act according to what you wrote, reordering is very much possible within the same UDP flow. In case of QUIC, as far as I know, it uses destination ports 80 and 443. Like we already notify NICs about vxlan and geneve port numbers, we could selectively enable RSS for UDP ports on those destination port numbers to tell the NIC it won't receive fragments on those ports and thus do RSS. The same way the NIC could be notified if reordering on this port is possible. This could be done for example via a lightweight setsockopt. The situation with encapsulation is even more complicated: We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP constellation. If we do the fragmentation inside the vxlan tunnel and carry over the skb hash to all resulting UDP/vxlan packets source ports, we are fine and reordering on the receiver NIC won't happen in this case. If the fragmentation happens on the outer UDP header, this will result in reordering of the inner L2 flow. Unfortunately this depends on how the vxlan tunnel was set up, how other devices do that and (I believe so) on the kernel version. Given the fact that we tell the
Re: [PATCH net-next v2 RESEND 0/4] net: dsa: Allow switch drivers to indicate number of TX queues
From: Florian FainelliDate: Sun, 3 Sep 2017 20:26:59 -0700 > This patch series extracts the parts of the patch set that are likely not to > be > controversial and actually bringing multi-queue support to DSA-created network > devices. > > With these patches, we can now use sch_multiq as documented under > Documentation/networking/multique.txt and let applications dedice the switch > port output queue they want to use. Currently only Broadcom tags utilize that > information. > > Resending based on David's feedback regarding the patches not in patchwork. > > Changes in v2: > - use a proper define for the number of TX queues in bcm_sf2.c (Andrew) > > Changes from RFC: > > - dropped the ability to configure RX queues since we don't do anything with > those just yet > - dropped the patches that dealt with binding the DSA slave network devices > queues with their master network devices queues this will be worked on > separately. Series applied, thanks.
Re: [PATCH net-next] bridge: switchdev: Use an helper to clear forward mark
From: Ido SchimmelDate: Sun, 3 Sep 2017 17:44:13 +0300 > Instead of using ifdef in the C file. > > Signed-off-by: Ido Schimmel > Suggested-by: Nikolay Aleksandrov > Tested-by: Yotam Gigi Applied.
Re: [PATCH 7/10] net/mlx4_core: Use ARRAY_SIZE macro
From: Thomas MeyerDate: Sun, 03 Sep 2017 14:19:31 +0200 > Use ARRAY_SIZE macro, rather than explicitly coding some variant of it > yourself. > Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e > 's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\) > /ARRAY_SIZE(\1)/g' and manual check/verification. > > Signed-off-by: Thomas Meyer Applied.
Re: [PATCH 6/10] ixgbe: Use ARRAY_SIZE macro
From: Thomas MeyerDate: Sun, 03 Sep 2017 14:19:31 +0200 > Use ARRAY_SIZE macro, rather than explicitly coding some variant of it > yourself. > Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e > 's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\) > /ARRAY_SIZE(\1)/g' and manual check/verification. > > Signed-off-by: Thomas Meyer This should be submitted to the Intel ethernet driver maintainers. Thank you.
[PATCH v2] radix-tree: must check __radix_tree_preload() return value
From: Eric Dumazet__radix_tree_preload() only disables preemption if no error is returned. So we really need to make sure callers always check the return value. idr_preload() contract is to always disable preemption, so we need to add a missing preempt_disable() if an error happened. Similarly, ida_pre_get() only needs to call preempt_enable() in the case no error happened. Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree") Fixes: 7ad3d4d85c7a ("ida: Move ida_bitmap to a percpu variable") Signed-off-by: Eric Dumazet Cc: [4.11+] --- v2: addressed Linus feedback, not adding useless/confusing 'ret' variables. lib/radix-tree.c |9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 3527eb364964..afb3cb4d44b6 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -463,7 +463,7 @@ radix_tree_node_free(struct radix_tree_node *node) * To make use of this facility, the radix tree must be initialised without * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE(). */ -static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr) +static __must_check int __radix_tree_preload(gfp_t gfp_mask, unsigned nr) { struct radix_tree_preload *rtp; struct radix_tree_node *node; @@ -2104,7 +2104,8 @@ EXPORT_SYMBOL(radix_tree_tagged); */ void idr_preload(gfp_t gfp_mask) { - __radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE); + if (__radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE)) + preempt_disable(); } EXPORT_SYMBOL(idr_preload); @@ -2118,13 +2119,13 @@ EXPORT_SYMBOL(idr_preload); */ int ida_pre_get(struct ida *ida, gfp_t gfp) { - __radix_tree_preload(gfp, IDA_PRELOAD_SIZE); /* * The IDA API has no preload_end() equivalent. Instead, * ida_get_new() can return -EAGAIN, prompting the caller * to return to the ida_pre_get() step. */ - preempt_enable(); + if (!__radix_tree_preload(gfp, IDA_PRELOAD_SIZE)) + preempt_enable(); if (!this_cpu_read(ida_bitmap)) { struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
Re: [PATCH v2 net-next 0/2] flow_dissector: Flow dissector fixes
From: Tom HerbertDate: Fri, 1 Sep 2017 14:04:10 -0700 > This patch set fixes some basic issues with __skb_flow_dissect function. > > Items addressed: > - Cleanup control flow in the function; in particular eliminate a > bunch of goto's and implement a simplified control flow model > - Add limits for number of encapsulations and headers that can be > dissected > > v2: > - Simplify the logic for limits on flow dissection. Just set the > limit based on the number of headers the flow dissector can > processes. The accounted headers includes encapsulation headers, > extension headers, or other shim headers. > > Tested: > > Ran normal traffic, GUE, and VXLAN traffic. Applied, thanks Tom.
Re: [PATCH] radix-tree: must check __radix_tree_preload() return value
On Tue, 2017-09-05 at 11:07 -0700, Linus Torvalds wrote: > On Tue, Sep 5, 2017 at 10:59 AM, Eric Dumazetwrote: > > @@ -2104,7 +2104,10 @@ EXPORT_SYMBOL(radix_tree_tagged); > > */ > > void idr_preload(gfp_t gfp_mask) > > { > > - __radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE); > > + int ret = __radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE); > > + > > + if (ret) > > + preempt_disable(); > > } > > EXPORT_SYMBOL(idr_preload); > > Is there a reason for the "ret" variable that is entirely mis-named, > since it's never actually used as a return value? > > (Sure. it's the return value of a function, but that is entirely > useless and pointless information, and adds no value. We should name > variables by the data they contain or how they are used, not by "it > was the return value of a function"). > > In other words, why isn't this just > > if (__radix_tree_preload(..)) > preempt_disable(); > > which is shorter and clearer and not confusing? ... > Same issue, but this time strengthened by an additional "why doesn't > this just use that idr_preload function then?" question.. > Yep, I will send a v2, thanks.
Re: [RFC net-next] net: sch_clsact: add support for global per-netns classifier mode
On 9/5/17 9:18 PM, Cong Wang wrote: On Tue, Sep 5, 2017 at 5:48 AM, Nikolay Aleksandrovwrote: Hi all, This RFC adds a new mode for clsact which designates a device's egress classifier as global per netns. The packets that are not classified for a particular device will be classified using the global classifier. We have needed a global classifier for some time now for various purposes and setting the single bridge or loopback/vrf device as the global classifier device is acceptable for us. Doing it this way avoids the act/cls device and queue dependencies. This is strictly an RFC patch just to show the intent, if we agree on the details the proposed patch will have support for both ingress and egress, and will be using a static key to avoid the fast path test when no global classifier has been configured. Example (need a modified tc that adds TCA_OPTIONS when using q_clsact): $ tc qdisc add dev lo clsact global $ tc filter add dev lo egress protocol ip u32 match ip dst 4.3.2.1/32 action drop the last filter will be global for all devices that don't have a specific egress_cl_list (i.e. have clsact configured). Sorry this is too ugly netdevice is still implied in your command line even if you treat it as global. It is essentially hard to bypass netdevice layer since netdevice is the core of L2 and also where everything begins. This is only a quick RFC, that can be removed entirely if we limit it to the netns and its loopback device. Then we can drop the "device" keyword altogether. Maybe the best we can do here is make tc filters standalone as tc actions so that filters can exist before qdisc's and netdevices. But this probably requires significant works to make it working with both existing non-standalone and bindings standalones with qdisc's. We've actually been discussing this option internally as well. I think we'll look into doing that regardless of this patch. Note I don't look deeply into this, just one thought, at least this appears less ugly than yours. What I did was aimed at simplicity and is merely a mode of clsact which doesn't have an impact if not configured. Every other solution requires a much more invasive change, note that doesn't mean I don't agree. :-)
Re: [RFC net-next] net: sch_clsact: add support for global per-netns classifier mode
On Tue, Sep 5, 2017 at 5:48 AM, Nikolay Aleksandrovwrote: > Hi all, > This RFC adds a new mode for clsact which designates a device's egress > classifier as global per netns. The packets that are not classified for > a particular device will be classified using the global classifier. > We have needed a global classifier for some time now for various > purposes and setting the single bridge or loopback/vrf device as the > global classifier device is acceptable for us. Doing it this way avoids > the act/cls device and queue dependencies. > > This is strictly an RFC patch just to show the intent, if we agree on > the details the proposed patch will have support for both ingress and > egress, and will be using a static key to avoid the fast path test when no > global classifier has been configured. > > Example (need a modified tc that adds TCA_OPTIONS when using q_clsact): > $ tc qdisc add dev lo clsact global > $ tc filter add dev lo egress protocol ip u32 match ip dst 4.3.2.1/32 action > drop > > the last filter will be global for all devices that don't have a > specific egress_cl_list (i.e. have clsact configured). Sorry this is too ugly. netdevice is still implied in your command line even if you treat it as global. It is essentially hard to bypass netdevice layer since netdevice is the core of L2 and also where everything begins. Maybe the best we can do here is make tc filters standalone as tc actions so that filters can exist before qdisc's and netdevices. But this probably requires significant works to make it working with both existing non-standalone and bindings standalones with qdisc's. Note I don't look deeply into this, just one thought, at least this appears less ugly than yours.
[PATCH 3/3] drivers: net: xgene: Remove return statement from void function
commit 183db4 ("drivers: net: xgene: Correct probe sequence handling") changed the return type of xgene_enet_check_phy_handle() to void. This patch, removes the return statement from the last line. Signed-off-by: Iyappan Subramanian--- drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c index 655c0fc..3b889ef 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c @@ -1674,8 +1674,6 @@ static void xgene_enet_check_phy_handle(struct xgene_enet_pdata *pdata) ret = xgene_enet_phy_connect(pdata->ndev); if (!ret) pdata->mdio_driver = true; - - return; } static void xgene_enet_gpiod_get(struct xgene_enet_pdata *pdata) -- 2.7.4
[PATCH 1/3] drivers: net: xgene: Read tx/rx delay for ACPI
This patch fixes reading tx/rx delay values for ACPI. Signed-off-by: Iyappan SubramanianSigned-off-by: Quan Nguyen --- drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c index 6e253d9..655c0fc 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c @@ -1591,7 +1591,7 @@ static int xgene_get_tx_delay(struct xgene_enet_pdata *pdata) struct device *dev = >pdev->dev; int delay, ret; - ret = of_property_read_u32(dev->of_node, "tx-delay", ); + ret = device_property_read_u32(dev, "tx-delay", ); if (ret) { pdata->tx_delay = 4; return 0; @@ -1612,7 +1612,7 @@ static int xgene_get_rx_delay(struct xgene_enet_pdata *pdata) struct device *dev = >pdev->dev; int delay, ret; - ret = of_property_read_u32(dev->of_node, "rx-delay", ); + ret = device_property_read_u32(dev, "rx-delay", ); if (ret) { pdata->rx_delay = 2; return 0; -- 2.7.4
[PATCH 2/2] b43legacy: fix unitialized reads of ret by initializing the array to zero
From: Colin Ian KingThe u8 char array ret is not being initialized and elements outside the range start to end contain just garbage values from the stack. This results in a later scan of the array to read potentially uninitialized values. Fix this by initializing the array to zero. This seems to have been an issue since the very first commit. Detected by CoverityScan CID#139653 ("Uninitialized scalar variable") Signed-off-by: Colin Ian King --- drivers/net/wireless/broadcom/b43legacy/radio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/wireless/broadcom/b43legacy/radio.c b/drivers/net/wireless/broadcom/b43legacy/radio.c index 9501420340a9..eab1c9387846 100644 --- a/drivers/net/wireless/broadcom/b43legacy/radio.c +++ b/drivers/net/wireless/broadcom/b43legacy/radio.c @@ -280,7 +280,7 @@ u8 b43legacy_radio_aci_detect(struct b43legacy_wldev *dev, u8 channel) u8 b43legacy_radio_aci_scan(struct b43legacy_wldev *dev) { struct b43legacy_phy *phy = >phy; - u8 ret[13]; + u8 ret[13] = { 0 }; unsigned int channel = phy->channel; unsigned int i; unsigned int j; -- 2.14.1
[PATCH 0/3] drivers: net: xgene: Misc bug fixes
This patch set fixes bugs related to handling the case for ACPI for, reading and programming tx/rx delay values. Signed-off-by: Iyappan Subramanian--- Iyappan Subramanian (2): drivers: net: xgene: Read tx/rx delay for ACPI drivers: net: xgene: Remove return statement from void function Quan Nguyen (1): drivers: net: xgene: Configure tx/rx delay for ACPI drivers/net/ethernet/apm/xgene/xgene_enet_hw.c | 7 ++- drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 6 ++ 2 files changed, 4 insertions(+), 9 deletions(-) -- 2.7.4
[PATCH 2/3] drivers: net: xgene: Configure tx/rx delay for ACPI
From: Quan NguyenThis patch fixes configuring tx/rx delay values for ACPI. Signed-off-by: Quan Nguyen Signed-off-by: Iyappan Subramanian --- drivers/net/ethernet/apm/xgene/xgene_enet_hw.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c index e45b587..3188f55 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c @@ -468,7 +468,6 @@ static void xgene_enet_configure_clock(struct xgene_enet_pdata *pdata) static void xgene_gmac_set_speed(struct xgene_enet_pdata *pdata) { - struct device *dev = >pdev->dev; u32 icm0, icm2, mc2; u32 intf_ctl, rgmii, value; @@ -500,10 +499,8 @@ static void xgene_gmac_set_speed(struct xgene_enet_pdata *pdata) intf_ctl |= ENET_GHD_MODE; CFG_MACMODE_SET(, 2); CFG_WAITASYNCRD_SET(, 0); - if (dev->of_node) { - CFG_TXCLK_MUXSEL0_SET(, pdata->tx_delay); - CFG_RXCLK_MUXSEL0_SET(, pdata->rx_delay); - } + CFG_TXCLK_MUXSEL0_SET(, pdata->tx_delay); + CFG_RXCLK_MUXSEL0_SET(, pdata->rx_delay); rgmii |= CFG_SPEED_1250; xgene_enet_rd_csr(pdata, DEBUG_REG_ADDR, ); -- 2.7.4
[PATCH 1/2] b43: fix unitialized reads of ret by initializing the array to zero
From: Colin Ian KingThe u8 char array ret is not being initialized and elements outside the range start to end contain just garbage values from the stack. This results in a later scan of the array to read potentially uninitialized values. Fix this by initializing the array to zero. This seems to have been an issue since the very first commit. Detected by CoverityScan CID#139652 ("Uninitialized scalar variable") Signed-off-by: Colin Ian King --- drivers/net/wireless/broadcom/b43/phy_g.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/wireless/broadcom/b43/phy_g.c b/drivers/net/wireless/broadcom/b43/phy_g.c index 822dcaa8ace6..f59c02166462 100644 --- a/drivers/net/wireless/broadcom/b43/phy_g.c +++ b/drivers/net/wireless/broadcom/b43/phy_g.c @@ -2297,7 +2297,7 @@ static u8 b43_gphy_aci_detect(struct b43_wldev *dev, u8 channel) static u8 b43_gphy_aci_scan(struct b43_wldev *dev) { struct b43_phy *phy = >phy; - u8 ret[13]; + u8 ret[13] = { 0 }; unsigned int channel = phy->channel; unsigned int i, j, start, end; -- 2.14.1
Re: [PATCH] radix-tree: must check __radix_tree_preload() return value
On Tue, Sep 5, 2017 at 10:59 AM, Eric Dumazetwrote: > @@ -2104,7 +2104,10 @@ EXPORT_SYMBOL(radix_tree_tagged); > */ > void idr_preload(gfp_t gfp_mask) > { > - __radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE); > + int ret = __radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE); > + > + if (ret) > + preempt_disable(); > } > EXPORT_SYMBOL(idr_preload); Is there a reason for the "ret" variable that is entirely mis-named, since it's never actually used as a return value? (Sure. it's the return value of a function, but that is entirely useless and pointless information, and adds no value. We should name variables by the data they contain or how they are used, not by "it was the return value of a function"). In other words, why isn't this just if (__radix_tree_preload(..)) preempt_disable(); which is shorter and clearer and not confusing? > @@ -2118,13 +2121,14 @@ EXPORT_SYMBOL(idr_preload); > */ > int ida_pre_get(struct ida *ida, gfp_t gfp) > { > - __radix_tree_preload(gfp, IDA_PRELOAD_SIZE); > + int ret = __radix_tree_preload(gfp, IDA_PRELOAD_SIZE); > /* > * The IDA API has no preload_end() equivalent. Instead, > * ida_get_new() can return -EAGAIN, prompting the caller > * to return to the ida_pre_get() step. > */ > - preempt_enable(); > + if (!ret) > + preempt_enable(); Same issue, but this time strengthened by an additional "why doesn't this just use that idr_preload function then?" question.. Linus
[PATCH] radix-tree: must check __radix_tree_preload() return value
From: Eric Dumazet__radix_tree_preload() only disables preemption if no error is returned. So we really need to make sure callers always check the return value. idr_preload() contract is to always disable preemption, so we need to add a missing preempt_disable() if an error happened. Similarly, ida_pre_get() only needs to call preempt_enable() in the case no error happened. Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree") Fixes: 7ad3d4d85c7a ("ida: Move ida_bitmap to a percpu variable") Signed-off-by: Eric Dumazet Cc: [4.11+] --- lib/radix-tree.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 3527eb364964..fac702039304 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -463,7 +463,7 @@ radix_tree_node_free(struct radix_tree_node *node) * To make use of this facility, the radix tree must be initialised without * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE(). */ -static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr) +static __must_check int __radix_tree_preload(gfp_t gfp_mask, unsigned nr) { struct radix_tree_preload *rtp; struct radix_tree_node *node; @@ -2104,7 +2104,10 @@ EXPORT_SYMBOL(radix_tree_tagged); */ void idr_preload(gfp_t gfp_mask) { - __radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE); + int ret = __radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE); + + if (ret) + preempt_disable(); } EXPORT_SYMBOL(idr_preload); @@ -2118,13 +2121,14 @@ EXPORT_SYMBOL(idr_preload); */ int ida_pre_get(struct ida *ida, gfp_t gfp) { - __radix_tree_preload(gfp, IDA_PRELOAD_SIZE); + int ret = __radix_tree_preload(gfp, IDA_PRELOAD_SIZE); /* * The IDA API has no preload_end() equivalent. Instead, * ida_get_new() can return -EAGAIN, prompting the caller * to return to the ida_pre_get() step. */ - preempt_enable(); + if (!ret) + preempt_enable(); if (!this_cpu_read(ida_bitmap)) { struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
Re: [PATCH net-next 2/4] ip/options: explicitly provide net ns to __ip_options_echo()
On Thu, 2017-08-03 at 18:07 +0200, Paolo Abeni wrote: > __ip_options_echo() uses the current network namespace, and > currently retrives it via skb->dst->dev. > > This commit adds an explicit 'net' argument to __ip_options_echo() > and update all the call sites to provide it, usually via a simpler > sock_net(). > > After this change, __ip_options_echo() no more needs to access > skb->dst and we can drop a couple of hack to preserve such > info in the rx path. > > Signed-off-by: Paolo Abeni> --- David, Paolo This commit (91ed1e666a4ea2e260452a7d7d311ac5ae852cba "ip/options: explicitly provide net ns to __ip_options_echo()") needs to be backported to linux-4.13 stable version to avoid these kind of crashes [1] This is because of MSG_PEEK operation, hitting skb_consume_udp() while skb is still in receive queue. Next read() finding again the skb then can see a NULL skb->dst Thanks ! [1] general protection fault: [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 3017 Comm: syzkaller446772 Not tainted 4.13.0+ #68 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 task: 8801cd0a4380 task.stack: 8801cc498000 RIP: 0010:__ip_options_echo+0xea8/0x1430 net/ipv4/ip_options.c:143 RSP: 0018:8801cc49f628 EFLAGS: 00010246 RAX: dc00 RBX: 8801cc49f928 RCX: RDX: RSI: 0001 RDI: 0004 RBP: 8801cc49f6b8 R08: 8801cc49f936 R09: ed0039893f28 R10: 0003 R11: ed0039893f27 R12: 8801cc49f918 R13: 8801ccbcf36c R14: 000f R15: 0018 FS: 00979880() GS:8801db20() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 200c0ff0 CR3: 0001cc4ed000 CR4: 001406f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: ip_options_echo include/net/ip.h:574 [inline] ip_cmsg_recv_retopts net/ipv4/ip_sockglue.c:91 [inline] ip_cmsg_recv_offset+0xa17/0x1280 net/ipv4/ip_sockglue.c:207 udp_recvmsg+0xe0b/0x1260 net/ipv4/udp.c:1641 inet_recvmsg+0x14c/0x5f0 net/ipv4/af_inet.c:793 sock_recvmsg_nosec net/socket.c:792 [inline] sock_recvmsg+0xc9/0x110 net/socket.c:799 SYSC_recvfrom+0x2dc/0x570 net/socket.c:1788 SyS_recvfrom+0x40/0x50 net/socket.c:1760 entry_SYSCALL_64_fastpath+0x1f/0xbe RIP: 0033:0x444c89 RSP: 002b:7ffd80c788e8 EFLAGS: 0286 ORIG_RAX: 002d RAX: ffda RBX: RCX: 00444c89 RDX: RSI: 20bc RDI: 0004 RBP: 0082 R08: 200c0ff0 R09: 0010 R10: R11: 0286 R12: 00402390 R13: 00402420 R14: R15: Code: f6 c1 01 0f 85 a5 01 00 00 48 89 4d b8 e8 31 e9 6b fd 48 8b 4d b8 48 b8 00 00 00 00 00 fc ff df 48 83 e1 fe 48 89 ca 48 c1 ea 03 <80> 3c 02 00 0f 85 41 02 00 00 48 8b 09 48 b8 00 00 00 00 00 fc RIP: __ip_options_echo+0xea8/0x1430 net/ipv4/ip_options.c:143 RSP: 8801cc49f628 ---[ end trace b30d95b284222843 ]--- Kernel panic - not syncing: Fatal exception
Re: [PATCH net v2] net: sched: don't use GFP_KERNEL under spin lock
On Tue, 2017-09-05 at 08:31 -0700, Jakub Kicinski wrote: > The new TC IDR code uses GFP_KERNEL under spin lock. Which leads > to: > > [ 582.621091] BUG: sleeping function called from invalid context at > ../mm/slab.h:416 > [ 582.629721] in_atomic(): 1, irqs_disabled(): 0, pid: 3379, name: tc > [ 582.636939] 2 locks held by tc/3379: > [ 582.641049] #0: (rtnl_mutex){+.+.+.}, at: [] > rtnetlink_rcv_msg+0x92e/0x1400 > [ 582.650958] #1: (&(>idrinfo->lock)->rlock){+.-.+.}, at: > [] tcf_idr_create+0x2f0/0x8e0 > [ 582.662217] Preemption disabled at: > [ 582.66] [] tcf_idr_create+0x2f0/0x8e0 > [ 582.672592] CPU: 9 PID: 3379 Comm: tc Tainted: GW > 4.13.0-rc7-debug-00648-g43503a79b9f0 #287 > [ 582.683432] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 > 11/08/2016 > [ 582.691937] Call Trace: > ... > [ 582.742460] kmem_cache_alloc+0x286/0x540 > [ 582.747055] radix_tree_node_alloc.constprop.6+0x4a/0x450 > [ 582.753209] idr_get_free_cmn+0x627/0xf80 > ... > [ 582.815525] idr_alloc_cmn+0x1a8/0x270 > ... > [ 582.833804] tcf_idr_create+0x31b/0x8e0 > ... > > Try to preallocate the memory with idr_prealloc(GFP_KERNEL) > (as suggested by Eric Dumazet), and change the allocation > flags under spin lock. > > Fixes: 65a206c01e8e ("net/sched: Change act_api and act_xxx modules to use > IDR") > Signed-off-by: Jakub Kicinski> Reviewed-by: Simon Horman > --- Acked-by: Eric Dumazet Not related to your patch, but it looks like idr_preload() has a bug added in 4.11. I will send a fix to lkml.
[ANNOUNCE] Iproute2 for Linux 4.13
Update to iproute2 utility to support new features in Linux 4.13. This is a larger than usual release because of lots of updates for BPF and the new RDMA utility. Lots of cleanups and Coverity reported potential issues as well. Source: https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.13.0.tar.gz Repository: git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git Report problems (or enhancements) to the netdev@vger.kernel.org mailing list. --- Alexander Alemayhu (1): examples/bpf: update list of examples Andreas Henriksson (1): ss: fix help/man TCP-STATE description for listening Arkadi Sharshevsky (1): bridge: Distinguish between externally learned vs offloaded FDBs Casey Callendrello (1): netns: make /var/run/netns bind-mount recursive Daniel Borkmann (8): bpf: remove obsolete samples bpf: support loading map in map from obj bpf: dump id/jited info for cls/act programs bpf: improve error reporting around tail calls bpf: fix mnt path when from env bpf: unbreak libelf linkage for bpf obj loader bpf: minor cleanups for bpf_trace_pipe bpf: consolidate dumps to use bpf_dump_prog_info David Ahern (2): lib: Dump ext-ack string by default libnetlink: Fix extack attribute parsing Girish Moodalbail (1): geneve: support for modifying geneve device Hangbin Liu (1): utils: return default family when rtm_family is not RTNL_FAMILY_IPMR/IP6MR Jakub Kicinski (3): bpf: print xdp offloaded mode bpf: add xdpdrv for requesting XDP driver mode bpf: allow requesting XDP HW offload Jiri Benc (2): tc: m_tunnel_key: reformat the usage text tc: m_tunnel_key: add csum/nocsum option Jiri Pirko (7): tc_filter: add support for chain index tc: actions: add helpers to parse and print control actions tc/actions: introduce support for goto chain action tc: flower: add support for tcp flags tc: gact: fix control action parsing tc: add support for TRAP action tc: don't print error message on miss when parsing action with default Leon Romanovsky (8): utils: Move BIT macro to common header rdma: Add basic infrastructure for RDMA tool rdma: Add dev object rdma: Add link object rdma: Add json and pretty outputs rdma: Implement json output for dev object rdma: Add json output to link object rdma: Add initial manual for the tool Martin KaFai Lau (1): bpf: Add support for IFLA_XDP_PROG_ID Matteo Croce (3): tc: fix typo in manpage netns: avoid directory traversal netns: more input validation Michal Kubecek (2): iplink: check for message truncation in iplink_get() iplink: double the buffer size also in iplink_get() Or Gerlitz (1): tc: flower: add support for matching on ip tos and ttl Phil Sutter (58): bpf: Make bytecode-file reading a little more robust Really fix get_addr() and get_prefix() error messages tc-simple: Fix documentation examples: Some shell fixes to cbq.init ifcfg: Quote left-hand side of [ ] expression tipc/node: Fix socket fd check in cmd_node_get_addr() iproute_lwtunnel: Argument to strerror must be positive iproute_lwtunnel: csum_mode value checking was ineffective ss: Don't leak fd in tcp_show_netlink_file() tc/em_ipset: Don't leak sockfd on error path ipvrf: Fix error path of vrf_switch() ifstat: Fix memleak in error case ifstat: Fix memleak in dump_kern_db() for json output ss: Fix potential memleak in unix_stats_print() tipc/bearer: Fix resource leak in error path devlink: No need for this self-assignment ipntable: No need to check and assign to parms_rta iproute: Fix for missing 'Oifs:' display lib/rt_names: Drop dead code in rtnl_rttable_n2a() ss: Skip useless check in parse_hostcond() ss: Drop useless assignment tc/m_gact: Drop dead code ipaddress: Avoid accessing uninitialized variable lcl iplink_can: Prevent overstepping array bounds ipmaddr: Avoid accessing uninitialized data ss: Use C99 initializer in netlink_show_one() netem/maketable: Check return value of fstat() tc/q_multiq: Don't pass garbage in TCA_OPTIONS iproute: Check mark value input iplink_vrf: Complain if main table is not found devlink: Check return code of strslashrsplit() lib/bpf: Don't leak fp in bpf_find_mntpt() ifstat, nstat: Check fdopen() return value tc/q_netem: Don't dereference possibly NULL pointer tc/tc_filter: Make sure filter name is not empty tipc/bearer: Prevent NULL pointer dereference ipntable: Avoid memory allocation for filter.name lib/fs: Fix format string in find_fs_mount() lib/inet_proto: Review inet_proto_{a2n,n2a}() lnstat_util: Simplify alloc_and_open() a bit tc/m_xt: Fix for
Re: 915f3e3f76 ("mac80211_hwsim: Replace bogus hrtimer clockid"): BUG: kernel reboot-without-warning in test stage
On 2017-09-05 09:12:40 [+0200], Thomas Gleixner wrote: > Sorry, no. That bisect is completely bogus. The commit in question merily > replaces the unsupported clockid with a valid one. The bisect is correct. It just has problems to express itself properly. So the table says: | WARNING:at_kernel/time/hrtimer.c:#hrtimer_init | 7 | || | | BUG:kernel_reboot-without-warning_in_test_stage | 0 | 12 | 6 | 6 | which means _before_ your commit it counted a warning in hrtimer_init() (an unsupported clock id was used). With your commit, the warning was gone and I *think* the userland then printed "BUG:kernel_reboot-without-warning_in_test_stage" because it had no warning. It seems that the bot learned to live with that warning which was around for more than three years. Now that you removed it, it seems to be a mistake to do so because nobody complained about it so far. > Thanks, > > tglx Sebastian
Re: [PATCH iproute2 master 2/2] bpf: consolidate dumps to use bpf_dump_prog_info
On 09/05/2017 06:35 PM, Stephen Hemminger wrote: [...] I applied this to master, and resolved conflicts with net-next. But the dump with JSON of xdp is now incomplete. Ok, I will check it out, and send a follow-up to make it complete again.
Re: [PATCH iproute2 master 2/2] bpf: consolidate dumps to use bpf_dump_prog_info
On Tue, 5 Sep 2017 02:24:32 +0200 Daniel Borkmannwrote: > Consolidate dump of prog info to use bpf_dump_prog_info() when possible. > Moving forward, we want to have a consistent output for BPF progs when > being dumped. E.g. in cls/act case we used to dump tag as a separate > netlink attribute before we had BPF_OBJ_GET_INFO_BY_FD bpf(2) command. > > Move dumping tag into bpf_dump_prog_info() as well, and only dump the > netlink attribute for older kernels. Also, reuse bpf_dump_prog_info() > for XDP case, so we can dump tag and whether program was jited, which > we currently don't show. > > Signed-off-by: Daniel Borkmann I applied this to master, and resolved conflicts with net-next. But the dump with JSON of xdp is now incomplete.
Re: [PATCH net] rds: Fix non-atomic operation on shared flag variable
On 9/5/2017 8:42 AM, Håkon Bugge wrote: The bits in m_flags in struct rds_message are used for a plurality of reasons, and from different contexts. To avoid any missing updates to m_flags, use the atomic set_bit() instead of the non-atomic equivalent. Signed-off-by: Håkon BuggeReviewed-by: Knut Omang Reviewed-by: Wei Lin Guay --- net/rds/send.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/rds/send.c b/net/rds/send.c index 41b9f0f..058a407 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -273,7 +273,7 @@ int rds_send_xmit(struct rds_conn_path *cp) len = ntohl(rm->m_inc.i_hdr.h_len); if (cp->cp_unacked_packets == 0 || cp->cp_unacked_bytes < len) { - __set_bit(RDS_MSG_ACK_REQUIRED, >m_flags); + set_bit(RDS_MSG_ACK_REQUIRED, >m_flags); cp->cp_unacked_packets = rds_sysctl_max_unacked_packets; @@ -829,7 +829,7 @@ static int rds_send_queue_rm(struct rds_sock *rs, struct rds_connection *conn, * throughput hits a certain threshold. */ if (rs->rs_snd_bytes >= rds_sk_sndbuf(rs) / 2) - __set_bit(RDS_MSG_ACK_REQUIRED, >m_flags); + set_bit(RDS_MSG_ACK_REQUIRED, >m_flags); list_add_tail(>m_sock_item, >rs_send_queue); set_bit(RDS_MSG_ON_SOCK, >m_flags); Indeed, these couple of instances remained for the m_flags. Patch looks good. Thanks !! Acked-by: Santosh Shilimkar
Re: [PATCH iproute2/master] tc actions: store and dump correct length of user cookies
On Tue, 5 Sep 2017 13:06:24 +0200 Simon Hormanwrote: > Correct two errors which cancel each other out: > * Do not send twice the length of the actual provided by the user to the > kernel > * Do not dump half the length of the cookie provided by the kernel > > As the cookie is now stored in the kernel at its correct length rather > than double the that length cookies of up to the maximum size of 16 bytes > may now be stored rather than a maximum of half that length. > > Output of dump is the same before and after this change, > but the data stored in the kernel is now exactly the cookie > rather than the cookie + as many trailing zeros. > > Before: > # tc filter add dev eth0 protocol ip parent : \ >flower ip_proto udp action drop \ >cookie 0123456789abcdef0123456789abcdef > RTNETLINK answers: Invalid argument > > After: > # tc filter add dev eth0 protocol ip parent : \ >flower ip_proto udp action drop \ >cookie 0123456789abcdef0123456789abcdef > # tc filter show dev eth0 ingress >eth_type ipv4 >ip_proto udp >not_in_hw >action order 1: gact action drop > random type none pass val 0 > index 1 ref 1 bind 1 installed 1 sec used 1 sec >Action statistics: >Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) >backlog 0b 0p requeues 0 >cookie len 16 0123456789abcdef0123456789abcdef > > Fixes: fd8b3d2c1b9b ("actions: Add support for user cookies") > Cc: Jamal Hadi Salim > Signed-off-by: Simon Horman Applied thanks
Re: [PATCH iproute2 master 0/2] Two minor BPF updates
On Tue, 5 Sep 2017 02:24:30 +0200 Daniel Borkmannwrote: > Two minor updates including a small cleanup for dumping > the trace pipe and one for consolidating prog dumps for > tc and xdp to use bpf_prog_info_by_fd() when possible. > > Thanks! > > Daniel Borkmann (2): > bpf: minor cleanups for bpf_trace_pipe > bpf: consolidate dumps to use bpf_dump_prog_info > > include/bpf_util.h | 2 +- > ip/ipaddress.c | 6 -- > ip/iplink_xdp.c| 19 +++ > ip/xdp.h | 2 +- > lib/bpf.c | 31 ++- > tc/f_bpf.c | 8 > tc/m_bpf.c | 8 > 7 files changed, 47 insertions(+), 29 deletions(-) > Applied, thanks Daniel
Re: [PATCH] soc: ti/knav_dma: include dmaengine header
From: Arnd BergmannDate: Tue, 5 Sep 2017 10:31:35 +0200 > A header file cleanup apparently caused a build regression > with one driver using the knav infrastructure: > > In file included from drivers/net/ethernet/ti/netcp_core.c:30:0: > include/linux/soc/ti/knav_dma.h:129:30: error: field 'direction' has > incomplete type > enum dma_transfer_direction direction; > ^ > drivers/net/ethernet/ti/netcp_core.c: In function 'netcp_txpipe_open': > drivers/net/ethernet/ti/netcp_core.c:1349:21: error: 'DMA_MEM_TO_DEV' > undeclared (first use in this function); did you mean 'DMA_MEMORY_MAP'? > config.direction = DMA_MEM_TO_DEV; > ^~ > DMA_MEMORY_MAP > drivers/net/ethernet/ti/netcp_core.c:1349:21: note: each undeclared > identifier is reported only once for each function it appears in > drivers/net/ethernet/ti/netcp_core.c: In function > 'netcp_setup_navigator_resources': > drivers/net/ethernet/ti/netcp_core.c:1659:22: error: 'DMA_DEV_TO_MEM' > undeclared (first use in this function); did you mean 'DMA_DESC_HOST'? > config.direction = DMA_DEV_TO_MEM; > > As the header is no longer included implicitly through netdevice.h, > we should include it in the header that references the enum. > > Fixes: 0dd5759dbb1c ("net: remove dmaengine.h inclusion from netdevice.h") > Signed-off-by: Arnd Bergmann > --- > If the cleanup patch hasn't been submitted for mainline yet, please > add this fixup to the net-next tree, otherwise I'll merge it through > arm-soc. Applied to net-next.
Re: [PATCH] net/ncsi: fix ncsi_vlan_rx_{add,kill}_vid references
From: Arnd BergmannDate: Tue, 5 Sep 2017 10:05:47 +0200 > We get a new link error in allmodconfig kernels after ftgmac100 > started using the ncsi helpers: > > ERROR: "ncsi_vlan_rx_kill_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] > undefined! > ERROR: "ncsi_vlan_rx_add_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] > undefined! > > Related to that, we get another error when CONFIG_NET_NCSI is disabled: > > drivers/net/ethernet/faraday/ftgmac100.c:1626:25: error: > 'ncsi_vlan_rx_add_vid' undeclared here (not in a function); did you mean > 'ncsi_start_dev'? > drivers/net/ethernet/faraday/ftgmac100.c:1627:26: error: > 'ncsi_vlan_rx_kill_vid' undeclared here (not in a function); did you mean > 'ncsi_vlan_rx_add_vid'? > > This fixes both problems at once, using a 'static inline' stub helper > for the disabled case, and exporting the functions when they are present. > > Fixes: 51564585d8c6 ("ftgmac100: Support NCSI VLAN filtering when available") > Fixes: 21acf63013ed ("net/ncsi: Configure VLAN tag filter") > Signed-off-by: Arnd Bergmann Applied.
Re: [PATCH net-next] bpf: fix numa_node validation
From: Eric DumazetDate: Mon, 04 Sep 2017 22:41:02 -0700 > From: Eric Dumazet > > syzkaller reported crashes in bpf map creation or map update [1] > > Problem is that nr_node_ids is a signed integer, > NUMA_NO_NODE is also an integer, so it is very tempting > to declare numa_node as a signed integer. > > This means the typical test to validate a user provided value : > > if (numa_node != NUMA_NO_NODE && > (numa_node >= nr_node_ids || > !node_online(numa_node))) > > must be written : > > if (numa_node != NUMA_NO_NODE && > ((unsigned int)numa_node >= nr_node_ids || > !node_online(numa_node))) > > > [1] ... > Fixes: 96eabe7a40aa ("bpf: Allow selecting numa node during map creation") > Signed-off-by: Eric Dumazet Applied, thanks.
Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
On Tue, Sep 5, 2017 at 4:14 AM, Hannes Frederic Sowawrote: > Tom Herbert writes: > >> There is absolutely no requirement in IP that packets are delivered in >> order-- there never has been and there never will be! If the ULP, like >> Ethernet encapsulation, requires in order deliver then it needs to >> implement that itself like TCP, GRE, and other protocols ensure that >> with sequence numbers and reassembly. All of these hoops we do make >> sure that packets always follow the same path and are always in order >> are done for benefit of middlebox devices like stateful firewalls that >> have force us to adopt their concept of network architecture-- in the >> long run this is self-defeating and kills our ability to innovate. >> >> I'm not saying that we shouldn't consider legacy devices, but we >> should scrutinize new development or solutions that perpetuate >> incorrect design or bad assumptions. > > So configure RSS per port and ensure no fragments are send to those > ports. This is possible and rather easy to do. It solves the problem > with legacy software and it spreads out packets for your applications. > > It is not perfect but it is working and solves both problems. > Hannes, I don't see how that solves anything. The purpose of RSS is to distribute the load of protocol packets across queues. This needs to work for UDP applications. For instance, if I were building a QUIC server I'd want the sort of flow distribution that a TCP server would give. You can't do that by configuring a few ports in the device. If I were to suggest any HW change it would be to not do DPI on fragments (MF or offset is set). This ensures that all packets of the fragment train get hashed to the same queue and is on fact what RPS has been doing for years without any complaints. But even before I'd make that recommendation, I'd really like understand what the problem actually is. The only thing I can garner from this discussion and the Intel patch is that when fragments are received OOO that is perceived as a problem. But the by the protocol specification clearly says this is not a problem. So the questions are: who is negatively affected by this? Is this a problem because some artificial test that checks for everything to be in order is now failing? Is this affecting real users? Is this an issue in the stack or really with some implementation outside of the stack? If it is an implementation outside of the stack, then are we just bandaid'ing over someone else's incorrect implementation by patching the kernel (like would have be the case if we change the kernel to interoperate with Facebook's switch that couldn't handle OOO in twstate). Thanks, Tom > Bye, > Hannes
[PATCH 24/25 v2] net/cdc_ncm: Replace tasklet with softirq hrtimer
From: Thomas GleixnerThe bh tasklet is used in invoke the hrtimer (cdc_ncm_tx_timer_cb) in softirq context. This can be also achieved without the tasklet but with CLOCK_MONOTONIC_SOFT as hrtimer base. Signed-off-by: Thomas Gleixner Signed-off-by: Anna-Maria Gleixner Cc: Oliver Neukum Cc: Greg Kroah-Hartman Cc: linux-...@vger.kernel.org Cc: netdev@vger.kernel.org Acked-by: Greg Kroah-Hartman [bigeasy: using usb_get_intfdata() as suggested by Bjørn Mork] Signed-off-by: Sebastian Andrzej Siewior --- On 2017-08-31 15:57:04 [+0200], Bjørn Mork wrote: > I believe the struct usbnet pointer is redundant. We already have lots > of pointers back and forth here. This should work, but is not tested: > > struct usbnet *dev = usb_get_intfdata(ctx->control): I think so, too. Still untested as I don't have a working gadget around. v1…v2: Updated as suggested by Bjørn and added Greg's Acked-by. drivers/net/usb/cdc_ncm.c | 36 +++- include/linux/usb/cdc_ncm.h | 1 - 2 files changed, 15 insertions(+), 22 deletions(-) diff --git a/drivers/net/usb/cdc_ncm.c b/drivers/net/usb/cdc_ncm.c index 8f572b9f3625..42f7bd90e6a4 100644 --- a/drivers/net/usb/cdc_ncm.c +++ b/drivers/net/usb/cdc_ncm.c @@ -61,7 +61,6 @@ static bool prefer_mbim; module_param(prefer_mbim, bool, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(prefer_mbim, "Prefer MBIM setting on dual NCM/MBIM functions"); -static void cdc_ncm_txpath_bh(unsigned long param); static void cdc_ncm_tx_timeout_start(struct cdc_ncm_ctx *ctx); static enum hrtimer_restart cdc_ncm_tx_timer_cb(struct hrtimer *hr_timer); static struct usb_driver cdc_ncm_driver; @@ -777,10 +776,8 @@ int cdc_ncm_bind_common(struct usbnet *dev, struct usb_interface *intf, u8 data_ if (!ctx) return -ENOMEM; - hrtimer_init(>tx_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + hrtimer_init(>tx_timer, CLOCK_MONOTONIC_SOFT, HRTIMER_MODE_REL); ctx->tx_timer.function = _ncm_tx_timer_cb; - ctx->bh.data = (unsigned long)dev; - ctx->bh.func = cdc_ncm_txpath_bh; atomic_set(>stop, 0); spin_lock_init(>mtx); @@ -967,10 +964,7 @@ void cdc_ncm_unbind(struct usbnet *dev, struct usb_interface *intf) atomic_set(>stop, 1); - if (hrtimer_active(>tx_timer)) - hrtimer_cancel(>tx_timer); - - tasklet_kill(>bh); + hrtimer_cancel(>tx_timer); /* handle devices with combined control and data interface */ if (ctx->control == ctx->data) @@ -1348,20 +1342,9 @@ static void cdc_ncm_tx_timeout_start(struct cdc_ncm_ctx *ctx) HRTIMER_MODE_REL); } -static enum hrtimer_restart cdc_ncm_tx_timer_cb(struct hrtimer *timer) +static void cdc_ncm_txpath_bh(struct cdc_ncm_ctx *ctx) { - struct cdc_ncm_ctx *ctx = - container_of(timer, struct cdc_ncm_ctx, tx_timer); - - if (!atomic_read(>stop)) - tasklet_schedule(>bh); - return HRTIMER_NORESTART; -} - -static void cdc_ncm_txpath_bh(unsigned long param) -{ - struct usbnet *dev = (struct usbnet *)param; - struct cdc_ncm_ctx *ctx = (struct cdc_ncm_ctx *)dev->data[0]; + struct usbnet *dev = usb_get_intfdata(ctx->control); spin_lock_bh(>mtx); if (ctx->tx_timer_pending != 0) { @@ -1379,6 +1362,17 @@ static void cdc_ncm_txpath_bh(unsigned long param) } } +static enum hrtimer_restart cdc_ncm_tx_timer_cb(struct hrtimer *timer) +{ + struct cdc_ncm_ctx *ctx = + container_of(timer, struct cdc_ncm_ctx, tx_timer); + + if (!atomic_read(>stop)) + cdc_ncm_txpath_bh(ctx); + + return HRTIMER_NORESTART; +} + struct sk_buff * cdc_ncm_tx_fixup(struct usbnet *dev, struct sk_buff *skb, gfp_t flags) { diff --git a/include/linux/usb/cdc_ncm.h b/include/linux/usb/cdc_ncm.h index 1a59699cf82a..62b506fddf8d 100644 --- a/include/linux/usb/cdc_ncm.h +++ b/include/linux/usb/cdc_ncm.h @@ -92,7 +92,6 @@ struct cdc_ncm_ctx { struct usb_cdc_ncm_ntb_parameters ncm_parm; struct hrtimer tx_timer; - struct tasklet_struct bh; const struct usb_cdc_ncm_desc *func_desc; const struct usb_cdc_mbim_desc *mbim_desc; -- 2.14.1
[PATCH net] rds: Fix non-atomic operation on shared flag variable
The bits in m_flags in struct rds_message are used for a plurality of reasons, and from different contexts. To avoid any missing updates to m_flags, use the atomic set_bit() instead of the non-atomic equivalent. Signed-off-by: Håkon BuggeReviewed-by: Knut Omang Reviewed-by: Wei Lin Guay --- net/rds/send.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/rds/send.c b/net/rds/send.c index 41b9f0f..058a407 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -273,7 +273,7 @@ int rds_send_xmit(struct rds_conn_path *cp) len = ntohl(rm->m_inc.i_hdr.h_len); if (cp->cp_unacked_packets == 0 || cp->cp_unacked_bytes < len) { - __set_bit(RDS_MSG_ACK_REQUIRED, >m_flags); + set_bit(RDS_MSG_ACK_REQUIRED, >m_flags); cp->cp_unacked_packets = rds_sysctl_max_unacked_packets; @@ -829,7 +829,7 @@ static int rds_send_queue_rm(struct rds_sock *rs, struct rds_connection *conn, * throughput hits a certain threshold. */ if (rs->rs_snd_bytes >= rds_sk_sndbuf(rs) / 2) - __set_bit(RDS_MSG_ACK_REQUIRED, >m_flags); + set_bit(RDS_MSG_ACK_REQUIRED, >m_flags); list_add_tail(>m_sock_item, >rs_send_queue); set_bit(RDS_MSG_ON_SOCK, >m_flags); -- 2.9.3
[PATCH net v2] net: sched: don't use GFP_KERNEL under spin lock
The new TC IDR code uses GFP_KERNEL under spin lock. Which leads to: [ 582.621091] BUG: sleeping function called from invalid context at ../mm/slab.h:416 [ 582.629721] in_atomic(): 1, irqs_disabled(): 0, pid: 3379, name: tc [ 582.636939] 2 locks held by tc/3379: [ 582.641049] #0: (rtnl_mutex){+.+.+.}, at: [] rtnetlink_rcv_msg+0x92e/0x1400 [ 582.650958] #1: (&(>idrinfo->lock)->rlock){+.-.+.}, at: [] tcf_idr_create+0x2f0/0x8e0 [ 582.662217] Preemption disabled at: [ 582.66] [] tcf_idr_create+0x2f0/0x8e0 [ 582.672592] CPU: 9 PID: 3379 Comm: tc Tainted: GW 4.13.0-rc7-debug-00648-g43503a79b9f0 #287 [ 582.683432] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016 [ 582.691937] Call Trace: ... [ 582.742460] kmem_cache_alloc+0x286/0x540 [ 582.747055] radix_tree_node_alloc.constprop.6+0x4a/0x450 [ 582.753209] idr_get_free_cmn+0x627/0xf80 ... [ 582.815525] idr_alloc_cmn+0x1a8/0x270 ... [ 582.833804] tcf_idr_create+0x31b/0x8e0 ... Try to preallocate the memory with idr_prealloc(GFP_KERNEL) (as suggested by Eric Dumazet), and change the allocation flags under spin lock. Fixes: 65a206c01e8e ("net/sched: Change act_api and act_xxx modules to use IDR") Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman --- net/sched/act_api.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 0eb545bcb247..a306974e2fb4 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -296,10 +296,12 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, spin_lock_init(>tcfa_lock); /* user doesn't specify an index */ if (!index) { + idr_preload(GFP_KERNEL); spin_lock_bh(>lock); err = idr_alloc_ext(idr, NULL, _index, 1, 0, - GFP_KERNEL); + GFP_ATOMIC); spin_unlock_bh(>lock); + idr_preload_end(); if (err) { err3: free_percpu(p->cpu_qstats); @@ -307,10 +309,12 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est, } p->tcfa_index = idr_index; } else { + idr_preload(GFP_KERNEL); spin_lock_bh(>lock); err = idr_alloc_ext(idr, NULL, NULL, index, index + 1, - GFP_KERNEL); + GFP_ATOMIC); spin_unlock_bh(>lock); + idr_preload_end(); if (err) goto err3; p->tcfa_index = index; -- 2.14.1