date:20170905

[Patch net] net_sched: fix a memory leak of filter chain

2017-09-05 Thread Cong Wang

tcf_chain_destroy() is called by tcf_block_put() and tcf_chain_put().
tcf_chain_put() is refcn'ed and paired with tcf_chain_get(),
but tcf_block_put() is not, it should be paired with tcf_block_get()
and we still need to decrease the refcnt. However, tcf_block_put()
is special, it stores the chains too, we have to detach them if
it is not the last user.

What's more, index 0 is not special at all, it should be treated
like other chains. This also makes the code more readable.

Fixes: 744a4cf63e52 ("net: sched: fix use after free when tcf_chain_destroy is 
called multiple times")
Reported-by: Jakub Kicinski 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
 net/sched/cls_api.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 6c5ea84d2682..c6d25b29bcd4 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -213,17 +213,17 @@ static void tcf_chain_flush(struct tcf_chain *chain)
}
 }
 
-static void tcf_chain_destroy(struct tcf_chain *chain)
+static void tcf_chain_detach(struct tcf_chain *chain)
 {
/* May be already removed from the list by the previous call. */
if (!list_empty(>list))
list_del_init(>list);
+}
 
-   /* There might still be a reference held when we got here from
-* tcf_block_put. Wait for the user to drop reference before free.
-*/
-   if (!chain->refcnt)
-   kfree(chain);
+static void tcf_chain_destroy(struct tcf_chain *chain)
+{
+   tcf_chain_detach(chain);
+   kfree(chain);
 }
 
 struct tcf_chain *tcf_chain_get(struct tcf_block *block, u32 chain_index,
@@ -246,10 +246,7 @@ EXPORT_SYMBOL(tcf_chain_get);
 
 void tcf_chain_put(struct tcf_chain *chain)
 {
-   /* Destroy unused chain, with exception of chain 0, which is the
-* default one and has to be always present.
-*/
-   if (--chain->refcnt == 0 && !chain->filter_chain && chain->index != 0)
+   if (--chain->refcnt == 0)
tcf_chain_destroy(chain);
 }
 EXPORT_SYMBOL(tcf_chain_put);
@@ -296,8 +293,11 @@ void tcf_block_put(struct tcf_block *block)
 
list_for_each_entry_safe(chain, tmp, >chain_list, list) {
tcf_chain_flush(chain);
-   tcf_chain_destroy(chain);
+   tcf_chain_put(chain);
}
+   /* If tc actions still hold the chain, just detach it. */
+   list_for_each_entry_safe(chain, tmp, >chain_list, list)
+   tcf_chain_detach(chain);
kfree(block);
 }
 EXPORT_SYMBOL(tcf_block_put);
-- 
2.13.0

Re: [RFC net-next] net: sch_clsact: add support for global per-netns classifier mode

2017-09-05 Thread Roopa Prabhu

On Tue, Sep 5, 2017 at 3:25 PM, Jamal Hadi Salim  wrote:
> On 17-09-05 06:01 PM, Roopa Prabhu wrote:
>
>>
>> yes, like Nikolay says we have been discussing this as well. Nikolay's
>> patch is a cleaver and most importantly non-invasive
>> way today given the anchor point for tc rules is a netdev. we have
>> also considered a separate implicit tc anchor device.
>>   lo seemed like a good fit for all global rules given it is already
>> available. And it is not uncommon to hang off global
>> network config on the loopback interface.
>>
>
> IMO, Jiri has done all the necessary work already with the concept of
> blocks. We dont really need the netdev to be the attachment point.
>
> You can add to a block in many locations in the kernel by
> constructing the proper "coordinates" in the tcmsg.
> i.e this:
>   tcmsg {
> unsigned char   tcm_family;
> unsigned char   tcm__pad1;
> unsigned short  tcm__pad2;
> int tcm_ifindex;
> __u32   tcm_handle;
> __u32   tcm_parent;
> }
>
> If you were to set tcm_ifindex to -1 (since that is not a legit
> ifindex) then all we need to do is define a parent for a
> different location. Current locations tied to netdevs are:
> -
> #define TC_H_ROOT   (0xU)
> #define TC_H_INGRESS(0xFFF1U)
> #define TC_H_CLSACT TC_H_INGRESS
>
> #define TC_H_MIN_INGRESS0xFFF2U
> #define TC_H_MIN_EGRESS 0xFFF3U
> -
>
> You should be able to say add a location which maps to a pre-routing
> or post-routing etc; and this would work as well...
>

ok, interesting. Jiri's examples still showed netdev as attachment point.
we will explore some more. thanks.

Re: [RFC net-next] net: sch_clsact: add support for global per-netns classifier mode

2017-09-05 Thread Roopa Prabhu

On Tue, Sep 5, 2017 at 3:45 PM, Daniel Borkmann  wrote:
> On 09/06/2017 12:01 AM, Roopa Prabhu wrote:
>>
>> On Tue, Sep 5, 2017 at 11:18 AM, Cong Wang 
>> wrote:
>>>
>>> On Tue, Sep 5, 2017 at 5:48 AM, Nikolay Aleksandrov
>>>  wrote:

 Hi all,
 This RFC adds a new mode for clsact which designates a device's egress
 classifier as global per netns. The packets that are not classified for
 a particular device will be classified using the global classifier.
 We have needed a global classifier for some time now for various
 purposes and setting the single bridge or loopback/vrf device as the
>
>
> Can you elaborate a bit more on the ... "we have needed a global
> classifier for some time now for various purposes".

Most of our acl's are global or use a wildcard. eg iptables supports
global rules without an dev. We do end up having hundreds of netdevs.
Another use case for the future is use of tc for policy based routing
which requires global rules.

[no subject]

2017-09-05 Thread informationrequest



45388.doc
Description: MS-Word document

[PATCH net] netlink: access nlk groups safely in netlink bind and getname

2017-09-05 Thread Xin Long

Now there is no lock protecting nlk ngroups/groups' accessing in
netlink bind and getname. It's safe from nlk groups' setting in
netlink_release, but not from netlink_realloc_groups called by
netlink_setsockopt.

netlink_lock_table is needed in both netlink bind and getname when
accessing nlk groups.

Acked-by: Florian Westphal 
Signed-off-by: Xin Long 
---
 net/netlink/af_netlink.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 94a61e6..3278077 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -955,7 +955,7 @@ static int netlink_bind(struct socket *sock, struct 
sockaddr *addr,
struct net *net = sock_net(sk);
struct netlink_sock *nlk = nlk_sk(sk);
struct sockaddr_nl *nladdr = (struct sockaddr_nl *)addr;
-   int err;
+   int err = 0;
long unsigned int groups = nladdr->nl_groups;
bool bound;
 
@@ -983,6 +983,7 @@ static int netlink_bind(struct socket *sock, struct 
sockaddr *addr,
return -EINVAL;
}
 
+   netlink_lock_table();
if (nlk->netlink_bind && groups) {
int group;
 
@@ -993,7 +994,7 @@ static int netlink_bind(struct socket *sock, struct 
sockaddr *addr,
if (!err)
continue;
netlink_undo_bind(group, groups, sk);
-   return err;
+   goto unlock;
}
}
 
@@ -1006,12 +1007,13 @@ static int netlink_bind(struct socket *sock, struct 
sockaddr *addr,
netlink_autobind(sock);
if (err) {
netlink_undo_bind(nlk->ngroups, groups, sk);
-   return err;
+   goto unlock;
}
}
 
if (!groups && (nlk->groups == NULL || !(u32)nlk->groups[0]))
-   return 0;
+   goto unlock;
+   netlink_unlock_table();
 
netlink_table_grab();
netlink_update_subscriptions(sk, nlk->subscriptions +
@@ -1022,6 +1024,10 @@ static int netlink_bind(struct socket *sock, struct 
sockaddr *addr,
netlink_table_ungrab();
 
return 0;
+
+unlock:
+   netlink_unlock_table();
+   return err;
 }
 
 static int netlink_connect(struct socket *sock, struct sockaddr *addr,
@@ -1079,7 +1085,9 @@ static int netlink_getname(struct socket *sock, struct 
sockaddr *addr,
nladdr->nl_groups = netlink_group_mask(nlk->dst_group);
} else {
nladdr->nl_pid = nlk->portid;
+   netlink_lock_table();
nladdr->nl_groups = nlk->groups ? nlk->groups[0] : 0;
+   netlink_unlock_table();
}
return 0;
 }
-- 
2.1.0

[PATCH net] netlink: fix an use-after-free issue for nlk groups

2017-09-05 Thread Xin Long

ChunYu found a netlink use-after-free issue by syzkaller:

[28448.842981] BUG: KASAN: use-after-free in __nla_put+0x37/0x40 at addr 
8807185e2378
[28448.969918] Call Trace:
[...]
[28449.117207]  __nla_put+0x37/0x40
[28449.132027]  nla_put+0xf5/0x130
[28449.146261]  sk_diag_fill.isra.4.constprop.5+0x5a0/0x750 [netlink_diag]
[28449.176608]  __netlink_diag_dump+0x25a/0x700 [netlink_diag]
[28449.202215]  netlink_diag_dump+0x176/0x240 [netlink_diag]
[28449.226834]  netlink_dump+0x488/0xbb0
[28449.298014]  __netlink_dump_start+0x4e8/0x760
[28449.317924]  netlink_diag_handler_dump+0x261/0x340 [netlink_diag]
[28449.413414]  sock_diag_rcv_msg+0x207/0x390
[28449.432409]  netlink_rcv_skb+0x149/0x380
[28449.467647]  sock_diag_rcv+0x2d/0x40
[28449.484362]  netlink_unicast+0x562/0x7b0
[28449.564790]  netlink_sendmsg+0xaa8/0xe60
[28449.661510]  sock_sendmsg+0xcf/0x110
[28449.865631]  __sys_sendmsg+0xf3/0x240
[28450.000964]  SyS_sendmsg+0x32/0x50
[28450.016969]  do_syscall_64+0x25c/0x6c0
[28450.154439]  entry_SYSCALL64_slow_path+0x25/0x25

It was caused by no protection between nlk groups' free in netlink_release
and nlk groups' accessing in sk_diag_dump_groups. The similar issue also
exists in netlink_seq_show().

This patch is to defer nlk groups' free in deferred_put_nlk_sk.

Reported-by: ChunYu Wang 
Acked-by: Florian Westphal 
Signed-off-by: Xin Long 
---
 net/netlink/af_netlink.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 5acee49..94a61e6 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -691,6 +691,9 @@ static void deferred_put_nlk_sk(struct rcu_head *head)
struct netlink_sock *nlk = container_of(head, struct netlink_sock, rcu);
struct sock *sk = >sk;
 
+   kfree(nlk->groups);
+   nlk->groups = NULL;
+
if (!refcount_dec_and_test(>sk_refcnt))
return;
 
@@ -769,9 +772,6 @@ static int netlink_release(struct socket *sock)
netlink_table_ungrab();
}
 
-   kfree(nlk->groups);
-   nlk->groups = NULL;
-
local_bh_disable();
sock_prot_inuse_add(sock_net(sk), _proto, -1);
local_bh_enable();
-- 
2.1.0

Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-09-05 Thread Michael S. Tsirkin

On Tue, Sep 05, 2017 at 04:09:19PM +0200, Willem de Bruijn wrote:
> On Mon, Sep 4, 2017 at 5:03 AM, Jason Wang  wrote:
> >
> >
> > On 2017年09月02日 00:17, Willem de Bruijn wrote:
> >
> > This is not a 50/50 split, which impliesTw that some packets from the
> > large
> > packet flow are still converted to copying. Without the change the rate
> > without queue was 80k zerocopy vs 80k copy, so this choice of
> > (vq->num >> 2) appears too conservative.
> >
> > However, testing with (vq->num >> 1) was not as effective at mitigating
> > stalls. I did not save that data, unfortunately. Can run more tests on
> > fine
> > tuning this variable, if the idea sounds good.
> 
> 
>  Looks like there're still two cases were left:
> >>>
> >>> To be clear, this patch is not intended to fix all issues. It is a small
> >>> improvement to avoid HoL blocking due to queued zerocopy skbs.
> >
> >
> > Right, just want to see if there's anything left.
> >
> >>>
> >>> The trade-off is that reverting to copying in these cases increases
> >>> cycle cost. I think that that is a trade-off worth making compared to
> >>> the alternative drop in throughput. It probably would be good to be
> >>> able to measure this without kernel instrumentation: export
> >>> counters similar to net->tx_zcopy_err and net->tx_packets (though
> >>> without reset to zero, as in vhost_net_tx_packet).
> >
> >
> > I think it's acceptable if extra cycles were spent if we detect HOL anyhow.
> >
> >>>
>  1) sndbuf is not INT_MAX
> >>>
> >>> You mean the case where the device stalls, later zerocopy notifications
> >>> are queued, but these are never cleaned in free_old_xmit_skbs,
> >>> because it requires a start_xmit and by now the (only) socket is out of
> >>> descriptors?
> >>
> >> Typo, sorry. I meant out of sndbuf.
> >
> >
> > I mean e.g for tun. If its sndbuf is smaller than e.g (vq->num >> 1) *
> > $pkt_size and if all packet were held by some modules, limitation like
> > vq->num >> 1 won't work since we hit sudbuf before it.
> 
> Good point.

I agree however anyone using SNDBUF < infinity already hits HOQ blocking
in some scenarios.


> >>
> >>> A watchdog would help somewhat. With tx-napi, this case cannot occur,
> >>> either, as free_old_xmit_skbs no longer depends on a call to start_xmit.
> >>>
>  2) tx napi is used for virtio-net
> >>>
> >>> I am not aware of any issue specific to the use of tx-napi?
> >
> >
> > Might not be clear here, I mean e.g virtio_net (tx-napi) in guest +
> > vhost_net (zerocopy) in host. In this case, even if we switch to datacopy if
> > ubuf counts exceeds vq->num >> 1, we still complete tx buffers in order, tx
> > interrupt could be delayed for indefinite time.
> 
> Copied buffers are completed immediately in handle_tx.
> 
> Do you mean when a process sends fewer packets than vq->num >> 1,
> so that all are queued? Yes, then the latency is indeed that of the last
> element leaving the qdisc.
> 
> >>>
>  1) could be a corner case, and for 2) what your suggest here may not
>  solve
>  the issue since it still do in order completion.
> >>>
> >>> Somewhat tangential, but it might also help to break the in-order
> >>> completion processing in vhost_zerocopy_signal_used. Complete
> >>> all descriptors between done_idx and upend_idx. done_idx should
> >>> then only be forward to the oldest still not-completed descriptor.
> >>>
> >>> In the test I ran, where the oldest descriptors are held in a queue and
> >>> all newer ones are tail-dropped,
> >
> >
> > Do you mean the descriptors were tail-dropped by vhost?
> 
> Tail-dropped by netem. The dropped items are completed out of
> order by vhost before the held items.

Re: [PATCH RFC v3 0/4] Add cross-compilation support to eBPF samples

2017-09-05 Thread Joel Fernandes

Hi Alexei,

On Tue, Sep 5, 2017 at 8:24 PM, Alexei Starovoitov
 wrote:
> On Sun, Sep 03, 2017 at 11:23:21AM -0700, Joel Fernandes wrote:
>> These patches fix issues seen when cross-compiling eBPF samples on arm64.
>> Compared to [1], I dropped the controversial inline-asm patch pending further
>> discussion on the right way to do it. However these patches are still a step 
>> in
>> the right direction and I wanted them to get in before the more controversial
>> bit.
>
> All patches in this set look good to me.
> When you repost without changes after net-next reopens pls keep my:
> Acked-by: Alexei Starovoitov 

Thanks a lot! will do.

Regards,
Joel

Re: [PATCH RFC v3 0/4] Add cross-compilation support to eBPF samples

2017-09-05 Thread Alexei Starovoitov

On Sun, Sep 03, 2017 at 11:23:21AM -0700, Joel Fernandes wrote:
> These patches fix issues seen when cross-compiling eBPF samples on arm64.
> Compared to [1], I dropped the controversial inline-asm patch pending further
> discussion on the right way to do it. However these patches are still a step 
> in
> the right direction and I wanted them to get in before the more controversial
> bit.

All patches in this set look good to me.
When you repost without changes after net-next reopens pls keep my:
Acked-by: Alexei Starovoitov

Re: [PATCH v3 2/2] ip6_tunnel: fix ip6 tunnel lookup in collect_md mode

2017-09-05 Thread Alexei Starovoitov


On 9/4/17 1:36 AM, Haishuang Yan wrote:

In collect_md mode, if the tun dev is down, it still can call
__ip6_tnl_rcv to receive on packets, and the rx statistics increase
improperly.

Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
Cc: Alexei Starovoitov 
Signed-off-by: Haishuang Yan 

---
Change since v3:
  * Increment rx_dropped if tunnel device is not up, suggested by
  Pravin B Shelar
  * Fix wrong recipient address
---
 net/ipv6/ip6_tunnel.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 10a693a..e91d3b6 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -171,8 +171,11 @@ static struct net_device_stats *ip6_get_stats(struct 
net_device *dev)
}

t = rcu_dereference(ip6n->collect_md_tun);
-   if (t)
-   return t;
+   if (t) {
+   if (t->dev->flags & IFF_UP)
+   return t;
+   t->dev->stats.rx_dropped++;
+   }


Why increment the stats only for this drop case?
There are plenty of other conditions where packet
will be dropped in ip6 tunnel. I think it's important
to present consistent behavior to the users,
so I'd increment drop stats either for all drop cases
or for none. And today it's none.
The ! IFF_UP case should probably be return NULL too.

Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads

2017-09-05 Thread Alexander Duyck

On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbert  wrote:
>> The situation with encapsulation is even more complicated:
>>
>> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
>> constellation. If we do the fragmentation inside the vxlan tunnel and
>> carry over the skb hash to all resulting UDP/vxlan packets source ports,
>> we are fine and reordering on the receiver NIC won't happen in this
>> case. If the fragmentation happens on the outer UDP header, this will
>> result in reordering of the inner L2 flow. Unfortunately this depends on
>> how the vxlan tunnel was set up, how other devices do that and (I
>> believe so) on the kernel version.
>>
> This really isn't that complicated. The assumption that an IP network
> always delivers packets in order is simply wrong. The inventors of
> VXLAN must have know full well that when you use IP, packets can and
> eventually will be delivered out of order. This isn't just because of
> fragmentation, there are many other reasons that packets can be
> delivered OOO. This also must have been known when IP/GRE and any
> other protocol that carries L2 over IP was invented. If OOO is an
> issue for these protocols then they need to be fixed-- this is not a
> concern with IP protocol nor the stack.
>
> Tom

As far as a little background on the original patch I believe the
issue that was fixed by the patch was a video streaming application
that was sending/receiving a mix of fragmented and non-fragmented
packets. Receiving them out of order due to the fragmentation was
causing issues with stutters in the video and so we ended up disabling
UDP by default in the NICs listed. We decided to go that way as UDP
RSS was viewed as a performance optimization, while the out-of-order
problems were viewed as a functionality issue.

The default for i40e is to have UDP RSS hashing enabled if I recall
correctly. Basically as we move away from enterprise to cloud I
suspect that is going to be the case more and more since all the UDP
tunnels require either port recognition or UDP RSS to be enabled. For
now we carry the ability to enable UDP RSS if desired in the legacy
drivers, and I believe we have some white papers somewhere that
suggest enabling it if you are planning to use UDP based tunnels.

- Alex

Re: [net 0/2][pull request] Intel Wired LAN Driver Updates 2017-09-05

2017-09-05 Thread David Miller

From: Jeff Kirsher 
Date: Tue,  5 Sep 2017 18:04:16 -0700

> This series contains fixes for i40e only.

Pulled, thanks Jeff.

Re: [PATCH net-next v6 3/3] openvswitch: enable NSH support

2017-09-05 Thread Yang, Yi

On Tue, Sep 05, 2017 at 09:12:09PM +0800, Hannes Frederic Sowa wrote:
> "Yang, Yi"  writes:
> 
> > We can change this later if we really find a better way to handle this
> > because it isn't defined in include/uapi/linux/openvswitch.h, so I still
> > have backdoor to do this if needed :-)
> 
> Sorry, I can't follow you.
> 
> It doesn't matter if something is defined in uapi headers, the
> observable behavior matters. If you allow users to configure flows with
> specific fields, it should not stop working at a future point in time.

Anyway this can be changed if it is really needed, so far current way is
the best one we can come up with, we would like to use it if you can
have better proposal. We have explained repeatedly context headers must
be matched and set, this is bottom line.

> 
> > For our sfc use case in Opendaylight, we use context[0] for tunnel ID,
> > context[1] for destination IP for reverse RSP, they are used to match
> > and set in OpenFlow table, you can't limit users to use them in such
> > ways.
> 
> So in your specific case you expect the masks to be completely stable
> because you defined a protocol on top of NSH, understood. And that is
> stable accross all possible paths. Understood as well.
> 
> > If you check GENEVE implementation, tun_metadata* can be set or matched
> > as any other match field.
> 
> Yes, I wrote that in my previous mail. I wonder why NSH context metadata
> is not in tun_metadata as well?

tun_metadata is tunnel metadata, GENEVE needs tunnel port, but NSH is
not so, NSH can't directly use tun_metadata, for MD type 2, we need to a
lot of rework on tun_metadata to make it shared between GENEVE and NSH,
I don't think this can happen in near term. So tun_metadata isn't option
for this now.

> 
> > Actually the most important information in NSH are just these context
> > headers, you can't limit imagination of users by removing them from flow
> > keys.
> >
> > My point is to bring miniflow into kernel data path to fix your concern,
> > this will benefit your employer directly :-)
> 
> Okay, interesting. It will probably not help if you still have a hash of
> a packet inside the flow table and use that for load balancing.
> 
> [...]
> 
> BTW I don't want to stop this patch, I am merely interested in how the
> bigger picture will look like in the end.
> 
> Bye,
> Hannes

[net 2/2] i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq

2017-09-05 Thread Jeff Kirsher

From: Jacob Keller 

When introducing the functions to read the NVM through the AdminQ, we
did not correctly mark the wb_desc.

Fixes: 7073f46e443e ("i40e: Add AQ commands for NVM Update for X722", 
2015-06-05)
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_nvm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_nvm.c 
b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
index 154ba49c2c6f..26d7e9fe6220 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_nvm.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
@@ -230,6 +230,7 @@ static i40e_status i40e_read_nvm_aq(struct i40e_hw *hw, u8 
module_pointer,
struct i40e_asq_cmd_details cmd_details;
 
memset(_details, 0, sizeof(cmd_details));
+   cmd_details.wb_desc = >nvm_wb_desc;
 
/* Here we are checking the SR limit only for the flat memory model.
 * We cannot do it for the module-based model, as we did not acquire
-- 
2.14.1

[net 0/2][pull request] Intel Wired LAN Driver Updates 2017-09-05

2017-09-05 Thread Jeff Kirsher

This series contains fixes for i40e only.

These two patches fix an issue where our nvmupdate tool does not work on RHEL 
7.4
and newer kernels, in fact, the use of the nvmupdate tool on newer kernels can
cause the cards to be non-functional unless these patches are applied.

Anjali reworks the locking around accessing the NVM so that NVM acquire timeouts
do not occur which was causing the failed firmware updates.

Jake correctly updates the wb_desc when reading the NVM through the AdminQ.

The following are changes since commit 6d9c153a0b84392406bc77600aa7d3ea365de041:
  net: dsa: loop: Do not unregister invalid fixed PHY
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 40GbE

Anjali Singhai Jain (1):
  i40e: avoid NVM acquire deadlock during NVM update

Jacob Keller (1):
  i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq

 drivers/net/ethernet/intel/i40e/i40e_nvm.c   | 99 +++-
 drivers/net/ethernet/intel/i40e/i40e_prototype.h |  2 -
 2 files changed, 61 insertions(+), 40 deletions(-)

-- 
2.14.1

[net 1/2] i40e: avoid NVM acquire deadlock during NVM update

2017-09-05 Thread Jeff Kirsher

From: Anjali Singhai Jain 

X722 devices use the AdminQ to access the NVM, and this requires taking
the AdminQ lock. Because of this, we lock the AdminQ during
i40e_read_nvm(), which is also called in places where the lock is
already held, such as the firmware update path which wants to lock once
and then unlock when finished after performing several tasks.

Although this should have only affected X722 devices, commit
96a39aed25e6 ("i40e: Acquire NVM lock before reads on all devices",
2016-12-02) added locking for all NVM reads, regardless of device
family.

This resulted in us accidentally causing NVM acquire timeouts on all
devices, causing failed firmware updates which left the eeprom in
a corrupt state.

Create unsafe non-locked variants of i40e_read_nvm_word and
i40e_read_nvm_buffer, __i40e_read_nvm_word and __i40e_read_nvm_buffer
respectively. These variants will not take the NVM lock and are expected
to only be called in places where the NVM lock is already held if
needed.

Since the only caller of i40e_read_nvm_buffer() was in such a path,
remove it entirely in favor of the unsafe version. If necessary we can
always add it back in the future.

Additionally, we now need to hold the NVM lock in i40e_validate_checksum
because the call to i40e_calc_nvm_checksum now assumes that the NVM lock
is held. We can further move the call to read I40E_SR_SW_CHECKSUM_WORD
up a bit so that we do not need to acquire the NVM lock twice.

This should resolve firmware updates and also fix potential raise that
could have caused the driver to report an invalid NVM checksum upon
driver load.

Reported-by: Stefan Assmann 
Fixes: 96a39aed25e6 ("i40e: Acquire NVM lock before reads on all devices", 
2016-12-02)
Signed-off-by: Anjali Singhai Jain 
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_nvm.c   | 98 +++-
 drivers/net/ethernet/intel/i40e/i40e_prototype.h |  2 -
 2 files changed, 60 insertions(+), 40 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_nvm.c 
b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
index 800bd55d0159..154ba49c2c6f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_nvm.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
@@ -266,7 +266,7 @@ static i40e_status i40e_read_nvm_aq(struct i40e_hw *hw, u8 
module_pointer,
  * @offset: offset of the Shadow RAM word to read (0x00 - 0x001FFF)
  * @data: word read from the Shadow RAM
  *
- * Reads one 16 bit word from the Shadow RAM using the GLNVM_SRCTL register.
+ * Reads one 16 bit word from the Shadow RAM using the AdminQ
  **/
 static i40e_status i40e_read_nvm_word_aq(struct i40e_hw *hw, u16 offset,
 u16 *data)
@@ -280,27 +280,49 @@ static i40e_status i40e_read_nvm_word_aq(struct i40e_hw 
*hw, u16 offset,
 }
 
 /**
- * i40e_read_nvm_word - Reads Shadow RAM
+ * __i40e_read_nvm_word - Reads nvm word, assumes called does the locking
  * @hw: pointer to the HW structure
  * @offset: offset of the Shadow RAM word to read (0x00 - 0x001FFF)
  * @data: word read from the Shadow RAM
  *
- * Reads one 16 bit word from the Shadow RAM using the GLNVM_SRCTL register.
+ * Reads one 16 bit word from the Shadow RAM.
+ *
+ * Do not use this function except in cases where the nvm lock is already
+ * taken via i40e_acquire_nvm().
+ **/
+static i40e_status __i40e_read_nvm_word(struct i40e_hw *hw,
+   u16 offset, u16 *data)
+{
+   i40e_status ret_code = 0;
+
+   if (hw->flags & I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE)
+   ret_code = i40e_read_nvm_word_aq(hw, offset, data);
+   else
+   ret_code = i40e_read_nvm_word_srctl(hw, offset, data);
+   return ret_code;
+}
+
+/**
+ * i40e_read_nvm_word - Reads nvm word and acquire lock if necessary
+ * @hw: pointer to the HW structure
+ * @offset: offset of the Shadow RAM word to read (0x00 - 0x001FFF)
+ * @data: word read from the Shadow RAM
+ *
+ * Reads one 16 bit word from the Shadow RAM.
  **/
 i40e_status i40e_read_nvm_word(struct i40e_hw *hw, u16 offset,
   u16 *data)
 {
-   enum i40e_status_code ret_code = 0;
+   i40e_status ret_code = 0;
 
ret_code = i40e_acquire_nvm(hw, I40E_RESOURCE_READ);
-   if (!ret_code) {
-   if (hw->flags & I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE) {
-   ret_code = i40e_read_nvm_word_aq(hw, offset, data);
-   } else {
-   ret_code = i40e_read_nvm_word_srctl(hw, offset, data);
-   }
-   i40e_release_nvm(hw);
-   }
+   if (ret_code)
+   return ret_code;
+
+   ret_code = __i40e_read_nvm_word(hw, offset, data);
+
+   i40e_release_nvm(hw);
+
return ret_code;
 }

Re: [PATCH v2 rfc 0/8] IGMP snooping for local traffic

2017-09-05 Thread Andrew Lunn

> The third and last issue will be explained in a followup email.

Hi DSA hackers

So there is the third issue. It affects just DSA, but it possible
affects all DSA drivers.

This patchset broken broadcast with the Marvell drivers. It could
break broadcast on others drivers as well.

What i found is that the Marvell chips don't flood broadcast frames
between bridged ports. What appears to happen is there is a fdb miss,
so it gets forwarded to the CPU port for the host to deal with. The
software bridge when floods it out all ports of the bridge.

But the set offload_fwd_mark patch changes this. The software bridge
now assumes the hardware has already flooded broadcast out all ports
of the switch as needed. So it does not do any flooding itself. As a
result, on Marvell devices, broadcast packets don't get flooded at
all.

The issue can be fixed. I just need to add an mdb entry for the
broadcast address to each port of the bridge in the switch, and the
CPU port.  But i don't know at what level to do this.

Should this be done at the DSA level, or at the driver level?  Do any
chips do broadcast flooding in hardware already? Hence they currently
see broadcast duplication? If i add a broadcast mdb at the DSA level,
and the chip is already hard wired to flooding broadcast, is it going
to double flood?

Andrew

Re: [PATCH 2/3] security: bpf: Add eBPF LSM hooks and security field to eBPF map

2017-09-05 Thread Alexei Starovoitov

On Tue, Sep 05, 2017 at 02:59:38PM -0700, Chenbo Feng wrote:
> On Thu, Aug 31, 2017 at 7:05 PM, Alexei Starovoitov
>  wrote:
> > On Thu, Aug 31, 2017 at 01:56:34PM -0700, Chenbo Feng wrote:
> >> From: Chenbo Feng 
> >>
> >> Introduce a pointer into struct bpf_map to hold the security information
> >> about the map. The actual security struct varies based on the security
> >> models implemented. Place the LSM hooks before each of the unrestricted
> >> eBPF operations, the map_update_elem and map_delete_elem operations are
> >> checked by security_map_modify. The map_lookup_elem and map_get_next_key
> >> operations are checked by securtiy_map_read.
> >>
> >> Signed-off-by: Chenbo Feng 
> >
> > ...
> >
> >> @@ -410,6 +418,10 @@ static int map_lookup_elem(union bpf_attr *attr)
> >>   if (IS_ERR(map))
> >>   return PTR_ERR(map);
> >>
> >> + err = security_map_read(map);
> >> + if (err)
> >> + return -EACCES;
> >> +
> >>   key = memdup_user(ukey, map->key_size);
> >>   if (IS_ERR(key)) {
> >>   err = PTR_ERR(key);
> >> @@ -490,6 +502,10 @@ static int map_update_elem(union bpf_attr *attr)
> >>   if (IS_ERR(map))
> >>   return PTR_ERR(map);
> >>
> >> + err = security_map_modify(map);
> >
> > I don't feel these extra hooks are really thought through.
> > With such hook you'll disallow map_update for given map. That's it.
> > The key/values etc won't be used in such security decision.
> > In such case you don't need such hooks in update/lookup at all.
> > Only in map_creation and object_get calls where FD can be received.
> > In other words I suggest to follow standard unix practices:
> > Do permissions checks in open() and allow read/write() if FD is valid.
> > Same here. Do permission checks in prog_load/map_create/obj_pin/get
> > and that will be enough to jail bpf subsystem.
> > bpf cmds that need to be fast (like lookup and update) should not
> > have security hooks.
> >
> Thanks for pointing out this. I agree we should only do checks on
> creating and passing
> the object from one processes to another. And if we can still
> distinguish the read/write operation
> when obtaining the file fd, that would be great. But that may require
> us to add a new mode
> field into bpf_map struct and change the attribute struct when doing
> the bpf syscall. How do you
> think about this approach? Or we can do simple checks for
> bpf_obj_create and bpf_obj_use when
> creating the object and passing the object respectively but this
> solution cannot distinguish map modify and
> read.

iirc the idea to have read only maps was brought up in the past
(unfortunately no one took a stab at implementing it), but imo
that's better then relying on lsm to provide such restriction
and more secure, since even if you disable map_update via lsm,
the user can craft a program just to udpate the map from it.
For bpffs we already test for inode_permission(inode, MAY_WRITE);
during BPF_OBJ_GET command and we can extend this facility further.
Also we can allow dropping 'write' permissions from the map
(internally implemented via flag in struct bpf_map), so
update/delete operations won't be allowed.
This flag will be checked by syscall during map_update/delete
and by the verifier if such read-only map is used by the program
being loaded, so map_update/helpers won't be allowed in
such program.
Would also be good to support read-only programs as well
(progs that are read-only from kernel state point of view)
They won't be able to modify packets, maps, etc.

Re: [PATCH v2 rfc 0/8] IGMP snooping for local traffic

2017-09-05 Thread Stephen Hemminger

On Wed,  6 Sep 2017 01:35:02 +0200
Andrew Lunn  wrote:

> After the very useful feedback from Nikolay, i threw away what i had,
> and started again. To recap:
> 
> The linux bridge supports IGMP snooping. It will listen to IGMP
> reports on bridge ports and keep track of which groups have been
> joined on an interface. It will then forward multicast based on this
> group membership.
> 
> When the bridge adds or removed groups from an interface, it uses
> switchdev to request the hardware add an mdb to a port, so the
> hardware can perform the selective forwarding between ports.
> 
> What is not covered by the current bridge code, is IGMP joins/leaves
> from the host on the brX interface. These are not reported via
> switchdev so that hardware knows the local host is interested in the
> multicast frames.
> 
> Luckily, the bridge does track joins/leaves on the brX interface. The
> code is obfusticated, which is why i missed it with my first attempt.
> So the first patch tries to remove this obfustication. Currently,
> there is no notifications sent when the bridge interface joins a
> group. The second patch adds them. bridge monitor then shows
> joins/leaves in the same way as for other ports of the bridge.
> 
> Then starts the work passing down to the hardware that the host has
> joined/left a group. The existing switchdev mdb object cannot be used,
> since the semantics are different. The existing
> SWITCHDEV_OBJ_ID_PORT_MDB is used to indicate a specific multicast
> group should be forwarded out that port of the switch. However here we
> require the exact opposite. We want multicast frames for the group
> received on the port to the forwarded to the host. Hence add a new
> object SWITCHDEV_OBJ_ID_HOST_MDB, a multicast database entry to
> forward to the host. This new object is then propagated through the
> DSA layers. No DSA driver changes should be needed, this should just
> work...
> 
> Getting the frames to the bridge as requested turned up an issue or
> three. The offload_fwd_mark is not being set by DSA, so the bridge
> floods the received frames back to the switch ports, resulting in
> duplication since the hardware has already flooded the packet. Fixing
> that turned up an issue with the meaning of
> SWITCHDEV_ATTR_ID_PORT_PARENT_ID in DSA. A DSA fabric of three
> switches needs to look to the software bridge as a single
> switch. Otherwise the offload_fwd_mark does not work, and we get
> duplication on the non-ingress switch. But each switch returned a
> different value. And they were not unique.
> 
> The third and last issue will be explained in a followup email.
> 
> Open questions:
> 
> Is sending notifications going to break userspace?
> Is this new switchdev object O.K. for the few non-DSA switches that exist?
> Is the SWITCHDEV_ATTR_ID_PORT_PARENT_ID change acceptable?
> 
>Andrew
> 
> Andrew Lunn (8):
>   net: bridge: Rename mglist to host_joined
>   net: bridge: Send notification when host join/leaves a group
>   net: bridge: Add/del switchdev object on host join/leave
>   net: dsa: slave: Handle switchdev host mdb add/del
>   net: dsa: switch: handle host mdb add/remove
>   net: dsa: switch: Don't add CPU port to an mdb by default
>   net: dsa: set offload_fwd_mark on received packets
>   net: dsa: Fix SWITCHDEV_ATTR_ID_PORT_PARENT_ID
> 
>  include/net/switchdev.h   |  1 +
>  net/bridge/br_input.c |  2 +-
>  net/bridge/br_mdb.c   | 50 +---
>  net/bridge/br_multicast.c | 18 +++-
>  net/bridge/br_private.h   |  2 +-
>  net/dsa/dsa.c |  1 +
>  net/dsa/dsa_priv.h|  7 +
>  net/dsa/port.c| 26 +
>  net/dsa/slave.c   | 16 ---
>  net/dsa/switch.c  | 72 
> +++
>  net/switchdev/switchdev.c |  2 ++
>  11 files changed, 168 insertions(+), 29 deletions(-)
> 

This looks much cleaner. I don't have DSA hardware or infrastructure to look 
deeper.

[PATCH v2 rfc 1/8] net: bridge: Rename mglist to host_joined

2017-09-05 Thread Andrew Lunn

The boolean mglist indicates the host has joined a particular
multicast group on the bridge interface. It is badly named, obscuring
what is means. Rename it.

Signed-off-by: Andrew Lunn 
---
 net/bridge/br_input.c |  2 +-
 net/bridge/br_mdb.c   |  2 +-
 net/bridge/br_multicast.c | 14 +++---
 net/bridge/br_private.h   |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 7637f58c1226..0dd55a183391 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -179,7 +179,7 @@ int br_handle_frame_finish(struct net *net, struct sock 
*sk, struct sk_buff *skb
mdst = br_mdb_get(br, skb, vid);
if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
br_multicast_querier_exists(br, eth_hdr(skb))) {
-   if ((mdst && mdst->mglist) ||
+   if ((mdst && mdst->host_joined) ||
br_multicast_is_router(br)) {
local_rcv = true;
br->dev->stats.multicast++;
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index ca01def49af0..71885e251988 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -654,7 +654,7 @@ static int __br_mdb_del(struct net_bridge *br, struct 
br_mdb_entry *entry)
call_rcu_bh(>rcu, br_multicast_free_pg);
err = 0;
 
-   if (!mp->ports && !mp->mglist &&
+   if (!mp->ports && !mp->host_joined &&
netif_running(br->dev))
mod_timer(>timer, jiffies);
break;
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 8dc5c8d69bcd..c6b2b8d419e7 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -249,7 +249,7 @@ static void br_multicast_group_expired(unsigned long data)
if (!netif_running(br->dev) || timer_pending(>timer))
goto out;
 
-   mp->mglist = false;
+   mp->host_joined = false;
 
if (mp->ports)
goto out;
@@ -292,7 +292,7 @@ static void br_multicast_del_pg(struct net_bridge *br,
  p->flags);
call_rcu_bh(>rcu, br_multicast_free_pg);
 
-   if (!mp->ports && !mp->mglist &&
+   if (!mp->ports && !mp->host_joined &&
netif_running(br->dev))
mod_timer(>timer, jiffies);
 
@@ -775,7 +775,7 @@ static int br_multicast_add_group(struct net_bridge *br,
goto err;
 
if (!port) {
-   mp->mglist = true;
+   mp->host_joined = true;
mod_timer(>timer, now + br->multicast_membership_interval);
goto out;
}
@@ -1451,7 +1451,7 @@ static int br_ip4_multicast_query(struct net_bridge *br,
 
max_delay *= br->multicast_last_member_count;
 
-   if (mp->mglist &&
+   if (mp->host_joined &&
(timer_pending(>timer) ?
 time_after(mp->timer.expires, now + max_delay) :
 try_to_del_timer_sync(>timer) >= 0))
@@ -1535,7 +1535,7 @@ static int br_ip6_multicast_query(struct net_bridge *br,
goto out;
 
max_delay *= br->multicast_last_member_count;
-   if (mp->mglist &&
+   if (mp->host_joined &&
(timer_pending(>timer) ?
 time_after(mp->timer.expires, now + max_delay) :
 try_to_del_timer_sync(>timer) >= 0))
@@ -1596,7 +1596,7 @@ br_multicast_leave_group(struct net_bridge *br,
br_mdb_notify(br->dev, port, group, RTM_DELMDB,
  p->flags);
 
-   if (!mp->ports && !mp->mglist &&
+   if (!mp->ports && !mp->host_joined &&
netif_running(br->dev))
mod_timer(>timer, jiffies);
}
@@ -1636,7 +1636,7 @@ br_multicast_leave_group(struct net_bridge *br,
 br->multicast_last_member_interval;
 
if (!port) {
-   if (mp->mglist &&
+   if (mp->host_joined &&
(timer_pending(>timer) ?
 time_after(mp->timer.expires, time) :
 try_to_del_timer_sync(>timer) >= 0)) {
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index fd9ee73e0a6d..e79e0611908d 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -202,7 +202,7 @@ struct net_bridge_mdb_entry
struct rcu_head rcu;
struct timer_list   timer;
struct br_ipaddr;
-   boolmglist;
+   boolhost_joined;
 };
 
 struct net_bridge_mdb_htable
-- 
2.14.1

[PATCH v2 rfc 3/8] net: bridge: Add/del switchdev object on host join/leave

2017-09-05 Thread Andrew Lunn

When the host joins or leaves a multicast group, use switchdev to add
an object to the hardware to forward traffic for the group to the
host.

Signed-off-by: Andrew Lunn 
---
 include/net/switchdev.h   |  1 +
 net/bridge/br_mdb.c   | 39 +++
 net/switchdev/switchdev.c |  2 ++
 3 files changed, 42 insertions(+)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index d767b7991887..b298a6b4c11d 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -75,6 +75,7 @@ enum switchdev_obj_id {
SWITCHDEV_OBJ_ID_UNDEFINED,
SWITCHDEV_OBJ_ID_PORT_VLAN,
SWITCHDEV_OBJ_ID_PORT_MDB,
+   SWITCHDEV_OBJ_ID_HOST_MDB,
 };
 
 struct switchdev_obj {
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 80fa91ccc50c..a9c03f3a3b88 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -291,6 +291,42 @@ static void br_mdb_complete(struct net_device *dev, int 
err, void *priv)
kfree(priv);
 }
 
+static void br_mdb_switchdev_host_port(struct net_device *dev,
+  struct net_device *lower_dev,
+  struct br_mdb_entry *entry, int type)
+{
+   struct switchdev_obj_port_mdb mdb = {
+   .obj = {
+   .id = SWITCHDEV_OBJ_ID_HOST_MDB,
+   .flags = SWITCHDEV_F_DEFER,
+   },
+   .vid = entry->vid,
+   };
+
+   if (entry->addr.proto == htons(ETH_P_IP))
+   ip_eth_mc_map(entry->addr.u.ip4, mdb.addr);
+#if IS_ENABLED(CONFIG_IPV6)
+   else
+   ipv6_eth_mc_map(>addr.u.ip6, mdb.addr);
+#endif
+
+   mdb.obj.orig_dev = dev;
+   if (type == RTM_NEWMDB)
+   switchdev_port_obj_add(lower_dev, );
+   if (type == RTM_DELMDB)
+   switchdev_port_obj_del(lower_dev, );
+}
+
+static void br_mdb_switchdev_host(struct net_device *dev,
+ struct br_mdb_entry *entry, int type)
+{
+   struct net_device *lower_dev;
+   struct list_head *iter;
+
+   netdev_for_each_lower_dev(dev, lower_dev, iter)
+   br_mdb_switchdev_host_port(dev, lower_dev, entry, type);
+}
+
 static void __br_mdb_notify(struct net_device *dev, struct net_bridge_port *p,
struct br_mdb_entry *entry, int type)
 {
@@ -330,6 +366,9 @@ static void __br_mdb_notify(struct net_device *dev, struct 
net_bridge_port *p,
switchdev_port_obj_del(port_dev, );
}
 
+   if (!p)
+   br_mdb_switchdev_host(dev, entry, type);
+
skb = nlmsg_new(rtnl_mdb_nlmsg_size(), GFP_ATOMIC);
if (!skb)
goto errout;
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 0531b41d1f2d..74b9d916a58b 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -345,6 +345,8 @@ static size_t switchdev_obj_size(const struct switchdev_obj 
*obj)
return sizeof(struct switchdev_obj_port_vlan);
case SWITCHDEV_OBJ_ID_PORT_MDB:
return sizeof(struct switchdev_obj_port_mdb);
+   case SWITCHDEV_OBJ_ID_HOST_MDB:
+   return sizeof(struct switchdev_obj_port_mdb);
default:
BUG();
}
-- 
2.14.1

[PATCH v2 rfc 6/8] net: dsa: switch: Don't add CPU port to an mdb by default

2017-09-05 Thread Andrew Lunn

Now that the host indicates when a multicast group should be forwarded
from the switch to the host, don't do it by default.

Signed-off-by: Andrew Lunn 
---
 net/dsa/switch.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/dsa/switch.c b/net/dsa/switch.c
index 326680039c52..147295d8de94 100644
--- a/net/dsa/switch.c
+++ b/net/dsa/switch.c
@@ -145,7 +145,7 @@ static int dsa_switch_mdb_add(struct dsa_switch *ds,
if (ds->index == info->sw_index)
set_bit(info->port, group);
for (port = 0; port < ds->num_ports; port++)
-   if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port))
+   if (dsa_is_dsa_port(ds, port))
set_bit(port, group);
 
return dsa_switch_mdb_add_bitmap(ds, info, mdb, group);
-- 
2.14.1

[PATCH v2 rfc 0/8] IGMP snooping for local traffic

2017-09-05 Thread Andrew Lunn

After the very useful feedback from Nikolay, i threw away what i had,
and started again. To recap:

The linux bridge supports IGMP snooping. It will listen to IGMP
reports on bridge ports and keep track of which groups have been
joined on an interface. It will then forward multicast based on this
group membership.

When the bridge adds or removed groups from an interface, it uses
switchdev to request the hardware add an mdb to a port, so the
hardware can perform the selective forwarding between ports.

What is not covered by the current bridge code, is IGMP joins/leaves
from the host on the brX interface. These are not reported via
switchdev so that hardware knows the local host is interested in the
multicast frames.

Luckily, the bridge does track joins/leaves on the brX interface. The
code is obfusticated, which is why i missed it with my first attempt.
So the first patch tries to remove this obfustication. Currently,
there is no notifications sent when the bridge interface joins a
group. The second patch adds them. bridge monitor then shows
joins/leaves in the same way as for other ports of the bridge.

Then starts the work passing down to the hardware that the host has
joined/left a group. The existing switchdev mdb object cannot be used,
since the semantics are different. The existing
SWITCHDEV_OBJ_ID_PORT_MDB is used to indicate a specific multicast
group should be forwarded out that port of the switch. However here we
require the exact opposite. We want multicast frames for the group
received on the port to the forwarded to the host. Hence add a new
object SWITCHDEV_OBJ_ID_HOST_MDB, a multicast database entry to
forward to the host. This new object is then propagated through the
DSA layers. No DSA driver changes should be needed, this should just
work...

Getting the frames to the bridge as requested turned up an issue or
three. The offload_fwd_mark is not being set by DSA, so the bridge
floods the received frames back to the switch ports, resulting in
duplication since the hardware has already flooded the packet. Fixing
that turned up an issue with the meaning of
SWITCHDEV_ATTR_ID_PORT_PARENT_ID in DSA. A DSA fabric of three
switches needs to look to the software bridge as a single
switch. Otherwise the offload_fwd_mark does not work, and we get
duplication on the non-ingress switch. But each switch returned a
different value. And they were not unique.

The third and last issue will be explained in a followup email.

Open questions:

Is sending notifications going to break userspace?
Is this new switchdev object O.K. for the few non-DSA switches that exist?
Is the SWITCHDEV_ATTR_ID_PORT_PARENT_ID change acceptable?

   Andrew

Andrew Lunn (8):
  net: bridge: Rename mglist to host_joined
  net: bridge: Send notification when host join/leaves a group
  net: bridge: Add/del switchdev object on host join/leave
  net: dsa: slave: Handle switchdev host mdb add/del
  net: dsa: switch: handle host mdb add/remove
  net: dsa: switch: Don't add CPU port to an mdb by default
  net: dsa: set offload_fwd_mark on received packets
  net: dsa: Fix SWITCHDEV_ATTR_ID_PORT_PARENT_ID

 include/net/switchdev.h   |  1 +
 net/bridge/br_input.c |  2 +-
 net/bridge/br_mdb.c   | 50 +---
 net/bridge/br_multicast.c | 18 +++-
 net/bridge/br_private.h   |  2 +-
 net/dsa/dsa.c |  1 +
 net/dsa/dsa_priv.h|  7 +
 net/dsa/port.c| 26 +
 net/dsa/slave.c   | 16 ---
 net/dsa/switch.c  | 72 +++
 net/switchdev/switchdev.c |  2 ++
 11 files changed, 168 insertions(+), 29 deletions(-)

-- 
2.14.1

[PATCH v2 rfc 4/8] net: dsa: slave: Handle switchdev host mdb add/del

2017-09-05 Thread Andrew Lunn

Add code to handle switchdev host mdb add/del. As with normal mdb
add/del, send a notification to the switch layer.

Signed-off-by: Andrew Lunn 
---
 net/dsa/dsa_priv.h |  7 +++
 net/dsa/port.c | 26 ++
 net/dsa/slave.c|  6 ++
 3 files changed, 39 insertions(+)

diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 9c3eeb72462d..0ffe49f78d14 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -24,6 +24,8 @@ enum {
DSA_NOTIFIER_FDB_DEL,
DSA_NOTIFIER_MDB_ADD,
DSA_NOTIFIER_MDB_DEL,
+   DSA_NOTIFIER_HOST_MDB_ADD,
+   DSA_NOTIFIER_HOST_MDB_DEL,
DSA_NOTIFIER_VLAN_ADD,
DSA_NOTIFIER_VLAN_DEL,
 };
@@ -131,6 +133,11 @@ int dsa_port_mdb_add(struct dsa_port *dp,
 struct switchdev_trans *trans);
 int dsa_port_mdb_del(struct dsa_port *dp,
 const struct switchdev_obj_port_mdb *mdb);
+int dsa_host_mdb_add(struct dsa_port *dp,
+const struct switchdev_obj_port_mdb *mdb,
+struct switchdev_trans *trans);
+int dsa_host_mdb_del(struct dsa_port *dp,
+const struct switchdev_obj_port_mdb *mdb);
 int dsa_port_vlan_add(struct dsa_port *dp,
  const struct switchdev_obj_port_vlan *vlan,
  struct switchdev_trans *trans);
diff --git a/net/dsa/port.c b/net/dsa/port.c
index 659676ba3f8b..5b18b9fe2219 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -199,6 +199,32 @@ int dsa_port_mdb_del(struct dsa_port *dp,
return dsa_port_notify(dp, DSA_NOTIFIER_MDB_DEL, );
 }
 
+int dsa_host_mdb_add(struct dsa_port *dp,
+const struct switchdev_obj_port_mdb *mdb,
+struct switchdev_trans *trans)
+{
+   struct dsa_notifier_mdb_info info = {
+   .sw_index = dp->ds->index,
+   .port = dp->index,
+   .trans = trans,
+   .mdb = mdb,
+   };
+
+   return dsa_port_notify(dp, DSA_NOTIFIER_HOST_MDB_ADD, );
+}
+
+int dsa_host_mdb_del(struct dsa_port *dp,
+const struct switchdev_obj_port_mdb *mdb)
+{
+   struct dsa_notifier_mdb_info info = {
+   .sw_index = dp->ds->index,
+   .port = dp->index,
+   .mdb = mdb,
+   };
+
+   return dsa_port_notify(dp, DSA_NOTIFIER_HOST_MDB_DEL, );
+}
+
 int dsa_port_vlan_add(struct dsa_port *dp,
  const struct switchdev_obj_port_vlan *vlan,
  struct switchdev_trans *trans)
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 78e78a6e6833..2e07be149415 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -330,6 +330,9 @@ static int dsa_slave_port_obj_add(struct net_device *dev,
case SWITCHDEV_OBJ_ID_PORT_MDB:
err = dsa_port_mdb_add(dp, SWITCHDEV_OBJ_PORT_MDB(obj), trans);
break;
+   case SWITCHDEV_OBJ_ID_HOST_MDB:
+   err = dsa_host_mdb_add(dp, SWITCHDEV_OBJ_PORT_MDB(obj), trans);
+   break;
case SWITCHDEV_OBJ_ID_PORT_VLAN:
err = dsa_port_vlan_add(dp, SWITCHDEV_OBJ_PORT_VLAN(obj),
trans);
@@ -353,6 +356,9 @@ static int dsa_slave_port_obj_del(struct net_device *dev,
case SWITCHDEV_OBJ_ID_PORT_MDB:
err = dsa_port_mdb_del(dp, SWITCHDEV_OBJ_PORT_MDB(obj));
break;
+   case SWITCHDEV_OBJ_ID_HOST_MDB:
+   err = dsa_host_mdb_del(dp, SWITCHDEV_OBJ_PORT_MDB(obj));
+   break;
case SWITCHDEV_OBJ_ID_PORT_VLAN:
err = dsa_port_vlan_del(dp, SWITCHDEV_OBJ_PORT_VLAN(obj));
break;
-- 
2.14.1

[PATCH v2 rfc 8/8] net: dsa: Fix SWITCHDEV_ATTR_ID_PORT_PARENT_ID

2017-09-05 Thread Andrew Lunn

SWITCHDEV_ATTR_ID_PORT_PARENT_ID is used by the software bridge when
determining which ports to flood a packet out. If the packet
originated from a switch, it assumes the switch has already flooded
the packet out the switches ports, so the bridge should not flood the
packet itself out switch ports. Ports on the same switch are expected
to return the same parent ID when SWITCHDEV_ATTR_ID_PORT_PARENT_ID is
called.

DSA gets this wrong with clusters of switches. As far as the software
bridge is concerned, the cluster is all one switch. A packet from any
switch in the cluster can be assumed to of been flooded as needed out
all ports of the cluster, not just the switch it originated
from. Hence all ports of a cluster should return the same parent. The
old implementation did not, each switch in the cluster had its own ID.

Also wrong was that the ID was not unique if multiple DSA instances
are in operation.

Use the MAC address of the master interface as the parent ID. This is
the same for all switches in a cluster, and should be unique if there
are multiple clusters.

Signed-off-by: Andrew Lunn 
---
 net/dsa/slave.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 2e07be149415..d2744b0dad6e 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -374,13 +374,15 @@ static int dsa_slave_port_attr_get(struct net_device *dev,
   struct switchdev_attr *attr)
 {
struct dsa_slave_priv *p = netdev_priv(dev);
-   struct dsa_switch *ds = p->dp->ds;
 
switch (attr->id) {
-   case SWITCHDEV_ATTR_ID_PORT_PARENT_ID:
-   attr->u.ppid.id_len = sizeof(ds->index);
-   memcpy(>u.ppid.id, >index, attr->u.ppid.id_len);
+   case SWITCHDEV_ATTR_ID_PORT_PARENT_ID: {
+   struct net_device *master = dsa_master_netdev(p);
+
+   attr->u.ppid.id_len = ETH_ALEN;
+   ether_addr_copy(attr->u.ppid.id, master->dev_addr);
break;
+   }
case SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS_SUPPORT:
attr->u.brport_flags_support = 0;
break;
-- 
2.14.1

[PATCH v2 rfc 5/8] net: dsa: switch: handle host mdb add/remove

2017-09-05 Thread Andrew Lunn

When receiving notifications of the host mdb add/remove, add/remove an
mdb to the CPU port, so that traffic flows from a port to the CPU port
and hence to the host.

Signed-off-by: Andrew Lunn 
---
 net/dsa/switch.c | 72 ++--
 1 file changed, 60 insertions(+), 12 deletions(-)

diff --git a/net/dsa/switch.c b/net/dsa/switch.c
index e6c06aa349a6..326680039c52 100644
--- a/net/dsa/switch.c
+++ b/net/dsa/switch.c
@@ -108,22 +108,14 @@ static int dsa_switch_fdb_del(struct dsa_switch *ds,
 info->vid);
 }
 
-static int dsa_switch_mdb_add(struct dsa_switch *ds,
- struct dsa_notifier_mdb_info *info)
+static int dsa_switch_mdb_add_bitmap(struct dsa_switch *ds,
+struct dsa_notifier_mdb_info *info,
+const struct switchdev_obj_port_mdb *mdb,
+unsigned long *group)
 {
-   const struct switchdev_obj_port_mdb *mdb = info->mdb;
struct switchdev_trans *trans = info->trans;
-   DECLARE_BITMAP(group, ds->num_ports);
int port, err;
 
-   /* Build a mask of Multicast group members */
-   bitmap_zero(group, ds->num_ports);
-   if (ds->index == info->sw_index)
-   set_bit(info->port, group);
-   for (port = 0; port < ds->num_ports; port++)
-   if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port))
-   set_bit(port, group);
-
if (switchdev_trans_ph_prepare(trans)) {
if (!ds->ops->port_mdb_prepare || !ds->ops->port_mdb_add)
return -EOPNOTSUPP;
@@ -141,6 +133,40 @@ static int dsa_switch_mdb_add(struct dsa_switch *ds,
return 0;
 }
 
+static int dsa_switch_mdb_add(struct dsa_switch *ds,
+ struct dsa_notifier_mdb_info *info)
+{
+   const struct switchdev_obj_port_mdb *mdb = info->mdb;
+   DECLARE_BITMAP(group, ds->num_ports);
+   int port;
+
+   /* Build a mask of Multicast group members */
+   bitmap_zero(group, ds->num_ports);
+   if (ds->index == info->sw_index)
+   set_bit(info->port, group);
+   for (port = 0; port < ds->num_ports; port++)
+   if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port))
+   set_bit(port, group);
+
+   return dsa_switch_mdb_add_bitmap(ds, info, mdb, group);
+}
+
+static int dsa_switch_host_mdb_add(struct dsa_switch *ds,
+  struct dsa_notifier_mdb_info *info)
+{
+   const struct switchdev_obj_port_mdb *mdb = info->mdb;
+   DECLARE_BITMAP(group, ds->num_ports);
+   int port;
+
+   /* Build a mask of Multicast group members */
+   bitmap_zero(group, ds->num_ports);
+   for (port = 0; port < ds->num_ports; port++)
+   if (dsa_is_cpu_port(ds, port))
+   set_bit(port, group);
+
+   return dsa_switch_mdb_add_bitmap(ds, info, mdb, group);
+}
+
 static int dsa_switch_mdb_del(struct dsa_switch *ds,
  struct dsa_notifier_mdb_info *info)
 {
@@ -155,6 +181,22 @@ static int dsa_switch_mdb_del(struct dsa_switch *ds,
return 0;
 }
 
+static int dsa_switch_host_mdb_del(struct dsa_switch *ds,
+  struct dsa_notifier_mdb_info *info)
+{
+   const struct switchdev_obj_port_mdb *mdb = info->mdb;
+   int port;
+
+   if (!ds->ops->port_mdb_del)
+   return -EOPNOTSUPP;
+
+   for (port = 0; port < ds->num_ports; port++)
+   if (dsa_is_cpu_port(ds, port))
+   ds->ops->port_mdb_del(ds, port, mdb);
+
+   return 0;
+}
+
 static int dsa_switch_vlan_add(struct dsa_switch *ds,
   struct dsa_notifier_vlan_info *info)
 {
@@ -230,6 +272,12 @@ static int dsa_switch_event(struct notifier_block *nb,
case DSA_NOTIFIER_MDB_DEL:
err = dsa_switch_mdb_del(ds, info);
break;
+   case DSA_NOTIFIER_HOST_MDB_ADD:
+   err = dsa_switch_host_mdb_add(ds, info);
+   break;
+   case DSA_NOTIFIER_HOST_MDB_DEL:
+   err = dsa_switch_host_mdb_del(ds, info);
+   break;
case DSA_NOTIFIER_VLAN_ADD:
err = dsa_switch_vlan_add(ds, info);
break;
-- 
2.14.1

[PATCH v2 rfc 7/8] net: dsa: set offload_fwd_mark on received packets

2017-09-05 Thread Andrew Lunn

The software bridge needs to know if a packet has already been bridged
by hardware offload to ports in the same hardware offload, in order
that it does not re-flood them, causing duplicates. This is
particularly true for multicast traffic which the host has required.

By setting offload_fwd_mark in the skb the bridge will only flood to
ports in other offloads and other netifs.

Signed-off-by: Andrew Lunn 
---
 net/dsa/dsa.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 03c58b0eb082..5732696ac71c 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -213,6 +213,7 @@ static int dsa_switch_rcv(struct sk_buff *skb, struct 
net_device *dev,
skb_push(skb, ETH_HLEN);
skb->pkt_type = PACKET_HOST;
skb->protocol = eth_type_trans(skb, skb->dev);
+   skb->offload_fwd_mark = 1;
 
s = this_cpu_ptr(p->stats64);
u64_stats_update_begin(>syncp);
-- 
2.14.1

[PATCH v2 rfc 2/8] net: bridge: Send notification when host join/leaves a group

2017-09-05 Thread Andrew Lunn

The host can join or leave a multicast group on the brX interface, as
indicated by IGMP snooping.  This is tracked within the bridge
multicast code. Send a notification when this happens, in the same way
a notification is sent when a port of the bridge joins/leaves a group
because of IGMP snooping.

Signed-off-by: Andrew Lunn 
---
 net/bridge/br_mdb.c   | 9 ++---
 net/bridge/br_multicast.c | 6 +-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 71885e251988..80fa91ccc50c 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -316,7 +316,7 @@ static void __br_mdb_notify(struct net_device *dev, struct 
net_bridge_port *p,
 #endif
 
mdb.obj.orig_dev = port_dev;
-   if (port_dev && type == RTM_NEWMDB) {
+   if (p && port_dev && type == RTM_NEWMDB) {
complete_info = kmalloc(sizeof(*complete_info), GFP_ATOMIC);
if (complete_info) {
complete_info->port = p;
@@ -326,7 +326,7 @@ static void __br_mdb_notify(struct net_device *dev, struct 
net_bridge_port *p,
if (switchdev_port_obj_add(port_dev, ))
kfree(complete_info);
}
-   } else if (port_dev && type == RTM_DELMDB) {
+   } else if (p && port_dev && type == RTM_DELMDB) {
switchdev_port_obj_del(port_dev, );
}
 
@@ -352,7 +352,10 @@ void br_mdb_notify(struct net_device *dev, struct 
net_bridge_port *port,
struct br_mdb_entry entry;
 
memset(, 0, sizeof(entry));
-   entry.ifindex = port->dev->ifindex;
+   if (port)
+   entry.ifindex = port->dev->ifindex;
+   else
+   entry.ifindex = dev->ifindex;
entry.addr.proto = group->proto;
entry.addr.u.ip4 = group->u.ip4;
 #if IS_ENABLED(CONFIG_IPV6)
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index c6b2b8d419e7..955f340fe719 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -250,6 +250,7 @@ static void br_multicast_group_expired(unsigned long data)
goto out;
 
mp->host_joined = false;
+   br_mdb_notify(br->dev, NULL, >addr, RTM_DELMDB, 0);
 
if (mp->ports)
goto out;
@@ -775,7 +776,10 @@ static int br_multicast_add_group(struct net_bridge *br,
goto err;
 
if (!port) {
-   mp->host_joined = true;
+   if (!mp->host_joined) {
+   mp->host_joined = true;
+   br_mdb_notify(br->dev, NULL, >addr, RTM_NEWMDB, 0);
+   }
mod_timer(>timer, now + br->multicast_membership_interval);
goto out;
}
-- 
2.14.1

Re: [RFC net-next] net: sch_clsact: add support for global per-netns classifier mode

2017-09-05 Thread Daniel Borkmann


On 09/06/2017 12:45 AM, Daniel Borkmann wrote:

On 09/06/2017 12:01 AM, Roopa Prabhu wrote:

On Tue, Sep 5, 2017 at 11:18 AM, Cong Wang  wrote:

On Tue, Sep 5, 2017 at 5:48 AM, Nikolay Aleksandrov
 wrote:

Hi all,
This RFC adds a new mode for clsact which designates a device's egress
classifier as global per netns. The packets that are not classified for
a particular device will be classified using the global classifier.
We have needed a global classifier for some time now for various
purposes and setting the single bridge or loopback/vrf device as the


Can you elaborate a bit more on the ... "we have needed a global
classifier for some time now for various purposes".


global classifier device is acceptable for us. Doing it this way avoids
the act/cls device and queue dependencies.

This is strictly an RFC patch just to show the intent, if we agree on
the details the proposed patch will have support for both ingress and
egress, and will be using a static key to avoid the fast path test when no
global classifier has been configured.

Example (need a modified tc that adds TCA_OPTIONS when using q_clsact):
$ tc qdisc add dev lo clsact global
$ tc filter add dev lo egress protocol ip u32 match ip dst 4.3.2.1/32 action 
drop

the last filter will be global for all devices that don't have a
specific egress_cl_list (i.e. have clsact configured).


Sorry this is too ugly.


+1


netdevice is still implied in your command line even if you treat it
as global. It is essentially hard to bypass netdevice layer since
netdevice is the core of L2 and also where everything begins.


One could probably use special wildcard device name, e.g. tcpdump
allows for 'tcpdump -i any' to capture on all devices, so you could
indicate something like 'tc filter add dev any blah' to have similar
semantics in the realm of the netns w/o making it look too hacky
for tc users perhaps.


Maybe the best we can do here is make tc filters standalone
as tc actions so that filters can exist before qdisc's and netdevices.
But this probably requires significant works to make it working
with both existing non-standalone and bindings standalones
with qdisc's.


yes, like Nikolay says we have been discussing this as well. Nikolay's
patch is a cleaver and most importantly non-invasive
way today given the anchor point for tc rules is a netdev. we have
also considered a separate implicit tc anchor device.


Seems ugly just as well. :( Hmm, why not just having the two list
pointers (ingress, egress list) in the netns struct and when
something configures them to be effectively non-zero, then devices
in that netns could automatically get a clsact and inherit the
lists from there such that sch_handle_ingress() and sch_handle_egress()
require exactly zero changes in fast-path. You could then go and
say that either you would make changes to clsact for individual
devices immutable when they use the 'shared' list pointers, or then
duplicate the configs when being altered from the global one. Would
push the complexity to control path only at least. Just a brief
thought.

Re: linux-next: manual merge of the pci tree with the net tree

2017-09-05 Thread Stephen Rothwell

Hi Sinan,

On Tue, 5 Sep 2017 18:54:38 -0400 Sinan Kaya  wrote:
>
> On 9/4/2017 12:54 AM, Stephen Rothwell wrote:
> > Just a reminder that this conflict still exists.  
> 
> I suppose this was a message for Bjorn, right?

Correct.  It was just a reminder that he may have to tell Linus about
the conflict when he asks Linus to merge his tree (depending on how
complex the conflict is and if he gets in before Dave gets the net-next
tree merged).

-- 
Cheers,
Stephen Rothwell

Re: linux-next: manual merge of the pci tree with the net tree

2017-09-05 Thread Sinan Kaya

On 9/4/2017 12:54 AM, Stephen Rothwell wrote:
> Just a reminder that this conflict still exists.

I suppose this was a message for Bjorn, right?

I looked at the change and it looked good to me.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH net-next RFC 1/2] tun: enable NAPI for TUN/TAP driver

2017-09-05 Thread Stephen Hemminger

On Tue,  5 Sep 2017 15:35:50 -0700
Petar Penkov  wrote:

> Changes TUN driver to use napi_gro_receive() upon receiving packets
> rather than netif_rx_ni(). Adds flag CONFIG_TUN_NAPI that enables
> these changes and operation is not affected if the flag is disabled.
> SKBs are constructed upon packet arrival and are queued to be
> processed later.
> 
> The new path was evaluated with a benchmark with the following setup:
> Open two tap devices and a receiver thread that reads in a loop for
> each device. Start one sender thread and pin all threads to different
> CPUs. Send 1M minimum UDP packets to each device and measure sending
> time for each of the sending methods:
>   napi_gro_receive(): 4.90s
>   netif_rx_ni():  4.90s
>   netif_receive_skb():7.20s
> 
> Signed-off-by: Petar Penkov 
> Cc: Eric Dumazet 
> Cc: Mahesh Bandewar 
> Cc: Willem de Bruijn 
> Cc: da...@davemloft.net
> Cc: ppen...@stanford.edu

Why is this optional? It adds two code paths both of which need
to be tested.

Re: [RFC net-next] net: sch_clsact: add support for global per-netns classifier mode

2017-09-05 Thread Daniel Borkmann


On 09/06/2017 12:01 AM, Roopa Prabhu wrote:

On Tue, Sep 5, 2017 at 11:18 AM, Cong Wang  wrote:

On Tue, Sep 5, 2017 at 5:48 AM, Nikolay Aleksandrov
 wrote:

Hi all,
This RFC adds a new mode for clsact which designates a device's egress
classifier as global per netns. The packets that are not classified for
a particular device will be classified using the global classifier.
We have needed a global classifier for some time now for various
purposes and setting the single bridge or loopback/vrf device as the


Can you elaborate a bit more on the ... "we have needed a global
classifier for some time now for various purposes".


global classifier device is acceptable for us. Doing it this way avoids
the act/cls device and queue dependencies.

This is strictly an RFC patch just to show the intent, if we agree on
the details the proposed patch will have support for both ingress and
egress, and will be using a static key to avoid the fast path test when no
global classifier has been configured.

Example (need a modified tc that adds TCA_OPTIONS when using q_clsact):
$ tc qdisc add dev lo clsact global
$ tc filter add dev lo egress protocol ip u32 match ip dst 4.3.2.1/32 action 
drop

the last filter will be global for all devices that don't have a
specific egress_cl_list (i.e. have clsact configured).


Sorry this is too ugly.


+1


netdevice is still implied in your command line even if you treat it
as global. It is essentially hard to bypass netdevice layer since
netdevice is the core of L2 and also where everything begins.

Maybe the best we can do here is make tc filters standalone
as tc actions so that filters can exist before qdisc's and netdevices.
But this probably requires significant works to make it working
with both existing non-standalone and bindings standalones
with qdisc's.


yes, like Nikolay says we have been discussing this as well. Nikolay's
patch is a cleaver and most importantly non-invasive
way today given the anchor point for tc rules is a netdev. we have
also considered a separate implicit tc anchor device.


Seems ugly just as well. :( Hmm, why not just having the two list
pointers (ingress, egress list) in the netns struct and when
something configures them to be effectively non-zero, then devices
in that netns could automatically get a clsact and inherit the
lists from there such that sch_handle_ingress() and sch_handle_egress()
require exactly zero changes in fast-path. You could then go and
say that either you would make changes to clsact for individual
devices immutable when they use the 'shared' list pointers, or then
duplicate the configs when being altered from the global one. Would
push the complexity to control path only at least. Just a brief
thought.

[PATCH net-next RFC 0/2] Improve code coverage of syzkaller

2017-09-05 Thread Petar Penkov

This patch series is intended to improve code coverage of syzkaller on
the early receive path, specifically including flow dissector, GRO,
and GRO with frags parts of the networking stack. Syzkaller exercises
the stack through the TUN driver and this is therefore where changes
reside. Current coverage through netif_receive_skb() is limited as it
does not touch on any of the aforementioned code paths. Furthermore,
for full coverage, it is necessary to have more flexibility over the
linear and non-linear data of the skbs.

The following patches address this by providing the user(syzkaller)
with the ability to send via napi_gro_receive() and napi_gro_frags().
Additionally, syzkaller can specify how many fragments there are and
how much data per fragment there is. This is done by exploiting the
convenient structure of iovecs. Finally, this patch series adds
support for exercising the flow dissector during fuzzing.

The code path including napi_gro_receive() can be enabled via the
CONFIG_TUN_NAPI compile-time flag, and can be used by users other than
syzkaller. The remainder of the  changes in this patch series give the
user significantly more control over packets entering the kernel. To
avoid potential security vulnerabilities, hide the ability to send
custom skbs and the flow dissector code paths behind a run-time flag
IFF_NAPI_FRAGS that is advertised and accepted only if CONFIG_TUN_NAPI
is enabled.

The patch series will be followed with changes to packetdrill, where
these additions to the TUN driver are exercised and demonstrated.
This will give the ability to write regression tests for specific
parts of the early networking stack.

Patch 1/ Add NAPI struct per receive queue, enable NAPI, and use
 napi_gro_receive() 
Patch 2/ Use NAPI skb and napi_gro_frags(), exercise flow dissector,
 and allow custom skbs.
Petar Penkov (2):
  tun: enable NAPI for TUN/TAP driver
  tun: enable napi_gro_frags() for TUN/TAP driver

 drivers/net/Kconfig |   8 ++
 drivers/net/tun.c   | 251 +---
 include/uapi/linux/if_tun.h |   1 +
 3 files changed, 246 insertions(+), 14 deletions(-)

-- 
2.14.1.581.gf28d330327-goog

[PATCH net-next RFC 2/2] tun: enable napi_gro_frags() for TUN/TAP driver

2017-09-05 Thread Petar Penkov

Add a TUN/TAP receive mode that exercises the napi_gro_frags()
interface. This mode is available only in TAP mode, as the interface
expects packets with Ethernet headers.

Furthermore, packets follow the layout of the iovec_iter that was
received. The first iovec is the linear data, and every one after the
first is a fragment. If there are more fragments than the max number,
drop the packet. Additionally, invoke eth_get_headlen() to exercise flow
dissector code and to verify that the header resides in the linear data.

The napi_gro_frags() mode requires setting the IFF_NAPI_FRAGS option.
This is imposed because this mode is intended for testing via tools like
syzkaller and packetdrill, and the increased flexibility it provides can
introduce security vulnerabilities.

Signed-off-by: Petar Penkov 
Cc: Eric Dumazet 
Cc: Mahesh Bandewar 
Cc: Willem de Bruijn 
Cc: da...@davemloft.net
Cc: ppen...@stanford.edu
---
 drivers/net/tun.c   | 135 ++--
 include/uapi/linux/if_tun.h |   1 +
 2 files changed, 130 insertions(+), 6 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index d5c824e3ec42..2ba9809ab6cd 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -75,6 +75,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -120,8 +121,15 @@ do {   
\
 #define TUN_VNET_LE 0x8000
 #define TUN_VNET_BE 0x4000
 
+#if IS_ENABLED(CONFIG_TUN_NAPI)
+#define TUN_FEATURES_EXTRA IFF_NAPI_FRAGS
+#else
+#define TUN_FEATURES_EXTRA 0
+#endif
+
 #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
- IFF_MULTI_QUEUE)
+ IFF_MULTI_QUEUE | TUN_FEATURES_EXTRA)
+
 #define GOODCOPY_LEN 128
 
 #define FLT_EXACT_COUNT 8
@@ -173,6 +181,7 @@ struct tun_file {
unsigned int ifindex;
};
struct napi_struct napi;
+   struct mutex napi_mutex;/* Protects access to the above napi */
struct list_head next;
struct tun_struct *detached;
struct skb_array tx_array;
@@ -276,6 +285,7 @@ static void tun_napi_init(struct tun_struct *tun, struct 
tun_file *tfile)
netif_napi_add(tun->dev, >napi, tun_napi_poll,
   NAPI_POLL_WEIGHT);
napi_enable(>napi);
+   mutex_init(>napi_mutex);
}
 }
 
@@ -291,6 +301,11 @@ static void tun_napi_del(struct tun_file *tfile)
netif_napi_del(>napi);
 }
 
+static bool tun_napi_frags_enabled(const struct tun_struct *tun)
+{
+   return READ_ONCE(tun->flags) & IFF_NAPI_FRAGS;
+}
+
 #ifdef CONFIG_TUN_VNET_CROSS_LE
 static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
 {
@@ -1034,7 +1049,8 @@ static void tun_poll_controller(struct net_device *dev)
 * supports polling, which enables bridge devices in virt setups to
 * still use netconsole
 * If NAPI is enabled, however, we need to schedule polling for all
-* queues.
+* queues unless we are using napi_gro_frags(), which we call in
+* process context and not in NAPI context.
 */
 
if (IS_ENABLED(CONFIG_TUN_NAPI)) {
@@ -1042,6 +1058,9 @@ static void tun_poll_controller(struct net_device *dev)
struct tun_file *tfile;
int i;
 
+   if (tun_napi_frags_enabled(tun))
+   return;
+
rcu_read_lock();
for (i = 0; i < tun->numqueues; i++) {
tfile = rcu_dereference(tun->tfiles[i]);
@@ -1264,6 +1283,64 @@ static unsigned int tun_chr_poll(struct file *file, 
poll_table *wait)
return mask;
 }
 
+static struct sk_buff *tun_napi_alloc_frags(struct tun_file *tfile,
+   size_t len,
+   const struct iov_iter *it)
+{
+   struct sk_buff *skb;
+   size_t linear;
+   int err;
+   int i;
+
+   if (it->nr_segs > MAX_SKB_FRAGS + 1)
+   return ERR_PTR(-ENOMEM);
+
+   local_bh_disable();
+   skb = napi_get_frags(>napi);
+   local_bh_enable();
+   if (!skb)
+   return ERR_PTR(-ENOMEM);
+
+   linear = iov_iter_single_seg_count(it);
+   err = __skb_grow(skb, linear);
+   if (err)
+   goto free;
+
+   skb->len = len;
+   skb->data_len = len - linear;
+   skb->truesize += skb->data_len;
+
+   for (i = 1; i < it->nr_segs; i++) {
+   size_t fragsz = it->iov[i].iov_len;
+   unsigned long offset;
+   struct page *page;
+   void *data;
+
+   if (fragsz == 0 || fragsz > PAGE_SIZE) {
+   err = -EINVAL;
+   goto free;
+   }
+
+   local_bh_disable();
+   data =

[PATCH net-next RFC 1/2] tun: enable NAPI for TUN/TAP driver

2017-09-05 Thread Petar Penkov

Changes TUN driver to use napi_gro_receive() upon receiving packets
rather than netif_rx_ni(). Adds flag CONFIG_TUN_NAPI that enables
these changes and operation is not affected if the flag is disabled.
SKBs are constructed upon packet arrival and are queued to be
processed later.

The new path was evaluated with a benchmark with the following setup:
Open two tap devices and a receiver thread that reads in a loop for
each device. Start one sender thread and pin all threads to different
CPUs. Send 1M minimum UDP packets to each device and measure sending
time for each of the sending methods:
napi_gro_receive(): 4.90s
netif_rx_ni():  4.90s
netif_receive_skb():7.20s

Signed-off-by: Petar Penkov 
Cc: Eric Dumazet 
Cc: Mahesh Bandewar 
Cc: Willem de Bruijn 
Cc: da...@davemloft.net
Cc: ppen...@stanford.edu
---
 drivers/net/Kconfig |   8 
 drivers/net/tun.c   | 120 +++-
 2 files changed, 118 insertions(+), 10 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 83a1616903f8..34850d71ddd1 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -307,6 +307,14 @@ config TAP
  This option is selected by any driver implementing tap user space
  interface for a virtual interface to re-use core tap functionality.
 
+config TUN_NAPI
+   bool "NAPI support on tx path for TUN/TAP driver"
+   default n
+   depends on TUN
+   ---help---
+ This option allows the TUN/TAP driver to use NAPI to pass packets to
+ the kernel when receiving packets from user space via write()/send().
+
 config TUN_VNET_CROSS_LE
bool "Support for cross-endian vnet headers on little-endian kernels"
default n
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 06e8f0bb2dab..d5c824e3ec42 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -172,6 +172,7 @@ struct tun_file {
u16 queue_index;
unsigned int ifindex;
};
+   struct napi_struct napi;
struct list_head next;
struct tun_struct *detached;
struct skb_array tx_array;
@@ -229,6 +230,67 @@ struct tun_struct {
struct bpf_prog __rcu *xdp_prog;
 };
 
+static int tun_napi_receive(struct napi_struct *napi, int budget)
+{
+   struct tun_file *tfile = container_of(napi, struct tun_file, napi);
+   struct sk_buff_head *queue = >sk.sk_write_queue;
+   struct sk_buff_head process_queue;
+   struct sk_buff *skb;
+   int received = 0;
+
+   __skb_queue_head_init(_queue);
+
+   spin_lock(>lock);
+   skb_queue_splice_tail_init(queue, _queue);
+   spin_unlock(>lock);
+
+   while (received < budget && (skb = __skb_dequeue(_queue))) {
+   napi_gro_receive(napi, skb);
+   ++received;
+   }
+
+   if (!skb_queue_empty(_queue)) {
+   spin_lock(>lock);
+   skb_queue_splice(_queue, queue);
+   spin_unlock(>lock);
+   }
+
+   return received;
+}
+
+static int tun_napi_poll(struct napi_struct *napi, int budget)
+{
+   unsigned int received;
+
+   received = tun_napi_receive(napi, budget);
+
+   if (received < budget)
+   napi_complete_done(napi, received);
+
+   return received;
+}
+
+static void tun_napi_init(struct tun_struct *tun, struct tun_file *tfile)
+{
+   if (IS_ENABLED(CONFIG_TUN_NAPI)) {
+   netif_napi_add(tun->dev, >napi, tun_napi_poll,
+  NAPI_POLL_WEIGHT);
+   napi_enable(>napi);
+   }
+}
+
+static void tun_napi_disable(struct tun_file *tfile)
+{
+   if (IS_ENABLED(CONFIG_TUN_NAPI))
+   napi_disable(>napi);
+}
+
+static void tun_napi_del(struct tun_file *tfile)
+{
+   if (IS_ENABLED(CONFIG_TUN_NAPI))
+   netif_napi_del(>napi);
+}
+
 #ifdef CONFIG_TUN_VNET_CROSS_LE
 static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
 {
@@ -541,6 +603,11 @@ static void __tun_detach(struct tun_file *tfile, bool 
clean)
 
tun = rtnl_dereference(tfile->tun);
 
+   if (tun && clean) {
+   tun_napi_disable(tfile);
+   tun_napi_del(tfile);
+   }
+
if (tun && !tfile->detached) {
u16 index = tfile->queue_index;
BUG_ON(index >= tun->numqueues);
@@ -598,6 +665,7 @@ static void tun_detach_all(struct net_device *dev)
for (i = 0; i < n; i++) {
tfile = rtnl_dereference(tun->tfiles[i]);
BUG_ON(!tfile);
+   tun_napi_disable(tfile);
tfile->socket.sk->sk_shutdown = RCV_SHUTDOWN;
tfile->socket.sk->sk_data_ready(tfile->socket.sk);
RCU_INIT_POINTER(tfile->tun, NULL);
@@ -613,6 +681,7 @@ static void tun_detach_all(struct net_device *dev)
synchronize_net();
for (i

Re: [PATCH net-next] bpf: fix numa_node validation

2017-09-05 Thread Martin KaFai Lau

On Mon, Sep 04, 2017 at 10:41:02PM -0700, Eric Dumazet wrote:
> From: Eric Dumazet 
>
> syzkaller reported crashes in bpf map creation or map update [1]
>
> Problem is that nr_node_ids is a signed integer,
> NUMA_NO_NODE is also an integer, so it is very tempting
> to declare numa_node as a signed integer.
>
> This means the typical test to validate a user provided value :
>
> if (numa_node != NUMA_NO_NODE &&
> (numa_node >= nr_node_ids ||
>  !node_online(numa_node)))
>
> must be written :
>
> if (numa_node != NUMA_NO_NODE &&
> ((unsigned int)numa_node >= nr_node_ids ||
Thanks for fixing it.

>  !node_online(numa_node)))
>
>
> [1]
> kernel BUG at mm/slab.c:3256!
> invalid opcode:  [#1] SMP KASAN
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Modules linked in:
> CPU: 0 PID: 2946 Comm: syzkaller916108 Not tainted 4.13.0-rc7+ #35
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> task: 8801d2bc60c0 task.stack: 8801c0c9
> RIP: 0010:cache_alloc_node+0x1d4/0x1e0 mm/slab.c:3292
> RSP: 0018:8801c0c97638 EFLAGS: 00010096
> RAX: 8b7b RBX: 01080220 RCX: 
> RDX: 8b7b RSI: 01080220 RDI: 8801dac00040
> RBP: 8801c0c976c0 R08:  R09: 
> R10: 8801c0c97620 R11: 0001 R12: 8801dac00040
> R13: 8801dac00040 R14:  R15: 8b7b
> FS:  02119940() GS:8801db20() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 20001fec CR3: 0001d298 CR4: 001406f0
> Call Trace:
>  __do_kmalloc_node mm/slab.c:3688 [inline]
>  __kmalloc_node+0x33/0x70 mm/slab.c:3696
>  kmalloc_node include/linux/slab.h:535 [inline]
>  alloc_htab_elem+0x2a8/0x480 kernel/bpf/hashtab.c:740
>  htab_map_update_elem+0x740/0xb80 kernel/bpf/hashtab.c:820
>  map_update_elem kernel/bpf/syscall.c:587 [inline]
>  SYSC_bpf kernel/bpf/syscall.c:1468 [inline]
>  SyS_bpf+0x20c5/0x4c40 kernel/bpf/syscall.c:1443
>  entry_SYSCALL_64_fastpath+0x1f/0xbe
> RIP: 0033:0x440409
> RSP: 002b:7ffd1f1792b8 EFLAGS: 0246 ORIG_RAX: 0141
> RAX: ffda RBX: 004002c8 RCX: 00440409
> RDX: 0020 RSI: 20006000 RDI: 0002
> RBP: 0086 R08:  R09: 
> R10:  R11: 0246 R12: 00401d70
> R13: 00401e00 R14:  R15: 
> Code: 83 c2 01 89 50 18 4c 03 70 08 e8 38 f4 ff ff 4d 85 f6 0f 85 3e ff ff ff 
> 44 89 fe 4c 89 ef e8 94 fb ff ff 49 89 c6 e9 2b ff ff ff <0f> 0b 0f 0b 0f 0b 
> 66 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41
> RIP: cache_alloc_node+0x1d4/0x1e0 mm/slab.c:3292 RSP: 8801c0c97638
> ---[ end trace d745f355da2e33ce ]---
> Kernel panic - not syncing: Fatal exception
>
> Fixes: 96eabe7a40aa ("bpf: Allow selecting numa node during map creation")
> Signed-off-by: Eric Dumazet 
> Cc: Martin KaFai Lau 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> ---
>  kernel/bpf/syscall.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 
> 021a05d9d80095303bdfed51ee85bd9067582774..70ad8e220343c7825c8e331f19c1f65c78fdb796
>  100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -323,7 +323,8 @@ static int map_create(union bpf_attr *attr)
>   return -EINVAL;
>
>   if (numa_node != NUMA_NO_NODE &&
> - (numa_node >= nr_node_ids || !node_online(numa_node)))
> + ((unsigned int)numa_node >= nr_node_ids ||
> +  !node_online(numa_node)))
>   return -EINVAL;
>
>   /* find map type and init map: hashtable vs rbtree vs bloom vs ... */
>
>

Re: [RFC net-next] net: sch_clsact: add support for global per-netns classifier mode

2017-09-05 Thread Jamal Hadi Salim


On 17-09-05 06:01 PM, Roopa Prabhu wrote:



yes, like Nikolay says we have been discussing this as well. Nikolay's
patch is a cleaver and most importantly non-invasive
way today given the anchor point for tc rules is a netdev. we have
also considered a separate implicit tc anchor device.
  lo seemed like a good fit for all global rules given it is already
available. And it is not uncommon to hang off global
network config on the loopback interface.



IMO, Jiri has done all the necessary work already with the concept of
blocks. We dont really need the netdev to be the attachment point.

You can add to a block in many locations in the kernel by
constructing the proper "coordinates" in the tcmsg.
i.e this:
  tcmsg {
unsigned char   tcm_family;
unsigned char   tcm__pad1;
unsigned short  tcm__pad2;
int tcm_ifindex;
__u32   tcm_handle;
__u32   tcm_parent;
}

If you were to set tcm_ifindex to -1 (since that is not a legit
ifindex) then all we need to do is define a parent for a
different location. Current locations tied to netdevs are:
-
#define TC_H_ROOT   (0xU)
#define TC_H_INGRESS(0xFFF1U)
#define TC_H_CLSACT TC_H_INGRESS

#define TC_H_MIN_INGRESS0xFFF2U
#define TC_H_MIN_EGRESS 0xFFF3U
-

You should be able to say add a location which maps to a pre-routing
or post-routing etc; and this would work as well...

cheers,
jamal

Re: [PATCH 1/3] security: bpf: Add eBPF LSM hooks to security module

2017-09-05 Thread Chenbo Feng

On Fri, Sep 1, 2017 at 5:50 AM, Stephen Smalley  wrote:
> On Thu, 2017-08-31 at 13:56 -0700, Chenbo Feng wrote:
>> From: Chenbo Feng 
>>
>> Introduce 5 LSM hooks to provide finer granularity controls on eBPF
>> related operations including create eBPF maps, modify and read eBPF
>> maps
>> content and load eBPF programs to the kernel. Hooks use the new
>> security
>> pointer inside the eBPF map struct to store the owner's security
>> information and the different security modules can perform different
>> checks based on the information stored inside the security field.
>>
>> Signed-off-by: Chenbo Feng 
>> ---
>>  include/linux/lsm_hooks.h | 41
>> +
>>  include/linux/security.h  | 36 
>>  security/security.c   | 28 
>>  3 files changed, 105 insertions(+)
>>
>> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
>> index ce02f76a6188..3aaf9a08a983 100644
>> --- a/include/linux/lsm_hooks.h
>> +++ b/include/linux/lsm_hooks.h
>> @@ -1353,6 +1353,32 @@
>>   *   @inode we wish to get the security context of.
>>   *   @ctx is a pointer in which to place the allocated security
>> context.
>>   *   @ctxlen points to the place to put the length of @ctx.
>> + *
>> + * Security hooks for using the eBPF maps and programs
>> functionalities through
>> + * eBPF syscalls.
>> + *
>> + * @bpf_map_create:
>> + *   Check permissions prior to creating a new bpf map.
>> + *   Return 0 if the permission is granted.
>> + *
>> + * @bpf_map_modify:
>> + *   Check permission prior to insert, update and delete map
>> content.
>> + *   @map pointer to the struct bpf_map that contains map
>> information.
>> + *   Return 0 if the permission is granted.
>> + *
>> + * @bpf_map_read:
>> + *   Check permission prior to read a bpf map content.
>> + *   @map pointer to the struct bpf_map that contains map
>> information.
>> + *   Return 0 if the permission is granted.
>> + *
>> + * @bpf_prog_load:
>> + *   Check permission prior to load eBPF program.
>> + *   Return 0 if the permission is granted.
>> + *
>> + * @bpf_post_create:
>> + *   Initialize the bpf object security field inside struct
>> bpf_maps and
>> + *   it is used for future security checks.
>> + *
>>   */
>>  union security_list_options {
>>   int (*binder_set_context_mgr)(struct task_struct *mgr);
>> @@ -1685,6 +1711,14 @@ union security_list_options {
>>   struct audit_context *actx);
>>   void (*audit_rule_free)(void *lsmrule);
>>  #endif /* CONFIG_AUDIT */
>> +
>> +#ifdef CONFIG_BPF_SYSCALL
>> + int (*bpf_map_create)(void);
>> + int (*bpf_map_read)(struct bpf_map *map);
>> + int (*bpf_map_modify)(struct bpf_map *map);
>> + int (*bpf_prog_load)(void);
>> + int (*bpf_post_create)(struct bpf_map *map);
>> +#endif /* CONFIG_BPF_SYSCALL */
>>  };
>>
>>  struct security_hook_heads {
>> @@ -1905,6 +1939,13 @@ struct security_hook_heads {
>>   struct list_head audit_rule_match;
>>   struct list_head audit_rule_free;
>>  #endif /* CONFIG_AUDIT */
>> +#ifdef CONFIG_BPF_SYSCALL
>> + struct list_head bpf_map_create;
>> + struct list_head bpf_map_read;
>> + struct list_head bpf_map_modify;
>> + struct list_head bpf_prog_load;
>> + struct list_head bpf_post_create;
>> +#endif /* CONFIG_BPF_SYSCALL */
>>  } __randomize_layout;
>>
>>  /*
>> diff --git a/include/linux/security.h b/include/linux/security.h
>> index 458e24bea2d4..0656a4f74d14 100644
>> --- a/include/linux/security.h
>> +++ b/include/linux/security.h
>> @@ -31,6 +31,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  struct linux_binprm;
>>  struct cred;
>> @@ -1735,6 +1736,41 @@ static inline void securityfs_remove(struct
>> dentry *dentry)
>>
>>  #endif
>>
>> +#ifdef CONFIG_BPF_SYSCALL
>> +#ifdef CONFIG_SECURITY
>> +int security_map_create(void);
>> +int security_map_modify(struct bpf_map *map);
>> +int security_map_read(struct bpf_map *map);
>> +int security_prog_load(void);
>> +int security_post_create(struct bpf_map *map);
>> +#else
>> +static inline int security_map_create(void)
>> +{
>> + return 0;
>> +}
>> +
>> +static inline int security_map_read(struct bpf_map *map)
>> +{
>> + return 0;
>> +}
>> +
>> +static inline int security_map_modify(struct bpf_map *map)
>> +{
>> + return 0;
>> +}
>> +
>> +static inline int security_prog_load(void)
>> +{
>> + return 0;
>> +}
>> +
>> +static inline int security_post_create(struct bpf_map *map)
>> +{
>> + return 0;
>> +}
>> +#endif /* CONFIG_SECURITY */
>> +#endif /* CONFIG_BPF_SYSCALL */
>
> These should be named consistently with the ones in lsm_hooks.h and
> should unambiguously indicate that these are hooks for bpf
> objects/operations, i.e. security_bpf_map_create(),
> security_bpf_map_read(), etc.
>
Thanks for pointing out, will fix this.
> Do you need this level of granularity?
>

Re: [PATCH 0/3] drivers: net: xgene: Misc bug fixes

2017-09-05 Thread David Miller

From: Iyappan Subramanian 
Date: Tue,  5 Sep 2017 11:16:29 -0700

> This patch set fixes bugs related to handling the case for ACPI for,
> reading and programming tx/rx delay values.
> 
> Signed-off-by: Iyappan Subramanian 

Series applied, thank you.

Re: [RFC net-next] net: sch_clsact: add support for global per-netns classifier mode

2017-09-05 Thread Roopa Prabhu

On Tue, Sep 5, 2017 at 11:18 AM, Cong Wang  wrote:
> On Tue, Sep 5, 2017 at 5:48 AM, Nikolay Aleksandrov
>  wrote:
>> Hi all,
>> This RFC adds a new mode for clsact which designates a device's egress
>> classifier as global per netns. The packets that are not classified for
>> a particular device will be classified using the global classifier.
>> We have needed a global classifier for some time now for various
>> purposes and setting the single bridge or loopback/vrf device as the
>> global classifier device is acceptable for us. Doing it this way avoids
>> the act/cls device and queue dependencies.
>>
>> This is strictly an RFC patch just to show the intent, if we agree on
>> the details the proposed patch will have support for both ingress and
>> egress, and will be using a static key to avoid the fast path test when no
>> global classifier has been configured.
>>
>> Example (need a modified tc that adds TCA_OPTIONS when using q_clsact):
>> $ tc qdisc add dev lo clsact global
>> $ tc filter add dev lo egress protocol ip u32 match ip dst 4.3.2.1/32 action 
>> drop
>>
>> the last filter will be global for all devices that don't have a
>> specific egress_cl_list (i.e. have clsact configured).
>
> Sorry this is too ugly.
>
> netdevice is still implied in your command line even if you treat it
> as global. It is essentially hard to bypass netdevice layer since
> netdevice is the core of L2 and also where everything begins.
>
> Maybe the best we can do here is make tc filters standalone
> as tc actions so that filters can exist before qdisc's and netdevices.
> But this probably requires significant works to make it working
> with both existing non-standalone and bindings standalones
> with qdisc's.

yes, like Nikolay says we have been discussing this as well. Nikolay's
patch is a cleaver and most importantly non-invasive
way today given the anchor point for tc rules is a netdev. we have
also considered a separate implicit tc anchor device.
 lo seemed like a good fit for all global rules given it is already
available. And it is not uncommon to hang off global
network config on the loopback interface.

Re: [PATCH 2/3] security: bpf: Add eBPF LSM hooks and security field to eBPF map

2017-09-05 Thread Chenbo Feng

On Thu, Aug 31, 2017 at 7:05 PM, Alexei Starovoitov
 wrote:
> On Thu, Aug 31, 2017 at 01:56:34PM -0700, Chenbo Feng wrote:
>> From: Chenbo Feng 
>>
>> Introduce a pointer into struct bpf_map to hold the security information
>> about the map. The actual security struct varies based on the security
>> models implemented. Place the LSM hooks before each of the unrestricted
>> eBPF operations, the map_update_elem and map_delete_elem operations are
>> checked by security_map_modify. The map_lookup_elem and map_get_next_key
>> operations are checked by securtiy_map_read.
>>
>> Signed-off-by: Chenbo Feng 
>
> ...
>
>> @@ -410,6 +418,10 @@ static int map_lookup_elem(union bpf_attr *attr)
>>   if (IS_ERR(map))
>>   return PTR_ERR(map);
>>
>> + err = security_map_read(map);
>> + if (err)
>> + return -EACCES;
>> +
>>   key = memdup_user(ukey, map->key_size);
>>   if (IS_ERR(key)) {
>>   err = PTR_ERR(key);
>> @@ -490,6 +502,10 @@ static int map_update_elem(union bpf_attr *attr)
>>   if (IS_ERR(map))
>>   return PTR_ERR(map);
>>
>> + err = security_map_modify(map);
>
> I don't feel these extra hooks are really thought through.
> With such hook you'll disallow map_update for given map. That's it.
> The key/values etc won't be used in such security decision.
> In such case you don't need such hooks in update/lookup at all.
> Only in map_creation and object_get calls where FD can be received.
> In other words I suggest to follow standard unix practices:
> Do permissions checks in open() and allow read/write() if FD is valid.
> Same here. Do permission checks in prog_load/map_create/obj_pin/get
> and that will be enough to jail bpf subsystem.
> bpf cmds that need to be fast (like lookup and update) should not
> have security hooks.
>
Thanks for pointing out this. I agree we should only do checks on
creating and passing
the object from one processes to another. And if we can still
distinguish the read/write operation
when obtaining the file fd, that would be great. But that may require
us to add a new mode
field into bpf_map struct and change the attribute struct when doing
the bpf syscall. How do you
think about this approach? Or we can do simple checks for
bpf_obj_create and bpf_obj_use when
creating the object and passing the object respectively but this
solution cannot distinguish map modify and
read.

Re: [PATCH][next v2] rocker: fix kcalloc parameter order

2017-09-05 Thread David Miller

From: Zahari Doychev 
Date: Tue,  5 Sep 2017 21:49:58 +0200

> The function calls to kcalloc use wrong parameter order and incorrect flags
> values. GFP_KERNEL is used instead of flags now and the order is corrected.
> 
> The change was done using the following coccinelle script:
> 
> @@
> expression E1,E2;
> type T;
> @@
> 
> -kcalloc(E1, E2, sizeof(T))
> +kcalloc(E2, sizeof(T), GFP_KERNEL)
> 
> Signed-off-by: Zahari Doychev 

Applied, thank you.

Re: [PATCH net] rds: Fix non-atomic operation on shared flag variable

2017-09-05 Thread David Miller

From: Håkon Bugge 
Date: Tue,  5 Sep 2017 17:42:01 +0200

> The bits in m_flags in struct rds_message are used for a plurality of
> reasons, and from different contexts. To avoid any missing updates to
> m_flags, use the atomic set_bit() instead of the non-atomic equivalent.
> 
> Signed-off-by: Håkon Bugge 
> Reviewed-by: Knut Omang 
> Reviewed-by: Wei Lin Guay 

Applied.

Re: [PATCH net v2] net: sched: don't use GFP_KERNEL under spin lock

2017-09-05 Thread David Miller

From: Jakub Kicinski 
Date: Tue,  5 Sep 2017 08:31:23 -0700

> The new TC IDR code uses GFP_KERNEL under spin lock.  Which leads
> to:
 ...
> Try to preallocate the memory with idr_prealloc(GFP_KERNEL)
> (as suggested by Eric Dumazet), and change the allocation 
> flags under spin lock.
> 
> Fixes: 65a206c01e8e ("net/sched: Change act_api and act_xxx modules to use 
> IDR")
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Simon Horman 

Applied, thanks.

Re: [PATCH net V2] vhost_net: correctly check tx avail during rx busy polling

2017-09-05 Thread David Miller

From: Jason Wang 
Date: Tue,  5 Sep 2017 09:22:05 +0800

> We check tx avail through vhost_enable_notify() in the past which is
> wrong since it only checks whether or not guest has filled more
> available buffer since last avail idx synchronization which was just
> done by vhost_vq_avail_empty() before. What we really want is checking
> pending buffers in the avail ring. Fix this by calling
> vhost_vq_avail_empty() instead.
> 
> This issue could be noticed by doing netperf TCP_RR benchmark as
> client from guest (but not host). With this fix, TCP_RR from guest to
> localhost restores from 1375.91 trans per sec to 55235.28 trans per
> sec on my laptop (Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz).
> 
> Fixes: 030881372460 ("vhost_net: basic polling support")
> Signed-off-by: Jason Wang 

Applied and queued up for -stable, thanks.

Re: [PATCH net-next] rxrpc: Make service connection lookup always check for retry

2017-09-05 Thread David Miller

From: David Howells 
Date: Mon, 04 Sep 2017 15:28:28 +0100

> When an RxRPC service packet comes in, the target connection is looked up
> by an rb-tree search under RCU and a read-locked seqlock; the seqlock retry
> check is, however, currently skipped if we got a match, but probably
> shouldn't be in case the connection we found gets replaced whilst we're
> doing a search.
> 
> Make the lookup procedure always go through need_seqretry(), even if the
> lookup was successful.  This makes sure we always pick up on a write-lock
> event.
> 
> On the other hand, since we don't take a ref on the object, but rely on RCU
> to prevent its destruction after dropping the seqlock, I'm not sure this is
> necessary.
> 
> Signed-off-by: David Howells 

Applied, thanks David.

Re: [PATCH v3 net-next 0/2] Use correct sk->sk_prot for IPV6

2017-09-05 Thread David Miller

From: Ilya Lesokhin 
Date: Mon,  4 Sep 2017 13:13:59 +0300

> The tls ulp overrides sk->prot with a new tls specific proto structs. 
>
> The tls specific structs were previously based on the ipv4 specific   
>
> tcp_prot sturct.  
>
> As a result, attaching the tls ulp to an ipv6 tcp socket replaced 
>
> some ipv6 callback with the ipv4 equivalents. 
>
>   
>
> This patch adds ipv6 tls proto structs and uses them when 
>
> attached to ipv6 sockets. 
> 
> Changed since v2: 
> - Dropped patch to fix IPV6_ADDRFORM setsockopt
> There was some disagreement about the correct way of fixinig it,
> and this series does not depend on it.
> 
> Changes since v1: 
>
> - TLS now dependes on IPV6
>
> This fixes complication issues when TLS is built-in and IPV6 is a module. 
>
> The downside should be small as it is unlikely that there are kernel TLS  
>
> users who can't afford to include IPV6 in thier kernel.   
>

This is not acceptable.

Forcing such a huge piece of infrastructure like ipv6 to be statically
built just because someone wants to enable TLS is not going to pass.

Every distrubtion out there will enable all features, including TLS,
so effectively you are forcing every distribution to build ipv6
non-modules.

Sorry, you will have to fix this in a way that allows TLS and IPV6
to be modular.

Re: [PATCH net-next v7] net: stmmac: Delete dead code for MDIO registration

2017-09-05 Thread David Miller

From: Romain Perier 
Date: Mon,  4 Sep 2017 10:41:36 +0200

> This code is no longer used, the logging function was changed by commit
> fbca164776e4 ("net: stmmac: Use the right logging function in 
> stmmac_mdio_register").
> It was previously showing information about the type of the IRQ, if it's
> polled, ignored or a normal interrupt. As we don't want information loss,
> I have moved this code to phy_attached_print().
> 
> Fixes: fbca164776e4 ("net: stmmac: Use the right logging function in 
> stmmac_mdio_register")
> Signed-off-by: Romain Perier 

Applied.

Re: [net v2] gianfar: Fix Tx flow control deactivation

2017-09-05 Thread David Miller

From: Claudiu Manoil 
Date: Mon, 4 Sep 2017 10:45:28 +0300

> The wrong register is checked for the Tx flow control bit,
> it should have been maccfg1 not maccfg2.
> This went unnoticed for so long probably because the impact is
> hardly visible, not to mention the tangled code from adjust_link().
> First, link flow control (i.e. handling of Rx/Tx link level pause frames)
> is disabled by default (needs to be enabled via 'ethtool -A').
> Secondly, maccfg2 always returns 0 for tx_flow_oldval (except for a few
> old boards), which results in Tx flow control remaining always on
> once activated.
> 
> Fixes: 45b679c9a3ccd9e34f28e6ec677b812a860eb8eb ("gianfar: Implement PAUSE 
> frame generation support")
> 
> Signed-off-by: Claudiu Manoil 

Please do not put an empty line between Fixes: and other tags.  All tags
are equal and should be in one uninterrupted block of text.

Applied, and queued up for -stable, thanks.

Re: [PATCH net-next] cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6

2017-09-05 Thread David Miller

From: Ganesh Goudar 
Date: Mon,  4 Sep 2017 11:25:34 +0530

> MPS_TX_INT_CAUSE[Bubble] is a normal condition for T6, hence
> ignore this interrupt for T6.
> 
> Signed-off-by: Ganesh Goudar 
> Signed-off-by: Casey Leedom 

Applied.

Re: [PATCH net-next] cxgb4: Fix pause frame count in t4_get_port_stats

2017-09-05 Thread David Miller

From: Ganesh Goudar 
Date: Mon,  4 Sep 2017 11:17:36 +0530

> MPS_STAT_CTL[CountPauseStatTx] and MPS_STAT_CTL[CountPauseStatRx]
> only control whether or not Pause Frames will be counted as part
> of the 64-Byte Tx/Rx Frame counters.  These bits do not control
> whether Pause Frames are counted in the Total Tx/Rx Frames/Bytes
> counters.
> 
> Signed-off-by: Ganesh Goudar 
> Signed-off-by: Casey Leedom 

Applied.

Re: [PATCH net-next] cxgb4: fix memory leak

2017-09-05 Thread David Miller

From: Ganesh Goudar 
Date: Mon,  4 Sep 2017 11:16:28 +0530

> do not reuse the loop counter which is used iterate over
> the ports, so that sched_tbl will be freed for all the ports.
> 
> Signed-off-by: Ganesh Goudar 

Applied.

Re: [PATCH net-next 2/2] tun: rename generic_xdp to skb_xdp

2017-09-05 Thread David Miller

From: Jason Wang 
Date: Mon,  4 Sep 2017 11:36:09 +0800

> Rename "generic_xdp" to "skb_xdp" to avoid confusing it with the
> generic XDP which will be done at netif_receive_skb().
> 
> Cc: Daniel Borkmann 
> Signed-off-by: Jason Wang 

Applied.

Re: [PATCH net-next 1/2] tun: reserve extra headroom only when XDP is set

2017-09-05 Thread David Miller

From: Jason Wang 
Date: Mon,  4 Sep 2017 11:36:08 +0800

> We reserve headroom unconditionally which could cause unnecessary
> stress on socket memory accounting because of increased trusesize. Fix
> this by only reserve extra headroom when XDP is set.
> 
> Cc: Jakub Kicinski 
> Signed-off-by: Jason Wang 

Applied.

Re: [PATCH net-next] net: cpsw: Don't handle SIOC[GS]HWTSTAMP when CPTS is disabled

2017-09-05 Thread Grygorii Strashko

Hi

On 08/31/2017 02:48 AM, Richard Cochran wrote:
> On Wed, Aug 30, 2017 at 02:47:45PM -0700, David Miller wrote:
>> It should not be required to disable a Kconfig option just to get PHY
>> timestamping to work properly.
> 
> Well, if the MAC driver handles the ioctl and enables time stamping,
> then the PHY driver's time stamping remains disabled.  We don't have a
> way to choose PHY time stamping at run time.
>   
>> Rather, if the CPTS code returns -EOPNOTSUPP we should try to
>> fallthrough to the PHY library based methods.
> 
> I agree that it would be better for the core (rather than the
> individual drivers) to handle this case.

I'd like to clarify one thing here - what is the preferable time-stamping
device: PHY over MAC, or MAC over PHY? 
my understanding it's PHY and ethtool_get_ts_info() seems already implemented 
this way.

> 
> There are a few callers of .ndo_do_ioctl to consider.  Besides
> dev_ifsioc() there is at least vlan_dev_ioctl() that needs to handle
> the EOPNOTSUPP.

Sry, I've not tried to do solution in Net core, but below patch updates CPSW
driver to selected PHY time-stamping over MAC if supported without disabling 
CPTS in Kconfig
(at least it's expected to fix it) - not tested as I do not have HW with 
dp83640 phy.

---
>From 51347692087732320f2f5615030f5f36ed3c7724 Mon Sep 17 00:00:00 2001
From: Grygorii Strashko 
Date: Tue, 5 Sep 2017 15:24:44 -0500
Subject: [PATCH] net: ethernet: cpsw: allow phy timestamping over mac

Allow phy timestamping to be used over mac timestamping if supported.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/cpsw.c | 21 +
 drivers/net/ethernet/ti/cpts.c |  2 ++
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 95ac926..8831cb9 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -291,6 +291,10 @@ struct cpsw_ss_regs {
 #define CPSW_MAX_BLKS_TX_SHIFT 4
 #define CPSW_MAX_BLKS_RX   5
 
+#define HAS_PHY_TXTSTAMP(p) ((p) && (p)->drv && (p)->drv->txtstamp)
+#define HAS_PHY_TSTAMP(p) ((p) && (p)->drv && \
+   ((p)->drv->rxtstamp || (p)->drv->rxtstamp))
+
 struct cpsw_host_regs {
u32 max_blks;
u32 blk_cnt;
@@ -1600,6 +1604,8 @@ static netdev_tx_t cpsw_ndo_start_xmit(struct sk_buff 
*skb,
 {
struct cpsw_priv *priv = netdev_priv(ndev);
struct cpsw_common *cpsw = priv->cpsw;
+   int slave_no = cpsw_slave_index(cpsw, priv);
+   struct phy_device *phy = cpsw->slaves[slave_no].phy;
struct cpts *cpts = cpsw->cpts;
struct netdev_queue *txq;
struct cpdma_chan *txch;
@@ -1611,8 +1617,9 @@ static netdev_tx_t cpsw_ndo_start_xmit(struct sk_buff 
*skb,
return NET_XMIT_DROP;
}
 
-   if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP &&
-   cpts_is_tx_enabled(cpts) && cpts_can_timestamp(cpts, skb))
+   if ((skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) &&
+   cpts_can_timestamp(cpts, skb) &&
+   (cpts_is_tx_enabled(cpts) || HAS_PHY_TXTSTAMP(phy)))
skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
 
q_idx = skb_get_queue_mapping(skb);
@@ -1810,20 +1817,26 @@ static int cpsw_ndo_ioctl(struct net_device *dev, 
struct ifreq *req, int cmd)
struct cpsw_priv *priv = netdev_priv(dev);
struct cpsw_common *cpsw = priv->cpsw;
int slave_no = cpsw_slave_index(cpsw, priv);
+   struct phy_device *phy = cpsw->slaves[slave_no].phy;
+
 
if (!netif_running(dev))
return -EINVAL;
 
switch (cmd) {
case SIOCSHWTSTAMP:
+   if (HAS_PHY_TSTAMP(phy))
+   break;
return cpsw_hwtstamp_set(dev, req);
case SIOCGHWTSTAMP:
+   if (HAS_PHY_TSTAMP(phy))
+   break;
return cpsw_hwtstamp_get(dev, req);
}
 
-   if (!cpsw->slaves[slave_no].phy)
+   if (!phy)
return -EOPNOTSUPP;
-   return phy_mii_ioctl(cpsw->slaves[slave_no].phy, req, cmd);
+   return phy_mii_ioctl(phy, req, cmd);
 }
 
 static void cpsw_ndo_tx_timeout(struct net_device *ndev)
diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
index c2121d2..f257f54 100644
--- a/drivers/net/ethernet/ti/cpts.c
+++ b/drivers/net/ethernet/ti/cpts.c
@@ -421,6 +421,8 @@ void cpts_tx_timestamp(struct cpts *cpts, struct sk_buff 
*skb)
u64 ns;
struct skb_shared_hwtstamps ssh;
 
+   if (!cpts->rx_enable)
+   return;
if (!(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS))
return;
ns = cpts_find_ts(cpts, skb, CPTS_EV_TX);
-- 
2.10.1


-- 
regards,
-grygorii

Re: [PATCH 6/10] ixgbe: Use ARRAY_SIZE macro

2017-09-05 Thread David Miller

From: Joe Perches 
Date: Tue, 05 Sep 2017 13:01:18 -0700

> On Tue, 2017-09-05 at 21:45 +0200, Thomas Meyer wrote:
>> On Tue, Sep 05, 2017 at 11:50:44AM -0700, David Miller wrote:
>> > From: Thomas Meyer 
>> > Date: Sun, 03 Sep 2017 14:19:31 +0200
>> > 
>> > > Use ARRAY_SIZE macro, rather than explicitly coding some variant of it
>> > > yourself.
>> > > Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e
>> > > 's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\)
>> > > /ARRAY_SIZE(\1)/g' and manual check/verification.
>> > > 
>> > > Signed-off-by: Thomas Meyer 
>> > 
>> > This should be submitted to the Intel ethernet driver maintainers.
>> 
>> Hi,
>> 
>> my script checks the output of get_maintainer scripts and only sends to "open
>> list" entries.
>> 
>> The intel-wired-...@lists.osuosl.org is moderated, so that's why the patch
>> wasn't send there.
>> 
>> Strangely the lists for nouv...@lists.freedesktop.org and
>> intel-gvt-...@lists.freedesktop.org appears as open lists in the MAINTAINERS
>> file but seems to be also moderated lists... At least I got some reply that 
>> my
>> message awaits approval. Maybe an update to the MAINTAINERS file is missing
>> here?
>> 
>> I may drop above check in my script and send to all mailing lists that
>> get_maintainer.pl will return.
> 
> There's a difference between moderated and subscriber-only
> entries in MAINTAINERS.
> 
> get_maintainers will by default list moderated lists and
> not show subscriber-only lists unless using the -s switch.

Furthermore, nothing prevented you from CC:'ing the maintainer,
Jeff Kirscher.

Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads

2017-09-05 Thread Tom Herbert

> The situation with encapsulation is even more complicated:
>
> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
> constellation. If we do the fragmentation inside the vxlan tunnel and
> carry over the skb hash to all resulting UDP/vxlan packets source ports,
> we are fine and reordering on the receiver NIC won't happen in this
> case. If the fragmentation happens on the outer UDP header, this will
> result in reordering of the inner L2 flow. Unfortunately this depends on
> how the vxlan tunnel was set up, how other devices do that and (I
> believe so) on the kernel version.
>
This really isn't that complicated. The assumption that an IP network
always delivers packets in order is simply wrong. The inventors of
VXLAN must have know full well that when you use IP, packets can and
eventually will be delivered out of order. This isn't just because of
fragmentation, there are many other reasons that packets can be
delivered OOO. This also must have been known when IP/GRE and any
other protocol that carries L2 over IP was invented. If OOO is an
issue for these protocols then they need to be fixed-- this is not a
concern with IP protocol nor the stack.

Tom

Re: [PATCH net-next 2/4] ip/options: explicitly provide net ns to __ip_options_echo()

2017-09-05 Thread Paolo Abeni

On Tue, 2017-09-05 at 10:18 -0700, Eric Dumazet wrote:
> On Thu, 2017-08-03 at 18:07 +0200, Paolo Abeni wrote:
> > __ip_options_echo() uses the current network namespace, and
> > currently retrives it via skb->dst->dev.
> > 
> > This commit adds an explicit 'net' argument to __ip_options_echo()
> > and update all the call sites to provide it, usually via a simpler
> > sock_net().
> > 
> > After this change, __ip_options_echo() no more needs to access
> > skb->dst and we can drop a couple of hack to preserve such
> > info in the rx path.
> > 
> > Signed-off-by: Paolo Abeni 
> > ---
> 
> David, Paolo
> 
> This commit (91ed1e666a4ea2e260452a7d7d311ac5ae852cba "ip/options:
> explicitly provide net ns to __ip_options_echo()")
> 
> needs to be backported to linux-4.13 stable version to avoid these kind
> of crashes [1]
> 
> This is because of MSG_PEEK operation, hitting skb_consume_udp() while
> skb is still in receive queue.
> 
> Next read() finding again the skb then can see a NULL skb->dst
> 
> Thanks !
> 
> [1]
> general protection fault:  [#1] SMP KASAN
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Modules linked in:
> CPU: 0 PID: 3017 Comm: syzkaller446772 Not tainted 4.13.0+ #68
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> task: 8801cd0a4380 task.stack: 8801cc498000
> RIP: 0010:__ip_options_echo+0xea8/0x1430 net/ipv4/ip_options.c:143
> RSP: 0018:8801cc49f628 EFLAGS: 00010246
> RAX: dc00 RBX: 8801cc49f928 RCX: 
> RDX:  RSI: 0001 RDI: 0004
> RBP: 8801cc49f6b8 R08: 8801cc49f936 R09: ed0039893f28
> R10: 0003 R11: ed0039893f27 R12: 8801cc49f918
> R13: 8801ccbcf36c R14: 000f R15: 0018
> FS:  00979880() GS:8801db20()
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 200c0ff0 CR3: 0001cc4ed000 CR4: 001406f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  ip_options_echo include/net/ip.h:574 [inline]
>  ip_cmsg_recv_retopts net/ipv4/ip_sockglue.c:91 [inline]
>  ip_cmsg_recv_offset+0xa17/0x1280 net/ipv4/ip_sockglue.c:207
>  udp_recvmsg+0xe0b/0x1260 net/ipv4/udp.c:1641
>  inet_recvmsg+0x14c/0x5f0 net/ipv4/af_inet.c:793
>  sock_recvmsg_nosec net/socket.c:792 [inline]
>  sock_recvmsg+0xc9/0x110 net/socket.c:799
>  SYSC_recvfrom+0x2dc/0x570 net/socket.c:1788
>  SyS_recvfrom+0x40/0x50 net/socket.c:1760
>  entry_SYSCALL_64_fastpath+0x1f/0xbe
> RIP: 0033:0x444c89
> RSP: 002b:7ffd80c788e8 EFLAGS: 0286 ORIG_RAX: 002d
> RAX: ffda RBX:  RCX: 00444c89
> RDX:  RSI: 20bc RDI: 0004
> RBP: 0082 R08: 200c0ff0 R09: 0010
> R10:  R11: 0286 R12: 00402390
> R13: 00402420 R14:  R15: 
> Code: f6 c1 01 0f 85 a5 01 00 00 48 89 4d b8 e8 31 e9 6b fd 48 8b 4d b8
> 48 b8 00 00 00 00 00 fc ff df 48 83 e1 fe 48 89 ca 48 c1 ea 03 <80> 3c
> 02 00 0f 85 41 02 00 00 48 8b 09 48 b8 00 00 00 00 00 fc 
> RIP: __ip_options_echo+0xea8/0x1430 net/ipv4/ip_options.c:143 RSP:
> 8801cc49f628
> ---[ end trace b30d95b284222843 ]---
> Kernel panic - not syncing: Fatal exception

Thank you Eric for the report! 

Darn me, I seriously messed-up with the stateless consume.

I think we can have similar issues pith ipsec/secpath and MSG_PEEK,
even if with less catastropthic outcome.

What about the following, which should cover both cases? (only compile
tested, I'll test it tomorrow morning my time)
---
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d67a8182e5eb..63df75ae70ee 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -885,7 +885,7 @@ void kfree_skb(struct sk_buff *skb);
 void kfree_skb_list(struct sk_buff *segs);
 void skb_tx_error(struct sk_buff *skb);
 void consume_skb(struct sk_buff *skb);
-void consume_stateless_skb(struct sk_buff *skb);
+void __consume_stateless_skb(struct sk_buff *skb);
 void  __kfree_skb(struct sk_buff *skb);
 extern struct kmem_cache *skbuff_head_cache;
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e07556606284..f2411a8744d7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -753,14 +753,11 @@ EXPORT_SYMBOL(consume_skb);
  * consume_stateless_skb - free an skbuff, assuming it is stateless
  * @skb: buffer to free
  *
- * Works like consume_skb(), but this variant assumes that all the head
- * states have been already dropped.
+ * Alike consume_skb(), but this variant assumes that all the head
+ * states have been already dropped and usage count is one
  */
-void consume_stateless_skb(struct sk_buff *skb)
+void __consume_stateless_skb(struct sk_buff

Re: [PATCH 2/2] b43legacy: fix unitialized reads of ret by initializing the array to zero

2017-09-05 Thread Michael Büsch

On Tue,  5 Sep 2017 19:16:58 +0100
Colin King  wrote:

> From: Colin Ian King 
> 
> The u8 char array ret is not being initialized and elements outside
> the range start to end contain just garbage values from the stack.
> This results in a later scan of the array to read potentially
> uninitialized values.  Fix this by initializing the array to zero.
> This seems to have been an issue since the very first commit.
> 
> Detected by CoverityScan CID#139653 ("Uninitialized scalar variable")
> 
> Signed-off-by: Colin Ian King 
> ---
>  drivers/net/wireless/broadcom/b43legacy/radio.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/wireless/broadcom/b43legacy/radio.c 
> b/drivers/net/wireless/broadcom/b43legacy/radio.c
> index 9501420340a9..eab1c9387846 100644
> --- a/drivers/net/wireless/broadcom/b43legacy/radio.c
> +++ b/drivers/net/wireless/broadcom/b43legacy/radio.c
> @@ -280,7 +280,7 @@ u8 b43legacy_radio_aci_detect(struct b43legacy_wldev 
> *dev, u8 channel)
>  u8 b43legacy_radio_aci_scan(struct b43legacy_wldev *dev)
>  {
>   struct b43legacy_phy *phy = >phy;
> - u8 ret[13];
> + u8 ret[13] = { 0 };
>   unsigned int channel = phy->channel;
>   unsigned int i;
>   unsigned int j;


This fix seems to be correct.
Thanks for finding and fixing the issue.

Reviewed-by: Michael Buesch 



-- 
Michael


pgp4fgRHHbeGW.pgp
Description: OpenPGP digital signature

Re: [PATCH 1/2] b43: fix unitialized reads of ret by initializing the array to zero

2017-09-05 Thread Michael Büsch

On Tue,  5 Sep 2017 19:15:50 +0100
Colin King  wrote:

> From: Colin Ian King 
> 
> The u8 char array ret is not being initialized and elements outside
> the range start to end contain just garbage values from the stack.
> This results in a later scan of the array to read potentially
> uninitialized values.  Fix this by initializing the array to zero.
> This seems to have been an issue since the very first commit.
> 
> Detected by CoverityScan CID#139652 ("Uninitialized scalar variable")
> 
> Signed-off-by: Colin Ian King 
> ---
>  drivers/net/wireless/broadcom/b43/phy_g.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/wireless/broadcom/b43/phy_g.c 
> b/drivers/net/wireless/broadcom/b43/phy_g.c
> index 822dcaa8ace6..f59c02166462 100644
> --- a/drivers/net/wireless/broadcom/b43/phy_g.c
> +++ b/drivers/net/wireless/broadcom/b43/phy_g.c
> @@ -2297,7 +2297,7 @@ static u8 b43_gphy_aci_detect(struct b43_wldev *dev, u8 
> channel)
>  static u8 b43_gphy_aci_scan(struct b43_wldev *dev)
>  {
>   struct b43_phy *phy = >phy;
> - u8 ret[13];
> + u8 ret[13] = { 0 };
>   unsigned int channel = phy->channel;
>   unsigned int i, j, start, end;
>  


This fix seems to be correct.
Thanks for finding and fixing the issue.

Reviewed-by: Michael Buesch 


-- 
Michael


pgp6xvXbtxiIn.pgp
Description: OpenPGP digital signature

[PATCH] e1000e: changed some expensive calls of udelay to usleep_range

2017-09-05 Thread Kardonik Michael

Calls to udelay are not preemtable by userspace so userspace
applications experience a large (~200us) latency when running on core0. 
Instead usleep_range can be used to be more friendly to userspace
since it is preemtable. This is due to udelay using busy-wait loops
while usleep_rang uses hrtimers instead. It is recommended to use
udelay when the delay is <10us since at that precision overhead of
usleep_range hrtimer setup causes issues. However, the replaced calls
are for 50us and 100us so this should not be not an issue.
There is no open bug that this patch is fixing, but we see a good
boost in zero loss performance of specific user space application 
(dpdk l3fwd) when this patch is applied: we get from 32% of 10Gb line 
to 49%.

Signed-off-by: Matthew Tan 
Signed-off-by: Michael Kardonik 

---
 drivers/net/ethernet/intel/e1000e/phy.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/phy.c 
b/drivers/net/ethernet/intel/e1000e/phy.c
index de13aea..e318fdc 100644
--- a/drivers/net/ethernet/intel/e1000e/phy.c
+++ b/drivers/net/ethernet/intel/e1000e/phy.c
@@ -158,7 +158,7 @@ s32 e1000e_read_phy_reg_mdic(struct e1000_hw *hw, u32 
offset, u16 *data)
 * the lower time out
 */
for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 3); i++) {
-   udelay(50);
+   usleep_range(50, 60);
mdic = er32(MDIC);
if (mdic & E1000_MDIC_READY)
break;
@@ -183,7 +183,7 @@ s32 e1000e_read_phy_reg_mdic(struct e1000_hw *hw, u32 
offset, u16 *data)
 * reading duplicate data in the next MDIC transaction.
 */
if (hw->mac.type == e1000_pch2lan)
-   udelay(100);
+   usleep_range(100, 110);
 
return 0;
 }
@@ -222,7 +222,7 @@ s32 e1000e_write_phy_reg_mdic(struct e1000_hw *hw, u32 
offset, u16 data)
 * the lower time out
 */
for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 3); i++) {
-   udelay(50);
+   usleep_range(50, 60);
mdic = er32(MDIC);
if (mdic & E1000_MDIC_READY)
break;
@@ -246,7 +246,7 @@ s32 e1000e_write_phy_reg_mdic(struct e1000_hw *hw, u32 
offset, u16 data)
 * reading duplicate data in the next MDIC transaction.
 */
if (hw->mac.type == e1000_pch2lan)
-   udelay(100);
+   usleep_range(100, 110);
 
return 0;
 }
-- 
2.7.4

Re: [PATCH 6/10] ixgbe: Use ARRAY_SIZE macro

2017-09-05 Thread Joe Perches

On Tue, 2017-09-05 at 21:45 +0200, Thomas Meyer wrote:
> On Tue, Sep 05, 2017 at 11:50:44AM -0700, David Miller wrote:
> > From: Thomas Meyer 
> > Date: Sun, 03 Sep 2017 14:19:31 +0200
> > 
> > > Use ARRAY_SIZE macro, rather than explicitly coding some variant of it
> > > yourself.
> > > Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e
> > > 's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\)
> > > /ARRAY_SIZE(\1)/g' and manual check/verification.
> > > 
> > > Signed-off-by: Thomas Meyer 
> > 
> > This should be submitted to the Intel ethernet driver maintainers.
> 
> Hi,
> 
> my script checks the output of get_maintainer scripts and only sends to "open
> list" entries.
> 
> The intel-wired-...@lists.osuosl.org is moderated, so that's why the patch
> wasn't send there.
> 
> Strangely the lists for nouv...@lists.freedesktop.org and
> intel-gvt-...@lists.freedesktop.org appears as open lists in the MAINTAINERS
> file but seems to be also moderated lists... At least I got some reply that my
> message awaits approval. Maybe an update to the MAINTAINERS file is missing
> here?
> 
> I may drop above check in my script and send to all mailing lists that
> get_maintainer.pl will return.

There's a difference between moderated and subscriber-only
entries in MAINTAINERS.

get_maintainers will by default list moderated lists and
not show subscriber-only lists unless using the -s switch.

[PATCH][next v2] rocker: fix kcalloc parameter order

2017-09-05 Thread Zahari Doychev

The function calls to kcalloc use wrong parameter order and incorrect flags
values. GFP_KERNEL is used instead of flags now and the order is corrected.

The change was done using the following coccinelle script:

@@
expression E1,E2;
type T;
@@

-kcalloc(E1, E2, sizeof(T))
+kcalloc(E2, sizeof(T), GFP_KERNEL)

Signed-off-by: Zahari Doychev 
---
 drivers/net/ethernet/rocker/rocker_ofdpa.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c 
b/drivers/net/ethernet/rocker/rocker_ofdpa.c
index da4e26b53a52..0653b70723a3 100644
--- a/drivers/net/ethernet/rocker/rocker_ofdpa.c
+++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c
@@ -1177,7 +1177,7 @@ static int ofdpa_group_l2_fan_out(struct ofdpa_port 
*ofdpa_port,
entry->group_id = group_id;
entry->group_count = group_count;
 
-   entry->group_ids = kcalloc(flags, group_count, sizeof(u32));
+   entry->group_ids = kcalloc(group_count, sizeof(u32), GFP_KERNEL);
if (!entry->group_ids) {
kfree(entry);
return -ENOMEM;
@@ -1456,7 +1456,7 @@ static int ofdpa_port_vlan_flood_group(struct ofdpa_port 
*ofdpa_port,
int err = 0;
int i;
 
-   group_ids = kcalloc(flags, port_count, sizeof(u32));
+   group_ids = kcalloc(port_count, sizeof(u32), GFP_KERNEL);
if (!group_ids)
return -ENOMEM;
 
-- 
2.13.0

Re: [PATCH 6/10] ixgbe: Use ARRAY_SIZE macro

2017-09-05 Thread Thomas Meyer

On Tue, Sep 05, 2017 at 11:50:44AM -0700, David Miller wrote:
> From: Thomas Meyer 
> Date: Sun, 03 Sep 2017 14:19:31 +0200
> 
> > Use ARRAY_SIZE macro, rather than explicitly coding some variant of it
> > yourself.
> > Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e
> > 's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\)
> > /ARRAY_SIZE(\1)/g' and manual check/verification.
> > 
> > Signed-off-by: Thomas Meyer 
> 
> This should be submitted to the Intel ethernet driver maintainers.
Hi,

my script checks the output of get_maintainer scripts and only sends to "open
list" entries.

The intel-wired-...@lists.osuosl.org is moderated, so that's why the patch
wasn't send there.

Strangely the lists for nouv...@lists.freedesktop.org and
intel-gvt-...@lists.freedesktop.org appears as open lists in the MAINTAINERS
file but seems to be also moderated lists... At least I got some reply that my
message awaits approval. Maybe an update to the MAINTAINERS file is missing
here?

I may drop above check in my script and send to all mailing lists that
get_maintainer.pl will return.

> 
> Thank you.

Re: 915f3e3f76 ("mac80211_hwsim: Replace bogus hrtimer clockid"): BUG: kernel reboot-without-warning in test stage

2017-09-05 Thread Thomas Gleixner

On Tue, 5 Sep 2017, Sebastian Andrzej Siewior wrote:

> On 2017-09-05 09:12:40 [+0200], Thomas Gleixner wrote:
> > Sorry, no. That bisect is completely bogus. The commit in question merily
> > replaces the unsupported clockid with a valid one.
> 
> The bisect is correct. It just has problems to express itself properly. So
> the table says:
> 
> | WARNING:at_kernel/time/hrtimer.c:#hrtimer_init  | 7   | ||   |  
>   
> | BUG:kernel_reboot-without-warning_in_test_stage | 0   | 12  | 6  | 6 |  
>   
> 
> which means _before_ your commit it counted a warning in hrtimer_init()
> (an unsupported clock id was used). With your commit, the warning was
> gone and I *think* the userland then printed
> "BUG:kernel_reboot-without-warning_in_test_stage" because it had no
> warning.
> It seems that the bot learned to live with that warning which was around
> for more than three years. Now that you removed it, it seems to be a
> mistake to do so because nobody complained about it so far.

Thanks for the translation. I'll never learn to decode these reports.

   tglx

Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads

2017-09-05 Thread Hannes Frederic Sowa

Hi Tom,

Tom Herbert  writes:

> On Tue, Sep 5, 2017 at 4:14 AM, Hannes Frederic Sowa
>  wrote:
>> Tom Herbert  writes:
>>
>>> There is absolutely no requirement in IP that packets are delivered in
>>> order-- there never has been and there never will be! If the ULP, like
>>> Ethernet encapsulation, requires in order deliver then it needs to
>>> implement that itself like TCP, GRE, and other protocols ensure that
>>> with sequence numbers and reassembly. All of these hoops we do make
>>> sure that packets always follow the same path and are always in order
>>> are done for benefit of middlebox devices like stateful firewalls that
>>> have force us to adopt their concept of network architecture-- in the
>>> long run this is self-defeating and kills our ability to innovate.
>>>
>>> I'm not saying that we shouldn't consider legacy devices, but we
>>> should scrutinize new development or solutions that perpetuate
>>> incorrect design or bad assumptions.
>>
>> So configure RSS per port and ensure no fragments are send to those
>> ports. This is possible and rather easy to do. It solves the problem
>> with legacy software and it spreads out packets for your applications.
>>
>> It is not perfect but it is working and solves both problems.
>>
> Hannes,
>
> I don't see how that solves anything. The purpose of RSS is to
> distribute the load of protocol packets across queues. This needs to
> work for UDP applications. For instance, if I were building a QUIC
> server I'd want the sort of flow distribution that a TCP server would
> give. You can't do that by configuring a few ports in the device.

I seriously am with you and I think you are on the right track. I would
also much rather like to see fragmentation *not* happening and would
like to see hardware logic *not* parsing deep down into packets and
protocols. We can agree on that!

Albeit I don't understand the problem with my approach:

If you know that you are running a QUIC server and know your environment
(safe against reordering), run `ethtool -N ethX rx-flow-hash udp4 sdfn'
on the box and you would get spreading of UDP packets based on the UDP
src+dst port numbers to the rx queues. Otherwise only source and
destination addresses are being considered, which is the safe default
(maybe this is sometimes enough).

Intel network cards warn you if you actually switch the mode:

-- >8 --
drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c:   
   "enabling UDP RSS: fragmented packets may arrive out of order to the stack 
above\n");
drivers/net/ethernet/intel/igb/igb_ethtool.c:   
"enabling UDP RSS: fragmented packets may arrive out of order to the stack 
above\n");
drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c:   
e_warn(drv, "enabling UDP RSS: fragmented packets may arrive out of order to 
the stack above\n");
-- 8< --

I argue that we can't change the default.

Furthermore I tried to argue that in a cloud where you don't know which
applications you run within the encapsulated networks, it is very hard
to predict which protocols can be considered safe against those
effects. I agree, that most of the time, it probably should not
matter. But given the rule about kernel backwards compatibility I do
have doubts that this change would go unnoticed.

I am the devil's advocate here and argue it is on us to show that is
safe before we change the semantics.

> If I were to suggest any HW change it would be to not do DPI on
> fragments (MF or offset is set). This ensures that all packets of the
> fragment train get hashed to the same queue and is on fact what RPS
> has been doing for years without any complaints.

The same UDP flow could consist out of fragments and non-fragmented
packets. Even if hardware would act according to what you wrote,
reordering is very much possible within the same UDP flow.

In case of QUIC, as far as I know, it uses destination ports 80 and
443. Like we already notify NICs about vxlan and geneve port numbers, we
could selectively enable RSS for UDP ports on those destination port
numbers to tell the NIC it won't receive fragments on those ports and
thus do RSS. The same way the NIC could be notified if reordering on
this port is possible. This could be done for example via a lightweight
setsockopt.

The situation with encapsulation is even more complicated:

We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
constellation. If we do the fragmentation inside the vxlan tunnel and
carry over the skb hash to all resulting UDP/vxlan packets source ports,
we are fine and reordering on the receiver NIC won't happen in this
case. If the fragmentation happens on the outer UDP header, this will
result in reordering of the inner L2 flow. Unfortunately this depends on
how the vxlan tunnel was set up, how other devices do that and (I
believe so) on the kernel version.

Given the fact that we tell the

Re: [PATCH net-next v2 RESEND 0/4] net: dsa: Allow switch drivers to indicate number of TX queues

2017-09-05 Thread David Miller

From: Florian Fainelli 
Date: Sun,  3 Sep 2017 20:26:59 -0700

> This patch series extracts the parts of the patch set that are likely not to 
> be
> controversial and actually bringing multi-queue support to DSA-created network
> devices.
> 
> With these patches, we can now use sch_multiq as documented under
> Documentation/networking/multique.txt and let applications dedice the switch
> port output queue they want to use. Currently only Broadcom tags utilize that
> information.
> 
> Resending based on David's feedback regarding the patches not in patchwork.
> 
> Changes in v2:
> - use a proper define for the number of TX queues in bcm_sf2.c (Andrew)
> 
> Changes from RFC:
> 
> - dropped the ability to configure RX queues since we don't do anything with
>   those just yet
> - dropped the patches that dealt with binding the DSA slave network devices
>   queues with their master network devices queues this will be worked on
>   separately.

Series applied, thanks.

Re: [PATCH net-next] bridge: switchdev: Use an helper to clear forward mark

2017-09-05 Thread David Miller

From: Ido Schimmel 
Date: Sun,  3 Sep 2017 17:44:13 +0300

> Instead of using ifdef in the C file.
> 
> Signed-off-by: Ido Schimmel 
> Suggested-by: Nikolay Aleksandrov 
> Tested-by: Yotam Gigi 

Applied.

Re: [PATCH 7/10] net/mlx4_core: Use ARRAY_SIZE macro

2017-09-05 Thread David Miller

From: Thomas Meyer 
Date: Sun, 03 Sep 2017 14:19:31 +0200

> Use ARRAY_SIZE macro, rather than explicitly coding some variant of it
> yourself.
> Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e
> 's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\)
> /ARRAY_SIZE(\1)/g' and manual check/verification.
> 
> Signed-off-by: Thomas Meyer 

Applied.

Re: [PATCH 6/10] ixgbe: Use ARRAY_SIZE macro

2017-09-05 Thread David Miller

From: Thomas Meyer 
Date: Sun, 03 Sep 2017 14:19:31 +0200

> Use ARRAY_SIZE macro, rather than explicitly coding some variant of it
> yourself.
> Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e
> 's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\)
> /ARRAY_SIZE(\1)/g' and manual check/verification.
> 
> Signed-off-by: Thomas Meyer 

This should be submitted to the Intel ethernet driver maintainers.

Thank you.

[PATCH v2] radix-tree: must check __radix_tree_preload() return value

2017-09-05 Thread Eric Dumazet

From: Eric Dumazet 

__radix_tree_preload() only disables preemption if no error is returned.

So we really need to make sure callers always check the return value.

idr_preload() contract is to always disable preemption, so we need
to add a missing preempt_disable() if an error happened.

Similarly, ida_pre_get() only needs to call preempt_enable() in the
case no error happened.

Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree")
Fixes: 7ad3d4d85c7a ("ida: Move ida_bitmap to a percpu variable")
Signed-off-by: Eric Dumazet 
Cc: [4.11+]
---
v2: addressed Linus feedback, not adding useless/confusing 'ret' variables.

 lib/radix-tree.c |9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 3527eb364964..afb3cb4d44b6 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -463,7 +463,7 @@ radix_tree_node_free(struct radix_tree_node *node)
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
  */
-static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
+static __must_check int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
 {
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
@@ -2104,7 +2104,8 @@ EXPORT_SYMBOL(radix_tree_tagged);
  */
 void idr_preload(gfp_t gfp_mask)
 {
-   __radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE);
+   if (__radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE))
+   preempt_disable();
 }
 EXPORT_SYMBOL(idr_preload);
 
@@ -2118,13 +2119,13 @@ EXPORT_SYMBOL(idr_preload);
  */
 int ida_pre_get(struct ida *ida, gfp_t gfp)
 {
-   __radix_tree_preload(gfp, IDA_PRELOAD_SIZE);
/*
 * The IDA API has no preload_end() equivalent.  Instead,
 * ida_get_new() can return -EAGAIN, prompting the caller
 * to return to the ida_pre_get() step.
 */
-   preempt_enable();
+   if (!__radix_tree_preload(gfp, IDA_PRELOAD_SIZE))
+   preempt_enable();
 
if (!this_cpu_read(ida_bitmap)) {
struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);

Re: [PATCH v2 net-next 0/2] flow_dissector: Flow dissector fixes

2017-09-05 Thread David Miller

From: Tom Herbert 
Date: Fri,  1 Sep 2017 14:04:10 -0700

> This patch set fixes some basic issues with __skb_flow_dissect function.
> 
> Items addressed:
>   - Cleanup control flow in the function; in particular eliminate a
> bunch of goto's and implement a simplified control flow model
>   - Add limits for number of encapsulations and headers that can be
> dissected
> 
> v2:
>   - Simplify the logic for limits on flow dissection. Just set the
> limit based on the number of headers the flow dissector can
> processes. The accounted headers includes encapsulation headers,
> extension headers, or other shim headers.
> 
> Tested:
> 
> Ran normal traffic, GUE, and VXLAN traffic.

Applied, thanks Tom.

Re: [PATCH] radix-tree: must check __radix_tree_preload() return value

2017-09-05 Thread Eric Dumazet

On Tue, 2017-09-05 at 11:07 -0700, Linus Torvalds wrote:
> On Tue, Sep 5, 2017 at 10:59 AM, Eric Dumazet  wrote:
> > @@ -2104,7 +2104,10 @@ EXPORT_SYMBOL(radix_tree_tagged);
> >   */
> >  void idr_preload(gfp_t gfp_mask)
> >  {
> > -   __radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE);
> > +   int ret = __radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE);
> > +
> > +   if (ret)
> > +   preempt_disable();
> >  }
> >  EXPORT_SYMBOL(idr_preload);
> 
> Is there a reason for the "ret" variable that is entirely mis-named,
> since it's never actually used as a return value?
> 
> (Sure. it's the return value of a function, but that is entirely
> useless and pointless information, and adds no value. We should name
> variables by the data they contain or how they are used, not by "it
> was the return value of a function").
> 
> In other words, why isn't this just
> 
> if (__radix_tree_preload(..))
> preempt_disable();
> 
> which is shorter and clearer and not confusing?


...

> Same issue, but this time strengthened by an additional "why doesn't
> this just use that idr_preload function then?" question..
> 

Yep, I will send a v2, thanks.

Re: [RFC net-next] net: sch_clsact: add support for global per-netns classifier mode

2017-09-05 Thread Nikolay Aleksandrov


On 9/5/17 9:18 PM, Cong Wang wrote:

On Tue, Sep 5, 2017 at 5:48 AM, Nikolay Aleksandrov
 wrote:

Hi all,
This RFC adds a new mode for clsact which designates a device's egress
classifier as global per netns. The packets that are not classified for
a particular device will be classified using the global classifier.
We have needed a global classifier for some time now for various
purposes and setting the single bridge or loopback/vrf device as the
global classifier device is acceptable for us. Doing it this way avoids
the act/cls device and queue dependencies.

This is strictly an RFC patch just to show the intent, if we agree on
the details the proposed patch will have support for both ingress and
egress, and will be using a static key to avoid the fast path test when no
global classifier has been configured.

Example (need a modified tc that adds TCA_OPTIONS when using q_clsact):
$ tc qdisc add dev lo clsact global
$ tc filter add dev lo egress protocol ip u32 match ip dst 4.3.2.1/32 action 
drop

the last filter will be global for all devices that don't have a
specific egress_cl_list (i.e. have clsact configured).


Sorry this is too ugly

netdevice is still implied in your command line even if you treat it
as global. It is essentially hard to bypass netdevice layer since
netdevice is the core of L2 and also where everything begins.



This is only a quick RFC, that can be removed entirely if we limit it to the 
netns and its loopback device. Then we can drop the "device" keyword altogether.



Maybe the best we can do here is make tc filters standalone
as tc actions so that filters can exist before qdisc's and netdevices.
But this probably requires significant works to make it working
with both existing non-standalone and bindings standalones
with qdisc's.



We've actually been discussing this option internally as well.
I think we'll look into doing that regardless of this patch.

Note I don't look deeply into this, just one thought, at least this
appears less ugly than yours.



What I did was aimed at simplicity and is merely a mode of clsact which
doesn't have an impact if not configured. Every other solution requires
a much more invasive change, note that doesn't mean I don't agree. :-)

Re: [RFC net-next] net: sch_clsact: add support for global per-netns classifier mode

2017-09-05 Thread Cong Wang

On Tue, Sep 5, 2017 at 5:48 AM, Nikolay Aleksandrov
 wrote:
> Hi all,
> This RFC adds a new mode for clsact which designates a device's egress
> classifier as global per netns. The packets that are not classified for
> a particular device will be classified using the global classifier.
> We have needed a global classifier for some time now for various
> purposes and setting the single bridge or loopback/vrf device as the
> global classifier device is acceptable for us. Doing it this way avoids
> the act/cls device and queue dependencies.
>
> This is strictly an RFC patch just to show the intent, if we agree on
> the details the proposed patch will have support for both ingress and
> egress, and will be using a static key to avoid the fast path test when no
> global classifier has been configured.
>
> Example (need a modified tc that adds TCA_OPTIONS when using q_clsact):
> $ tc qdisc add dev lo clsact global
> $ tc filter add dev lo egress protocol ip u32 match ip dst 4.3.2.1/32 action 
> drop
>
> the last filter will be global for all devices that don't have a
> specific egress_cl_list (i.e. have clsact configured).

Sorry this is too ugly.

netdevice is still implied in your command line even if you treat it
as global. It is essentially hard to bypass netdevice layer since
netdevice is the core of L2 and also where everything begins.

Maybe the best we can do here is make tc filters standalone
as tc actions so that filters can exist before qdisc's and netdevices.
But this probably requires significant works to make it working
with both existing non-standalone and bindings standalones
with qdisc's.

Note I don't look deeply into this, just one thought, at least this
appears less ugly than yours.

[PATCH 3/3] drivers: net: xgene: Remove return statement from void function

2017-09-05 Thread Iyappan Subramanian

commit 183db4 ("drivers: net: xgene: Correct probe sequence handling")
changed the return type of xgene_enet_check_phy_handle() to void.

This patch, removes the return statement from the last line.

Signed-off-by: Iyappan Subramanian 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
index 655c0fc..3b889ef 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
@@ -1674,8 +1674,6 @@ static void xgene_enet_check_phy_handle(struct 
xgene_enet_pdata *pdata)
ret = xgene_enet_phy_connect(pdata->ndev);
if (!ret)
pdata->mdio_driver = true;
-
-   return;
 }
 
 static void xgene_enet_gpiod_get(struct xgene_enet_pdata *pdata)
-- 
2.7.4

[PATCH 1/3] drivers: net: xgene: Read tx/rx delay for ACPI

2017-09-05 Thread Iyappan Subramanian

This patch fixes reading tx/rx delay values for ACPI.

Signed-off-by: Iyappan Subramanian 
Signed-off-by: Quan Nguyen 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
index 6e253d9..655c0fc 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
@@ -1591,7 +1591,7 @@ static int xgene_get_tx_delay(struct xgene_enet_pdata 
*pdata)
struct device *dev = >pdev->dev;
int delay, ret;
 
-   ret = of_property_read_u32(dev->of_node, "tx-delay", );
+   ret = device_property_read_u32(dev, "tx-delay", );
if (ret) {
pdata->tx_delay = 4;
return 0;
@@ -1612,7 +1612,7 @@ static int xgene_get_rx_delay(struct xgene_enet_pdata 
*pdata)
struct device *dev = >pdev->dev;
int delay, ret;
 
-   ret = of_property_read_u32(dev->of_node, "rx-delay", );
+   ret = device_property_read_u32(dev, "rx-delay", );
if (ret) {
pdata->rx_delay = 2;
return 0;
-- 
2.7.4

[PATCH 2/2] b43legacy: fix unitialized reads of ret by initializing the array to zero

2017-09-05 Thread Colin King

From: Colin Ian King 

The u8 char array ret is not being initialized and elements outside
the range start to end contain just garbage values from the stack.
This results in a later scan of the array to read potentially
uninitialized values.  Fix this by initializing the array to zero.
This seems to have been an issue since the very first commit.

Detected by CoverityScan CID#139653 ("Uninitialized scalar variable")

Signed-off-by: Colin Ian King 
---
 drivers/net/wireless/broadcom/b43legacy/radio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/broadcom/b43legacy/radio.c 
b/drivers/net/wireless/broadcom/b43legacy/radio.c
index 9501420340a9..eab1c9387846 100644
--- a/drivers/net/wireless/broadcom/b43legacy/radio.c
+++ b/drivers/net/wireless/broadcom/b43legacy/radio.c
@@ -280,7 +280,7 @@ u8 b43legacy_radio_aci_detect(struct b43legacy_wldev *dev, 
u8 channel)
 u8 b43legacy_radio_aci_scan(struct b43legacy_wldev *dev)
 {
struct b43legacy_phy *phy = >phy;
-   u8 ret[13];
+   u8 ret[13] = { 0 };
unsigned int channel = phy->channel;
unsigned int i;
unsigned int j;
-- 
2.14.1

[PATCH 0/3] drivers: net: xgene: Misc bug fixes

2017-09-05 Thread Iyappan Subramanian

This patch set fixes bugs related to handling the case for ACPI for,
reading and programming tx/rx delay values.

Signed-off-by: Iyappan Subramanian 
---

Iyappan Subramanian (2):
  drivers: net: xgene: Read tx/rx delay for ACPI
  drivers: net: xgene: Remove return statement from void function

Quan Nguyen (1):
  drivers: net: xgene: Configure tx/rx delay for ACPI

 drivers/net/ethernet/apm/xgene/xgene_enet_hw.c   | 7 ++-
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 6 ++
 2 files changed, 4 insertions(+), 9 deletions(-)

-- 
2.7.4

[PATCH 2/3] drivers: net: xgene: Configure tx/rx delay for ACPI

2017-09-05 Thread Iyappan Subramanian

From: Quan Nguyen 

This patch fixes configuring tx/rx delay values for ACPI.

Signed-off-by: Quan Nguyen 
Signed-off-by: Iyappan Subramanian 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
index e45b587..3188f55 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
@@ -468,7 +468,6 @@ static void xgene_enet_configure_clock(struct 
xgene_enet_pdata *pdata)
 
 static void xgene_gmac_set_speed(struct xgene_enet_pdata *pdata)
 {
-   struct device *dev = >pdev->dev;
u32 icm0, icm2, mc2;
u32 intf_ctl, rgmii, value;
 
@@ -500,10 +499,8 @@ static void xgene_gmac_set_speed(struct xgene_enet_pdata 
*pdata)
intf_ctl |= ENET_GHD_MODE;
CFG_MACMODE_SET(, 2);
CFG_WAITASYNCRD_SET(, 0);
-   if (dev->of_node) {
-   CFG_TXCLK_MUXSEL0_SET(, pdata->tx_delay);
-   CFG_RXCLK_MUXSEL0_SET(, pdata->rx_delay);
-   }
+   CFG_TXCLK_MUXSEL0_SET(, pdata->tx_delay);
+   CFG_RXCLK_MUXSEL0_SET(, pdata->rx_delay);
rgmii |= CFG_SPEED_1250;
 
xgene_enet_rd_csr(pdata, DEBUG_REG_ADDR, );
-- 
2.7.4

[PATCH 1/2] b43: fix unitialized reads of ret by initializing the array to zero

2017-09-05 Thread Colin King

From: Colin Ian King 

The u8 char array ret is not being initialized and elements outside
the range start to end contain just garbage values from the stack.
This results in a later scan of the array to read potentially
uninitialized values.  Fix this by initializing the array to zero.
This seems to have been an issue since the very first commit.

Detected by CoverityScan CID#139652 ("Uninitialized scalar variable")

Signed-off-by: Colin Ian King 
---
 drivers/net/wireless/broadcom/b43/phy_g.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/broadcom/b43/phy_g.c 
b/drivers/net/wireless/broadcom/b43/phy_g.c
index 822dcaa8ace6..f59c02166462 100644
--- a/drivers/net/wireless/broadcom/b43/phy_g.c
+++ b/drivers/net/wireless/broadcom/b43/phy_g.c
@@ -2297,7 +2297,7 @@ static u8 b43_gphy_aci_detect(struct b43_wldev *dev, u8 
channel)
 static u8 b43_gphy_aci_scan(struct b43_wldev *dev)
 {
struct b43_phy *phy = >phy;
-   u8 ret[13];
+   u8 ret[13] = { 0 };
unsigned int channel = phy->channel;
unsigned int i, j, start, end;
 
-- 
2.14.1

Re: [PATCH] radix-tree: must check __radix_tree_preload() return value

2017-09-05 Thread Linus Torvalds

On Tue, Sep 5, 2017 at 10:59 AM, Eric Dumazet  wrote:
> @@ -2104,7 +2104,10 @@ EXPORT_SYMBOL(radix_tree_tagged);
>   */
>  void idr_preload(gfp_t gfp_mask)
>  {
> -   __radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE);
> +   int ret = __radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE);
> +
> +   if (ret)
> +   preempt_disable();
>  }
>  EXPORT_SYMBOL(idr_preload);

Is there a reason for the "ret" variable that is entirely mis-named,
since it's never actually used as a return value?

(Sure. it's the return value of a function, but that is entirely
useless and pointless information, and adds no value. We should name
variables by the data they contain or how they are used, not by "it
was the return value of a function").

In other words, why isn't this just

if (__radix_tree_preload(..))
preempt_disable();

which is shorter and clearer and not confusing?

> @@ -2118,13 +2121,14 @@ EXPORT_SYMBOL(idr_preload);
>   */
>  int ida_pre_get(struct ida *ida, gfp_t gfp)
>  {
> -   __radix_tree_preload(gfp, IDA_PRELOAD_SIZE);
> +   int ret = __radix_tree_preload(gfp, IDA_PRELOAD_SIZE);
> /*
>  * The IDA API has no preload_end() equivalent.  Instead,
>  * ida_get_new() can return -EAGAIN, prompting the caller
>  * to return to the ida_pre_get() step.
>  */
> -   preempt_enable();
> +   if (!ret)
> +   preempt_enable();

Same issue, but this time strengthened by an additional "why doesn't
this just use that idr_preload function then?" question..

   Linus

[PATCH] radix-tree: must check __radix_tree_preload() return value

2017-09-05 Thread Eric Dumazet

From: Eric Dumazet 

__radix_tree_preload() only disables preemption if no error is returned.

So we really need to make sure callers always check the return value.

idr_preload() contract is to always disable preemption, so we need
to add a missing preempt_disable() if an error happened.

Similarly, ida_pre_get() only needs to call preempt_enable() in the
case no error happened.

Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree")
Fixes: 7ad3d4d85c7a ("ida: Move ida_bitmap to a percpu variable")
Signed-off-by: Eric Dumazet 
Cc: [4.11+]
---
 lib/radix-tree.c |   12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 3527eb364964..fac702039304 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -463,7 +463,7 @@ radix_tree_node_free(struct radix_tree_node *node)
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
  */
-static int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
+static __must_check int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
 {
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
@@ -2104,7 +2104,10 @@ EXPORT_SYMBOL(radix_tree_tagged);
  */
 void idr_preload(gfp_t gfp_mask)
 {
-   __radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE);
+   int ret = __radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE);
+
+   if (ret)
+   preempt_disable();
 }
 EXPORT_SYMBOL(idr_preload);
 
@@ -2118,13 +2121,14 @@ EXPORT_SYMBOL(idr_preload);
  */
 int ida_pre_get(struct ida *ida, gfp_t gfp)
 {
-   __radix_tree_preload(gfp, IDA_PRELOAD_SIZE);
+   int ret = __radix_tree_preload(gfp, IDA_PRELOAD_SIZE);
/*
 * The IDA API has no preload_end() equivalent.  Instead,
 * ida_get_new() can return -EAGAIN, prompting the caller
 * to return to the ida_pre_get() step.
 */
-   preempt_enable();
+   if (!ret)
+   preempt_enable();
 
if (!this_cpu_read(ida_bitmap)) {
struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);

Re: [PATCH net-next 2/4] ip/options: explicitly provide net ns to __ip_options_echo()

2017-09-05 Thread Eric Dumazet

On Thu, 2017-08-03 at 18:07 +0200, Paolo Abeni wrote:
> __ip_options_echo() uses the current network namespace, and
> currently retrives it via skb->dst->dev.
> 
> This commit adds an explicit 'net' argument to __ip_options_echo()
> and update all the call sites to provide it, usually via a simpler
> sock_net().
> 
> After this change, __ip_options_echo() no more needs to access
> skb->dst and we can drop a couple of hack to preserve such
> info in the rx path.
> 
> Signed-off-by: Paolo Abeni 
> ---

David, Paolo

This commit (91ed1e666a4ea2e260452a7d7d311ac5ae852cba "ip/options:
explicitly provide net ns to __ip_options_echo()")

needs to be backported to linux-4.13 stable version to avoid these kind
of crashes [1]

This is because of MSG_PEEK operation, hitting skb_consume_udp() while
skb is still in receive queue.

Next read() finding again the skb then can see a NULL skb->dst

Thanks !

[1]
general protection fault:  [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 3017 Comm: syzkaller446772 Not tainted 4.13.0+ #68
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
task: 8801cd0a4380 task.stack: 8801cc498000
RIP: 0010:__ip_options_echo+0xea8/0x1430 net/ipv4/ip_options.c:143
RSP: 0018:8801cc49f628 EFLAGS: 00010246
RAX: dc00 RBX: 8801cc49f928 RCX: 
RDX:  RSI: 0001 RDI: 0004
RBP: 8801cc49f6b8 R08: 8801cc49f936 R09: ed0039893f28
R10: 0003 R11: ed0039893f27 R12: 8801cc49f918
R13: 8801ccbcf36c R14: 000f R15: 0018
FS:  00979880() GS:8801db20()
knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 200c0ff0 CR3: 0001cc4ed000 CR4: 001406f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 ip_options_echo include/net/ip.h:574 [inline]
 ip_cmsg_recv_retopts net/ipv4/ip_sockglue.c:91 [inline]
 ip_cmsg_recv_offset+0xa17/0x1280 net/ipv4/ip_sockglue.c:207
 udp_recvmsg+0xe0b/0x1260 net/ipv4/udp.c:1641
 inet_recvmsg+0x14c/0x5f0 net/ipv4/af_inet.c:793
 sock_recvmsg_nosec net/socket.c:792 [inline]
 sock_recvmsg+0xc9/0x110 net/socket.c:799
 SYSC_recvfrom+0x2dc/0x570 net/socket.c:1788
 SyS_recvfrom+0x40/0x50 net/socket.c:1760
 entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x444c89
RSP: 002b:7ffd80c788e8 EFLAGS: 0286 ORIG_RAX: 002d
RAX: ffda RBX:  RCX: 00444c89
RDX:  RSI: 20bc RDI: 0004
RBP: 0082 R08: 200c0ff0 R09: 0010
R10:  R11: 0286 R12: 00402390
R13: 00402420 R14:  R15: 
Code: f6 c1 01 0f 85 a5 01 00 00 48 89 4d b8 e8 31 e9 6b fd 48 8b 4d b8
48 b8 00 00 00 00 00 fc ff df 48 83 e1 fe 48 89 ca 48 c1 ea 03 <80> 3c
02 00 0f 85 41 02 00 00 48 8b 09 48 b8 00 00 00 00 00 fc 
RIP: __ip_options_echo+0xea8/0x1430 net/ipv4/ip_options.c:143 RSP:
8801cc49f628
---[ end trace b30d95b284222843 ]---
Kernel panic - not syncing: Fatal exception

Re: [PATCH net v2] net: sched: don't use GFP_KERNEL under spin lock

2017-09-05 Thread Eric Dumazet

On Tue, 2017-09-05 at 08:31 -0700, Jakub Kicinski wrote:
> The new TC IDR code uses GFP_KERNEL under spin lock.  Which leads
> to:
> 
> [  582.621091] BUG: sleeping function called from invalid context at 
> ../mm/slab.h:416
> [  582.629721] in_atomic(): 1, irqs_disabled(): 0, pid: 3379, name: tc
> [  582.636939] 2 locks held by tc/3379:
> [  582.641049]  #0:  (rtnl_mutex){+.+.+.}, at: [] 
> rtnetlink_rcv_msg+0x92e/0x1400
> [  582.650958]  #1:  (&(>idrinfo->lock)->rlock){+.-.+.}, at: 
> [] tcf_idr_create+0x2f0/0x8e0
> [  582.662217] Preemption disabled at:
> [  582.66] [] tcf_idr_create+0x2f0/0x8e0
> [  582.672592] CPU: 9 PID: 3379 Comm: tc Tainted: GW   
> 4.13.0-rc7-debug-00648-g43503a79b9f0 #287
> [  582.683432] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 
> 11/08/2016
> [  582.691937] Call Trace:
> ...
> [  582.742460]  kmem_cache_alloc+0x286/0x540
> [  582.747055]  radix_tree_node_alloc.constprop.6+0x4a/0x450
> [  582.753209]  idr_get_free_cmn+0x627/0xf80
> ...
> [  582.815525]  idr_alloc_cmn+0x1a8/0x270
> ...
> [  582.833804]  tcf_idr_create+0x31b/0x8e0
> ...
> 
> Try to preallocate the memory with idr_prealloc(GFP_KERNEL)
> (as suggested by Eric Dumazet), and change the allocation 
> flags under spin lock.
> 
> Fixes: 65a206c01e8e ("net/sched: Change act_api and act_xxx modules to use 
> IDR")
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Simon Horman 
> ---


Acked-by: Eric Dumazet 

Not related to your patch, but it looks like idr_preload() has a bug
added in 4.11.  I will send a fix to lkml.

[ANNOUNCE] Iproute2 for Linux 4.13

2017-09-05 Thread Stephen Hemminger


Update to iproute2 utility to support new features in Linux 4.13.
This is a larger than usual release because of lots of updates for BPF
and the new RDMA utility. Lots of cleanups and Coverity reported
potential issues as well.

Source:
  https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.13.0.tar.gz

Repository:
  git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git

Report problems (or enhancements) to the netdev@vger.kernel.org mailing list.

---
Alexander Alemayhu (1):
  examples/bpf: update list of examples

Andreas Henriksson (1):
  ss: fix help/man TCP-STATE description for listening

Arkadi Sharshevsky (1):
  bridge: Distinguish between externally learned vs offloaded FDBs

Casey Callendrello (1):
  netns: make /var/run/netns bind-mount recursive

Daniel Borkmann (8):
  bpf: remove obsolete samples
  bpf: support loading map in map from obj
  bpf: dump id/jited info for cls/act programs
  bpf: improve error reporting around tail calls
  bpf: fix mnt path when from env
  bpf: unbreak libelf linkage for bpf obj loader
  bpf: minor cleanups for bpf_trace_pipe
  bpf: consolidate dumps to use bpf_dump_prog_info

David Ahern (2):
  lib: Dump ext-ack string by default
  libnetlink: Fix extack attribute parsing

Girish Moodalbail (1):
  geneve: support for modifying geneve device

Hangbin Liu (1):
  utils: return default family when rtm_family is not RTNL_FAMILY_IPMR/IP6MR

Jakub Kicinski (3):
  bpf: print xdp offloaded mode
  bpf: add xdpdrv for requesting XDP driver mode
  bpf: allow requesting XDP HW offload

Jiri Benc (2):
  tc: m_tunnel_key: reformat the usage text
  tc: m_tunnel_key: add csum/nocsum option

Jiri Pirko (7):
  tc_filter: add support for chain index
  tc: actions: add helpers to parse and print control actions
  tc/actions: introduce support for goto chain action
  tc: flower: add support for tcp flags
  tc: gact: fix control action parsing
  tc: add support for TRAP action
  tc: don't print error message on miss when parsing action with default

Leon Romanovsky (8):
  utils: Move BIT macro to common header
  rdma: Add basic infrastructure for RDMA tool
  rdma: Add dev object
  rdma: Add link object
  rdma: Add json and pretty outputs
  rdma: Implement json output for dev object
  rdma: Add json output to link object
  rdma: Add initial manual for the tool

Martin KaFai Lau (1):
  bpf: Add support for IFLA_XDP_PROG_ID

Matteo Croce (3):
  tc: fix typo in manpage
  netns: avoid directory traversal
  netns: more input validation

Michal Kubecek (2):
  iplink: check for message truncation in iplink_get()
  iplink: double the buffer size also in iplink_get()

Or Gerlitz (1):
  tc: flower: add support for matching on ip tos and ttl

Phil Sutter (58):
  bpf: Make bytecode-file reading a little more robust
  Really fix get_addr() and get_prefix() error messages
  tc-simple: Fix documentation
  examples: Some shell fixes to cbq.init
  ifcfg: Quote left-hand side of [ ] expression
  tipc/node: Fix socket fd check in cmd_node_get_addr()
  iproute_lwtunnel: Argument to strerror must be positive
  iproute_lwtunnel: csum_mode value checking was ineffective
  ss: Don't leak fd in tcp_show_netlink_file()
  tc/em_ipset: Don't leak sockfd on error path
  ipvrf: Fix error path of vrf_switch()
  ifstat: Fix memleak in error case
  ifstat: Fix memleak in dump_kern_db() for json output
  ss: Fix potential memleak in unix_stats_print()
  tipc/bearer: Fix resource leak in error path
  devlink: No need for this self-assignment
  ipntable: No need to check and assign to parms_rta
  iproute: Fix for missing 'Oifs:' display
  lib/rt_names: Drop dead code in rtnl_rttable_n2a()
  ss: Skip useless check in parse_hostcond()
  ss: Drop useless assignment
  tc/m_gact: Drop dead code
  ipaddress: Avoid accessing uninitialized variable lcl
  iplink_can: Prevent overstepping array bounds
  ipmaddr: Avoid accessing uninitialized data
  ss: Use C99 initializer in netlink_show_one()
  netem/maketable: Check return value of fstat()
  tc/q_multiq: Don't pass garbage in TCA_OPTIONS
  iproute: Check mark value input
  iplink_vrf: Complain if main table is not found
  devlink: Check return code of strslashrsplit()
  lib/bpf: Don't leak fp in bpf_find_mntpt()
  ifstat, nstat: Check fdopen() return value
  tc/q_netem: Don't dereference possibly NULL pointer
  tc/tc_filter: Make sure filter name is not empty
  tipc/bearer: Prevent NULL pointer dereference
  ipntable: Avoid memory allocation for filter.name
  lib/fs: Fix format string in find_fs_mount()
  lib/inet_proto: Review inet_proto_{a2n,n2a}()
  lnstat_util: Simplify alloc_and_open() a bit
  tc/m_xt: Fix for

Re: 915f3e3f76 ("mac80211_hwsim: Replace bogus hrtimer clockid"): BUG: kernel reboot-without-warning in test stage

2017-09-05 Thread Sebastian Andrzej Siewior

On 2017-09-05 09:12:40 [+0200], Thomas Gleixner wrote:
> Sorry, no. That bisect is completely bogus. The commit in question merily
> replaces the unsupported clockid with a valid one.

The bisect is correct. It just has problems to express itself properly. So
the table says:

| WARNING:at_kernel/time/hrtimer.c:#hrtimer_init  | 7   | ||   |

| BUG:kernel_reboot-without-warning_in_test_stage | 0   | 12  | 6  | 6 |

which means _before_ your commit it counted a warning in hrtimer_init()
(an unsupported clock id was used). With your commit, the warning was
gone and I *think* the userland then printed
"BUG:kernel_reboot-without-warning_in_test_stage" because it had no
warning.
It seems that the bot learned to live with that warning which was around
for more than three years. Now that you removed it, it seems to be a
mistake to do so because nobody complained about it so far.

> Thanks,
> 
>   tglx

Sebastian

Re: [PATCH iproute2 master 2/2] bpf: consolidate dumps to use bpf_dump_prog_info

2017-09-05 Thread Daniel Borkmann


On 09/05/2017 06:35 PM, Stephen Hemminger wrote:
[...]

I applied this to master, and resolved conflicts with net-next.
But the dump with JSON of xdp is now incomplete.


Ok, I will check it out, and send a follow-up to make it
complete again.

Re: [PATCH iproute2 master 2/2] bpf: consolidate dumps to use bpf_dump_prog_info

2017-09-05 Thread Stephen Hemminger

On Tue,  5 Sep 2017 02:24:32 +0200
Daniel Borkmann  wrote:

> Consolidate dump of prog info to use bpf_dump_prog_info() when possible.
> Moving forward, we want to have a consistent output for BPF progs when
> being dumped. E.g. in cls/act case we used to dump tag as a separate
> netlink attribute before we had BPF_OBJ_GET_INFO_BY_FD bpf(2) command.
> 
> Move dumping tag into bpf_dump_prog_info() as well, and only dump the
> netlink attribute for older kernels. Also, reuse bpf_dump_prog_info()
> for XDP case, so we can dump tag and whether program was jited, which
> we currently don't show.
> 
> Signed-off-by: Daniel Borkmann 

I applied this to master, and resolved conflicts with net-next.
But the dump with JSON of xdp is now incomplete.

Re: [PATCH net] rds: Fix non-atomic operation on shared flag variable

2017-09-05 Thread Santosh Shilimkar


On 9/5/2017 8:42 AM, Håkon Bugge wrote:

The bits in m_flags in struct rds_message are used for a plurality of
reasons, and from different contexts. To avoid any missing updates to
m_flags, use the atomic set_bit() instead of the non-atomic equivalent.

Signed-off-by: Håkon Bugge 
Reviewed-by: Knut Omang 
Reviewed-by: Wei Lin Guay 
---
  net/rds/send.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index 41b9f0f..058a407 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -273,7 +273,7 @@ int rds_send_xmit(struct rds_conn_path *cp)
len = ntohl(rm->m_inc.i_hdr.h_len);
if (cp->cp_unacked_packets == 0 ||
cp->cp_unacked_bytes < len) {
-   __set_bit(RDS_MSG_ACK_REQUIRED, >m_flags);
+   set_bit(RDS_MSG_ACK_REQUIRED, >m_flags);
  
  cp->cp_unacked_packets =

rds_sysctl_max_unacked_packets;
@@ -829,7 +829,7 @@ static int rds_send_queue_rm(struct rds_sock *rs, struct 
rds_connection *conn,
 * throughput hits a certain threshold.
 */
if (rs->rs_snd_bytes >= rds_sk_sndbuf(rs) / 2)
-   __set_bit(RDS_MSG_ACK_REQUIRED, >m_flags);
+   set_bit(RDS_MSG_ACK_REQUIRED, >m_flags);
  
  		list_add_tail(>m_sock_item, >rs_send_queue);

set_bit(RDS_MSG_ON_SOCK, >m_flags);


Indeed, these couple of instances remained for the m_flags.
Patch looks good. Thanks !!

Acked-by: Santosh Shilimkar

Re: [PATCH iproute2/master] tc actions: store and dump correct length of user cookies

2017-09-05 Thread Stephen Hemminger

On Tue,  5 Sep 2017 13:06:24 +0200
Simon Horman  wrote:

> Correct two errors which cancel each other out:
> * Do not send twice the length of the actual provided by the user to the 
> kernel
> * Do not dump half the length of the cookie provided by the kernel
> 
> As the cookie is now stored in the kernel at its correct length rather
> than double the that length cookies of up to the maximum size of 16 bytes
> may now be stored rather than a maximum of half that length.
> 
> Output of dump is the same before and after this change,
> but the data stored in the kernel is now exactly the cookie
> rather than the cookie + as many trailing zeros.
> 
> Before:
>  # tc filter add dev eth0 protocol ip parent : \
>flower ip_proto udp action drop \
>cookie 0123456789abcdef0123456789abcdef
>  RTNETLINK answers: Invalid argument
> 
> After:
>  # tc filter add dev eth0 protocol ip parent : \
>flower ip_proto udp action drop \
>cookie 0123456789abcdef0123456789abcdef
>  # tc filter show dev eth0 ingress
>eth_type ipv4
>ip_proto udp
>not_in_hw
>action order 1: gact action drop
> random type none pass val 0
> index 1 ref 1 bind 1 installed 1 sec used 1 sec
>Action statistics:
>Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>backlog 0b 0p requeues 0
>cookie len 16 0123456789abcdef0123456789abcdef
> 
> Fixes: fd8b3d2c1b9b ("actions: Add support for user cookies")
> Cc: Jamal Hadi Salim 
> Signed-off-by: Simon Horman 

Applied thanks

Re: [PATCH iproute2 master 0/2] Two minor BPF updates

2017-09-05 Thread Stephen Hemminger

On Tue,  5 Sep 2017 02:24:30 +0200
Daniel Borkmann  wrote:

> Two minor updates including a small cleanup for dumping
> the trace pipe and one for consolidating prog dumps for
> tc and xdp to use bpf_prog_info_by_fd() when possible.
> 
> Thanks!
> 
> Daniel Borkmann (2):
>   bpf: minor cleanups for bpf_trace_pipe
>   bpf: consolidate dumps to use bpf_dump_prog_info
> 
>  include/bpf_util.h |  2 +-
>  ip/ipaddress.c |  6 --
>  ip/iplink_xdp.c| 19 +++
>  ip/xdp.h   |  2 +-
>  lib/bpf.c  | 31 ++-
>  tc/f_bpf.c |  8 
>  tc/m_bpf.c |  8 
>  7 files changed, 47 insertions(+), 29 deletions(-)
> 

Applied, thanks Daniel

Re: [PATCH] soc: ti/knav_dma: include dmaengine header

2017-09-05 Thread David Miller

From: Arnd Bergmann 
Date: Tue,  5 Sep 2017 10:31:35 +0200

> A header file cleanup apparently caused a build regression
> with one driver using the knav infrastructure:
> 
> In file included from drivers/net/ethernet/ti/netcp_core.c:30:0:
> include/linux/soc/ti/knav_dma.h:129:30: error: field 'direction' has 
> incomplete type
>   enum dma_transfer_direction direction;
>   ^
> drivers/net/ethernet/ti/netcp_core.c: In function 'netcp_txpipe_open':
> drivers/net/ethernet/ti/netcp_core.c:1349:21: error: 'DMA_MEM_TO_DEV' 
> undeclared (first use in this function); did you mean 'DMA_MEMORY_MAP'?
>   config.direction = DMA_MEM_TO_DEV;
>  ^~
>  DMA_MEMORY_MAP
> drivers/net/ethernet/ti/netcp_core.c:1349:21: note: each undeclared 
> identifier is reported only once for each function it appears in
> drivers/net/ethernet/ti/netcp_core.c: In function 
> 'netcp_setup_navigator_resources':
> drivers/net/ethernet/ti/netcp_core.c:1659:22: error: 'DMA_DEV_TO_MEM' 
> undeclared (first use in this function); did you mean 'DMA_DESC_HOST'?
>   config.direction  = DMA_DEV_TO_MEM;
> 
> As the header is no longer included implicitly through netdevice.h,
> we should include it in the header that references the enum.
> 
> Fixes: 0dd5759dbb1c ("net: remove dmaengine.h inclusion from netdevice.h")
> Signed-off-by: Arnd Bergmann 
> ---
> If the cleanup patch hasn't been submitted for mainline yet, please
> add this fixup to the net-next tree, otherwise I'll merge it through
> arm-soc.

Applied to net-next.

Re: [PATCH] net/ncsi: fix ncsi_vlan_rx_{add,kill}_vid references

2017-09-05 Thread David Miller

From: Arnd Bergmann 
Date: Tue,  5 Sep 2017 10:05:47 +0200

> We get a new link error in allmodconfig kernels after ftgmac100
> started using the ncsi helpers:
> 
> ERROR: "ncsi_vlan_rx_kill_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] 
> undefined!
> ERROR: "ncsi_vlan_rx_add_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] 
> undefined!
> 
> Related to that, we get another error when CONFIG_NET_NCSI is disabled:
> 
> drivers/net/ethernet/faraday/ftgmac100.c:1626:25: error: 
> 'ncsi_vlan_rx_add_vid' undeclared here (not in a function); did you mean 
> 'ncsi_start_dev'?
> drivers/net/ethernet/faraday/ftgmac100.c:1627:26: error: 
> 'ncsi_vlan_rx_kill_vid' undeclared here (not in a function); did you mean 
> 'ncsi_vlan_rx_add_vid'?
> 
> This fixes both problems at once, using a 'static inline' stub helper
> for the disabled case, and exporting the functions when they are present.
> 
> Fixes: 51564585d8c6 ("ftgmac100: Support NCSI VLAN filtering when available")
> Fixes: 21acf63013ed ("net/ncsi: Configure VLAN tag filter")
> Signed-off-by: Arnd Bergmann 

Applied.

Re: [PATCH net-next] bpf: fix numa_node validation

2017-09-05 Thread David Miller

From: Eric Dumazet 
Date: Mon, 04 Sep 2017 22:41:02 -0700

> From: Eric Dumazet 
> 
> syzkaller reported crashes in bpf map creation or map update [1]
> 
> Problem is that nr_node_ids is a signed integer,
> NUMA_NO_NODE is also an integer, so it is very tempting
> to declare numa_node as a signed integer.
> 
> This means the typical test to validate a user provided value :
> 
> if (numa_node != NUMA_NO_NODE &&
> (numa_node >= nr_node_ids ||
>  !node_online(numa_node)))
> 
> must be written :
> 
> if (numa_node != NUMA_NO_NODE &&
> ((unsigned int)numa_node >= nr_node_ids ||
>  !node_online(numa_node)))
> 
> 
> [1]
 ...
> Fixes: 96eabe7a40aa ("bpf: Allow selecting numa node during map creation")
> Signed-off-by: Eric Dumazet 

Applied, thanks.

Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads

2017-09-05 Thread Tom Herbert

On Tue, Sep 5, 2017 at 4:14 AM, Hannes Frederic Sowa
 wrote:
> Tom Herbert  writes:
>
>> There is absolutely no requirement in IP that packets are delivered in
>> order-- there never has been and there never will be! If the ULP, like
>> Ethernet encapsulation, requires in order deliver then it needs to
>> implement that itself like TCP, GRE, and other protocols ensure that
>> with sequence numbers and reassembly. All of these hoops we do make
>> sure that packets always follow the same path and are always in order
>> are done for benefit of middlebox devices like stateful firewalls that
>> have force us to adopt their concept of network architecture-- in the
>> long run this is self-defeating and kills our ability to innovate.
>>
>> I'm not saying that we shouldn't consider legacy devices, but we
>> should scrutinize new development or solutions that perpetuate
>> incorrect design or bad assumptions.
>
> So configure RSS per port and ensure no fragments are send to those
> ports. This is possible and rather easy to do. It solves the problem
> with legacy software and it spreads out packets for your applications.
>
> It is not perfect but it is working and solves both problems.
>
Hannes,

I don't see how that solves anything. The purpose of RSS is to
distribute the load of protocol packets across queues. This needs to
work for UDP applications. For instance, if I were building a QUIC
server I'd want the sort of flow distribution that a TCP server would
give. You can't do that by configuring a few ports in the device.

If I were to suggest any HW change it would be to not do DPI on
fragments (MF or offset is set). This ensures that all packets of the
fragment train get hashed to the same queue and is on fact what RPS
has been doing for years without any complaints.

But even before I'd make that recommendation, I'd really like
understand what the problem actually is. The only thing I can garner
from this discussion and the Intel patch is that when fragments are
received OOO that is perceived as a problem. But the by the protocol
specification clearly says this is not a problem. So the questions
are: who is negatively affected by this? Is this a problem because
some artificial test that checks for everything to be in order is now
failing? Is this affecting real users? Is this an issue in the stack
or really with some implementation outside of the stack? If it is an
implementation outside of the stack, then are we just bandaid'ing over
someone else's incorrect implementation by patching the kernel (like
would have be the case if we change the kernel to interoperate with
Facebook's switch that couldn't handle OOO in twstate).

Thanks,
Tom

> Bye,
> Hannes

[PATCH 24/25 v2] net/cdc_ncm: Replace tasklet with softirq hrtimer

2017-09-05 Thread Sebastian Andrzej Siewior

From: Thomas Gleixner 

The bh tasklet is used in invoke the hrtimer (cdc_ncm_tx_timer_cb) in
softirq context. This can be also achieved without the tasklet but with
CLOCK_MONOTONIC_SOFT as hrtimer base.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Anna-Maria Gleixner 
Cc: Oliver Neukum 
Cc: Greg Kroah-Hartman 
Cc: linux-...@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Greg Kroah-Hartman 
[bigeasy: using usb_get_intfdata() as suggested by Bjørn Mork]
Signed-off-by: Sebastian Andrzej Siewior 
---
On 2017-08-31 15:57:04 [+0200], Bjørn Mork wrote:
> I believe the struct usbnet pointer is redundant.  We already have lots
> of pointers back and forth here.  This should work, but is not tested:
> 
>   struct usbnet *dev = usb_get_intfdata(ctx->control):

I think so, too. Still untested as I don't have a working gadget around.

v1…v2: Updated as suggested by Bjørn and added Greg's Acked-by.

 drivers/net/usb/cdc_ncm.c   | 36 +++-
 include/linux/usb/cdc_ncm.h |  1 -
 2 files changed, 15 insertions(+), 22 deletions(-)

diff --git a/drivers/net/usb/cdc_ncm.c b/drivers/net/usb/cdc_ncm.c
index 8f572b9f3625..42f7bd90e6a4 100644
--- a/drivers/net/usb/cdc_ncm.c
+++ b/drivers/net/usb/cdc_ncm.c
@@ -61,7 +61,6 @@ static bool prefer_mbim;
 module_param(prefer_mbim, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(prefer_mbim, "Prefer MBIM setting on dual NCM/MBIM 
functions");
 
-static void cdc_ncm_txpath_bh(unsigned long param);
 static void cdc_ncm_tx_timeout_start(struct cdc_ncm_ctx *ctx);
 static enum hrtimer_restart cdc_ncm_tx_timer_cb(struct hrtimer *hr_timer);
 static struct usb_driver cdc_ncm_driver;
@@ -777,10 +776,8 @@ int cdc_ncm_bind_common(struct usbnet *dev, struct 
usb_interface *intf, u8 data_
if (!ctx)
return -ENOMEM;
 
-   hrtimer_init(>tx_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+   hrtimer_init(>tx_timer, CLOCK_MONOTONIC_SOFT, HRTIMER_MODE_REL);
ctx->tx_timer.function = _ncm_tx_timer_cb;
-   ctx->bh.data = (unsigned long)dev;
-   ctx->bh.func = cdc_ncm_txpath_bh;
atomic_set(>stop, 0);
spin_lock_init(>mtx);
 
@@ -967,10 +964,7 @@ void cdc_ncm_unbind(struct usbnet *dev, struct 
usb_interface *intf)
 
atomic_set(>stop, 1);
 
-   if (hrtimer_active(>tx_timer))
-   hrtimer_cancel(>tx_timer);
-
-   tasklet_kill(>bh);
+   hrtimer_cancel(>tx_timer);
 
/* handle devices with combined control and data interface */
if (ctx->control == ctx->data)
@@ -1348,20 +1342,9 @@ static void cdc_ncm_tx_timeout_start(struct cdc_ncm_ctx 
*ctx)
HRTIMER_MODE_REL);
 }
 
-static enum hrtimer_restart cdc_ncm_tx_timer_cb(struct hrtimer *timer)
+static void cdc_ncm_txpath_bh(struct cdc_ncm_ctx *ctx)
 {
-   struct cdc_ncm_ctx *ctx =
-   container_of(timer, struct cdc_ncm_ctx, tx_timer);
-
-   if (!atomic_read(>stop))
-   tasklet_schedule(>bh);
-   return HRTIMER_NORESTART;
-}
-
-static void cdc_ncm_txpath_bh(unsigned long param)
-{
-   struct usbnet *dev = (struct usbnet *)param;
-   struct cdc_ncm_ctx *ctx = (struct cdc_ncm_ctx *)dev->data[0];
+   struct usbnet *dev = usb_get_intfdata(ctx->control);
 
spin_lock_bh(>mtx);
if (ctx->tx_timer_pending != 0) {
@@ -1379,6 +1362,17 @@ static void cdc_ncm_txpath_bh(unsigned long param)
}
 }
 
+static enum hrtimer_restart cdc_ncm_tx_timer_cb(struct hrtimer *timer)
+{
+   struct cdc_ncm_ctx *ctx =
+   container_of(timer, struct cdc_ncm_ctx, tx_timer);
+
+   if (!atomic_read(>stop))
+   cdc_ncm_txpath_bh(ctx);
+
+   return HRTIMER_NORESTART;
+}
+
 struct sk_buff *
 cdc_ncm_tx_fixup(struct usbnet *dev, struct sk_buff *skb, gfp_t flags)
 {
diff --git a/include/linux/usb/cdc_ncm.h b/include/linux/usb/cdc_ncm.h
index 1a59699cf82a..62b506fddf8d 100644
--- a/include/linux/usb/cdc_ncm.h
+++ b/include/linux/usb/cdc_ncm.h
@@ -92,7 +92,6 @@
 struct cdc_ncm_ctx {
struct usb_cdc_ncm_ntb_parameters ncm_parm;
struct hrtimer tx_timer;
-   struct tasklet_struct bh;
 
const struct usb_cdc_ncm_desc *func_desc;
const struct usb_cdc_mbim_desc *mbim_desc;
-- 
2.14.1

[PATCH net] rds: Fix non-atomic operation on shared flag variable

2017-09-05 Thread Håkon Bugge

The bits in m_flags in struct rds_message are used for a plurality of
reasons, and from different contexts. To avoid any missing updates to
m_flags, use the atomic set_bit() instead of the non-atomic equivalent.

Signed-off-by: Håkon Bugge 
Reviewed-by: Knut Omang 
Reviewed-by: Wei Lin Guay 
---
 net/rds/send.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index 41b9f0f..058a407 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -273,7 +273,7 @@ int rds_send_xmit(struct rds_conn_path *cp)
len = ntohl(rm->m_inc.i_hdr.h_len);
if (cp->cp_unacked_packets == 0 ||
cp->cp_unacked_bytes < len) {
-   __set_bit(RDS_MSG_ACK_REQUIRED, >m_flags);
+   set_bit(RDS_MSG_ACK_REQUIRED, >m_flags);
 
cp->cp_unacked_packets =
rds_sysctl_max_unacked_packets;
@@ -829,7 +829,7 @@ static int rds_send_queue_rm(struct rds_sock *rs, struct 
rds_connection *conn,
 * throughput hits a certain threshold.
 */
if (rs->rs_snd_bytes >= rds_sk_sndbuf(rs) / 2)
-   __set_bit(RDS_MSG_ACK_REQUIRED, >m_flags);
+   set_bit(RDS_MSG_ACK_REQUIRED, >m_flags);
 
list_add_tail(>m_sock_item, >rs_send_queue);
set_bit(RDS_MSG_ON_SOCK, >m_flags);
-- 
2.9.3

[PATCH net v2] net: sched: don't use GFP_KERNEL under spin lock

2017-09-05 Thread Jakub Kicinski

The new TC IDR code uses GFP_KERNEL under spin lock.  Which leads
to:

[  582.621091] BUG: sleeping function called from invalid context at 
../mm/slab.h:416
[  582.629721] in_atomic(): 1, irqs_disabled(): 0, pid: 3379, name: tc
[  582.636939] 2 locks held by tc/3379:
[  582.641049]  #0:  (rtnl_mutex){+.+.+.}, at: [] 
rtnetlink_rcv_msg+0x92e/0x1400
[  582.650958]  #1:  (&(>idrinfo->lock)->rlock){+.-.+.}, at: 
[] tcf_idr_create+0x2f0/0x8e0
[  582.662217] Preemption disabled at:
[  582.66] [] tcf_idr_create+0x2f0/0x8e0
[  582.672592] CPU: 9 PID: 3379 Comm: tc Tainted: GW   
4.13.0-rc7-debug-00648-g43503a79b9f0 #287
[  582.683432] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 
11/08/2016
[  582.691937] Call Trace:
...
[  582.742460]  kmem_cache_alloc+0x286/0x540
[  582.747055]  radix_tree_node_alloc.constprop.6+0x4a/0x450
[  582.753209]  idr_get_free_cmn+0x627/0xf80
...
[  582.815525]  idr_alloc_cmn+0x1a8/0x270
...
[  582.833804]  tcf_idr_create+0x31b/0x8e0
...

Try to preallocate the memory with idr_prealloc(GFP_KERNEL)
(as suggested by Eric Dumazet), and change the allocation 
flags under spin lock.

Fixes: 65a206c01e8e ("net/sched: Change act_api and act_xxx modules to use IDR")
Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
 net/sched/act_api.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 0eb545bcb247..a306974e2fb4 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -296,10 +296,12 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, 
struct nlattr *est,
spin_lock_init(>tcfa_lock);
/* user doesn't specify an index */
if (!index) {
+   idr_preload(GFP_KERNEL);
spin_lock_bh(>lock);
err = idr_alloc_ext(idr, NULL, _index, 1, 0,
-   GFP_KERNEL);
+   GFP_ATOMIC);
spin_unlock_bh(>lock);
+   idr_preload_end();
if (err) {
 err3:
free_percpu(p->cpu_qstats);
@@ -307,10 +309,12 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, 
struct nlattr *est,
}
p->tcfa_index = idr_index;
} else {
+   idr_preload(GFP_KERNEL);
spin_lock_bh(>lock);
err = idr_alloc_ext(idr, NULL, NULL, index, index + 1,
-   GFP_KERNEL);
+   GFP_ATOMIC);
spin_unlock_bh(>lock);
+   idr_preload_end();
if (err)
goto err3;
p->tcfa_index = index;
-- 
2.14.1

1 2 >

1 - 100 of 159 matches

Mail list logo