[PATCH iproute2 v2] macsec: Nr. of packets and octets for macsec tx stats were swapped.

2016-11-22 Thread daniel . hopf
Resent from other mail address due to our company mail  
[clients|servers] stupidly forcing
line-breaks on plain-text e-mails. Also changed the subject format as  
suggested by Sabrina

and Rami.

Acked-by: Rami Rosen 
Acked-by: Sabrina Dubroca 
Signed-off-by: Daniel Hopf 
---
 ip/ipmacsec.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/ip/ipmacsec.c b/ip/ipmacsec.c
index c9252bb..aa89a00 100644
--- a/ip/ipmacsec.c
+++ b/ip/ipmacsec.c
@@ -634,10 +634,10 @@ static void print_one_stat(const char **names,  
struct rtattr **attr, int idx,

 }

 static const char *txsc_stats_names[NUM_MACSEC_TXSC_STATS_ATTR] = {
-   [MACSEC_TXSC_STATS_ATTR_OUT_PKTS_PROTECTED] = "OutOctetsProtected",
-   [MACSEC_TXSC_STATS_ATTR_OUT_PKTS_ENCRYPTED] = "OutOctetsEncrypted",
-   [MACSEC_TXSC_STATS_ATTR_OUT_OCTETS_PROTECTED] = "OutPktsProtected",
-   [MACSEC_TXSC_STATS_ATTR_OUT_OCTETS_ENCRYPTED] = "OutPktsEncrypted",
+   [MACSEC_TXSC_STATS_ATTR_OUT_PKTS_PROTECTED] = "OutPktsProtected",
+   [MACSEC_TXSC_STATS_ATTR_OUT_PKTS_ENCRYPTED] = "OutPktsEncrypted",
+   [MACSEC_TXSC_STATS_ATTR_OUT_OCTETS_PROTECTED] = "OutOctetsProtected",
+   [MACSEC_TXSC_STATS_ATTR_OUT_OCTETS_ENCRYPTED] = "OutOctetsEncrypted",
 };

 static void print_txsc_stats(const char *prefix, struct rtattr *attr)
--
2.9.3



Re: [PATCH net 1/1] net sched filters: fix filter handle ID in tfilter_notify_chain()

2016-11-22 Thread Daniel Borkmann

On 11/23/2016 02:57 AM, Roman Mashak wrote:

Should pass valid filter handle, not the netlink flags.

Fixes: 30a391a13ab92 ("net sched filters: pass netlink message flags in event 
notification")
Signed-off-by: Roman Mashak 
Signed-off-by: Jamal Hadi Salim 


Acked-by: Daniel Borkmann 


Re: [PATCH net-next 1/1] ipv6: sr: add option to control lwtunnel support

2016-11-22 Thread Roopa Prabhu
On 11/22/16, 4:16 PM, Alexei Starovoitov wrote:
> On Wed, Nov 16, 2016 at 8:32 AM, David Miller  wrote:
>> From: David Lebrun 
>> Date: Tue, 15 Nov 2016 16:14:04 +0100
>>
>>> This patch adds a new option CONFIG_IPV6_SEG6_LWTUNNEL to enable/disable
>>> support of encapsulation with the lightweight tunnels. When this option
>>> is enabled, CONFIG_LWTUNNEL is automatically selected.
>>>
>>> Fix commit 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and 
>>> injection with lwtunnels")
>>>
>>> Without a proper option to control lwtunnel support for SR-IPv6, if
>>> CONFIG_LWTUNNEL=n then the IPv6 initialization fails as a consequence
>>> of seg6_iptunnel_init() failure with EOPNOTSUPP:
>>>
>>> NET: Registered protocol family 10
>>> IPv6: Attempt to unregister permanent protocol 6
>>> IPv6: Attempt to unregister permanent protocol 136
>>> IPv6: Attempt to unregister permanent protocol 17
>>> NET: Unregistered protocol family 10
>>>
>>> Tested (compiling, booting, and loading ipv6 module when relevant)
>>> with possible combinations of CONFIG_IPV6={y,m,n},
>>> CONFIG_IPV6_SEG6_LWTUNNEL={y,n} and CONFIG_LWTUNNEL={y,n}.
>>>
>>> Reported-by: Lorenzo Colitti 
>>> Suggested-by: Roopa Prabhu 
>>> Signed-off-by: David Lebrun 
>> Applied.
> ipv6 seems to be still broken in the latest net-next
> when CONFIG_LWTUNNEL is not set:
> # ping 127.0.0.1
> ping: socket: Address family not supported by protocol
> # ping -4 127.0.0.1
> PING localhost.localdomain (127.0.0.1) 56(84) bytes of data.
> 64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=1 ttl=64 time=0.067 
> ms
>
> it works with CONFIG_LWTUNNEL=y
>
> Roopa, David, please take a look.
>
I can't seem to reproduce the problem you are seeing. still trying..
I don't have CONFIG_LWTUNNEL set nor any of the other SEG6 configs.
My CONFIG_IPV6 is on and compiled as a module. I have also tried disabling it.
If you can send me the config, I can try again. Looking back at the patches,
I do see a few things below ..but they may not fix your problem directly.

Though I had none of the ipv6 segment routing configs turned on,
I do see the "Segment Routing with IPv6" msg at bootup.
Was looking at david's patches again, and a few things (I had missed seeing the 
last version):

In my review comment I was hinting at CONFIG_IPV6_SEG6 to cover all of ipv6 
segment routing,
including the lwtunnel bits.

something like below:

config IPV6_SEG6
bool "IPv6: Segment Routing Header encapsulation support"
depends on LWTUNNEL && IPV6

DavidL, do you see a problem doing it this way ?. with this 'seg6.o' will be 
part of CONFIG_IPV6_SEG6 and not
get initialized unless it is enabled..which seems like the right thing to do.

DaveM had suggested compiling LWTUNNEL in by default. I can submit a patch for 
that.
But it is not clear to me yet why the right depends will not fix it.

thanks.



pull request: bluetooth 2016-11-23

2016-11-22 Thread Johan Hedberg
Hi Dave,

Sorry about the late pull request for 4.9, but we have one more
important Bluetooth patch that should make it to the release. It fixes
connection creation for Bluetooth LE controllers that do not have a
public address (only a random one).

Please let me know if there are any issues pulling. Thanks.

Johan

---
The following changes since commit c9b8af1330198ae241cd545e1f040019010d44d9:

  flow_dissect: call init_default_flow_dissectors() earlier (2016-11-22 
14:44:01 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth.git 
for-upstream

for you to fetch changes up to 39385cb5f3274735b03ed1f8e7ff517b02a0beed:

  Bluetooth: Fix using the correct source address type (2016-11-22 22:50:46 
+0100)


Johan Hedberg (1):
  Bluetooth: Fix using the correct source address type

 include/net/bluetooth/hci_core.h |  2 +-
 net/bluetooth/6lowpan.c  |  4 ++--
 net/bluetooth/hci_conn.c | 26 --
 net/bluetooth/l2cap_core.c   |  2 +-
 net/bluetooth/rfcomm/tty.c   |  2 +-
 net/bluetooth/sco.c  |  2 +-
 6 files changed, 30 insertions(+), 8 deletions(-)



signature.asc
Description: PGP signature


Re: [LKP] [net] 34fad54c25: kernel BUG at include/linux/skbuff.h:1935!

2016-11-22 Thread Linus Torvalds
On Tue, Nov 22, 2016 at 10:44 PM, Fengguang Wu  wrote:
>
> On Tue, Nov 22, 2016 at 02:04:42PM -0800, Linus Torvalds wrote:
>
>> I also noticed that the kernel test robot had screwed up the
>> participants list for some reason, and had
>>
>>  "Acked-by: Alexander Duyck , David S.
>> Miller" 
>>
>> as one of the participants. So there's some odd commit parsing issue
>> there somewhere. But Alexander seems to have seen this report despite
>> that, it just never went anywhere that I can tell.
>
>
> Yeah the robot will CC all "Acked-by" people in the bug reports.
>
> Shall we limit it to the below TO/CC list?

No. We do want to keep the Acked-by's on the cc.

But you missed the real problem.

It *didn't* cc the acked-by. Look closer. What happened was that it cc'd this:

 "Acked-by: Alexander Duyck , David S. Miller"

 

ie there is only _one_ email address (that of da...@davemloft.net),
and the whole "Acked-by: Alexander Duyck <...>" part is quoted as the
_name_ of that email address.

At least that's what the headers look like for me in the original report:

   From: kernel test robot 
   To: Eric Dumazet 
   Cc: l...@01.org, Linus Torvalds ,
LKML , Alexei Starovoitov
, Willem de Bruijn , "Acked-by:
Alexander Duyck , David S. Miller"


Notice the quoting of that last "name".

  Linus


Re: [LKP] [net] 34fad54c25: kernel BUG at include/linux/skbuff.h:1935!

2016-11-22 Thread Fengguang Wu

Hi Linus,

On Tue, Nov 22, 2016 at 02:04:42PM -0800, Linus Torvalds wrote:
[snip]


I also noticed that the kernel test robot had screwed up the
participants list for some reason, and had

 "Acked-by: Alexander Duyck , David S.
Miller" 

as one of the participants. So there's some odd commit parsing issue
there somewhere. But Alexander seems to have seen this report despite
that, it just never went anywhere that I can tell.


Yeah the robot will CC all "Acked-by" people in the bug reports.

Shall we limit it to the below TO/CC list?

   TO: author
   CC: committer (maintainer)
   CC: all Signed-off-by
   CC: all Reviewed-by
   CC: mailing lists, if the bug is found in a maintainer/well known tree

Regards,
Fengguang


On Tue, Nov 15, 2016 at 1:20 PM, kernel test robot
 wrote:


FYI, we noticed the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
commit 34fad54c2537f7c99d07375e50cb30aa3c23bd83 ("net: __skb_flow_dissect() must cap 
its return value")

in testcase: pbzip2
with following parameters:

nr_threads: 25%
blocksize: 900K
cpufreq_governor: performance



on test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz 
with 64G memory

caused below changes:


+--+++
|  | 79774d6bfa 
| 34fad54c25 |
+--+++
| boot_successes   | 0  
| 2  |
| boot_failures| 2  
| 20 |
| invoked_oom-killer:gfp_mask=0x   | 2  
| 2  |
| Mem-Info | 2  
| 2  |
| Kernel_panic-not_syncing:Out_of_memory_and_no_killable_processes | 2  
| 2  |
| kernel_BUG_at_include/linux/skbuff.h | 0  
| 16 |
| invalid_opcode:#[##]SMP  | 0  
| 16 |
| RIP:eth_type_trans   | 0  
| 16 |
| Kernel_panic-not_syncing:Fatal_exception_in_interrupt| 0  
| 15 |
| calltrace:hub_event  | 0  
| 1  |
| WARNING:at_fs/sysfs/dir.c:#sysfs_warn_dup| 0  
| 2  |
| calltrace:parport_pc_init| 0  
| 2  |
| calltrace:SyS_finit_module   | 0  
| 2  |
| WARNING:at_lib/kobject.c:#kobject_add_internal   | 0  
| 2  |
+--+++



[   19.375251] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[   19.388892] Sending DHCP requests .
[   19.388892] [ cut here ]
[   19.388894] kernel BUG at include/linux/skbuff.h:1935!
[   19.388895] invalid opcode:  [#1] SMP
[   19.388896] Modules linked in:
[   19.388897] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
4.9.0-rc3-00320-g34fad54 #1
[   19.388898] Hardware name: Intel Corporation S2600WP/S2600WP, BIOS 
SE5C600.86B.02.02.0002.122320131210 12/23/2013
[   19.388899] task: 81e0e4c0 task.stack: 81e0
[   19.388904] RIP: 0010:[]  [] 
eth_type_trans+0xe8/0x140
[   19.388904] RSP: :88081e803db8  EFLAGS: 00010297
[   19.388905] RAX: 0152 RBX: 88080221f200 RCX: 1073
[   19.388905] RDX: 8808013afdc0 RSI: 880801114000 RDI: 880819407c00
[   19.388906] RBP: 88081e803e20 R08: 880801114000 R09: 0800
[   19.388907] R10: 8808013afec0 R11: ea003fd5a880 R12: 880819407c00
[   19.388907] R13: 881033408000 R14: c9000843e000 R15: 0158
[   19.388908] FS:  () GS:88081e80() 
knlGS:
[   19.388909] CS:  0010 DS:  ES:  CR0: 80050033
[   19.388910] CR2: 88103000 CR3: 01e07000 CR4: 001406f0
[   19.388910] Stack:
[   19.388912]  816905a7 ea003fd5a880 ea08 
88080221f050
[   19.388913]  88080221f000 00400160 ea003fd5a880 

[   19.388915]  0040  88080221f050 
88100d216000
[   19.388915] Call Trace:
[   19.388919]  
[   19.388919]  [] ? igb_clean_rx_irq+0x6a7/0x7d0
[   19.388921]  [] igb_poll+0x382/0x700
[   19.388922]  [] ? igb_poll+0x397/0x700
[   19.388925]  [] net_rx_action+0x217/0x360
[   19.388928]  [] __do_softirq+0x104/0x2ab
[   19.388931]  [] irq_exit+0xf1/0x100

Re: sendfile from 9p fs into af_alg

2016-11-22 Thread Al Viro
On Tue, Nov 22, 2016 at 08:55:59PM -0800, Alexei Starovoitov wrote:
> On Wed, Nov 23, 2016 at 04:46:26AM +, Al Viro wrote:
> > On Tue, Nov 22, 2016 at 07:58:29PM -0800, Alexei Starovoitov wrote:
> > > Hi Al,
> > > 
> > > it seems the following commit 523ac9afc73a ("switch 
> > > default_file_splice_read() to use of pipe-backed iov_iter")
> > > breaks sendfile from 9p fs into af_alg socket.
> > > sendfile into af_alg is used by iproute2/tc.
> > > I'm not sure whether it's 9p or crypto or vfs problem, but happy to test 
> > > any patches.
> > 
> > Could you try -rc6 (or anything that contains 680bb946a1ae04, for that
> > matter)?
> 
> already tested with that patch in the latest net-next. Still broken :(

Joy...  Which transport are you using there?  The interesting part is
whether it's zerocopy or non-zerocopy path in p9_client_read()...


Re: net/can: use-after-free in bcm_rx_thr_flush

2016-11-22 Thread Oliver Hartkopp

On 11/22/2016 06:37 PM, Andrey Konovalov wrote:

On Tue, Nov 22, 2016 at 6:29 PM, Oliver Hartkopp  wrote:

Hi Andrey,

thanks for the report.

Although I can't see the issue in the code ...



Oh, I can see it now m(

Will send a patch today.

Many thanks,
Oliver



Re: [PATCH net-next] net/sched: cls_flower: verify root pointer before dereferncing it

2016-11-22 Thread Cong Wang
On Tue, Nov 22, 2016 at 3:36 PM, John Fastabend
 wrote:
> On 16-11-22 12:41 PM, Daniel Borkmann wrote:
>> On 11/22/2016 08:28 PM, Cong Wang wrote:
>>> On Tue, Nov 22, 2016 at 8:11 AM, Jiri Pirko  wrote:
 Tue, Nov 22, 2016 at 05:04:11PM CET, dan...@iogearbox.net wrote:
> Hmm, I don't think we want to have such an additional test in fast
> path for each and every classifier. Can we think of ways to avoid that?
>
> My question is, since we unlink individual instances from such
> tp-internal
> lists through RCU and release the instance through call_rcu() as
> well as
> the head (tp->root) via kfree_rcu() eventually, against what are we
> protecting
> setting RCU_INIT_POINTER(tp->root, NULL) in ->destroy() callback?
> Something
> not respecting grace period?

 If you call tp->ops->destroy in call_rcu, you don't have to set tp->root
 to null.
>>
>> But that's not really an answer to my question. ;)
>>
>>> We do need to respect the grace period if we touch the globally visible
>>> data structure tp in tcf_destroy(). Therefore Roi's patch is not
>>> fixing the
>>> right place.
>>
>> I think there may be multiple issues actually.
>>
>> At the time we go into tc_classify(), from ingress as well as egress side,
>> we're under RCU, but BH variant. In cls delete()/destroy() callbacks, we
>> everywhere use call_rcu() and kfree_rcu(), same as for tcf_destroy() where
>> we use kfree_rcu() on tp, although we iterate tps (and implicitly inner
>> filters)
>> via rcu_dereference_bh() from reader side. Is there a reason why we don't
>> use call_rcu_bh() variant on destruction for all this instead?
>
> I can't think of any if its all under _bh we can convert the call_rcu to
> call_rcu_bh it just needs an audit.
>
>>
>> Just looking at cls_bpf and others, what protects
>> RCU_INIT_POINTER(tp->root,
>> NULL) against? The tp is unlinked in tc_ctl_tfilter() from the tp chain in
>> tcf_destroy() cases. Still active readers under RCU BH can race against
>> this
>> (tp->root being NULL), as the commit identified. Only the get() callback
>> checks
>> for head against NULL, but both are serialized under rtnl, and the only
>> place
>> we call this is tc_ctl_tfilter(). Even if we create a new tp, head
>> should not
>> be NULL there, if it was assigned during the init() cb, but contains an
>> empty
>> list. (It's different for things like cls_cgroup, though.) So, I'm
>> wondering
>> if the RCU_INIT_POINTER(tp->root, NULL) can just be removed instead
>> (unless I'm
>> missing something obvious)?
>
>
> Just took a look at this I think there are a couple possible solutions.
> The easiest is likely to fix all the call sites so that 'tp' is unlinked
> before calling the destroy() handlers AND not doing the NULL set. I only
> see one such call site where destroy is called before unlinking at the
> moment. This should enforce that after a grace period there is no path
> to reach the classifiers because 'tp' is unlinked. Calling destroy
> before unlinking 'tp' however could cause a small race between grace
> period of 'tp' and grace period of the filter.
>
> Another would be to only call the destroy path from the call_rcu path
> of the 'tp' object so that destroy is only ever called after the object
> is guaranteed to be unlinked from the tc_filter path.
>
> I think both solutions would be fine.
>
> Cong were you working on one of these? Or do you have another idea?

Yeah, this is basic what I think as well, however, both are hard.
On one hand, we can't detach the tp from the global singly-linked list
before tcf_destroy() since we rely on its return value to make this decision.
On the other hand, it is a singly-linked list, we have to pass in the address
of its previous pointer to rcu callback to remove it, it seems racy as well
since we modify a previous pointer which is still visible globally...

Hmm, perhaps we really have to switch to a doubly-linked list, that is
list_head. I need to double check. And also the semantic of ->destroy()
needs to revise too.

So yeah, my commit should be blamed. :-/


Re: sendfile from 9p fs into af_alg

2016-11-22 Thread Alexei Starovoitov
On Wed, Nov 23, 2016 at 04:46:26AM +, Al Viro wrote:
> On Tue, Nov 22, 2016 at 07:58:29PM -0800, Alexei Starovoitov wrote:
> > Hi Al,
> > 
> > it seems the following commit 523ac9afc73a ("switch 
> > default_file_splice_read() to use of pipe-backed iov_iter")
> > breaks sendfile from 9p fs into af_alg socket.
> > sendfile into af_alg is used by iproute2/tc.
> > I'm not sure whether it's 9p or crypto or vfs problem, but happy to test 
> > any patches.
> 
> Could you try -rc6 (or anything that contains 680bb946a1ae04, for that
> matter)?

already tested with that patch in the latest net-next. Still broken :(



Re: sendfile from 9p fs into af_alg

2016-11-22 Thread Al Viro
On Tue, Nov 22, 2016 at 07:58:29PM -0800, Alexei Starovoitov wrote:
> Hi Al,
> 
> it seems the following commit 523ac9afc73a ("switch 
> default_file_splice_read() to use of pipe-backed iov_iter")
> breaks sendfile from 9p fs into af_alg socket.
> sendfile into af_alg is used by iproute2/tc.
> I'm not sure whether it's 9p or crypto or vfs problem, but happy to test any 
> patches.

Could you try -rc6 (or anything that contains 680bb946a1ae04, for that
matter)?


[PATCH net-next 1/2] openvswitch: Add a missing break statement.

2016-11-22 Thread Jarno Rajahalme
Add a break statement to prevent fall-through from
OVS_KEY_ATTR_ETHERNET to OVS_KEY_ATTR_TUNNEL.  Without the break
actions setting ethernet addresses fail to validate with log messages
complaining about invalid tunnel attributes.

Fixes: 0a6410fbde ("openvswitch: netlink: support L3 packets")
Signed-off-by: Jarno Rajahalme 
---
 net/openvswitch/flow_netlink.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index d19044f..c87d359 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -2195,6 +2195,7 @@ static int validate_set(const struct nlattr *a,
case OVS_KEY_ATTR_ETHERNET:
if (mac_proto != MAC_PROTO_ETHERNET)
return -EINVAL;
+   break;
 
case OVS_KEY_ATTR_TUNNEL:
if (masked)
-- 
2.1.4



[PATCH net-next 2/2] openvswitch: Fix skb->protocol for vlan frames.

2016-11-22 Thread Jarno Rajahalme
Do not set skb->protocol to be the ethertype of the L3 header, unless
the packet only has the L3 header.  For a non-hardware offloaded VLAN
frame skb->protocol needs to be one of the VLAN ethertypes.

Any VLAN offloading is undone on the OVS netlink interface.  Due to
this all VLAN packets sent to openvswitch module from userspace are
non-offloaded.

Incorrect skb->protocol value on a full-size non-offloaded VLAN skb
causes packet drop due to failing MTU check, as the VLAN header should
not be counted in when considering MTU in ovs_vport_send().

Fixes: 5108bbaddc ("openvswitch: add processing of L3 packets")
Signed-off-by: Jarno Rajahalme 
---
 net/openvswitch/datapath.c |  1 -
 net/openvswitch/flow.c | 20 +++-
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 2d4c4d3..9c62b63 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -606,7 +606,6 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, 
struct genl_info *info)
rcu_assign_pointer(flow->sf_acts, acts);
packet->priority = flow->key.phy.priority;
packet->mark = flow->key.phy.skb_mark;
-   packet->protocol = flow->key.eth.type;
 
rcu_read_lock();
dp = get_dp_rcu(net, ovs_header->dp_ifindex);
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 08aa926..9be9fda 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -477,12 +477,17 @@ static int parse_icmpv6(struct sk_buff *skb, struct 
sw_flow_key *key,
 }
 
 /**
- * key_extract - extracts a flow key from an Ethernet frame.
+ * key_extract - extracts a flow key from a packet with or without an
+ * Ethernet header.
  * @skb: sk_buff that contains the frame, with skb->data pointing to the
- * Ethernet header
+ * beginning of the packet.
  * @key: output flow key
  *
- * The caller must ensure that skb->len >= ETH_HLEN.
+ * 'key->mac_proto' must be initialized to indicate the frame type.
+ * For an L3 frame 'key->mac_proto' must equal 'MAC_PROTO_NONE', and the
+ * caller must ensure that 'skb->protocol' is set to the ethertype of the L3
+ * header.  Otherwise the presence of an Ethernet header is assumed and
+ * the caller must ensure that skb->len >= ETH_HLEN.
  *
  * Returns 0 if successful, otherwise a negative errno value.
  *
@@ -497,9 +502,6 @@ static int parse_icmpv6(struct sk_buff *skb, struct 
sw_flow_key *key,
  *  on output, then just past the IP header, if one is present and
  *  of a correct length, otherwise the same as skb->network_header.
  *  For other key->eth.type values it is left untouched.
- *
- *- skb->protocol: the type of the data starting at skb->network_header.
- *  Equals to key->eth.type.
  */
 static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
 {
@@ -518,6 +520,7 @@ static int key_extract(struct sk_buff *skb, struct 
sw_flow_key *key)
return -EINVAL;
 
skb_reset_network_header(skb);
+   key->eth.type = skb->protocol;
} else {
eth = eth_hdr(skb);
ether_addr_copy(key->eth.src, eth->h_source);
@@ -531,15 +534,14 @@ static int key_extract(struct sk_buff *skb, struct 
sw_flow_key *key)
if (unlikely(parse_vlan(skb, key)))
return -ENOMEM;
 
-   skb->protocol = parse_ethertype(skb);
-   if (unlikely(skb->protocol == htons(0)))
+   key->eth.type = parse_ethertype(skb);
+   if (unlikely(key->eth.type == htons(0)))
return -ENOMEM;
 
skb_reset_network_header(skb);
__skb_push(skb, skb->data - skb_mac_header(skb));
}
skb_reset_mac_len(skb);
-   key->eth.type = skb->protocol;
 
/* Network layer. */
if (key->eth.type == htons(ETH_P_IP)) {
-- 
2.1.4



Re: [PATCH net 1/1] net sched filters: fix filter handle ID in tfilter_notify_chain()

2016-11-22 Thread Cong Wang
On Tue, Nov 22, 2016 at 5:57 PM, Roman Mashak  wrote:
> Should pass valid filter handle, not the netlink flags.
>
> Fixes: 30a391a13ab92 ("net sched filters: pass netlink message flags in event 
> notification")
> Signed-off-by: Roman Mashak 
> Signed-off-by: Jamal Hadi Salim 

Reported-by: Cong Wang 


sendfile from 9p fs into af_alg

2016-11-22 Thread Alexei Starovoitov
Hi Al,

it seems the following commit 523ac9afc73a ("switch default_file_splice_read() 
to use of pipe-backed iov_iter")
breaks sendfile from 9p fs into af_alg socket.
sendfile into af_alg is used by iproute2/tc.
I'm not sure whether it's 9p or crypto or vfs problem, but happy to test any 
patches.

The following program is a reduced test from iproute2.
On broken kernels it fails as:
$ ./a.out some_file
Error from sendfile (8192 vs 9624 bytes): Success

It seems to work fine when 'some_file' is on ext4 or tmpfs, so could be 9p 
related.

Thanks

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 

#include 

#ifndef AF_ALG
#define AF_ALG 38
#endif

static int obj_hash(const char *object, uint8_t *out, size_t len)
{
struct sockaddr_alg alg = {
.salg_family= AF_ALG,
.salg_type  = "hash",
.salg_name  = "sha1",
};
int ret, cfd, ofd, ffd;
struct stat stbuff;
ssize_t size;

if (!object || len != 20)
return -EINVAL;

cfd = socket(AF_ALG, SOCK_SEQPACKET, 0);
if (cfd < 0) {
fprintf(stderr, "Cannot get AF_ALG socket: %s\n",
strerror(errno));
return cfd;
}

ret = bind(cfd, (struct sockaddr *), sizeof(alg));
if (ret < 0) {
fprintf(stderr, "Error binding socket: %s\n", strerror(errno));
goto out_cfd;
}

ofd = accept(cfd, NULL, 0);
if (ofd < 0) {
fprintf(stderr, "Error accepting socket: %s\n",
strerror(errno));
ret = ofd;
goto out_cfd;
}

ffd = open(object, O_RDONLY);
if (ffd < 0) {
fprintf(stderr, "Error opening object %s: %s\n",
object, strerror(errno));
ret = ffd;
goto out_ofd;
}

ret = fstat(ffd, );
if (ret < 0) {
fprintf(stderr, "Error doing fstat: %s\n",
strerror(errno));
goto out_ffd;
}

size = sendfile(ofd, ffd, NULL, stbuff.st_size);
if (size != stbuff.st_size) {
fprintf(stderr, "Error from sendfile (%zd vs %zu bytes): %s\n",
size, stbuff.st_size, strerror(errno));
ret = -1;
goto out_ffd;
}

size = read(ofd, out, len);
if (size != len) {
fprintf(stderr, "Error from read (%zd vs %zu bytes): %s\n",
size, len, strerror(errno));
ret = -1;
} else {
ret = 0;
}
out_ffd:
close(ffd);
out_ofd:
close(ofd);
out_cfd:
close(cfd);
return ret;
}

int main(int ac, char **av)
{
uint8_t hash[20] = {};

if (ac != 2) {
fprintf(stderr, "%s file\n", av[0]);
return 1;
}
obj_hash(av[1], hash, sizeof(hash));
printf("hash %llx\n", *(long long *)hash);
return 0;
}


RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

2016-11-22 Thread Hayes Wang
Mark Lord [mailto:ml...@pobox.com]
> Sent: Friday, November 18, 2016 8:03 PM
[..]
> How does the RTL8152 know that the limit is 16KB,
> rather than some other number?  Is this a hardwired number
> in the hardware, or is it a parameter that the software
> sends to the chip during initialization?

It is the limitation of the hardware.

> I have a USB analyzer, but it is difficult to figure out how
> to program an appropriate trigger point for the capture,
> since the problem (with 16KB URBs) takes minutes to hours
> or even days to trigger.

It is good. Our hw engineers real want it. Maybe you could send
a specific packet, and trigger it. You could allocate a skb and
fill the data which you prefer, and call

skb_queue_tail(>tx_queue, skb);

[...]
> The first issue is that a packet sometimes begins in one URB,
> and completes in the next URB, without an rx_desc at the start
> of the second URB.  This I have already reported earlier.

However, our hw engineer says it wouldn't happen. Our hw always
sends rx_desc + packet + padding. The hw wouldn't split it to
two or more transmission. That is why I wonder who does it.

> But the driver, as written, sometimes accesses bytes outside
> of the 16KB URB buffer, because it trusts the non-existent
> rx_desc in these cases, and also because it accesses bytes
> from the rx_desc without first checking whether there is
> sufficient remaining space in the URB to hold an rx_desc.

I think I check them. According to the followning code,

list_for_each_safe(cursor, next, _queue) {
struct rx_desc *rx_desc;
struct rx_agg *agg;
int len_used = 0;
struct urb *urb;
u8 *rx_data;

...

rx_desc = agg->head;
rx_data = agg->head;
len_used += sizeof(struct rx_desc); //<-- add the size of next 
rx_desc

while (urb->actual_length > len_used) {
struct net_device *netdev = tp->netdev;
struct net_device_stats *stats = >stats;
unsigned int pkt_len;
struct sk_buff *skb;

pkt_len = le32_to_cpu(rx_desc->opts1) & RX_LEN_MASK;
if (pkt_len < ETH_ZLEN)
break;

len_used += pkt_len;
if (urb->actual_length < len_used)
break;

pkt_len -= CRC_SIZE;
rx_data += sizeof(struct rx_desc);

...

find_next_rx:
rx_data = rx_agg_align(rx_data + pkt_len + CRC_SIZE);
rx_desc = (struct rx_desc *)rx_data;
len_used = (int)(rx_data - (u8 *)agg->head);
len_used += sizeof(struct rx_desc); //<-- add the size 
of next rx_desc
}

submit:
...
}

The while loop would check if the next rx_desc is inside the urb
buffer, because the len_used includes the size of the next rx_desc.
Then, in the while loop, the len_used adds the packet size and check
with urb->actual_length again. These make sure the rx_desc and the
packet are inside the urb buffer. Except the urb->actual_length
is more than agg_buf_sz. However, I don't think it would happen.

Best Regards,
Hayes



Re: net/icmp: null-ptr-deref in icmp6_send

2016-11-22 Thread David Ahern
On 11/22/16 1:11 PM, Cong Wang wrote:
> I have no idea what commit 5d41ce29e tried to fix, but we already
> use skb->dev a few lines before l3mdev_master_ifindex(), so I don't
> understand why skb->dev could be NULL, maybe just for vrf dev?

skb->dev can be null depending on when icmp6_send / icmpv6_send is called. 
Clearly I missed the ipv6_parse_hopopts -> icmpv6_param_prob path. I'll send a 
fix when I get back from PTO.


[lkp] [net] 50d4a9ef15: INFO:suspicious_RCU_usage

2016-11-22 Thread kernel test robot

FYI, we noticed the following commit:

commit 50d4a9ef15cee93344653f6ca8f9bab62e76e972 ("net: ipv6: avoid errors due 
to per-cpu atomic alloc")
url: 
https://github.com/0day-ci/linux/commits/Mike-Manning/net-ipv6-avoid-errors-due-to-per-cpu-atomic-alloc/20161122-202055


in testcase: trinity
with following parameters:

runtime: 300s

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/


on test machine: qemu-system-i386 -enable-kvm -m 320M

caused below changes:


+-+++
| | 
3b404a5198 | 50d4a9ef15 |
+-+++
| boot_successes  | 
2  | 0  |
| boot_failures   | 
4  | 8  |
| calltrace:init  | 
4  | 8  |
| IP-Config:Auto-configuration_of_network_failed  | 
4  | 8  |
| INFO:suspicious_RCU_usage   | 
0  | 8  |
| calltrace:addrconf_notify   | 
0  | 8  |
| calltrace:ip_auto_config| 
0  | 8  |
| WARNING:at_kernel/locking/mutex.c:#mutex_lock_nested| 
0  | 8  |
| BUG:sleeping_function_called_from_invalid_context_at_kernel/locking/mutex.c | 
0  | 1  |
| INFO:lockdep_is_turned_off  | 
0  | 1  |
| calltrace:SyS_ioctl | 
0  | 1  |
+-+++




[   20.158999] ### dt-test ### end of unittest - 149 passed, 0 failed
[   20.170208] 
[   20.172448] ===
[   20.178239] [ INFO: suspicious RCU usage. ]
[   20.183917] 4.9.0-rc6-00087-g50d4a9e #1 Not tainted
[   20.190554] ---
[   20.196388] kernel/sched/core.c:7729 Illegal context switch in RCU-bh 
read-side critical section!
[   20.210614] 
[   20.210614] other info that might help us debug this:
[   20.210614] 
[   20.221405] 
[   20.221405] rcu_scheduler_active = 1, debug_locks = 1
[   20.230284] 3 locks held by swapper/0/1:
[   20.235791]  #0: 
[   20.238261]  (
rtnl_mutex
[   20.242057] ){+.+.+.}
, at: 
[   20.246518] [<43f61b50>] rtnl_lock+0xf/0x11
[   20.252294]  #1: 
[   20.254819]  (
rcu_read_lock_bh
[   20.259422] ){..}
, at: 
[   20.263542] [<4401af1b>] ipv6_add_addr+0x47/0x43e
[   20.270211]  #2: 
[   20.272644]  (
addrconf_hash_lock
[   20.277542] ){+.}
, at: 
[   20.281681] [<4401afaa>] ipv6_add_addr+0xd6/0x43e
[   20.288459] 
[   20.288459] stack backtrace:
[   20.294554] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
4.9.0-rc6-00087-g50d4a9e #1
[   20.305143] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[   20.317757]  5344bc84 43bf3a0a 5345 0001 5344bca0 43aa5c4f 4441008b 
4440e52f
[   20.330680]  4440fa0e 026c  5344bcb4 43a896a6 5345  

[   20.358226]  5344bcd0 43a89863 43b2b5c1 02080020 5344bcd4 4450ff80 4ffa5000 
5344bd00
[   20.367429] Call Trace:
[   20.370671]  [<43bf3a0a>] dump_stack+0x75/0xa9
[   20.376473]  [<43aa5c4f>] lockdep_rcu_suspicious+0xbb/0xc4
[   20.383600]  [<43a896a6>] ___might_sleep+0x82/0x1d2
[   20.389988]  [<43a89863>] __might_sleep+0x6d/0x74
[   20.396177]  [<43b2b5c1>] ? __slab_alloc+0x49/0x59
[   20.404660]  [<441541ee>] mutex_lock_nested+0x1e/0x2c5
[   20.411304]  [<43aa3263>] ? trace_hardirqs_on+0xb/0xd
[   20.417850]  [<43b14b56>] pcpu_alloc+0x84/0x42d
[   20.423785]  [<43f5a9a2>] ? dst_alloc+0x5f/0x6e
[   20.429678]  [<43b15308>] __alloc_percpu_gfp+0xb/0xd
[   20.436133]  [<44023f10>] ip6_dst_alloc+0x23/0x70
[   20.442194]  [<440265cf>] addrconf_dst_alloc+0x34/0xce
[   20.448861]  [<4401b02f>] ipv6_add_addr+0x15b/0x43e
[   20.455188]  [<4401eec6>] add_addr+0x19/0x5a
[   20.460731]  [<44020bf5>] addrconf_notify+0x565/0x93e
[   20.467366]  [<43f719a0>] ? pktgen_device_event+0x100/0x25c
[   20.474501]  [<43a7fd41>] notifier_call_chain+0x25/0x47
[   20.481300]  [<43a7fff4>] raw_notifier_call_chain+0xc/0xe
[   20.488305]  [<43f4ceef>] call_netdevice_notifiers_info+0x41/0x49
[   20.496217]  [<43f4ff3d>] call_netdevice_notifiers+0xc/0xe
[   20.503271]  [<43f53e20>]

[PATCH net-next] tuntap: remove unnecessary sk_receive_queue length check during xmit

2016-11-22 Thread Jason Wang
After commit 1576d9860599 ("tun: switch to use skb array for tx"),
sk_receive_queue was not used any more. So remove the uncessary
sk_receive_queue length check during xmit.

Signed-off-by: Jason Wang 
---
 drivers/net/tun.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 64e694c..e2af2dd 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -878,13 +878,6 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, 
struct net_device *dev)
sk_filter(tfile->socket.sk, skb))
goto drop;
 
-   /* Limit the number of packets queued by dividing txq length with the
-* number of queues.
-*/
-   if (skb_queue_len(>socket.sk->sk_receive_queue) * numqueues
- >= dev->tx_queue_len)
-   goto drop;
-
if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
goto drop;
 
-- 
2.7.4



[PATCH net 1/1] net sched filters: fix filter handle ID in tfilter_notify_chain()

2016-11-22 Thread Roman Mashak
Should pass valid filter handle, not the netlink flags.

Fixes: 30a391a13ab92 ("net sched filters: pass netlink message flags in event 
notification")
Signed-off-by: Roman Mashak 
Signed-off-by: Jamal Hadi Salim 
---
 net/sched/cls_api.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 8e93d4a..b05d4a2 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -112,7 +112,7 @@ static void tfilter_notify_chain(struct net *net, struct 
sk_buff *oskb,
 
for (it_chain = chain; (tp = rtnl_dereference(*it_chain)) != NULL;
 it_chain = >next)
-   tfilter_notify(net, oskb, n, tp, n->nlmsg_flags, event, false);
+   tfilter_notify(net, oskb, n, tp, 0, event, false);
 }
 
 /* Select new prio value from the range, managed by kernel. */
-- 
1.9.1



net/arp: ARP cache aging failed.

2016-11-22 Thread yuehaibing

 Hi,

I've encountered a arp cache aging failed bug in 4.9 kernel.The topo is 
as follow:


HOST1     -
  IP1 | Switch |IP2-| HOST2 |
 |    -
  --Bonging  |
   |   | IP3
 MAC1MAC2  | HOST3 |


HOST1 have a bonding interface which including two NICs

IP1:192.168.1.100/24
IP2:192.168.1.200/24
IP2:192.168.1.300/24

There are large numbers of TCP transaction between HOST2 and HOST3.


The Host2 can ping HOST1 normally.However,It cannot ping after HOST1 
bonding interface deactived a working NIC.


on HOST2 ,use fowllow command:

watch "ip -s neigh show|grep 192.168.1.100"

I noticed the old HOST1 arp cache aging counter is gradually  increased 
,then reset before it reached 30.This process is repeated,
thus the arp cache holding REACHABLE status.The new HOST1 MAC arp cache cannot 
been renewed ,and thus ping cannot sendto the correct HOST1 MAC.

Then I found n->confirmed is freshed  in dst_neigh_output while 
dst->pending_confirm is set to 1.

include/net/dst.h

static inline void dst_confirm(struct dst_entry *dst)
{
dst->pending_confirm = 1;
}

static inline int dst_neigh_output(struct dst_entry *dst, struct neighbour *n,
   struct sk_buff *skb)
{
const struct hh_cache *hh;

if (dst->pending_confirm) {
unsigned long now = jiffies;

dst->pending_confirm = 0;
/* avoid dirtying neighbour */
if (n->confirmed != now)
n->confirmed = now;
}
...
}

dst_confirm can be called by tcp_ack in net/ipv4/tcp_input.c.

/* This routine deals with incoming acks, but not outgoing ones. */
static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{
.
if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP)) {
struct dst_entry *dst = __sk_dst_get(sk);
if (dst)
dst_confirm(dst);
}
.
}

As to my topo,HOST1 and HOST3 share one route on HOST2, tcp connection 
between HOST2 and HOST3 may call tcp_ack to set dst->pending_confirm.

So dst_neigh_output may wrongly freshed  n->confirmed which stands for 
HOST1,however HOST1'MAC had been changed.

The possibility of this occurred Significantly increases ,when ping and 
TCP transaction are set the same processor affinity on the HOST2.

It seems that the issue is brought in commit 
5110effee8fde2edfacac9cd12a9960ab2dc39ea ("net: Do delayed neigh 
confirmation.").







Re: [RFC PATCH net-next] net: ethtool: add support for forward error correction modes

2016-11-22 Thread Casey Leedom
  I'm attempting to start the work necessary to implement Vidya's proposed new 
ethtool interface to manage Forward Error Correction settings on a link.  I'm 
confused by the ethtool FEC API and the degree/type of control it offers.  At 
the top of the patch we have:

Encoding: Types of encoding
Off:  Turning off any encoding
RS :  enforcing RS-FEC encoding on supported speeds
BaseR  :  enforcing Base R encoding on supported speeds
Auto   :  Default FEC settings  for  divers , and would represent
  asking the hardware to essentially go into a best effort mode.

but then later on we have:

+struct ethtool_fecparam {
+ __u32   cmd;
+ __u32   autoneg;
+ /* bitmask of FEC modes */
+ __u32   fec;
+ __u32   reserved;
+};

...

+enum ethtool_fec_config_bits {
+   ETHTOOL_FEC_NONE_BIT,
+   ETHTOOL_FEC_AUTO_BIT,
+   ETHTOOL_FEC_OFF_BIT,
+   ETHTOOL_FEC_RS_BIT,
+   ETHTOOL_FEC_BASER_BIT,
+};

...

+   ETHTOOL_LINK_MODE_FEC_NONE_BIT  = 47,
+   ETHTOOL_LINK_MODE_FEC_RS_BIT= 48,
+   ETHTOOL_LINK_MODE_FEC_BASER_BIT   = 49,

The last ethtool Link Mode bits seem to imply a separable FEC on/off with 
individual control for RS and BASER.  How would the "Auto" from the top be 
encoded within these Link Mode bits?  And I don't see any reference to the 
ethtool_fec_config_bits in the kernel or ethtool patches so I'm not sure what 
they're supposed to reference.  Can you clarify the above?  I.e. can you offer 
a small template example of what a driver implementation might look like 
interpreting the incoming Link Mode Bits?

  And do we expect that there will be new FECs in the future?

Casey

Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver

2016-11-22 Thread Vishwanathapura, Niranjana

On Tue, Nov 22, 2016 at 05:04:37PM -0600, Christoph Lameter wrote:

On Tue, 22 Nov 2016, Vishwanathapura, Niranjana wrote:


Ok, I do understand Jason's point that we should probably not put this driver
under drivers/infiniband/sw/.., as this driver is not a HCA.
It is an ULP similar to ipoib, built on top of Omni-path irrespective of
whether we register a hfi_vnic_bus or a direct custom interface with HFI1.
This ULP will transmit and recieve Omni-path packets over the fabric, and is
dependent on IB MAD interface and the HFI1 driver.


This is something that encapsulates IP (v4 right?) in something else.
Would belong into

linux/net/ipv4

You already have similar implementations there

See f.e. ipip.c, ip_tunnel.c and lots more (try
ls linux/net/ipv4/*tunnel*

)

If this is more like a device then it would belong into

linux/drivers/net/hfi or so (see also linux/drivers/net/ppp, plip,
loopback, etc etc)



It is Ethernet packet encapsulated in Omni-path header by hfi_vnic driver.
The packets are sent and received over the wire by the HFI1 device driven by 
HFI1 driver. The encapsulation information is obtained via IB MAD control 
interface.


Niranjana






Re: [RESEND][PATCH v4] cgroup: Use CAP_SYS_RESOURCE to allow a process to migrate other tasks between cgroups

2016-11-22 Thread John Stultz
On Tue, Nov 8, 2016 at 4:12 PM, Andy Lutomirski  wrote:
> On Tue, Nov 8, 2016 at 4:03 PM, Alexei Starovoitov
>  wrote:
>> On Tue, Nov 08, 2016 at 03:51:40PM -0800, Andy Lutomirski wrote:
>>>
>>> I hate to say it, but I think I may see a problem.  Current
>>> developments are afoot to make cgroups do more than resource control.
>>> For example, there's Landlock and there's Daniel's ingress/egress
>>> filter thing.  Current cgroup controllers can mostly just DoS their
>>> controlled processes.  These new controllers (or controller-like
>>> things) can exfiltrate data and change semantics.
>>>
>>> Does anyone have a security model in mind for these controllers and
>>> the cgroups that they're attached to?  I'm reasonably confident that
>>> CAP_SYS_RESOURCE is not the answer...
>>
>> and specifically the answer is... ?
>> Also would be great if you start with specifying the question first
>> and the problem you're trying to solve.
>>
>
> I don't have a good answer right now.  Here are some constraints, though:
>
> 1. An insufficiently privileged process should not be able to move a
> victim into a dangerous cgroup.
>
> 2. An insufficiently privileged process should not be able to move
> itself into a dangerous cgroup and then use execve to gain privilege
> such that the execve'd program can be compromised.
>
> 3. An insufficiently privileged process should not be able to make an
> existing cgroup dangerous in a way that could compromise a victim in
> that cgroup.
>
> 4. An insufficiently privileged process should not be able to make a
> cgroup dangerous in a way that bypasses protections that would
> otherwise protect execve() as used by itself or some other process in
> that cgroup.
>
> Keep in mind that "dangerous" may apply to a cgroup's descendents in
> addition to the cgroup being controlled.

Sorry for taking awhile to get back to you here.  I'm a little
befuddled as to what next steps I should consider (and honestly, I'm
not totally sure I really grok your concern here, particularly what
you mean with "dangrous cgroups").

So is going back to the CAP_CGROUP_MIGRATE approach (to properly
separate "sufficiently" from "insufficiently privileged") better?

Or something closer to the original method Android used of each cgroup
having an allow_attach() check which could determine what is
sufficiently privledged for the respective level of danger the cgroup
might poise?

Or just stepping back, what method would you imagine to be reasonable
to allow a specified task to migrate other tasks between cgroups
without it having to be root/suid?

thanks
-john


[PATCH net-next 1/2] samples/bpf: fix sockex2 example

2016-11-22 Thread Alexei Starovoitov
since llvm commit "Do not expand UNDEF SDNode during insn selection lowering"
llvm will generate code that uses uninitialized registers for cases
where C code is actually uses uninitialized data.
So this sockex2 example is technically broken.
Fix it by initializing on the stack variable fully.
Also increase verifier buffer limit, since verifier output
may not fit in 64k for this sockex2 code depending on llvm version.

Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/libbpf.h   | 2 +-
 samples/bpf/sockex2_kern.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index ac6edb61b64a..de96a935068d 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -18,7 +18,7 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);
 
-#define LOG_BUF_SIZE 65536
+#define LOG_BUF_SIZE (256 * 1024)
 extern char bpf_log_buf[LOG_BUF_SIZE];
 
 /* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
diff --git a/samples/bpf/sockex2_kern.c b/samples/bpf/sockex2_kern.c
index 44e5846c988f..f58acfc92556 100644
--- a/samples/bpf/sockex2_kern.c
+++ b/samples/bpf/sockex2_kern.c
@@ -198,7 +198,7 @@ struct bpf_map_def SEC("maps") hash_map = {
 SEC("socket2")
 int bpf_prog2(struct __sk_buff *skb)
 {
-   struct bpf_flow_keys flow;
+   struct bpf_flow_keys flow = {};
struct pair *value;
u32 key;
 
-- 
2.8.0



[PATCH net-next 2/2] samples/bpf: fix bpf loader

2016-11-22 Thread Alexei Starovoitov
llvm can emit relocations into sections other than program code
(like debug info sections). Ignore them during parsing of elf file

Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/bpf_load.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 97913e109b14..62f54d6eb8bf 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -317,6 +317,10 @@ int load_bpf_file(char *path)
_prog, _prog))
continue;
 
+   if (shdr_prog.sh_type != SHT_PROGBITS ||
+   !(shdr_prog.sh_flags & SHF_EXECINSTR))
+   continue;
+
insns = (struct bpf_insn *) data_prog->d_buf;
 
processed_sec[shdr.sh_info] = true;
-- 
2.8.0



Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver

2016-11-22 Thread Jason Gunthorpe
On Tue, Nov 22, 2016 at 07:05:05PM -0500, ira.weiny wrote:
> On Tue, Nov 22, 2016 at 10:04:07AM -0700, Jason Gunthorpe wrote:
> > On Mon, Nov 21, 2016 at 05:53:04PM -0800, Vishwanathapura, Niranjana wrote:
> > > There are many example drivers in kernel which are using bus_register() in
> > > an initcall.
> > 
> > There really are not, certainly not in major subsystems.
> 
> I see 2 drivers in the Block subsystem which do this:
> 
> 
> 19   5354  /nfs/site/home/iweiny/linux-stable/drivers/block/cciss.c 
> <>
>   err = bus_register(_bus_type);
> 20   6447 /nfs/site/home/iweiny/linux-stable/drivers/block/rbd.c 
> <>
>   ret = bus_register(_bus_type);
> 
> 2 drivers in the drm subsystem which do this:
> 
> 
> 29   1155  /nfs/site/home/iweiny/linux-stable/drivers/gpu/drm/drm_mipi_dsi.c 
> <>
>   return bus_register(_dsi_bus_type);
> 30242 /nfs/site/home/iweiny/linux-stable/drivers/gpu/host1x/dev.c 
> <>
>   err = bus_register(_bus_type);

IMHO this is all obscure or legacy stuff (eg ccsiss lost its bus when
it was reworked into hpsa). Who knows about that SOC stuff, maybe
there really is a special on-chip bus under those drivers.

The point is using a bus as a generic interconnect between two driver
modules seems very rare, and is not what we have historically ever
done in drivers/infiniband - all our split drivers use a trivial
register scheme. eg see cxgb4_register_uld/mlx4_register_interface/etc.

Should a multi-function driver use a bus or class to connect its
parts? Who knows. Maybe Greg KH/etc has an opinion. But that is not
what we have been doing, it doesn't seem very simplifying, and
this series doesn't even make module auto-loading work...

Since doing this creates a bunch of uapis (again, from a driver, ugh) it
seems like a bad idea without more support as 'the right way'

.. and yes, it would be nice to have a lightweight mechanism to
replace those register functions that could handle module auto loading
too, and maybe that is a 'multi part driver bus/class' or somesuch
... This is really a topic for the device core maintainers, IMHO.

> > > We could add a custom Interface between HFI1 driver and hfi_vnic drivers
> > > without involving a bus.
> > 
> > hfi is already registering on the infiniband class, just use that.
> 
> I don't understand what you mean here?

Get the struct ib_device for the hfi and then do something to get hfi
specific function calls.

Or work it backwards with a _register function..

> [*] As an aside why does the ib_core not use this methodology?  It dawned on
> me that this may be a better way to fix our module load problems.  However, I
> have not looked into details.

ib_core is a class, which is appropriate. RDMA devices are not busses.

Jason


Re: [RFC net-next 0/3] net: bridge: Allow CPU port configuration

2016-11-22 Thread Florian Fainelli
On 11/22/2016 02:08 PM, Jiri Pirko wrote:
> Tue, Nov 22, 2016 at 06:48:29PM CET, and...@lunn.ch wrote:
>> Hi Ido
>>
>>> First of all, I want to be sure that when we say "CPU port", we're
>>> talking about the same thing. In mlxsw, the CPU port is a pipe between
>>> the device and the host, through which all packets trapped to the host
>>> go through. So, when a packet is trapped, the driver reads its Rx
>>> descriptor, checks through which port it ingressed, resolves its netdev,
>>> sets skb->dev accordingly and injects it to the Rx path via
>>> netif_receive_skb(). The CPU port itself isn't represented using a
>>> netdev.
>>
>> With DSA, we have a real physical ethernet network interface for the
>> 'cpu' port. It connects to one of the ports of the switch. Frames on
> 
> Every port should be visible as a netdevice, including cpu port.
> Would it make sence to have representors for those?

The CPU port is kind of already visible with DSA since you need the
switch to be attached to a normal Ethernet MAC driver (later referenced
as eth0 for simplicity). Since eth0 is going to potentially receive/send
switch tagged traffic, and the model is to terminate the interfaces at
the port level, this interface does not really have any meaningful use
from a data exchange, apart from multiplexing/demultiplexing switch tags
(when enabled).

If we did create a "cpu" network device, this interface would not be
able to send/receive traffic either, because the per-port network
interfaces are terminated at their level, and the conduit interface is
just used for transmitting/receiving switch tagged traffic. It does have
value as a controlling interface only though.

As a controlling interface, this can be helpful, but we need to decide
which side of the switch this CPU interface would represent, is it the
switch's view of the CPU port, or is the Ethernet MAC view's of the
switch's CPU port, attached to it (especially true with discrete switch
chips).

If we did use eth0 as a controlling interface, we need to somehow be
able to overload (in an objected oriented fashioned) the netdev_ops,
ethtool_ops and switchdev_ops for that interface so as to make it
participate in the switch configuration (we actually do this already for
ethtool statistics, but this is ugly).
-- 
Florian


Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver

2016-11-22 Thread ira.weiny
On Tue, Nov 22, 2016 at 10:04:07AM -0700, Jason Gunthorpe wrote:
> On Mon, Nov 21, 2016 at 05:53:04PM -0800, Vishwanathapura, Niranjana wrote:
> > There are many example drivers in kernel which are using bus_register() in
> > an initcall.
> 
> There really are not, certainly not in major subsystems.

I see 2 drivers in the Block subsystem which do this:


19   5354  /nfs/site/home/iweiny/linux-stable/drivers/block/cciss.c 
<>
err = bus_register(_bus_type);
20   6447 /nfs/site/home/iweiny/linux-stable/drivers/block/rbd.c 
<>
ret = bus_register(_bus_type);

2 drivers in the drm subsystem which do this:


29   1155  /nfs/site/home/iweiny/linux-stable/drivers/gpu/drm/drm_mipi_dsi.c 
<>
return bus_register(_dsi_bus_type);
30242 /nfs/site/home/iweiny/linux-stable/drivers/gpu/host1x/dev.c 
<>
err = bus_register(_bus_type);

And I think there are a couple others.

I'm not sure what these devices/buses do but they are registering their own bus
while being in another major subsystem.  Is what we are doing really so
crazy/wrong?


>
> > We could add a custom Interface between HFI1 driver and hfi_vnic drivers
> > without involving a bus.
> 
> hfi is already registering on the infiniband class, just use that.
> 

I don't understand what you mean here?

The bus_register provides a really clean way for the hfi1 driver and hfi_vnic
driver to find each other.  This includes being able to support hfi1 with or
without hfi_vnic being loaded.  Note that without configuration from the "EM"
Ethernet Manager the hfi_vnic does not export a net device.

Why wouldn't we use this core kernel support?[*]

> > But using the existing bus model gave a lot of in-built flexibility in
> > decoupling devices from the drivers.
> 
> If you want to have your own bus then you need your own hfi
> subsystem. drivers/infiniband is not a dumping ground..
> 

We don't consider drivers/infiniband a "dumping ground".  There is a
requirement on ib_mad from the hfi_vnic driver.

Ira

[*] As an aside why does the ib_core not use this methodology?  It dawned on
me that this may be a better way to fix our module load problems.  However, I
have not looked into details.

> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 1/1] ipv6: sr: add option to control lwtunnel support

2016-11-22 Thread Alexei Starovoitov
On Wed, Nov 16, 2016 at 8:32 AM, David Miller  wrote:
> From: David Lebrun 
> Date: Tue, 15 Nov 2016 16:14:04 +0100
>
>> This patch adds a new option CONFIG_IPV6_SEG6_LWTUNNEL to enable/disable
>> support of encapsulation with the lightweight tunnels. When this option
>> is enabled, CONFIG_LWTUNNEL is automatically selected.
>>
>> Fix commit 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and 
>> injection with lwtunnels")
>>
>> Without a proper option to control lwtunnel support for SR-IPv6, if
>> CONFIG_LWTUNNEL=n then the IPv6 initialization fails as a consequence
>> of seg6_iptunnel_init() failure with EOPNOTSUPP:
>>
>> NET: Registered protocol family 10
>> IPv6: Attempt to unregister permanent protocol 6
>> IPv6: Attempt to unregister permanent protocol 136
>> IPv6: Attempt to unregister permanent protocol 17
>> NET: Unregistered protocol family 10
>>
>> Tested (compiling, booting, and loading ipv6 module when relevant)
>> with possible combinations of CONFIG_IPV6={y,m,n},
>> CONFIG_IPV6_SEG6_LWTUNNEL={y,n} and CONFIG_LWTUNNEL={y,n}.
>>
>> Reported-by: Lorenzo Colitti 
>> Suggested-by: Roopa Prabhu 
>> Signed-off-by: David Lebrun 
>
> Applied.

ipv6 seems to be still broken in the latest net-next
when CONFIG_LWTUNNEL is not set:
# ping 127.0.0.1
ping: socket: Address family not supported by protocol
# ping -4 127.0.0.1
PING localhost.localdomain (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=1 ttl=64 time=0.067 ms

it works with CONFIG_LWTUNNEL=y

Roopa, David, please take a look.

Thanks!


Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver

2016-11-22 Thread Andrew Lunn
On Tue, Nov 22, 2016 at 11:49:18AM -0800, Vishwanathapura, Niranjana wrote:
> Ok, I do understand Jason's point that we should probably not put
> this driver under drivers/infiniband/sw/.., as this driver is not a
> HCA.
> It is an ULP similar to ipoib, built on top of Omni-path
> irrespective of whether we register a hfi_vnic_bus or a direct
> custom interface with HFI1.
> This ULP will transmit and recieve Omni-path packets over the
> fabric, and is dependent on IB MAD interface and the HFI1 driver.
> 
> Doug,
> Will it be acceptable if we put it under 'drivers/infiniband/ulp/hfi_vnic'?

How about turning this whole discussion around. 

This is a network driver. So ask the network Maintainers where he
wants it. Send the patch to David Miller  and
netdev with the question, where does this code belong?

Andrew


[PATCH net-next] mlx4: reorganize struct mlx4_en_tx_ring

2016-11-22 Thread Eric Dumazet
From: Eric Dumazet 

Goal is to reorganize this critical structure to increase performance.

ndo_start_xmit() should only dirty one cache line, and access as few
cache lines as possible.

Add sp_ (Slow Path) prefix to fields that are not used in fast path,
to make clear what is going on.

After this patch pahole reports something much better, as all
ndo_start_xmit() needed fields are packed into two cache lines instead
of seven or eight

struct mlx4_en_tx_ring {
u32last_nr_txbb; /* 0   0x4 */
u32cons; /*   0x4   0x4 */
long unsigned int  wake_queue;   /*   0x8   0x8 */
struct netdev_queue *  tx_queue; /*  0x10   0x8 */
u32(*free_tx_desc)(struct mlx4_en_priv *, 
struct mlx4_en_tx_ring *, int, u8, u64, int); /*  0x18   0x8 */
struct mlx4_en_rx_ring *   recycle_ring; /*  0x20   0x8 */

/* XXX 24 bytes hole, try to pack */

/* --- cacheline 1 boundary (64 bytes) --- */
u32prod; /*  0x40   0x4 */
unsigned int   tx_dropped;   /*  0x44   0x4 */
long unsigned int  bytes;/*  0x48   0x8 */
long unsigned int  packets;  /*  0x50   0x8 */
long unsigned int  tx_csum;  /*  0x58   0x8 */
long unsigned int  tso_packets;  /*  0x60   0x8 */
long unsigned int  xmit_more;/*  0x68   0x8 */
struct mlx4_bf bf;   /*  0x70  0x18 */
/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
__be32 doorbell_qpn; /*  0x88   0x4 */
__be32 mr_key;   /*  0x8c   0x4 */
u32size; /*  0x90   0x4 */
u32size_mask;/*  0x94   0x4 */
u32full_size;/*  0x98   0x4 */
u32buf_size; /*  0x9c   0x4 */
void * buf;  /*  0xa0   0x8 */
struct mlx4_en_tx_info *   tx_info;  /*  0xa8   0x8 */
intqpn;  /*  0xb0   0x4 */
u8 queue_index;  /*  0xb4   0x1 */
bool   bf_enabled;   /*  0xb5   0x1 */
bool   bf_alloced;   /*  0xb6   0x1 */
u8 hwtstamp_tx_type; /*  0xb7   0x1 */
u8 *   bounce_buf;   /*  0xb8   0x8 */
/* --- cacheline 3 boundary (192 bytes) --- */
long unsigned int  queue_stopped;/*  0xc0   0x8 */
struct mlx4_hwq_resources  sp_wqres; /*  0xc8  0x58 */
/* --- cacheline 4 boundary (256 bytes) was 32 bytes ago --- */
struct mlx4_qp sp_qp;/* 0x120  0x30 */
/* --- cacheline 5 boundary (320 bytes) was 16 bytes ago --- */
struct mlx4_qp_context sp_context;   /* 0x150  0xf8 */
/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
cpumask_t  sp_affinity_mask; /* 0x248  0x20 */
enum mlx4_qp_state sp_qp_state;  /* 0x268   0x4 */
u16sp_stride;/* 0x26c   0x2 */
u16sp_cqn;   /* 0x26e   0x2 */

/* size: 640, cachelines: 10, members: 36 */
/* sum members: 600, holes: 1, sum holes: 24 */
/* padding: 16 */
};

Instead of this silly placement :

struct mlx4_en_tx_ring {
u32last_nr_txbb; /* 0   0x4 */
u32cons; /*   0x4   0x4 */
long unsigned int  wake_queue;   /*   0x8   0x8 */

/* XXX 48 bytes hole, try to pack */

/* --- cacheline 1 boundary (64 bytes) --- */
u32prod; /*  0x40   0x4 */

/* XXX 4 bytes hole, try to pack */

long unsigned int  bytes;/*  0x48   0x8 */
long unsigned int  packets;  /*  0x50   0x8 */
long unsigned int  tx_csum;  /*  0x58   0x8 */
long unsigned int  tso_packets;  /*  0x60   0x8 */
long unsigned int  xmit_more;/*  0x68   0x8 */
unsigned int   tx_dropped;   /*  0x70   0x4 */

/* XXX 4 bytes hole, try to pack */

struct mlx4_bf bf;   /*  0x78  0x18 */
/* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
long 

[PATCH ethtool] ethtool: Fix the "advertise" parameter logic.

2016-11-22 Thread Michael Chan
From: Michael Chan 

The current code ignores the value of the advertise parameter.  For example,

ethtool -s ethx advertise 0x1000

The full_advertising_wanted parameter of 0x1000 is not passed to the kernel.
The reason is that advertising_wanted is NULL in this case, and ethtool
will think that the user has given no advertisement input and so it will
proceed to pass all supported advertisement speeds to the kernel.

The older legacy ethtool with similar logic worked because
advertising_wanted was an integer and could take on -1 and 0.  It would pass
the full_advertising_wanted value if advertising_wanted == -1.

This fix is to pass all supported advertisement speeds only when both
advertising_wanted == NULL && full_advertising_wanted == NULL.

Signed-off-by: Michael Chan 
---
 ethtool.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/ethtool.c b/ethtool.c
index 49ac94e..7715823 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -2971,7 +2971,8 @@ static int do_sset(struct cmd_context *ctx)
fprintf(stderr, "\n");
}
if (autoneg_wanted == AUTONEG_ENABLE &&
-   advertising_wanted == NULL) {
+   advertising_wanted == NULL &&
+   full_advertising_wanted == NULL) {
unsigned int i;
 
/* Auto negotiation enabled, but with
-- 
1.8.4.5



Re: [RFC PATCH net-next] net: ethtool: add support for forward error correction modes

2016-11-22 Thread Casey Leedom
  And by the way, we currently have two ethtool APIs which pump in an 
Auto-Negotiation indication -- set_link_ksettings() and set_pauseparam().  Now 
we're talking about adding a third, set_fecparam().  Are all of the calls to 
these three APIs supposed to agree on the concept of Auto-Negotiations?  I.e. 
what's it mean if set_link_ksettings() gets called with 
link_ksettings->base.autoneg == AUTONEG_ENABLE but set_pauseparam() gets called 
with epause->autoneg == AUTONEG_DISABLE?  And now adding set_fecparam() into 
the system with a similar ability to specify the state of Auto-Negotiation is 
even more confusing.

Casey

[PATCH RFC v1] ethtool: implement helper to get flow_type value

2016-11-22 Thread Jacob Keller
Often a driver wants to store the flow type and thus it must mask the
extra fields. This is a task that could grow more complex as more flags
are added in the future. Add a helper function that masks the flags for
marking additional fields.

Modify drivers in drivers/net/ethernet that currently check for FLOW_EXT
and FLOW_MAC_EXT to use the helper. Currently this is only the mellanox
drivers.

I chose not to modify other drivers as I'm actually unsure whether we
should always mask the flow type even for drivers which don't recognize
the newer flags. On the one hand, today's drivers (generally)
automatically fail when a new flag is used because they won't mask it
and their checks against flow_type will not match. On the other hand, it
means another place that you have to update when you begin implementing
a flag.

An alternative is to have the driver store a set of flags that it knows
about, and then have ethtool core do the check for us to discard frames.
I haven't implemented this quite yet.

Signed-off-by: Jacob Keller 
---
I plan on using this helper when fixing the mask code for ntuple filters
in the Intel i40e driver. I wanted to see whether this approach was
acceptable, and whether we should implement additional checks. The
primary reason is that today's drivers are "fail closed" in that a new
flag type will probably fail on drivers due to checking for flow types
they recognize. Since drivers only remove the masked bits they recognize
this works. However, this gets cumbersome if new additional flags get
added in the future. I would like some sort of helper, but if we
encourage its use, and a new flag gets added, the helper will then
unforunately make the driver "fail open" in that a new flag will get
ignored as the driver won't know to return -EINVAL.

I think the right solution will be to add some sort of checks in core
ethtool which we can basically set the recognized flags in some way for
all drivers such that the ethtool core can drop requests for flows with
unknown flag types. I'm unsure how to implement this though.

Thoughts?

 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c |  6 +++---
 include/uapi/linux/ethtool.h| 11 ---
 3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 487a58f9c192..d8f9839ce2a3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -1270,7 +1270,7 @@ static int mlx4_en_validate_flow(struct net_device *dev,
return -EINVAL;
}
 
-   switch (cmd->fs.flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
+   switch (ethtool_get_flow_spec_type(cmd->fs.flow_type)) {
case TCP_V4_FLOW:
case UDP_V4_FLOW:
if (cmd->fs.m_u.tcp_ip4_spec.tos)
@@ -1493,7 +1493,7 @@ static int mlx4_en_ethtool_to_net_trans_rule(struct 
net_device *dev,
if (err)
return err;
 
-   switch (cmd->fs.flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
+   switch (ethtool_get_flow_spec_type(cmd->fs.flow_type)) {
case ETHER_FLOW:
spec_l2 = kzalloc(sizeof(*spec_l2), GFP_KERNEL);
if (!spec_l2)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
index 3691451c728c..066e6c5cf38b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
@@ -63,7 +63,7 @@ static struct mlx5e_ethtool_table *get_flow_table(struct 
mlx5e_priv *priv,
int table_size;
int prio;
 
-   switch (fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
+   switch (ethtool_get_flow_spec_type(fs->flow_type)) {
case TCP_V4_FLOW:
case UDP_V4_FLOW:
max_tuples = ETHTOOL_NUM_L3_L4_FTS;
@@ -147,7 +147,7 @@ static int set_flow_attrs(u32 *match_c, u32 *match_v,
 outer_headers);
void *outer_headers_v = MLX5_ADDR_OF(fte_match_param, match_v,
 outer_headers);
-   u32 flow_type = fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT);
+   u32 flow_type = ethtool_get_flow_spec_type(fs->flow_type);
struct ethtool_tcpip4_spec *l4_mask;
struct ethtool_tcpip4_spec *l4_val;
struct ethtool_usrip4_spec *l3_mask;
@@ -393,7 +393,7 @@ static int validate_flow(struct mlx5e_priv *priv,
fs->ring_cookie != RX_CLS_FLOW_DISC)
return -EINVAL;
 
-   switch (fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
+   switch (ethtool_get_flow_spec_type(fs->flow_type)) {
case ETHER_FLOW:
eth_mask = >m_u.ether_spec;
if (!is_zero_ether_addr(eth_mask->h_dest))
diff --git 

Re: [PATCH net-next] net/sched: cls_flower: verify root pointer before dereferncing it

2016-11-22 Thread John Fastabend
On 16-11-22 12:41 PM, Daniel Borkmann wrote:
> On 11/22/2016 08:28 PM, Cong Wang wrote:
>> On Tue, Nov 22, 2016 at 8:11 AM, Jiri Pirko  wrote:
>>> Tue, Nov 22, 2016 at 05:04:11PM CET, dan...@iogearbox.net wrote:
 Hmm, I don't think we want to have such an additional test in fast
 path for each and every classifier. Can we think of ways to avoid that?

 My question is, since we unlink individual instances from such
 tp-internal
 lists through RCU and release the instance through call_rcu() as
 well as
 the head (tp->root) via kfree_rcu() eventually, against what are we
 protecting
 setting RCU_INIT_POINTER(tp->root, NULL) in ->destroy() callback?
 Something
 not respecting grace period?
>>>
>>> If you call tp->ops->destroy in call_rcu, you don't have to set tp->root
>>> to null.
> 
> But that's not really an answer to my question. ;)
> 
>> We do need to respect the grace period if we touch the globally visible
>> data structure tp in tcf_destroy(). Therefore Roi's patch is not
>> fixing the
>> right place.
> 
> I think there may be multiple issues actually.
> 
> At the time we go into tc_classify(), from ingress as well as egress side,
> we're under RCU, but BH variant. In cls delete()/destroy() callbacks, we
> everywhere use call_rcu() and kfree_rcu(), same as for tcf_destroy() where
> we use kfree_rcu() on tp, although we iterate tps (and implicitly inner
> filters)
> via rcu_dereference_bh() from reader side. Is there a reason why we don't
> use call_rcu_bh() variant on destruction for all this instead?

I can't think of any if its all under _bh we can convert the call_rcu to
call_rcu_bh it just needs an audit.

> 
> Just looking at cls_bpf and others, what protects
> RCU_INIT_POINTER(tp->root,
> NULL) against? The tp is unlinked in tc_ctl_tfilter() from the tp chain in
> tcf_destroy() cases. Still active readers under RCU BH can race against
> this
> (tp->root being NULL), as the commit identified. Only the get() callback
> checks
> for head against NULL, but both are serialized under rtnl, and the only
> place
> we call this is tc_ctl_tfilter(). Even if we create a new tp, head
> should not
> be NULL there, if it was assigned during the init() cb, but contains an
> empty
> list. (It's different for things like cls_cgroup, though.) So, I'm
> wondering
> if the RCU_INIT_POINTER(tp->root, NULL) can just be removed instead
> (unless I'm
> missing something obvious)?


Just took a look at this I think there are a couple possible solutions.
The easiest is likely to fix all the call sites so that 'tp' is unlinked
before calling the destroy() handlers AND not doing the NULL set. I only
see one such call site where destroy is called before unlinking at the
moment. This should enforce that after a grace period there is no path
to reach the classifiers because 'tp' is unlinked. Calling destroy
before unlinking 'tp' however could cause a small race between grace
period of 'tp' and grace period of the filter.

Another would be to only call the destroy path from the call_rcu path
of the 'tp' object so that destroy is only ever called after the object
is guaranteed to be unlinked from the tc_filter path.

I think both solutions would be fine.

Cong were you working on one of these? Or do you have another idea?


> 
>> Also I don't know why you blame my commit, this problem should already
>> exist prior to my commit, probably date back to John's RCU patches.
> 
> It seems so.



Re: [net] 34fad54c25: kernel BUG at include/linux/skbuff.h:1935!

2016-11-22 Thread Linus Torvalds
On Tue, Nov 22, 2016 at 2:28 PM, Eric Dumazet  wrote:
>
> This is fixed by :
> https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=c9b8af1330198ae241cd545e1f040019010d44d9

Thanks guys. This was one of the less esoteric-looking regressions, so
I'm happy to hear it's solved.

 Linus


Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver

2016-11-22 Thread Christoph Lameter
On Tue, 22 Nov 2016, Vishwanathapura, Niranjana wrote:

> Ok, I do understand Jason's point that we should probably not put this driver
> under drivers/infiniband/sw/.., as this driver is not a HCA.
> It is an ULP similar to ipoib, built on top of Omni-path irrespective of
> whether we register a hfi_vnic_bus or a direct custom interface with HFI1.
> This ULP will transmit and recieve Omni-path packets over the fabric, and is
> dependent on IB MAD interface and the HFI1 driver.

This is something that encapsulates IP (v4 right?) in something else.
Would belong into

linux/net/ipv4

You already have similar implementations there

See f.e. ipip.c, ip_tunnel.c and lots more (try
ls linux/net/ipv4/*tunnel*

)

If this is more like a device then it would belong into

linux/drivers/net/hfi or so (see also linux/drivers/net/ppp, plip,
loopback, etc etc)





Re: [net] 34fad54c25: kernel BUG at include/linux/skbuff.h:1935!

2016-11-22 Thread Andre Noll
On Tue, Nov 22, 14:04, Linus Torvalds wrote
>  what's the situation on this issue? The bisection looks a bit odd,
> but the commit in question does end up changing the key_control->thoff
> value for the failure case, so maybe that in turn ends up screwing up
> a later skb_pull.
> 
> I'm not seeing anything that might fix this in the last networking
> pull, but I may have missed something.

I think that's the bug Eric has fixed today. See thread

[PATCH net] flow_dissect: call init_default_flow_dissectors() earlier

David has queued up the fix and will send it your way shortly.

Andre
-- 
Max Planck Institute for Developmental Biology
Spemannstraße 35, 72076 Tübingen, Germany. Phone: (+49) 7071 601 829
http://people.tuebingen.mpg.de/maan/


signature.asc
Description: Digital signature


Re: [net] 34fad54c25: kernel BUG at include/linux/skbuff.h:1935!

2016-11-22 Thread Eric Dumazet
On Tue, Nov 22, 2016 at 2:04 PM, Linus Torvalds
 wrote:
> David, Eric,
>
>  what's the situation on this issue? The bisection looks a bit odd,
> but the commit in question does end up changing the key_control->thoff
> value for the failure case, so maybe that in turn ends up screwing up
> a later skb_pull.
>
> I'm not seeing anything that might fix this in the last networking
> pull, but I may have missed something.
>
> I also noticed that the kernel test robot had screwed up the
> participants list for some reason, and had
>
>   "Acked-by: Alexander Duyck , David S.
> Miller" 
>
> as one of the participants. So there's some odd commit parsing issue
> there somewhere. But Alexander seems to have seen this report despite
> that, it just never went anywhere that I can tell.
>
> Linus
>

This is fixed by :
https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=c9b8af1330198ae241cd545e1f040019010d44d9

Thanks


Re: [PATCH] net: dsa: mv88e6xxx: egress all frames

2016-11-22 Thread Vivien Didelot
Hi Andrew, Stefan,

Andrew Lunn  writes:

> What you might find useful is
>
> https://github.com/vivien/linux.git 161b96bd7d16d21b0f046c935b70c3b2d277ccc2
>
> although it might need some changes for recent commits.
>
> With that, you can see deeper into the switches registers.

FYI, I have rebased it on top of the latest net-next (f9aa9dc7d2d0):

https://github.com/vivien/linux.git dsa/dev

Thanks,

Vivien


Re: [PATCH 2/2] net: qcom/emac: add support for the Qualcomm Technologies QDF2400

2016-11-22 Thread Timur Tabi

On 11/21/2016 04:58 PM, Timur Tabi wrote:

The QDF2432 and the QDF2400 have slightly different internal PHYs,
so there are some programming differences.  Some of the registers in
the QDF2400 have moved, and some registers require different values
during initialization.

Because of the differences, the internal PHY on the QDF2400 has a new
ACPI HID, QCOM8072.

Signed-off-by: Timur Tabi


There seems to be some disagreement internally as to whether a new HID 
is the right approach.  Please hold off on applying patch [2/2] for now.


Patch [1/2] can be applied, however, if it passes review.

--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.


Re: [net] 34fad54c25: kernel BUG at include/linux/skbuff.h:1935!

2016-11-22 Thread Linus Torvalds
David, Eric,

 what's the situation on this issue? The bisection looks a bit odd,
but the commit in question does end up changing the key_control->thoff
value for the failure case, so maybe that in turn ends up screwing up
a later skb_pull.

I'm not seeing anything that might fix this in the last networking
pull, but I may have missed something.

I also noticed that the kernel test robot had screwed up the
participants list for some reason, and had

  "Acked-by: Alexander Duyck , David S.
Miller" 

as one of the participants. So there's some odd commit parsing issue
there somewhere. But Alexander seems to have seen this report despite
that, it just never went anywhere that I can tell.

Linus

On Tue, Nov 15, 2016 at 1:20 PM, kernel test robot
 wrote:
>
> FYI, we noticed the following commit:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> commit 34fad54c2537f7c99d07375e50cb30aa3c23bd83 ("net: __skb_flow_dissect() 
> must cap its return value")
>
> in testcase: pbzip2
> with following parameters:
>
> nr_threads: 25%
> blocksize: 900K
> cpufreq_governor: performance
>
>
>
> on test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 
> 2.70GHz with 64G memory
>
> caused below changes:
>
>
> +--+++
> |  | 
> 79774d6bfa | 34fad54c25 |
> +--+++
> | boot_successes   | 0
>   | 2  |
> | boot_failures| 2
>   | 20 |
> | invoked_oom-killer:gfp_mask=0x   | 2
>   | 2  |
> | Mem-Info | 2
>   | 2  |
> | Kernel_panic-not_syncing:Out_of_memory_and_no_killable_processes | 2
>   | 2  |
> | kernel_BUG_at_include/linux/skbuff.h | 0
>   | 16 |
> | invalid_opcode:#[##]SMP  | 0
>   | 16 |
> | RIP:eth_type_trans   | 0
>   | 16 |
> | Kernel_panic-not_syncing:Fatal_exception_in_interrupt| 0
>   | 15 |
> | calltrace:hub_event  | 0
>   | 1  |
> | WARNING:at_fs/sysfs/dir.c:#sysfs_warn_dup| 0
>   | 2  |
> | calltrace:parport_pc_init| 0
>   | 2  |
> | calltrace:SyS_finit_module   | 0
>   | 2  |
> | WARNING:at_lib/kobject.c:#kobject_add_internal   | 0
>   | 2  |
> +--+++
>
>
>
> [   19.375251] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
> [   19.388892] Sending DHCP requests .
> [   19.388892] [ cut here ]
> [   19.388894] kernel BUG at include/linux/skbuff.h:1935!
> [   19.388895] invalid opcode:  [#1] SMP
> [   19.388896] Modules linked in:
> [   19.388897] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
> 4.9.0-rc3-00320-g34fad54 #1
> [   19.388898] Hardware name: Intel Corporation S2600WP/S2600WP, BIOS 
> SE5C600.86B.02.02.0002.122320131210 12/23/2013
> [   19.388899] task: 81e0e4c0 task.stack: 81e0
> [   19.388904] RIP: 0010:[]  [] 
> eth_type_trans+0xe8/0x140
> [   19.388904] RSP: :88081e803db8  EFLAGS: 00010297
> [   19.388905] RAX: 0152 RBX: 88080221f200 RCX: 
> 1073
> [   19.388905] RDX: 8808013afdc0 RSI: 880801114000 RDI: 
> 880819407c00
> [   19.388906] RBP: 88081e803e20 R08: 880801114000 R09: 
> 0800
> [   19.388907] R10: 8808013afec0 R11: ea003fd5a880 R12: 
> 880819407c00
> [   19.388907] R13: 881033408000 R14: c9000843e000 R15: 
> 0158
> [   19.388908] FS:  () GS:88081e80() 
> knlGS:
> [   19.388909] CS:  0010 DS:  ES:  CR0: 80050033
> [   19.388910] CR2: 88103000 CR3: 01e07000 CR4: 
> 001406f0
> [   19.388910] Stack:
> [   19.388912]  816905a7 ea003fd5a880 ea08 
> 88080221f050
> [   19.388913]  88080221f000 00400160 ea003fd5a880 
> 
> [   19.388915]  0040  88080221f050 
> 88100d216000
> [   19.388915] Call Trace:
> [   19.388919]  
> [   19.388919]  [] ? igb_clean_rx_irq+0x6a7/0x7d0
> [   19.388921]  [] igb_poll+0x382/0x700
> [   

Re: [RFC net-next 0/3] net: bridge: Allow CPU port configuration

2016-11-22 Thread Jiri Pirko
Tue, Nov 22, 2016 at 06:48:29PM CET, and...@lunn.ch wrote:
>Hi Ido
> 
>> First of all, I want to be sure that when we say "CPU port", we're
>> talking about the same thing. In mlxsw, the CPU port is a pipe between
>> the device and the host, through which all packets trapped to the host
>> go through. So, when a packet is trapped, the driver reads its Rx
>> descriptor, checks through which port it ingressed, resolves its netdev,
>> sets skb->dev accordingly and injects it to the Rx path via
>> netif_receive_skb(). The CPU port itself isn't represented using a
>> netdev.
>
>With DSA, we have a real physical ethernet network interface for the
>'cpu' port. It connects to one of the ports of the switch. Frames on

Every port should be visible as a netdevice, including cpu port.
Would it make sence to have representors for those?

>this interface have an extra header, indicating which switch port it
>came from, and we do a similar resolving it to a slave netdev, strip
>of the header and injecting it into the receiver path via
>netif_receive_skb().
>
>   Andrew


[PATCH net-next v2] ethtool: Protect {get,set}_phy_tunable with PHY device mutex

2016-11-22 Thread Florian Fainelli
PHY drivers should be able to rely on the caller of {get,set}_tunable to
have acquired the PHY device mutex, in order to both serialize against
concurrent calls of these functions, but also against PHY state machine
changes. All ethtool PHY-level functions do this, except
{get,set}_tunable, so we make them consistent here as well.

We need to update the Microsemi PHY driver in the same commit to avoid
introducing either deadlocks, or lack of proper locking.

Fixes: 968ad9da7e0e ("ethtool: Implements 
ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE")
Fixes: 310d9ad57ae0 ("net: phy: Add downshift get/set support in Microsemi PHYs 
driver")
Signed-off-by: Florian Fainelli 
---
Changes in v2:

- also patch drivers/net/phy/mscc.c in the same commit

 drivers/net/phy/mscc.c | 16 +---
 net/core/ethtool.c |  4 
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
index 92018ba6209e..7a3740c7bf6d 100644
--- a/drivers/net/phy/mscc.c
+++ b/drivers/net/phy/mscc.c
@@ -115,10 +115,9 @@ static int vsc85xx_downshift_get(struct phy_device 
*phydev, u8 *count)
int rc;
u16 reg_val;
 
-   mutex_lock(>lock);
rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_EXTENDED);
if (rc != 0)
-   goto out_unlock;
+   goto out;
 
reg_val = phy_read(phydev, MSCC_PHY_ACTIPHY_CNTL);
reg_val &= DOWNSHIFT_CNTL_MASK;
@@ -128,9 +127,7 @@ static int vsc85xx_downshift_get(struct phy_device *phydev, 
u8 *count)
*count = ((reg_val & ~DOWNSHIFT_EN) >> DOWNSHIFT_CNTL_POS) + 2;
rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_STANDARD);
 
-out_unlock:
-   mutex_unlock(>lock);
-
+out:
return rc;
 }
 
@@ -150,23 +147,20 @@ static int vsc85xx_downshift_set(struct phy_device 
*phydev, u8 count)
count = (((count - 2) << DOWNSHIFT_CNTL_POS) | DOWNSHIFT_EN);
}
 
-   mutex_lock(>lock);
rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_EXTENDED);
if (rc != 0)
-   goto out_unlock;
+   goto out;
 
reg_val = phy_read(phydev, MSCC_PHY_ACTIPHY_CNTL);
reg_val &= ~(DOWNSHIFT_CNTL_MASK);
reg_val |= count;
rc = phy_write(phydev, MSCC_PHY_ACTIPHY_CNTL, reg_val);
if (rc != 0)
-   goto out_unlock;
+   goto out;
 
rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_STANDARD);
 
-out_unlock:
-   mutex_unlock(>lock);
-
+out:
return rc;
 }
 
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index e9b4556751ff..0adb3bec5b5a 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -2466,7 +2466,9 @@ static int get_phy_tunable(struct net_device *dev, void 
__user *useraddr)
data = kmalloc(tuna.len, GFP_USER);
if (!data)
return -ENOMEM;
+   mutex_lock(>lock);
ret = phydev->drv->get_tunable(phydev, , data);
+   mutex_unlock(>lock);
if (ret)
goto out;
useraddr += sizeof(tuna);
@@ -2501,7 +2503,9 @@ static int set_phy_tunable(struct net_device *dev, void 
__user *useraddr)
ret = -EFAULT;
if (copy_from_user(data, useraddr, tuna.len))
goto out;
+   mutex_lock(>lock);
ret = phydev->drv->set_tunable(phydev, , data);
+   mutex_unlock(>lock);
 
 out:
kfree(data);
-- 
2.9.3



Re: [PATCH net-next] ethtool: Protect {get,set}_phy_tunable with PHY device mutex

2016-11-22 Thread Florian Fainelli
On 11/22/2016 12:13 PM, Florian Fainelli wrote:
> PHY drivers should be able to rely on the caller of {get,set}_tunable to
> have acquired the PHY device mutex, in order to both serialize against
> concurrent calls of these functions, but also against PHY state machine
> changes. All ethtool PHY-level functions do this, except
> {get,set}_tunable, so we make them consistent here as well.
> 
> Fixes: 968ad9da7e0e ("ethtool: Implements 
> ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE")
> Signed-off-by: Florian Fainelli 

David, please discard, this is going to create problems for the
Microsemi PHY driver since it also acquires phydev->lock. (patch has
been marked accordingly in patchwork.
Thanks!
-- 
Florian


[PATCH net-next V2 2/7] net/mlx5e: Support HW (offloaded) and SW counters for SRIOV switchdev mode

2016-11-22 Thread Saeed Mahameed
From: Or Gerlitz 

Switchdev driver net-device port statistics should follow the model introduced
in commit a5ea31f57309 'Merge branch net-offloaded-stats'.

For VF reps we return the SRIOV eswitch vport stats as the usual ones and SW 
stats
if asked. For the PF, if we're in the switchdev mode, we return the uplink stats
and SW stats if asked, otherwise as before. The uplink stats are implemented 
using
the PPCNT 802_3 counters which are already being read/cached by the driver.

Signed-off-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  31 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   | 111 +++--
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   1 +
 4 files changed, 128 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index ac09767..ebf5dbc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -874,6 +874,7 @@ int mlx5e_add_sqs_fwd_rules(struct mlx5e_priv *priv);
 void mlx5e_remove_sqs_fwd_rules(struct mlx5e_priv *priv);
 int mlx5e_attr_get(struct net_device *dev, struct switchdev_attr *attr);
 void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
+void mlx5e_update_hw_rep_counters(struct mlx5e_priv *priv);
 
 int mlx5e_create_direct_rqts(struct mlx5e_priv *priv);
 void mlx5e_destroy_rqt(struct mlx5e_priv *priv, struct mlx5e_rqt *rqt);
@@ -890,12 +891,16 @@ struct net_device *mlx5e_create_netdev(struct 
mlx5_core_dev *mdev,
 void mlx5e_destroy_netdev(struct mlx5_core_dev *mdev, struct mlx5e_priv *priv);
 int mlx5e_attach_netdev(struct mlx5_core_dev *mdev, struct net_device *netdev);
 void mlx5e_detach_netdev(struct mlx5_core_dev *mdev, struct net_device 
*netdev);
-struct rtnl_link_stats64 *
-mlx5e_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats);
 u32 mlx5e_choose_lro_timeout(struct mlx5_core_dev *mdev, u32 wanted_timeout);
 void mlx5e_add_vxlan_port(struct net_device *netdev,
  struct udp_tunnel_info *ti);
 void mlx5e_del_vxlan_port(struct net_device *netdev,
  struct udp_tunnel_info *ti);
 
+int mlx5e_get_offload_stats(int attr_id, const struct net_device *dev,
+   void *sp);
+bool mlx5e_has_offload_stats(const struct net_device *dev, int attr_id);
+
+bool mlx5e_is_uplink_rep(struct mlx5e_priv *priv);
+bool mlx5e_is_vf_vport_rep(struct mlx5e_priv *priv);
 #endif /* __MLX5_EN_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 6957608..8e8d809 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -470,16 +470,6 @@ static void mlx5e_rq_free_mpwqe_info(struct mlx5e_rq *rq)
kfree(rq->mpwqe.info);
 }
 
-static bool mlx5e_is_vf_vport_rep(struct mlx5e_priv *priv)
-{
-   struct mlx5_eswitch_rep *rep = (struct mlx5_eswitch_rep *)priv->ppriv;
-
-   if (rep && rep->vport != FDB_UPLINK_VPORT)
-   return true;
-
-   return false;
-}
-
 static int mlx5e_create_rq(struct mlx5e_channel *c,
   struct mlx5e_rq_param *param,
   struct mlx5e_rq *rq)
@@ -2664,7 +2654,7 @@ static int mlx5e_ndo_setup_tc(struct net_device *dev, u32 
handle,
return mlx5e_setup_tc(dev, tc->tc);
 }
 
-struct rtnl_link_stats64 *
+static struct rtnl_link_stats64 *
 mlx5e_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
 {
struct mlx5e_priv *priv = netdev_priv(dev);
@@ -2672,13 +2662,20 @@ mlx5e_get_stats(struct net_device *dev, struct 
rtnl_link_stats64 *stats)
struct mlx5e_vport_stats *vstats = >stats.vport;
struct mlx5e_pport_stats *pstats = >stats.pport;
 
-   stats->rx_packets = sstats->rx_packets;
-   stats->rx_bytes   = sstats->rx_bytes;
-   stats->tx_packets = sstats->tx_packets;
-   stats->tx_bytes   = sstats->tx_bytes;
+   if (mlx5e_is_uplink_rep(priv)) {
+   stats->rx_packets = PPORT_802_3_GET(pstats, 
a_frames_received_ok);
+   stats->rx_bytes   = PPORT_802_3_GET(pstats, 
a_octets_received_ok);
+   stats->tx_packets = PPORT_802_3_GET(pstats, 
a_frames_transmitted_ok);
+   stats->tx_bytes   = PPORT_802_3_GET(pstats, 
a_octets_transmitted_ok);
+   } else {
+   stats->rx_packets = sstats->rx_packets;
+   stats->rx_bytes   = sstats->rx_bytes;
+   stats->tx_packets = sstats->tx_packets;
+   stats->tx_bytes   = sstats->tx_bytes;
+   stats->tx_dropped = sstats->tx_queue_dropped;
+   }
 
stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer;
-   

[PATCH net-next V2 3/7] net/mlx5e: Support VF vport link state control for SRIOV switchdev mode

2016-11-22 Thread Saeed Mahameed
From: Or Gerlitz 

Reflect the administative link changes done on the VF representor to the
VF e-switch vport. This means that doing ip link set down/up commands on
the VF rep will modify the e-switch vport state which in turn will make
proper VF drivers to set their carrier accordingly.

Signed-off-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 33 ++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index e0d1a56..5e33f6b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -236,6 +236,35 @@ void mlx5e_nic_rep_unload(struct mlx5_eswitch *esw,
mlx5e_tc_init(priv);
 }
 
+static int mlx5e_rep_open(struct net_device *dev)
+{
+   struct mlx5e_priv *priv = netdev_priv(dev);
+   struct mlx5_eswitch_rep *rep = priv->ppriv;
+   struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
+   int err;
+
+   err = mlx5e_open(dev);
+   if (err)
+   return err;
+
+   err = mlx5_eswitch_set_vport_state(esw, rep->vport, 
MLX5_ESW_VPORT_ADMIN_STATE_UP);
+   if (!err)
+   netif_carrier_on(dev);
+
+   return 0;
+}
+
+static int mlx5e_rep_close(struct net_device *dev)
+{
+   struct mlx5e_priv *priv = netdev_priv(dev);
+   struct mlx5_eswitch_rep *rep = priv->ppriv;
+   struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
+
+   (void)mlx5_eswitch_set_vport_state(esw, rep->vport, 
MLX5_ESW_VPORT_ADMIN_STATE_DOWN);
+
+   return mlx5e_close(dev);
+}
+
 static int mlx5e_rep_get_phys_port_name(struct net_device *dev,
char *buf, size_t len)
 {
@@ -349,8 +378,8 @@ static const struct switchdev_ops mlx5e_rep_switchdev_ops = 
{
 };
 
 static const struct net_device_ops mlx5e_netdev_ops_rep = {
-   .ndo_open= mlx5e_open,
-   .ndo_stop= mlx5e_close,
+   .ndo_open= mlx5e_rep_open,
+   .ndo_stop= mlx5e_rep_close,
.ndo_start_xmit  = mlx5e_xmit,
.ndo_get_phys_port_name  = mlx5e_rep_get_phys_port_name,
.ndo_setup_tc= mlx5e_rep_ndo_setup_tc,
-- 
2.7.4



[PATCH net-next V2 4/7] devlink: Add E-Switch inline mode control

2016-11-22 Thread Saeed Mahameed
From: Roi Dayan 

Some HWs need the VF driver to put part of the packet headers on the
TX descriptor so the e-switch can do proper matching and steering.

The supported modes: none, link, network, transport.

Signed-off-by: Roi Dayan 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 include/net/devlink.h|  2 ++
 include/uapi/linux/devlink.h |  8 +
 net/core/devlink.c   | 70 
 3 files changed, 61 insertions(+), 19 deletions(-)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 211bd3c..d29e5fc 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -92,6 +92,8 @@ struct devlink_ops {
 
int (*eswitch_mode_get)(struct devlink *devlink, u16 *p_mode);
int (*eswitch_mode_set)(struct devlink *devlink, u16 mode);
+   int (*eswitch_inline_mode_get)(struct devlink *devlink, u8 
*p_inline_mode);
+   int (*eswitch_inline_mode_set)(struct devlink *devlink, u8 inline_mode);
 };
 
 static inline void *devlink_priv(struct devlink *devlink)
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 915bfa7..9014c33 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -102,6 +102,13 @@ enum devlink_eswitch_mode {
DEVLINK_ESWITCH_MODE_SWITCHDEV,
 };
 
+enum devlink_eswitch_inline_mode {
+   DEVLINK_ESWITCH_INLINE_MODE_NONE,
+   DEVLINK_ESWITCH_INLINE_MODE_LINK,
+   DEVLINK_ESWITCH_INLINE_MODE_NETWORK,
+   DEVLINK_ESWITCH_INLINE_MODE_TRANSPORT,
+};
+
 enum devlink_attr {
/* don't change the order or add anything between, this is ABI! */
DEVLINK_ATTR_UNSPEC,
@@ -133,6 +140,7 @@ enum devlink_attr {
DEVLINK_ATTR_SB_OCC_CUR,/* u32 */
DEVLINK_ATTR_SB_OCC_MAX,/* u32 */
DEVLINK_ATTR_ESWITCH_MODE,  /* u16 */
+   DEVLINK_ATTR_ESWITCH_INLINE_MODE,   /* u8 */
 
/* add new attributes above here, update the policy in devlink.c */
 
diff --git a/net/core/devlink.c b/net/core/devlink.c
index c14f8b6..2b5bf9e 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -1394,26 +1394,45 @@ static int devlink_nl_cmd_sb_occ_max_clear_doit(struct 
sk_buff *skb,
 
 static int devlink_eswitch_fill(struct sk_buff *msg, struct devlink *devlink,
enum devlink_command cmd, u32 portid,
-   u32 seq, int flags, u16 mode)
+   u32 seq, int flags)
 {
+   const struct devlink_ops *ops = devlink->ops;
void *hdr;
+   int err = 0;
+   u16 mode;
+   u8 inline_mode;
 
hdr = genlmsg_put(msg, portid, seq, _nl_family, flags, cmd);
if (!hdr)
return -EMSGSIZE;
 
-   if (devlink_nl_put_handle(msg, devlink))
-   goto nla_put_failure;
+   err = devlink_nl_put_handle(msg, devlink);
+   if (err)
+   goto out;
 
-   if (nla_put_u16(msg, DEVLINK_ATTR_ESWITCH_MODE, mode))
-   goto nla_put_failure;
+   err = ops->eswitch_mode_get(devlink, );
+   if (err)
+   goto out;
+   err = nla_put_u16(msg, DEVLINK_ATTR_ESWITCH_MODE, mode);
+   if (err)
+   goto out;
+
+   if (ops->eswitch_inline_mode_get) {
+   err = ops->eswitch_inline_mode_get(devlink, _mode);
+   if (err)
+   goto out;
+   err = nla_put_u8(msg, DEVLINK_ATTR_ESWITCH_INLINE_MODE,
+inline_mode);
+   if (err)
+   goto out;
+   }
 
genlmsg_end(msg, hdr);
return 0;
 
-nla_put_failure:
+out:
genlmsg_cancel(msg, hdr);
-   return -EMSGSIZE;
+   return err;
 }
 
 static int devlink_nl_cmd_eswitch_mode_get_doit(struct sk_buff *skb,
@@ -1422,22 +1441,17 @@ static int devlink_nl_cmd_eswitch_mode_get_doit(struct 
sk_buff *skb,
struct devlink *devlink = info->user_ptr[0];
const struct devlink_ops *ops = devlink->ops;
struct sk_buff *msg;
-   u16 mode;
int err;
 
if (!ops || !ops->eswitch_mode_get)
return -EOPNOTSUPP;
 
-   err = ops->eswitch_mode_get(devlink, );
-   if (err)
-   return err;
-
msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
if (!msg)
return -ENOMEM;
 
err = devlink_eswitch_fill(msg, devlink, DEVLINK_CMD_ESWITCH_MODE_GET,
-  info->snd_portid, info->snd_seq, 0, mode);
+  info->snd_portid, info->snd_seq, 0);
 
if (err) {
nlmsg_free(msg);
@@ -1453,15 +1467,32 @@ static int devlink_nl_cmd_eswitch_mode_set_doit(struct 
sk_buff *skb,
struct devlink *devlink = info->user_ptr[0];
const struct devlink_ops *ops = devlink->ops;
u16 mode;
+

[PATCH net-next V2 5/7] net/mlx5: Enable to query min inline for a specific vport

2016-11-22 Thread Saeed Mahameed
From: Roi Dayan 

Also move the inline capablities enum to a shared header vport.h

Signed-off-by: Roi Dayan 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  6 --
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 11 +--
 drivers/net/ethernet/mellanox/mlx5/core/vport.c   | 14 --
 include/linux/mlx5/vport.h| 10 --
 4 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index ebf5dbc..a2b32ed 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -150,12 +150,6 @@ static inline int mlx5_max_log_rq_size(int wq_type)
}
 }
 
-enum {
-   MLX5E_INLINE_MODE_L2,
-   MLX5E_INLINE_MODE_VPORT_CONTEXT,
-   MLX5_INLINE_MODE_NOT_REQUIRED,
-};
-
 struct mlx5e_tx_wqe {
struct mlx5_wqe_ctrl_seg ctrl;
struct mlx5_wqe_eth_seg  eth;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8e8d809..19403d6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -957,7 +957,7 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
sq->bf_buf_size = (1 << MLX5_CAP_GEN(mdev, log_bf_reg_size)) / 2;
sq->max_inline  = param->max_inline;
sq->min_inline_mode =
-   MLX5_CAP_ETH(mdev, wqe_inline_mode) == 
MLX5E_INLINE_MODE_VPORT_CONTEXT ?
+   MLX5_CAP_ETH(mdev, wqe_inline_mode) == 
MLX5_CAP_INLINE_MODE_VPORT_CONTEXT ?
param->min_inline_mode : 0;
 
err = mlx5e_alloc_sq_db(sq, cpu_to_node(c->cpu));
@@ -3417,14 +3417,13 @@ static void mlx5e_query_min_inline(struct mlx5_core_dev 
*mdev,
   u8 *min_inline_mode)
 {
switch (MLX5_CAP_ETH(mdev, wqe_inline_mode)) {
-   case MLX5E_INLINE_MODE_L2:
+   case MLX5_CAP_INLINE_MODE_L2:
*min_inline_mode = MLX5_INLINE_MODE_L2;
break;
-   case MLX5E_INLINE_MODE_VPORT_CONTEXT:
-   mlx5_query_nic_vport_min_inline(mdev,
-   min_inline_mode);
+   case MLX5_CAP_INLINE_MODE_VPORT_CONTEXT:
+   mlx5_query_nic_vport_min_inline(mdev, 0, min_inline_mode);
break;
-   case MLX5_INLINE_MODE_NOT_REQUIRED:
+   case MLX5_CAP_INLINE_MODE_NOT_REQUIRED:
*min_inline_mode = MLX5_INLINE_MODE_NONE;
break;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c 
b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
index 525f17a..269e440 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
@@ -113,15 +113,17 @@ static int mlx5_modify_nic_vport_context(struct 
mlx5_core_dev *mdev, void *in,
return mlx5_cmd_exec(mdev, in, inlen, out, sizeof(out));
 }
 
-void mlx5_query_nic_vport_min_inline(struct mlx5_core_dev *mdev,
-u8 *min_inline_mode)
+int mlx5_query_nic_vport_min_inline(struct mlx5_core_dev *mdev,
+   u16 vport, u8 *min_inline)
 {
u32 out[MLX5_ST_SZ_DW(query_nic_vport_context_out)] = {0};
+   int err;
 
-   mlx5_query_nic_vport_context(mdev, 0, out, sizeof(out));
-
-   *min_inline_mode = MLX5_GET(query_nic_vport_context_out, out,
-   nic_vport_context.min_wqe_inline_mode);
+   err = mlx5_query_nic_vport_context(mdev, vport, out, sizeof(out));
+   if (!err)
+   *min_inline = MLX5_GET(query_nic_vport_context_out, out,
+  nic_vport_context.min_wqe_inline_mode);
+   return err;
 }
 EXPORT_SYMBOL_GPL(mlx5_query_nic_vport_min_inline);
 
diff --git a/include/linux/mlx5/vport.h b/include/linux/mlx5/vport.h
index 451b0bd..ec35157 100644
--- a/include/linux/mlx5/vport.h
+++ b/include/linux/mlx5/vport.h
@@ -36,6 +36,12 @@
 #include 
 #include 
 
+enum {
+   MLX5_CAP_INLINE_MODE_L2,
+   MLX5_CAP_INLINE_MODE_VPORT_CONTEXT,
+   MLX5_CAP_INLINE_MODE_NOT_REQUIRED,
+};
+
 u8 mlx5_query_vport_state(struct mlx5_core_dev *mdev, u8 opmod, u16 vport);
 u8 mlx5_query_vport_admin_state(struct mlx5_core_dev *mdev, u8 opmod,
u16 vport);
@@ -43,8 +49,8 @@ int mlx5_modify_vport_admin_state(struct mlx5_core_dev *mdev, 
u8 opmod,
  u16 vport, u8 state);
 int mlx5_query_nic_vport_mac_address(struct mlx5_core_dev *mdev,
 u16 vport, u8 *addr);
-void mlx5_query_nic_vport_min_inline(struct mlx5_core_dev *mdev,
-u8 *min_inline);
+int 

[PATCH net-next V2 7/7] net/mlx5e: Enforce min inline mode when offloading flows

2016-11-22 Thread Saeed Mahameed
From: Roi Dayan 

A flow should be offloaded only if the matches are
allowed according to min inline mode.

Signed-off-by: Roi Dayan 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 46 +++--
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 4b99112..4d06fab 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -279,8 +279,10 @@ static int parse_tunnel_attr(struct mlx5e_priv *priv,
return 0;
 }
 
-static int parse_cls_flower(struct mlx5e_priv *priv, struct mlx5_flow_spec 
*spec,
-   struct tc_cls_flower_offload *f)
+static int __parse_cls_flower(struct mlx5e_priv *priv,
+ struct mlx5_flow_spec *spec,
+ struct tc_cls_flower_offload *f,
+ u8 *min_inline)
 {
void *headers_c = MLX5_ADDR_OF(fte_match_param, spec->match_criteria,
   outer_headers);
@@ -289,6 +291,8 @@ static int parse_cls_flower(struct mlx5e_priv *priv, struct 
mlx5_flow_spec *spec
u16 addr_type = 0;
u8 ip_proto = 0;
 
+   *min_inline = MLX5_INLINE_MODE_L2;
+
if (f->dissector->used_keys &
~(BIT(FLOW_DISSECTOR_KEY_CONTROL) |
  BIT(FLOW_DISSECTOR_KEY_BASIC) |
@@ -362,6 +366,9 @@ static int parse_cls_flower(struct mlx5e_priv *priv, struct 
mlx5_flow_spec *spec
 mask->ip_proto);
MLX5_SET(fte_match_set_lyr_2_4, headers_v, ip_protocol,
 key->ip_proto);
+
+   if (mask->ip_proto)
+   *min_inline = MLX5_INLINE_MODE_IP;
}
 
if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
@@ -432,6 +439,9 @@ static int parse_cls_flower(struct mlx5e_priv *priv, struct 
mlx5_flow_spec *spec
memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, headers_v,
dst_ipv4_dst_ipv6.ipv4_layout.ipv4),
   >dst, sizeof(key->dst));
+
+   if (mask->src || mask->dst)
+   *min_inline = MLX5_INLINE_MODE_IP;
}
 
if (addr_type == FLOW_DISSECTOR_KEY_IPV6_ADDRS) {
@@ -457,6 +467,10 @@ static int parse_cls_flower(struct mlx5e_priv *priv, 
struct mlx5_flow_spec *spec
memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, headers_v,
dst_ipv4_dst_ipv6.ipv6_layout.ipv6),
   >dst, sizeof(key->dst));
+
+   if (ipv6_addr_type(>src) != IPV6_ADDR_ANY ||
+   ipv6_addr_type(>dst) != IPV6_ADDR_ANY)
+   *min_inline = MLX5_INLINE_MODE_IP;
}
 
if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_PORTS)) {
@@ -497,11 +511,39 @@ static int parse_cls_flower(struct mlx5e_priv *priv, 
struct mlx5_flow_spec *spec
   "Only UDP and TCP transport are 
supported\n");
return -EINVAL;
}
+
+   if (mask->src || mask->dst)
+   *min_inline = MLX5_INLINE_MODE_TCP_UDP;
}
 
return 0;
 }
 
+static int parse_cls_flower(struct mlx5e_priv *priv,
+   struct mlx5_flow_spec *spec,
+   struct tc_cls_flower_offload *f)
+{
+   struct mlx5_core_dev *dev = priv->mdev;
+   struct mlx5_eswitch *esw = dev->priv.eswitch;
+   struct mlx5_eswitch_rep *rep = priv->ppriv;
+   u8 min_inline;
+   int err;
+
+   err = __parse_cls_flower(priv, spec, f, _inline);
+
+   if (!err && esw->mode == SRIOV_OFFLOADS &&
+   rep->vport != FDB_UPLINK_VPORT) {
+   if (min_inline > esw->offloads.inline_mode) {
+   netdev_warn(priv->netdev,
+   "Flow is not offloaded due to min inline 
setting, required %d actual %d\n",
+   min_inline, esw->offloads.inline_mode);
+   return -EOPNOTSUPP;
+   }
+   }
+
+   return err;
+}
+
 static int parse_tc_nic_actions(struct mlx5e_priv *priv, struct tcf_exts *exts,
u32 *action, u32 *flow_tag)
 {
-- 
2.7.4



[PATCH net-next V2 0/7] Mellanox 100G mlx5 SRIOV switchdev update

2016-11-22 Thread Saeed Mahameed
Hi Dave,

This series from Roi and Or further enhances the new SRIOV switchdev mode.

Roi's patches deal with allowing users to configure though devlink
the level of inline headers that the VF should be setting in order for
the eswitch HW to do proper matching. We also enforce that the matching
required for offloaded TC rules is aligned with that level on the PF driver.

Or's patches deals with allowing the user to control on the VF operational
link state through admin directives on the mlx5 VF rep link. Also in this series
is implementation of HW and SW counters for the mlx5 VF rep which is aligned
with the design set by commit a5ea31f57309 'Merge branch net-offloaded-stats'.

v1 --> v2:
* constified the net-device param of get offloaded stats ndo in mlxsw
  (pointed by 0-day screaming on us...)
* added Or's Review-by tags for Roi's patches

This series was generated against commit
e796f49d826a ("net: ieee802154: constify ieee802154_ops structures")

Thanks,
Saeed.

Or Gerlitz (3):
  net: Add net-device param to the get offloaded stats ndo
  net/mlx5e: Support HW (offloaded) and SW counters for SRIOV switchdev
mode
  net/mlx5e: Support VF vport link state control for SRIOV switchdev
mode

Roi Dayan (4):
  devlink: Add E-Switch inline mode control
  net/mlx5: Enable to query min inline for a specific vport
  net/mlx5: E-Switch, Add control for inline mode
  net/mlx5e: Enforce min inline mode when offloading flows

 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  15 +--
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  42 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   | 144 +++--
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c|  46 ++-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |   4 +
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c | 141 
 drivers/net/ethernet/mellanox/mlx5/core/main.c |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/vport.c|  14 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |   2 +-
 include/linux/mlx5/vport.h |  10 +-
 include/linux/netdevice.h  |   4 +-
 include/net/devlink.h  |   2 +
 include/uapi/linux/devlink.h   |   8 ++
 net/core/devlink.c |  70 +++---
 net/core/rtnetlink.c   |   4 +-
 17 files changed, 438 insertions(+), 72 deletions(-)

-- 
2.7.4



[PATCH net-next V2 6/7] net/mlx5: E-Switch, Add control for inline mode

2016-11-22 Thread Saeed Mahameed
From: Roi Dayan 

Implement devlink show and set of HW inline-mode.
The supported modes: none, link, network, transport.
We currently support one mode for all vports so set is done on all vports.
When eswitch is first initialized the inline-mode is queried from the FW.

Signed-off-by: Roi Dayan 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |   4 +
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c | 141 +
 drivers/net/ethernet/mellanox/mlx5/core/main.c |   2 +
 4 files changed, 148 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 9734ac8..d6807c3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1798,6 +1798,7 @@ int mlx5_eswitch_init(struct mlx5_core_dev *dev)
esw->total_vports = total_vports;
esw->enabled_vports = 0;
esw->mode = SRIOV_NONE;
+   esw->offloads.inline_mode = MLX5_INLINE_MODE_NONE;
 
dev->priv.eswitch = esw;
return 0;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 40482e8..cf1aa56 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -200,6 +200,7 @@ struct mlx5_esw_offload {
struct mlx5_flow_group *vport_rx_group;
struct mlx5_eswitch_rep *vport_reps;
DECLARE_HASHTABLE(encap_tbl, 8);
+   u8 inline_mode;
 };
 
 struct mlx5_eswitch {
@@ -309,6 +310,9 @@ void mlx5_eswitch_sqs2vport_stop(struct mlx5_eswitch *esw,
 
 int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode);
 int mlx5_devlink_eswitch_mode_get(struct devlink *devlink, u16 *mode);
+int mlx5_devlink_eswitch_inline_mode_set(struct devlink *devlink, u8 mode);
+int mlx5_devlink_eswitch_inline_mode_get(struct devlink *devlink, u8 *mode);
+int mlx5_eswitch_inline_mode_get(struct mlx5_eswitch *esw, int nvfs, u8 *mode);
 void mlx5_eswitch_register_vport_rep(struct mlx5_eswitch *esw,
 int vport_index,
 struct mlx5_eswitch_rep *rep);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 731f286..5c01550 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -657,6 +657,14 @@ static int esw_offloads_start(struct mlx5_eswitch *esw)
if (err1)
esw_warn(esw->dev, "Failed setting eswitch back to 
legacy, err %d\n", err);
}
+   if (esw->offloads.inline_mode == MLX5_INLINE_MODE_NONE) {
+   if (mlx5_eswitch_inline_mode_get(esw,
+num_vfs,
+>offloads.inline_mode)) {
+   esw->offloads.inline_mode = MLX5_INLINE_MODE_L2;
+   esw_warn(esw->dev, "Inline mode is different between 
vports\n");
+   }
+   }
return err;
 }
 
@@ -771,6 +779,50 @@ static int esw_mode_to_devlink(u16 mlx5_mode, u16 *mode)
return 0;
 }
 
+static int esw_inline_mode_from_devlink(u8 mode, u8 *mlx5_mode)
+{
+   switch (mode) {
+   case DEVLINK_ESWITCH_INLINE_MODE_NONE:
+   *mlx5_mode = MLX5_INLINE_MODE_NONE;
+   break;
+   case DEVLINK_ESWITCH_INLINE_MODE_LINK:
+   *mlx5_mode = MLX5_INLINE_MODE_L2;
+   break;
+   case DEVLINK_ESWITCH_INLINE_MODE_NETWORK:
+   *mlx5_mode = MLX5_INLINE_MODE_IP;
+   break;
+   case DEVLINK_ESWITCH_INLINE_MODE_TRANSPORT:
+   *mlx5_mode = MLX5_INLINE_MODE_TCP_UDP;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static int esw_inline_mode_to_devlink(u8 mlx5_mode, u8 *mode)
+{
+   switch (mlx5_mode) {
+   case MLX5_INLINE_MODE_NONE:
+   *mode = DEVLINK_ESWITCH_INLINE_MODE_NONE;
+   break;
+   case MLX5_INLINE_MODE_L2:
+   *mode = DEVLINK_ESWITCH_INLINE_MODE_LINK;
+   break;
+   case MLX5_INLINE_MODE_IP:
+   *mode = DEVLINK_ESWITCH_INLINE_MODE_NETWORK;
+   break;
+   case MLX5_INLINE_MODE_TCP_UDP:
+   *mode = DEVLINK_ESWITCH_INLINE_MODE_TRANSPORT;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode)
 {
struct mlx5_core_dev *dev;
@@ -815,6 +867,95 @@ int mlx5_devlink_eswitch_mode_get(struct 

[PATCH net-next V2 1/7] net: Add net-device param to the get offloaded stats ndo

2016-11-22 Thread Saeed Mahameed
From: Or Gerlitz 

Some drivers would need to check few internal matters for
that. To be used in downstream mlx5 commit.

Signed-off-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 2 +-
 include/linux/netdevice.h  | 4 ++--
 net/core/rtnetlink.c   | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 4a1f9d5..e0d7d5a 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -857,7 +857,7 @@ mlxsw_sp_port_get_sw_stats64(const struct net_device *dev,
return 0;
 }
 
-static bool mlxsw_sp_port_has_offload_stats(int attr_id)
+static bool mlxsw_sp_port_has_offload_stats(const struct net_device *dev, int 
attr_id)
 {
switch (attr_id) {
case IFLA_OFFLOAD_XSTATS_CPU_HIT:
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e84800e..ae32a27 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -925,7 +925,7 @@ struct netdev_xdp {
  * 3. Update dev->stats asynchronously and atomically, and define
  *neither operation.
  *
- * bool (*ndo_has_offload_stats)(int attr_id)
+ * bool (*ndo_has_offload_stats)(const struct net_device *dev, int attr_id)
  * Return true if this device supports offload stats of this attr_id.
  *
  * int (*ndo_get_offload_stats)(int attr_id, const struct net_device *dev,
@@ -1165,7 +1165,7 @@ struct net_device_ops {
 
struct rtnl_link_stats64* (*ndo_get_stats64)(struct net_device *dev,
 struct rtnl_link_stats64 
*storage);
-   bool(*ndo_has_offload_stats)(int attr_id);
+   bool(*ndo_has_offload_stats)(const struct 
net_device *dev, int attr_id);
int (*ndo_get_offload_stats)(int attr_id,
 const struct 
net_device *dev,
 void *attr_data);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index db313ec..f5a8d8a 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3665,7 +3665,7 @@ static int rtnl_get_offload_stats(struct sk_buff *skb, 
struct net_device *dev,
if (!size)
continue;
 
-   if (!dev->netdev_ops->ndo_has_offload_stats(attr_id))
+   if (!dev->netdev_ops->ndo_has_offload_stats(dev, attr_id))
continue;
 
attr = nla_reserve_64bit(skb, attr_id, size,
@@ -3706,7 +3706,7 @@ static int rtnl_get_offload_stats_size(const struct 
net_device *dev)
 
for (attr_id = IFLA_OFFLOAD_XSTATS_FIRST;
 attr_id <= IFLA_OFFLOAD_XSTATS_MAX; attr_id++) {
-   if (!dev->netdev_ops->ndo_has_offload_stats(attr_id))
+   if (!dev->netdev_ops->ndo_has_offload_stats(dev, attr_id))
continue;
size = rtnl_get_offload_stats_attr_size(attr_id);
nla_size += nla_total_size_64bit(size);
-- 
2.7.4



Re: [PATCH net-next 4/5] net: phy: bcm7xxx: Add support for downshift/Wirespeed

2016-11-22 Thread Florian Fainelli
On 11/22/2016 12:57 PM, Andrew Lunn wrote:
>>> Maybe we should think about this locking a bit. It is normal for the
>>> lock to be held when using ops in the phy driver structure. The
>>> exception is suspend/resume. Maybe we should also take the lock before
>>> calling the phydev->drv->get_tunable() and phydev->drv->set_tunable()?
>>
>> Yes, that certainly seems like a good approach to me, let me cook a
>> patch doing that.
> 
> Hi Florian
> 
> There are a couple of mutex locks/unlocks you will need to remove from
> mscc.c when you centralize this mutex.

Good point, thanks, let me review the mscc PHY driver and propose a more
proper fix.
-- 
Florian


Re: [PATCH net-next 1/4] net: mvneta: Convert to be 64 bits compatible

2016-11-22 Thread Arnd Bergmann
On Tuesday, November 22, 2016 5:48:41 PM CET Gregory CLEMENT wrote:
> +#ifdef CONFIG_64BIT
> +   void *data_tmp;
> +
> +   /* In Neta HW only 32 bits data is supported, so in order to
> +* obtain whole 64 bits address from RX descriptor, we store
> +* the upper 32 bits when allocating buffer, and put it back
> +* when using buffer cookie for accessing packet in memory.
> +* Frags should be allocated from single 'memory' region,
> +* hence common upper address half should be sufficient.
> +*/
> +   data_tmp = mvneta_frag_alloc(pp->frag_size);
> +   if (data_tmp) {
> +   pp->data_high = (u64)upper_32_bits((u64)data_tmp) << 32;
> +   mvneta_frag_free(pp->frag_size, data_tmp);
> +   }
> 

How does this work when the region spans a n*4GB address boundary?

Arnd


Re: [PATCH net-next 4/5] net: phy: bcm7xxx: Add support for downshift/Wirespeed

2016-11-22 Thread Andrew Lunn
> > Maybe we should think about this locking a bit. It is normal for the
> > lock to be held when using ops in the phy driver structure. The
> > exception is suspend/resume. Maybe we should also take the lock before
> > calling the phydev->drv->get_tunable() and phydev->drv->set_tunable()?
> 
> Yes, that certainly seems like a good approach to me, let me cook a
> patch doing that.

Hi Florian

There are a couple of mutex locks/unlocks you will need to remove from
mscc.c when you centralize this mutex.

   Andrew


Re: [PATCH net-next] net/sched: cls_flower: verify root pointer before dereferncing it

2016-11-22 Thread Daniel Borkmann

On 11/22/2016 08:28 PM, Cong Wang wrote:

On Tue, Nov 22, 2016 at 8:11 AM, Jiri Pirko  wrote:

Tue, Nov 22, 2016 at 05:04:11PM CET, dan...@iogearbox.net wrote:

Hmm, I don't think we want to have such an additional test in fast
path for each and every classifier. Can we think of ways to avoid that?

My question is, since we unlink individual instances from such tp-internal
lists through RCU and release the instance through call_rcu() as well as
the head (tp->root) via kfree_rcu() eventually, against what are we protecting
setting RCU_INIT_POINTER(tp->root, NULL) in ->destroy() callback? Something
not respecting grace period?


If you call tp->ops->destroy in call_rcu, you don't have to set tp->root
to null.


But that's not really an answer to my question. ;)


We do need to respect the grace period if we touch the globally visible
data structure tp in tcf_destroy(). Therefore Roi's patch is not fixing the
right place.


I think there may be multiple issues actually.

At the time we go into tc_classify(), from ingress as well as egress side,
we're under RCU, but BH variant. In cls delete()/destroy() callbacks, we
everywhere use call_rcu() and kfree_rcu(), same as for tcf_destroy() where
we use kfree_rcu() on tp, although we iterate tps (and implicitly inner filters)
via rcu_dereference_bh() from reader side. Is there a reason why we don't
use call_rcu_bh() variant on destruction for all this instead?

Just looking at cls_bpf and others, what protects RCU_INIT_POINTER(tp->root,
NULL) against? The tp is unlinked in tc_ctl_tfilter() from the tp chain in
tcf_destroy() cases. Still active readers under RCU BH can race against this
(tp->root being NULL), as the commit identified. Only the get() callback checks
for head against NULL, but both are serialized under rtnl, and the only place
we call this is tc_ctl_tfilter(). Even if we create a new tp, head should not
be NULL there, if it was assigned during the init() cb, but contains an empty
list. (It's different for things like cls_cgroup, though.) So, I'm wondering
if the RCU_INIT_POINTER(tp->root, NULL) can just be removed instead (unless I'm
missing something obvious)?


Also I don't know why you blame my commit, this problem should already
exist prior to my commit, probably date back to John's RCU patches.


It seems so.


[PATCH ethtool v4 2/2] Ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE and PHY downshift

2016-11-22 Thread Allan W. Nielsen
From: Raju Lakkaraju 

Add ethtool get and set tunable to access PHY drivers.

Ethtool Help: ethtool -h for PHY tunables
ethtool --set-phy-tunable DEVNAME  Set PHY tunable
[ downshift on|off [count N] ]
ethtool --get-phy-tunable DEVNAME  Get PHY tunable
[ downshift ]

Ethtool ex:
  ethtool --set-phy-tunable eth0 downshift on
  ethtool --set-phy-tunable eth0 downshift off
  ethtool --set-phy-tunable eth0 downshift on count 2

  ethtool --get-phy-tunable eth0 downshift

Signed-off-by: Raju Lakkaraju 
Signed-off-by: Allan W. Nielsen 
Acked-by: Florian Fainelli 
Tested-by: Florian Fainelli 
---
 ethtool.8.in |  40 +
 ethtool.c| 144 +++
 2 files changed, 184 insertions(+)

diff --git a/ethtool.8.in b/ethtool.8.in
index 9631847..5c36c06 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -340,6 +340,18 @@ ethtool \- query or control network driver and hardware 
settings
 .B2 tx-lpi on off
 .BN tx-timer
 .BN advertise
+.HP
+.B ethtool \-\-set\-phy\-tunable
+.I devname
+.RB [
+.B downshift
+.A1 on off
+.BN count
+.RB ]
+.HP
+.B ethtool \-\-get\-phy\-tunable
+.I devname
+.RB [ downshift ]
 .
 .\" Adjust lines (i.e. full justification) and hyphenate.
 .ad
@@ -947,6 +959,34 @@ Values are as for
 Sets the amount of time the device should stay in idle mode prior to asserting
 its Tx LPI (in microseconds). This has meaning only when Tx LPI is enabled.
 .RE
+.TP
+.B \-\-set\-phy\-tunable
+Sets the PHY tunable parameters.
+.RS 4
+.TP
+.A2 downshift on off
+Specifies whether downshift should be enabled
+.TS
+nokeep;
+lB l.
+.BI count \ N
+Sets the PHY downshift re-tries count.
+.TE
+.PD
+.RE
+.TP
+.B \-\-get\-phy\-tunable
+Gets the PHY tunable parameters.
+.RS 4
+.TP
+.B downshift
+For operation in cabling environments that are incompatible with 1000BASE-T,
+PHY device provides an automatic link speed downshift operation.
+Link speed downshift after N failed 1000BASE-T auto-negotiation attempts.
+Downshift is useful where cable does not have the 4 pairs instance.
+
+Gets the PHY downshift count/status.
+.RE
 .SH BUGS
 Not supported (in part or whole) on all network drivers.
 .SH AUTHOR
diff --git a/ethtool.c b/ethtool.c
index 49ac94e..7dcd005 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -4520,6 +4520,146 @@ static int do_seee(struct cmd_context *ctx)
return 0;
 }
 
+static int do_get_phy_tunable(struct cmd_context *ctx)
+{
+   int argc = ctx->argc;
+   char **argp = ctx->argp;
+   int err, i;
+   u8 downshift_changed = 0;
+
+   if (argc < 1)
+   exit_bad_args();
+   for (i = 0; i < argc; i++) {
+   if (!strcmp(argp[i], "downshift")) {
+   downshift_changed = 1;
+   i += 1;
+   if (i < argc)
+   exit_bad_args();
+   } else  {
+   exit_bad_args();
+   }
+   }
+
+   if (downshift_changed) {
+   struct ethtool_tunable ds;
+   u8 count = 0;
+
+   ds.cmd = ETHTOOL_PHY_GTUNABLE;
+   ds.id = ETHTOOL_PHY_DOWNSHIFT;
+   ds.type_id = ETHTOOL_TUNABLE_U8;
+   ds.len = 1;
+   ds.data[0] = 
+   err = send_ioctl(ctx, );
+   if (err < 0) {
+   perror("Cannot Get PHY downshift count");
+   return 87;
+   }
+   count = *((u8 *)[0]);
+   if (count)
+   fprintf(stdout, "Downshift count: %d\n", count);
+   else
+   fprintf(stdout, "Downshift disabled\n");
+   }
+
+   return err;
+}
+
+static int parse_named_bool(struct cmd_context *ctx, const char *name, u8 *on)
+{
+   if (ctx->argc < 2)
+   return 0;
+
+   if (strcmp(*ctx->argp, name))
+   return 0;
+
+   if (!strcmp(*(ctx->argp + 1), "on")) {
+   *on = 1;
+   } else if (!strcmp(*(ctx->argp + 1), "off")) {
+   *on = 0;
+   } else {
+   fprintf(stderr, "Invalid boolean\n");
+   exit_bad_args();
+   }
+
+   ctx->argc -= 2;
+   ctx->argp += 2;
+
+   return 1;
+}
+
+static int parse_named_u8(struct cmd_context *ctx, const char *name, u8 *val)
+{
+   if (ctx->argc < 2)
+   return 0;
+
+   if (strcmp(*ctx->argp, name))
+   return 0;
+
+   *val = get_uint_range(*(ctx->argp + 1), 0, 0xff);
+
+   ctx->argc -= 2;
+   ctx->argp += 2;
+
+   return 1;
+}
+
+static int do_set_phy_tunable(struct cmd_context *ctx)
+{
+   int err = 0;
+   u8 ds_cnt = DOWNSHIFT_DEV_DEFAULT_COUNT;
+   u8 ds_changed = 0, ds_has_cnt = 0, ds_enable = 0;
+
+   if (ctx->argc == 0)
+  

[PATCH ethtool v4 1/2] ethtool-copy.h:sync with net

2016-11-22 Thread Allan W. Nielsen
From: Raju Lakkaraju 

This covers kernel changes upto:

commit f5a4732f85613b3fb43f8bc33a017e3db3b3605a
Author: Raju Lakkaraju 
Date:   Wed Nov 9 16:33:09 2016 +0530

ethtool: (uapi) Add ETHTOOL_PHY_DOWNSHIFT to PHY tunables

For operation in cabling environments that are incompatible with
1000BASE-T, PHY device may provide an automatic link speed downshift
operation. When enabled, the device automatically changes its 1000BASE-T
auto-negotiation to the next slower speed after a configured number of
failed attempts at 1000BASE-T.  This feature is useful in setting up in
networks using older cable installations that include only pairs A and B,
and not pairs C and D.

Signed-off-by: Raju Lakkaraju 
Signed-off-by: Allan W. Nielsen 

Signed-off-by: Allan W. Nielsen 
Acked-by: Florian Fainelli 
---
 ethtool-copy.h | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/ethtool-copy.h b/ethtool-copy.h
index 70748f5..2e2448f 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -247,6 +247,19 @@ struct ethtool_tunable {
void*data[0];
 };
 
+#define DOWNSHIFT_DEV_DEFAULT_COUNT0xff
+#define DOWNSHIFT_DEV_DISABLE  0
+
+enum phy_tunable_id {
+   ETHTOOL_PHY_ID_UNSPEC,
+   ETHTOOL_PHY_DOWNSHIFT,
+   /*
+* Add your fresh new phy tunable attribute above and remember to update
+* phy_tunable_strings[] in net/core/ethtool.c
+*/
+   __ETHTOOL_PHY_TUNABLE_COUNT,
+};
+
 /**
  * struct ethtool_regs - hardware register dump
  * @cmd: Command number = %ETHTOOL_GREGS
@@ -547,6 +560,7 @@ struct ethtool_pauseparam {
  * @ETH_SS_FEATURES: Device feature names
  * @ETH_SS_RSS_HASH_FUNCS: RSS hush function names
  * @ETH_SS_PHY_STATS: Statistic names, for use with %ETHTOOL_GPHYSTATS
+ * @ETH_SS_PHY_TUNABLES: PHY tunable names
  */
 enum ethtool_stringset {
ETH_SS_TEST = 0,
@@ -557,6 +571,7 @@ enum ethtool_stringset {
ETH_SS_RSS_HASH_FUNCS,
ETH_SS_TUNABLES,
ETH_SS_PHY_STATS,
+   ETH_SS_PHY_TUNABLES,
 };
 
 /**
@@ -1312,7 +1327,8 @@ struct ethtool_per_queue_op {
 
 #define ETHTOOL_GLINKSETTINGS  0x004c /* Get ethtool_link_settings */
 #define ETHTOOL_SLINKSETTINGS  0x004d /* Set ethtool_link_settings */
-
+#define ETHTOOL_PHY_GTUNABLE   0x004e /* Get PHY tunable configuration */
+#define ETHTOOL_PHY_STUNABLE   0x004f /* Set PHY tunable configuration */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
-- 
2.7.3



[PATCH ethtool v4 0/2] Adding downshift support to ethtool

2016-11-22 Thread Allan W. Nielsen
(downshift feature is applied in the net-next tree - d3c19c0a72)

This series adds support for downshift (using phy-tunables).

Downshifting can either be turned on/off, or it can be configured to a
specifc count.

"count" is optional.

Change set:
v1:
- Initial version of set/get phy tunable with downshift feature.
v2:
- (ethtool) Syntax is changed from "--set-phy-tunable downshift on|off|%d"
  to "--set-phy-tunable [downshift on|off [count N]]" - as requested by
  Andrew.
v3:
- Fixed Spelling in "ethtool-copy.h:sync with net" 
- Fixed "if send_ioctl() returns an error, print the error message and then
  still print th value of count".
v4:
- Fixing spelling in the example included in the commit message
- Improve the description in the man-page

Raju Lakkaraju (2):
  ethtool-copy.h:sync with net
  Ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE and PHY
downshift

 ethtool-copy.h |  18 +++-
 ethtool.8.in   |  40 
 ethtool.c  | 144 +
 3 files changed, 201 insertions(+), 1 deletion(-)

-- 
2.7.3



Re: [PATCH net-next 4/5] net: phy: bcm7xxx: Add support for downshift/Wirespeed

2016-11-22 Thread Florian Fainelli
On 11/22/2016 12:02 PM, Andrew Lunn wrote:
>> +static int bcm7xxx_28nm_set_tunable(struct phy_device *phydev,
>> +struct ethtool_tunable *tuna,
>> +const void *data)
>> +{
>> +u8 count = *(u8 *)data;
>> +int ret;
>> +
>> +switch (tuna->id) {
>> +case ETHTOOL_PHY_DOWNSHIFT:
>> +ret = bcm_phy_downshift_set(phydev, count);
>> +break;
>> +default:
>> +return -EOPNOTSUPP;
>> +}
>> +
>> +if (ret)
>> +return ret;
>> +
>> +/* Disable EEE advertisment since this prevents the PHY
>> + * from successfully linking up, trigger auto-negotiation restart
>> + * to let the MAC decide what to do.
>> + */
>> +ret = bcm_phy_set_eee(phydev, count == DOWNSHIFT_DEV_DISABLE);
>> +if (ret)
>> +return ret;
>> +
>> +return genphy_restart_aneg(phydev);
>> +}
> 
> Hi Florian
> 
> Is the locking O.K. here? The core code does not take the phy lock.
> But i think your shadow register accesses at least need to be
> protected by the lock?

There should be some kind of protection, but I was expecting it to be
done at the caller level, so that when {get,set}_tunable run, they are
serialized with respect to each other, clearly, by looking at the code,
this is not the case.

> 
> Maybe we should think about this locking a bit. It is normal for the
> lock to be held when using ops in the phy driver structure. The
> exception is suspend/resume. Maybe we should also take the lock before
> calling the phydev->drv->get_tunable() and phydev->drv->set_tunable()?

Yes, that certainly seems like a good approach to me, let me cook a
patch doing that.
-- 
Florian


[PATCH net-next] ethtool: Protect {get,set}_phy_tunable with PHY device mutex

2016-11-22 Thread Florian Fainelli
PHY drivers should be able to rely on the caller of {get,set}_tunable to
have acquired the PHY device mutex, in order to both serialize against
concurrent calls of these functions, but also against PHY state machine
changes. All ethtool PHY-level functions do this, except
{get,set}_tunable, so we make them consistent here as well.

Fixes: 968ad9da7e0e ("ethtool: Implements 
ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE")
Signed-off-by: Florian Fainelli 
---
 net/core/ethtool.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index e9b4556751ff..0adb3bec5b5a 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -2466,7 +2466,9 @@ static int get_phy_tunable(struct net_device *dev, void 
__user *useraddr)
data = kmalloc(tuna.len, GFP_USER);
if (!data)
return -ENOMEM;
+   mutex_lock(>lock);
ret = phydev->drv->get_tunable(phydev, , data);
+   mutex_unlock(>lock);
if (ret)
goto out;
useraddr += sizeof(tuna);
@@ -2501,7 +2503,9 @@ static int set_phy_tunable(struct net_device *dev, void 
__user *useraddr)
ret = -EFAULT;
if (copy_from_user(data, useraddr, tuna.len))
goto out;
+   mutex_lock(>lock);
ret = phydev->drv->set_tunable(phydev, , data);
+   mutex_unlock(>lock);
 
 out:
kfree(data);
-- 
2.9.3



Re: [PATCH net-next 4/5] net: phy: bcm7xxx: Add support for downshift/Wirespeed

2016-11-22 Thread Andrew Lunn
> +static int bcm7xxx_28nm_set_tunable(struct phy_device *phydev,
> + struct ethtool_tunable *tuna,
> + const void *data)
> +{
> + u8 count = *(u8 *)data;
> + int ret;
> +
> + switch (tuna->id) {
> + case ETHTOOL_PHY_DOWNSHIFT:
> + ret = bcm_phy_downshift_set(phydev, count);
> + break;
> + default:
> + return -EOPNOTSUPP;
> + }
> +
> + if (ret)
> + return ret;
> +
> + /* Disable EEE advertisment since this prevents the PHY
> +  * from successfully linking up, trigger auto-negotiation restart
> +  * to let the MAC decide what to do.
> +  */
> + ret = bcm_phy_set_eee(phydev, count == DOWNSHIFT_DEV_DISABLE);
> + if (ret)
> + return ret;
> +
> + return genphy_restart_aneg(phydev);
> +}

Hi Florian

Is the locking O.K. here? The core code does not take the phy lock.
But i think your shadow register accesses at least need to be
protected by the lock?

Maybe we should think about this locking a bit. It is normal for the
lock to be held when using ops in the phy driver structure. The
exception is suspend/resume. Maybe we should also take the lock before
calling the phydev->drv->get_tunable() and phydev->drv->set_tunable()?

  Andrew


Re: [PATCH net] flow_dissect: call init_default_flow_dissectors() earlier

2016-11-22 Thread Andre Noll
On Tue, Nov 22, 11:17, Eric Dumazet wrote
> -late_initcall_sync(init_default_flow_dissectors);
> +core_initcall(init_default_flow_dissectors);

Indeed, that fixed it. Feel free to add

Tested-by: Andre Noll 

Thanks a lot
Andre
-- 
Max Planck Institute for Developmental Biology
Spemannstraße 35, 72076 Tübingen, Germany. Phone: (+49) 7071 601 829
http://people.tuebingen.mpg.de/maan/


signature.asc
Description: Digital signature


Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver

2016-11-22 Thread Vishwanathapura, Niranjana
Ok, I do understand Jason's point that we should probably not put this driver 
under drivers/infiniband/sw/.., as this driver is not a HCA.
It is an ULP similar to ipoib, built on top of Omni-path irrespective of 
whether we register a hfi_vnic_bus or a direct custom interface with HFI1.
This ULP will transmit and recieve Omni-path packets over the fabric, and is 
dependent on IB MAD interface and the HFI1 driver.


Doug,
Will it be acceptable if we put it under 'drivers/infiniband/ulp/hfi_vnic'?

Niranjana



Re: [PATCH net] flow_dissect: call init_default_flow_dissectors() earlier

2016-11-22 Thread David Miller
From: Eric Dumazet 
Date: Tue, 22 Nov 2016 11:17:30 -0800

> From: Eric Dumazet 
> 
> Andre Noll reported panics after my recent fix (commit 34fad54c2537
> "net: __skb_flow_dissect() must cap its return value")
> 
> After some more headaches, Alexander root caused the problem to
> init_default_flow_dissectors() being called too late, in case
> a network driver like IGB is not a module and receives DHCP message
> very early.
> 
> Fix is to call init_default_flow_dissectors() much earlier,
> as it is a core infrastructure and does not depend on another
> kernel service.
> 
> Fixes: 06635a35d13d4 ("flow_dissect: use programable dissector in 
> skb_flow_dissect and friends")
> Signed-off-by: Eric Dumazet 
> Reported-by: Andre Noll 
> Diagnosed-by: Alexander Duyck 

Applied and queued up for -stable, I'll try to fast-track this.


[PATCH net-next 3/5] net: phy: broadcom: Allow enabling or disabling of EEE

2016-11-22 Thread Florian Fainelli
In preparation for adding support for Wirespeed/downshift, we need to
change bcm_phy_eee_enable() to allow enabling or disabling EEE, so make
the function take an extra enable/disable boolean parameter and rename
it to illustrate it sets EEE, not necessarily just enables it.

Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/bcm-cygnus.c  |  2 +-
 drivers/net/phy/bcm-phy-lib.c | 14 ++
 drivers/net/phy/bcm-phy-lib.h |  2 +-
 drivers/net/phy/bcm7xxx.c |  2 +-
 4 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/net/phy/bcm-cygnus.c b/drivers/net/phy/bcm-cygnus.c
index 49bbc6826883..196400cddf68 100644
--- a/drivers/net/phy/bcm-cygnus.c
+++ b/drivers/net/phy/bcm-cygnus.c
@@ -104,7 +104,7 @@ static int bcm_cygnus_config_init(struct phy_device *phydev)
return rc;
 
/* Advertise EEE */
-   rc = bcm_phy_enable_eee(phydev);
+   rc = bcm_phy_set_eee(phydev, true);
if (rc)
return rc;
 
diff --git a/drivers/net/phy/bcm-phy-lib.c b/drivers/net/phy/bcm-phy-lib.c
index d742894816f6..3156ce6d5861 100644
--- a/drivers/net/phy/bcm-phy-lib.c
+++ b/drivers/net/phy/bcm-phy-lib.c
@@ -195,7 +195,7 @@ int bcm_phy_enable_apd(struct phy_device *phydev, bool 
dll_pwr_down)
 }
 EXPORT_SYMBOL_GPL(bcm_phy_enable_apd);
 
-int bcm_phy_enable_eee(struct phy_device *phydev)
+int bcm_phy_set_eee(struct phy_device *phydev, bool enable)
 {
int val;
 
@@ -205,7 +205,10 @@ int bcm_phy_enable_eee(struct phy_device *phydev)
if (val < 0)
return val;
 
-   val |= LPI_FEATURE_EN | LPI_FEATURE_EN_DIG1000X;
+   if (enable)
+   val |= LPI_FEATURE_EN | LPI_FEATURE_EN_DIG1000X;
+   else
+   val &= ~(LPI_FEATURE_EN | LPI_FEATURE_EN_DIG1000X);
 
phy_write_mmd_indirect(phydev, BRCM_CL45VEN_EEE_CONTROL,
   MDIO_MMD_AN, (u32)val);
@@ -216,14 +219,17 @@ int bcm_phy_enable_eee(struct phy_device *phydev)
if (val < 0)
return val;
 
-   val |= (MDIO_AN_EEE_ADV_100TX | MDIO_AN_EEE_ADV_1000T);
+   if (enable)
+   val |= (MDIO_AN_EEE_ADV_100TX | MDIO_AN_EEE_ADV_1000T);
+   else
+   val &= ~(MDIO_AN_EEE_ADV_100TX | MDIO_AN_EEE_ADV_1000T);
 
phy_write_mmd_indirect(phydev, BCM_CL45VEN_EEE_ADV,
   MDIO_MMD_AN, (u32)val);
 
return 0;
 }
-EXPORT_SYMBOL_GPL(bcm_phy_enable_eee);
+EXPORT_SYMBOL_GPL(bcm_phy_set_eee);
 
 int bcm_phy_downshift_get(struct phy_device *phydev, u8 *count)
 {
diff --git a/drivers/net/phy/bcm-phy-lib.h b/drivers/net/phy/bcm-phy-lib.h
index 3f492e629094..a117f657c6d7 100644
--- a/drivers/net/phy/bcm-phy-lib.h
+++ b/drivers/net/phy/bcm-phy-lib.h
@@ -36,7 +36,7 @@ int bcm_phy_config_intr(struct phy_device *phydev);
 
 int bcm_phy_enable_apd(struct phy_device *phydev, bool dll_pwr_down);
 
-int bcm_phy_enable_eee(struct phy_device *phydev);
+int bcm_phy_set_eee(struct phy_device *phydev, bool enable);
 
 int bcm_phy_downshift_get(struct phy_device *phydev, u8 *count);
 
diff --git a/drivers/net/phy/bcm7xxx.c b/drivers/net/phy/bcm7xxx.c
index 9636da0b6efc..b7789e879670 100644
--- a/drivers/net/phy/bcm7xxx.c
+++ b/drivers/net/phy/bcm7xxx.c
@@ -199,7 +199,7 @@ static int bcm7xxx_28nm_config_init(struct phy_device 
*phydev)
if (ret)
return ret;
 
-   ret = bcm_phy_enable_eee(phydev);
+   ret = bcm_phy_set_eee(phydev, true);
if (ret)
return ret;
 
-- 
2.9.3



[PATCH net-next 5/5] net: dsa: bcm_sf2: Ensure we re-negotiate EEE during after link change

2016-11-22 Thread Florian Fainelli
In case the link change and EEE is enabled or disabled, always try to
re-negotiate this with the link partner.

Fixes: 450b05c15f9c ("net: dsa: bcm_sf2: add support for controlling EEE")
Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/bcm_sf2.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index e3ee27ce13dd..9ec33b51a0ed 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -588,6 +588,7 @@ static void bcm_sf2_sw_adjust_link(struct dsa_switch *ds, 
int port,
   struct phy_device *phydev)
 {
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
+   struct ethtool_eee *p = >port_sts[port].eee;
u32 id_mode_dis = 0, port_mode;
const char *str = NULL;
u32 reg;
@@ -662,6 +663,9 @@ static void bcm_sf2_sw_adjust_link(struct dsa_switch *ds, 
int port,
reg |= DUPLX_MODE;
 
core_writel(priv, reg, CORE_STS_OVERRIDE_GMIIP_PORT(port));
+
+   if (!phydev->is_pseudo_fixed_link)
+   p->eee_enabled = bcm_sf2_eee_init(ds, port, phydev);
 }
 
 static void bcm_sf2_sw_fixed_link_update(struct dsa_switch *ds, int port,
-- 
2.9.3



[PATCH net-next 0/5] net: phy: broadcom: Wirespeed/downshift support

2016-11-22 Thread Florian Fainelli
Hi all,

This patch series adds support for the Broadcom Wirespeed, aka downsfhit feature
utilizing the recently added ethtool PHY tunables.

Tested with two Gigabit link partners with a 4-wire cable having only 2 pairs
connected.

Last patch in the series is a fix that was required for testing, which should
make it to -stable, which I can submit separate against net if you prefer David.

Thanks!

Florian Fainelli (5):
  net: phy: broadcom: Move bcm54xx_auxctl_{read,write} to common library
  net: phy: broadcom: Add support code for downshift/Wirespeed
  net: phy: broadcom: Allow enabling or disabling of EEE
  net: phy: bcm7xxx: Add support for downshift/Wirespeed
  net: dsa: bcm_sf2: Ensure we re-negotiate EEE during after link change

 drivers/net/dsa/bcm_sf2.c |   4 ++
 drivers/net/phy/bcm-cygnus.c  |   2 +-
 drivers/net/phy/bcm-phy-lib.c | 117 --
 drivers/net/phy/bcm-phy-lib.h |  10 +++-
 drivers/net/phy/bcm7xxx.c |  51 +-
 drivers/net/phy/broadcom.c|  15 --
 include/linux/brcmphy.h   |  10 
 7 files changed, 187 insertions(+), 22 deletions(-)

-- 
2.9.3



[PATCH net-next 4/5] net: phy: bcm7xxx: Add support for downshift/Wirespeed

2016-11-22 Thread Florian Fainelli
Add support for configuring the downshift/Wirespeed enable/disable
toggles and specify a link retry value ranging from 1 to 9. Since the
integrated BCM7xxx have issues when wirespeed is enabled and EEE is also
enabled, we do disable EEE if wirespeed is enabled.

Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/bcm7xxx.c | 51 ++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/bcm7xxx.c b/drivers/net/phy/bcm7xxx.c
index b7789e879670..5b3be4c67be8 100644
--- a/drivers/net/phy/bcm7xxx.c
+++ b/drivers/net/phy/bcm7xxx.c
@@ -167,6 +167,7 @@ static int bcm7xxx_28nm_config_init(struct phy_device 
*phydev)
 {
u8 rev = PHY_BRCM_7XXX_REV(phydev->dev_flags);
u8 patch = PHY_BRCM_7XXX_PATCH(phydev->dev_flags);
+   u8 count;
int ret = 0;
 
pr_info_once("%s: %s PHY revision: 0x%02x, patch: %d\n",
@@ -199,7 +200,12 @@ static int bcm7xxx_28nm_config_init(struct phy_device 
*phydev)
if (ret)
return ret;
 
-   ret = bcm_phy_set_eee(phydev, true);
+   ret = bcm_phy_downshift_get(phydev, );
+   if (ret)
+   return ret;
+
+   /* Only enable EEE if Wirespeed/downshift is disabled */
+   ret = bcm_phy_set_eee(phydev, count == DOWNSHIFT_DEV_DISABLE);
if (ret)
return ret;
 
@@ -303,6 +309,47 @@ static int bcm7xxx_suspend(struct phy_device *phydev)
return 0;
 }
 
+static int bcm7xxx_28nm_get_tunable(struct phy_device *phydev,
+   struct ethtool_tunable *tuna,
+   void *data)
+{
+   switch (tuna->id) {
+   case ETHTOOL_PHY_DOWNSHIFT:
+   return bcm_phy_downshift_get(phydev, (u8 *)data);
+   default:
+   return -EOPNOTSUPP;
+   }
+}
+
+static int bcm7xxx_28nm_set_tunable(struct phy_device *phydev,
+   struct ethtool_tunable *tuna,
+   const void *data)
+{
+   u8 count = *(u8 *)data;
+   int ret;
+
+   switch (tuna->id) {
+   case ETHTOOL_PHY_DOWNSHIFT:
+   ret = bcm_phy_downshift_set(phydev, count);
+   break;
+   default:
+   return -EOPNOTSUPP;
+   }
+
+   if (ret)
+   return ret;
+
+   /* Disable EEE advertisment since this prevents the PHY
+* from successfully linking up, trigger auto-negotiation restart
+* to let the MAC decide what to do.
+*/
+   ret = bcm_phy_set_eee(phydev, count == DOWNSHIFT_DEV_DISABLE);
+   if (ret)
+   return ret;
+
+   return genphy_restart_aneg(phydev);
+}
+
 #define BCM7XXX_28NM_GPHY(_oui, _name) \
 {  \
.phy_id = (_oui),   \
@@ -315,6 +362,8 @@ static int bcm7xxx_suspend(struct phy_device *phydev)
.config_aneg= genphy_config_aneg,   \
.read_status= genphy_read_status,   \
.resume = bcm7xxx_28nm_resume,  \
+   .get_tunable= bcm7xxx_28nm_get_tunable, \
+   .set_tunable= bcm7xxx_28nm_set_tunable, \
 }
 
 #define BCM7XXX_40NM_EPHY(_oui, _name) \
-- 
2.9.3



[PATCH net-next 1/5] net: phy: broadcom: Move bcm54xx_auxctl_{read,write} to common library

2016-11-22 Thread Florian Fainelli
We are going to need these functions to implement support for Broadcom
Wirespeed, aka downshift.

Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/bcm-phy-lib.c | 17 +
 drivers/net/phy/bcm-phy-lib.h |  3 +++
 drivers/net/phy/broadcom.c| 15 ---
 3 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/drivers/net/phy/bcm-phy-lib.c b/drivers/net/phy/bcm-phy-lib.c
index df0416db0b88..18e11b3a0f41 100644
--- a/drivers/net/phy/bcm-phy-lib.c
+++ b/drivers/net/phy/bcm-phy-lib.c
@@ -50,6 +50,23 @@ int bcm_phy_read_exp(struct phy_device *phydev, u16 reg)
 }
 EXPORT_SYMBOL_GPL(bcm_phy_read_exp);
 
+int bcm54xx_auxctl_read(struct phy_device *phydev, u16 regnum)
+{
+   /* The register must be written to both the Shadow Register Select and
+* the Shadow Read Register Selector
+*/
+   phy_write(phydev, MII_BCM54XX_AUX_CTL, regnum |
+ regnum << MII_BCM54XX_AUXCTL_SHDWSEL_READ_SHIFT);
+   return phy_read(phydev, MII_BCM54XX_AUX_CTL);
+}
+EXPORT_SYMBOL_GPL(bcm54xx_auxctl_read);
+
+int bcm54xx_auxctl_write(struct phy_device *phydev, u16 regnum, u16 val)
+{
+   return phy_write(phydev, MII_BCM54XX_AUX_CTL, regnum | val);
+}
+EXPORT_SYMBOL(bcm54xx_auxctl_write);
+
 int bcm_phy_write_misc(struct phy_device *phydev,
   u16 reg, u16 chl, u16 val)
 {
diff --git a/drivers/net/phy/bcm-phy-lib.h b/drivers/net/phy/bcm-phy-lib.h
index b2091c88b44d..31cb4fdf5d5a 100644
--- a/drivers/net/phy/bcm-phy-lib.h
+++ b/drivers/net/phy/bcm-phy-lib.h
@@ -19,6 +19,9 @@
 int bcm_phy_write_exp(struct phy_device *phydev, u16 reg, u16 val);
 int bcm_phy_read_exp(struct phy_device *phydev, u16 reg);
 
+int bcm54xx_auxctl_write(struct phy_device *phydev, u16 regnum, u16 val);
+int bcm54xx_auxctl_read(struct phy_device *phydev, u16 regnum);
+
 int bcm_phy_write_misc(struct phy_device *phydev,
   u16 reg, u16 chl, u16 value);
 int bcm_phy_read_misc(struct phy_device *phydev,
diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
index b1e32e9be1b3..409b365f12b1 100644
--- a/drivers/net/phy/broadcom.c
+++ b/drivers/net/phy/broadcom.c
@@ -30,21 +30,6 @@ MODULE_DESCRIPTION("Broadcom PHY driver");
 MODULE_AUTHOR("Maciej W. Rozycki");
 MODULE_LICENSE("GPL");
 
-static int bcm54xx_auxctl_read(struct phy_device *phydev, u16 regnum)
-{
-   /* The register must be written to both the Shadow Register Select and
-* the Shadow Read Register Selector
-*/
-   phy_write(phydev, MII_BCM54XX_AUX_CTL, regnum |
- regnum << MII_BCM54XX_AUXCTL_SHDWSEL_READ_SHIFT);
-   return phy_read(phydev, MII_BCM54XX_AUX_CTL);
-}
-
-static int bcm54xx_auxctl_write(struct phy_device *phydev, u16 regnum, u16 val)
-{
-   return phy_write(phydev, MII_BCM54XX_AUX_CTL, regnum | val);
-}
-
 static int bcm54810_config(struct phy_device *phydev)
 {
int rc, val;
-- 
2.9.3



[PATCH net-next 2/5] net: phy: broadcom: Add support code for downshift/Wirespeed

2016-11-22 Thread Florian Fainelli
Broadcom's Wirespeed feature allows us to configure how auto-negotiation
should behave with fewer working pairs of wires on a cable. Add support
code for retrieving and setting such downshift counters using the
recently added ethtool downshift tunables.

Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/bcm-phy-lib.c | 86 +++
 drivers/net/phy/bcm-phy-lib.h |  5 +++
 include/linux/brcmphy.h   | 10 +
 3 files changed, 101 insertions(+)

diff --git a/drivers/net/phy/bcm-phy-lib.c b/drivers/net/phy/bcm-phy-lib.c
index 18e11b3a0f41..d742894816f6 100644
--- a/drivers/net/phy/bcm-phy-lib.c
+++ b/drivers/net/phy/bcm-phy-lib.c
@@ -225,6 +225,92 @@ int bcm_phy_enable_eee(struct phy_device *phydev)
 }
 EXPORT_SYMBOL_GPL(bcm_phy_enable_eee);
 
+int bcm_phy_downshift_get(struct phy_device *phydev, u8 *count)
+{
+   int val;
+
+   val = bcm54xx_auxctl_read(phydev, MII_BCM54XX_AUXCTL_SHDWSEL_MISC);
+   if (val < 0)
+   return val;
+
+   /* Check if wirespeed is enabled or not */
+   if (!(val & MII_BCM54XX_AUXCTL_SHDWSEL_MISC_WIRESPEED_EN)) {
+   *count = DOWNSHIFT_DEV_DISABLE;
+   return 0;
+   }
+
+   val = bcm_phy_read_shadow(phydev, BCM54XX_SHD_SCR2);
+   if (val < 0)
+   return val;
+
+   /* Downgrade after one link attempt */
+   if (val & BCM54XX_SHD_SCR2_WSPD_RTRY_DIS) {
+   *count = 1;
+   } else {
+   /* Downgrade after configured retry count */
+   val >>= BCM54XX_SHD_SCR2_WSPD_RTRY_LMT_SHIFT;
+   val &= BCM54XX_SHD_SCR2_WSPD_RTRY_LMT_MASK;
+   *count = val + BCM54XX_SHD_SCR2_WSPD_RTRY_LMT_OFFSET;
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(bcm_phy_downshift_get);
+
+int bcm_phy_downshift_set(struct phy_device *phydev, u8 count)
+{
+   int val = 0, ret = 0;
+
+   /* Range check the number given */
+   if (count - BCM54XX_SHD_SCR2_WSPD_RTRY_LMT_OFFSET >
+   BCM54XX_SHD_SCR2_WSPD_RTRY_LMT_MASK &&
+   count != DOWNSHIFT_DEV_DEFAULT_COUNT) {
+   return -ERANGE;
+   }
+
+   val = bcm54xx_auxctl_read(phydev, MII_BCM54XX_AUXCTL_SHDWSEL_MISC);
+   if (val < 0)
+   return val;
+
+   /* Se the write enable bit */
+   val |= MII_BCM54XX_AUXCTL_MISC_WREN;
+
+   if (count == DOWNSHIFT_DEV_DISABLE) {
+   val &= ~MII_BCM54XX_AUXCTL_SHDWSEL_MISC_WIRESPEED_EN;
+   return bcm54xx_auxctl_write(phydev,
+   MII_BCM54XX_AUXCTL_SHDWSEL_MISC,
+   val);
+   } else {
+   val |= MII_BCM54XX_AUXCTL_SHDWSEL_MISC_WIRESPEED_EN;
+   ret = bcm54xx_auxctl_write(phydev,
+  MII_BCM54XX_AUXCTL_SHDWSEL_MISC,
+  val);
+   if (ret < 0)
+   return ret;
+   }
+
+   val = bcm_phy_read_shadow(phydev, BCM54XX_SHD_SCR2);
+   val &= ~(BCM54XX_SHD_SCR2_WSPD_RTRY_LMT_MASK <<
+BCM54XX_SHD_SCR2_WSPD_RTRY_LMT_SHIFT |
+BCM54XX_SHD_SCR2_WSPD_RTRY_DIS);
+
+   switch (count) {
+   case 1:
+   val |= BCM54XX_SHD_SCR2_WSPD_RTRY_DIS;
+   break;
+   case DOWNSHIFT_DEV_DEFAULT_COUNT:
+   val |= 1 << BCM54XX_SHD_SCR2_WSPD_RTRY_LMT_SHIFT;
+   break;
+   default:
+   val |= (count - BCM54XX_SHD_SCR2_WSPD_RTRY_LMT_OFFSET) <<
+   BCM54XX_SHD_SCR2_WSPD_RTRY_LMT_SHIFT;
+   break;
+   }
+
+   return bcm_phy_write_shadow(phydev, BCM54XX_SHD_SCR2, val);
+}
+EXPORT_SYMBOL_GPL(bcm_phy_downshift_set);
+
 MODULE_DESCRIPTION("Broadcom PHY Library");
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR("Broadcom Corporation");
diff --git a/drivers/net/phy/bcm-phy-lib.h b/drivers/net/phy/bcm-phy-lib.h
index 31cb4fdf5d5a..3f492e629094 100644
--- a/drivers/net/phy/bcm-phy-lib.h
+++ b/drivers/net/phy/bcm-phy-lib.h
@@ -37,4 +37,9 @@ int bcm_phy_config_intr(struct phy_device *phydev);
 int bcm_phy_enable_apd(struct phy_device *phydev, bool dll_pwr_down);
 
 int bcm_phy_enable_eee(struct phy_device *phydev);
+
+int bcm_phy_downshift_get(struct phy_device *phydev, u8 *count);
+
+int bcm_phy_downshift_set(struct phy_device *phydev, u8 count);
+
 #endif /* _LINUX_BCM_PHY_LIB_H */
diff --git a/include/linux/brcmphy.h b/include/linux/brcmphy.h
index 848dc508ef57..f9f8aaf9c943 100644
--- a/include/linux/brcmphy.h
+++ b/include/linux/brcmphy.h
@@ -114,6 +114,7 @@
 #define MII_BCM54XX_AUXCTL_SHDWSEL_MISC0x0007
 #define MII_BCM54XX_AUXCTL_SHDWSEL_READ_SHIFT  12
 #define MII_BCM54XX_AUXCTL_SHDWSEL_MISC_RGMII_SKEW_EN  (1 << 8)
+#define MII_BCM54XX_AUXCTL_SHDWSEL_MISC_WIRESPEED_EN   (1 << 4)
 
 #define MII_BCM54XX_AUXCTL_SHDWSEL_MASK0x0007
 
@@ -130,6 +131,7 @@
 #define BCM_LED_SRC_INTR   0x6
 

Re: [PATCH net-next] net/sched: cls_flower: verify root pointer before dereferncing it

2016-11-22 Thread Cong Wang
On Tue, Nov 22, 2016 at 8:11 AM, Jiri Pirko  wrote:
> Tue, Nov 22, 2016 at 05:04:11PM CET, dan...@iogearbox.net wrote:
>>Hmm, I don't think we want to have such an additional test in fast
>>path for each and every classifier. Can we think of ways to avoid that?
>>
>>My question is, since we unlink individual instances from such tp-internal
>>lists through RCU and release the instance through call_rcu() as well as
>>the head (tp->root) via kfree_rcu() eventually, against what are we protecting
>>setting RCU_INIT_POINTER(tp->root, NULL) in ->destroy() callback? Something
>>not respecting grace period?
>
> If you call tp->ops->destroy in call_rcu, you don't have to set tp->root
> to null.

We do need to respect the grace period if we touch the globally visible
data structure tp in tcf_destroy(). Therefore Roi's patch is not fixing the
right place.

Also I don't know why you blame my commit, this problem should already
exist prior to my commit, probably date back to John's RCU patches.

I am working on a patch.


[PATCH net] flow_dissect: call init_default_flow_dissectors() earlier

2016-11-22 Thread Eric Dumazet
From: Eric Dumazet 

Andre Noll reported panics after my recent fix (commit 34fad54c2537
"net: __skb_flow_dissect() must cap its return value")

After some more headaches, Alexander root caused the problem to
init_default_flow_dissectors() being called too late, in case
a network driver like IGB is not a module and receives DHCP message
very early.

Fix is to call init_default_flow_dissectors() much earlier,
as it is a core infrastructure and does not depend on another
kernel service.

Fixes: 06635a35d13d4 ("flow_dissect: use programable dissector in 
skb_flow_dissect and friends")
Signed-off-by: Eric Dumazet 
Reported-by: Andre Noll 
Diagnosed-by: Alexander Duyck 
---
 net/core/flow_dissector.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 69e4463a4b1b..c6d8207ffa7e 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1013,4 +1013,4 @@ static int __init init_default_flow_dissectors(void)
return 0;
 }
 
-late_initcall_sync(init_default_flow_dissectors);
+core_initcall(init_default_flow_dissectors);




Re: [PATCH net-next] tcp: enhance tcp_collapse_retrans() with skb_shift()

2016-11-22 Thread Eric Dumazet
On Tue, 2016-11-15 at 12:51 -0800, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> In commit 2331ccc5b323 ("tcp: enhance tcp collapsing"),
> we made a first step allowing copying right skb to left skb head.
> 
> Since all skbs in socket write queue are headless (but possibly the very
> first one), this strategy often does not work.
> 
> This patch extends tcp_collapse_retrans() to perform frag shifting,
> thanks to skb_shift() helper.
> 
> This helper needs to not BUG on non headless skbs, as callers are ok
> with that.
> 
> Tested:
> 
> Following packetdrill test now passes :
> 
> 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>+0 bind(3, ..., ...) = 0
>+0 listen(3, 1) = 0
> 
>+0 < S 0:0(0) win 32792 
>+0 > S. 0:0(0) ack 1 
> +.100 < . 1:1(0) ack 1 win 257
>+0 accept(3, ..., ...) = 4
> 
>+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
>+0 write(4, ..., 200) = 200
>+0 > P. 1:201(200) ack 1
> +.001 write(4, ..., 200) = 200
>+0 > P. 201:401(200) ack 1
> +.001 write(4, ..., 200) = 200
>+0 > P. 401:601(200) ack 1
> +.001 write(4, ..., 200) = 200
>+0 > P. 601:801(200) ack 1
> +.001 write(4, ..., 200) = 200
>+0 > P. 801:1001(200) ack 1
> +.001 write(4, ..., 100) = 100
>+0 > P. 1001:1101(100) ack 1
> +.001 write(4, ..., 100) = 100
>+0 > P. 1101:1201(100) ack 1
> +.001 write(4, ..., 100) = 100
>+0 > P. 1201:1301(100) ack 1
> +.001 write(4, ..., 100) = 100
>+0 > P. 1301:1401(100) ack 1
> 
> +.099 < . 1:1(0) ack 201 win 257
> +.001 < . 1:1(0) ack 201 win 257 
>+0 > P. 201:1001(800) ack 1
> 
> Signed-off-by: Eric Dumazet 
> Cc: Neal Cardwell 
> Cc: Yuchung Cheng 
> ---
>  net/core/skbuff.c |4 +++-
>  net/ipv4/tcp_output.c |   22 +++---
>  2 files changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 
> 0b2a6e94af2de73ed638634c47a0fb71e2cbc1cb..a9cb81a10c4ba895587727aa4cf098e9a38424ea
>  100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -2656,7 +2656,9 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, 
> int shiftlen)
>   struct skb_frag_struct *fragfrom, *fragto;
>  
>   BUG_ON(shiftlen > skb->len);
> - BUG_ON(skb_headlen(skb));   /* Would corrupt stream */
> +
> + if (skb_headlen(skb))
> + return 0;
>  
>   todo = shiftlen;
>   from = 0;
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> f57b5aa51b59cf0a58975fe34a7dcdb886ea8c50..19105b46a30436ebb85fe97ee43089e77aa028bb
>  100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2514,7 +2514,7 @@ void tcp_skb_collapse_tstamp(struct sk_buff *skb,
>  }
>  
>  /* Collapses two adjacent SKB's during retransmission. */
> -static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
> +static bool tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
>  {
>   struct tcp_sock *tp = tcp_sk(sk);
>   struct sk_buff *next_skb = tcp_write_queue_next(sk, skb);
> @@ -2525,14 +2525,17 @@ static void tcp_collapse_retrans(struct sock *sk, 
> struct sk_buff *skb)
>  
>   BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1);
>  
> + if (next_skb_size) {
> + if (next_skb_size <= skb_availroom(skb))
> + skb_copy_bits(next_skb, 0, skb_put(skb, next_skb_size),
> +   next_skb_size);
> + else if (!skb_shift(skb, next_skb, next_skb_size))
> + return false;
> + }
>   tcp_highest_sack_combine(sk, next_skb, skb);
>  
>   tcp_unlink_write_queue(next_skb, sk);
>  
> - if (next_skb_size)
> - skb_copy_bits(next_skb, 0, skb_put(skb, next_skb_size),
> -   next_skb_size);
> -
>   if (next_skb->ip_summed == CHECKSUM_PARTIAL)
>   skb->ip_summed = CHECKSUM_PARTIAL;
>  
> @@ -2561,6 +2564,7 @@ static void tcp_collapse_retrans(struct sock *sk, 
> struct sk_buff *skb)
>   tcp_skb_collapse_tstamp(skb, next_skb);
>  
>   sk_wmem_free_skb(sk, next_skb);
> + return true;
>  }
>  
>  /* Check if coalescing SKBs is legal. */
> @@ -2610,16 +2614,12 @@ static void tcp_retrans_try_collapse(struct sock *sk, 
> struct sk_buff *to,
>  
>   if (space < 0)
>   break;
> - /* Punt if not enough space exists in the first SKB for
> -  * the data in the second
> -  */
> - if (skb->len > skb_availroom(to))
> - break;
>  
>   if (after(TCP_SKB_CB(skb)->end_seq, tcp_wnd_end(tp)))
>   break;
>  
> - tcp_collapse_retrans(sk, to);
> + if (!tcp_collapse_retrans(sk, to))
> + break;
>   }
>  }
>  


David, patch is marked 'Superseded' 

Re: net/icmp: null-ptr-deref in icmp6_send

2016-11-22 Thread David Ahern


Sent from my iPhone

> On Nov 22, 2016, at 1:11 PM, Cong Wang  wrote:
> 
>> On Tue, Nov 22, 2016 at 2:23 AM, Andrey Konovalov  
>> wrote:
>> Hi,
>> 
>> I've got the following error report while fuzzing the kernel with syzkaller.
>> 
>> It seems that skb_dst(skb) may end up being NULL.
>> 
>> As far as I can see the bug was introduced in commit 5d41ce29e ("net:
>> icmp6_send should use dst dev to determine L3 domain").
>> ICMP v4 probaly has similar issue due to 9d1a6c4ea ("net:
>> icmp_route_lookup should use rt dev to determine L3 domain").
> 
> 
> ipv6_parse_hopopts() is called before NF_INET_PRE_ROUTING,
> so the skb_dst could be NULL.
> 
> I have no idea what commit 5d41ce29e tried to fix, but we already
> use skb->dev a few lines before l3mdev_master_ifindex(), so I don't
> understand why skb->dev could be NULL, maybe just for vrf dev?

On PTO this week and currently at the beach. Will take a look tonight. Thanks 
for the report. 

Re: [PATCH net] bnxt: do not busy-poll when link is down

2016-11-22 Thread Eric Dumazet
On Tue, 2016-11-22 at 10:55 -0800, Michael Chan wrote:
> On Tue, Nov 22, 2016 at 10:38 AM, Eric Dumazet  wrote:

> >
> > Any plans removing this busy polling stuff, now it is done in core
> > networking stack ?
> >
> > This would remove bnxt_lock_napi() extra overhead in normal path ( napi
> > poll )
> >
> > I could do this but I do not have the hardware to do the tests.
> >
> It's on my list of many TODO things.  Probably in the next few weeks.

Awesome, thanks !





Re: [PATCH] net: dsa: mv88e6xxx: egress all frames

2016-11-22 Thread Andrew Lunn
On Tue, Nov 22, 2016 at 07:37:33PM +0100, Stefan Eichenberger wrote:
> Hi Andrew
> 
> On Tue, Nov 22, 2016 at 04:03:30PM +0100, Andrew Lunn wrote:
> > On Tue, Nov 22, 2016 at 11:39:44AM +0100, Stefan Eichenberger wrote:
> > > Egress multicast and egress unicast is only enabled for CPU/DSA ports
> > > but for switching operation it seems it should be enabled for all ports.
> > > Do I miss something here?
> > > 
> > > I did the following test:
> > > brctl addbr br0
> > > brctl addif br0 lan0
> > > brctl addif br0 lan1
> > > 
> > > In this scenario the unicast and multicast packets were not forwarded,
> > > therefore ARP requests were not resolved, and no connection could be
> > > established.
> > 
> > Hi Stefan
> > 
> > This is probably specific to the 6097 family. It works fine without
> > this on other devices. Creating a bridge like above and pinging across
> > it is one of my standard tests. But i only test modern devices like
> > the 6165, 6352, 6351, 6390 families.
> 
> Okay perfect, I wasn't 100% sure if I would have to configure something
> additionally.

No. The idea is you treat the interfaces as normal interfaces. You
should not need to do anything additional to what you would do with a
normal interface, when adding it to a bridge.
 
> > In fact, you might need to review all the code and look where
> > mv88e6xxx_6095_family(chip) is used and consider if you need to add
> > mv88e6xxx_6097_family(chip). e.g.
> > 
> > if (mv88e6xxx_6095_family(chip) || mv88e6xxx_6185_family(chip)) {
> > /* Set the upstream port this port should use */
> > reg |= dsa_upstream_port(ds);
> > /* enable forwarding of unknown multicast addresses to
> >  * the upstream port
> >  */
> > if (port == dsa_upstream_port(ds))
> > reg |= PORT_CONTROL_2_FORWARD_UNKNOWN;
> > }
> > 
> > Maybe this is your problem?
> 
> I think I still don't understand exactly how the driver works.
> 
> My problem is that the multicast and broadcast frames are filtered and
> the following counter is increasing in ethtool:
> sw_in_filtered: 596

This is not what is supposed to happen. Broadcast and multicast frames
should go to all ports in the bridge. There are two different ways
this can happen:

1) The mv88e6xxx driver started out with the host doing all bridge
operations. The switch forwards all frames to the software bridge, and
the software bridge then sends them out another port if needed.

2) We later added support for hardware bridging. That is, the switch
itself bridges frames between ports. It will only pass frames to the
software bridge if it does not know what to do with a frame itself.

Now, the different families are not 100% compatible with each
other. We never had access to a 6097, so it has not been tested
recently, and we have probably broken it... My guess would be,
anywhere mv88e6xxx_6095_family(chip) is used, there also needs to be
an mv88e6xxx_6097_family(chip). But i could be wrong.

What you might find useful is

https://github.com/vivien/linux.git 161b96bd7d16d21b0f046c935b70c3b2d277ccc2

although it might need some changes for recent commits.

With that, you can see deeper into the switches registers.

 Andrew


Re: [PATCH net] bnxt: do not busy-poll when link is down

2016-11-22 Thread Michael Chan
On Tue, Nov 22, 2016 at 10:38 AM, Eric Dumazet  wrote:
> On Tue, 2016-11-22 at 13:14 -0500, Andy Gospodarek wrote:
>> When busy polling while a link is down (during a link-flap test), TX
>> timeouts were observed as well as the following messages in the ring
>> buffer:
>>
>> bnxt_en 0008:01:00.2 enP8p1s0f2d2: Resp cmpl intr err msg: 0x51
>> bnxt_en 0008:01:00.2 enP8p1s0f2d2: hwrm_ring_free tx failed. rc:-1
>> bnxt_en 0008:01:00.2 enP8p1s0f2d2: Resp cmpl intr err msg: 0x51
>> bnxt_en 0008:01:00.2 enP8p1s0f2d2: hwrm_ring_free rx failed. rc:-1
>>
>> These were resolved by checking for link status and returning if link
>> was not up.
>>
>> Signed-off-by: Andy Gospodarek 
>> Signed-off-by: Michael Chan 
>> Tested-by: Rob Miller 
>> ---
>>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
>> b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> index e18635b..013e373 100644
>> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> @@ -1811,6 +1811,9 @@ static int bnxt_busy_poll(struct napi_struct *napi)
>>   if (atomic_read(>intr_sem) != 0)
>>   return LL_FLUSH_FAILED;
>>
>> + if (!bp->link_info.link_up)
>> + return LL_FLUSH_FAILED;
>> +
>>   if (!bnxt_lock_poll(bnapi))
>>   return LL_FLUSH_BUSY;
>>
>
>
> Any plans removing this busy polling stuff, now it is done in core
> networking stack ?
>
> This would remove bnxt_lock_napi() extra overhead in normal path ( napi
> poll )
>
> I could do this but I do not have the hardware to do the tests.
>
It's on my list of many TODO things.  Probably in the next few weeks.


Re: [PATCH] net: dsa: mv88e6xxx: egress all frames

2016-11-22 Thread Stefan Eichenberger
Hi Andrew

On Tue, Nov 22, 2016 at 04:03:30PM +0100, Andrew Lunn wrote:
> On Tue, Nov 22, 2016 at 11:39:44AM +0100, Stefan Eichenberger wrote:
> > Egress multicast and egress unicast is only enabled for CPU/DSA ports
> > but for switching operation it seems it should be enabled for all ports.
> > Do I miss something here?
> > 
> > I did the following test:
> > brctl addbr br0
> > brctl addif br0 lan0
> > brctl addif br0 lan1
> > 
> > In this scenario the unicast and multicast packets were not forwarded,
> > therefore ARP requests were not resolved, and no connection could be
> > established.
> 
> Hi Stefan
> 
> This is probably specific to the 6097 family. It works fine without
> this on other devices. Creating a bridge like above and pinging across
> it is one of my standard tests. But i only test modern devices like
> the 6165, 6352, 6351, 6390 families.

Okay perfect, I wasn't 100% sure if I would have to configure something
additionally.

> 
> In fact, you might need to review all the code and look where
> mv88e6xxx_6095_family(chip) is used and consider if you need to add
> mv88e6xxx_6097_family(chip). e.g.
> 
> if (mv88e6xxx_6095_family(chip) || mv88e6xxx_6185_family(chip)) {
> /* Set the upstream port this port should use */
> reg |= dsa_upstream_port(ds);
> /* enable forwarding of unknown multicast addresses to
>  * the upstream port
>  */
> if (port == dsa_upstream_port(ds))
> reg |= PORT_CONTROL_2_FORWARD_UNKNOWN;
> }
> 
> Maybe this is your problem?

I think I still don't understand exactly how the driver works.

My problem is that the multicast and broadcast frames are filtered and
the following counter is increasing in ethtool:
sw_in_filtered: 596

This makes sense because "Egress Floods" in the Port Control Register is
set to 0. What kind of mechanism should make sure that for example ARP
packets are sent trough all ports anyway?

Unfortunately I don't have any devices available with more modern
devices, so I can't double check the registers.

Regards,
Stefan


Re: [PATCH net] bnxt: do not busy-poll when link is down

2016-11-22 Thread Eric Dumazet
On Tue, 2016-11-22 at 13:14 -0500, Andy Gospodarek wrote:
> When busy polling while a link is down (during a link-flap test), TX
> timeouts were observed as well as the following messages in the ring
> buffer:
> 
> bnxt_en 0008:01:00.2 enP8p1s0f2d2: Resp cmpl intr err msg: 0x51
> bnxt_en 0008:01:00.2 enP8p1s0f2d2: hwrm_ring_free tx failed. rc:-1
> bnxt_en 0008:01:00.2 enP8p1s0f2d2: Resp cmpl intr err msg: 0x51
> bnxt_en 0008:01:00.2 enP8p1s0f2d2: hwrm_ring_free rx failed. rc:-1
> 
> These were resolved by checking for link status and returning if link
> was not up.
> 
> Signed-off-by: Andy Gospodarek 
> Signed-off-by: Michael Chan 
> Tested-by: Rob Miller 
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index e18635b..013e373 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -1811,6 +1811,9 @@ static int bnxt_busy_poll(struct napi_struct *napi)
>   if (atomic_read(>intr_sem) != 0)
>   return LL_FLUSH_FAILED;
>  
> + if (!bp->link_info.link_up)
> + return LL_FLUSH_FAILED;
> +
>   if (!bnxt_lock_poll(bnapi))
>   return LL_FLUSH_BUSY;
>  


Any plans removing this busy polling stuff, now it is done in core
networking stack ?

This would remove bnxt_lock_napi() extra overhead in normal path ( napi
poll )

I could do this but I do not have the hardware to do the tests.





List pre vas

2016-11-22 Thread Paní KLeung



Ahoj.

Dobre rano, a jak to delate? Jen rychly jedno, je tu oficialni 
prilezitosti bych chtel diskutovat s vami soukrome.


Ocenil bych vasi rychlou reakci tady na mem osobnim soukromeho e-mailu 
nize pro dalsi komunikaci.


S pratelskym pozdravem,
Paní Ko May Leung
email: lngkoma...@gmail.com
Místopredseda, Managing Director
a vykonny reditel Chong Hing Bank Limited


[PATCH net] bnxt: do not busy-poll when link is down

2016-11-22 Thread Andy Gospodarek
When busy polling while a link is down (during a link-flap test), TX
timeouts were observed as well as the following messages in the ring
buffer:

bnxt_en 0008:01:00.2 enP8p1s0f2d2: Resp cmpl intr err msg: 0x51
bnxt_en 0008:01:00.2 enP8p1s0f2d2: hwrm_ring_free tx failed. rc:-1
bnxt_en 0008:01:00.2 enP8p1s0f2d2: Resp cmpl intr err msg: 0x51
bnxt_en 0008:01:00.2 enP8p1s0f2d2: hwrm_ring_free rx failed. rc:-1

These were resolved by checking for link status and returning if link
was not up.

Signed-off-by: Andy Gospodarek 
Signed-off-by: Michael Chan 
Tested-by: Rob Miller 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index e18635b..013e373 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1811,6 +1811,9 @@ static int bnxt_busy_poll(struct napi_struct *napi)
if (atomic_read(>intr_sem) != 0)
return LL_FLUSH_FAILED;
 
+   if (!bp->link_info.link_up)
+   return LL_FLUSH_FAILED;
+
if (!bnxt_lock_poll(bnapi))
return LL_FLUSH_BUSY;
 
-- 
2.1.0



[PATCH/RFC -next] net: phy: Fix double free in phy_detach()

2016-11-22 Thread Geert Uytterhoeven
During "poweroff" on sh73a0/kzm9g:

WARNING: CPU: 0 PID: 1271 at drivers/base/devres.c:889 phy_detach+0x44/0x60
Modules linked in:
CPU: 0 PID: 1271 Comm: halt Not tainted 
4.9.0-rc6-kzm9g-05637-gb090128865050239 #823
Hardware name: Generic SH73A0 (Flattened Device Tree)
[] (unwind_backtrace) from [] (show_stack+0x10/0x14)
[] (show_stack) from [] (dump_stack+0xa4/0xdc)
[] (dump_stack) from [] (__warn+0xcc/0xfc)
[] (__warn) from [] (warn_slowpath_null+0x1c/0x24)
[] (warn_slowpath_null) from [] (phy_detach+0x44/0x60)
[] (phy_detach) from [] (smsc911x_stop+0xf4/0x10c)
[] (smsc911x_stop) from [] (__dev_close_many+0x94/0xb8)
[] (__dev_close_many) from [] (__dev_close+0x20/0x34)
[] (__dev_close) from [] (__dev_change_flags+0x8c/0x130)
[] (__dev_change_flags) from [] 
(dev_change_flags+0x18/0x48)
[] (dev_change_flags) from [] 
(devinet_ioctl+0x33c/0x708)
[] (devinet_ioctl) from [] (sock_ioctl+0x29c/0x2f8)
[] (sock_ioctl) from [] (vfs_ioctl+0x20/0x34)
[] (vfs_ioctl) from [] (do_vfs_ioctl+0x870/0x9c4)
[] (do_vfs_ioctl) from [] (SyS_ioctl+0x34/0x5c)
[] (SyS_ioctl) from [] (ret_fast_syscall+0x0/0x1c)
---[ end trace 4555b9be7369b463 ]---

If device_release_driver(>mdio.dev) was called, it has already
released all resources belonging to the PHY device. Hence the subsequent
call to phy_led_triggers_unregister() may cause a double free, leading
to the warning.

Move the call to phy_led_triggers_unregister() before the possible call
to device_release_driver() to fix this.

Fixes: 2e0bc452f4721520 ("net: phy: leds: add support for led triggers on phy 
link state change")
Signed-off-by: Geert Uytterhoeven 
---
Is this the right fix?
---
 drivers/net/phy/phy_device.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 9e8f048891bd192f..b32457660db66de4 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -981,6 +981,8 @@ void phy_detach(struct phy_device *phydev)
phydev->attached_dev = NULL;
phy_suspend(phydev);
 
+   phy_led_triggers_unregister(phydev);
+
/* If the device had no specific driver before (i.e. - it
 * was using the generic driver), we unbind the device
 * from the generic driver so that there's a chance a
@@ -994,8 +996,6 @@ void phy_detach(struct phy_device *phydev)
}
}
 
-   phy_led_triggers_unregister(phydev);
-
/*
 * The phydev might go away on the put_device() below, so avoid
 * a use-after-free bug by reading the underlying bus first.
-- 
1.9.1



Re: [RFC net-next 0/3] net: bridge: Allow CPU port configuration

2016-11-22 Thread Florian Fainelli
On 11/22/2016 09:41 AM, Ido Schimmel wrote:
> Hi Florian,
> 
> On Mon, Nov 21, 2016 at 11:09:22AM -0800, Florian Fainelli wrote:
>> Hi all,
>>
>> This patch series allows using the bridge master interface to configure
>> an Ethernet switch port's CPU/management port with different VLAN attributes 
>> than
>> those of the bridge downstream ports/members.
>>
>> Jiri, Ido, Andrew, Vivien, please review the impact on mlxsw and mv88e6xxx, I
>> tested this with b53 and a mockup DSA driver.
> 
> We'll need to add a check in mlxsw and ignore any VLAN configuration for
> the bridge device itself. Otherwise, any configuration done on br0 will
> be propagated to all of its slaves, which is incorrect.
> 
>>
>> Open questions:
>>
>> - if we have more than one bridge on top of a physical switch, the driver
>>   should keep track of that and verify that we are not going to change
>>   the CPU port VLAN attributes in a way that results in incompatible settings
>>   to be applied
>>
>> - if the default behavior is to have all VLANs associated with the CPU port
>>   be ingressing/egressing tagged to the CPU, is this really useful?
> 
> First of all, I want to be sure that when we say "CPU port", we're
> talking about the same thing. In mlxsw, the CPU port is a pipe between
> the device and the host, through which all packets trapped to the host
> go through. So, when a packet is trapped, the driver reads its Rx
> descriptor, checks through which port it ingressed, resolves its netdev,
> sets skb->dev accordingly and injects it to the Rx path via
> netif_receive_skb(). The CPU port itself isn't represented using a
> netdev.

In the case of DSA, the CPU port is a normal Ethernet MAC driver, but in
premise, this driver plus the DSA tag protocol hook do exactly the same
things as you just describe.

> 
> Given the above, having VLAN filters (or STP) on the CPU port itself
> isn't really helpful (we do have them for physical ports of course...).
> So, mlxsw will not benefit from this patchset and if we've the same
> concept of "CPU port", then I'm not sure why you don't just enable all
> the VLANs on it?

We do enable all VLANs on the CPU port (at least with b53, but I think
mv88e6xxx does it too), but compared to e.g: mlxsw, we trap all traffic
by default, and actually, quite often (always actually, until we add IP
routing offloads) the CPU is involved in the LAN/WAN routing, so it is
not infrequent to have the following packet flow:

LAN port -> VLAN 1 -> eth0.1 -> NAT/routing -> eth0.2 -> VLAN 2 -> WAN port

In that case, having the ability to define the per-port membership for
VLANs, including the CPU, kind of helps, especially if there are
private/guests VLAN on either the LAN or WAN segments that the CPU does
not necessarily need to play a role in.

NB: this scheme works because in most configurations that we support
today, the CPU port's speed is greater or equal than the speed of the
downstream/front panel ports.

> 
> Also, how are you going to set the VLAN filters for the CPU port when
> you don't offload a bridge, but instead vlan devices between which you
> route packets? You lose your abstraction of CPU port...

As far as I can tell today, this is not particularly helpful with DSA,
where we start with all traffic going to the CPU (each DSA created
network device is segregated from the other) and only then we require
having bridge VLAN filtering enabled in the kernel, and configuring
bridge VLAN membership to have a proper VLAN-based scheme.

If you did configure VLAN membership with e.g: port0. we could
support that just fine, but that programming interface does not allow
configuring the default VLAN, and in our case, it matters a bit to
support the LAN/WAN routing scenario described. We could agree that all
untagged traffic should go to VLAN 0 or 1 for instance, but that could
then, vary on a per-driver/HW basis.

Hope this clarifies things a bit!
-- 
Florian


Re: net/icmp: null-ptr-deref in icmp6_send

2016-11-22 Thread Cong Wang
On Tue, Nov 22, 2016 at 2:23 AM, Andrey Konovalov  wrote:
> Hi,
>
> I've got the following error report while fuzzing the kernel with syzkaller.
>
> It seems that skb_dst(skb) may end up being NULL.
>
> As far as I can see the bug was introduced in commit 5d41ce29e ("net:
> icmp6_send should use dst dev to determine L3 domain").
> ICMP v4 probaly has similar issue due to 9d1a6c4ea ("net:
> icmp_route_lookup should use rt dev to determine L3 domain").


ipv6_parse_hopopts() is called before NF_INET_PRE_ROUTING,
so the skb_dst could be NULL.

I have no idea what commit 5d41ce29e tried to fix, but we already
use skb->dev a few lines before l3mdev_master_ifindex(), so I don't
understand why skb->dev could be NULL, maybe just for vrf dev?


Re: wl1251 & mac address & calibration data

2016-11-22 Thread Pali Rohár
On Tuesday 22 November 2016 17:14:28 Michal Kazior wrote:
> On 22 November 2016 at 16:31, Pali Rohár  wrote:
> > On Tuesday 22 November 2016 16:22:57 Michal Kazior wrote:
> >> On 21 November 2016 at 16:51, Pali Rohár 
> >> wrote:
> >> > On Friday 11 November 2016 18:20:50 Pali Rohár wrote:
> >> >> Hi! I will open discussion about mac address and calibration
> >> >> data for wl1251 wireless chip again...
> >> >> 
> >> >> Problem: Mac address & calibration data for wl1251 chip on
> >> >> Nokia N900 are stored on second nand partition (mtd1) in
> >> >> special proprietary format which is used only for Nokia N900
> >> >> (probably on N8x0 and N9 too). Wireless driver wl1251.ko
> >> >> cannot work without mac address and calibration data.
> >> 
> >> Same problem applies to some ath9k/ath10k supported routers. Some
> >> even carry mac address as implicit offset from ethernet mac
> >> address. As far as I understand OpenWRT cooks cal blobs on first
> >> boot prior to loading modules.
> > 
> > So... wl1251 on Nokia N900 is not alone and this problem is there
> > for more drivers and devices. Which means we should come up with
> > some generic solution.
> 
> This isn't particularly a problem for ath9k/ath10k.
> 
> Let me give you more background on ath10k.
> 
> ath10k devices can come with caldata and macaddr stored in their
> OTP/EEPROM. In that case a generic "template" board file is used.
> Userspace doesn't need to do anything special.
> 
> Some vendors however decide to use flash partition to store caldata.
> In that case ath10k expects userspace to prepare
> cal-$bus-$devname.bin files, each for a different radio (you can
> have multiple radios on a system).
> 
> Now translating this for wl1251 I would expect it should also use
> something like wl1251-nvs-sdio-0x0001.bin for devices like N900 that
> have caldata on flash partition (instead of the generic
> wl1251-nvs.bin). I'm not sure if wl1251-nvs.bin is something
> comparable to (the generic) board.bin ath10k has though. Maybe the
> entire idea behind wl1251-nvs.bin is flawed as it's supposed to be
> device specific and is oblivious to possibility of having multiple
> wl1251 radios on one system (probably sane assumption from practical
> standpoint but still).

Basically nvs data are device specific, in ideal case they should be 
generated in factory by some calibration process (or so).

> >> >> Absence of mac address cause that driver generates random mac
> >> >> address at every kernel boot which has couple of problems
> >> >> (unstable identifier of wireless device due to udev permanent
> >> >> storage rules; unpredictable behaviour for dhcp mac address
> >> >> assignment, mac address filtering, ...).
> >> >> 
> >> >> Currently there is no way to set (permanent) mac address for
> >> >> network interface from userspace. And it does not make sense
> >> >> to implement in linux kernel large parser for proprietary
> >> >> format of second nand partition where is mac address stored
> >> >> only for one device -- Nokia N900.
> >> >> 
> >> >> Driver wl1251.ko loads calibration data via request_firmware()
> >> >> for file wl1251-nvs.bin. There are some "example" calibration
> >> >> file in linux- firmware repository, but it is not suitable for
> >> >> normal usage as real calibration data are per-device specific.
> >> 
> >> You could hook up a script that cooks up the cal/mac file via
> >> modprobe's install hook, no?
> > 
> > Via modprobe hook I can either pass custom module parameter or call
> > any other system (shell) commands.
> > 
> > As wl1251.ko does not accept mac_address as module parameter, such
> > modprobe hook does not help -- as there is absolutely no way from
> > userspace to set or change (permanent) mac address.
> 
> Quoting modprobe.d manual:
> >   install modulename command...
> >   
> >   This command instructs modprobe to run your
> >   command instead of inserting the module in the
> >   kernel as normal. The command can be any shell
> >   command: this allows you to do any kind of
> >   complex processing you might wish. [...]

I know. But this do not allow me to send mac address to kernel -- as 
kernel does not support such command yet (reason for my first question).

> You can hook up a script that cooks up wl1251-nvs.bin (caldata,
> macaddr) and then insmod the actual wl1251.ko module. Or you can just
> cook up the nvs on first device boot and store it in /lib/firmware
> (possibly overwriting the "generic" wl1251 from linux-firmware).

This is what I would like to prevent -- overwriting (possible readonly) 
system files with some device specific. It is really bad idea!

-- 
Pali Rohár
pali.ro...@gmail.com


signature.asc
Description: This is a digitally signed message part.


Re: [RFC net-next 0/3] net: bridge: Allow CPU port configuration

2016-11-22 Thread Ido Schimmel
Hi Florian,

On Mon, Nov 21, 2016 at 11:09:22AM -0800, Florian Fainelli wrote:
> Hi all,
> 
> This patch series allows using the bridge master interface to configure
> an Ethernet switch port's CPU/management port with different VLAN attributes 
> than
> those of the bridge downstream ports/members.
> 
> Jiri, Ido, Andrew, Vivien, please review the impact on mlxsw and mv88e6xxx, I
> tested this with b53 and a mockup DSA driver.

We'll need to add a check in mlxsw and ignore any VLAN configuration for
the bridge device itself. Otherwise, any configuration done on br0 will
be propagated to all of its slaves, which is incorrect.

> 
> Open questions:
> 
> - if we have more than one bridge on top of a physical switch, the driver
>   should keep track of that and verify that we are not going to change
>   the CPU port VLAN attributes in a way that results in incompatible settings
>   to be applied
> 
> - if the default behavior is to have all VLANs associated with the CPU port
>   be ingressing/egressing tagged to the CPU, is this really useful?

First of all, I want to be sure that when we say "CPU port", we're
talking about the same thing. In mlxsw, the CPU port is a pipe between
the device and the host, through which all packets trapped to the host
go through. So, when a packet is trapped, the driver reads its Rx
descriptor, checks through which port it ingressed, resolves its netdev,
sets skb->dev accordingly and injects it to the Rx path via
netif_receive_skb(). The CPU port itself isn't represented using a
netdev.

Given the above, having VLAN filters (or STP) on the CPU port itself
isn't really helpful (we do have them for physical ports of course...).
So, mlxsw will not benefit from this patchset and if we've the same
concept of "CPU port", then I'm not sure why you don't just enable all
the VLANs on it?

Also, how are you going to set the VLAN filters for the CPU port when
you don't offload a bridge, but instead vlan devices between which you
route packets? You lose your abstraction of CPU port...

Thanks!


Re: [RFC net-next 0/3] net: bridge: Allow CPU port configuration

2016-11-22 Thread Andrew Lunn
Hi Ido
 
> First of all, I want to be sure that when we say "CPU port", we're
> talking about the same thing. In mlxsw, the CPU port is a pipe between
> the device and the host, through which all packets trapped to the host
> go through. So, when a packet is trapped, the driver reads its Rx
> descriptor, checks through which port it ingressed, resolves its netdev,
> sets skb->dev accordingly and injects it to the Rx path via
> netif_receive_skb(). The CPU port itself isn't represented using a
> netdev.

With DSA, we have a real physical ethernet network interface for the
'cpu' port. It connects to one of the ports of the switch. Frames on
this interface have an extra header, indicating which switch port it
came from, and we do a similar resolving it to a slave netdev, strip
of the header and injecting it into the receiver path via
netif_receive_skb().

Andrew


Re: net/can: use-after-free in bcm_rx_thr_flush

2016-11-22 Thread Andrey Konovalov
On Tue, Nov 22, 2016 at 6:29 PM, Oliver Hartkopp  wrote:
> Hi Andrey,
>
> thanks for the report.
>
> Although I can't see the issue in the code ...
>
> On 11/22/2016 10:22 AM, Andrey Konovalov wrote:
>
>> ==
>> BUG: KASAN: use-after-free in bcm_rx_thr_flush+0x284/0x2b0
>> Read of size 1 at addr 88006c1faae5 by task a.out/3874
>>
>> page:ea0001b07e80 count:1 mapcount:0 mapping:  (null)
>> index:0x0
>> flags: 0x180(slab)
>> page dumped because: kasan: bad access detected
>
>
> (..)
>
>>
>> The buggy address belongs to the object at 88006c1faae0
>>  which belongs to the cache kmalloc-32 of size 32
>
>
> ???
>
>> The buggy address 88006c1faae5 is located 5 bytes inside
>>  of 32-byte region [88006c1faae0, 88006c1fab00)
>
>
> (..)
>
>> Memory state around the buggy address:
>>  88006c1fa980: fc fc fb fb fb fb fc fc fb fb fb fb fc fc fb fb
>>  88006c1faa00: fb fb fc fc fb fb fb fb fc fc fb fb fb fb fc fc
>>>
>>> 88006c1faa80: fb fb fb fb fc fc fb fb fb fb fc fc fb fb fb fb
>>
>>^
>>  88006c1fab00: fc fc fb fb fb fb fc fc 00 00 00 00 fc fc 00 00
>>  88006c1fab80: 00 00 fc fc fb fb fb fb fc fc fb fb fb fb fc fc
>> ==
>
>
> (should be some zero initialized memory here)
>
> The relevant code of bcm_rx_do_flush() can be found here:
>
> http://lxr.free-electrons.com/source/net/can/bcm.c#L589
>
> static inline int bcm_rx_do_flush(struct bcm_op *op, int update,
>   unsigned int index)
> {
> struct canfd_frame *lcf = op->last_frames + op->cfsiz * index;
>
> if ((op->last_frames) && (lcf->flags & RX_THR)) {  <<<- !!!
> if (update)
> bcm_rx_changed(op, lcf);
> return 1;
> }
> return 0;
> }
>
>
> lcf->flags points into an array of struct canfd_frame at offset 5 which is
> allocated here:
>
> http://lxr.free-electrons.com/source/net/can/bcm.c#L1105
>
> /* create and init array for received CAN frames */
> op->last_frames = kzalloc(msg_head->nframes * op->cfsiz,
>   GFP_KERNEL);
>
> So why does KASAN complain about accessing some kind of 32 byte cache when
> it should point into a zero initialized allocated space?

Hi Oliver,

My guess would be that this is an out-of-bounds access which doesn't
hit the redzone.
The free and alloc stack traces also look unrelated to the access.
Besides I have a bunch of related slab-out-of-bounds reports, see below.

Thanks for looking at this!

==
BUG: KASAN: slab-out-of-bounds in bcm_send_to_user+0x330/0x480
Read of size 16 at addr 88006de17338 by task syz-executor/30679

page:ea0001b78580 count:1 mapcount:0 mapping:  (null)
index:0x88006de16760 compound_mapcount: 0
flags: 0x5004080(slab|head)
page dumped because: kasan: bad access detected

CPU: 2 PID: 30679 Comm: syz-executor Not tainted 4.9.0-rc6+ #429
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 88003cd277b0 81b472e4 88003cd27840 88006de17338
 00fb 00fc 88003cd27830 8150ad42
  81509f65 88006aef9830 0282
Call Trace:
 [< inline >] __dump_stack lib/dump_stack.c:15
 [] dump_stack+0xb3/0x10f lib/dump_stack.c:51
 [< inline >] describe_address mm/kasan/report.c:259
 [] kasan_report_error+0x122/0x560 mm/kasan/report.c:365
 [] kasan_report+0x36/0x40 mm/kasan/report.c:387
 [< inline >] check_memory_region_inline mm/kasan/kasan.c:308
 [] check_memory_region+0x13e/0x1a0 mm/kasan/kasan.c:315
 [] memcpy+0x23/0x50 mm/kasan/kasan.c:350
 [] bcm_send_to_user+0x330/0x480 net/can/bcm.c:325
 [] bcm_rx_changed+0x22e/0x2a0 net/can/bcm.c:443
 [< inline >] bcm_rx_do_flush net/can/bcm.c:591
 [] bcm_rx_thr_flush+0x19e/0x2b0 net/can/bcm.c:612
 [< inline >] bcm_rx_setup net/can/bcm.c:1199
 [] bcm_sendmsg+0xbb6/0x30e0 net/can/bcm.c:1351
 [< inline >] sock_sendmsg_nosec net/socket.c:621
 [] sock_sendmsg+0xcc/0x110 net/socket.c:631
 [] ___sys_sendmsg+0x771/0x8b0 net/socket.c:1954
 [] __sys_sendmsg+0xce/0x170 net/socket.c:1988
 [< inline >] SYSC_sendmsg net/socket.c:1999
 [] SyS_sendmsg+0x2d/0x50 net/socket.c:1995
 [] entry_SYSCALL_64_fastpath+0x1f/0xc2

The buggy address belongs to the object at 88006de17320
 which belongs to the cache kmalloc-32 of size 32
The buggy address 88006de17338 is located 24 bytes inside
 of 32-byte region [88006de17320, 88006de17340)

Freed by task 0:
 [] save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
 [] save_stack+0x46/0xd0 mm/kasan/kasan.c:495
 [< inline >] set_track mm/kasan/kasan.c:507
 [] 

RE: [PATCH v9 0/8] thunderbolt: Introducing Thunderbolt(TM) Networking

2016-11-22 Thread Mario.Limonciello
> Here are a couple of additional questions:
> 
> - When the network interface is created, there is no IP address
>   assigned (or negotiated ?) on the Linux side. But it is done on the
>   MacOS side. And in the Linux kernel logs I can also read the message:
>   "ready for ThunderboltIP negotiation". Is there something missing or
>   not working on the Linux side ? What is the correct way to configure
>   or negotiate the IP address. For my tests I did it manually...
> 
> - When the Linux machine is started with the Thunderbolt wire already
>   connected to a MacBook Pro, sometimes (but not every time) the
>   network interface is not created. The Thunderbolt wire needs to be
>   replugged.
> 
> FWIW you get my
> 
> Tested-by: Simon Guinot 
> 
> Simon

Simon,

Since I also performed testing on the previous patchset, I'll share what I did.

I configured Network Manager to use the TBT interface to share an internet
connection to another box.  This configures a static IP address on the local
Linux side and sets up routing.

Network manager remembers setup this in a configuration database.  
When the interface goes up it will then set up a DHCP server to hand
out an IP address to the other side.




Re: net/can: use-after-free in bcm_rx_thr_flush

2016-11-22 Thread Oliver Hartkopp

Hi Andrey,

thanks for the report.

Although I can't see the issue in the code ...

On 11/22/2016 10:22 AM, Andrey Konovalov wrote:


==
BUG: KASAN: use-after-free in bcm_rx_thr_flush+0x284/0x2b0
Read of size 1 at addr 88006c1faae5 by task a.out/3874

page:ea0001b07e80 count:1 mapcount:0 mapping:  (null) index:0x0
flags: 0x180(slab)
page dumped because: kasan: bad access detected


(..)



The buggy address belongs to the object at 88006c1faae0
 which belongs to the cache kmalloc-32 of size 32


???


The buggy address 88006c1faae5 is located 5 bytes inside
 of 32-byte region [88006c1faae0, 88006c1fab00)


(..)


Memory state around the buggy address:
 88006c1fa980: fc fc fb fb fb fb fc fc fb fb fb fb fc fc fb fb
 88006c1faa00: fb fb fc fc fb fb fb fb fc fc fb fb fb fb fc fc

88006c1faa80: fb fb fb fb fc fc fb fb fb fb fc fc fb fb fb fb

   ^
 88006c1fab00: fc fc fb fb fb fb fc fc 00 00 00 00 fc fc 00 00
 88006c1fab80: 00 00 fc fc fb fb fb fb fc fc fb fb fb fb fc fc
==


(should be some zero initialized memory here)

The relevant code of bcm_rx_do_flush() can be found here:

http://lxr.free-electrons.com/source/net/can/bcm.c#L589

static inline int bcm_rx_do_flush(struct bcm_op *op, int update,
  unsigned int index)
{
struct canfd_frame *lcf = op->last_frames + op->cfsiz * index;

if ((op->last_frames) && (lcf->flags & RX_THR)) {  <<<- !!!
if (update)
bcm_rx_changed(op, lcf);
return 1;
}
return 0;
}


lcf->flags points into an array of struct canfd_frame at offset 5 which 
is allocated here:


http://lxr.free-electrons.com/source/net/can/bcm.c#L1105

/* create and init array for received CAN frames */
op->last_frames = kzalloc(msg_head->nframes * op->cfsiz,
  GFP_KERNEL);

So why does KASAN complain about accessing some kind of 32 byte cache 
when it should point into a zero initialized allocated space?


I will write some other test cases with a similar setting of options to 
check if I can trigger the instability too.


Tnx & regards,
Oliver


Re: [PATCH v9 0/8] thunderbolt: Introducing Thunderbolt(TM) Networking

2016-11-22 Thread Simon Guinot
On Fri, Nov 18, 2016 at 12:20:07PM +0100, Simon Guinot wrote:
> On Fri, Nov 18, 2016 at 08:48:36AM +, Levy, Amir (Jer) wrote:
> > On Tue, Nov 15 2016, 12:59 PM, Simon Guinot wrote:
> > > On Wed, Nov 09, 2016 at 03:42:53PM +, Levy, Amir (Jer) wrote:
> > > > On Wed, Nov 9 2016, 04:36 PM, Simon Guinot wrote:
> > > > > Hi Amir,
> > > > >
> > > > > I have an ASUS "All Series/Z87-DELUXE/QUAD" motherboard with a 
> > > > > Thunderbolt 2 "Falcon Ridge" chipset (device ID 156d).
> > > > >
> > > > > Is the thunderbolt-icm driver supposed to work with this chipset ?
> > > > >
> > > >
> > > > Yes, the thunderbolt-icm supports Falcon Ridge, device ID 156c.
> > > > 156d is the bridge -
> > > > http://lxr.free-electrons.com/source/include/linux/pci_ids.h#L2619
> > > >
> > > > > I have installed both a 4.8.6 Linux kernel (patched with your v9
> > > > > series) and the thunderbolt-software-daemon (27 october release) 
> > > > > inside a Debian system (Jessie).
> > > > >
> > > > > If I connect the ASUS motherboard with a MacBook Pro (Thunderbolt 
> > > > > 2, device ID 156c), I can see that the thunderbolt-icm driver is 
> > > > > loaded and that the thunderbolt-software-daemon is well started. 
> > > > > But the Ethernet interface is not created.
> > > > >
> > > > > I have attached to this email the syslog file. There is the logs 
> > > > > from both the kernel and the daemon inside. Note that the daemon 
> > > > > logs are everything but clear about what could be the issue. Maybe 
> > > > > I missed some kind of configuration ? But I failed to find any 
> > > > > valuable information about configuring the driver and/or the 
> > > > > daemon in
> > > the various documentation files.
> > > > >
> > > > > Please, can you provide some guidance ? I'd really like to test 
> > > > > your patch series.
> > > >
> > > > First, thank you very much for willing to test it.
> > > > Thunderbolt Networking support was added during Falcon Ridge, in the
> > > latest FR images.
> > > > Do you know which Thunderbolt image version you have on your system?
> > > > Currently I submitted only Thunderbolt Networking feature in Linux, 
> > > > and we plan to add more features like reading the image version and
> > > updating the image.
> > > > If you don't know the image version, the only thing I can suggest is 
> > > > to load windows, install thunderbolt SW and check in the Thunderbolt
> > > application the image version.
> > > > To know if image update is needed, you can check - 
> > > > https://thunderbolttechnology.net/updates
> > > 
> > > Hi Amir,
> > > 
> > > From the Windows Thunderbolt software, I can read 13.00 for the 
> > > firmware version. And from https://thunderbolttechnology.net/updates, 
> > > I can see that there is no update available for my ASUS motherboard.
> > > 
> > > Am I good to go ?
> > > 
> > 
> > Thunderbolt Networking is supported on both Thunderbolt(tm) 2 and 
> > Thunderbolt(tm) 3 systems.  
> > Thunderbolt 2 systems must have updated NVM (version 25 or later) in order 
> > for the functionality to work properly.  
> > If the system does not have the update, please contact the OEM directly for 
> > an updated NVM.  
> > For best functionality and support, Intel recommends using Thunderbolt 3 
> > systems for all validation and testing.
> 
> Maybe it is worth mentioning in the documentation and/or in the Kconfig
> help message that a minimal firmware version is needed for Thunderbolt 2
> controllers.
> 
> It would have saved some time for me :)
> 
> > 
> > > BTW, it is quite a shame that the Thunderbolt firmware version can't 
> > > be read from Linux.
> > > 
> > 
> > This is WIP, once this patch will be upstream, we will be able to focus more
> > on aligning Linux with the Thunderbolt features that we have for windows.
> 
> Well, I rather see the firmware identification and update as basic
> features on the top of which ones you can build a driver. For example in
> this case this would allow the ICM driver and/or the userland daemon to
> exit with a useful error message rather than just not working without any
> explanation.
> 
> Next week I'll try the driver with a Thunderbolt 3 controller.

Hi Amir,

I tested the thunderbolt-icm driver (v9 series) on an Gigabyte
motherboard (Z170X-UD5 TH-CF) with a Thunderbolt 3 controller (Alpine
Ridge 4C).

I can see that the network interface is well created when the
motherboard is connected to a MacBook Pro (Thunderbolt 2 or 3).

And here are the TCP bandwidths measured using the iperf3 benchmark:

- MacBook Pro Thunderbolt 2: 8.46Gbits/sec
- MacBook Pro Thunderbolt 3: 11.8Gbits/sec

Are this results consistent with your expectations ?

From the MacOS system interface on the MacBook Pro Thunderbolt 3,
I noticed that the interface appears as dual lane (2x 20Gb/sec). But
when two MacBook Pro are connected together, the interface appears as
single lane (1x 40Gb/sec). Is some lane bonding support missing in the
Linux implementation ?

Here are a couple of additional 

Re: [PATCHv2 net-next 00/11] Start adding support for mv88e6390

2016-11-22 Thread Vivien Didelot
Hi,

Andrew Lunn  writes:

> This is the first patchset implementing support for the mv88e6390
> family.  This is a new generation of switch devices and has numerous
> incompatible changes to the registers. These patches allow the switch
> to the detected during probe, and makes the statistics unit work.
>
> These patches are insufficient to make the mv88e6390 functional. More
> patches will follow.
>
> v2:
>   Move stats code into global1
>   Change DT compatible string to mv88e6190
>   Fixed mv88e6351 stats which v1 had broken

Thanks Andrew!

For what it's worth:

Reviewed-by: Vivien Didelot 


Vivien


[PATCH net] udplite: call proper backlog handlers

2016-11-22 Thread Eric Dumazet
From: Eric Dumazet 

In commits 93821778def10 ("udp: Fix rcv socket locking") and
f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into
__udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite
was forgotten.

This leads to crashes if UDPlite header is pulled twice, which happens
starting from commit e6afc8ace6dd ("udp: remove headers from UDP packets
before queueing")

Bug found by syzkaller team, thanks a lot guys !

Note that backlog use in UDP/UDPlite is scheduled to be removed starting
from linux-4.10, so this patch is only needed up to linux-4.9

Fixes: 93821778def1 ("udp: Fix rcv socket locking")
Fixes: f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into 
__udpv6_queue_rcv_skb")
Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Eric Dumazet 
Reported-by: Andrey Konovalov 
Cc: Benjamin LaHaise 
Cc: Herbert Xu 
---
 net/ipv4/udp.c  |2 +-
 net/ipv4/udp_impl.h |2 +-
 net/ipv4/udplite.c  |2 +-
 net/ipv6/udp.c  |2 +-
 net/ipv6/udp_impl.h |2 +-
 net/ipv6/udplite.c  |2 +-
 6 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 0de9d5d2b9ae..5bab6c3f7a2f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1455,7 +1455,7 @@ static void udp_v4_rehash(struct sock *sk)
udp_lib_rehash(sk, new_hash);
 }
 
-static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
int rc;
 
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index 7e0fe4bdd967..feb50a16398d 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -25,7 +25,7 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t 
len, int noblock,
int flags, int *addr_len);
 int udp_sendpage(struct sock *sk, struct page *page, int offset, size_t size,
 int flags);
-int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
+int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 void udp_destroy_sock(struct sock *sk);
 
 #ifdef CONFIG_PROC_FS
diff --git a/net/ipv4/udplite.c b/net/ipv4/udplite.c
index af817158d830..ff450c2aad9b 100644
--- a/net/ipv4/udplite.c
+++ b/net/ipv4/udplite.c
@@ -50,7 +50,7 @@ struct proto  udplite_prot = {
.sendmsg   = udp_sendmsg,
.recvmsg   = udp_recvmsg,
.sendpage  = udp_sendpage,
-   .backlog_rcv   = udp_queue_rcv_skb,
+   .backlog_rcv   = __udp_queue_rcv_skb,
.hash  = udp_lib_hash,
.unhash= udp_lib_unhash,
.get_port  = udp_v4_get_port,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e5056d4873d1..e4a8000d59ad 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -514,7 +514,7 @@ void __udp6_lib_err(struct sk_buff *skb, struct 
inet6_skb_parm *opt,
return;
 }
 
-static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
int rc;
 
diff --git a/net/ipv6/udp_impl.h b/net/ipv6/udp_impl.h
index f6eb1ab34f4b..e78bdc76dcc3 100644
--- a/net/ipv6/udp_impl.h
+++ b/net/ipv6/udp_impl.h
@@ -26,7 +26,7 @@ int compat_udpv6_getsockopt(struct sock *sk, int level, int 
optname,
 int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
 int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
  int flags, int *addr_len);
-int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
+int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 void udpv6_destroy_sock(struct sock *sk);
 
 #ifdef CONFIG_PROC_FS
diff --git a/net/ipv6/udplite.c b/net/ipv6/udplite.c
index 47d0d2b87106..2f5101a12283 100644
--- a/net/ipv6/udplite.c
+++ b/net/ipv6/udplite.c
@@ -45,7 +45,7 @@ struct proto udplitev6_prot = {
.getsockopt= udpv6_getsockopt,
.sendmsg   = udpv6_sendmsg,
.recvmsg   = udpv6_recvmsg,
-   .backlog_rcv   = udpv6_queue_rcv_skb,
+   .backlog_rcv   = __udpv6_queue_rcv_skb,
.hash  = udp_lib_hash,
.unhash= udp_lib_unhash,
.get_port  = udp_v6_get_port,




Re: [RFC net-next 2/3] net: dsa: Propagate VLAN add/del to CPU port(s)

2016-11-22 Thread Vivien Didelot
Hi Florian,

Open question: will we need to do the same for FDB and MDB objects?

Florian Fainelli  writes:

> Now that the bridge layer can call into switchdev to signal programming
> requests targeting the bridge master device itself, allow the switch
> drivers to implement separate programming of downstream and
> upstream/management ports.
>
> Signed-off-by: Vivien Didelot 
> Signed-off-by: Florian Fainelli 
> ---
>  net/dsa/slave.c | 45 +
>  1 file changed, 33 insertions(+), 12 deletions(-)
>
> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> index d0c7bce88743..18288261b964 100644
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -223,35 +223,30 @@ static int dsa_slave_set_mac_address(struct net_device 
> *dev, void *a)
>   return 0;
>  }
>  
> -static int dsa_slave_port_vlan_add(struct net_device *dev,
> +static int dsa_slave_port_vlan_add(struct dsa_switch *ds, int port,
>  const struct switchdev_obj_port_vlan *vlan,
>  struct switchdev_trans *trans)
>  {
> - struct dsa_slave_priv *p = netdev_priv(dev);
> - struct dsa_switch *ds = p->parent;
>  

Extra newline ^.

>   if (switchdev_trans_ph_prepare(trans)) {
>   if (!ds->ops->port_vlan_prepare || !ds->ops->port_vlan_add)
>   return -EOPNOTSUPP;
>  
> - return ds->ops->port_vlan_prepare(ds, p->port, vlan, trans);
> + return ds->ops->port_vlan_prepare(ds, port, vlan, trans);
>   }
>  
> - ds->ops->port_vlan_add(ds, p->port, vlan, trans);
> + ds->ops->port_vlan_add(ds, port, vlan, trans);
>  
>   return 0;
>  }
>  
> -static int dsa_slave_port_vlan_del(struct net_device *dev,
> +static int dsa_slave_port_vlan_del(struct dsa_switch *ds, int port,
>  const struct switchdev_obj_port_vlan *vlan)
>  {
> - struct dsa_slave_priv *p = netdev_priv(dev);
> - struct dsa_switch *ds = p->parent;
> -
>   if (!ds->ops->port_vlan_del)
>   return -EOPNOTSUPP;
>  
> - return ds->ops->port_vlan_del(ds, p->port, vlan);
> + return ds->ops->port_vlan_del(ds, port, vlan);
>  }
>  
>  static int dsa_slave_port_vlan_dump(struct net_device *dev,
> @@ -465,8 +460,21 @@ static int dsa_slave_port_obj_add(struct net_device *dev,
> const struct switchdev_obj *obj,
> struct switchdev_trans *trans)
>  {
> + struct dsa_slave_priv *p = netdev_priv(dev);
> + struct dsa_switch *ds = p->parent;
> + int port = p->port;
>   int err;
>  
> + /* Here we may be called with an orig_dev which is different from dev,
> +  * on purpose, to receive request coming from e.g the bridge master
> +  * device. Although there are no network device associated with CPU/DSA
> +  * ports, we may still have programming operation for these ports.
> +  */
> + if (obj->orig_dev == p->bridge_dev) {
> + ds = ds->dst->ds[0];
> + port = ds->dst->cpu_port;
> + }
> +
>   /* For the prepare phase, ensure the full set of changes is feasable in
>* one go in order to signal a failure properly. If an operation is not
>* supported, return -EOPNOTSUPP.
> @@ -483,7 +491,7 @@ static int dsa_slave_port_obj_add(struct net_device *dev,
>trans);
>   break;
>   case SWITCHDEV_OBJ_ID_PORT_VLAN:
> - err = dsa_slave_port_vlan_add(dev,
> + err = dsa_slave_port_vlan_add(ds, port,
> SWITCHDEV_OBJ_PORT_VLAN(obj),
> trans);

Note that dsa_slave_port_vlan_add() will be called N times, N being the
number of bridge ports. This is not an issue for the moment though.
Programming it only once requires caching, so leave it for an eventual
future patch.

When issuing the following command (lan0 being a member of br0):

# bridge vlan add vid 42 dev lan0

the CPU port is also programmed as tagged in VLAN 42. Is that expected?

Thanks,

Vivien


Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver

2016-11-22 Thread Jason Gunthorpe
On Mon, Nov 21, 2016 at 05:53:04PM -0800, Vishwanathapura, Niranjana wrote:
> There are many example drivers in kernel which are using bus_register() in
> an initcall.

There really are not, certainly not in major subsystems.

> We could add a custom Interface between HFI1 driver and hfi_vnic drivers
> without involving a bus.

hfi is already registering on the infiniband class, just use that.

> But using the existing bus model gave a lot of in-built flexibility in
> decoupling devices from the drivers.

If you want to have your own bus then you need your own hfi
subsystem. drivers/infiniband is not a dumping ground..

Jason


[PATCH v2] net: dsa: mv88e6xxx: add MV88E6097 switch

2016-11-22 Thread Stefan Eichenberger
Add support for the MV88E6097 switch. The change was tested on an Armada
based platform with a MV88E6097 switch.

Signed-off-by: Stefan Eichenberger 
---
 drivers/net/dsa/mv88e6xxx/chip.c  | 26 ++
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h |  2 ++
 2 files changed, 28 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 48b58c7..2d5941c 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -3208,6 +3208,19 @@ static const struct mv88e6xxx_ops mv88e6095_ops = {
.stats_get_stats = mv88e6095_stats_get_stats,
 };
 
+static const struct mv88e6xxx_ops mv88e6097_ops = {
+   .set_switch_mac = mv88e6xxx_g2_set_switch_mac,
+   .phy_read = mv88e6xxx_g2_smi_phy_read,
+   .phy_write = mv88e6xxx_g2_smi_phy_write,
+   .port_set_link = mv88e6xxx_port_set_link,
+   .port_set_duplex = mv88e6xxx_port_set_duplex,
+   .port_set_speed = mv88e6185_port_set_speed,
+   .stats_snapshot = mv88e6xxx_g1_stats_snapshot,
+   .stats_get_sset_count = mv88e6095_stats_get_sset_count,
+   .stats_get_strings = mv88e6095_stats_get_strings,
+   .stats_get_stats = mv88e6095_stats_get_stats,
+};
+
 static const struct mv88e6xxx_ops mv88e6123_ops = {
/* MV88E6XXX_FAMILY_6165 */
.set_switch_mac = mv88e6xxx_g2_set_switch_mac,
@@ -3579,6 +3592,19 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.ops = _ops,
},
 
+   [MV88E6097] = {
+   .prod_num = PORT_SWITCH_ID_PROD_NUM_6097,
+   .family = MV88E6XXX_FAMILY_6097,
+   .name = "Marvell 88E6097/88E6097F",
+   .num_databases = 4096,
+   .num_ports = 11,
+   .port_base_addr = 0x10,
+   .global1_addr = 0x1b,
+   .age_time_coeff = 15000,
+   .flags = MV88E6XXX_FLAGS_FAMILY_6097,
+   .ops = _ops,
+   },
+
[MV88E6123] = {
.prod_num = PORT_SWITCH_ID_PROD_NUM_6123,
.family = MV88E6XXX_FAMILY_6165,
diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h 
b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
index 9298faa..ab52c37 100644
--- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
@@ -81,6 +81,7 @@
 #define PORT_SWITCH_ID 0x03
 #define PORT_SWITCH_ID_PROD_NUM_6085   0x04a
 #define PORT_SWITCH_ID_PROD_NUM_6095   0x095
+#define PORT_SWITCH_ID_PROD_NUM_6097   0x099
 #define PORT_SWITCH_ID_PROD_NUM_6131   0x106
 #define PORT_SWITCH_ID_PROD_NUM_6320   0x115
 #define PORT_SWITCH_ID_PROD_NUM_6123   0x121
@@ -378,6 +379,7 @@
 enum mv88e6xxx_model {
MV88E6085,
MV88E6095,
+   MV88E6097,
MV88E6123,
MV88E6131,
MV88E6161,
-- 
2.9.3



[PATCH net-next 4/4] ARM64: dts: marvell: Add network support for Armada 3700

2016-11-22 Thread Gregory CLEMENT
Add neta nodes for network support both in device tree for the SoC and
the board.

Signed-off-by: Gregory CLEMENT 
---
 arch/arm64/boot/dts/marvell/armada-3720-db.dts | 23 +++
 arch/arm64/boot/dts/marvell/armada-37xx.dtsi   | 23 +++
 2 files changed, 46 insertions(+)

diff --git a/arch/arm64/boot/dts/marvell/armada-3720-db.dts 
b/arch/arm64/boot/dts/marvell/armada-3720-db.dts
index 1372e9a6aaa4..c8b82e4145de 100644
--- a/arch/arm64/boot/dts/marvell/armada-3720-db.dts
+++ b/arch/arm64/boot/dts/marvell/armada-3720-db.dts
@@ -81,3 +81,26 @@
  {
status = "okay";
 };
+
+ {
+   status = "okay";
+   phy0: ethernet-phy@0 {
+   reg = <0>;
+   };
+
+   phy1: ethernet-phy@1 {
+   reg = <1>;
+   };
+};
+
+ {
+   phy-mode = "rgmii-id";
+   phy = <>;
+   status = "okay";
+};
+
+ {
+   phy-mode = "rgmii-id";
+   phy = <>;
+   status = "okay";
+};
diff --git a/arch/arm64/boot/dts/marvell/armada-37xx.dtsi 
b/arch/arm64/boot/dts/marvell/armada-37xx.dtsi
index c4762538ec01..a7278ce9e523 100644
--- a/arch/arm64/boot/dts/marvell/armada-37xx.dtsi
+++ b/arch/arm64/boot/dts/marvell/armada-37xx.dtsi
@@ -140,6 +140,29 @@
};
};
 
+   eth0: ethernet@3 {
+  compatible = "marvell,armada-3700-neta";
+  reg = <0x3 0x4000>;
+  interrupts = ;
+  clocks = <_periph_clk 8>;
+  status = "disabled";
+   };
+
+   mdio: mdio@32004 {
+   #address-cells = <1>;
+   #size-cells = <0>;
+   compatible = "marvell,orion-mdio";
+   reg = <0x32004 0x4>;
+   };
+
+   eth1: ethernet@4 {
+   compatible = "marvell,armada-3700-neta";
+   reg = <0x4 0x4000>;
+   interrupts = ;
+   clocks = <_periph_clk 7>;
+   status = "disabled";
+   };
+
usb3: usb@58000 {
compatible = "marvell,armada3700-xhci",
"generic-xhci";
-- 
2.10.2



  1   2   >