date:20161102

Re: [PATCH (net.git) 0/3] stmmac: fix PTP support

2016-11-02 Thread Rayagond Kokatanur

On Wed, Oct 26, 2016 at 2:28 PM, Richard Cochran
 wrote:
> On Wed, Oct 26, 2016 at 08:56:01AM +0200, Giuseppe Cavallaro wrote:
>> This subset of patches aim to fix the PTP support
>> for the stmmac and especially for 4.x chip series.
>> While setting PTP on an ST box with 4.00a Ethernet
>> core, the kernel panics due to a broken settings
>> of the descriptors. The patches review the
>> register configuration, the algo used for configuring
>> the protocol, the way to get the timestamp inside
>> the RX/TX descriptors and, in the end, the statistics
>> displayed by ethtool.
>>
>> Giuseppe Cavallaro (3):
>>   stmmac: update the PTP header file
>>   stmmac: fix PTP support for GMAC4
>>   stmmac: fix PTP type ethtool stats
>
> Acked-by: Richard Cochran 

Acked-by: Rayagond Kokatanur 



-- 
wwr
Rayagond

Re: [PATCH (net.git) 2/3] stmmac: fix PTP support for GMAC4

2016-11-02 Thread Rayagond Kokatanur

On Wed, Nov 2, 2016 at 12:04 PM, Giuseppe CAVALLARO
 wrote:
> Hello Rayagond
>
> if patches are ok, can we consider you Acked-by ?
Yes.

>
> Thx
> Peppe
>
>
> On 10/27/2016 12:51 PM, Rayagond Kokatanur wrote:
>>
>> On Thu, Oct 27, 2016 at 4:02 PM, Giuseppe CAVALLARO
>>  wrote:
>>>
>>> Hello Rayagond !
>>>
>>> On 10/27/2016 12:25 PM, Rayagond Kokatanur wrote:
>
>
> +static int dwmac4_wrback_get_rx_timestamp_status(void *desc, u32 ats)
>>
>>  {
>> struct dma_desc *p = (struct dma_desc *)desc;
>> +   int ret = -EINVAL;
>> +
>> +   /* Get the status from normal w/b descriptor */
>> +   if (likely(p->des3 & TDES3_RS1V)) {
>> +   if (likely(p->des1 & RDES1_TIMESTAMP_AVAILABLE)) {
>> +   int i = 0;
>> +
>> +   /* Check if timestamp is OK from context
>> descriptor */
>> +   do {
>> +   ret = dwmac4_rx_check_timestamp(desc);


 Here, "desc" is not pointing to next descriptor (ie context
 descriptor). Driver should check the context descriptor.
>>>
>>>
>>>
>>> you are right and this is done by the caller:  stmmac_get_rx_hwtstamp
>>
>>
>> Yes.
>>
>>>
>>> Cheers
>>> peppe
>>>
>>
>>
>>
>



-- 
wwr
Rayagond

[PATCH] igb/e1000: correct register comments

2016-11-02 Thread Cao jin

Signed-off-by: Cao jin 
---
 drivers/net/ethernet/intel/igb/e1000_regs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_regs.h 
b/drivers/net/ethernet/intel/igb/e1000_regs.h
index d84afdd..58adbf2 100644
--- a/drivers/net/ethernet/intel/igb/e1000_regs.h
+++ b/drivers/net/ethernet/intel/igb/e1000_regs.h
@@ -320,7 +320,7 @@
 #define E1000_VT_CTL   0x0581C  /* VMDq Control - RW */
 #define E1000_WUC  0x05800  /* Wakeup Control - RW */
 #define E1000_WUFC 0x05808  /* Wakeup Filter Control - RW */
-#define E1000_WUS  0x05810  /* Wakeup Status - RO */
+#define E1000_WUS  0x05810  /* Wakeup Status - R/W1C */
 #define E1000_MANC 0x05820  /* Management Control - RW */
 #define E1000_IPAV 0x05838  /* IP Address Valid - RW */
 #define E1000_WUPL 0x05900  /* Wakeup Packet Length - RW */
-- 
2.1.0

Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-02 Thread Jiri Pirko

Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote:
>On 11/1/16, 10:03 AM, Ido Schimmel wrote:
>> Hi Roopa,
>>
>> On Tue, Nov 01, 2016 at 08:14:14AM -0700, Roopa Prabhu wrote:
>>>
>[snip]
>>> I have the same concern as Eric here.
>>>
>>> I understand why you need it, but can the driver request for an initial 
>>> dump and that
>>> dump be made more efficient somehow ie not hold rtnl for the whole dump ?.
>>> instead of making the fib notifier registration code doing it.
>> We can do what we suggested in the last bi-weekly meeting, which is
>> still holding rtnl, but moving the hardware operation to delayed work.
>> This is possible because upper layers always assume operation was
>> successful and driver is responsible for invoking its abort mechanism in
>> case of failure.
>>
>>> these routing table sizes can be huge and an analogy for this in user-space:
>>> We do request a netlink dump of  routing tables at initialization (on 
>>> driver starts or resets)...
>>> but, existing netlink routing table dumps for that scale don't hold rtnl 
>>> for the whole dump.
>>> The dump is split into multiple responses to the user and hence it does not 
>>> starve other rtnl users.
>> In my reply to Eric I mentioned that when we register and unregister
>> from this chain the tables aren't really huge, but instead quite small.
>> I understand your concerns, but I don't wish to make things more
>> complicated than they should be only to address concerns that aren't
>> really realistic.
>
>I understand..but, if you are adding some core infrastructure for switchdev 
>..it cannot be
>based on the number of simple use-cases or data you have today.
>
>I won't be surprised if tomorrow other switch drivers have a case where they 
>need to
>reset the hw routing table state and reprogram all routes again. 
>Re-registering the notifier to just
>get the routing state of the kernel will not scale. For the long term, since 
>the driver does not maintain a cache,

Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say
otherwise.


>a pull api with efficient use of rtnl will be useful for other such cases as 
>well.

How do you imagine this "pull API" should look like?


>
>
>If you don't want to get to the complexity of a new api right away because of 
>the
>simple case of management interface routes you have, Can your driver register 
>the notifier early  ?
>(I am sure you have probably already thought about this)

Register early? What it would resolve? I must be missing something. We
register as early as possible. But the thing is, we cannot register
in a past. And that is what this patch resolves.


>
>>
>> I believe current patch is quite simple and also consistent with other
>> notification chains in the kernel, such as the netdevice, where rtnl is
>> held during replay of events.
>> http://lxr.free-electrons.com/source/net/core/dev.c#L1535
>as you know, netdev and routing scale are not the same thing.
>Looking at the current code for netdevices (replay and rollback on failure),
>a pull api (equivalent to the netlink dump api) may end up being less 
>complex...with an
>ability to batch in the future.
>
>
>
>
>
>
>

Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-02 Thread Jiri Pirko

Tue, Nov 01, 2016 at 04:36:50PM CET, da...@davemloft.net wrote:
>From: Roopa Prabhu 
>Date: Tue, 01 Nov 2016 08:14:14 -0700
>
>> On 11/1/16, 7:19 AM, Eric Dumazet wrote:
>>> On Tue, 2016-11-01 at 00:57 +0200, Ido Schimmel wrote:
 On Mon, Oct 31, 2016 at 02:24:06PM -0700, Eric Dumazet wrote:
> How well will this work for large FIB tables ?
>
> Holding rtnl while sending thousands of skb will prevent consumers to
> make progress ?
 Can you please clarify what do you mean by "while sending thousands of
 skb"? This patch doesn't generate notifications to user space, but
 instead invokes notification routines inside the kernel. I probably
 misunderstood you.

 Are you suggesting this be done using RCU instead? Well, there are a
 couple of reasons why I took RTNL here:

>>> No, I do not believe RCU is wanted here, in control path where we might
>>> sleep anyway.
>>>
 1) The FIB notification chain is blocking, so listeners are expected to
 be able to sleep. This isn't possible if we use RCU. Note that this
 chain is mainly useful for drivers that reflect the FIB table into a
 capable device and hardware operations usually involve sleeping.

 2) The insertion of a single route is done with RTNL held. I didn't want
 to differentiate between both cases. This property is really useful for
 listeners, as they don't need to worry about locking in writer-side.
 Access to data structs is serialized by RTNL.
>>> My concern was that for large iterations, you might hold RTNL and/or
>>> current cpu for hundred of ms or even seconds...
>>>
>> I have the same concern as Eric here.
>> 
>> I understand why you need it, but can the driver request for an initial dump 
>> and that
>> dump be made more efficient somehow ie not hold rtnl for the whole dump ?.
>> instead of making the fib notifier registration code doing it.
>> 
>> these routing table sizes can be huge and an analogy for this in user-space:
>> We do request a netlink dump of  routing tables at initialization (on driver 
>> starts or resets)...
>> but, existing netlink routing table dumps for that scale don't hold rtnl for 
>> the whole dump.
>> The dump is split into multiple responses to the user and hence it does not 
>> starve other rtnl users.
>> 
>> In-fact I don't think netlink routing table dumps from user-space hold 
>> rtnl_lock for the whole dump.
>> IIRC this was done to allow route add/dels to be allowed in parallel for 
>> performance reasons.
>> (I will need to double check to confirm this).
>
>I've always had some reservations about using notifiers for getting
>the FIB entries down to the offloaded device.

Yeah, me too. But there is no really other way to do it. I thought about
it quite long. But maybe I missed something.


>
>And this problem is just another symptom that it is the wrong
>mechanism for propagating this information.
>
>As suggested by Roopa here, perhaps we're looking at the problem from
>the wrong direction.  We tend to design NDO ops and notifiers, to
>"push" things to the driver, but maybe something more like a push+pull
>model is better.

How do you imagine this mode should looks like? Could you draw me some
example?

Thanks!

Re: Let's do P4

2016-11-02 Thread Jiri Pirko

Tue, Nov 01, 2016 at 04:13:32PM CET, john.fastab...@gmail.com wrote:
>[...]
>
 P4 is ment to program programable hw, not fixed pipeline.

>>>
>>> I'm guessing there are no upstream drivers at the moment that support
>>> this though right? The rocker universe bits though could leverage this.
>> 
>> mlxsw. But this is naturaly not implemented yet, as there is no
>> infrastructure.
>
>Really? What is re-programmable?
>
>Can the parse graph support arbitrary parse graph?
>Can the table topology be reconfigured?
>Can new tables be created?
>What about "new" actions being defined at configuration time?
>
>Or is this just the normal TCAM configuration of defining key widths and
>fields.

At this point TCAM configuration.


>
>> 
>> 
>>>

>
>>
>>> since I cannot see how one can put the whole p4 language compiler
>>> into the driver, so this last step of p4ast->hw, I presume, will be
>>> done by firmware, which will be running full compiler in an embedded cpu
>>
>> In case of mlxsw, that compiler would be in driver.
>>
>>
>>> on the switch. To me that's precisely the kernel bypass, since we won't
>>> have a clue what HW capabilities actually are and won't be able to fine
>>> grain control them.
>>> Please correct me if I'm wrong.
>>
>> You are wrong. By your definition, everything has to be figured out in
>> driver and FW does nothing. Otherwise it could do "something else" and
>> that would be a bypass? Does not make any sense to me whatsoever.
>>
>>
>>>
 Plus the thing I cannot imagine in the model you propose is table 
 fillup.
 For ebpf, you use maps. For p4 you would have to have a separate 
 HW-only
 API. This is very similar to the original John's Flow-API. And 
 therefore
 a kernel bypass.
>>>
>>> I think John's flow api is a better way to expose mellanox switch 
>>> capabilities.
>>
>> We are under impression that p4 suits us nicely. But it is not about
>> us, it is about finding the common way to do this.
>>
>
> I'll just poke at my FlowAPI question again. For fixed ASICS what is
> the Flow-API missing. We have a few proof points that show it is both
> sufficient and usable for the handful of use cases we care about.

 Yeah, it is most probably fine. Even for flex ASICs to some point. The
 question is how it stands comparing to other alternatives, like p4

>>>
>>> Just to be clear the Flow-API _was_ generated from the initial P4 spec.
>>> The header files and tools used with it were autogenerated ("compiled"
>>> in a loose sense) from the P4 program. The piece I never exposed
>>> was the set_* operations to reconfigure running systems. I'm not sure
>>> how valuable this is in practice though.
>>>
>>> Also there is a P4-16 spec that will be released shortly that is more
>>> flexible and also more complex.
>> 
>> Would it be able to easily extend the Flow-API to include the changes?
>> 
>
>P4-16 will allow externs, "functions" to execute in the control flow and
>possibly inside the parse graph. None of this was considered in the
>Flow-API. So none of this is supported.
>
>I still have the question are you trying to push the "programming" of
>the device via 'tc' or just the runtime configuration of tables? If it
>is just runtime Flow-API is sufficient IMO. If its programming the
>device using the complete P4-16 spec than no its not sufficient. But

Sure we need both.


>I don't believe vendors will expose the complete programmability of the
>device in the driver, this is going to look more like a fw update than
>a runtime change at least on the devices I'm aware of.

Depends on driver. I think it is fine if driver processed it into come
hw configuration sequence or it simply pushed the program down to fw.
Both usecases are legit.


>
>> 
>>>

>
>>
>>> I also think it's not fair to call it 'bypass'. I see nothing in it
>>> that justify such 'swear word' ;)
>>
>> John's Flow-API was a kernel bypass. Why? It was a API specifically
>> designed to directly work with HW tables, without kernel being involved.
>
> I don't think that is a fair definition of HW bypass. The SKIP_SW flag
> does exactly that for 'tc' based offloads and it was not rejected.

 No, no, no. You still have possibility to do the same thing in kernel,
 same functionality, with the same API. That is a big difference.


>
> The _real_ reason that seems to have fallen out of this and other
> discussion is the Flow-API didn't provide an in-kernel translation into
> an emulated patch. Note we always had a usermode translation to eBPF.
> A secondary reason appears to be overhead of adding yet another netlink
> family.

 Yeah. Maybe you remember, back then when Flow-API was being discussed,
 I suggested to wrap it under TC as cls_xflows and cls_xflo

Re: Let's do P4

2016-11-02 Thread Jiri Pirko

Wed, Nov 02, 2016 at 03:29:23AM CET, dan...@iogearbox.net wrote:
>On 10/31/2016 10:39 AM, Jiri Pirko wrote:
>> Sun, Oct 30, 2016 at 11:39:05PM CET, alexei.starovoi...@gmail.com wrote:
>> > On Sun, Oct 30, 2016 at 05:38:36PM +0100, Jiri Pirko wrote:
>> > > Sun, Oct 30, 2016 at 11:26:49AM CET, tg...@suug.ch wrote:
>> > > > On 10/30/16 at 08:44am, Jiri Pirko wrote:
>> > > > > Sat, Oct 29, 2016 at 06:46:21PM CEST, john.fastab...@gmail.com wrote:
>> > > > > > On 16-10-29 07:49 AM, Jakub Kicinski wrote:
>> > > > > > > On Sat, 29 Oct 2016 09:53:28 +0200, Jiri Pirko wrote:
>> > > > > > > > Hi all.
>> > 
>> > sorry for delay. travelling to KS, so probably missed something in
>> > this thread and comments can be totally off...
>> > 
>> > the subject "let's do P4" is imo misleading, since it reads like
>> > we don't do P4 at the moment, whereas the opposite is true.
>> > Several p4->bpf compilers is a proof.
>> 
>> We don't do p4 in kernel now, we don't do p4 offloading now. That is
>> the reason I started this discussion.
>> 
>> > > The network world is divided into 2 general types of hw:
>> > > 1) network ASICs - network specific silicon, containing things like TCAM
>> > > These ASICs are suitable to be programmed by P4.
>> > 
>> > i think the opposite is the case in case of P4.
>> > when hw asic has tcam it's still far far away from being usable with P4
>> > which requires fully programmable protocol parser, arbitrary tables and so 
>> > on.
>> > P4 doesn't even define TCAM as a table type. The p4 program can declare
>> > a desired algorithm of search in the table and compiler has to figure out
>> > what HW resources to use to satisfy such p4 program.
>> > 
>> > > 2) network processors - basically a general purpose CPUs
>> > > These processors are suitable to be programmed by eBPF.
>> > 
>> > I think this statement is also misleading, since it positions
>> > p4 and bpf as competitors whereas that's not the case.
>> > p4 is the language. bpf is an instruction set.
>> 
>> I wanted to say that we are having 2 approaches in silicon, 2 different
>> paradigms. Sure you can do p4>bpf. But hard to do it the opposite way.
>> 
>> > > Exactly. Following drawing shows p4 pipeline setup for SW and Hw:
>> > > 
>> > >   |
>> > >   |   +--> ebpf engine
>> > >   |   |
>> > >   |   |
>> > >   |   compilerB
>> > >   |   ^
>> > >   |   |
>> > > p4src --> compilerA --> p4ast --TCNL--> cls_p4 --+-> driver -> compilerC 
>> > > -> HW
>> > >   |
>> > > userspace | kernel
>> > >   |
>
>Sorry for jumping into the middle and the delay (plumbers this week). My
>question would be, if the main target is for p4 *offloading* anyway, who
>would use this sw fallback path? Mostly for testing purposes?

Development and testing purposes, yes.


>
>I'm not sure about compilerB here and the complexity that needs to be
>pushed into the kernel along with it. I would assume this would result
>in slower code than what the existing P4 -> eBPF front ends for LLVM
>would generate since it could perform all kind of optimizations there,

The complexity would be similar to compilerC. For compilerB,
optimizations does not really matter, as it it for testing mainly.


>that might not be feasible for doing inside the kernel. Thus, if I'd want
>to do that in sw, I'd just use the existing LLVM facilities instead and
>go via cls_bpf in that case.
>
>What is your compilerA? Is that part of tc in user space? Maybe linked

It is something that transforms original p4 source to some intermediate
form, easy to be processed by in-kernel compilers.


>against LLVM lib, for example? If you really want some sw path, can't tc
>do this transparently from user space instead when it gets a netlink error
>that it cannot get offloaded (and thus switch internally to f_bpf's loader)?

In real life, user will most probably use p4 for hw programming, but the
sw fallback will be done in bpf directly. In that case, he would use
cls_bfp SKIP_HW
cls_p4 SKIP_SW

But in order to allow cls_p4 offloading to hw, we need in-kernel
interpreter. That is purpose of compilerB to take agvantage of bpf, but
the in-kernel interpreter could be implemented differently.

Re: [PATCH net-next V2 3/3] net/mlx4_en: Add ethtool statistics for XDP cases

2016-11-02 Thread Tariq Toukan


Hi Brenden,

On 01/11/2016 11:06 PM, Brenden Blanco wrote:

On Tue, Nov 01, 2016 at 01:36:26PM +0200, Tariq Toukan wrote:

XDP statistics are reported in ethtool as follows:
- xdp_drop: the number of packets dropped by xdp.
- xdp_tx: the number of packets forwarded by xdp.
- xdp_tx_full: the number of times an xdp forward failed
due to a full tx xdp ring.

In addition, all packets that are dropped/forwarded by XDP
are no longer accounted in rx_packets/rx_bytes of the ring,
so that they count traffic that is passed to the stack.

This seems like a step backwards, in that I now no longer have any
statistic whatsoever that can count xdp packets per-ring. For instance,
how would I validate that my flow-hash rules are operating correctly? I
would suggest to restore the rxN_packet/bytes stat increment.
The per ring counters are there, and I meant to expose them. Somehow 
they were missed.

I'll add them now.
They're going to be like this:
rx0_xdp_drop
rx0_xdp_tx
rx0_xdp_tx_full


Signed-off-by: Tariq Toukan 
---
  drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 14 ++
  drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |  4 
  drivers/net/ethernet/mellanox/mlx4/en_port.c|  6 ++
  drivers/net/ethernet/mellanox/mlx4/en_rx.c  | 12 +++-
  drivers/net/ethernet/mellanox/mlx4/en_tx.c  |  8 
  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h|  7 ++-
  drivers/net/ethernet/mellanox/mlx4/mlx4_stats.h | 10 +-
  7 files changed, 50 insertions(+), 11 deletions(-)

[...]

Thanks for your comment.

Regards,
Tariq

Re: [PATCH net] net: Check for fullsock in sock_i_uid()

2016-11-02 Thread Eric Dumazet

On Tue, 2016-11-01 at 23:27 -0600, Subash Abhinov Kasiviswanathan wrote:
> sock_i_uid() acquires the sk_callback_lock which does not exist
> for sockets in TCP_NEW_SYN_RECV state. This results in errors
> showing up as spinlock bad magic.
> 
> Signed-off-by: Subash Abhinov Kasiviswanathan 
> Cc: Eric Dumazet 
> ---
>  net/core/sock.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index c73e28f..af15ef0 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1727,7 +1727,10 @@ void sock_efree(struct sk_buff *skb)
>  
>  kuid_t sock_i_uid(struct sock *sk)
>  {
> - kuid_t uid;
> + kuid_t uid = GLOBAL_ROOT_UID;
> +
> + if (!sk_fullsock(sk))
> + return uid;
>  
>   read_lock_bh(&sk->sk_callback_lock);
>   uid = sk->sk_socket ? SOCK_INODE(sk->sk_socket)->i_uid : 
> GLOBAL_ROOT_UID;


This would be a bug in the caller.

Can you give us the complete stack trace leading to the problem you
had ?

Thanks !

[PATCH V2] mac80211: Ignore VHT IE from peer with wrong rx_mcs_map

2016-11-02 Thread Filip Matusiak

This is a workaround for VHT-enabled STAs which break the spec
and have the VHT-MCS Rx map filled in with value 3 for all eight
spacial streams, an example is AR9462 in AP mode.

As per spec, in section 22.1.1 Introduction to the VHT PHY
A VHT STA shall support at least single spactial stream VHT-MCSs
0 to 7 (transmit and receive) in all supported channel widths.

Some devices in STA mode will get firmware assert when trying to
associate, examples are QCA9377 & QCA6174.

Packet example of broken VHT Cap IE of AR9462:

Tag: VHT Capabilities (IEEE Std 802.11ac/D3.1)
Tag Number: VHT Capabilities (IEEE Std 802.11ac/D3.1) (191)
Tag length: 12
VHT Capabilities Info: 0x
VHT Supported MCS Set
Rx MCS Map: 0x
   ..11 = Rx 1 SS: Not Supported (0x0003)
   11.. = Rx 2 SS: Not Supported (0x0003)
  ..11  = Rx 3 SS: Not Supported (0x0003)
  11..  = Rx 4 SS: Not Supported (0x0003)
 ..11   = Rx 5 SS: Not Supported (0x0003)
 11..   = Rx 6 SS: Not Supported (0x0003)
..11    = Rx 7 SS: Not Supported (0x0003)
11..    = Rx 8 SS: Not Supported (0x0003)
...0    = Rx Highest Long GI Data Rate (in Mb/s, 0 = 
subfield not in use): 0x
Tx MCS Map: 0x
...0    = Tx Highest Long GI Data Rate  (in Mb/s, 0 = 
subfield not in use): 0x

Signed-off-by: Filip Matusiak 
---
 net/mac80211/vht.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/net/mac80211/vht.c b/net/mac80211/vht.c
index ee71576..6832bf6 100644
--- a/net/mac80211/vht.c
+++ b/net/mac80211/vht.c
@@ -270,6 +270,22 @@ ieee80211_vht_cap_ie_to_sta_vht_cap(struct 
ieee80211_sub_if_data *sdata,
vht_cap->vht_mcs.tx_mcs_map |= cpu_to_le16(peer_tx << i * 2);
}
 
+   /*
+* This is a workaround for VHT-enabled STAs which break the spec
+* and have the VHT-MCS Rx map filled in with value 3 for all eight
+* spacial streams, an example is AR9462.
+*
+* As per spec, in section 22.1.1 Introduction to the VHT PHY
+* A VHT STA shall support at least single spactial stream VHT-MCSs
+* 0 to 7 (transmit and receive) in all supported channel widths.
+*/
+   if (vht_cap->vht_mcs.rx_mcs_map == cpu_to_le16(0x)) {
+   vht_cap->vht_supported = false;
+   sdata_info(sdata, "Ignoring VHT IE from %pM due to invalid 
rx_mcs_map\n",
+  sta->addr);
+   return;
+   }
+
/* finally set up the bandwidth */
switch (vht_cap->cap & IEEE80211_VHT_CAP_SUPP_CHAN_WIDTH_MASK) {
case IEEE80211_VHT_CAP_SUPP_CHAN_WIDTH_160MHZ:
-- 
2.7.4

Re: [PATCH net-next 07/11] net: dsa: mv88e6xxx: add port link setter

2016-11-02 Thread Andrew Lunn

On Wed, Nov 02, 2016 at 02:07:09AM +0100, Vivien Didelot wrote:
> Hi Andrew,
> 
> Andrew Lunn  writes:
> 
> >> +#define LINK_UNKNOWN  -1
> >> +
> >> +  /* Port's MAC link state
> >> +   * LINK_UNKNOWN for normal link detection, 0 to force link down,
> >> +   * otherwise force link up.
> >> +   */
> >> +  int (*port_set_link)(struct mv88e6xxx_chip *chip, int port, int link);
> >
> > Maybe LINK_AUTO would be better than UNKNOWN? Or LINK_UNFORCED.
> 
> I used LINK_UNKNOWN to be consistent with the supported SPEED_UNKNOWN
> and DUPLEX_UNKNOWN values of PHY devices.

Hi Vivien

These are i think for reporting back to user space what duplex or link
is currently being used. But here you are setting, not
reporting. Setting something to an unknown state is rather odd, and in
fact, it is not unknown, it is unforced.

  Andrew

[PATCH net 1/1] driver: veth: Return the actual value instead return NETDEV_TX_OK always

2016-11-02 Thread fgao

From: Gao Feng 

Current veth_xmit always returns NETDEV_TX_OK whatever if it is really
sent successfully. Now return the actual value instead of NETDEV_TX_OK
always.

Signed-off-by: Gao Feng 
---
 drivers/net/veth.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index fbc853e..769a3bd 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -111,15 +111,18 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
struct veth_priv *priv = netdev_priv(dev);
struct net_device *rcv;
int length = skb->len;
+   int ret = NETDEV_TX_OK;
 
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
if (unlikely(!rcv)) {
kfree_skb(skb);
+   ret = NET_RX_DROP;
goto drop;
}
 
-   if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
+   ret = dev_forward_skb(rcv, skb);
+   if (likely(ret == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
u64_stats_update_begin(&stats->syncp);
@@ -131,7 +134,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
atomic64_inc(&priv->dropped);
}
rcu_read_unlock();
-   return NETDEV_TX_OK;
+   return ret;
 }
 
 /*
-- 
1.9.1

Re: [RFC PATCH v2 2/5] net: phy: Add Meson GXL Internal PHY driver

2016-11-02 Thread Neil Armstrong

On 10/31/2016 08:05 PM, Andrew Lunn wrote:
> On Mon, Oct 31, 2016 at 05:56:24PM +0100, Neil Armstrong wrote:
>> Add driver for the Internal RMII PHY found in the Amlogic Meson GXL SoCs.
>>
>> This PHY seems to only implement some standard registers and need some
>> workarounds to provide autoneg values from vendor registers.
>>
>> Some magic values are currently used to configure the PHY, and this a
>> temporary setup until clarification about these registers names and
>> registers fields are provided by Amlogic.
>>
>> Signed-off-by: Neil Armstrong 
>> ---
>>  drivers/net/phy/Kconfig |  5 +++
>>  drivers/net/phy/Makefile|  1 +
>>  drivers/net/phy/meson-gxl.c | 81 
>> +
>>  3 files changed, 87 insertions(+)
>>  create mode 100644 drivers/net/phy/meson-gxl.c
>>
>> diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
>> index 2651c8d..09342b6 100644
>> --- a/drivers/net/phy/Kconfig
>> +++ b/drivers/net/phy/Kconfig
>> @@ -226,6 +226,11 @@ config DP83867_PHY
>>  ---help---
>>Currently supports the DP83867 PHY.
>>  
>> +config MESON_GXL_PHY
>> +tristate "Amlogic Meson GXL Internal PHY"
>> +---help---
>> +  Currently has a driver for the Amlogic Meson GXL Internal PHY
>> +
> 
> Hi Neil
> 
> Please keep them in alphabetic order. This goes after Marvell.
> 
>>  config FIXED_PHY
>>  tristate "MDIO Bus/PHY emulation with fixed speed/link PHYs"
>>  depends on PHYLIB
>> diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
>> index e58667d..1511b3e 100644
>> --- a/drivers/net/phy/Makefile
>> +++ b/drivers/net/phy/Makefile
>> @@ -44,6 +44,7 @@ obj-$(CONFIG_MARVELL_PHY)  += marvell.o
>>  obj-$(CONFIG_MICREL_KS8995MA)   += spi_ks8995.o
>>  obj-$(CONFIG_MICREL_PHY)+= micrel.o
>>  obj-$(CONFIG_MICROCHIP_PHY) += microchip.o
>> +obj-$(CONFIG_MESON_GXL_PHY) += meson-gxl.o
>>  obj-$(CONFIG_MICROSEMI_PHY) += mscc.o
> 
> Again, alphabetic order.
> 
>Andrew
> 

Sorry, rebase issue.

Neil

Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-02 Thread Hannes Frederic Sowa

Hi Tom,

On Wed, Nov 2, 2016, at 00:07, Tom Herbert wrote:
> On Tue, Nov 1, 2016 at 3:12 PM, Hannes Frederic Sowa
>  wrote:
> > On 01.11.2016 21:59, Thomas Graf wrote:
> >> On 1 November 2016 at 13:08, Hannes Frederic Sowa
> >>  wrote:
> >>> On Tue, Nov 1, 2016, at 19:51, Thomas Graf wrote:
>  If I understand you correctly then a single BPF program would be
>  loaded which then applies to all dst_output() calls? This has a huge
>  drawback, instead of multiple small BPF programs which do exactly what
>  is required per dst, a large BPF program is needed which matches on
>  metadata. That's way slower and renders one of the biggest advantages
>  of BPF invalid, the ability to generate a a small program tailored to
>  a particular use. See Cilium.
> >>>
> >>> I thought more of hooks in the actual output/input functions specific to
> >>> the protocol type (unfortunately again) protected by jump labels? Those
> >>> hook get part of the dst_entry mapped so they can act on them.
> >>
> >> This has no advantage over installing a BPF program at tc egress and
> >> enabling to store/access metadata per dst. The whole point is to
> >> execute bpf for a specific route.
> >
> > The advantage I saw here was that in your proposal the tc egress path
> > would have to be chosen by a route. Otherwise I would already have
> > proposed it. :)
> >
> >>> Another idea would be to put the eBPF hooks into the fib rules
> >>> infrastructure. But I fear this wouldn't get you the hooks you were
> >>> looking for? There they would only end up in the runtime path if
> >>> actually activated.
> >>
> >> Use of fib rules kills performance so it's not an option. I'm not even
> >> sure that would be any simpler.
> >
> > It very much depends on the number of rules installed. If there are just
> > several very few rules, it shouldn't hurt performance that much (but
> > haven't verified).
> >
> Hannes,
> 
> I can say that the primary value we get out of using ILA+LWT is that
> we can essentially cache a policy decision in connected sockets. That
> is we are able to create a host route for each destination (thousands
> of them) that describes how to do the translation for each one. There
> is no route lookup per packet, and actually no extra lookup otherwise.

Exactly, that is why I do like LWT and the dst_entry socket caching
shows its benefits here. Also the dst_entries communicate enough vital
information up the stack so that allocation of sk_buffs is done
accordingly to the headers that might need to be inserted later on.

(On the other hand, the looked up BPF program can also be cached. This
becomes more difficult if we can't share the socket structs between
namespaces though.)

> The translation code doesn't do much at all, basically just copies in
> new destination to the packet. We need a route lookup for the
> rewritten destination, but that is easily cached in the LWT structure.
> The net result is that the transmit path for ILA is _really_ fast. I'm
> not sure how we can match this same performance tc egress, it seems
> like we would want to cache the matching rules in the socket to avoid
> rule lookups.

In case of namespaces, do you allocate the host routes in the parent or
child (net-)namespaces? Or don't we talk about namespaces right now at all?

Why do we want to do the packet manipulation in tc egress and not using
LWT + interfaces? The dst_entries should be able to express all possible
allocation strategies etc. so that we don't need to shift/reallocate
packets around when inserting an additional header. We can't express
those semantics with tc egress.

> On the other hand, I'm not really sure how to implement for this level
> of performance this in LWT+BPF either. It seems like one way to do
> that would be to create a program each destination and set it each
> host. As you point out would create a million different programs which
> doesn't seem manageable. I don't think the BPF map works either since
> that implies we need a lookup (?). It seems like what we need is one
> program but allow it to be parameterized with per destination
> information saved in the route (LWT structure).

Yes, that is my proposal. Just using the dst entry as meta-data (which
can actually also be an ID for the network namespace the packet is
coming from).

My concern with using BPF is that the rest of the kernel doesn't really
see the semantics and can't optimize or cache at specific points,
because the kernel cannot introspect what the BPF program does (for
metadata manipulation, one can e.g. specifiy that the program is "pure",
and always provides the same output for some specified given input, thus
things can be cached and memorized, but that framework seems very hard
to build).

That's why I am in favor of splitting this patchset down and allow the
policies that should be expressed by BPF programs being applied to the
specific subsystems (I am not totally against a generic BPF hook in
input or output of the protocol

[PATCH net-next v2] mlxsw: Remove unused including

2016-11-02 Thread Wei Yongjun

From: Wei Yongjun 

Remove including  that don't need it.

Signed-off-by: Wei Yongjun 
---
v1 -> v2: remove from spectrum.c and switchx2.c
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 1 -
 drivers/net/ethernet/mellanox/mlxsw/switchib.c | 1 -
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 1 -
 3 files changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 8bca020..a5433e4 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -54,7 +54,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchib.c 
b/drivers/net/ethernet/mellanox/mlxsw/switchib.c
index ec0b27e..1552594 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/switchib.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchib.c
@@ -43,7 +43,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include "pci.h"
 #include "core.h"
diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c 
b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
index 5208764..60f19fb 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
@@ -45,7 +45,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include "pci.h"
 #include "core.h"

[patch net-next 2/2] [PATCH] net: ip, raw_diag -- Use jump for exiting from nested loop

2016-11-02 Thread Cyrill Gorcunov

I managed to miss that sk_for_each is called under "for"
cycle so need to use goto here to return matching socket.

CC: David S. Miller 
CC: Eric Dumazet 
CC: David Ahern 
CC: Andrey Vagin 
CC: Stephen Hemminger 
Signed-off-by: Cyrill Gorcunov 
---
 net/ipv4/raw_diag.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-ml.git/net/ipv4/raw_diag.c
===
--- linux-ml.git.orig/net/ipv4/raw_diag.c
+++ linux-ml.git/net/ipv4/raw_diag.c
@@ -79,10 +79,11 @@ static struct sock *raw_sock_get(struct
 * hashinfo->lock here.
 */
sock_hold(sk);
-   break;
+   goto out_unlock;
}
}
}
+out_unlock:
read_unlock(&hashinfo->lock);
 
return sk ? sk : ERR_PTR(-ENOENT);

[patch net-next 0/2] Fixes for raw diag sockets handling

2016-11-02 Thread Cyrill Gorcunov

Hi! Here are a few fixes for raw-diag sockets handling: missing
sock_put call and jump for exiting from nested cycle. I made
patches for iproute2 as well so will send them out soon.

Also I have a question about sockets lookup not for raw diag only
(though I didn't modify lookup procedure) but in general: the structure
inet_diag_req_v2 has inet_diag_sockid::idiag_if member which supposed to
carry interface index from userspace request.

Then for example in INET_MATCH (include/net/inet_hashtables.h),
the __dif parameter (which is @idiag_if) compared with @sk_bound_dev_if
*iif* the sk_bound_dev_if has been ever set. Thus if say someone
looks up for paticular device with specified index if the
rest of parameters match and SO_BINDTODEVICE never been called
for this device we return the socket even if idiag_if is not zero.
Is it supposed to be so? Or I miss something obvious?

I mean this snippet


 (!(__sk)->sk_bound_dev_if  ||  \
   ((__sk)->sk_bound_dev_if == (__dif)))&&  \

when someone calls for destory sockets on particular interface and
@__dif != 0 the match may return socket where sk_bound_dev_if = 0
instead of completely matching one. Isn't it?

Cyrill

[patch net-next 1/2] [PATCH] net: ip, raw_diag -- Fix socket leaking for destroy request

2016-11-02 Thread Cyrill Gorcunov

In raw_diag_destroy the helper raw_sock_get returns
with sock_hold call, so we have to put it then.

CC: David S. Miller 
CC: Eric Dumazet 
CC: David Ahern 
CC: Andrey Vagin 
CC: Stephen Hemminger 
Signed-off-by: Cyrill Gorcunov 
---
 net/ipv4/raw_diag.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

Index: linux-ml.git/net/ipv4/raw_diag.c
===
--- linux-ml.git.orig/net/ipv4/raw_diag.c
+++ linux-ml.git/net/ipv4/raw_diag.c
@@ -205,11 +205,14 @@ static int raw_diag_destroy(struct sk_bu
 {
struct net *net = sock_net(in_skb->sk);
struct sock *sk;
+   int err;
 
sk = raw_sock_get(net, r);
if (IS_ERR(sk))
return PTR_ERR(sk);
-   return sock_diag_destroy(sk, ECONNABORTED);
+   err = sock_diag_destroy(sk, ECONNABORTED);
+   sock_put(sk);
+   return err;
 }
 #endif

Re: [PATCH net-next v2] mlxsw: Remove unused including

2016-11-02 Thread Jiri Pirko

Wed, Nov 02, 2016 at 01:49:57PM CET, weiyj...@gmail.com wrote:
>From: Wei Yongjun 
>
>Remove including  that don't need it.
>
>Signed-off-by: Wei Yongjun 

Acked-by: Jiri Pirko

[PATCH net-next iproute2 1/2 v2] libnetlink: Add test for error code returned from netlink reply

2016-11-02 Thread Cyrill Gorcunov

In case if some diag module is not present in the system,
say the kernel is not modern enough, we simply skip the
error code reported. Instead we should check for data
length in NLMSG_DONE and process unsupported case.

Signed-off-by: Cyrill Gorcunov 
---
 lib/libnetlink.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index 2279935..232daee 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -312,6 +312,22 @@ int rtnl_dump_filter_l(struct rtnl_handle *rth,
dump_intr = 1;
 
if (h->nlmsg_type == NLMSG_DONE) {
+   if (rth->proto == NETLINK_SOCK_DIAG) {
+   if (h->nlmsg_len < 
NLMSG_LENGTH(sizeof(int))) {
+   fprintf(stderr, "DONE 
truncated\n");
+   return -1;
+   } else {
+   int len = *(int 
*)NLMSG_DATA(h);
+   if (len < 0) {
+   errno = -len;
+   if (errno == 
ENOENT ||
+   errno == 
EOPNOTSUPP)
+   return 
-1;
+   
perror("RTNETLINK answers");
+   return len;
+   }
+   }
+   }
found_done = 1;
break; /* process next filter */
}
-- 
2.7.4

[PATCH net-next iproute2 0/2 v2] Add support for operating raw sockest via diag interface

2016-11-02 Thread Cyrill Gorcunov

The diag interface for raw sockets is now in linux-net-next
http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=432490f9d455fb842d70219f22d9d2c812371676
so here is early patches for misc/ss

I've to update libnetlink code to keep backward compatibility and switch to
parse procfs output if no raw_diag module present in the system. Note the error
reporting sitting in the kernel since 2006 so it's not something new, I guess
this hasn't been done for other diag modules because it been assumed that they
are always here. Which is not applied to fresh raw-diag module.

Cyrill
-- 
2.7.4

[PATCH net-next iproute2 PATCH 2/2 v2] ss: Add inet raw sockets information gathering via netlink diag interface

2016-11-02 Thread Cyrill Gorcunov

unix, tcp, udp[lite], packet, netlink sockets already support diag
interface for their collection and killing. Implement support
for raw sockets.

Signed-off-by: Cyrill Gorcunov 
---
 include/linux/inet_diag.h | 15 +++
 misc/ss.c | 20 ++--
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/include/linux/inet_diag.h b/include/linux/inet_diag.h
index f5f5c1b..ac66148 100644
--- a/include/linux/inet_diag.h
+++ b/include/linux/inet_diag.h
@@ -43,6 +43,21 @@ struct inet_diag_req_v2 {
struct inet_diag_sockid id;
 };
 
+/*
+ * An alias for struct inet_diag_req_v2,
+ * @sdiag_raw_protocol member shadows
+ * @pad explicitly, it is done this way
+ * for backward compatibility sake.
+ */
+struct inet_diag_req_raw {
+   __u8sdiag_family;
+   __u8sdiag_protocol;
+   __u8idiag_ext;
+   __u8sdiag_raw_protocol;
+   __u32   idiag_states;
+   struct inet_diag_sockid id;
+};
+
 enum {
INET_DIAG_REQ_NONE,
INET_DIAG_REQ_BYTECODE,
diff --git a/misc/ss.c b/misc/ss.c
index dd77b81..e8c4010 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -724,6 +724,7 @@ struct sockstat {
struct sockstat*next;
unsigned inttype;
uint16_tprot;
+   uint16_traw_prot;
inet_prefix local;
inet_prefix remote;
int lport;
@@ -2190,6 +2191,10 @@ static void parse_diag_msg(struct nlmsghdr *nlh, struct 
sockstat *s)
s->mark = 0;
if (tb[INET_DIAG_MARK])
s->mark = *(__u32 *) RTA_DATA(tb[INET_DIAG_MARK]);
+   if (tb[INET_DIAG_PROTOCOL])
+   s->raw_prot = *(__u8 *)RTA_DATA(tb[INET_DIAG_PROTOCOL]);
+   else
+   s->raw_prot = 0;
 
if (s->local.family == AF_INET)
s->local.bytelen = s->remote.bytelen = 4;
@@ -2384,7 +2389,7 @@ struct inet_diag_arg {
struct rtnl_handle *rth;
 };
 
-static int kill_inet_sock(struct nlmsghdr *h, void *arg)
+static int kill_inet_sock(struct nlmsghdr *h, void *arg, struct sockstat *s)
 {
struct inet_diag_msg *d = NLMSG_DATA(h);
struct inet_diag_arg *diag_arg = arg;
@@ -2399,6 +2404,13 @@ static int kill_inet_sock(struct nlmsghdr *h, void *arg)
req.r.sdiag_protocol = diag_arg->protocol;
req.r.id = d->id;
 
+   if (diag_arg->protocol == IPPROTO_RAW) {
+   struct inet_diag_req_raw *raw = (void *)&req.r;
+
+   BUILD_BUG_ON(sizeof(req.r) != sizeof(*raw));
+   raw->sdiag_raw_protocol = s->raw_prot;
+   }
+
return rtnl_talk(rth, &req.nlh, NULL, 0);
 }
 
@@ -2418,7 +2430,7 @@ static int show_one_inet_sock(const struct sockaddr_nl 
*addr,
if (diag_arg->f->f && run_ssfilter(diag_arg->f->f, &s) == 0)
return 0;
 
-   if (diag_arg->f->kill && kill_inet_sock(h, arg) != 0) {
+   if (diag_arg->f->kill && kill_inet_sock(h, arg, &s) != 0) {
if (errno == EOPNOTSUPP || errno == ENOENT) {
/* Socket can't be closed, or is already closed. */
return 0;
@@ -2715,6 +2727,10 @@ static int raw_show(struct filter *f)
 
dg_proto = RAW_PROTO;
 
+   if (!getenv("PROC_NET_RAW") && !getenv("PROC_ROOT") &&
+   inet_show_netlink(f, NULL, IPPROTO_RAW) == 0)
+   return 0;
+
if (f->families&(1<

Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-02 Thread Roopa Prabhu

On Wed, Nov 2, 2016 at 12:20 AM, Jiri Pirko  wrote:
> Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote:
>>
[snip]

>>I understand..but, if you are adding some core infrastructure for switchdev 
>>..it cannot be
>>based on the number of simple use-cases or data you have today.
>>
>>I won't be surprised if tomorrow other switch drivers have a case where they 
>>need to
>>reset the hw routing table state and reprogram all routes again. 
>>Re-registering the notifier to just
>>get the routing state of the kernel will not scale. For the long term, since 
>>the driver does not maintain a cache,
>
> Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say
> otherwise.
>
>
>>a pull api with efficient use of rtnl will be useful for other such cases as 
>>well.
>
> How do you imagine this "pull API" should look like?

Just like you already have added fib notifiers to parallel fib netlink
notifications, the pull API is  a parallel to 'netlink dump'.
Is my imagination too wild  ? :)

>
>
>>
>>
>>If you don't want to get to the complexity of a new api right away because of 
>>the
>>simple case of management interface routes you have, Can your driver register 
>>the notifier early  ?
>>(I am sure you have probably already thought about this)
>
> Register early? What it would resolve? I must be missing something. We
> register as early as possible. But the thing is, we cannot register
> in a past. And that is what this patch resolves.

sure, you must be having a valid problem then. I was just curious why
your driver is not up and initialized before any of the addresses or
routes get configured in the system (even on a management port). Ours
does. But i agree there can be races and you cannot always guarantee
(I was just responding to ido's comment about adding complexity for a
small problem he has to solve for management routes). Our driver does
a pull before it starts. This helps when we want to reset the hardware
routing table state too.

But, my point was, when you are defining an API, you cannot quantify
the 'past' to be just the very 'close past' or 'the past is just the
management routes that were added' . Tomorrow the 'past' can be the
full routing table if you need to reset the hardware state.

mlx5: ifup failure due to huge allocation

2016-11-02 Thread Sebastian Ott

Hi,

Ifup on an interface provided by CX4 (MLX5 driver) on s390 fails with:

[   22.318553] [ cut here ]
[   22.318564] WARNING: CPU: 1 PID: 399 at mm/page_alloc.c:3421 
__alloc_pages_nodemask+0x2ee/0x1298
[   22.318568] Modules linked in: mlx4_ib ib_core mlx5_core mlx4_en mlx4_core 
[...]
[   22.318610] CPU: 1 PID: 399 Comm: NetworkManager Not tainted 4.8.0 #13
[   22.318614] Hardware name: IBM  2964 N96  704
  (LPAR)
[   22.318618] task: dbe1c008 task.stack: dd9e4000
[   22.318622] Krnl PSW : 0704c0018000 002a427e 
(__alloc_pages_nodemask+0x2ee/0x1298)
[   22.318631]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 
RI:0 EA:3
   Krnl GPRS:  00ceb4d4 024080c0 
0001
[   22.318640]002a4204 a410 001f 
0001
[   22.318644]024080c0 0009  

[   22.318648]a400 0088ea30 002a4204 
dd9e7060
[   22.318660] Krnl Code: 002a4272: a7740592brc 7,2a4d96
  002a4276: 92011000mvi 0(%r1),1
 #002a427a: a7f40001brc 
15,2a427c
 >002a427e: a7f4058cbrc 
15,2a4d96
  002a4282: 5830f0b4l   
%r3,180(%r15)
  002a4286: 5030f0ecst  
%r3,236(%r15)
  002a428a: 1823lr  %r2,%r3
  002a428c: a53e0048llilh   %r3,72
[   22.318695] Call Trace:
[   22.318700] ([<002a4204>] __alloc_pages_nodemask+0x274/0x1298)
[   22.318706] ([<0030dac0>] alloc_pages_current+0x1c0/0x268)
[   22.318712] ([<00135aa6>] s390_dma_alloc+0x6e/0x1e0)
[   22.318733] ([<03ff8015474c>] mlx5_dma_zalloc_coherent_node+0xb4/0xf8 
[mlx5_core])
[   22.318748] ([<03ff80154c58>] mlx5_buf_alloc_node+0x70/0x108 [mlx5_core])
[   22.318765] ([<03ff8015fe06>] mlx5_cqwq_create+0xf6/0x180 [mlx5_core])
[   22.318783] ([<03ff8016654c>] mlx5e_open_cq+0xac/0x1e0 [mlx5_core])
[   22.318802] ([<03ff801693e6>] mlx5e_open_channels+0xe66/0xeb8 
[mlx5_core])
[   22.318820] ([<03ff8016982e>] mlx5e_open_locked+0x8e/0x1e0 [mlx5_core])
[   22.318837] ([<03ff801699c6>] mlx5e_open+0x46/0x68 [mlx5_core])
[   22.318844] ([<00748338>] __dev_open+0xa8/0x118)
[   22.318848] ([<0074867a>] __dev_change_flags+0xc2/0x190)
[   22.318853] ([<0074877e>] dev_change_flags+0x36/0x78)
[   22.318858] ([<0075bc8a>] do_setlink+0x332/0xb30)
[   22.318862] ([<0075de3a>] rtnl_newlink+0x3e2/0x820)
[   22.318867] ([<0075e46e>] rtnetlink_rcv_msg+0x1f6/0x248)
[   22.318873] ([<00782202>] netlink_rcv_skb+0x92/0x108)
[   22.318878] ([<0075c668>] rtnetlink_rcv+0x48/0x58)
[   22.318882] ([<00781ace>] netlink_unicast+0x14e/0x1f0)
[   22.318887] ([<00781f82>] netlink_sendmsg+0x32a/0x3b0)
[   22.318892] ([<0071d502>] sock_sendmsg+0x5a/0x80)
[   22.318897] ([<0071ed38>] ___sys_sendmsg+0x270/0x2a8)
[   22.318901] ([<0071fe80>] __sys_sendmsg+0x60/0x90)
[   22.318905] ([<007207c6>] SyS_socketcall+0x2be/0x388)
[   22.318912] ([<0086fcae>] system_call+0xd6/0x270)
[   22.318916] 3 locks held by NetworkManager/399:
[   22.318920]  #0:  (rtnl_mutex){+.+.+.}, at: [<0075c658>] 
rtnetlink_rcv+0x38/0x58
[   22.318935]  #1:  (&priv->state_lock){+.+.+.}, at: [<03ff801699bc>] 
mlx5e_open+0x3c/0x68 [mlx5_core]
[   22.318962]  #2:  (&priv->alloc_mutex){+.+.+.}, at: [<03ff801546e0>] 
mlx5_dma_zalloc_coherent_node+0x48/0xf8 [mlx5_core]
[   22.318987] Last Breaking-Event-Address:
[   22.318992]  [<002a427a>] __alloc_pages_nodemask+0x2ea/0x1298
[   22.318996] ---[ end trace d2b54f5a0cd00b89 ]---
[   22.319001] mlx5_core 0001:00:00.0: 0001:00:00.0:mlx5_cqwq_create:121:(pid 
399): mlx5_buf_alloc_node() failed, -12
[   22.320548] mlx5_core 0001:00:00.0 enP1s171: mlx5e_open_locked: 
mlx5e_open_channels failed, -12



This fails because the largest possible allocation on s390 is currently 1MB 
(order 8).
Would it be possible to add the __GFP_NOWARN flag and try a smaller allocation 
if the
big one failed? (The latter change also would make the device usable when it is 
added
via hotplug and free memory is scattered).

Regards,
Sebastian

Re: [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-02 Thread Roopa Prabhu

On 10/31/16, 5:37 PM, Thomas Graf wrote:
> Register two new BPF prog types BPF_PROG_TYPE_LWT_IN and
> BPF_PROG_TYPE_LWT_OUT which are invoked if a route contains a
> LWT redirection of type LWTUNNEL_ENCAP_BPF.
>
> The separate program types are required because manipulation of
> packet data is only allowed on the output and transmit path as
> the subsequent dst_input() call path assumes an IP header
> validated by ip_rcv(). The BPF programs will be handed an skb
> with the L3 header attached and may return one of the following
> return codes:
>
>  BPF_OK - Continue routing as per nexthop
>  BPF_DROP - Drop skb and return EPERM
>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
> (Only valid on lwtunnel_xmit() hook)
>
> The return codes are binary compatible with their TC_ACT_
> relatives to ease compatibility.
>
> A new helper bpf_skb_push() is added which allows to preprend an
> L2 header in front of the skb, extend the existing L3 header, or
> both. This allows to address a wide range of issues:
>  - Optimize L2 header construction when L2 information is always
>static to avoid ARP/NDisc lookup.
>  - Extend IP header to add additional IP options.
>  - Perform simple encapsulation where offload is of no concern.
>(The existing funtionality to attach a tunnel key to the skb
> and redirect to a tunnel net_device to allow for offload
> continues to work obviously).
>
> Signed-off-by: Thomas Graf 
> ---
>  
[snip]
> diff --git a/net/Kconfig b/net/Kconfig
> index 7b6cd34..7554f12 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -396,6 +396,7 @@ source "net/nfc/Kconfig"
>  
>  config LWTUNNEL
>   bool "Network light weight tunnels"
> + depends on IPV6 || IPV6=n
>   ---help---
> This feature provides an infrastructure to support light weight
> tunnels like mpls. There is no netdevice associated with a light
> diff --git a/net/core/Makefile b/net/core/Makefile
> index d6508c2..a675fd3 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -23,7 +23,7 @@ obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
>  obj-$(CONFIG_NET_PTP_CLASSIFY) += ptp_classifier.o
>  obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
>  obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
> -obj-$(CONFIG_LWTUNNEL) += lwtunnel.o
> +obj-$(CONFIG_LWTUNNEL) += lwtunnel.o lwt_bpf.o

Any reason you want to keep lwt bpf under the main CONFIG_LWTUNNEL infra config 
?.
since it is defined as yet another plug-gable encap function, seems like it 
will be better under a separate
CONFIG_LWTUNNEL_BPF or CONFIG_LWT_BPF that depends on CONFIG_LWTUNNEL

Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-02 Thread Ido Schimmel

On Wed, Nov 02, 2016 at 06:29:40AM -0700, Roopa Prabhu wrote:
> On Wed, Nov 2, 2016 at 12:20 AM, Jiri Pirko  wrote:
> > Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote:
> >>
> [snip]
> 
> >>I understand..but, if you are adding some core infrastructure for switchdev 
> >>..it cannot be
> >>based on the number of simple use-cases or data you have today.
> >>
> >>I won't be surprised if tomorrow other switch drivers have a case where 
> >>they need to
> >>reset the hw routing table state and reprogram all routes again. 
> >>Re-registering the notifier to just
> >>get the routing state of the kernel will not scale. For the long term, 
> >>since the driver does not maintain a cache,
> >
> > Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say
> > otherwise.
> >
> >
> >>a pull api with efficient use of rtnl will be useful for other such cases 
> >>as well.
> >
> > How do you imagine this "pull API" should look like?
> 
> 
> Just like you already have added fib notifiers to parallel fib netlink
> notifications, the pull API is  a parallel to 'netlink dump'.
> Is my imagination too wild  ? :)

The question is more about the mechanics of this pull API, because it's
not very clear to me how that should look like. You want consumers to
dump the tables in batches, so that rtnl is held only during the batch
but not in between them? How are the routes passed down? Does the fib
code fill up some struct or does it use the fib chain?

> >>If you don't want to get to the complexity of a new api right away because 
> >>of the
> >>simple case of management interface routes you have, Can your driver 
> >>register the notifier early  ?
> >>(I am sure you have probably already thought about this)
> >
> > Register early? What it would resolve? I must be missing something. We
> > register as early as possible. But the thing is, we cannot register
> > in a past. And that is what this patch resolves.
> 
> sure, you must be having a valid problem then. I was just curious why
> your driver is not up and initialized before any of the addresses or
> routes get configured in the system (even on a management port). Ours
> does. But i agree there can be races and you cannot always guarantee
> (I was just responding to ido's comment about adding complexity for a
> small problem he has to solve for management routes). Our driver does
> a pull before it starts. This helps when we want to reset the hardware
> routing table state too.

One can modprobe the module after routes are already present on other
netdevs. That's actually how I tested the patch.

Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-02 Thread Jiri Pirko

Wed, Nov 02, 2016 at 02:29:40PM CET, ro...@cumulusnetworks.com wrote:
>On Wed, Nov 2, 2016 at 12:20 AM, Jiri Pirko  wrote:
>> Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote:
>>>
>[snip]
>
>>>I understand..but, if you are adding some core infrastructure for switchdev 
>>>..it cannot be
>>>based on the number of simple use-cases or data you have today.
>>>
>>>I won't be surprised if tomorrow other switch drivers have a case where they 
>>>need to
>>>reset the hw routing table state and reprogram all routes again. 
>>>Re-registering the notifier to just
>>>get the routing state of the kernel will not scale. For the long term, since 
>>>the driver does not maintain a cache,
>>
>> Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say
>> otherwise.
>>
>>
>>>a pull api with efficient use of rtnl will be useful for other such cases as 
>>>well.
>>
>> How do you imagine this "pull API" should look like?
>
>
>Just like you already have added fib notifiers to parallel fib netlink
>notifications, the pull API is  a parallel to 'netlink dump'.
>Is my imagination too wild  ? :)

Perhaps I'm slow, but I don't understand what you mean.


>
>
>>
>>
>>>
>>>
>>>If you don't want to get to the complexity of a new api right away because 
>>>of the
>>>simple case of management interface routes you have, Can your driver 
>>>register the notifier early  ?
>>>(I am sure you have probably already thought about this)
>>
>> Register early? What it would resolve? I must be missing something. We
>> register as early as possible. But the thing is, we cannot register
>> in a past. And that is what this patch resolves.
>
>sure, you must be having a valid problem then. I was just curious why
>your driver is not up and initialized before any of the addresses or
>routes get configured in the system (even on a management port). Ours

If you unload the module and load it again for example. This is a valid
usecase.


>does. But i agree there can be races and you cannot always guarantee
>(I was just responding to ido's comment about adding complexity for a
>small problem he has to solve for management routes). Our driver does
>a pull before it starts. This helps when we want to reset the hardware
>routing table state too.

Can you point me to you driver in the tree? I would like to see how you
do "the pull".


>
>
>But, my point was, when you are defining an API, you cannot quantify
>the 'past' to be just the very 'close past' or 'the past is just the
>management routes that were added' . Tomorrow the 'past' can be the
>full routing table if you need to reset the hardware state.

Sure.

Re: [PATCH v2 1/2] net: stmmac: Add OXNAS Glue Driver

2016-11-02 Thread Neil Armstrong

On 10/31/2016 12:12 PM, Joachim Eastwood wrote:
> Hi Neil,
> 
> On 31 October 2016 at 11:54, Neil Armstrong  wrote:
>> Add Synopsys Designware MAC Glue layer for the Oxford Semiconductor OX820.
>>
>> Acked-by: Joachim Eastwood 
>> Signed-off-by: Neil Armstrong 
>> ---
>> +static int oxnas_dwmac_init(struct oxnas_dwmac *dwmac)
>> +{
>> +   unsigned int value;
>> +   int ret;
>> +
>> +   /* Reset HW here before changing the glue configuration */
>> +   ret = device_reset(dwmac->dev);
>> +   if (ret)
>> +   return ret;
>> +
>> +   ret = clk_prepare_enable(dwmac->clk);
>> +   if (ret)
>> +   return ret;
>> +
>> +   ret = regmap_read(dwmac->regmap, OXNAS_DWMAC_CTRL_REGOFFSET, &value);
>> +   if (ret < 0)
>> +   return ret;
> 
> If regmap reading fails here, the clock will be left on as probe fails.
> 

Indeed, thanks.

Neil

[...]
> 
> 
> regards,
> Joachim Eastwood
>

[PATCH v3 0/2] net: stmmac: Add OXNAS DWMAC Glue

2016-11-02 Thread Neil Armstrong

This patchset add support for the Sysnopsys DWMAC Gigabit Ethernet
controller Glue layer of the Oxford Semiconductor OX820 SoC.

Changes since v2 at 
http://lkml.kernel.org/r/20161031105345.16711-1-narmstr...@baylibre.com :
 - Disable/Unprepare clock if regmap read fails in oxnas_dwmac_init

Changes since v1 at https://patchwork.kernel.org/patch/9388231/ :
 - Split dt-bindings in a separate patch
 - Add IP version in the dt-bindings compatible
 - Check return of clk_prepare_enable()
 - use get_stmmac_bsp_priv() helper
 - hardwire setup values in oxnas_dwmac_init()

Changes since RFC at https://patchwork.kernel.org/patch/9387257 :
 - Drop init/exit callbacks
 - Implement proper remove and PM callback
 - Call init from probe
 - Disable/Unprepare clock if stmmac probe fails

Neil Armstrong (2):
  net: stmmac: Add OXNAS Glue Driver
  dt-bindings: net: Add OXNAS DWMAC Bindings

 .../devicetree/bindings/net/oxnas-dwmac.txt|  39 
 drivers/net/ethernet/stmicro/stmmac/Kconfig|  11 ++
 drivers/net/ethernet/stmicro/stmmac/Makefile   |   1 +
 drivers/net/ethernet/stmicro/stmmac/dwmac-oxnas.c  | 217 +
 4 files changed, 268 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/oxnas-dwmac.txt
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac-oxnas.c

-- 
2.7.0

[PATCH v3 1/2] net: stmmac: Add OXNAS Glue Driver

2016-11-02 Thread Neil Armstrong

Add Synopsys Designware MAC Glue layer for the Oxford Semiconductor OX820.

Acked-by: Joachim Eastwood 
Signed-off-by: Neil Armstrong 
---
 drivers/net/ethernet/stmicro/stmmac/Kconfig   |  11 ++
 drivers/net/ethernet/stmicro/stmmac/Makefile  |   1 +
 drivers/net/ethernet/stmicro/stmmac/dwmac-oxnas.c | 217 ++
 3 files changed, 229 insertions(+)
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac-oxnas.c

diff --git a/drivers/net/ethernet/stmicro/stmmac/Kconfig 
b/drivers/net/ethernet/stmicro/stmmac/Kconfig
index 3818c5e..6e9fcc3 100644
--- a/drivers/net/ethernet/stmicro/stmmac/Kconfig
+++ b/drivers/net/ethernet/stmicro/stmmac/Kconfig
@@ -69,6 +69,17 @@ config DWMAC_MESON
  the stmmac device driver. This driver is used for Meson6,
  Meson8, Meson8b and GXBB SoCs.
 
+config DWMAC_OXNAS
+   tristate "Oxford Semiconductor OXNAS dwmac support"
+   default ARCH_OXNAS
+   depends on OF && COMMON_CLK && (ARCH_OXNAS || COMPILE_TEST)
+   select MFD_SYSCON
+   help
+ Support for Ethernet controller on Oxford Semiconductor OXNAS SoCs.
+
+ This selects the Oxford Semiconductor OXNASSoC glue layer support for
+ the stmmac device driver. This driver is used for OX820.
+
 config DWMAC_ROCKCHIP
tristate "Rockchip dwmac support"
default ARCH_ROCKCHIP
diff --git a/drivers/net/ethernet/stmicro/stmmac/Makefile 
b/drivers/net/ethernet/stmicro/stmmac/Makefile
index 5d6ece5..8f83a86 100644
--- a/drivers/net/ethernet/stmicro/stmmac/Makefile
+++ b/drivers/net/ethernet/stmicro/stmmac/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_STMMAC_PLATFORM) += stmmac-platform.o
 obj-$(CONFIG_DWMAC_IPQ806X)+= dwmac-ipq806x.o
 obj-$(CONFIG_DWMAC_LPC18XX)+= dwmac-lpc18xx.o
 obj-$(CONFIG_DWMAC_MESON)  += dwmac-meson.o dwmac-meson8b.o
+obj-$(CONFIG_DWMAC_OXNAS)  += dwmac-oxnas.o
 obj-$(CONFIG_DWMAC_ROCKCHIP)   += dwmac-rk.o
 obj-$(CONFIG_DWMAC_SOCFPGA)+= dwmac-altr-socfpga.o
 obj-$(CONFIG_DWMAC_STI)+= dwmac-sti.o
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-oxnas.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-oxnas.c
new file mode 100644
index 000..c355975
--- /dev/null
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-oxnas.c
@@ -0,0 +1,217 @@
+/*
+ * Oxford Semiconductor OXNAS DWMAC glue layer
+ *
+ * Copyright (C) 2016 Neil Armstrong 
+ * Copyright (C) 2014 Daniel Golle 
+ * Copyright (C) 2013 Ma Haijun 
+ * Copyright (C) 2012 John Crispin 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see .
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "stmmac_platform.h"
+
+/* System Control regmap offsets */
+#define OXNAS_DWMAC_CTRL_REGOFFSET 0x78
+#define OXNAS_DWMAC_DELAY_REGOFFSET0x100
+
+/* Control Register */
+#define DWMAC_CKEN_RX_IN14
+#define DWMAC_CKEN_RXN_OUT  13
+#define DWMAC_CKEN_RX_OUT   12
+#define DWMAC_CKEN_TX_IN10
+#define DWMAC_CKEN_TXN_OUT  9
+#define DWMAC_CKEN_TX_OUT   8
+#define DWMAC_RX_SOURCE 7
+#define DWMAC_TX_SOURCE 6
+#define DWMAC_LOW_TX_SOURCE 4
+#define DWMAC_AUTO_TX_SOURCE3
+#define DWMAC_RGMII 2
+#define DWMAC_SIMPLE_MUX1
+#define DWMAC_CKEN_GTX  0
+
+/* Delay register */
+#define DWMAC_TX_VARDELAY_SHIFT0
+#define DWMAC_TXN_VARDELAY_SHIFT   8
+#define DWMAC_RX_VARDELAY_SHIFT16
+#define DWMAC_RXN_VARDELAY_SHIFT   24
+#define DWMAC_TX_VARDELAY(d)   ((d) << DWMAC_TX_VARDELAY_SHIFT)
+#define DWMAC_TXN_VARDELAY(d)  ((d) << DWMAC_TXN_VARDELAY_SHIFT)
+#define DWMAC_RX_VARDELAY(d)   ((d) << DWMAC_RX_VARDELAY_SHIFT)
+#define DWMAC_RXN_VARDELAY(d)  ((d) << DWMAC_RXN_VARDELAY_SHIFT)
+
+struct oxnas_dwmac {
+   struct device   *dev;
+   struct clk  *clk;
+   struct regmap   *regmap;
+};
+
+static int oxnas_dwmac_init(struct oxnas_dwmac *dwmac)
+{
+   unsigned int value;
+   int ret;
+
+   /* Reset HW here before changing the glue configuration */
+   ret = device_reset(dwmac->dev);
+   if (ret)
+   return ret;
+
+   ret = clk_prepare_enable(dwmac->clk);
+   if (ret)
+   return ret;
+
+   ret = regmap_read(dwmac->regmap, OXNAS_DWMAC_CTRL_REGOFFSET, &value);
+   if (ret < 0) {
+   clk_disable_unprepare(dwmac->clk);
+   return ret;
+   }
+
+   /* Enable GMII_GTXCLK to follow GMII_REFCLK, required for gigabit PHY */
+   value |= BIT(DWMAC_CKEN_GTX)|
+/* Use simple mux for 25/125 Mhz clock switching */
+BIT(DWMAC_SIMPLE_MUX)

[PATCH v3 2/2] dt-bindings: net: Add OXNAS DWMAC Bindings

2016-11-02 Thread Neil Armstrong

Signed-off-by: Neil Armstrong 
---
 .../devicetree/bindings/net/oxnas-dwmac.txt| 39 ++
 1 file changed, 39 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/oxnas-dwmac.txt

diff --git a/Documentation/devicetree/bindings/net/oxnas-dwmac.txt 
b/Documentation/devicetree/bindings/net/oxnas-dwmac.txt
new file mode 100644
index 000..df0534e
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/oxnas-dwmac.txt
@@ -0,0 +1,39 @@
+* Oxford Semiconductor OXNAS DWMAC Ethernet controller
+
+The device inherits all the properties of the dwmac/stmmac devices
+described in the file stmmac.txt in the current directory with the
+following changes.
+
+Required properties on all platforms:
+
+- compatible:  For the OX820 SoC, it should be :
+   - "oxsemi,ox820-dwmac" to select glue
+   - "snps,dwmac-3.512" to select IP version.
+
+- clocks: Should contain phandles to the following clocks
+- clock-names: Should contain the following:
+   - "stmmaceth" for the host clock - see stmmac.txt
+   - "gmac" for the peripheral gate clock
+
+- oxsemi,sys-ctrl: a phandle to the system controller syscon node
+
+Example :
+
+etha: ethernet@4040 {
+   compatible = "oxsemi,ox820-dwmac", "snps,dwmac-3.512";
+   reg = <0x4040 0x2000>;
+   interrupts = ,
+;
+   interrupt-names = "macirq", "eth_wake_irq";
+   mac-address = []; /* Filled in by U-Boot */
+   phy-mode = "rgmii";
+
+   clocks = <&stdclk CLK_820_ETHA>, <&gmacclk>;
+   clock-names = "gmac", "stmmaceth";
+   resets = <&reset RESET_MAC>;
+
+   /* Regmap for sys registers */
+   oxsemi,sys-ctrl = <&sys>;
+
+   status = "disabled";
+};
-- 
2.7.0

Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net

2016-11-02 Thread Jesper Dangaard Brouer

On Fri, 28 Oct 2016 13:11:01 -0400 (EDT)
David Miller  wrote:

> From: John Fastabend 
> Date: Fri, 28 Oct 2016 08:56:35 -0700
> 
> > On 16-10-27 07:10 PM, David Miller wrote:  
> >> From: Alexander Duyck 
> >> Date: Thu, 27 Oct 2016 18:43:59 -0700
> >>   
> >>> On Thu, Oct 27, 2016 at 6:35 PM, David Miller  
> >>> wrote:  
>  From: "Michael S. Tsirkin" 
>  Date: Fri, 28 Oct 2016 01:25:48 +0300
>   
> > On Thu, Oct 27, 2016 at 05:42:18PM -0400, David Miller wrote:  
> >> From: "Michael S. Tsirkin" 
> >> Date: Fri, 28 Oct 2016 00:30:35 +0300
> >>  
> >>> Something I'd like to understand is how does XDP address the
> >>> problem that 100Byte packets are consuming 4K of memory now.  
> >>
> >> Via page pools.  We're going to make a generic one, but right now
> >> each and every driver implements a quick list of pages to allocate
> >> from (and thus avoid the DMA man/unmap overhead, etc.)  
> >
> > So to clarify, ATM virtio doesn't attempt to avoid dma map/unmap
> > so there should be no issue with that even when using sub/page
> > regions, assuming DMA APIs support sub-page map/unmap correctly.  
> 
>  That's not what I said.
> 
>  The page pools are meant to address the performance degradation from
>  going to having one packet per page for the sake of XDP's
>  requirements.
> 
>  You still need to have one packet per page for correct XDP operation
>  whether you do page pools or not, and whether you have DMA mapping
>  (or it's equivalent virutalization operation) or not.  
> >>>
> >>> Maybe I am missing something here, but why do you need to limit things
> >>> to one packet per page for correct XDP operation?  Most of the drivers
> >>> out there now are usually storing something closer to at least 2
> >>> packets per page, and with the DMA API fixes I am working on there
> >>> should be no issue with changing the contents inside those pages since
> >>> we won't invalidate or overwrite the data after the DMA buffer has
> >>> been synchronized for use by the CPU.  
> >> 
> >> Because with SKB's you can share the page with other packets.
> >> 
> >> With XDP you simply cannot.
> >> 
> >> It's software semantics that are the issue.  SKB frag list pages
> >> are read only, XDP packets are writable.
> >> 
> >> This has nothing to do with "writability" of the pages wrt. DMA
> >> mapping or cpu mappings.
> >>   
> > 
> > Sorry I'm not seeing it either. The current xdp_buff is defined
> > by,
> > 
> >   struct xdp_buff {
> > void *data;
> > void *data_end;
> >   };
> > 
> > The verifier has an xdp_is_valid_access() check to ensure we don't go
> > past data_end. The page for now at least never leaves the driver. For
> > the work to get xmit to other devices working I'm still not sure I see
> > any issue.  
> 
> I guess I can say that the packets must be "writable" until I'm blue
> in the face but I'll say it again, semantically writable pages are a
> requirement.  And if multiple packets share a page this requirement
> is not satisfied.
> 
> Also, we want to do several things in the future:
> 
> 1) Allow push/pop of headers via eBPF code, which needs we need
>headroom.
> 
> 2) Transparently zero-copy pass packets into userspace, basically
>the user will have a semi-permanently mapped ring of all the
>packet pages sitting in the RX queue of the device and the
>page pool associated with it.  This way we avoid all of the
>TLB flush/map overhead for the user's mapping of the packets
>just as we avoid the DMA map/unmap overhead.
> 
> And that's just the beginninng.
> 
> I'm sure others can come up with more reasons why we have this
> requirement.

I've tried to update the XDP documentation about the "Page per packet"
requirement[1], fell free to correct below text:

Page per packet
===

On RX many NIC drivers splitup a memory page, to share it for multiple
packets, in-order to conserve memory.  Doing so complicates handling
and accounting of these memory pages, which affects performance.
Particularly the extra atomic refcnt handling needed for the page can
hurt performance.

XDP defines upfront a memory model where there is only one packet per
page.  This simplifies page handling and open up for future
extensions.

This requirement also (upfront) result in choosing not to support
things like, jumpo-frames, LRO and generally packets split over
multiple pages.

In the future, this strict memory model might be relaxed, but for now
it is a strict requirement.  With a more flexible
:ref:`ref_prog_negotiation` is might be possible to negotiate another
memory model. Given some specific XDP use-case might not require this
strict memory model.



Online here:
 [1] 
http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#page-per-packet

Commit:
 
https://github.com/netoptimizer/prototype-kernel/commit/27ece059011e6d5c8a1cb4bdb2ab361cd7faa6dd

-- 
Be

Re: [PATCH net-next 3/3] tools lib bpf: Sync bpf_map_def with tc

2016-11-02 Thread Daniel Borkmann


On 11/02/2016 05:09 AM, Joe Stringer wrote:

On 1 November 2016 at 20:09, Daniel Borkmann  wrote:

On 10/31/2016 07:39 PM, Joe Stringer wrote:


TC uses a slightly different map layout in its ELFs. Update libbpf to
use the same definition so that ELFs may be built using libbpf and
loaded using tc.

Signed-off-by: Joe Stringer 
---
   tools/lib/bpf/libbpf.h | 11 +++
   1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index dd7a513efb10..ea70c2744f8c 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -181,10 +181,13 @@ bool bpf_program__is_kprobe(struct bpf_program
*prog);
* and will be treated as an error due to -Werror.
*/
   struct bpf_map_def {
-   unsigned int type;
-   unsigned int key_size;
-   unsigned int value_size;
-   unsigned int max_entries;
+   uint32_t type;
+   uint32_t key_size;
+   uint32_t value_size;
+   uint32_t max_entries;
+   uint32_t flags;
+   uint32_t id;
+   uint32_t pinning;
   };

   /*


I think the problem is that this would break existing obj files that have
been compiled with the current struct bpf_map_def (besides libbpf not having
a use for the last two members right now).


Right, this is a problem. I wasn't sure whether libbpf was yet at a
stage where it tries to retain compatibility with binaries compiled
against older kernels.


For tc, we have a refactoring of the tc_bpf.c bits that generalizes them so
we can move these bits into iproute2 lib part and add new BPF types really
easily. What I did along with that is to implement a map compat mode, where
it detects the size of struct bpf_elf_map (or however you want to name it)
from the obj file and fixes up the missing members with some reasonable
default,
so these programs can still be loaded. Thus, the sample code using the
current
struct bpf_map_def will then work with tc as well. (I'll post the iproute2
patch early next week.)


Are you encoding the number of maps into the start of the maps section
in the ELF then using that to divide out and determine the size?


No. It's walking the symbol table and simply counts the number of symbols
that are present in the maps section with correct st_info attribution. It
works because that section name is fixed ABI for all cases and really per
definition only map structs are present there. The minimum attributes which
are allowed to be loaded are type, key_size, value_size and max_entries.


I look forward to your patches. Maybe if TC is more tolerant of other
map definition sizes then this patch is less relevant.

[PATCH] igb: Workaround for igb i210 firmware issue.

2016-11-02 Thread Chris J Arges

Sometimes firmware may not properly initialize I347AT4_PAGE_SELECT causing
the probe of an igb i210 NIC to fail. This patch adds an addition zeroing of
this register during igb_get_phy_id to workaround this issue.

Thanks for Jochen Henneberg for the idea and original patch.

Signed-off-by: Chris J Arges 
---
 drivers/net/ethernet/intel/igb/e1000_phy.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/e1000_phy.c 
b/drivers/net/ethernet/intel/igb/e1000_phy.c
index 5b54254..93ec2d0 100644
--- a/drivers/net/ethernet/intel/igb/e1000_phy.c
+++ b/drivers/net/ethernet/intel/igb/e1000_phy.c
@@ -77,6 +77,10 @@ s32 igb_get_phy_id(struct e1000_hw *hw)
s32 ret_val = 0;
u16 phy_id;
 
+   /* ensure phy page selection to fix misconfigured i210 */
+   if (hw->mac.type == e1000_i210)
+   phy->ops.write_reg(hw, I347AT4_PAGE_SELECT, 0);
+
ret_val = phy->ops.read_reg(hw, PHY_ID1, &phy_id);
if (ret_val)
goto out;
-- 
2.7.4

Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net

2016-11-02 Thread Jesper Dangaard Brouer

On Sat, 29 Oct 2016 13:25:14 +0200
Thomas Graf  wrote:

> On 10/28/16 at 08:51pm, Shrijeet Mukherjee wrote:
> > Generally agree, but SRIOV nics with multiple queues can end up in a bad
> > spot if each buffer was 4K right ? I see a specific page pool to be used
> > by queues which are enabled for XDP as the easiest to swing solution that
> > way the memory overhead can be restricted to enabled queues and shared
> > access issues can be restricted to skb's using that pool no ?

Yes, that is why that I've been arguing so strongly for having the
flexibility to attach a XDP program per RX queue, as this only change
the memory model for this one queue.

> Isn't this clearly a must anyway? I may be missing something
> fundamental here so please enlighten me :-)
> 
> If we dedicate a page per packet, that could translate to 14M*4K worth
> of memory being mapped per second for just a 10G NIC under DoS attack.
> How can one protect such as system? Is the assumption that we can always
> drop such packets quickly enough before we start dropping randomly due
> to memory pressure? If a handshake is required to determine validity
> of a packet then that is going to be difficult.

Under DoS attacks you don't run out of memory, because a diverse set of
socket memory limits/accounting avoids that situation.  What does
happen is the maximum achievable PPS rate is directly dependent on the
time you spend on each packet.   This use of CPU resources (and
hitting mem-limits-safe-guards) push-back on the drivers speed to
process the RX ring.  In effect, packets are dropped in the NIC HW as
RX-ring queue is not emptied fast-enough.

Given you don't control what HW drops, the attacker will "successfully"
cause your good traffic to be among the dropped packets.

This is where XDP change the picture. If you can express (by eBPF) a
filter that can separate "bad" vs "good" traffic, then you can take
back control.  Almost like controlling what traffic the HW should drop.
Given the cost of XDP-eBPF filter + serving regular traffic does not
use all of your CPU resources, you have overcome the attack.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-02 Thread Roopa Prabhu

On 11/2/16, 6:48 AM, Jiri Pirko wrote:
> Wed, Nov 02, 2016 at 02:29:40PM CET, ro...@cumulusnetworks.com wrote:
>> On Wed, Nov 2, 2016 at 12:20 AM, Jiri Pirko  wrote:
>>> Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote:
>> [snip]
>>
 I understand..but, if you are adding some core infrastructure for 
 switchdev ..it cannot be
 based on the number of simple use-cases or data you have today.

 I won't be surprised if tomorrow other switch drivers have a case where 
 they need to
 reset the hw routing table state and reprogram all routes again. 
 Re-registering the notifier to just
 get the routing state of the kernel will not scale. For the long term, 
 since the driver does not maintain a cache,
>>> Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say
>>> otherwise.
>>>
>>>
 a pull api with efficient use of rtnl will be useful for other such cases 
 as well.
>>> How do you imagine this "pull API" should look like?
>>
>> Just like you already have added fib notifiers to parallel fib netlink
>> notifications, the pull API is  a parallel to 'netlink dump'.
>> Is my imagination too wild  ? :)
> Perhaps I'm slow, but I don't understand what you mean.
 

 If you don't want to get to the complexity of a new api right away because 
 of the
 simple case of management interface routes you have, Can your driver 
 register the notifier early  ?
 (I am sure you have probably already thought about this)
>>> Register early? What it would resolve? I must be missing something. We
>>> register as early as possible. But the thing is, we cannot register
>>> in a past. And that is what this patch resolves.
>> sure, you must be having a valid problem then. I was just curious why
>> your driver is not up and initialized before any of the addresses or
>> routes get configured in the system (even on a management port). Ours
> If you unload the module and load it again for example. This is a valid
> usecase.

I see, so you are optimizing for this use case. sure it is a valid use-case but 
a narrow one
compared to the rtnl overhead the api may bring
 (note that i am not saying you should not solve it).

>
>
>> does. But i agree there can be races and you cannot always guarantee
>> (I was just responding to ido's comment about adding complexity for a
>> small problem he has to solve for management routes). Our driver does
>> a pull before it starts. This helps when we want to reset the hardware
>> routing table state too.
> Can you point me to you driver in the tree? I would like to see how you
> do "the pull".
:), you know all this... but, if i must explicitly say it, yes,  we don't have 
a driver in the tree and
we don't own the hardware. My analogy here is of a netlink dump that we use 
heavily for the
same scale that you will probably deploy.
i do give you full credit for the hardware and the driver and switchdev support 
and all that!.

>
>>
>> But, my point was, when you are defining an API, you cannot quantify
>> the 'past' to be just the very 'close past' or 'the past is just the
>> management routes that were added' . Tomorrow the 'past' can be the
>> full routing table if you need to reset the hardware state.
> Sure.

This pull api was a suggestion for an efficient use of rtnl ...similar to how 
the netlink
routing dump handles it. If you cannot imagine an api like that..., sure, your 
call.

[PATCH net] qede: Correctly map aggregation replacement pages

2016-11-02 Thread Yuval Mintz

Driver allocates replacement buffers before-hand to make
sure whenever an aggregation begins there would be a replacement
for the Rx buffers, as we can't release the buffer until
aggregation is terminated and driver logic assumes the Rx rings
are always full.

For every other Rx page that's being allocated [I.e., regular]
the page is being completely mapped while for the replacement
buffers only the first portion of the page is being mapped.
This means that:
  a. Once replacement buffer replenishes the regular Rx ring,
assuming there's more than a single packet on page we'd post unmapped
memory toward HW [assuming mapping is actually done in granularity
smaller than page].
  b. Unmaps are being done for the entire page, which is incorrect.

Fixes: 55482edc25f06 ("qede: Add slowpath/fastpath support and enable hardware 
GRO")
Signed-off-by: Yuval Mintz 
---
Dave,

Please consider applying this to `net'.

Thanks,
Yuval
---
 drivers/net/ethernet/qlogic/qede/qede_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 1391776..73f2a67 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -2918,7 +2918,7 @@ static int qede_alloc_sge_mem(struct qede_dev *edev, 
struct qede_rx_queue *rxq)
}
 
mapping = dma_map_page(&edev->pdev->dev, replace_buf->data, 0,
-  rxq->rx_buf_size, DMA_FROM_DEVICE);
+  PAGE_SIZE, DMA_FROM_DEVICE);
if (unlikely(dma_mapping_error(&edev->pdev->dev, mapping))) {
DP_NOTICE(edev,
  "Failed to map TPA replacement buffer\n");
-- 
1.8.3.1

Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-02 Thread Jiri Pirko

Wed, Nov 02, 2016 at 03:35:03PM CET, ro...@cumulusnetworks.com wrote:
>On 11/2/16, 6:48 AM, Jiri Pirko wrote:
>> Wed, Nov 02, 2016 at 02:29:40PM CET, ro...@cumulusnetworks.com wrote:
>>> On Wed, Nov 2, 2016 at 12:20 AM, Jiri Pirko  wrote:
 Wed, Nov 02, 2016 at 03:13:42AM CET, ro...@cumulusnetworks.com wrote:
>>> [snip]
>>>
> I understand..but, if you are adding some core infrastructure for 
> switchdev ..it cannot be
> based on the number of simple use-cases or data you have today.
>
> I won't be surprised if tomorrow other switch drivers have a case where 
> they need to
> reset the hw routing table state and reprogram all routes again. 
> Re-registering the notifier to just
> get the routing state of the kernel will not scale. For the long term, 
> since the driver does not maintain a cache,
 Driver (mlxsw, rocker) maintain a cache. So I'm not sure why you say
 otherwise.


> a pull api with efficient use of rtnl will be useful for other such cases 
> as well.
 How do you imagine this "pull API" should look like?
>>>
>>> Just like you already have added fib notifiers to parallel fib netlink
>>> notifications, the pull API is  a parallel to 'netlink dump'.
>>> Is my imagination too wild  ? :)
>> Perhaps I'm slow, but I don't understand what you mean.
> 
>
> If you don't want to get to the complexity of a new api right away 
> because of the
> simple case of management interface routes you have, Can your driver 
> register the notifier early  ?
> (I am sure you have probably already thought about this)
 Register early? What it would resolve? I must be missing something. We
 register as early as possible. But the thing is, we cannot register
 in a past. And that is what this patch resolves.
>>> sure, you must be having a valid problem then. I was just curious why
>>> your driver is not up and initialized before any of the addresses or
>>> routes get configured in the system (even on a management port). Ours
>> If you unload the module and load it again for example. This is a valid
>> usecase.
>
>I see, so you are optimizing for this use case. sure it is a valid use-case 
>but a narrow one

It is not an optimization, it's a bug fix.


>compared to the rtnl overhead the api may bring
> (note that i am not saying you should not solve it).
>
>>
>>
>>> does. But i agree there can be races and you cannot always guarantee
>>> (I was just responding to ido's comment about adding complexity for a
>>> small problem he has to solve for management routes). Our driver does
>>> a pull before it starts. This helps when we want to reset the hardware
>>> routing table state too.
>> Can you point me to you driver in the tree? I would like to see how you
>> do "the pull".
>:), you know all this... but, if i must explicitly say it, yes,  we don't have 
>a driver in the tree and
>we don't own the hardware. My analogy here is of a netlink dump that we use 
>heavily for the
>same scale that you will probably deploy.

You are comparing netlink kernel-user api with in kernel api. I don't
think that is comparable, at all. Therefore I asked how you imagine
"the pull" should look like, in kernel. Stating it should look like
some user api part does not help me much :(



>i do give you full credit for the hardware and the driver and switchdev 
>support and all that!.
>
>>
>>>
>>> But, my point was, when you are defining an API, you cannot quantify
>>> the 'past' to be just the very 'close past' or 'the past is just the
>>> management routes that were added' . Tomorrow the 'past' can be the
>>> full routing table if you need to reset the hardware state.
>> Sure.
>
>This pull api was a suggestion for an efficient use of rtnl ...similar to how 
>the netlink
>routing dump handles it. If you cannot imagine an api like that..., sure, your 
>call.

No, that's why I'm asking, because I was under impression you can
imagine that :)

Re: SNMP read-write MIBs

2016-11-02 Thread Murali Karicheri

+ David, Eric,

On 11/01/2016 02:06 PM, Murali Karicheri wrote:
> Hello netdev experts,
> 
> I am investigating the requirements to support hsr/prp SNMP functions in 
> kernel.
> Based on my investigation so far, the kernel include file include/net/snmp.h
> defines all of the SNMP MIBS related defines and structures. But the MIBs are
> read-only type MIBs. Is there any implementation of read-write MIBs in kernel?
> 
> One of the specs for MIBs that are investigating have read-write MIBs and
> wondering if we have any precedence of such MIBs implemented in kernel space.
> If not, what is the suggested way to implement these MIBs in kernel space?
> 
> I assume that to implement read-only MIBs for hsr driver, I need to add
> them to snmp.h and use standard Macros in snmp.h to update them from the 
> driver.
> 
> Thanks
>  
> Murali Karicheri
> Linux Kernel, Keystone
> 
I did some more research on this, and found some of the (not sure if there is 
any
in kernel) read-write MIBs are implemented in the net-snmp, where
it communicate using a raw socket and calling an ioctl. Is this the way to go
to implement the read-write MIBs?

Thanks
-- 
Murali Karicheri
Linux Kernel, Keystone

[PATCH net] tcp: fix potential memory corruption

2016-11-02 Thread Eric Dumazet

From: Eric Dumazet 

Imagine initial value of max_skb_frags is 17, and last
skb in write queue has 15 frags.

Then max_skb_frags is lowered to 14 or smaller value.

tcp_sendmsg() will then be allowed to add additional page frags
and eventually go past MAX_SKB_FRAGS, overflowing struct
skb_shared_info.

Fixes: 5f74f82ea34c ("net:Add sysctl_max_skb_frags")
Signed-off-by: Eric Dumazet 
Cc: Hans Westgaard Ry 
Cc: Håkon Bugge 
---
 net/ipv4/tcp.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3251fe71f39f..18238ef8135a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1241,7 +1241,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
 
if (!skb_can_coalesce(skb, i, pfrag->page,
  pfrag->offset)) {
-   if (i == sysctl_max_skb_frags || !sg) {
+   if (i >= sysctl_max_skb_frags || !sg) {
tcp_mark_push(tp, skb);
goto new_segment;
}

Re: [patch net-next 2/2] [PATCH] net: ip, raw_diag -- Use jump for exiting from nested loop

2016-11-02 Thread David Ahern

On 11/2/16 6:36 AM, Cyrill Gorcunov wrote:
> I managed to miss that sk_for_each is called under "for"
> cycle so need to use goto here to return matching socket.
> 
> CC: David S. Miller 
> CC: Eric Dumazet 
> CC: David Ahern 
> CC: Andrey Vagin 
> CC: Stephen Hemminger 
> Signed-off-by: Cyrill Gorcunov 
> ---
>  net/ipv4/raw_diag.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Acked-by: David Ahern

Re: [patch net-next 1/2] [PATCH] net: ip, raw_diag -- Fix socket leaking for destroy request

2016-11-02 Thread David Ahern

On 11/2/16 6:36 AM, Cyrill Gorcunov wrote:
> In raw_diag_destroy the helper raw_sock_get returns
> with sock_hold call, so we have to put it then.
> 
> CC: David S. Miller 
> CC: Eric Dumazet 
> CC: David Ahern 
> CC: Andrey Vagin 
> CC: Stephen Hemminger 
> Signed-off-by: Cyrill Gorcunov 
> ---
>  net/ipv4/raw_diag.c |5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)

Acked-by: David Ahern

[PATCH net-next 0/3] ip: add RECVFRAGSIZE cmsg

2016-11-02 Thread Willem de Bruijn

From: Willem de Bruijn 

On IP datagrams and raw sockets, when packets arrive fragmented,
expose the largest received fragment size through a new cmsg.

Protocols implemented on top of these sockets may use this, for
instance, to inform peers to lower MSS on platforms that silently
allow send calls to exceed PMTU and cause fragmentation.

Willem de Bruijn (3):
  ipv4: add IP_RECVFRAGSIZE cmsg
  ipv6: add IPV6_RECVFRAGSIZE cmsg
  ipv6: on reassembly, record frag_max_size

 include/linux/ipv6.h |  5 +++--
 include/net/inet_sock.h  |  1 +
 include/uapi/linux/in.h  |  1 +
 include/uapi/linux/in6.h |  1 +
 net/ipv4/ip_sockglue.c   | 26 ++
 net/ipv6/datagram.c  |  5 +
 net/ipv6/ipv6_sockglue.c |  8 
 net/ipv6/reassembly.c|  7 ++-
 8 files changed, 51 insertions(+), 3 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

[PATCH net-next 2/3] ipv6: add IPV6_RECVFRAGSIZE cmsg

2016-11-02 Thread Willem de Bruijn

From: Willem de Bruijn 

When reading a datagram or raw packet that arrived fragmented, expose
the maximum fragment size if recorded to allow applications to
estimate receive path MTU.

At this point, the field is only recorded when ipv6 connection
tracking is enabled. A follow-up patch will record this field also
in the ipv6 input path.

Tested using the test for IP_RECVFRAGSIZE plus

  ip netns exec to ip addr add dev veth1 fc07::1/64
  ip netns exec from ip addr add dev veth0 fc07::2/64

  ip netns exec to ./recv_cmsg_recvfragsize -6 -u -p 6000 &
  ip netns exec from nc -q 1 -u fc07::1 6000 < payload

Both with and without enabling connection tracking

  ip6tables -A INPUT -m state --state NEW -p udp -j LOG

Signed-off-by: Willem de Bruijn 
---
 include/linux/ipv6.h | 5 +++--
 include/uapi/linux/in6.h | 1 +
 net/ipv6/datagram.c  | 5 +
 net/ipv6/ipv6_sockglue.c | 8 
 4 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index ca1ad9e..1afb6e8 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -229,8 +229,9 @@ struct ipv6_pinfo {
 rxflow:1,
rxtclass:1,
rxpmtu:1,
-   rxorigdstaddr:1;
-   /* 2 bits hole */
+   rxorigdstaddr:1,
+   recvfragsize:1;
+   /* 1 bits hole */
} bits;
__u16   all;
} rxopt;
diff --git a/include/uapi/linux/in6.h b/include/uapi/linux/in6.h
index b39ea4f..46444f8 100644
--- a/include/uapi/linux/in6.h
+++ b/include/uapi/linux/in6.h
@@ -283,6 +283,7 @@ struct in6_flowlabel_req {
 #define IPV6_RECVORIGDSTADDRIPV6_ORIGDSTADDR
 #define IPV6_TRANSPARENT75
 #define IPV6_UNICAST_IF 76
+#define IPV6_RECVFRAGSIZE  77
 
 /*
  * Multicast Routing:
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 37874e2..620c79a 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -715,6 +715,11 @@ void ip6_datagram_recv_specific_ctl(struct sock *sk, 
struct msghdr *msg,
put_cmsg(msg, SOL_IPV6, IPV6_ORIGDSTADDR, sizeof(sin6), 
&sin6);
}
}
+   if (np->rxopt.bits.recvfragsize && opt->frag_max_size) {
+   int val = opt->frag_max_size;
+
+   put_cmsg(msg, SOL_IPV6, IPV6_RECVFRAGSIZE, sizeof(val), &val);
+   }
 }
 
 void ip6_datagram_recv_ctl(struct sock *sk, struct msghdr *msg,
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 636ec56..6c12678 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -868,6 +868,10 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, 
int optname,
np->autoflowlabel = valbool;
retv = 0;
break;
+   case IPV6_RECVFRAGSIZE:
+   np->rxopt.bits.recvfragsize = valbool;
+   retv = 0;
+   break;
}
 
release_sock(sk);
@@ -1310,6 +1314,10 @@ static int do_ipv6_getsockopt(struct sock *sk, int 
level, int optname,
val = np->autoflowlabel;
break;
 
+   case IPV6_RECVFRAGSIZE:
+   val = np->rxopt.bits.recvfragsize;
+   break;
+
default:
return -ENOPROTOOPT;
}
-- 
2.8.0.rc3.226.g39d4020

[PATCH net-next 1/3] ipv4: add IP_RECVFRAGSIZE cmsg

2016-11-02 Thread Willem de Bruijn

From: Willem de Bruijn 

The IP stack records the largest fragment of a reassembled packet
in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
that arrived fragmented, expose the value to allow applications to
estimate receive path MTU.

Tested:
  Sent data over a veth pair of which the source has a small mtu.
  Sent data using netcat, received using a dedicated process.

  Verified that the cmsg IP_RECVFRAGSIZE is returned only when
  data arrives fragmented, and in that cases matches the veth mtu.

ip link add veth0 type veth peer name veth1

ip netns add from
ip netns add to

ip link set dev veth1 netns to
ip netns exec to ip addr add dev veth1 192.168.10.1/24
ip netns exec to ip link set dev veth1 up

ip link set dev veth0 netns from
ip netns exec from ip addr add dev veth0 192.168.10.2/24
ip netns exec from ip link set dev veth0 up
ip netns exec from ip link set dev veth0 mtu 1300
ip netns exec from ethtool -K veth0 ufo off

dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload

ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 &
ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload

  using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c

Signed-off-by: Willem de Bruijn 
---
 include/net/inet_sock.h |  1 +
 include/uapi/linux/in.h |  1 +
 net/ipv4/ip_sockglue.c  | 26 ++
 3 files changed, 28 insertions(+)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 236a810..c9cff97 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -228,6 +228,7 @@ struct inet_sock {
 #define IP_CMSG_PASSSECBIT(5)
 #define IP_CMSG_ORIGDSTADDRBIT(6)
 #define IP_CMSG_CHECKSUM   BIT(7)
+#define IP_CMSG_RECVFRAGSIZE   BIT(8)
 
 /**
  * sk_to_full_sk - Access to a full socket
diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
index eaf9491..4e557f4 100644
--- a/include/uapi/linux/in.h
+++ b/include/uapi/linux/in.h
@@ -117,6 +117,7 @@ struct in_addr {
 #define IP_NODEFRAG 22
 #define IP_CHECKSUM23
 #define IP_BIND_ADDRESS_NO_PORT24
+#define IP_RECVFRAGSIZE25
 
 /* IP_MTU_DISCOVER values */
 #define IP_PMTUDISC_DONT   0   /* Never send DF frames */
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index b8a2d63..ecbaae2 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -97,6 +97,17 @@ static void ip_cmsg_recv_retopts(struct msghdr *msg, struct 
sk_buff *skb)
put_cmsg(msg, SOL_IP, IP_RETOPTS, opt->optlen, opt->__data);
 }
 
+static void ip_cmsg_recv_fragsize(struct msghdr *msg, struct sk_buff *skb)
+{
+   int val;
+
+   if (IPCB(skb)->frag_max_size == 0)
+   return;
+
+   val = IPCB(skb)->frag_max_size;
+   put_cmsg(msg, SOL_IP, IP_RECVFRAGSIZE, sizeof(val), &val);
+}
+
 static void ip_cmsg_recv_checksum(struct msghdr *msg, struct sk_buff *skb,
  int tlen, int offset)
 {
@@ -218,6 +229,9 @@ void ip_cmsg_recv_offset(struct msghdr *msg, struct sk_buff 
*skb,
 
if (flags & IP_CMSG_CHECKSUM)
ip_cmsg_recv_checksum(msg, skb, tlen, offset);
+
+   if (flags & IP_CMSG_RECVFRAGSIZE)
+   ip_cmsg_recv_fragsize(msg, skb);
 }
 EXPORT_SYMBOL(ip_cmsg_recv_offset);
 
@@ -614,6 +628,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
case IP_MULTICAST_LOOP:
case IP_RECVORIGDSTADDR:
case IP_CHECKSUM:
+   case IP_RECVFRAGSIZE:
if (optlen >= sizeof(int)) {
if (get_user(val, (int __user *) optval))
return -EFAULT;
@@ -726,6 +741,14 @@ static int do_ip_setsockopt(struct sock *sk, int level,
}
}
break;
+   case IP_RECVFRAGSIZE:
+   if (sk->sk_type != SOCK_RAW && sk->sk_type != SOCK_DGRAM)
+   goto e_inval;
+   if (val)
+   inet->cmsg_flags |= IP_CMSG_RECVFRAGSIZE;
+   else
+   inet->cmsg_flags &= ~IP_CMSG_RECVFRAGSIZE;
+   break;
case IP_TOS:/* This sets both TOS and Precedence */
if (sk->sk_type == SOCK_STREAM) {
val &= ~INET_ECN_MASK;
@@ -1357,6 +1380,9 @@ static int do_ip_getsockopt(struct sock *sk, int level, 
int optname,
case IP_CHECKSUM:
val = (inet->cmsg_flags & IP_CMSG_CHECKSUM) != 0;
break;
+   case IP_RECVFRAGSIZE:
+   val = (inet->cmsg_flags & IP_CMSG_RECVFRAGSIZE) != 0;
+   break;
case IP_TOS:
val = inet->tos;
break;
-- 
2.8.0.rc3.226.g39d4020

[PATCH net-next 3/3] ipv6: on reassembly, record frag_max_size

2016-11-02 Thread Willem de Bruijn

From: Willem de Bruijn 

IP6CB and IPCB have a frag_max_size field. In IPv6 this field is
filled in when packets are reassembled by the connection tracking
code. Also fill in when reassembling in the input path, to expose
it through cmsg IPV6_RECVFRAGSIZE in all cases.

Signed-off-by: Willem de Bruijn 
---
 net/ipv6/reassembly.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 3815e85..e1da5b8 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -211,7 +211,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
 {
struct sk_buff *prev, *next;
struct net_device *dev;
-   int offset, end;
+   int offset, end, fragsize;
struct net *net = dev_net(skb_dst(skb)->dev);
u8 ecn;
 
@@ -336,6 +336,10 @@ static int ip6_frag_queue(struct frag_queue *fq, struct 
sk_buff *skb,
fq->ecn |= ecn;
add_frag_mem_limit(fq->q.net, skb->truesize);
 
+   fragsize = -skb_network_offset(skb) + skb->len;
+   if (fragsize > fq->q.max_size)
+   fq->q.max_size = fragsize;
+
/* The first fragment.
 * nhoffset is obtained from the first fragment, of course.
 */
@@ -495,6 +499,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct 
sk_buff *prev,
ipv6_change_dsfield(ipv6_hdr(head), 0xff, ecn);
IP6CB(head)->nhoff = nhoff;
IP6CB(head)->flags |= IP6SKB_FRAGMENTED;
+   IP6CB(head)->frag_max_size = fq->q.max_size;
 
/* Yes, and fold redundant checksum back. 8) */
skb_postpush_rcsum(head, skb_network_header(head),
-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH net-next 3/3] tools lib bpf: Sync bpf_map_def with tc

2016-11-02 Thread Daniel Borkmann


On 11/02/2016 03:12 PM, Daniel Borkmann wrote:

On 11/02/2016 05:09 AM, Joe Stringer wrote:

On 1 November 2016 at 20:09, Daniel Borkmann  wrote:

On 10/31/2016 07:39 PM, Joe Stringer wrote:


TC uses a slightly different map layout in its ELFs. Update libbpf to
use the same definition so that ELFs may be built using libbpf and
loaded using tc.

Signed-off-by: Joe Stringer 
---
   tools/lib/bpf/libbpf.h | 11 +++
   1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index dd7a513efb10..ea70c2744f8c 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -181,10 +181,13 @@ bool bpf_program__is_kprobe(struct bpf_program
*prog);
* and will be treated as an error due to -Werror.
*/
   struct bpf_map_def {
-   unsigned int type;
-   unsigned int key_size;
-   unsigned int value_size;
-   unsigned int max_entries;
+   uint32_t type;
+   uint32_t key_size;
+   uint32_t value_size;
+   uint32_t max_entries;
+   uint32_t flags;
+   uint32_t id;
+   uint32_t pinning;
   };

   /*


I think the problem is that this would break existing obj files that have
been compiled with the current struct bpf_map_def (besides libbpf not having
a use for the last two members right now).


Right, this is a problem. I wasn't sure whether libbpf was yet at a
stage where it tries to retain compatibility with binaries compiled
against older kernels.


For tc, we have a refactoring of the tc_bpf.c bits that generalizes them so
we can move these bits into iproute2 lib part and add new BPF types really
easily. What I did along with that is to implement a map compat mode, where
it detects the size of struct bpf_elf_map (or however you want to name it)
from the obj file and fixes up the missing members with some reasonable
default,
so these programs can still be loaded. Thus, the sample code using the
current
struct bpf_map_def will then work with tc as well. (I'll post the iproute2
patch early next week.)


Are you encoding the number of maps into the start of the maps section
in the ELF then using that to divide out and determine the size?


No. It's walking the symbol table and simply counts the number of symbols
that are present in the maps section with correct st_info attribution. It
works because that section name is fixed ABI for all cases and really per
definition only map structs are present there. The minimum attributes which
are allowed to be loaded are type, key_size, value_size and max_entries.


I look forward to your patches. Maybe if TC is more tolerant of other
map definition sizes then this patch is less relevant.


Just a thought (not really related to your set though): Perhaps it makes sense
for the bpflib, to also add an option where the user passes a callback to parse
the map section itself if something else than struct bpf_map_def is expected in
the obj, and then have a way to access the meta data along with the fd via lib
api later on. Perhaps that would also include another callback passed to the
lib that would take care to invoke bpf(2) for map creation, which could be 
useful
if some custom interaction with bpffs is desired.

[PATCH iproute2] tc: flower: Fix usage message

2016-11-02 Thread Paul Blakey

Remove left over usage from removal of eth_type argument.

Fixes: 488b41d020fb ('tc: flower no need to specify the ethertype')
Signed-off-by: Paul Blakey 
---
 man/man8/tc-flower.8 | 9 -
 tc/f_flower.c| 3 +--
 2 files changed, 1 insertion(+), 11 deletions(-)

diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
index 74f7664..16ef261 100644
--- a/man/man8/tc-flower.8
+++ b/man/man8/tc-flower.8
@@ -23,8 +23,6 @@ flower \- flow based traffic control filter
 .R " | { "
 .BR dst_mac " | " src_mac " } "
 .IR mac_address " | "
-.BR eth_type " { " ipv4 " | " ipv6 " | " 802.1Q " | "
-.IR ETH_TYPE " } | "
 .B vlan_id
 .IR VID " | "
 .B vlan_prio
@@ -75,13 +73,6 @@ Do not process filter by hardware.
 .BI src_mac " mac_address"
 Match on source or destination MAC address.
 .TP
-.BI eth_type " ETH_TYPE"
-Match on the next protocol.
-.I ETH_TYPE
-may be either
-.BR ipv4 , ipv6 , 802.1Q ,
-or an unsigned 16bit value in hexadecimal format.
-.TP
 .BI vlan_id " VID"
 Match on vlan tag id.
 .I VID
diff --git a/tc/f_flower.c b/tc/f_flower.c
index 2d31d1a..f39b1f7 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -36,7 +36,6 @@ static void explain(void)
fprintf(stderr, "   vlan_ethtype [ ipv4 | ipv6 | 
ETH-TYPE ] |\n");
fprintf(stderr, "   dst_mac MAC-ADDR |\n");
fprintf(stderr, "   src_mac MAC-ADDR |\n");
-   fprintf(stderr, "   [ipv4 | ipv6 ] |\n");
fprintf(stderr, "   ip_proto [tcp | udp | IP-PROTO 
] |\n");
fprintf(stderr, "   dst_ip [ IPV4-ADDR | IPV6-ADDR 
] |\n");
fprintf(stderr, "   src_ip [ IPV4-ADDR | IPV6-ADDR 
] |\n");
@@ -45,7 +44,7 @@ static void explain(void)
fprintf(stderr, "   FILTERID := X:Y:Z\n");
fprintf(stderr, "   ACTION-SPEC := ... look at individual 
actions\n");
fprintf(stderr, "\n");
-   fprintf(stderr, "NOTE: CLASSID, ETH-TYPE, IP-PROTO are parsed as 
hexadecimal input.\n");
+   fprintf(stderr, "NOTE: CLASSID, IP-PROTO are parsed as hexadecimal 
input.\n");
fprintf(stderr, "NOTE: There can be only used one mask per one prio. If 
user needs\n");
fprintf(stderr, "  to specify different mask, he has to use 
different prio.\n");
 }
-- 
1.8.3.1

Re: [patch net-next 0/2] Fixes for raw diag sockets handling

2016-11-02 Thread David Ahern

On 11/2/16 6:36 AM, Cyrill Gorcunov wrote:
> Also I have a question about sockets lookup not for raw diag only
> (though I didn't modify lookup procedure) but in general: the structure
> inet_diag_req_v2 has inet_diag_sockid::idiag_if member which supposed to
> carry interface index from userspace request.
> 
> Then for example in INET_MATCH (include/net/inet_hashtables.h),
> the __dif parameter (which is @idiag_if) compared with @sk_bound_dev_if
> *iif* the sk_bound_dev_if has been ever set. Thus if say someone
> looks up for paticular device with specified index if the
> rest of parameters match and SO_BINDTODEVICE never been called
> for this device we return the socket even if idiag_if is not zero.
> Is it supposed to be so? Or I miss something obvious?
> 
> I mean this snippet
> 
> 
>(!(__sk)->sk_bound_dev_if  ||  \
>  ((__sk)->sk_bound_dev_if == (__dif)))&&  \
> 
> when someone calls for destory sockets on particular interface and
> @__dif != 0 the match may return socket where sk_bound_dev_if = 0
> instead of completely matching one. Isn't it?

yes. I recently added an exact_dif to the lookup for listener sockets (see 
compute_score). Something like that could be added to INET_MATCH.

[PATCH net-next V3 2/3] net/mlx4_en: Refactor the XDP forwarding rings scheme

2016-11-02 Thread Tariq Toukan

Separately manage the two types of TX rings: regular ones, and XDP.
Upon an XDP set, do not borrow regular TX rings and convert them
into XDP ones, but allocate new ones, unless we hit the max number
of rings.
Which means that in systems with smaller #cores we will not consume
the current TX rings for XDP, while we are still in the num TX limit.

XDP TX rings counters are not shown in ethtool statistics.
Instead, XDP counters will be added to the respective RX rings
in a downstream patch.

This has no performance implications.

Signed-off-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/en_cq.c  |  10 +-
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  76 +++--
 drivers/net/ethernet/mellanox/mlx4/en_main.c|   2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  | 378 ++--
 drivers/net/ethernet/mellanox/mlx4/en_port.c|   4 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c  |   8 +-
 drivers/net/ethernet/mellanox/mlx4/en_tx.c  |   9 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h|  11 +-
 8 files changed, 293 insertions(+), 205 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c 
b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 1427311a9640..03f05c4d1f98 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -127,15 +127,7 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct 
mlx4_en_cq *cq,
/* For TX we use the same irq per
ring we assigned for the RX*/
struct mlx4_en_cq *rx_cq;
-   int xdp_index;
-
-   /* The xdp tx irq must align with the rx ring that forwards to
-* it, so reindex these from 0. This should only happen when
-* tx_ring_num is not a multiple of rx_ring_num.
-*/
-   xdp_index = (priv->xdp_ring_num - priv->tx_ring_num) + cq_idx;
-   if (xdp_index >= 0)
-   cq_idx = xdp_index;
+
cq_idx = cq_idx % priv->rx_ring_num;
rx_cq = priv->rx_cq[cq_idx];
cq->vector = rx_cq->vector;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index bdda17d2ea0f..e8ccb95680bc 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -49,16 +49,19 @@
 
 static int mlx4_en_moderation_update(struct mlx4_en_priv *priv)
 {
-   int i;
+   int i, t;
int err = 0;
 
-   for (i = 0; i < priv->tx_ring_num; i++) {
-   priv->tx_cq[i]->moder_cnt = priv->tx_frames;
-   priv->tx_cq[i]->moder_time = priv->tx_usecs;
-   if (priv->port_up) {
-   err = mlx4_en_set_cq_moder(priv, priv->tx_cq[i]);
-   if (err)
-   return err;
+   for (t = 0 ; t < MLX4_EN_NUM_TX_TYPES; t++) {
+   for (i = 0; i < priv->tx_ring_num[t]; i++) {
+   priv->tx_cq[t][i]->moder_cnt = priv->tx_frames;
+   priv->tx_cq[t][i]->moder_time = priv->tx_usecs;
+   if (priv->port_up) {
+   err = mlx4_en_set_cq_moder(priv,
+  priv->tx_cq[t][i]);
+   if (err)
+   return err;
+   }
}
}
 
@@ -336,7 +339,7 @@ static int mlx4_en_get_sset_count(struct net_device *dev, 
int sset)
switch (sset) {
case ETH_SS_STATS:
return bitmap_iterator_count(&it) +
-   (priv->tx_ring_num * 2) +
+   (priv->tx_ring_num[TX] * 2) +
(priv->rx_ring_num * 3);
case ETH_SS_TEST:
return MLX4_EN_NUM_SELF_TEST - !(priv->mdev->dev->caps.flags
@@ -397,9 +400,9 @@ static void mlx4_en_get_ethtool_stats(struct net_device 
*dev,
if (bitmap_iterator_test(&it))
data[index++] = ((unsigned long *)&priv->pkstats)[i];
 
-   for (i = 0; i < priv->tx_ring_num; i++) {
-   data[index++] = priv->tx_ring[i]->packets;
-   data[index++] = priv->tx_ring[i]->bytes;
+   for (i = 0; i < priv->tx_ring_num[TX]; i++) {
+   data[index++] = priv->tx_ring[TX][i]->packets;
+   data[index++] = priv->tx_ring[TX][i]->bytes;
}
for (i = 0; i < priv->rx_ring_num; i++) {
data[index++] = priv->rx_ring[i]->packets;
@@ -467,7 +470,7 @@ static void mlx4_en_get_strings(struct net_device *dev,
strcpy(data + (index++) * ETH_GSTRING_LEN,
   main_strings[strings]);
 
-   for (i = 0; i < priv->tx_ring_num; i++) {
+   for (i = 0; i < priv->tx_ring_num[TX]; i++) {

[PATCH net-next V3 3/3] net/mlx4_en: Add ethtool statistics for XDP cases

2016-11-02 Thread Tariq Toukan

XDP statistics are reported in ethtool, in total and per ring,
as follows:
- xdp_drop: the number of packets dropped by xdp.
- xdp_tx: the number of packets forwarded by xdp.
- xdp_tx_full: the number of times an xdp forward failed
due to a full tx xdp ring.

In addition, all packets that are dropped/forwarded by XDP
are no longer accounted in rx_packets/rx_bytes of the ring,
so that they count traffic that is passed to the stack.

Signed-off-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 25 -
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |  4 
 drivers/net/ethernet/mellanox/mlx4/en_port.c|  6 ++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c  | 12 +++-
 drivers/net/ethernet/mellanox/mlx4/en_tx.c  |  8 
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h|  7 ++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_stats.h | 10 +-
 7 files changed, 60 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index e8ccb95680bc..487a58f9c192 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -195,6 +195,10 @@ static int mlx4_en_moderation_update(struct mlx4_en_priv 
*priv)
"tx_prio_7_packets", "tx_prio_7_bytes",
"tx_novlan_packets", "tx_novlan_bytes",
 
+   /* xdp statistics */
+   "rx_xdp_drop",
+   "rx_xdp_tx",
+   "rx_xdp_tx_full",
 };
 
 static const char mlx4_en_test_names[][ETH_GSTRING_LEN]= {
@@ -340,7 +344,7 @@ static int mlx4_en_get_sset_count(struct net_device *dev, 
int sset)
case ETH_SS_STATS:
return bitmap_iterator_count(&it) +
(priv->tx_ring_num[TX] * 2) +
-   (priv->rx_ring_num * 3);
+   (priv->rx_ring_num * (3 + NUM_XDP_STATS));
case ETH_SS_TEST:
return MLX4_EN_NUM_SELF_TEST - !(priv->mdev->dev->caps.flags
& MLX4_DEV_CAP_FLAG_UC_LOOPBACK) * 2;
@@ -400,6 +404,10 @@ static void mlx4_en_get_ethtool_stats(struct net_device 
*dev,
if (bitmap_iterator_test(&it))
data[index++] = ((unsigned long *)&priv->pkstats)[i];
 
+   for (i = 0; i < NUM_XDP_STATS; i++, bitmap_iterator_inc(&it))
+   if (bitmap_iterator_test(&it))
+   data[index++] = ((unsigned long *)&priv->xdp_stats)[i];
+
for (i = 0; i < priv->tx_ring_num[TX]; i++) {
data[index++] = priv->tx_ring[TX][i]->packets;
data[index++] = priv->tx_ring[TX][i]->bytes;
@@ -408,6 +416,9 @@ static void mlx4_en_get_ethtool_stats(struct net_device 
*dev,
data[index++] = priv->rx_ring[i]->packets;
data[index++] = priv->rx_ring[i]->bytes;
data[index++] = priv->rx_ring[i]->dropped;
+   data[index++] = priv->rx_ring[i]->xdp_drop;
+   data[index++] = priv->rx_ring[i]->xdp_tx;
+   data[index++] = priv->rx_ring[i]->xdp_tx_full;
}
spin_unlock_bh(&priv->stats_lock);
 
@@ -470,6 +481,12 @@ static void mlx4_en_get_strings(struct net_device *dev,
strcpy(data + (index++) * ETH_GSTRING_LEN,
   main_strings[strings]);
 
+   for (i = 0; i < NUM_XDP_STATS; i++, strings++,
+bitmap_iterator_inc(&it))
+   if (bitmap_iterator_test(&it))
+   strcpy(data + (index++) * ETH_GSTRING_LEN,
+  main_strings[strings]);
+
for (i = 0; i < priv->tx_ring_num[TX]; i++) {
sprintf(data + (index++) * ETH_GSTRING_LEN,
"tx%d_packets", i);
@@ -483,6 +500,12 @@ static void mlx4_en_get_strings(struct net_device *dev,
"rx%d_bytes", i);
sprintf(data + (index++) * ETH_GSTRING_LEN,
"rx%d_dropped", i);
+   sprintf(data + (index++) * ETH_GSTRING_LEN,
+   "rx%d_xdp_drop", i);
+   sprintf(data + (index++) * ETH_GSTRING_LEN,
+   "rx%d_xdp_tx", i);
+   sprintf(data + (index++) * ETH_GSTRING_LEN,
+   "rx%d_xdp_tx_full", i);
}
break;
case ETH_SS_PRIV_FLAGS:
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index edf0a99177e1..0f6225c042be 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -3125,6 +3125,10 @@ void mlx4_en_set_stats_bitmap(struct mlx4_dev *dev,
 
if (!mlx4_is_slave(dev))
bitmap_set(stats_bitmap->bitmap,

[PATCH net-next V3 0/3] mlx4 XDP TX refactor

2016-11-02 Thread Tariq Toukan

Hi Dave,

This patchset refactors the XDP forwarding case, so that
its dedicated transmit queues are managed in a complete
separation from the other regular ones.

It also adds ethtool counters for XDP cases.

Series generated against net-next commit:
22ca904ad70a genetlink: fix error return code in genl_register_family()

Thanks,
Tariq.

v3:
* Exposed per ring counters.

v2:
* Added ethtool counters.
* Rebased, now patch 2 reverts Brenden's fix, as the bug no longer exists:
  958b3d396d7f ("net/mlx4_en: fixup xdp tx irq to match rx")
* Updated commit message of patch 2.

Tariq Toukan (3):
  net/mlx4_en: Add TX_XDP for CQ types
  net/mlx4_en: Refactor the XDP forwarding rings scheme
  net/mlx4_en: Add ethtool statistics for XDP cases

 drivers/net/ethernet/mellanox/mlx4/en_cq.c  |  28 +-
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 101 +--
 drivers/net/ethernet/mellanox/mlx4/en_main.c|   2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  | 382 +++-
 drivers/net/ethernet/mellanox/mlx4/en_port.c|  10 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c  |  20 +-
 drivers/net/ethernet/mellanox/mlx4/en_tx.c  |  17 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h|  25 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_stats.h |  10 +-
 9 files changed, 366 insertions(+), 229 deletions(-)

-- 
1.8.3.1

[PATCH net-next V3 1/3] net/mlx4_en: Add TX_XDP for CQ types

2016-11-02 Thread Tariq Toukan

Support XDP CQ type, and refactor the CQ type enum.
Rename the is_tx field to match the change.

Signed-off-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/en_cq.c   | 18 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  7 ---
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c 
b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index e3be7e44ff51..1427311a9640 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -65,7 +65,7 @@ int mlx4_en_create_cq(struct mlx4_en_priv *priv,
cq->buf_size = cq->size * mdev->dev->caps.cqe_size;
 
cq->ring = ring;
-   cq->is_tx = mode;
+   cq->type = mode;
cq->vector = mdev->dev->caps.num_comp_vectors;
 
/* Allocate HW buffers on provided NUMA node.
@@ -104,7 +104,7 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct 
mlx4_en_cq *cq,
*cq->mcq.arm_db= 0;
memset(cq->buf, 0, cq->buf_size);
 
-   if (cq->is_tx == RX) {
+   if (cq->type == RX) {
if (!mlx4_is_eq_vector_valid(mdev->dev, priv->port,
 cq->vector)) {
cq->vector = 
cpumask_first(priv->rx_ring[cq->ring]->affinity_mask);
@@ -141,11 +141,11 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct 
mlx4_en_cq *cq,
cq->vector = rx_cq->vector;
}
 
-   if (!cq->is_tx)
+   if (cq->type == RX)
cq->size = priv->rx_ring[cq->ring]->actual_size;
 
-   if ((cq->is_tx && priv->hwtstamp_config.tx_type) ||
-   (!cq->is_tx && priv->hwtstamp_config.rx_filter))
+   if ((cq->type != RX && priv->hwtstamp_config.tx_type) ||
+   (cq->type == RX && priv->hwtstamp_config.rx_filter))
timestamp_en = 1;
 
err = mlx4_cq_alloc(mdev->dev, cq->size, &cq->wqres.mtt,
@@ -154,10 +154,10 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct 
mlx4_en_cq *cq,
if (err)
goto free_eq;
 
-   cq->mcq.comp  = cq->is_tx ? mlx4_en_tx_irq : mlx4_en_rx_irq;
+   cq->mcq.comp  = cq->type != RX ? mlx4_en_tx_irq : mlx4_en_rx_irq;
cq->mcq.event = mlx4_en_cq_event;
 
-   if (cq->is_tx)
+   if (cq->type != RX)
netif_tx_napi_add(cq->dev, &cq->napi, mlx4_en_poll_tx_cq,
  NAPI_POLL_WEIGHT);
else
@@ -181,7 +181,7 @@ void mlx4_en_destroy_cq(struct mlx4_en_priv *priv, struct 
mlx4_en_cq **pcq)
 
mlx4_free_hwq_res(mdev->dev, &cq->wqres, cq->buf_size);
if (mlx4_is_eq_vector_valid(mdev->dev, priv->port, cq->vector) &&
-   cq->is_tx == RX)
+   cq->type == RX)
mlx4_release_eq(priv->mdev->dev, cq->vector);
cq->vector = 0;
cq->buf_size = 0;
@@ -193,7 +193,7 @@ void mlx4_en_destroy_cq(struct mlx4_en_priv *priv, struct 
mlx4_en_cq **pcq)
 void mlx4_en_deactivate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq)
 {
napi_disable(&cq->napi);
-   if (!cq->is_tx) {
+   if (cq->type == RX) {
napi_hash_del(&cq->napi);
synchronize_rcu();
}
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h 
b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index a3528dd1e72e..83c914a79f14 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -207,8 +207,9 @@ enum {
  */
 
 enum cq_type {
-   RX = 0,
-   TX = 1,
+   TX,
+   TX_XDP,
+   RX,
 };
 
 
@@ -361,7 +362,7 @@ struct mlx4_en_cq {
int size;
int buf_size;
int vector;
-   enum cq_type is_tx;
+   enum cq_type type;
u16 moder_time;
u16 moder_cnt;
struct mlx4_cqe *buf;
-- 
1.8.3.1

Re: Let's do P4

2016-11-02 Thread John Fastabend

On 16-11-02 01:07 AM, Jiri Pirko wrote:
> Tue, Nov 01, 2016 at 04:13:32PM CET, john.fastab...@gmail.com wrote:
>> [...]
>>
> P4 is ment to program programable hw, not fixed pipeline.
>

 I'm guessing there are no upstream drivers at the moment that support
 this though right? The rocker universe bits though could leverage this.
>>>
>>> mlxsw. But this is naturaly not implemented yet, as there is no
>>> infrastructure.
>>
>> Really? What is re-programmable?
>>
>> Can the parse graph support arbitrary parse graph?
>> Can the table topology be reconfigured?
>> Can new tables be created?
>> What about "new" actions being defined at configuration time?
>>
>> Or is this just the normal TCAM configuration of defining key widths and
>> fields.
> 
> At this point TCAM configuration.
> 

OK so before we go down the path to enable a full fledged P4 interface
we need a consumer for sure. We shouldn't add all this complexity until
someone steps up to use it. A runtime API is sufficient for TCAM config.

[...]

>>
>> P4-16 will allow externs, "functions" to execute in the control flow and
>> possibly inside the parse graph. None of this was considered in the
>> Flow-API. So none of this is supported.
>>
>> I still have the question are you trying to push the "programming" of
>> the device via 'tc' or just the runtime configuration of tables? If it
>> is just runtime Flow-API is sufficient IMO. If its programming the
>> device using the complete P4-16 spec than no its not sufficient. But
> 
> Sure we need both.
> 

See above.

> 
>> I don't believe vendors will expose the complete programmability of the
>> device in the driver, this is going to look more like a fw update than
>> a runtime change at least on the devices I'm aware of.
> 
> Depends on driver. I think it is fine if driver processed it into come
> hw configuration sequence or it simply pushed the program down to fw.
> Both usecases are legit.
> 

At this point I don't think the entire P4 capabilities will be exposed
as an API but more along the lines of an FPGA bitstream or firmware
update.

[...]

>>
>> Same question as above are we _really_ talking about pushing the entire
>> programmability of the device via 'tc'. If so we need to have a vendor
>> say they will support and implement this?
> 
> We need some API, and I believe that TC is perfectly suitable for that.
> Why do you think it's a problem?
> 

For runtime configuration completely agree. For device configuration
I don't see the advantage of adding an entire device specific compiler
in the driver. The device is a set of CAMs, TCAMs, ALUs, instruction
caches, etc. its not like a typical NIC/switch where you just bang
some registers. Unless its all done in firmware but that creates an
entirely different set of problems like how to update your compiler.

Bottom line we need to have a proof point of a driver in kernel
to see exactly how a P4 configuration would work. Again runtime config
and device topology/capabilities discovery I'm completely on board.

Thanks,
John

Re: Let's do P4

2016-11-02 Thread Jiri Pirko

Wed, Nov 02, 2016 at 04:18:06PM CET, john.fastab...@gmail.com wrote:
>On 16-11-02 01:07 AM, Jiri Pirko wrote:
>> Tue, Nov 01, 2016 at 04:13:32PM CET, john.fastab...@gmail.com wrote:

[...]


>[...]>
>>>
>>> Same question as above are we _really_ talking about pushing the entire
>>> programmability of the device via 'tc'. If so we need to have a vendor
>>> say they will support and implement this?
>> 
>> We need some API, and I believe that TC is perfectly suitable for that.
>> Why do you think it's a problem?
>> 
>
>For runtime configuration completely agree. For device configuration
>I don't see the advantage of adding an entire device specific compiler
>in the driver. The device is a set of CAMs, TCAMs, ALUs, instruction
>caches, etc. its not like a typical NIC/switch where you just bang
>some registers. Unless its all done in firmware but that creates an
>entirely different set of problems like how to update your compiler.
>
>Bottom line we need to have a proof point of a driver in kernel
>to see exactly how a P4 configuration would work. Again runtime config
>and device topology/capabilities discovery I'm completely on board.

I think we need to implement P4 world in rocker. Any volunteer? :)

Re: Let's do P4

2016-11-02 Thread John Fastabend

[...]

> Exactly. Following drawing shows p4 pipeline setup for SW and Hw:
>
>   |
>   |   +--> ebpf engine
>   |   |
>   |   |
>   |   compilerB
>   |   ^
>   |   |
> p4src --> compilerA --> p4ast --TCNL--> cls_p4 --+-> driver -> compilerC 
> -> HW
>   |
> userspace | kernel
>   |
>>
>> Sorry for jumping into the middle and the delay (plumbers this week). My
>> question would be, if the main target is for p4 *offloading* anyway, who
>> would use this sw fallback path? Mostly for testing purposes?
> 
> Development and testing purposes, yes.
> 
> 
>>
>> I'm not sure about compilerB here and the complexity that needs to be
>> pushed into the kernel along with it. I would assume this would result
>> in slower code than what the existing P4 -> eBPF front ends for LLVM
>> would generate since it could perform all kind of optimizations there,
> 
> The complexity would be similar to compilerC. For compilerB,
> optimizations does not really matter, as it it for testing mainly.
> 
> 
>> that might not be feasible for doing inside the kernel. Thus, if I'd want
>> to do that in sw, I'd just use the existing LLVM facilities instead and
>> go via cls_bpf in that case.
>>
>> What is your compilerA? Is that part of tc in user space? Maybe linked
> 
> It is something that transforms original p4 source to some intermediate
> form, easy to be processed by in-kernel compilers.
> 
> 
>> against LLVM lib, for example? If you really want some sw path, can't tc
>> do this transparently from user space instead when it gets a netlink error
>> that it cannot get offloaded (and thus switch internally to f_bpf's loader)?
> 
> In real life, user will most probably use p4 for hw programming, but the
> sw fallback will be done in bpf directly. In that case, he would use
> cls_bfp SKIP_HW
> cls_p4 SKIP_SW
> 
> But in order to allow cls_p4 offloading to hw, we need in-kernel
> interpreter. That is purpose of compilerB to take agvantage of bpf, but
> the in-kernel interpreter could be implemented differently.
> 

But this is the issue. We openly acknowledge it wont actually be used.
We have multiple user space compilers that generate at least half way
reasonable ebpf code that is being used in real deployments and
works great for testing. This looks like pure overhead to satisfy this
hw/sw parity checkbox and I can't see why anyone would use it or even
maintain it. Looks like a checkbox and I like to avoid useless work that
is likely to bit rot.

Re: [PATCH net-next v2] ipv4: fib: Replay events when registering FIB notifier

2016-11-02 Thread David Miller

From: Jiri Pirko 
Date: Wed, 2 Nov 2016 08:35:02 +0100

> How do you imagine this mode should looks like? Could you draw me some
> example?

Well, first of all, there is no reason we can't provide a mechanism by
which the driver can request and obtain a FIB dump.

And it can be designed in a way to be preemptible or at least not
require RTNL to be held during the entire operation.  Sequence
counters or similar can be used to make sure that if the table changes
mid-dump due to RTNL being dropped, the dump can be rewound and
restarted.

Re: Let's do P4

2016-11-02 Thread Jiri Pirko

Wed, Nov 02, 2016 at 04:22:50PM CET, john.fastab...@gmail.com wrote:

[...]

>>>
>>> What is your compilerA? Is that part of tc in user space? Maybe linked
>> 
>> It is something that transforms original p4 source to some intermediate
>> form, easy to be processed by in-kernel compilers.
>> 
>> 
>>> against LLVM lib, for example? If you really want some sw path, can't tc
>>> do this transparently from user space instead when it gets a netlink error
>>> that it cannot get offloaded (and thus switch internally to f_bpf's loader)?
>> 
>> In real life, user will most probably use p4 for hw programming, but the
>> sw fallback will be done in bpf directly. In that case, he would use
>> cls_bfp SKIP_HW
>> cls_p4 SKIP_SW
>> 
>> But in order to allow cls_p4 offloading to hw, we need in-kernel
>> interpreter. That is purpose of compilerB to take agvantage of bpf, but
>> the in-kernel interpreter could be implemented differently.
>> 
>
>But this is the issue. We openly acknowledge it wont actually be used.
>We have multiple user space compilers that generate at least half way
>reasonable ebpf code that is being used in real deployments and
>works great for testing. This looks like pure overhead to satisfy this
>hw/sw parity checkbox and I can't see why anyone would use it or even
>maintain it. Looks like a checkbox and I like to avoid useless work that
>is likely to bit rot.

That's how it works I'm afraid, unless something changed from the last
time we discussed this. Without in-kernel implementation, it's a bypass.

Dave?

Re: [patch net-next 0/2] Fixes for raw diag sockets handling

2016-11-02 Thread Cyrill Gorcunov

On Wed, Nov 02, 2016 at 09:10:32AM -0600, David Ahern wrote:
> > @__dif != 0 the match may return socket where sk_bound_dev_if = 0
> > instead of completely matching one. Isn't it?
> 
> yes. I recently added an exact_dif to the lookup for listener sockets
> (see compute_score). Something like that could be added to INET_MATCH.

Seem so. I need to revisit this moment. Because with current lookup code
iproute2 patches I made and been testing do not kill all sockets bound
to particular device in one pass (because request from userspace asks
for index 15 in my case but kernel return one with index 0). At first
I thought I made a mistake in userspace code but once I added printk's
into kernel I found that here some strange results over lookup.

Cyrill

[PATCH RFC 1/2] ethtool: Add get actual port speed

2016-11-02 Thread Gal Pressman

Add an additional actual speed field when using ethtool DEVNAME.
Actual speed will show the actual bandwidth exposed for the machine,
which may be different from the HCA operating speed.

Signed-off-by: Gal Pressman 
---
 include/linux/ethtool.h  |  1 +
 include/uapi/linux/ethtool.h |  2 ++
 net/core/ethtool.c   | 20 
 3 files changed, 23 insertions(+)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 9ded8c6..215baa1 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -311,6 +311,7 @@ struct ethtool_ops {
void(*set_msglevel)(struct net_device *, u32);
int (*nway_reset)(struct net_device *);
u32 (*get_link)(struct net_device *);
+   int (*get_actual_speed)(struct net_device *);
int (*get_eeprom_len)(struct net_device *);
int (*get_eeprom)(struct net_device *,
  struct ethtool_eeprom *, u8 *);
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 099a420..63057a2 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -1315,6 +1315,8 @@ struct ethtool_per_queue_op {
 #define ETHTOOL_GLINKSETTINGS  0x004c /* Get ethtool_link_settings */
 #define ETHTOOL_SLINKSETTINGS  0x004d /* Set ethtool_link_settings */
 
+#define ETHTOOL_GASPD  0x004e /* Get port actual speed */
+
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 9774898..a1921fd 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1516,6 +1516,22 @@ static int ethtool_get_link(struct net_device *dev, char 
__user *useraddr)
return 0;
 }
 
+static int ethtool_get_actual_speed(struct net_device *dev,
+   char __user *useraddr)
+{
+   struct ethtool_value edata = { .cmd = ETHTOOL_GASPD };
+
+   if (!dev->ethtool_ops->get_actual_speed)
+   return -EOPNOTSUPP;
+
+   edata.data = dev->ethtool_ops->get_actual_speed(dev);
+
+   if (copy_to_user(useraddr, &edata, sizeof(edata)))
+   return -EFAULT;
+
+   return 0;
+}
+
 static int ethtool_get_any_eeprom(struct net_device *dev, void __user 
*useraddr,
  int (*getter)(struct net_device *,
struct ethtool_eeprom *, u8 *),
@@ -2450,6 +2466,7 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
case ETHTOOL_GDRVINFO:
case ETHTOOL_GMSGLVL:
case ETHTOOL_GLINK:
+   case ETHTOOL_GASPD:
case ETHTOOL_GCOALESCE:
case ETHTOOL_GRINGPARAM:
case ETHTOOL_GPAUSEPARAM:
@@ -2531,6 +2548,9 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
case ETHTOOL_GLINK:
rc = ethtool_get_link(dev, useraddr);
break;
+   case ETHTOOL_GASPD:
+   rc = ethtool_get_actual_speed(dev, useraddr);
+   break;
case ETHTOOL_GEEPROM:
rc = ethtool_get_eeprom(dev, useraddr);
break;
-- 
2.7.4

[PATCH RFC 2/2] net/mlx5e: Add support for ethtool get actual speed callback

2016-11-02 Thread Gal Pressman

ethtool DEVNAME will now show actual port speed in addition
to physical port speed.

Signed-off-by: Gal Pressman 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 27ff401..933b687 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -1504,9 +1504,16 @@ static int mlx5e_set_rxnfc(struct net_device *dev, 
struct ethtool_rxnfc *cmd)
return err;
 }
 
+static int mlx5e_get_actual_speed(struct net_device *netdev)
+{
+   /* TODO: FW command to query the actual speed */
+   return SPEED_25000;
+}
+
 const struct ethtool_ops mlx5e_ethtool_ops = {
.get_drvinfo   = mlx5e_get_drvinfo,
.get_link  = ethtool_op_get_link,
+   .get_actual_speed  = mlx5e_get_actual_speed,
.get_strings   = mlx5e_get_strings,
.get_sset_count= mlx5e_get_sset_count,
.get_ethtool_stats = mlx5e_get_ethtool_stats,
-- 
2.7.4

Re: [patch net-next 0/2] Fixes for raw diag sockets handling

2016-11-02 Thread David Ahern

On 11/2/16 9:29 AM, Cyrill Gorcunov wrote:
> On Wed, Nov 02, 2016 at 09:10:32AM -0600, David Ahern wrote:
>>> @__dif != 0 the match may return socket where sk_bound_dev_if = 0
>>> instead of completely matching one. Isn't it?
>>
>> yes. I recently added an exact_dif to the lookup for listener sockets
>> (see compute_score). Something like that could be added to INET_MATCH.
> 
> Seem so. I need to revisit this moment. Because with current lookup code
> iproute2 patches I made and been testing do not kill all sockets bound
> to particular device in one pass (because request from userspace asks
> for index 15 in my case but kernel return one with index 0). At first
> I thought I made a mistake in userspace code but once I added printk's
> into kernel I found that here some strange results over lookup.

Limited to raw sockets or are you looking at multiple spec options (dev, 
address, port)?

I have not seen issues with tcp or udp. Running:

ss -aK 'dev == red' 

drops all sockets bound to device 'red' (or at least signaling the socket 
failure for the app to handle):

root@jessie4:~# ss -ap 'dev == red'
Netid  State  Recv-Q Send-Q Local Address:Port  
Peer Address:Port
udpUNCONN 0  0  *%red:12345 
   *:* users:(("vrf-test",pid=765,fd=3))
tcpLISTEN 0  1  *%red:12345 
   *:* users:(("vrf-test",pid=766,fd=3))
tcpESTAB  0  0 10.100.1.4%red:ssh   
10.100.1.254:60298 users:(("sshd",pid=738,fd=3))

root@jessie4:~# ss -aKp 'dev == red'
Netid State  Recv-Q Send-Q  Local Address:Port  
 Peer Address:Port
udp   UNCONN 0  0   *%red:12345 
*:* 
users:(("vrf-test",pid=765,fd=3))
tcp   LISTEN 0  1   *%red:12345 
*:* 
users:(("vrf-test",pid=766,fd=3))
tcp   ESTAB  0  0  10.100.1.4%red:ssh   
 10.100.1.254:60298 
users:(("sshd",pid=738,fd=3))

root@jessie4:~# ss -ap 'dev == red'
Netid State  Recv-Q Send-Q  Local Address:Port  
 Peer Address:Port

[PATCH RFC 0/2] ethtool: Add actual port speed reporting

2016-11-02 Thread Gal Pressman

Sending RFC to get feedback for the following ethtool proposal:

In some cases such as virtual machines and multi functions (SR-IOV), the actual
bandwidth exposed for each machine is not accurately shown in ethtool.
Currently ethtool shows only physical port link speed.
In our case we would like to show the virtual port operational link speed
which in some cases is less than the physical port speed.

This will give users better visibility for the actual speed running on their 
device.

$ ethtool ens6
...
Speed: 5Mb/s
Actual speed: 25000Mb/s

Gal Pressman (2):
  ethtool: Add get actual port speed support
  net/mlx5e: Add support for ethtool get actual speed callback

 drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c |  7 +++
 include/linux/ethtool.h  |  1 +
 include/uapi/linux/ethtool.h |  2 ++
 net/core/ethtool.c   | 20 
 4 files changed, 30 insertions(+)

-- 
2.7.4

Re: [patch net-next 0/2] Fixes for raw diag sockets handling

2016-11-02 Thread Cyrill Gorcunov

On Wed, Nov 02, 2016 at 09:36:55AM -0600, David Ahern wrote:
> 
> Limited to raw sockets or are you looking at multiple spec options (dev, 
> address, port)?
> 
> I have not seen issues with tcp or udp. Running:
> 
> ss -aK 'dev == red' 
> 
> drops all sockets bound to device 'red' (or at least signaling the socket 
> failure for the app to handle):

Limited to raw socket. I didn't modify lookup kernel code but use already 
existing helpers.
The tcp/udp sockets do use port value in lookup (iirc, don't have code under my 
hand
at moment), in turn raw lookup uses only net,raw-protocol, src/dst and device 
index.
In my test case the sokets were unconnected so the have no address but bound to
device and I hit mismatch. Then looking into inet matching code I found this 
weird
snippet I posted previously.

> 
> root@jessie4:~# ss -ap 'dev == red'
> Netid  State  Recv-Q Send-Q Local Address:Port  
> Peer Address:Port
> udpUNCONN 0  0  *%red:12345   
>  *:* users:(("vrf-test",pid=765,fd=3))
> tcpLISTEN 0  1  *%red:12345   
>  *:* users:(("vrf-test",pid=766,fd=3))
> tcpESTAB  0  0 10.100.1.4%red:ssh   
> 10.100.1.254:60298 users:(("sshd",pid=738,fd=3))
> 
> root@jessie4:~# ss -aKp 'dev == red'
> Netid State  Recv-Q Send-Q  Local Address:Port
>Peer Address:Port
> udp   UNCONN 0  0   *%red:12345   
>   *:* 
> users:(("vrf-test",pid=765,fd=3))
> tcp   LISTEN 0  1   *%red:12345   
>   *:* 
> users:(("vrf-test",pid=766,fd=3))
> tcp   ESTAB  0  0  10.100.1.4%red:ssh 
>10.100.1.254:60298 
> users:(("sshd",pid=738,fd=3))
> 
> root@jessie4:~# ss -ap 'dev == red'
> Netid State  Recv-Q Send-Q  Local Address:Port
>Peer Address:Port

Cyrill

RE: [PATCH RFC 0/2] ethtool: Add actual port speed reporting

2016-11-02 Thread Mintz, Yuval

> Sending RFC to get feedback for the following ethtool proposal:
> 
> In some cases such as virtual machines and multi functions (SR-IOV), the 
> actual
> bandwidth exposed for each machine is not accurately shown in ethtool.
> Currently ethtool shows only physical port link speed.
> In our case we would like to show the virtual port operational link speed 
> which
> in some cases is less than the physical port speed.
> 
> This will give users better visibility for the actual speed running on their 
> device.
> 
> $ ethtool ens6
> ...
> Speed: 5Mb/s
> Actual speed: 25000Mb/s

Not saying this is a bad thing, but where exactly is it listed that ethtool has
to show the physical port speed?
E.g., bnx2x shows the logical speed instead, and has been doing that for years.
[Perhaps that's a past wrongness, but that's how it goes].

And besides, one can argue that in the SR-IOV scenario the VF has no business
knowing the physical port speed.

Re: [PATCH net-next 1/3] ipv4: add IP_RECVFRAGSIZE cmsg

2016-11-02 Thread Eric Dumazet

On Wed, 2016-11-02 at 11:02 -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> The IP stack records the largest fragment of a reassembled packet
> in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
> that arrived fragmented, expose the value to allow applications to
> estimate receive path MTU.

Acked-by: Eric Dumazet

Re: [PATCH net-next 2/3] ipv6: add IPV6_RECVFRAGSIZE cmsg

2016-11-02 Thread Eric Dumazet

On Wed, 2016-11-02 at 11:02 -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> When reading a datagram or raw packet that arrived fragmented, expose
> the maximum fragment size if recorded to allow applications to
> estimate receive path MTU.

Acked-by: Eric Dumazet

Re: [PATCH net-next 3/3] ipv6: on reassembly, record frag_max_size

2016-11-02 Thread Eric Dumazet

On Wed, 2016-11-02 at 11:02 -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> IP6CB and IPCB have a frag_max_size field. In IPv6 this field is
> filled in when packets are reassembled by the connection tracking
> code. Also fill in when reassembling in the input path, to expose
> it through cmsg IPV6_RECVFRAGSIZE in all cases.
> 
> Signed-off-by: Willem de Bruijn 
> ---

Acked-by: Eric Dumazet

Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net

2016-11-02 Thread Alexander Duyck

On Wed, Nov 2, 2016 at 7:01 AM, Jesper Dangaard Brouer
 wrote:
> On Fri, 28 Oct 2016 13:11:01 -0400 (EDT)
> David Miller  wrote:
>
>> From: John Fastabend 
>> Date: Fri, 28 Oct 2016 08:56:35 -0700
>>
>> > On 16-10-27 07:10 PM, David Miller wrote:
>> >> From: Alexander Duyck 
>> >> Date: Thu, 27 Oct 2016 18:43:59 -0700
>> >>
>> >>> On Thu, Oct 27, 2016 at 6:35 PM, David Miller  
>> >>> wrote:
>>  From: "Michael S. Tsirkin" 
>>  Date: Fri, 28 Oct 2016 01:25:48 +0300
>> 
>> > On Thu, Oct 27, 2016 at 05:42:18PM -0400, David Miller wrote:
>> >> From: "Michael S. Tsirkin" 
>> >> Date: Fri, 28 Oct 2016 00:30:35 +0300
>> >>
>> >>> Something I'd like to understand is how does XDP address the
>> >>> problem that 100Byte packets are consuming 4K of memory now.
>> >>
>> >> Via page pools.  We're going to make a generic one, but right now
>> >> each and every driver implements a quick list of pages to allocate
>> >> from (and thus avoid the DMA man/unmap overhead, etc.)
>> >
>> > So to clarify, ATM virtio doesn't attempt to avoid dma map/unmap
>> > so there should be no issue with that even when using sub/page
>> > regions, assuming DMA APIs support sub-page map/unmap correctly.
>> 
>>  That's not what I said.
>> 
>>  The page pools are meant to address the performance degradation from
>>  going to having one packet per page for the sake of XDP's
>>  requirements.
>> 
>>  You still need to have one packet per page for correct XDP operation
>>  whether you do page pools or not, and whether you have DMA mapping
>>  (or it's equivalent virutalization operation) or not.
>> >>>
>> >>> Maybe I am missing something here, but why do you need to limit things
>> >>> to one packet per page for correct XDP operation?  Most of the drivers
>> >>> out there now are usually storing something closer to at least 2
>> >>> packets per page, and with the DMA API fixes I am working on there
>> >>> should be no issue with changing the contents inside those pages since
>> >>> we won't invalidate or overwrite the data after the DMA buffer has
>> >>> been synchronized for use by the CPU.
>> >>
>> >> Because with SKB's you can share the page with other packets.
>> >>
>> >> With XDP you simply cannot.
>> >>
>> >> It's software semantics that are the issue.  SKB frag list pages
>> >> are read only, XDP packets are writable.
>> >>
>> >> This has nothing to do with "writability" of the pages wrt. DMA
>> >> mapping or cpu mappings.
>> >>
>> >
>> > Sorry I'm not seeing it either. The current xdp_buff is defined
>> > by,
>> >
>> >   struct xdp_buff {
>> > void *data;
>> > void *data_end;
>> >   };
>> >
>> > The verifier has an xdp_is_valid_access() check to ensure we don't go
>> > past data_end. The page for now at least never leaves the driver. For
>> > the work to get xmit to other devices working I'm still not sure I see
>> > any issue.
>>
>> I guess I can say that the packets must be "writable" until I'm blue
>> in the face but I'll say it again, semantically writable pages are a
>> requirement.  And if multiple packets share a page this requirement
>> is not satisfied.
>>
>> Also, we want to do several things in the future:
>>
>> 1) Allow push/pop of headers via eBPF code, which needs we need
>>headroom.
>>
>> 2) Transparently zero-copy pass packets into userspace, basically
>>the user will have a semi-permanently mapped ring of all the
>>packet pages sitting in the RX queue of the device and the
>>page pool associated with it.  This way we avoid all of the
>>TLB flush/map overhead for the user's mapping of the packets
>>just as we avoid the DMA map/unmap overhead.
>>
>> And that's just the beginninng.
>>
>> I'm sure others can come up with more reasons why we have this
>> requirement.
>
> I've tried to update the XDP documentation about the "Page per packet"
> requirement[1], fell free to correct below text:
>
> Page per packet
> ===
>
> On RX many NIC drivers splitup a memory page, to share it for multiple
> packets, in-order to conserve memory.  Doing so complicates handling
> and accounting of these memory pages, which affects performance.
> Particularly the extra atomic refcnt handling needed for the page can
> hurt performance.

The atomic refcnt handling isn't a big deal.  It is easily worked
around by doing bulk updates.  The allocating and freeing of pages on
the other hand is quite expensive by comparison.

> XDP defines upfront a memory model where there is only one packet per
> page.  This simplifies page handling and open up for future
> extensions.

This is blatantly wrong.  It complicates page handling especially
since you are implying you aren't using the page count which implies a
new memory model like SLUB but not SLUB and hopefully including DMA
pools of some sort.  The fact is the page count tricks are much
cheaper than having to come up with your own page allocator.  In
additio

[PATCH net] ipv4: allow local fragmentation in ip_finish_output_gso()

2016-11-02 Thread Lance Richardson

Some configurations (e.g. geneve interface with default
MTU of 1500 over an ethernet interface with 1500 MTU) result
in the transmission of packets that exceed the configured MTU.
While this should be considered to be a "bad" configuration,
it is still allowed and should not result in the sending
of packets that exceed the configured MTU.

Fix by dropping the assumption in ip_finish_output_gso() that
locally originated gso packets will never need fragmentation.
Basic testing using iperf (observing CPU usage and bandwidth)
have shown no measurable performance impact for traffic not
requiring fragmentation.

Fixes: c7ba65d7b649 ("net: ip: push gso skb forwarding handling down the stack")
Reported-by: Jan Tluka 
Signed-off-by: Lance Richardson 
---
 net/ipv4/ip_output.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 03e7f73..4971401 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -239,11 +239,9 @@ static int ip_finish_output_gso(struct net *net, struct 
sock *sk,
struct sk_buff *segs;
int ret = 0;
 
-   /* common case: fragmentation of segments is not allowed,
-* or seglen is <= mtu
+   /* common case: seglen is <= mtu
 */
-   if (((IPCB(skb)->flags & IPSKB_FRAG_SEGS) == 0) ||
- skb_gso_validate_mtu(skb, mtu))
+   if (skb_gso_validate_mtu(skb, mtu))
return ip_finish_output2(net, sk, skb);
 
/* Slowpath -  GSO segment length is exceeding the dst MTU.
-- 
2.5.5

Re: [PATCH v4 6/7] net: ethernet: bgmac: add NS2 support

2016-11-02 Thread Jon Mason

On Tue, Nov 01, 2016 at 05:05:13PM -0400, Jon Mason wrote:
> On Tue, Nov 01, 2016 at 01:34:30PM -0700, Scott Branden wrote:
> > One change in this patch
> > 
> > On 16-11-01 01:04 PM, Jon Mason wrote:
> > >Add support for the variant of amac hardware present in the Broadcom
> > >Northstar2 based SoCs.  Northstar2 requires an additional register to be
> > >configured with the port speed/duplexity (NICPM).  This can be added to
> > >the link callback to hide it from the instances that do not use this.
> > >Also, clearing of the pending interrupts on init is required due to
> > >observed issues on some platforms.
> > >
> > >Signed-off-by: Jon Mason 
> > >---
> > > drivers/net/ethernet/broadcom/bgmac-platform.c | 56 
> > > +-
> > > drivers/net/ethernet/broadcom/bgmac.c  |  3 ++
> > > drivers/net/ethernet/broadcom/bgmac.h  |  1 +
> > > 3 files changed, 58 insertions(+), 2 deletions(-)
> > >
> > >diff --git a/drivers/net/ethernet/broadcom/bgmac-platform.c 
> > >b/drivers/net/ethernet/broadcom/bgmac-platform.c
> > >index aed5dc5..f6d48c7 100644
> > >--- a/drivers/net/ethernet/broadcom/bgmac-platform.c
> > >+++ b/drivers/net/ethernet/broadcom/bgmac-platform.c
> > >@@ -14,12 +14,21 @@
> > > #define pr_fmt(fmt)   KBUILD_MODNAME ": " fmt
> > >
> > > #include 
> > >+#include 
> > > #include 
> > > #include 
> > > #include 
> > > #include 
> > > #include "bgmac.h"
> > >
> > >+#define NICPM_IOMUX_CTRL  0x0008
> > >+
> > >+#define NICPM_IOMUX_CTRL_INIT_VAL 0x3196e000
> > >+#define NICPM_IOMUX_CTRL_SPD_SHIFT10
> > >+#define NICPM_IOMUX_CTRL_SPD_10M  0
> > >+#define NICPM_IOMUX_CTRL_SPD_100M 1
> > >+#define NICPM_IOMUX_CTRL_SPD_1000M2
> > >+
> > > static u32 platform_bgmac_read(struct bgmac *bgmac, u16 offset)
> > > {
> > >   return readl(bgmac->plat.base + offset);
> > >@@ -87,12 +96,46 @@ static void platform_bgmac_cmn_maskset32(struct bgmac 
> > >*bgmac, u16 offset,
> > >   WARN_ON(1);
> > > }
> > >
> > >+static void bgmac_nicpm_speed_set(struct net_device *net_dev)
> > >+{
> > >+  struct bgmac *bgmac = netdev_priv(net_dev);
> > >+  u32 val;
> > >+
> > >+  if (!bgmac->plat.nicpm_base)
> > >+  return;
> > >+
> > >+  val = NICPM_IOMUX_CTRL_INIT_VAL;
> > >+  switch (bgmac->net_dev->phydev->speed) {
> > >+  default:
> > >+  pr_err("Unsupported speed.  Defaulting to 1000Mb\n");
> > This should be dev_err
> 
> It should probably be netdev_err (and there are a few instances below
> that should probably be changed to netdev_err as well).

Actually, the other instances I referenced above should not be
netdev_err, as they are enountered before the netdev is created.  So,
dev_err is correct for them.

That being said, the original pr_err that Scott referenced should be
netdev_err (as it is encountered after the netdev is created).  v5 will
make that change.

Thanks,
Jon

> 
> Thanks,
> Jon
> 
> > >+  case SPEED_1000:
> > >+  val |= NICPM_IOMUX_CTRL_SPD_1000M << NICPM_IOMUX_CTRL_SPD_SHIFT;
> > >+  break;
> > >+  case SPEED_100:
> > >+  val |= NICPM_IOMUX_CTRL_SPD_100M << NICPM_IOMUX_CTRL_SPD_SHIFT;
> > >+  break;
> > >+  case SPEED_10:
> > >+  val |= NICPM_IOMUX_CTRL_SPD_10M << NICPM_IOMUX_CTRL_SPD_SHIFT;
> > >+  break;
> > >+  }
> > >+
> > >+  writel(val, bgmac->plat.nicpm_base + NICPM_IOMUX_CTRL);
> > >+
> > >+  bgmac_adjust_link(bgmac->net_dev);
> > >+}
> > >+
> > > static int platform_phy_connect(struct bgmac *bgmac)
> > > {
> > >   struct phy_device *phy_dev;
> > >
> > >-  phy_dev = of_phy_get_and_connect(bgmac->net_dev, bgmac->dev->of_node,
> > >-   bgmac_adjust_link);
> > >+  if (bgmac->plat.nicpm_base)
> > >+  phy_dev = of_phy_get_and_connect(bgmac->net_dev,
> > >+   bgmac->dev->of_node,
> > >+   bgmac_nicpm_speed_set);
> > >+  else
> > >+  phy_dev = of_phy_get_and_connect(bgmac->net_dev,
> > >+   bgmac->dev->of_node,
> > >+   bgmac_adjust_link);
> > >   if (!phy_dev) {
> > >   dev_err(bgmac->dev, "Phy connect failed\n");
> > >   return -ENODEV;
> > >@@ -182,6 +225,14 @@ static int bgmac_probe(struct platform_device *pdev)
> > >   if (IS_ERR(bgmac->plat.idm_base))
> > >   return PTR_ERR(bgmac->plat.idm_base);
> > >
> > >+  regs = platform_get_resource_byname(pdev, IORESOURCE_MEM, "nicpm_base");
> > >+  if (regs) {
> > >+  bgmac->plat.nicpm_base = devm_ioremap_resource(&pdev->dev,
> > >+ regs);
> > >+  if (IS_ERR(bgmac->plat.nicpm_base))
> > >+  return PTR_ERR(bgmac->plat.nicpm_base);
> > >+  }
> > >+
> > >   bgmac->read = platform_bgmac_read;
> > >   bgmac->write = platform_bgmac_write;
> > >   bgmac->idm_read = platform_bgmac_idm_read;
> > >@@ -213,6 +264,7 @@ static

Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation

2016-11-02 Thread Tom Herbert

On Wed, Nov 2, 2016 at 3:48 AM, Hannes Frederic Sowa
 wrote:
> Hi Tom,
>
> On Wed, Nov 2, 2016, at 00:07, Tom Herbert wrote:
>> On Tue, Nov 1, 2016 at 3:12 PM, Hannes Frederic Sowa
>>  wrote:
>> > On 01.11.2016 21:59, Thomas Graf wrote:
>> >> On 1 November 2016 at 13:08, Hannes Frederic Sowa
>> >>  wrote:
>> >>> On Tue, Nov 1, 2016, at 19:51, Thomas Graf wrote:
>>  If I understand you correctly then a single BPF program would be
>>  loaded which then applies to all dst_output() calls? This has a huge
>>  drawback, instead of multiple small BPF programs which do exactly what
>>  is required per dst, a large BPF program is needed which matches on
>>  metadata. That's way slower and renders one of the biggest advantages
>>  of BPF invalid, the ability to generate a a small program tailored to
>>  a particular use. See Cilium.
>> >>>
>> >>> I thought more of hooks in the actual output/input functions specific to
>> >>> the protocol type (unfortunately again) protected by jump labels? Those
>> >>> hook get part of the dst_entry mapped so they can act on them.
>> >>
>> >> This has no advantage over installing a BPF program at tc egress and
>> >> enabling to store/access metadata per dst. The whole point is to
>> >> execute bpf for a specific route.
>> >
>> > The advantage I saw here was that in your proposal the tc egress path
>> > would have to be chosen by a route. Otherwise I would already have
>> > proposed it. :)
>> >
>> >>> Another idea would be to put the eBPF hooks into the fib rules
>> >>> infrastructure. But I fear this wouldn't get you the hooks you were
>> >>> looking for? There they would only end up in the runtime path if
>> >>> actually activated.
>> >>
>> >> Use of fib rules kills performance so it's not an option. I'm not even
>> >> sure that would be any simpler.
>> >
>> > It very much depends on the number of rules installed. If there are just
>> > several very few rules, it shouldn't hurt performance that much (but
>> > haven't verified).
>> >
>> Hannes,
>>
>> I can say that the primary value we get out of using ILA+LWT is that
>> we can essentially cache a policy decision in connected sockets. That
>> is we are able to create a host route for each destination (thousands
>> of them) that describes how to do the translation for each one. There
>> is no route lookup per packet, and actually no extra lookup otherwise.
>
> Exactly, that is why I do like LWT and the dst_entry socket caching
> shows its benefits here. Also the dst_entries communicate enough vital
> information up the stack so that allocation of sk_buffs is done
> accordingly to the headers that might need to be inserted later on.
>
> (On the other hand, the looked up BPF program can also be cached. This
> becomes more difficult if we can't share the socket structs between
> namespaces though.)
>
>> The translation code doesn't do much at all, basically just copies in
>> new destination to the packet. We need a route lookup for the
>> rewritten destination, but that is easily cached in the LWT structure.
>> The net result is that the transmit path for ILA is _really_ fast. I'm
>> not sure how we can match this same performance tc egress, it seems
>> like we would want to cache the matching rules in the socket to avoid
>> rule lookups.
>
> In case of namespaces, do you allocate the host routes in the parent or
> child (net-)namespaces? Or don't we talk about namespaces right now at all?
>
ILA is namespace aware, everything is set per namespace. I don't see
any issue with set routes per namespace either, nor any namespace
related issues with this patch, except maybe that I wouldn't know what
the interaction between BPF maps and namespaces are. Do maps belong to
namespaces?

> Why do we want to do the packet manipulation in tc egress and not using
> LWT + interfaces? The dst_entries should be able to express all possible
> allocation strategies etc. so that we don't need to shift/reallocate
> packets around when inserting an additional header. We can't express
> those semantics with tc egress.
>
I don't think we do want to do this sort of packet manipulation (ILA
in particular) in tc egress, that was my point. It's also not
appropriate for netfilter either I think.

>> On the other hand, I'm not really sure how to implement for this level
>> of performance this in LWT+BPF either. It seems like one way to do
>> that would be to create a program each destination and set it each
>> host. As you point out would create a million different programs which
>> doesn't seem manageable. I don't think the BPF map works either since
>> that implies we need a lookup (?). It seems like what we need is one
>> program but allow it to be parameterized with per destination
>> information saved in the route (LWT structure).
>
> Yes, that is my proposal. Just using the dst entry as meta-data (which
> can actually also be an ID for the network namespace the packet is
> coming from).
>
> My concern with using BPF is that the rest

Re: [PATCH net] net: Check for fullsock in sock_i_uid()

2016-11-02 Thread subashab


This would be a bug in the caller.

Can you give us the complete stack trace leading to the problem you
had ?

Thanks !


Thanks Eric for the clarification. In that case, the bug is in the 
IDLETIMER target in Android kernel.

https://android.googlesource.com/kernel/common/+/android-4.4/net/netfilter/xt_IDLETIMER.c#356

Here is the call stack.

-003|rwlock_bug(?, ?)
-004|arch_read_lock(inline)
-004|do_raw_read_lock(lock = 0xFFC0354E79C8)
-005|raw_read_lock_bh(lock = 0xFFC0354E79C8)
-006|sock_i_uid(sk = 0xFFC0354E77B0)
-007|from_kuid_munged(inline)
-007|reset_timer(info = 0xFFC04D17D218, skb = 0xFFC018AB98C0)
-008|idletimer_tg_target(skb = 0xFFC018AB98C0, ?)
-009|ipt_do_table(skb = 0xFFC018AB98C0, state = 0xFFC0017E6F30, 
?)
-010|iptable_mangle_hook(?, skb = 0xFFC018AB98C0, state = 
0xFFC0017E6F30)
-011|nf_iterate(head = 0xFFC0019D55B8, skb = 0xFFC018AB98C0, 
state = 0xFFC0017E6F30, elemp =

-012|nf_hook_slow(skb = 0xFFC018AB98C0, state = 0xFFC0017E6F30)
-013|NF_HOOK_COND(inline)
-013|ip_output(net = 0xFFC0019D4B00, sk = 0xFFC0354E77B0, skb = 
0xFFC018AB98C0)
-014|ip_local_out(net = 0xFFC0019D4B00, sk = 0xFFC0354E77B0, skb 
= 0xFFC018AB98C0)
-015|ip_build_and_send_pkt(skb = 0xFFC018AB98C0, sk = 
0xFFC023F2E880, saddr = 1688053952, daddr =
-016|tcp_v4_send_synack(sk = 0xFFC023F2E880, ?, ?, req = 
0xFFC0354E77B0, foc = 0xFFC0017E7110

-017|atomic_sub_return(inline)
-017|reqsk_put(inline)
-017|tcp_conn_request(?, af_ops = 0xFFC001080FC8, sk = 
0xFFC023F2E880, ?)

-018|tcp_v4_conn_request(?, ?)
-019|tcp_rcv_state_process(sk = 0xFFC023F2E880, skb = 
0xFFC018ABAD00)

-020|tcp_v4_do_rcv(sk = 0xFFC023F2E880, skb = 0xFFC018ABAD00)
-021|tcp_v4_rcv(skb = 0xFFC018ABAD00)
-022|ip_local_deliver_finish(net = 0xFFC0019D4B00, ?, skb = 
0xFFC018ABAD00)

-023|NF_HOOK_THRESH(inline)
-023|NF_HOOK(inline)
-023|ip_local_deliver(skb = 0xFFC018ABAD00)
-024|ip_rcv_finish(net = 0xFFC0019D4B00, ?, skb = 
0xFFC018ABAD00)

-025|NF_HOOK_THRESH(inline)
-025|NF_HOOK(inline)
-025|ip_rcv(skb = 0xFFC018ABAD00, dev = 0xFFC023474000, ?, ?)
-026|deliver_skb(inline)
-026|deliver_ptype_list_skb(inline)
-026|__netif_receive_skb_core(skb = 0x0A73, pfmemalloc = FALSE)
-027|__netif_receive_skb(skb = 0xFFC0BA455D40)
-028|netif_receive_skb_internal(skb = 0xFFC0BA455D40)
-029|netif_receive_skb(skb = 0xFFC0BA455D40)

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project

[PATCH v5 1/7] net: phy: broadcom: add bcm54xx_auxctl_read

2016-11-02 Thread Jon Mason

Add a helper function to read the AUXCTL register for the BCM54xx.  This
mirrors the bcm54xx_auxctl_write function already present in the code.

Signed-off-by: Jon Mason 
---
 drivers/net/phy/broadcom.c | 10 ++
 include/linux/brcmphy.h|  1 +
 2 files changed, 11 insertions(+)

diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
index 583ef8a..3a64b3d 100644
--- a/drivers/net/phy/broadcom.c
+++ b/drivers/net/phy/broadcom.c
@@ -30,6 +30,16 @@ MODULE_DESCRIPTION("Broadcom PHY driver");
 MODULE_AUTHOR("Maciej W. Rozycki");
 MODULE_LICENSE("GPL");
 
+static int bcm54xx_auxctl_read(struct phy_device *phydev, u16 regnum)
+{
+   /* The register must be written to both the Shadow Register Select and
+* the Shadow Read Register Selector
+*/
+   phy_write(phydev, MII_BCM54XX_AUX_CTL, regnum |
+ regnum << MII_BCM54XX_AUXCTL_SHDWSEL_READ_SHIFT);
+   return phy_read(phydev, MII_BCM54XX_AUX_CTL);
+}
+
 static int bcm54xx_auxctl_write(struct phy_device *phydev, u16 regnum, u16 val)
 {
return phy_write(phydev, MII_BCM54XX_AUX_CTL, regnum | val);
diff --git a/include/linux/brcmphy.h b/include/linux/brcmphy.h
index 60def78..0ed6691 100644
--- a/include/linux/brcmphy.h
+++ b/include/linux/brcmphy.h
@@ -110,6 +110,7 @@
 #define MII_BCM54XX_AUXCTL_MISC_FORCE_AMDIX0x0200
 #define MII_BCM54XX_AUXCTL_MISC_RDSEL_MISC 0x7000
 #define MII_BCM54XX_AUXCTL_SHDWSEL_MISC0x0007
+#define MII_BCM54XX_AUXCTL_SHDWSEL_READ_SHIFT  12
 
 #define MII_BCM54XX_AUXCTL_SHDWSEL_MASK0x0007
 
-- 
2.7.4

net/tcp: null-ptr-deref in __inet_lookup_listener/inet_exact_dif_match

2016-11-02 Thread Andrey Konovalov

Hi,

I've got the following error report while running the syzkaller fuzzer:

general protection fault:  [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 648 Comm: syz-executor Not tainted 4.9.0-rc3+ #333
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: 8800398c4480 task.stack: 88003b468000
RIP: 0010:[]  [< inline >]
inet_exact_dif_match include/net/tcp.h:808
RIP: 0010:[]  []
__inet_lookup_listener+0xb6/0x500 net/ipv4/inet_hashtables.c:219
RSP: 0018:88003b46f270  EFLAGS: 00010202
RAX: 0004 RBX: 4242 RCX: 0001
RDX:  RSI: c9e3c000 RDI: 0054
RBP: 88003b46f2d8 R08: 4000 R09: 830910e7
R10:  R11: 000a R12: 867fa0c0
R13: 4242 R14: 0003 R15: dc00
FS:  7fb135881700() GS:88003ec0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 20cc3000 CR3: 6d56a000 CR4: 06f0
Stack:
  0601a8c0  4242
 42423b9083c2 88003def4041 84e7e040 0246
 88003a0911c0  88003a091298 88003b9083ae
Call Trace:
 [] tcp_v4_send_reset+0x584/0x1700 net/ipv4/tcp_ipv4.c:643
 [] tcp_v4_rcv+0x198b/0x2e50 net/ipv4/tcp_ipv4.c:1718
 [] ip_local_deliver_finish+0x332/0xad0
net/ipv4/ip_input.c:216
 [< inline >] NF_HOOK_THRESH include/linux/netfilter.h:232
 [< inline >] NF_HOOK include/linux/netfilter.h:255
 [] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
 [< inline >] dst_input include/net/dst.h:507
 [] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
 [< inline >] NF_HOOK_THRESH include/linux/netfilter.h:232
 [< inline >] NF_HOOK include/linux/netfilter.h:255
 [] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
 [] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
 [] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
 [] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
 [] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
 [] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
 [] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
 [< inline >] new_sync_write fs/read_write.c:499
 [] __vfs_write+0x334/0x570 fs/read_write.c:512
 [] vfs_write+0x17b/0x500 fs/read_write.c:560
 [< inline >] SYSC_write fs/read_write.c:607
 [] SyS_write+0xd4/0x1a0 fs/read_write.c:599
 [] entry_SYSCALL_64_fastpath+0x1f/0xc2
Code: 00 00 45 85 c9 75 46 e8 e9 65 29 fe 4c 8b 55 a8 49 bf 00 00 00
00 00 fc ff df 49 8d 7a 54 49 89 fb 48 89 f8 49 c1 eb 03 83 e0 07 <43>
0f b6 1c 3b 83 c0 01 38 d8 7c 08 84 db 0f 85 a9 03 00 00 48
RIP  [< inline >] inet_exact_dif_match include/net/tcp.h:808
RIP  [] __inet_lookup_listener+0xb6/0x500
net/ipv4/inet_hashtables.c:219
 RSP 
---[ end trace 351d030d30a11e1a ]---
Kernel panic - not syncing: Fatal exception in interrupt
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled

On commit 0c183d92b20b5c84ca655b45ef57b3318b83eb9e (Oct 31).

Thanks!

[PATCH v5 3/7] net: phy: broadcom: Add BCM54810 PHY entry

2016-11-02 Thread Jon Mason

The BCM54810 PHY requires some semi-unique configuration, which results
in some additional configuration in addition to the standard config.
Also, some users of the BCM54810 require the PHY lanes to be swapped.
Since there is no way to detect this, add a device tree query to see if
it is applicable.

Inspired-by: Vikas Soni 
Signed-off-by: Jon Mason 
---
 drivers/net/phy/Kconfig|  2 +-
 drivers/net/phy/broadcom.c | 58 +-
 include/linux/brcmphy.h|  9 +++
 3 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 45f68ea..31967ca 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -217,7 +217,7 @@ config BROADCOM_PHY
select BCM_NET_PHYLIB
---help---
  Currently supports the BCM5411, BCM5421, BCM5461, BCM54616S, BCM5464,
- BCM5481 and BCM5482 PHYs.
+ BCM5481, BCM54810 and BCM5482 PHYs.
 
 config CICADA_PHY
tristate "Cicada PHYs"
diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
index 3a64b3d..b1e32e9 100644
--- a/drivers/net/phy/broadcom.c
+++ b/drivers/net/phy/broadcom.c
@@ -18,7 +18,7 @@
 #include 
 #include 
 #include 
-
+#include 
 
 #define BRCM_PHY_MODEL(phydev) \
((phydev)->drv->phy_id & (phydev)->drv->phy_id_mask)
@@ -45,6 +45,34 @@ static int bcm54xx_auxctl_write(struct phy_device *phydev, 
u16 regnum, u16 val)
return phy_write(phydev, MII_BCM54XX_AUX_CTL, regnum | val);
 }
 
+static int bcm54810_config(struct phy_device *phydev)
+{
+   int rc, val;
+
+   val = bcm_phy_read_exp(phydev, BCM54810_EXP_BROADREACH_LRE_MISC_CTL);
+   val &= ~BCM54810_EXP_BROADREACH_LRE_MISC_CTL_EN;
+   rc = bcm_phy_write_exp(phydev, BCM54810_EXP_BROADREACH_LRE_MISC_CTL,
+  val);
+   if (rc < 0)
+   return rc;
+
+   val = bcm54xx_auxctl_read(phydev, MII_BCM54XX_AUXCTL_SHDWSEL_MISC);
+   val &= ~MII_BCM54XX_AUXCTL_SHDWSEL_MISC_RGMII_SKEW_EN;
+   val |= MII_BCM54XX_AUXCTL_MISC_WREN;
+   rc = bcm54xx_auxctl_write(phydev, MII_BCM54XX_AUXCTL_SHDWSEL_MISC,
+ val);
+   if (rc < 0)
+   return rc;
+
+   val = bcm_phy_read_shadow(phydev, BCM54810_SHD_CLK_CTL);
+   val &= ~BCM54810_SHD_CLK_CTL_GTXCLK_EN;
+   rc = bcm_phy_write_shadow(phydev, BCM54810_SHD_CLK_CTL, val);
+   if (rc < 0)
+   return rc;
+
+   return 0;
+}
+
 /* Needs SMDSP clock enabled via bcm54xx_phydsp_config() */
 static int bcm50610_a0_workaround(struct phy_device *phydev)
 {
@@ -217,6 +245,12 @@ static int bcm54xx_config_init(struct phy_device *phydev)
(phydev->dev_flags & PHY_BRCM_AUTO_PWRDWN_ENABLE))
bcm54xx_adjust_rxrefclk(phydev);
 
+   if (BRCM_PHY_MODEL(phydev) == PHY_ID_BCM54810) {
+   err = bcm54810_config(phydev);
+   if (err)
+   return err;
+   }
+
bcm54xx_phydsp_config(phydev);
 
return 0;
@@ -314,6 +348,7 @@ static int bcm5482_read_status(struct phy_device *phydev)
 
 static int bcm5481_config_aneg(struct phy_device *phydev)
 {
+   struct device_node *np = phydev->mdio.dev.of_node;
int ret;
 
/* Aneg firsly. */
@@ -344,6 +379,14 @@ static int bcm5481_config_aneg(struct phy_device *phydev)
phy_write(phydev, 0x18, reg);
}
 
+   if (of_property_read_bool(np, "enet-phy-lane-swap")) {
+   /* Lane Swap - Undocumented register...magic! */
+   ret = bcm_phy_write_exp(phydev, MII_BCM54XX_EXP_SEL_ER + 0x9,
+   0x11B);
+   if (ret < 0)
+   return ret;
+   }
+
return ret;
 }
 
@@ -578,6 +621,18 @@ static struct phy_driver broadcom_drivers[] = {
.ack_interrupt  = bcm_phy_ack_intr,
.config_intr= bcm_phy_config_intr,
 }, {
+   .phy_id = PHY_ID_BCM54810,
+   .phy_id_mask= 0xfff0,
+   .name   = "Broadcom BCM54810",
+   .features   = PHY_GBIT_FEATURES |
+ SUPPORTED_Pause | SUPPORTED_Asym_Pause,
+   .flags  = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT,
+   .config_init= bcm54xx_config_init,
+   .config_aneg= bcm5481_config_aneg,
+   .read_status= genphy_read_status,
+   .ack_interrupt  = bcm_phy_ack_intr,
+   .config_intr= bcm_phy_config_intr,
+}, {
.phy_id = PHY_ID_BCM5482,
.phy_id_mask= 0xfff0,
.name   = "Broadcom BCM5482",
@@ -661,6 +716,7 @@ static struct mdio_device_id __maybe_unused broadcom_tbl[] 
= {
{ PHY_ID_BCM54616S, 0xfff0 },
{ PHY_ID_BCM5464, 0xfff0 },
{ PHY_ID_BCM5481, 0xfff0 },
+   { PHY_ID_BCM54810, 0xfff0 },
{ PHY_ID_BCM5482, 0xfff0 },
{ PHY_ID_BCM50610, 0xfff0 },
{ PHY_ID_BCM50610M, 0xfff0

[PATCH v5 2/7] Documentation: devicetree: add PHY lane swap binding

2016-11-02 Thread Jon Mason

Add the documentation for PHY lane swapping.  This is a boolean entry to
notify the phy device drivers that the TX/RX lanes need to be swapped.

Signed-off-by: Jon Mason 
---
 Documentation/devicetree/bindings/net/phy.txt | 4 
 1 file changed, 4 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/phy.txt 
b/Documentation/devicetree/bindings/net/phy.txt
index bc1c3c8..4627da3 100644
--- a/Documentation/devicetree/bindings/net/phy.txt
+++ b/Documentation/devicetree/bindings/net/phy.txt
@@ -35,6 +35,10 @@ Optional Properties:
 - broken-turn-around: If set, indicates the PHY device does not correctly
   release the turn around line low at the end of a MDIO transaction.
 
+- enet-phy-lane-swap: If set, indicates the PHY will swap the TX/RX lanes to
+  compensate for the board being designed with the lanes swapped.
+
+
 Example:
 
 ethernet-phy@0 {
-- 
2.7.4

[PATCH v5 6/7] net: ethernet: bgmac: add NS2 support

2016-11-02 Thread Jon Mason

Add support for the variant of amac hardware present in the Broadcom
Northstar2 based SoCs.  Northstar2 requires an additional register to be
configured with the port speed/duplexity (NICPM).  This can be added to
the link callback to hide it from the instances that do not use this.
Also, clearing of the pending interrupts on init is required due to
observed issues on some platforms.

Signed-off-by: Jon Mason 
---
 drivers/net/ethernet/broadcom/bgmac-platform.c | 56 +-
 drivers/net/ethernet/broadcom/bgmac.c  |  3 ++
 drivers/net/ethernet/broadcom/bgmac.h  |  1 +
 3 files changed, 58 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bgmac-platform.c 
b/drivers/net/ethernet/broadcom/bgmac-platform.c
index aed5dc5..fce63cf 100644
--- a/drivers/net/ethernet/broadcom/bgmac-platform.c
+++ b/drivers/net/ethernet/broadcom/bgmac-platform.c
@@ -14,12 +14,21 @@
 #define pr_fmt(fmt)KBUILD_MODNAME ": " fmt
 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include "bgmac.h"
 
+#define NICPM_IOMUX_CTRL   0x0008
+
+#define NICPM_IOMUX_CTRL_INIT_VAL  0x3196e000
+#define NICPM_IOMUX_CTRL_SPD_SHIFT 10
+#define NICPM_IOMUX_CTRL_SPD_10M   0
+#define NICPM_IOMUX_CTRL_SPD_100M  1
+#define NICPM_IOMUX_CTRL_SPD_1000M 2
+
 static u32 platform_bgmac_read(struct bgmac *bgmac, u16 offset)
 {
return readl(bgmac->plat.base + offset);
@@ -87,12 +96,46 @@ static void platform_bgmac_cmn_maskset32(struct bgmac 
*bgmac, u16 offset,
WARN_ON(1);
 }
 
+static void bgmac_nicpm_speed_set(struct net_device *net_dev)
+{
+   struct bgmac *bgmac = netdev_priv(net_dev);
+   u32 val;
+
+   if (!bgmac->plat.nicpm_base)
+   return;
+
+   val = NICPM_IOMUX_CTRL_INIT_VAL;
+   switch (bgmac->net_dev->phydev->speed) {
+   default:
+   netdev_err(net_dev, "Unsupported speed. Defaulting to 
1000Mb\n");
+   case SPEED_1000:
+   val |= NICPM_IOMUX_CTRL_SPD_1000M << NICPM_IOMUX_CTRL_SPD_SHIFT;
+   break;
+   case SPEED_100:
+   val |= NICPM_IOMUX_CTRL_SPD_100M << NICPM_IOMUX_CTRL_SPD_SHIFT;
+   break;
+   case SPEED_10:
+   val |= NICPM_IOMUX_CTRL_SPD_10M << NICPM_IOMUX_CTRL_SPD_SHIFT;
+   break;
+   }
+
+   writel(val, bgmac->plat.nicpm_base + NICPM_IOMUX_CTRL);
+
+   bgmac_adjust_link(bgmac->net_dev);
+}
+
 static int platform_phy_connect(struct bgmac *bgmac)
 {
struct phy_device *phy_dev;
 
-   phy_dev = of_phy_get_and_connect(bgmac->net_dev, bgmac->dev->of_node,
-bgmac_adjust_link);
+   if (bgmac->plat.nicpm_base)
+   phy_dev = of_phy_get_and_connect(bgmac->net_dev,
+bgmac->dev->of_node,
+bgmac_nicpm_speed_set);
+   else
+   phy_dev = of_phy_get_and_connect(bgmac->net_dev,
+bgmac->dev->of_node,
+bgmac_adjust_link);
if (!phy_dev) {
dev_err(bgmac->dev, "Phy connect failed\n");
return -ENODEV;
@@ -182,6 +225,14 @@ static int bgmac_probe(struct platform_device *pdev)
if (IS_ERR(bgmac->plat.idm_base))
return PTR_ERR(bgmac->plat.idm_base);
 
+   regs = platform_get_resource_byname(pdev, IORESOURCE_MEM, "nicpm_base");
+   if (regs) {
+   bgmac->plat.nicpm_base = devm_ioremap_resource(&pdev->dev,
+  regs);
+   if (IS_ERR(bgmac->plat.nicpm_base))
+   return PTR_ERR(bgmac->plat.nicpm_base);
+   }
+
bgmac->read = platform_bgmac_read;
bgmac->write = platform_bgmac_write;
bgmac->idm_read = platform_bgmac_idm_read;
@@ -213,6 +264,7 @@ static int bgmac_remove(struct platform_device *pdev)
 static const struct of_device_id bgmac_of_enet_match[] = {
{.compatible = "brcm,amac",},
{.compatible = "brcm,nsp-amac",},
+   {.compatible = "brcm,ns2-amac",},
{},
 };
 
diff --git a/drivers/net/ethernet/broadcom/bgmac.c 
b/drivers/net/ethernet/broadcom/bgmac.c
index 4584958..a805cc8 100644
--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c
@@ -1082,6 +1082,9 @@ static void bgmac_enable(struct bgmac *bgmac)
 /* http://bcm-v4.sipsolutions.net/mac-gbit/gmac/chipinit */
 static void bgmac_chip_init(struct bgmac *bgmac)
 {
+   /* Clear any erroneously pending interrupts */
+   bgmac_write(bgmac, BGMAC_INT_STATUS, ~0);
+
/* 1 interrupt per received frame */
bgmac_write(bgmac, BGMAC_INT_RECV_LAZY, 1 << BGMAC_IRL_FC_SHIFT);
 
diff --git a/drivers/net/ethernet/broadcom/bgmac.h 
b/drivers/net/ethernet/broadcom/bgmac.h
index ea52ac3..b1820ea 100644
---

[PATCH v5 5/7] net: ethernet: bgmac: device tree phy enablement

2016-11-02 Thread Jon Mason

Change the bgmac driver to allow for phy's defined by the device tree

Signed-off-by: Jon Mason 
---
 drivers/net/ethernet/broadcom/bgmac-bcma.c | 48 
 drivers/net/ethernet/broadcom/bgmac-platform.c | 48 +++-
 drivers/net/ethernet/broadcom/bgmac.c  | 52 ++
 drivers/net/ethernet/broadcom/bgmac.h  |  7 
 4 files changed, 105 insertions(+), 50 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bgmac-bcma.c 
b/drivers/net/ethernet/broadcom/bgmac-bcma.c
index c16ec3a..3e3efde 100644
--- a/drivers/net/ethernet/broadcom/bgmac-bcma.c
+++ b/drivers/net/ethernet/broadcom/bgmac-bcma.c
@@ -80,6 +80,50 @@ static void bcma_bgmac_cmn_maskset32(struct bgmac *bgmac, 
u16 offset, u32 mask,
bcma_maskset32(bgmac->bcma.cmn, offset, mask, set);
 }
 
+static int bcma_phy_connect(struct bgmac *bgmac)
+{
+   struct phy_device *phy_dev;
+   char bus_id[MII_BUS_ID_SIZE + 3];
+
+   /* Connect to the PHY */
+   snprintf(bus_id, sizeof(bus_id), PHY_ID_FMT, bgmac->mii_bus->id,
+bgmac->phyaddr);
+   phy_dev = phy_connect(bgmac->net_dev, bus_id, bgmac_adjust_link,
+ PHY_INTERFACE_MODE_MII);
+   if (IS_ERR(phy_dev)) {
+   dev_err(bgmac->dev, "PHY connecton failed\n");
+   return PTR_ERR(phy_dev);
+   }
+
+   return 0;
+}
+
+static int bcma_phy_direct_connect(struct bgmac *bgmac)
+{
+   struct fixed_phy_status fphy_status = {
+   .link = 1,
+   .speed = SPEED_1000,
+   .duplex = DUPLEX_FULL,
+   };
+   struct phy_device *phy_dev;
+   int err;
+
+   phy_dev = fixed_phy_register(PHY_POLL, &fphy_status, -1, NULL);
+   if (!phy_dev || IS_ERR(phy_dev)) {
+   dev_err(bgmac->dev, "Failed to register fixed PHY device\n");
+   return -ENODEV;
+   }
+
+   err = phy_connect_direct(bgmac->net_dev, phy_dev, bgmac_adjust_link,
+PHY_INTERFACE_MODE_MII);
+   if (err) {
+   dev_err(bgmac->dev, "Connecting PHY failed\n");
+   return err;
+   }
+
+   return err;
+}
+
 static const struct bcma_device_id bgmac_bcma_tbl[] = {
BCMA_CORE(BCMA_MANUF_BCM, BCMA_CORE_4706_MAC_GBIT,
  BCMA_ANY_REV, BCMA_ANY_CLASS),
@@ -275,6 +319,10 @@ static int bgmac_probe(struct bcma_device *core)
bgmac->cco_ctl_maskset = bcma_bgmac_cco_ctl_maskset;
bgmac->get_bus_clock = bcma_bgmac_get_bus_clock;
bgmac->cmn_maskset32 = bcma_bgmac_cmn_maskset32;
+   if (bgmac->mii_bus)
+   bgmac->phy_connect = bcma_phy_connect;
+   else
+   bgmac->phy_connect = bcma_phy_direct_connect;
 
err = bgmac_enet_probe(bgmac);
if (err)
diff --git a/drivers/net/ethernet/broadcom/bgmac-platform.c 
b/drivers/net/ethernet/broadcom/bgmac-platform.c
index be52f27..aed5dc5 100644
--- a/drivers/net/ethernet/broadcom/bgmac-platform.c
+++ b/drivers/net/ethernet/broadcom/bgmac-platform.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "bgmac.h"
 
@@ -86,6 +87,46 @@ static void platform_bgmac_cmn_maskset32(struct bgmac 
*bgmac, u16 offset,
WARN_ON(1);
 }
 
+static int platform_phy_connect(struct bgmac *bgmac)
+{
+   struct phy_device *phy_dev;
+
+   phy_dev = of_phy_get_and_connect(bgmac->net_dev, bgmac->dev->of_node,
+bgmac_adjust_link);
+   if (!phy_dev) {
+   dev_err(bgmac->dev, "Phy connect failed\n");
+   return -ENODEV;
+   }
+
+   return 0;
+}
+
+static int platform_phy_direct_connect(struct bgmac *bgmac)
+{
+   struct fixed_phy_status fphy_status = {
+   .link = 1,
+   .speed = SPEED_1000,
+   .duplex = DUPLEX_FULL,
+   };
+   struct phy_device *phy_dev;
+   int err;
+
+   phy_dev = fixed_phy_register(PHY_POLL, &fphy_status, -1, NULL);
+   if (!phy_dev || IS_ERR(phy_dev)) {
+   dev_err(bgmac->dev, "Failed to register fixed PHY device\n");
+   return -ENODEV;
+   }
+
+   err = phy_connect_direct(bgmac->net_dev, phy_dev, bgmac_adjust_link,
+PHY_INTERFACE_MODE_MII);
+   if (err) {
+   dev_err(bgmac->dev, "Connecting PHY failed\n");
+   return err;
+   }
+
+   return err;
+}
+
 static int bgmac_probe(struct platform_device *pdev)
 {
struct device_node *np = pdev->dev.of_node;
@@ -102,7 +143,6 @@ static int bgmac_probe(struct platform_device *pdev)
/* Set the features of the 4707 family */
bgmac->feature_flags |= BGMAC_FEAT_CLKCTLST;
bgmac->feature_flags |= BGMAC_FEAT_NO_RESET;
-   bgmac->feature_flags |= BGMAC_FEAT_FORCE_SPEED_2500;
bgmac->feature_flags |= BGMAC_FEAT_CMDCFG_SR_REV4;
bgmac->feature_flags |= BGMAC_FEAT_TX_MASK_SETUP

[PATCH v5 7/7] arm64: dts: NS2: add AMAC ethernet support

2016-11-02 Thread Jon Mason

Add support for the AMAC ethernet to the Broadcom Northstar2 SoC device
tree

Signed-off-by: Jon Mason 
---
 arch/arm64/boot/dts/broadcom/ns2-svk.dts |  5 +
 arch/arm64/boot/dts/broadcom/ns2.dtsi| 12 
 2 files changed, 17 insertions(+)

diff --git a/arch/arm64/boot/dts/broadcom/ns2-svk.dts 
b/arch/arm64/boot/dts/broadcom/ns2-svk.dts
index 2d7872a..2e4d90d 100644
--- a/arch/arm64/boot/dts/broadcom/ns2-svk.dts
+++ b/arch/arm64/boot/dts/broadcom/ns2-svk.dts
@@ -56,6 +56,10 @@
};
 };
 
+&enet {
+   status = "ok";
+};
+
 &pci_phy0 {
status = "ok";
 };
@@ -172,6 +176,7 @@
 &mdio_mux_iproc {
mdio@10 {
gphy0: eth-phy@10 {
+   enet-phy-lane-swap;
reg = <0x10>;
};
};
diff --git a/arch/arm64/boot/dts/broadcom/ns2.dtsi 
b/arch/arm64/boot/dts/broadcom/ns2.dtsi
index d95dc40..773ed59 100644
--- a/arch/arm64/boot/dts/broadcom/ns2.dtsi
+++ b/arch/arm64/boot/dts/broadcom/ns2.dtsi
@@ -191,6 +191,18 @@
 
#include "ns2-clock.dtsi"
 
+   enet: ethernet@6100 {
+   compatible = "brcm,ns2-amac";
+   reg = <0x6100 0x1000>,
+ <0x6109 0x1000>,
+ <0x6103 0x100>;
+   reg-names = "amac_base", "idm_base", "nicpm_base";
+   interrupts = ;
+   phy-handle = <&gphy0>;
+   phy-mode = "rgmii";
+   status = "disabled";
+   };
+
dma0: dma@6136 {
compatible = "arm,pl330", "arm,primecell";
reg = <0x6136 0x1000>;
-- 
2.7.4

[PATCH v5 4/7] Documentation: devicetree: net: add NS2 bindings to amac

2016-11-02 Thread Jon Mason

Clean-up the documentation to the bgmac-amac driver, per suggestion by
Rob Herring, and add details for NS2 support.

Signed-off-by: Jon Mason 
---
 Documentation/devicetree/bindings/net/brcm,amac.txt | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/brcm,amac.txt 
b/Documentation/devicetree/bindings/net/brcm,amac.txt
index ba5ecc1..2fefa1a 100644
--- a/Documentation/devicetree/bindings/net/brcm,amac.txt
+++ b/Documentation/devicetree/bindings/net/brcm,amac.txt
@@ -2,11 +2,17 @@ Broadcom AMAC Ethernet Controller Device Tree Bindings
 -
 
 Required properties:
- - compatible: "brcm,amac" or "brcm,nsp-amac"
- - reg:Address and length of the GMAC registers,
-   Address and length of the GMAC IDM registers
- - reg-names:  Names of the registers.  Must have both "amac_base" and
-   "idm_base"
+ - compatible: "brcm,amac"
+   "brcm,nsp-amac"
+   "brcm,ns2-amac"
+ - reg:Address and length of the register set for the device. 
It
+   contains the information of registers in the same order as
+   described by reg-names
+ - reg-names:  Names of the registers.
+   "amac_base":Address and length of the GMAC registers
+   "idm_base": Address and length of the GMAC IDM registers
+   "nicpm_base":   Address and length of the NIC Port Manager
+   registers (required for Northstar2)
  - interrupts: Interrupt number
 
 Optional properties:
-- 
2.7.4

[PATCH v5 0/7] add NS2 support to bgmac

2016-11-02 Thread Jon Mason

Changes in v5:
* Change a pr_err to netdev_err (per Scott Branden)
* Reword the lane swap binding documentation (per Andrew Lunn)


Changes in v4:
* Actually send out the lane swap binding doc patch (Per Scott Branden)
* Remove unused #define (Per Andrew Lunn)


Changes in v3:
* Clean-up the bgmac DT binding doc (per Rob Herring)
* Document the lane swap binding and make it generic (Per Andrew Lunn)


Changes in v2:
* Remove the PHY power-on (per Andrew Lunn)
* Misc PHY clean-ups regarding comments and #defines (per Andrew Lunn)
  This results on none of the original PHY code from Vikas being
  present.  So, I'm removing him as an author and giving him
  "Inspired-by" credit.
* Move PHY lane swapping to PHY driver (per Andrew Lunn and Florian
  Fainelli)
* Remove bgmac sleep (per Florian Fainelli)
* Re-add bgmac chip reset (per Florian Fainelli and Ray Jui)
* Rebased on latest net-next
* Added patch for bcm54xx_auxctl_read, which is used in the BCM54810


Add support for the amac found in the Broadcom Northstar2 SoC to the
bgmac driver.  This necessitates adding support to connect to an
externally defined phy (as described in the device tree) in the driver.
These phy changes are in addition to the changes necessary to get NS2
working.


Jon Mason (7):
  net: phy: broadcom: add bcm54xx_auxctl_read
  Documentation: devicetree: add PHY lane swap binding
  net: phy: broadcom: Add BCM54810 PHY entry
  Documentation: devicetree: net: add NS2 bindings to amac
  net: ethernet: bgmac: device tree phy enablement
  net: ethernet: bgmac: add NS2 support
  arm64: dts: NS2: add AMAC ethernet support

 .../devicetree/bindings/net/brcm,amac.txt  |  16 ++--
 Documentation/devicetree/bindings/net/phy.txt  |   4 +
 arch/arm64/boot/dts/broadcom/ns2-svk.dts   |   5 ++
 arch/arm64/boot/dts/broadcom/ns2.dtsi  |  12 +++
 drivers/net/ethernet/broadcom/bgmac-bcma.c |  48 ++
 drivers/net/ethernet/broadcom/bgmac-platform.c | 100 -
 drivers/net/ethernet/broadcom/bgmac.c  |  55 ++--
 drivers/net/ethernet/broadcom/bgmac.h  |   8 ++
 drivers/net/phy/Kconfig|   2 +-
 drivers/net/phy/broadcom.c |  68 +-
 include/linux/brcmphy.h|  10 +++
 11 files changed, 271 insertions(+), 57 deletions(-)

-- 
2.7.4

[mm PATCH v2 07/26] arch/blackfin: Add option to skip sync on DMA map

2016-11-02 Thread Alexander Duyck

The use of DMA_ATTR_SKIP_CPU_SYNC was not consistent across all of the DMA
APIs in the arch/arm folder.  This change is meant to correct that so that
we get consistent behavior.

Cc: Steven Miao 
Signed-off-by: Alexander Duyck 
---
 arch/blackfin/kernel/dma-mapping.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/blackfin/kernel/dma-mapping.c 
b/arch/blackfin/kernel/dma-mapping.c
index 53fbbb6..a27a74a 100644
--- a/arch/blackfin/kernel/dma-mapping.c
+++ b/arch/blackfin/kernel/dma-mapping.c
@@ -118,6 +118,10 @@ static int bfin_dma_map_sg(struct device *dev, struct 
scatterlist *sg_list,
 
for_each_sg(sg_list, sg, nents, i) {
sg->dma_address = (dma_addr_t) sg_virt(sg);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
__dma_sync(sg_dma_address(sg), sg_dma_len(sg), direction);
}
 
@@ -143,7 +147,9 @@ static dma_addr_t bfin_dma_map_page(struct device *dev, 
struct page *page,
 {
dma_addr_t handle = (dma_addr_t)(page_address(page) + offset);
 
-   _dma_sync(handle, size, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   _dma_sync(handle, size, dir);
+
return handle;
 }

Re: [PATCH v5 2/7] Documentation: devicetree: add PHY lane swap binding

2016-11-02 Thread Florian Fainelli

On 11/02/2016 10:08 AM, Jon Mason wrote:
> Add the documentation for PHY lane swapping.  This is a boolean entry to
> notify the phy device drivers that the TX/RX lanes need to be swapped.
> 
> Signed-off-by: Jon Mason 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH v5 1/7] net: phy: broadcom: add bcm54xx_auxctl_read

2016-11-02 Thread Florian Fainelli

On 11/02/2016 10:08 AM, Jon Mason wrote:
> Add a helper function to read the AUXCTL register for the BCM54xx.  This
> mirrors the bcm54xx_auxctl_write function already present in the code.
> 
> Signed-off-by: Jon Mason 

Reviewed-by: Florian Fainelli 
-- 
Florian

[mm PATCH v2 01/26] swiotlb: Drop unused functions swiotlb_map_sg and swiotlb_unmap_sg

2016-11-02 Thread Alexander Duyck

There are no users for swiotlb_map_sg or swiotlb_unmap_sg so we might as
well just drop them.

Cc: Konrad Rzeszutek Wilk 
Signed-off-by: Alexander Duyck 
---

v2: Added swiotlb_unmap_sg to functions dropped.

 include/linux/swiotlb.h |8 
 lib/swiotlb.c   |   16 
 2 files changed, 24 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 5f81f8a..f0d2589 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -73,14 +73,6 @@ extern void swiotlb_unmap_page(struct device *hwdev, 
dma_addr_t dev_addr,
   unsigned long attrs);
 
 extern int
-swiotlb_map_sg(struct device *hwdev, struct scatterlist *sg, int nents,
-  enum dma_data_direction dir);
-
-extern void
-swiotlb_unmap_sg(struct device *hwdev, struct scatterlist *sg, int nents,
-enum dma_data_direction dir);
-
-extern int
 swiotlb_map_sg_attrs(struct device *hwdev, struct scatterlist *sgl, int nelems,
 enum dma_data_direction dir,
 unsigned long attrs);
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 22e13a0..5005316 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -910,14 +910,6 @@ void swiotlb_unmap_page(struct device *hwdev, dma_addr_t 
dev_addr,
 }
 EXPORT_SYMBOL(swiotlb_map_sg_attrs);
 
-int
-swiotlb_map_sg(struct device *hwdev, struct scatterlist *sgl, int nelems,
-  enum dma_data_direction dir)
-{
-   return swiotlb_map_sg_attrs(hwdev, sgl, nelems, dir, 0);
-}
-EXPORT_SYMBOL(swiotlb_map_sg);
-
 /*
  * Unmap a set of streaming mode DMA translations.  Again, cpu read rules
  * concerning calls here are the same as for swiotlb_unmap_page() above.
@@ -938,14 +930,6 @@ void swiotlb_unmap_page(struct device *hwdev, dma_addr_t 
dev_addr,
 }
 EXPORT_SYMBOL(swiotlb_unmap_sg_attrs);
 
-void
-swiotlb_unmap_sg(struct device *hwdev, struct scatterlist *sgl, int nelems,
-enum dma_data_direction dir)
-{
-   return swiotlb_unmap_sg_attrs(hwdev, sgl, nelems, dir, 0);
-}
-EXPORT_SYMBOL(swiotlb_unmap_sg);
-
 /*
  * Make physical memory consistent for a set of streaming mode DMA translations
  * after a transfer.

[mm PATCH v2 06/26] arch/avr32: Add option to skip sync on DMA map

2016-11-02 Thread Alexander Duyck

The use of DMA_ATTR_SKIP_CPU_SYNC was not consistent across all of the DMA
APIs in the arch/arm folder.  This change is meant to correct that so that
we get consistent behavior.

Acked-by: Hans-Christian Noren Egtvedt 
Signed-off-by: Alexander Duyck 
---
 arch/avr32/mm/dma-coherent.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/avr32/mm/dma-coherent.c b/arch/avr32/mm/dma-coherent.c
index 58610d0..54534e5 100644
--- a/arch/avr32/mm/dma-coherent.c
+++ b/arch/avr32/mm/dma-coherent.c
@@ -146,7 +146,8 @@ static dma_addr_t avr32_dma_map_page(struct device *dev, 
struct page *page,
 {
void *cpu_addr = page_address(page) + offset;
 
-   dma_cache_sync(dev, cpu_addr, size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   dma_cache_sync(dev, cpu_addr, size, direction);
return virt_to_bus(cpu_addr);
 }
 
@@ -162,6 +163,10 @@ static int avr32_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
 
sg->dma_address = page_to_bus(sg_page(sg)) + sg->offset;
virt = sg_virt(sg);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
dma_cache_sync(dev, virt, sg->length, direction);
}

Re: [PATCH net] ipv4: allow local fragmentation in ip_finish_output_gso()

2016-11-02 Thread Hannes Frederic Sowa

On Wed, Nov 2, 2016, at 17:29, Lance Richardson wrote:
> Some configurations (e.g. geneve interface with default
> MTU of 1500 over an ethernet interface with 1500 MTU) result
> in the transmission of packets that exceed the configured MTU.
> While this should be considered to be a "bad" configuration,
> it is still allowed and should not result in the sending
> of packets that exceed the configured MTU.
> 
> Fix by dropping the assumption in ip_finish_output_gso() that
> locally originated gso packets will never need fragmentation.
> Basic testing using iperf (observing CPU usage and bandwidth)
> have shown no measurable performance impact for traffic not
> requiring fragmentation.
> 
> Fixes: c7ba65d7b649 ("net: ip: push gso skb forwarding handling down the
> stack")
> Reported-by: Jan Tluka 
> Signed-off-by: Lance Richardson 

Acked-by: Hannes Frederic Sowa

[mm PATCH v2 17/26] arch/parisc: Add option to skip DMA sync as a part of map and unmap

2016-11-02 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: linux-par...@vger.kernel.org
Signed-off-by: Alexander Duyck 
---
 arch/parisc/kernel/pci-dma.c |   20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c
index 02d9ed0..be55ede 100644
--- a/arch/parisc/kernel/pci-dma.c
+++ b/arch/parisc/kernel/pci-dma.c
@@ -459,7 +459,9 @@ static dma_addr_t pa11_dma_map_page(struct device *dev, 
struct page *page,
void *addr = page_address(page) + offset;
BUG_ON(direction == DMA_NONE);
 
-   flush_kernel_dcache_range((unsigned long) addr, size);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   flush_kernel_dcache_range((unsigned long) addr, size);
+
return virt_to_phys(addr);
 }
 
@@ -469,8 +471,11 @@ static void pa11_dma_unmap_page(struct device *dev, 
dma_addr_t dma_handle,
 {
BUG_ON(direction == DMA_NONE);
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return;
+
if (direction == DMA_TO_DEVICE)
-   return;
+   return;
 
/*
 * For PCI_DMA_FROMDEVICE this flush is not necessary for the
@@ -479,7 +484,6 @@ static void pa11_dma_unmap_page(struct device *dev, 
dma_addr_t dma_handle,
 */
 
flush_kernel_dcache_range((unsigned long) phys_to_virt(dma_handle), 
size);
-   return;
 }
 
 static int pa11_dma_map_sg(struct device *dev, struct scatterlist *sglist,
@@ -496,6 +500,10 @@ static int pa11_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
 
sg_dma_address(sg) = (dma_addr_t) virt_to_phys(vaddr);
sg_dma_len(sg) = sg->length;
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
flush_kernel_dcache_range(vaddr, sg->length);
}
return nents;
@@ -510,14 +518,16 @@ static void pa11_dma_unmap_sg(struct device *dev, struct 
scatterlist *sglist,
 
BUG_ON(direction == DMA_NONE);
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return;
+
if (direction == DMA_TO_DEVICE)
-   return;
+   return;
 
/* once we do combining we'll need to use 
phys_to_virt(sg_dma_address(sglist)) */
 
for_each_sg(sglist, sg, nents, i)
flush_kernel_vmap_range(sg_virt(sg), sg->length);
-   return;
 }
 
 static void pa11_dma_sync_single_for_cpu(struct device *dev,

[mm PATCH v2 02/26] swiotlb-xen: Enforce return of DMA_ERROR_CODE in mapping function

2016-11-02 Thread Alexander Duyck

The mapping function should always return DMA_ERROR_CODE when a mapping has
failed as this is what the DMA API expects when a DMA error has occurred.
The current function for mapping a page in Xen was returning either
DMA_ERROR_CODE or 0 depending on where it failed.

On x86 DMA_ERROR_CODE is 0, but on other architectures such as ARM it is
~0. We need to make sure we return the same error value if either the
mapping failed or the device is not capable of accessing the mapping.

If we are returning DMA_ERROR_CODE as our error value we can drop the
function for checking the error code as the default is to compare the
return value against DMA_ERROR_CODE if no function is defined.

Cc: Konrad Rzeszutek Wilk 
Signed-off-by: Alexander Duyck 
---

v1: Added this patch which was part of an earlier patch.

 arch/arm/xen/mm.c  |1 -
 arch/x86/xen/pci-swiotlb-xen.c |1 -
 drivers/xen/swiotlb-xen.c  |   18 ++
 include/xen/swiotlb-xen.h  |3 ---
 4 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
index d062f08..bd62d94 100644
--- a/arch/arm/xen/mm.c
+++ b/arch/arm/xen/mm.c
@@ -186,7 +186,6 @@ void xen_destroy_contiguous_region(phys_addr_t pstart, 
unsigned int order)
 EXPORT_SYMBOL(xen_dma_ops);
 
 static struct dma_map_ops xen_swiotlb_dma_ops = {
-   .mapping_error = xen_swiotlb_dma_mapping_error,
.alloc = xen_swiotlb_alloc_coherent,
.free = xen_swiotlb_free_coherent,
.sync_single_for_cpu = xen_swiotlb_sync_single_for_cpu,
diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
index 0e98e5d..a9fafb5 100644
--- a/arch/x86/xen/pci-swiotlb-xen.c
+++ b/arch/x86/xen/pci-swiotlb-xen.c
@@ -19,7 +19,6 @@
 int xen_swiotlb __read_mostly;
 
 static struct dma_map_ops xen_swiotlb_dma_ops = {
-   .mapping_error = xen_swiotlb_dma_mapping_error,
.alloc = xen_swiotlb_alloc_coherent,
.free = xen_swiotlb_free_coherent,
.sync_single_for_cpu = xen_swiotlb_sync_single_for_cpu,
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 87e6035..b8014bf 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -416,11 +416,12 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, 
struct page *page,
/*
 * Ensure that the address returned is DMA'ble
 */
-   if (!dma_capable(dev, dev_addr, size)) {
-   swiotlb_tbl_unmap_single(dev, map, size, dir);
-   dev_addr = 0;
-   }
-   return dev_addr;
+   if (dma_capable(dev, dev_addr, size))
+   return dev_addr;
+
+   swiotlb_tbl_unmap_single(dev, map, size, dir);
+
+   return DMA_ERROR_CODE;
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_map_page);
 
@@ -648,13 +649,6 @@ void xen_swiotlb_unmap_page(struct device *hwdev, 
dma_addr_t dev_addr,
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_sync_sg_for_device);
 
-int
-xen_swiotlb_dma_mapping_error(struct device *hwdev, dma_addr_t dma_addr)
-{
-   return !dma_addr;
-}
-EXPORT_SYMBOL_GPL(xen_swiotlb_dma_mapping_error);
-
 /*
  * Return whether the given device DMA address mask can be supported
  * properly.  For example, if your device can only drive the low 24-bits
diff --git a/include/xen/swiotlb-xen.h b/include/xen/swiotlb-xen.h
index 7c35e27..a0083be 100644
--- a/include/xen/swiotlb-xen.h
+++ b/include/xen/swiotlb-xen.h
@@ -51,9 +51,6 @@ extern void xen_swiotlb_unmap_page(struct device *hwdev, 
dma_addr_t dev_addr,
   int nelems, enum dma_data_direction dir);
 
 extern int
-xen_swiotlb_dma_mapping_error(struct device *hwdev, dma_addr_t dma_addr);
-
-extern int
 xen_swiotlb_dma_supported(struct device *hwdev, u64 mask);
 
 extern int

[mm PATCH v2 00/26] Add support for DMA writable pages being writable by the network stack

2016-11-02 Thread Alexander Duyck

The first 22 patches in the set add support for the DMA attribute
DMA_ATTR_SKIP_CPU_SYNC on multiple platforms/architectures.  This is needed
so that we can flag the calls to dma_map/unmap_page so that we do not
invalidate cache lines that do not currently belong to the device.  Instead
we have to take care of this in the driver via a call to
sync_single_range_for_cpu prior to freeing the Rx page.

Patch 23 adds support for dma_map_page_attrs and dma_unmap_page_attrs so
that we can unmap and map a page using the DMA_ATTR_SKIP_CPU_SYNC
attribute.

Patch 24 adds support for freeing a page that has multiple references being
held by a single caller.  This way we can free page fragments that were
allocated by a given driver.

The last 2 patches use these updates in the igb driver, and lay the
groundwork to allow for us to reimpelement the use of build_skb.

v1: Split out changes DMA_ERROR_CODE fix for swiotlb-xen
Minor fixes based on issues found by kernel build bot
Few minor changes for issues found on code review
Added Acked-by for patches that were acked and not changed

v2: Added a few more Acked-by
Added swiotlb_unmap_sg to functions dropped in patch 1, dropped Acked-by
Submitting patches to mm instead of net-next

---

Alexander Duyck (26):
  swiotlb: Drop unused functions swiotlb_map_sg and swiotlb_unmap_sg
  swiotlb-xen: Enforce return of DMA_ERROR_CODE in mapping function
  swiotlb: Add support for DMA_ATTR_SKIP_CPU_SYNC
  arch/arc: Add option to skip sync on DMA mapping
  arch/arm: Add option to skip sync on DMA map and unmap
  arch/avr32: Add option to skip sync on DMA map
  arch/blackfin: Add option to skip sync on DMA map
  arch/c6x: Add option to skip sync on DMA map and unmap
  arch/frv: Add option to skip sync on DMA map
  arch/hexagon: Add option to skip DMA sync as a part of mapping
  arch/m68k: Add option to skip DMA sync as a part of mapping
  arch/metag: Add option to skip DMA sync as a part of map and unmap
  arch/microblaze: Add option to skip DMA sync as a part of map and unmap
  arch/mips: Add option to skip DMA sync as a part of map and unmap
  arch/nios2: Add option to skip DMA sync as a part of map and unmap
  arch/openrisc: Add option to skip DMA sync as a part of mapping
  arch/parisc: Add option to skip DMA sync as a part of map and unmap
  arch/powerpc: Add option to skip DMA sync as a part of mapping
  arch/sh: Add option to skip DMA sync as a part of mapping
  arch/sparc: Add option to skip DMA sync as a part of map and unmap
  arch/tile: Add option to skip DMA sync as a part of map and unmap
  arch/xtensa: Add option to skip DMA sync as a part of mapping
  dma: Add calls for dma_map_page_attrs and dma_unmap_page_attrs
  mm: Add support for releasing multiple instances of a page
  igb: Update driver to make use of DMA_ATTR_SKIP_CPU_SYNC
  igb: Update code to better handle incrementing page count


 arch/arc/mm/dma.c |5 ++
 arch/arm/common/dmabounce.c   |   16 --
 arch/arm/xen/mm.c |1 
 arch/avr32/mm/dma-coherent.c  |7 ++-
 arch/blackfin/kernel/dma-mapping.c|8 +++
 arch/c6x/kernel/dma.c |   14 -
 arch/frv/mb93090-mb00/pci-dma-nommu.c |   14 -
 arch/frv/mb93090-mb00/pci-dma.c   |9 +++
 arch/hexagon/kernel/dma.c |6 ++
 arch/m68k/kernel/dma.c|8 +++
 arch/metag/kernel/dma.c   |   16 +-
 arch/microblaze/kernel/dma.c  |   10 +++-
 arch/mips/loongson64/common/dma-swiotlb.c |2 -
 arch/mips/mm/dma-default.c|8 ++-
 arch/nios2/mm/dma-mapping.c   |   26 +++---
 arch/openrisc/kernel/dma.c|3 +
 arch/parisc/kernel/pci-dma.c  |   20 ++--
 arch/powerpc/kernel/dma.c |9 +++
 arch/sh/kernel/dma-nommu.c|7 ++-
 arch/sparc/kernel/iommu.c |4 +-
 arch/sparc/kernel/ioport.c|4 +-
 arch/tile/kernel/pci-dma.c|   12 -
 arch/x86/xen/pci-swiotlb-xen.c|1 
 arch/xtensa/kernel/pci-dma.c  |7 ++-
 drivers/net/ethernet/intel/igb/igb.h  |7 ++-
 drivers/net/ethernet/intel/igb/igb_main.c |   77 +++--
 drivers/xen/swiotlb-xen.c |   27 +-
 include/linux/dma-mapping.h   |   20 +---
 include/linux/gfp.h   |2 +
 include/linux/swiotlb.h   |   14 ++---
 include/xen/swiotlb-xen.h |3 -
 lib/swiotlb.c |   64 +++-
 mm/page_alloc.c   |   14 +
 33 files changed, 291 insertions(+), 154 deletions(-)

--

[mm PATCH v2 15/26] arch/nios2: Add option to skip DMA sync as a part of map and unmap

2016-11-02 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Ley Foon Tan 
Signed-off-by: Alexander Duyck 
---
 arch/nios2/mm/dma-mapping.c |   26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/arch/nios2/mm/dma-mapping.c b/arch/nios2/mm/dma-mapping.c
index d800fad..f6a5dcf 100644
--- a/arch/nios2/mm/dma-mapping.c
+++ b/arch/nios2/mm/dma-mapping.c
@@ -98,13 +98,17 @@ static int nios2_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
int i;
 
for_each_sg(sg, sg, nents, i) {
-   void *addr;
+   void *addr = sg_virt(sg);
 
-   addr = sg_virt(sg);
-   if (addr) {
-   __dma_sync_for_device(addr, sg->length, direction);
-   sg->dma_address = sg_phys(sg);
-   }
+   if (!addr)
+   continue;
+
+   sg->dma_address = sg_phys(sg);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
+   __dma_sync_for_device(addr, sg->length, direction);
}
 
return nents;
@@ -117,7 +121,9 @@ static dma_addr_t nios2_dma_map_page(struct device *dev, 
struct page *page,
 {
void *addr = page_address(page) + offset;
 
-   __dma_sync_for_device(addr, size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   __dma_sync_for_device(addr, size, direction);
+
return page_to_phys(page) + offset;
 }
 
@@ -125,7 +131,8 @@ static void nios2_dma_unmap_page(struct device *dev, 
dma_addr_t dma_address,
size_t size, enum dma_data_direction direction,
unsigned long attrs)
 {
-   __dma_sync_for_cpu(phys_to_virt(dma_address), size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   __dma_sync_for_cpu(phys_to_virt(dma_address), size, direction);
 }
 
 static void nios2_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
@@ -138,6 +145,9 @@ static void nios2_dma_unmap_sg(struct device *dev, struct 
scatterlist *sg,
if (direction == DMA_TO_DEVICE)
return;
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return;
+
for_each_sg(sg, sg, nhwentries, i) {
addr = sg_virt(sg);
if (addr)

Re: [PATCH net] net: Check for fullsock in sock_i_uid()

2016-11-02 Thread Eric Dumazet

On Wed, 2016-11-02 at 11:05 -0600, subas...@codeaurora.org wrote:
> > This would be a bug in the caller.
> > 
> > Can you give us the complete stack trace leading to the problem you
> > had ?
> > 
> > Thanks !
> 
> Thanks Eric for the clarification. In that case, the bug is in the 
> IDLETIMER target in Android kernel.
> https://android.googlesource.com/kernel/common/+/android-4.4/net/netfilter/xt_IDLETIMER.c#356
> 
> Here is the call stack.

Sure, please fix Android, and not add ugly work around in linux kernel.

Lorenzo, have'nt you already fixed all these bugs ?

if (skb && skb->sk)
timer->uid = from_kuid_munged(current_user_ns(),
 sock_i_uid(skb_to_full_sk(skb)));

Thanks

[mm PATCH v2 24/26] mm: Add support for releasing multiple instances of a page

2016-11-02 Thread Alexander Duyck

This patch adds a function that allows us to batch free a page that has
multiple references outstanding.  Specifically this function can be used to
drop a page being used in the page frag alloc cache.  With this drivers can
make use of functionality similar to the page frag alloc cache without
having to do any workarounds for the fact that there is no function that
frees multiple references.

Signed-off-by: Alexander Duyck 
---
 include/linux/gfp.h |2 ++
 mm/page_alloc.c |   14 ++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f8041f9de..4175dca 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -506,6 +506,8 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int 
order,
 extern void free_hot_cold_page_list(struct list_head *list, bool cold);
 
 struct page_frag_cache;
+extern void __page_frag_drain(struct page *page, unsigned int order,
+ unsigned int count);
 extern void *__alloc_page_frag(struct page_frag_cache *nc,
   unsigned int fragsz, gfp_t gfp_mask);
 extern void __free_page_frag(void *addr);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 65e0b51..bb6d7bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3938,6 +3938,20 @@ static struct page *__page_frag_refill(struct 
page_frag_cache *nc,
return page;
 }
 
+void __page_frag_drain(struct page *page, unsigned int order,
+  unsigned int count)
+{
+   VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
+
+   if (page_ref_sub_and_test(page, count)) {
+   if (order == 0)
+   free_hot_cold_page(page, false);
+   else
+   __free_pages_ok(page, order);
+   }
+}
+EXPORT_SYMBOL(__page_frag_drain);
+
 void *__alloc_page_frag(struct page_frag_cache *nc,
unsigned int fragsz, gfp_t gfp_mask)
 {

Re: [PATCH v5 4/7] Documentation: devicetree: net: add NS2 bindings to amac

2016-11-02 Thread Sergei Shtylyov


Hello.

On 11/02/2016 08:08 PM, Jon Mason wrote:


Clean-up the documentation to the bgmac-amac driver, per suggestion by
Rob Herring, and add details for NS2 support.

Signed-off-by: Jon Mason 
---
 Documentation/devicetree/bindings/net/brcm,amac.txt | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/brcm,amac.txt 
b/Documentation/devicetree/bindings/net/brcm,amac.txt
index ba5ecc1..2fefa1a 100644
--- a/Documentation/devicetree/bindings/net/brcm,amac.txt
+++ b/Documentation/devicetree/bindings/net/brcm,amac.txt
@@ -2,11 +2,17 @@ Broadcom AMAC Ethernet Controller Device Tree Bindings
 -

 Required properties:
- - compatible: "brcm,amac" or "brcm,nsp-amac"
- - reg:Address and length of the GMAC registers,
-   Address and length of the GMAC IDM registers
- - reg-names:  Names of the registers.  Must have both "amac_base" and
-   "idm_base"
+ - compatible: "brcm,amac"
+   "brcm,nsp-amac"
+   "brcm,ns2-amac"
+ - reg:Address and length of the register set for the device. 
It
+   contains the information of registers in the same order as
+   described by reg-names
+ - reg-names:  Names of the registers.
+   "amac_base":  Address and length of the GMAC registers
+   "idm_base":   Address and length of the GMAC IDM registers
+   "nicpm_base": Address and length of the NIC Port Manager
+   registers (required for Northstar2)


  Why this "_base" suffix? It looks redundant...

[...]

MBR, Sergei

Re: [PATCH v5 3/7] net: phy: broadcom: Add BCM54810 PHY entry

2016-11-02 Thread Florian Fainelli

On 11/02/2016 10:08 AM, Jon Mason wrote:
> The BCM54810 PHY requires some semi-unique configuration, which results
> in some additional configuration in addition to the standard config.
> Also, some users of the BCM54810 require the PHY lanes to be swapped.
> Since there is no way to detect this, add a device tree query to see if
> it is applicable.
> 
> Inspired-by: Vikas Soni 
> Signed-off-by: Jon Mason 

Reviewed-by: Florian Fainelli 
-- 
Florian

[mm PATCH v2 26/26] igb: Update code to better handle incrementing page count

2016-11-02 Thread Alexander Duyck

This patch updates the driver code so that we do bulk updates of the page
reference count instead of just incrementing it by one reference at a time.
The advantage to doing this is that we cut down on atomic operations and
this in turn should give us a slight improvement in cycles per packet.  In
addition if we eventually move this over to using build_skb the gains will
be more noticeable.

Acked-by: Jeff Kirsher 
Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/igb/igb.h  |7 ++-
 drivers/net/ethernet/intel/igb/igb_main.c |   24 +---
 2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index d11093d..acbc3ab 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -210,7 +210,12 @@ struct igb_tx_buffer {
 struct igb_rx_buffer {
dma_addr_t dma;
struct page *page;
-   unsigned int page_offset;
+#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
+   __u32 page_offset;
+#else
+   __u16 page_offset;
+#endif
+   __u16 pagecnt_bias;
 };
 
 struct igb_tx_queue_stats {
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index c8c458c..5e66cde 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3962,7 +3962,8 @@ static void igb_clean_rx_ring(struct igb_ring *rx_ring)
 PAGE_SIZE,
 DMA_FROM_DEVICE,
 DMA_ATTR_SKIP_CPU_SYNC);
-   __free_page(buffer_info->page);
+   __page_frag_drain(buffer_info->page, 0,
+ buffer_info->pagecnt_bias);
 
buffer_info->page = NULL;
}
@@ -6830,13 +6831,15 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer 
*rx_buffer,
  struct page *page,
  unsigned int truesize)
 {
+   unsigned int pagecnt_bias = rx_buffer->pagecnt_bias--;
+
/* avoid re-using remote pages */
if (unlikely(igb_page_is_reserved(page)))
return false;
 
 #if (PAGE_SIZE < 8192)
/* if we are only owner of page we can reuse it */
-   if (unlikely(page_count(page) != 1))
+   if (unlikely(page_ref_count(page) != pagecnt_bias))
return false;
 
/* flip page offset to other buffer */
@@ -6849,10 +6852,14 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer 
*rx_buffer,
return false;
 #endif
 
-   /* Even if we own the page, we are not allowed to use atomic_set()
-* This would break get_page_unless_zero() users.
+   /* If we have drained the page fragment pool we need to update
+* the pagecnt_bias and page count so that we fully restock the
+* number of references the driver holds.
 */
-   page_ref_inc(page);
+   if (unlikely(pagecnt_bias == 1)) {
+   page_ref_add(page, USHRT_MAX);
+   rx_buffer->pagecnt_bias = USHRT_MAX;
+   }
 
return true;
 }
@@ -6904,7 +6911,6 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
return true;
 
/* this page cannot be reused so discard it */
-   __free_page(page);
return false;
}
 
@@ -6975,10 +6981,13 @@ static struct sk_buff *igb_fetch_rx_buffer(struct 
igb_ring *rx_ring,
/* hand second half of page back to the ring */
igb_reuse_rx_page(rx_ring, rx_buffer);
} else {
-   /* we are not reusing the buffer so unmap it */
+   /* We are not reusing the buffer so unmap it and free
+* any references we are holding to it
+*/
dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
 PAGE_SIZE, DMA_FROM_DEVICE,
 DMA_ATTR_SKIP_CPU_SYNC);
+   __page_frag_drain(page, 0, rx_buffer->pagecnt_bias);
}
 
/* clear contents of rx_buffer */
@@ -7252,6 +7261,7 @@ static bool igb_alloc_mapped_page(struct igb_ring 
*rx_ring,
bi->dma = dma;
bi->page = page;
bi->page_offset = 0;
+   bi->pagecnt_bias = 1;
 
return true;
 }

[mm PATCH v2 23/26] dma: Add calls for dma_map_page_attrs and dma_unmap_page_attrs

2016-11-02 Thread Alexander Duyck

Add support for mapping and unmapping a page with attributes.  The primary
use for this is currently to allow for us to pass the
DMA_ATTR_SKIP_CPU_SYNC attribute when mapping and unmapping a page.  On
some architectures such as ARM the synchronization has significant overhead
and if we are already taking care of the sync_for_cpu and sync_for_device
from the driver there isn't much need to handle this in the map/unmap calls
as well.

Signed-off-by: Alexander Duyck 
---
 include/linux/dma-mapping.h |   20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 08528af..10c5a17 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -243,29 +243,33 @@ static inline void dma_unmap_sg_attrs(struct device *dev, 
struct scatterlist *sg
ops->unmap_sg(dev, sg, nents, dir, attrs);
 }
 
-static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
- size_t offset, size_t size,
- enum dma_data_direction dir)
+static inline dma_addr_t dma_map_page_attrs(struct device *dev,
+   struct page *page,
+   size_t offset, size_t size,
+   enum dma_data_direction dir,
+   unsigned long attrs)
 {
struct dma_map_ops *ops = get_dma_ops(dev);
dma_addr_t addr;
 
kmemcheck_mark_initialized(page_address(page) + offset, size);
BUG_ON(!valid_dma_direction(dir));
-   addr = ops->map_page(dev, page, offset, size, dir, 0);
+   addr = ops->map_page(dev, page, offset, size, dir, attrs);
debug_dma_map_page(dev, page, offset, size, dir, addr, false);
 
return addr;
 }
 
-static inline void dma_unmap_page(struct device *dev, dma_addr_t addr,
- size_t size, enum dma_data_direction dir)
+static inline void dma_unmap_page_attrs(struct device *dev,
+   dma_addr_t addr, size_t size,
+   enum dma_data_direction dir,
+   unsigned long attrs)
 {
struct dma_map_ops *ops = get_dma_ops(dev);
 
BUG_ON(!valid_dma_direction(dir));
if (ops->unmap_page)
-   ops->unmap_page(dev, addr, size, dir, 0);
+   ops->unmap_page(dev, addr, size, dir, attrs);
debug_dma_unmap_page(dev, addr, size, dir, false);
 }
 
@@ -385,6 +389,8 @@ static inline void dma_sync_single_range_for_device(struct 
device *dev,
 #define dma_unmap_single(d, a, s, r) dma_unmap_single_attrs(d, a, s, r, 0)
 #define dma_map_sg(d, s, n, r) dma_map_sg_attrs(d, s, n, r, 0)
 #define dma_unmap_sg(d, s, n, r) dma_unmap_sg_attrs(d, s, n, r, 0)
+#define dma_map_page(d, p, o, s, r) dma_map_page_attrs(d, p, o, s, r, 0)
+#define dma_unmap_page(d, a, s, r) dma_unmap_page_attrs(d, a, s, r, 0)
 
 extern int dma_common_mmap(struct device *dev, struct vm_area_struct *vma,
   void *cpu_addr, dma_addr_t dma_addr, size_t size);

[mm PATCH v2 25/26] igb: Update driver to make use of DMA_ATTR_SKIP_CPU_SYNC

2016-11-02 Thread Alexander Duyck

The ARM architecture provides a mechanism for deferring cache line
invalidation in the case of map/unmap.  This patch makes use of this
mechanism to avoid unnecessary synchronization.

A secondary effect of this change is that the portion of the page that has
been synchronized for use by the CPU should be writable and could be passed
up the stack (at least on ARM).

The last bit that occurred to me is that on architectures where the
sync_for_cpu call invalidates cache lines we were prefetching and then
invalidating the first 128 bytes of the packet.  To avoid that I have moved
the sync up to before we perform the prefetch and allocate the skbuff so
that we can actually make use of it.

Acked-by: Jeff Kirsher 
Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/igb/igb_main.c |   53 ++---
 1 file changed, 33 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 4feca69..c8c458c 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3947,10 +3947,21 @@ static void igb_clean_rx_ring(struct igb_ring *rx_ring)
if (!buffer_info->page)
continue;
 
-   dma_unmap_page(rx_ring->dev,
-  buffer_info->dma,
-  PAGE_SIZE,
-  DMA_FROM_DEVICE);
+   /* Invalidate cache lines that may have been written to by
+* device so that we avoid corrupting memory.
+*/
+   dma_sync_single_range_for_cpu(rx_ring->dev,
+ buffer_info->dma,
+ buffer_info->page_offset,
+ IGB_RX_BUFSZ,
+ DMA_FROM_DEVICE);
+
+   /* free resources associated with mapping */
+   dma_unmap_page_attrs(rx_ring->dev,
+buffer_info->dma,
+PAGE_SIZE,
+DMA_FROM_DEVICE,
+DMA_ATTR_SKIP_CPU_SYNC);
__free_page(buffer_info->page);
 
buffer_info->page = NULL;
@@ -6808,12 +6819,6 @@ static void igb_reuse_rx_page(struct igb_ring *rx_ring,
 
/* transfer page from old buffer to new buffer */
*new_buff = *old_buff;
-
-   /* sync the buffer for use by the device */
-   dma_sync_single_range_for_device(rx_ring->dev, old_buff->dma,
-old_buff->page_offset,
-IGB_RX_BUFSZ,
-DMA_FROM_DEVICE);
 }
 
 static inline bool igb_page_is_reserved(struct page *page)
@@ -6934,6 +6939,13 @@ static struct sk_buff *igb_fetch_rx_buffer(struct 
igb_ring *rx_ring,
page = rx_buffer->page;
prefetchw(page);
 
+   /* we are reusing so sync this buffer for CPU use */
+   dma_sync_single_range_for_cpu(rx_ring->dev,
+ rx_buffer->dma,
+ rx_buffer->page_offset,
+ size,
+ DMA_FROM_DEVICE);
+
if (likely(!skb)) {
void *page_addr = page_address(page) +
  rx_buffer->page_offset;
@@ -6958,21 +6970,15 @@ static struct sk_buff *igb_fetch_rx_buffer(struct 
igb_ring *rx_ring,
prefetchw(skb->data);
}
 
-   /* we are reusing so sync this buffer for CPU use */
-   dma_sync_single_range_for_cpu(rx_ring->dev,
- rx_buffer->dma,
- rx_buffer->page_offset,
- size,
- DMA_FROM_DEVICE);
-
/* pull page into skb */
if (igb_add_rx_frag(rx_ring, rx_buffer, size, rx_desc, skb)) {
/* hand second half of page back to the ring */
igb_reuse_rx_page(rx_ring, rx_buffer);
} else {
/* we are not reusing the buffer so unmap it */
-   dma_unmap_page(rx_ring->dev, rx_buffer->dma,
-  PAGE_SIZE, DMA_FROM_DEVICE);
+   dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
+PAGE_SIZE, DMA_FROM_DEVICE,
+DMA_ATTR_SKIP_CPU_SYNC);
}
 
/* clear contents of rx_buffer */
@@ -7230,7 +7236,8 @@ static bool igb_alloc_mapped_page(struct igb_ring 
*rx_ring,
}
 
/* map page for use */
-   dma = dma_map_page(rx_ring->dev, page, 0, PAGE_SIZE, DMA_FROM_DEVICE);
+   dma = dma_map_page_attrs(rx_ring->dev, page, 0, PAGE_SIZE,
+DMA_FROM_DEVICE, DMA_ATTR_SKI

Re: [PATCH v5 6/7] net: ethernet: bgmac: add NS2 support

2016-11-02 Thread Florian Fainelli

On 11/02/2016 10:08 AM, Jon Mason wrote:
> Add support for the variant of amac hardware present in the Broadcom
> Northstar2 based SoCs.  Northstar2 requires an additional register to be
> configured with the port speed/duplexity (NICPM).  This can be added to
> the link callback to hide it from the instances that do not use this.
> Also, clearing of the pending interrupts on init is required due to
> observed issues on some platforms.
> 
> Signed-off-by: Jon Mason 

Reviewed-by: Florian Fainelli 
-- 
Florian

1 2 3 >

1 - 100 of 216 matches

Mail list logo