date:20160411

[PATCH net v3] net: sched: do not requeue a NULL skb

2016-04-11 Thread Lars Persson

A failure in validate_xmit_skb_list() triggered an unconditional call
to dev_requeue_skb with skb=NULL. This slowly grows the queue
discipline's qlen count until all traffic through the queue stops.

We take the optimistic approach and continue running the queue after a
failure since it is unknown if later packets also will fail in the
validate path.

Fixes: 55a93b3ea780 ("qdisc: validate skb without holding lock")
Signed-off-by: Lars Persson 
---
v3: After a discussion with Eric and Cong I went back to v1 and added the
likely() for the common path.
---
 net/sched/sch_generic.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index f18c350..80742ed 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -159,12 +159,15 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
if (validate)
skb = validate_xmit_skb_list(skb, dev);
 
-   if (skb) {
+   if (likely(skb)) {
HARD_TX_LOCK(dev, txq, smp_processor_id());
if (!netif_xmit_frozen_or_stopped(txq))
skb = dev_hard_start_xmit(skb, dev, txq, &ret);
 
HARD_TX_UNLOCK(dev, txq);
+   } else {
+   spin_lock(root_lock);
+   return qdisc_qlen(q);
}
spin_lock(root_lock);
 
-- 
2.1.4

[PATCH V3] net: mediatek: update the IRQ part of the binding document

2016-04-11 Thread John Crispin

The current binding document only describes a single interrupt. Update the
document by adding the 2 other interrupts.

The driver currently only uses a single interrupt. The HW is however able
to using IRQ grouping to split TX and RX onto separate GIC irqs.

Signed-off-by: John Crispin 
Cc: devicet...@vger.kernel.org
---
This binding doc was merged in 4.6-rc1 and there are no users yet. The
current driver only uses 1 irq but will work fine with all 3 listed in
the devicetree. This patch should be merged before v4.6 is final such
that listing all 3 irqs becomes part of the ABI. I have already posted
a patch that utilizes all 3 irqs for next-next for v4.7 inclusion.

Changes in V3:
* be verbose about the 3 irqs and their ordering

Changes in V2:
* split this patch out of the series that fixes tx stalls in the driver

 Documentation/devicetree/bindings/net/mediatek-net.txt |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/mediatek-net.txt 
b/Documentation/devicetree/bindings/net/mediatek-net.txt
index 5ca7929..32eaaca 100644
--- a/Documentation/devicetree/bindings/net/mediatek-net.txt
+++ b/Documentation/devicetree/bindings/net/mediatek-net.txt
@@ -9,7 +9,8 @@ have dual GMAC each represented by a child node..
 Required properties:
 - compatible: Should be "mediatek,mt7623-eth"
 - reg: Address and length of the register set for the device
-- interrupts: Should contain the frame engines interrupt
+- interrupts: Should contain the three frame engines interrupts in numeric
+   order. These are fe_int0, fe_int1 and fe_int2.
 - clocks: the clock used by the core
 - clock-names: the names of the clock listed in the clocks property. These are
"ethif", "esw", "gp2", "gp1"
@@ -42,7 +43,9 @@ eth: ethernet@1b10 {
 <ðsys CLK_ETHSYS_GP2>,
 <ðsys CLK_ETHSYS_GP1>;
clock-names = "ethif", "esw", "gp2", "gp1";
-   interrupts = ;
+   interrupts = ;
power-domains = <&scpsys MT2701_POWER_DOMAIN_ETH>;
resets = <ðsys MT2701_ETHSYS_ETH_RST>;
reset-names = "eth";
-- 
1.7.10.4

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Jesper Dangaard Brouer


On Mon, 11 Apr 2016 15:02:51 -0700 Alexander Duyck  
wrote:

> Have you taken a look at possibly trying to optimize the DMA pool API
> to work with pages?  It sounds like it is supposed to do something
> similar to what you are wanting to do.

Yes, I have looked at the mm/dmapool.c API. AFAIK this is for DMA
coherent memory (see use of dma_alloc_coherent/dma_free_coherent). 

What we are doing is "streaming" DMA memory, when processing the RX
ring.

(NIC are only using DMA coherent memory for the descriptors, which are
allocated on driver init)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Jesper Dangaard Brouer

On Mon, 11 Apr 2016 15:21:26 -0700
Alexei Starovoitov  wrote:

> On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote:
> > 
> > On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg  wrote:
> >   
[...]
> > > 
> > > If we go down this road how about also attaching some driver opaques
> > > to the page sets?  
> > 
> > That was the ultimate plan... to leave some opaques bytes left in the
> > page struct that drivers could use.
> > 
> > In struct page I would need a pointer back to my page_pool struct and a
> > page flag.  Then, I would need room to store the dma_unmap address.
> > (And then some of the usual fields are still needed, like the refcnt,
> > and reusing some of the list constructs).  And a zero-copy cross-domain
> > id.  
> 
> I don't think we need to add anything to struct page.
> This is supposed to be small cache of dma_mapped pages with lockless access.
> It can be implemented as an array or link list where every element
> is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to
> send it to back to page allocator.

It sounds like the Intel drivers recycle facility, where they split the
page into two parts, and keep page in RX-ring, by swapping to other
half of page, if page_count(page) is <= 2.  Thus, they use the atomic
page ref count to synchronize on.

Thus, we end-up having two atomic operations per RX packet, on the page
refcnt.  Where DPDK have zero...

By fully taking over the page as an allocator, almost like slab. I can
optimize the common case (of the packet-page getting allocated and
free'ed on the same CPU), and remove these atomic operations.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features

2016-04-11 Thread Leon Romanovsky

On Tue, Apr 12, 2016 at 08:36:21AM +0300, Or Gerlitz wrote:
> Conflicts happens @ all times, life.
> 

...

> 
> I understand your desire to get it down to zero, but it's not gonna
> work, pick another target.

Maybe you are right and the time will show, but now we (Saeed, Matan and me)
are trying hard to achieve this goal.

> 
> For example, the networking community has a fairly large rc activity
> (I would say 10-20x
> vs rdma), so when Dave does his "merge-rebases" for net-next over net
> and linus tree
> (4-5 times in a release), he has to this way or another solve
> conflicts, yes! ditto for
> Linus during merge windows and to some extent in rc times too.

I don't see any harm in our desire to decrease work overhead from these
busy people.

> 
> > It won't help to anyone to split this commit to more than one patch.
> 
> The commit change-log should make it clear what this is about, and it doesn't.
> If you believe in something, state that clear, be precise.

I agree.

> 
> As Saeed admitted the shared code in the commit spans maybe 2% of it.
> 
> The 1st commit deals with a field which is not used in the driver,
> this is a cleanup
> that you can do in rc (net) patch (remove the field all together) and
> overall, w.o seeing

I don't agree with your point that cleanup should go to RC.

> the down-stream patches that depend on the newly introduced fields,
> how do you know there aren't such (unused) bits in the 2nd commit?

No, I don't know in advance, but the truth is that it doesn't bother
anyone, because we are exposing our internal HW to kernel clients and
doing it with minimal impact on the maintainers.


signature.asc
Description: Digital signature

Re: [PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features

2016-04-11 Thread Or Gerlitz

On Tue, Apr 12, 2016 at 8:15 AM, Leon Romanovsky  wrote:
> On Tue, Apr 12, 2016 at 12:37:34AM +0300, Or Gerlitz wrote:
>> On Tue, Apr 12, 2016 at 12:24 AM, Saeed Mahameed
>>  wrote:
>> > On Tue, Apr 12, 2016 at 12:17 AM, Or Gerlitz  wrote:
>>
>> >> feature --> features
>>
>> > Correct, will fix.
>>
>> >>> * Add vport to steering commands for SRIOV ACL support
>> >>> * Add mlcr, pcmr and mcia registers for dump module EEPROM
>> >>> * Add support for FCS, baeacon led and disable_link bits to hca caps
>> >>> * Add CQE period mode bit in  CQ context for CQE based CQ
>> >>>   moderation support
>> >>> * Add umr SQ bit for fragmented memory registration
>> >>> * Add needed bits and caps for Striding RQ support
>>
>> >> AFAIK, all the above are features will go through net-next, what made
>> >> you anticipate conflicts with linux-rdma?
>>
>> > FCS bit is needed also for rdma, so we took the liberty of updating
>> > all the needed HW structs, bits, caps, etc ..
>> > at once for all mlx5 features planned for 4.7 regardless of rdma/net 
>> > conflicts.
>>
>> The cover letter states that this series deals with shared code.
>>
>> I guess you might also could extend it a bit to deal also with code
>> that you suspect could lead to conflicts, but I don't see why it
>> evolved to that extent.
>
> Or,
> All these micro-optimizations on this shared file can potentially lead
> to undesired merge conflicts. Subsystem maintainers and Linus don't need to
> deal with these conflicts at all.

Leon,

Conflicts happens @ all times, life.

We (MLNX) didn't do a good job to minimize them as much as
possible on the 4.5 cycle (understatement) and did vast improvement
in the 4.6 cycle (one or two conflicts AFAIK and communicated  to Linus).

It's correct that Linus got really angry on these two conflicts but I have
communicated to him the fact that we did that improvement and we're talking
on two commits for a fairly large volume of patches, the response was, if you
do things right, we will happily keep working with you, people, go look.

I understand your desire to get it down to zero, but it's not gonna
work, pick another target.

For example, the networking community has a fairly large rc activity
(I would say 10-20x
vs rdma), so when Dave does his "merge-rebases" for net-next over net
and linus tree
(4-5 times in a release), he has to this way or another solve
conflicts, yes! ditto for
Linus during merge windows and to some extent in rc times too.

> It won't help to anyone to split this commit to more than one patch.

The commit change-log should make it clear what this is about, and it doesn't.
If you believe in something, state that clear, be precise.

As Saeed admitted the shared code in the commit spans maybe 2% of it.

The 1st commit deals with a field which is not used in the driver,
this is a cleanup
that you can do in rc (net) patch (remove the field all together) and
overall, w.o seeing
the down-stream patches that depend on the newly introduced fields,
how do you know there aren't such (unused) bits in the 2nd commit?

Or.

[REGRESSION, bisect] pci: cxgb4 probe fails after commit 104daa71b3961434 ("PCI: Determine actual VPD size on first access")

2016-04-11 Thread Hariprasad Shenai

Hi All,


The following patch introduced a regression, causing cxgb4 driver to fail in
PCIe probe.

commit 104daa71b39614343929e1982170d5fcb0569bb5
Author: Hannes Reinecke 
Author: Hannes Reinecke 
Date:   Mon Feb 15 09:42:01 2016 +0100

PCI: Determine actual VPD size on first access

PCI-2.2 VPD entries have a maximum size of 32k, but might actually be
smaller than that.  To figure out the actual size one has to read the VPD
area until the 'end marker' is reached.

Per spec, reading outside of the VPD space is "not allowed."  In practice,
it may cause simple read errors or even crash the card.  To make matters
worse not every PCI card implements this properly, leaving us with no 'end'
marker or even completely invalid data.

Try to determine the size of the VPD data when it's first accessed.  If no
valid data can be read an I/O error will be returned when reading or
writing the sysfs attribute.

As the amount of VPD data is unknown initially the size of the sysfs
attribute will always be set to '0'.

[bhelgaas: changelog, use 0/1 (not false/true) for bitfield, tweak
pci_vpd_pci22_read() error checking]
Tested-by: Shane Seymour 
Tested-by: Babu Moger 
Signed-off-by: Hannes Reinecke 
Signed-off-by: Bjorn Helgaas 
Cc: Alexander Duyck 

The problem is stemming from the fact that the Chelsio adapters actually have
two VPD structures stored in the VPD.  An abbreviated on at Offset 0x0 and the
complete VPD at Offset 0x400.  The abbreviated one only contains the PN, SN and
EC Keywords, while the complete VPD contains those plus various adapter
constants contained in V0, V1, etc.  And it also contains the Base Ethernet MAC
Address in the "NA" Keyword which the cxgb4 driver needs when it can't contact
the adapter firmware.  (We don't have the "NA" Keywork in the VPD Structure at
Offset 0x0 because that's not an allowed VPD Keyword in the PCI-E 3.0
specification.)

  With the new code, the computed size of the VPD is 0x200 and so our efforts
to read the VPD at Offset 0x400 silently fails. We check the result of the
read looking for a signature 0x82 byte but we're checking against random stack
garbage.

  The end result is that the cxgb4 driver now fails the PCI-E Probe.

Thanks,
Hari

Re: [PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features

2016-04-11 Thread Leon Romanovsky

On Tue, Apr 12, 2016 at 12:37:34AM +0300, Or Gerlitz wrote:
> On Tue, Apr 12, 2016 at 12:24 AM, Saeed Mahameed
>  wrote:
> > On Tue, Apr 12, 2016 at 12:17 AM, Or Gerlitz  wrote:
> 
> >> feature --> features
> 
> > Correct, will fix.
> 
> >>> * Add vport to steering commands for SRIOV ACL support
> >>> * Add mlcr, pcmr and mcia registers for dump module EEPROM
> >>> * Add support for FCS, baeacon led and disable_link bits to hca caps
> >>> * Add CQE period mode bit in  CQ context for CQE based CQ
> >>>   moderation support
> >>> * Add umr SQ bit for fragmented memory registration
> >>> * Add needed bits and caps for Striding RQ support
> 
> >> AFAIK, all the above are features will go through net-next, what made
> >> you anticipate conflicts with linux-rdma?
> 
> > FCS bit is needed also for rdma, so we took the liberty of updating
> > all the needed HW structs, bits, caps, etc ..
> > at once for all mlx5 features planned for 4.7 regardless of rdma/net 
> > conflicts.
> 
> The cover letter states that this series deals with shared code.
> 
> I guess you might also could extend it a bit to deal also with code
> that you suspect could lead to conflicts, but I don't see why it
> evolved to that extent.

Or,
All these micro-optimizations on this shared file can potentially lead
to undesired merge conflicts. Subsystem maintainers and Linus don't need to
deal with these conflicts at all.

It won't help to anyone to split this commit to more than one patch.


signature.asc
Description: Digital signature

Re: [PATCH net-next v2 1/2] rtnetlink: add new RTM_GETSTATS message to dump link stats

2016-04-11 Thread roopa

On 4/10/16, 11:28 AM, roopa wrote:
> On 4/10/16, 1:16 AM, Thomas Graf wrote:
[snip]
>>
>> This currently ties everything to a net_device with a selector to
>> include certain bits of that net_device. How about we take it half a
>> step further and allow for non net_device stats such as IP, TCP,
>> routing or ipsec stats to be retrieved as well?
> yes, absolutely. and that is also the goal.
>> A simple array of nested attributes replacing IFLA_STATS_* would
>> allow for that, e.g.
>>
>> 1. {.type = ST_IPSTATS, value = { ...} }
>>
>> 2. {.type = ST_LINK, .value = {
>> {.type = ST_LINK_NAME, .value = "eth0"},
>> {.type = ST_LINK_Q, .value = 10}
>>   }}
>>
>> 3. ...
> One thing though,  Its unclear to me if we absolutely need the additional 
> nest.
> Every stats netlink msg has an ifindex in the header (if_stats_msg) if the 
> scope
> of the stats is a netdev. If the msg does not have an ifindex in the 
> if_stats_msg,
> it represents a global stat. ie Cant a dump, include other stats netlink msgs 
> after
> all the netdev msgs are done when the filter has global stat filters ?.
> same will apply to RTM_GETSTATS (without NLM_F_DUMP).
>
> Since the msg may potentially have more nest levels
> in the IFLA_EXT_STATS categories, just trying to see if i can avoid adding 
> another
> top-level nest. We can sure add it if there is no other way to include global
> stats in the same dump.
>

Just wanted to elaborate on what i was trying to say:

Top level stats attributes can be netdev or global attributes: We can include 
string "LINK" in
the names of all stats belonging to a netdev to make it easier to recognize the 
netdev stats (example):
 IFLA_STATS_LINK64,   (netdev)
 IFLA_STATS_LINK_INET6,(netdev)
 IFLA_STATS_TCP,(non-netdev, global tcp stats)

RTM_GETSTATS (NLM_F_DUMP)  with user given filter_mask {
If filter-mask contains any link stats, start with per netdev stats 
messages:
{ if_stats_msg.ifindex = 1,
  if_stats_msg.filter_mask = mask of included link stats,
   }

{ if_stats_msg.ifindex = 2,
  if_stats_msg.filter_mask = mask of included link stats,
   }

global stats (if user given filter mask contains global filters.):
{ if_stats_msg.ifindex = 0,
  if_stats_msg.filter_mask = mask of included global stats,
   }
}

We will need a field in netlink_callback to indicate global or netdev stats 
when the stats
crosses skb boundaries. A single nlmsg cannot have both netdev and global stats.

Non-dump RTM_GETSTATS examples:

RTM_GETSTATS with valid ifindex and filter_mask {
filter_mask cannot have global stats (return -EINVAL)
{ if_stats_msg.ifindex = ,
  if_stats_msg.filter_mask = mask of included link stats,
   }
}

RTM_GETSTATS with ifindex = 0 and filter_mask {
filter_mask cannot have link stats (return -EINVAL)
{ if_stats_msg.ifindex = 0,
  if_stats_msg.filter_mask = mask of included global link stats,
   }
}


Will this not work ?

Thanks,
Roopa

Re: [PATCH net-next 0/7] DSA refactoring: set 1

2016-04-11 Thread Vivien Didelot

Andrew Lunn  writes:

> There has been a long running effort to refractor DSA probing to make
> the switches true linux devices. Here are a small collection of
> patches moving in this direction. Most have been seen before.
>
> We take a little step forward by passing the dsa device point to the
> driver, thus allowing it to perform resource allocations using the
> normal mechanisms. This device structure will later be replaced by the
> devices own device structure.
>
> Future patches will add a true driver probe function, so we rename the
> current probe function, cleaning up the namespace.
>
> phys_port_mask continually confuses me, thinking it is about PHYs. But
> it is actually about ports to the outside world, user ports. So rename
> it.
>
> Lots more patches yet to follow, this is just doing some ground work.
>
> Andrew Lunn (7):
>   net: dsa: Pass the dsa device to the switch drivers
>   net: dsa: Have the switch driver allocate there own private memory
>   net: dsa: Remove allocation of driver private memory
>   net: dsa: Keep the mii bus and address in the private structure
>   net: dsa: Rename DSA probe function.
>   dsa: Rename phys_port_mask to user_port_mask
>   dsa: mv88e6xxx: Use bus in mv88e6xxx_lookup_name()
>
>  drivers/net/dsa/bcm_sf2.c   | 24 +---
>  drivers/net/dsa/mv88e6060.c | 47 +++---
>  drivers/net/dsa/mv88e6060.h | 11 +
>  drivers/net/dsa/mv88e6123.c | 14 +++-
>  drivers/net/dsa/mv88e6131.c | 14 +++-
>  drivers/net/dsa/mv88e6171.c | 14 +++-
>  drivers/net/dsa/mv88e6352.c | 14 +++-
>  drivers/net/dsa/mv88e6xxx.c | 55 
> +++--
>  drivers/net/dsa/mv88e6xxx.h | 17 +++---
>  include/net/dsa.h   | 16 -
>  net/dsa/dsa.c   | 19 +---
>  11 files changed, 166 insertions(+), 79 deletions(-)

Tested-by: Vivien Didelot

Re: [PATCH net-next 7/7] dsa: mv88e6xxx: Use bus in mv88e6xxx_lookup_name()

2016-04-11 Thread Vivien Didelot

Andrew Lunn  writes:

> mv88e6xxx_lookup_name() returns the model name of a switch at a given
> address on an MII bus. Using mii_bus to identify the bus rather than
> the host device is more logical, so change the parameter.
>
> Signed-off-by: Andrew Lunn 

Reviewed-by: Vivien Didelot

Re: [PATCH RFC] net: decrease the length of backlog queue immediately after it's detached from sk

2016-04-11 Thread Yang Yingliang




On 2016/4/11 20:13, Eric Dumazet wrote:

On Mon, 2016-04-11 at 19:57 +0800, Yang Yingliang wrote:


On 2016/4/8 22:44, Eric Dumazet wrote:

On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote:


I expand  tcp_adv_win_scale and tcp_rmem. It has no effect.


Try :

echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale

And restart your flows.


cat /proc/sys/net/ipv4/tcp_rmem
10240 2097152 10485760


What about leaving the default values ?

I tried, it did not work.



$ cat /proc/sys/net/ipv4/tcp_rmem
409687380   6291456



echo 102400 20971520 104857600 > /proc/sys/net/ipv4/tcp_rmem
echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale

It seems has not effect.



I have no idea what you did on the sender side to allow it to send more
than 1.5 MB then.


We are doing performance test. The sender send 256KB per-block with 128
threads to one socket. And the receiver uses 10Gb NIC to handle the
data on ARM64. The data flow is driver->ip layer->tcp layer->iscsi.

I added some debug messages and found handling backlog packets in 
__release_sock() cost about 11ms at most. This can cause backlog queue

overflow. The sk_data_ready is re-assigned, it may cost time in our
program. I will check it out.

Re: [PATCH RFC 1/5] net: phy: sun8i-h3-ephy: Add bindings for Allwinner H3 Ethernet PHY

2016-04-11 Thread Chen-Yu Tsai

On Tue, Apr 12, 2016 at 3:23 AM, Florian Fainelli  wrote:
> On 04/04/16 09:22, Chen-Yu Tsai wrote:
>> The Allwinner H3 SoC incorporates an Ethernet PHY. This is enabled and
>> configured through a memory mapped hardware register.
>>
>> This same register also configures the MAC interface mode and TX clock
>> source. Also covered by the register, but not supported in these bindings,
>> are TX/RX clock delay chains and inverters, and an RMII module.
>>
>> Signed-off-by: Chen-Yu Tsai 
>> ---
>>  .../bindings/net/allwinner,sun8i-h3-ephy.txt   | 44 
>> ++
>>  1 file changed, 44 insertions(+)
>>  create mode 100644 
>> Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt
>>
>> diff --git 
>> a/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt 
>> b/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt
>> new file mode 100644
>> index ..146f227e6d88
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt
>> @@ -0,0 +1,44 @@
>> +* Allwinner H3 E(thernet) PHY
>> +
>> +The Allwinner H3 integrates an MII ethernet PHY. As with external PHYs,
>> +before it can be configured over the MDIO bus and used, certain hardware
>> +features must be configured, such as the PHY address and LED polarity.
>
> Is the internal PHY address really configurable? Not that there is
> anything wrong with it, this is good.

It is. Things that are configured or provided to a discrete PHY are routed
to registers in the SoC, things such as PHY address, clocks, resets.

>> +The PHY must also be powered on and brought out of reset.
>> +
>> +This is accomplished with regulators and pull-up/downs for external PHYs.
>> +For this internal PHY, a hardware register is programmed.
>> +
>> +The same hardware register also contains clock and interface controls
>> +for the MAC. This is also present in earlier SoCs, and is covered by
>> +"allwinner,sun7i-a20-gmac-clk". The controls in the H3 are slightly
>> +different due to the inclusion of what appears to be an RMII-MII
>> +bridge.
>> +
>> +Required properties:
>> +- compatible: should be "allwinner,sun8i-h3-ephy"
>> +- reg: address and length of the register set for the device
>> +- clocks: A phandle to the reference clock for this device
>> +- resets: A phandle to the reset control for this device
>> +
>> +Ethernet PHY related properties:
>> +- allwinner,ephy-addr: the MDIO bus address the PHY should respond to.
>> +If this is not set, the external PHY is used, and
>> +everything else in this section is ignored.
>
> So we are going to put the same value here, and in the actual Ethernet
> PHY device tree node that should be hanging off the EMAC/MDIO bus
> controller, this is confusing and error prone.

Yes, that would be an issue when writing the DTS.
>
>> +- allwinner,leds-active-low: LEDs are active low. Without this, LEDs are
>> +  active high.
>> +
>> +Ethernet MAC clock related properties:
>> +- #clock-cells: should be 0
>> +- clock-output-names: "mac_tx"
>> +
>> +Example:
>> +
>> +ethernet-phy@01c00030 {
>> + compatible = "allwinner,sun8i-h3-ephy";
>> + reg = <0x01c00030 0x4>;
>
> Looking at this register space this looks more like an internal PHY SHIM
> that is needed to be configured before the internal PHY can be access,
> not a proper Ethernet PHY per-se, see replies in aptch 2.
>
> Should not this block be a second cell associated with the Ethernet MAC
> block? One or the other are not going to be very useful without
> knowledge of each other.

True. However the lower half of the same register also controls the
MAC interface mode and TX clock source and delays. This we had a clock
driver that was used in conjuction with stmmac on earlier SoCs. I was
hoping to keep that model with the newer EMAC. At the time it was
argued that what seemed like a clock should be handled by a clock
driver, instead of just a "syscon". If this is reaching too far to
handle this new use case, I will happily just provide patches to merge
this into the MAC.

I would like to know how to deal with things like a PHY requiring
some sort of shim driver, be it an internal one, or an external mfd
chip that happens to have an Ethernet PHY included? How do we tie
this into the PHY node under the MDIO bus?


Regards
ChenYu

Re: [PATCH] devlink: add missing install of header

2016-04-11 Thread David Miller

From: Stephen Hemminger 
Date: Mon, 11 Apr 2016 13:39:08 -0700

> The new devlink.h in uapi was not being installed by
> make headers_install
> 
> Signed-off-by: Stephen Hemminger 

Applied, thanks Stephen.

I wish we could just have everything under uapi/linux automatically
get this treatment, but I guess there is a very good reason for the
current behavior.

Re: Configuring ethernet link fails with No such device

2016-04-11 Thread David Miller

From: Stefan Agner 
Date: Mon, 11 Apr 2016 15:46:08 -0700

> What is the expectation/definition when link configuration should be
> possible? Only after the network device got opened or before?

Only after it is open.  Drivers almost always have the entire chip in
powerdown state when it is not open, so we wouldn't be able to
properly do link settings even if we wanted to when the device is
closed.

Re: [PATCH net-next] net: mdio: Fix lockdep falls positive splat

2016-04-11 Thread David Miller

From: Andrew Lunn 
Date: Mon, 11 Apr 2016 21:40:05 +0200

> MDIO devices can be stacked upon each other. The current code supports
> two levels, which until recently has been enough for a DSA mdio bus on
> top of another bus. Now we have hardware which has an MDIO mux in the
> middle.
> 
> Define an MDIO MUTEX class with three levels.
> 
> Signed-off-by: Andrew Lunn 

Applied, thanks Andrew.

Re: [PATCH 0/9] RxRPC: 2nd rewrite part 1

2016-04-11 Thread David Miller

From: David Howells 
Date: Mon, 11 Apr 2016 23:01:39 +0100

> David Miller  wrote:
> 
>> Series applied, but I had to fix up some trailing whitespace (reported by
>> GIT) by hand.
> 
> Do you perchance have a git hook script for checking for trailing whitespace?

Just try to apply the series yourself with "git am", that's what I do.

You can do "git apply --check --whitespace=error-all" on individual
patches, but that doesn't work for a series.

Re: [PATCH net v2] net: sched: do not requeue a NULL skb

2016-04-11 Thread Eric Dumazet

On Mon, 2016-04-11 at 16:19 -0700, Cong Wang wrote:

> My point is, for example, in OOM case, we don't know processin
> more SKB would make it better or worse. Maybe we really need to
> check the error code to decide to continue to exit?

Really, given this bug has been there for a long time (v3.18 ???), I
doubt it matters.

Nothing can tell that following packets in the qdisc need any
transformation, and memory allocations.

So I would just fix the bug in the simplest way.

__qdisc_run() has all the checks needed to yield when needed 
(if (quota <= 0 || need_resched())) , no need to add more complexity
there.

Re: [PATCH net v2] net: sched: do not requeue a NULL skb

2016-04-11 Thread Cong Wang

On Mon, Apr 11, 2016 at 11:30 AM, Eric Dumazet  wrote:
> On Mon, 2016-04-11 at 11:26 -0700, Eric Dumazet wrote:
>> On Mon, 2016-04-11 at 11:02 -0700, Cong Wang wrote:
>>
>> > I am fine with either way as long as the loop stops on failure.
>
>
> Note that skb that could not be validated is already freed.
>
> So I do not see any value from stopping the loop, since
> we need to schedule the queue to avoid tx hang.
>
> Just process following skb if there is one, fact that skb is sent or
> dropped does not matter.

My point is, for example, in OOM case, we don't know processing
more SKB would make it better or worse. Maybe we really need to
check the error code to decide to continue to exit?

Re: Configuring ethernet link fails with No such device

2016-04-11 Thread Florian Fainelli

On 11/04/16 15:46, Stefan Agner wrote:
> Hi All,
> 
> I traced an issue in which we tried configuring duplex mode and speed in
> a systemd-udevd .link file failed with the following error:
> "Could not set speed or duplex of eth0 to 0 Mbps (half): No such device"
> 
> The FEC driver (fec_main.c) does not initialize phy_dev until the device
> has been opened, and therefor the callback fec_enet_(get|set)_settings
> returns -19.
> 
> What is the expectation/definition when link configuration should be
> possible? Only after the network device got opened or before?

It seems to me like systemd tries to configure the device while we are
still opening it, which is arguably something that can be done, but you
are in a transient state, the interface is not completely ready yet.

> 
> Or in other words: Is this a Kernel or systemd issue? At what point is
> link configuration "guaranteed" to work?

Past ndo_open() had a chance to return, therefore connect to the PHY
device, is when the configuration/information querying is able to succeed.

> 
> It seems that the FEC driver used to probe the PHY earlier, it has been
> moved to the open callback with commit 418bd0d4df ("netdev/fec: fix
> ifconfig eth0 down hang issue")...

Part of the reason for doing that, is that probing for the PHY during
the driver's probe function could leave you with unused HW that is
powered on for no good reason. Until we get into ndo_open, we have no
clue if the interface is going to be used or not, so it is better to
move part of the HW initialization as late as possible in the process.

> 
> There has been a similar report on the systemd-devel mailing list a
> while ago, probably the same underlying issue:
> https://www.mail-archive.com/systemd-devel@lists.freedesktop.org/msg33186.html
> 
> --
> Stefan
> 

-- 
Florian

Configuring ethernet link fails with No such device

2016-04-11 Thread Stefan Agner

Hi All,

I traced an issue in which we tried configuring duplex mode and speed in
a systemd-udevd .link file failed with the following error:
"Could not set speed or duplex of eth0 to 0 Mbps (half): No such device"

The FEC driver (fec_main.c) does not initialize phy_dev until the device
has been opened, and therefor the callback fec_enet_(get|set)_settings
returns -19.

What is the expectation/definition when link configuration should be
possible? Only after the network device got opened or before?

Or in other words: Is this a Kernel or systemd issue? At what point is
link configuration "guaranteed" to work?

It seems that the FEC driver used to probe the PHY earlier, it has been
moved to the open callback with commit 418bd0d4df ("netdev/fec: fix
ifconfig eth0 down hang issue")...

There has been a similar report on the systemd-devel mailing list a
while ago, probably the same underlying issue:
https://www.mail-archive.com/systemd-devel@lists.freedesktop.org/msg33186.html

--
Stefan

Re: [PATCH RFT 1/2] phylib: add device reset GPIO support

2016-04-11 Thread Rob Herring

On Mon, Apr 11, 2016 at 2:28 PM, Sergei Shtylyov
 wrote:
> On 04/11/2016 10:25 PM, Rob Herring wrote:
>
>>> The PHY  devices sometimes do have their reset signal (maybe even power
>>> supply?) tied to some GPIO and sometimes it also does happen that a boot
>>> loader does not leave it deasserted. So far this issue has been attacked
>>> from (as I believe) a wrong angle: by teaching the MAC driver to
>>> manipulate
>>> the GPIO in question;  that solution, when  applied to the device trees,
>>> led to adding the PHY reset GPIO properties to the MAC device node, with
>>> one exception: Cadence MACB driver which could handle the "reset-gpios"
>>> prop  in a PHY device  subnode.  I believe that the correct approach is
>>> to
>>> teach the 'phylib' to get the MDIO device reset GPIO from the device tree
>>> node corresponding to this device -- which this patch is doing...
>>>
>>> Note that I had to modify the  AT803x PHY driver as it would stop working
>>> otherwise as it made use of the reset GPIO for its own purposes...
>>
>>
>> Lots of double spaces in here. Please fix.
>
>
>Oh, it's you again! :-D

Yep, one of those picky kernel maintainers that like a bad rash just
won't go away. :)

>>> Signed-off-by: Sergei Shtylyov 
>>>
>>> ---
>>>   Documentation/devicetree/bindings/net/phy.txt |2 +
>>>   drivers/net/phy/at803x.c  |   19 ++
>>>   drivers/net/phy/mdio_bus.c|4 +++
>>>   drivers/net/phy/mdio_device.c |   27
>>> +++--
>>>   drivers/net/phy/phy_device.c  |   33
>>> --
>>>   drivers/of/of_mdio.c  |   16 
>>>   include/linux/mdio.h  |3 ++
>>>   include/linux/phy.h   |5 +++
>>>   8 files changed, 89 insertions(+), 20 deletions(-)
>>
>>
>> [...]
>>
>>> Index: net-next/drivers/of/of_mdio.c
>>> ===
>>> --- net-next.orig/drivers/of/of_mdio.c
>>> +++ net-next/drivers/of/of_mdio.c
>>> @@ -44,6 +44,7 @@ static int of_get_phy_id(struct device_n
>>>   static int of_mdiobus_register_phy(struct mii_bus *mdio, struct
>>> device_node *child,
>>>u32 addr)
>>>   {
>>> +   struct gpio_desc *gpiod;
>>> struct phy_device *phy;
>>> bool is_c45;
>>> int rc;
>>> @@ -52,10 +53,17 @@ static int of_mdiobus_register_phy(struc
>>> is_c45 = of_device_is_compatible(child,
>>>  "ethernet-phy-ieee802.3-c45");
>>>
>>> +   gpiod = fwnode_get_named_gpiod(&child->fwnode, "reset-gpios");
>>
>>
>> Calling fwnode_* functions in a DT specific file/function? That doesn't
>> make sense.
>
>
>Really?! 8-)
>Where is a DT-only analog I wonder...

Ah, you're right. NM.

Rob

[PATCH v2 net 4/4] ipv6: udp: Do a route lookup and update during release_cb

2016-04-11 Thread Martin KaFai Lau

This patch adds a release_cb for UDPv6.  It does a route lookup
and updates sk->sk_dst_cache if it is needed.  It picks up the
left-over job from ip6_sk_update_pmtu() if the sk was owned
by user during the pmtu update.

It takes a rcu_read_lock to protect the __sk_dst_get() operations
because another thread may do ip6_dst_store() without taking the
sk lock (e.g. sendmsg).

Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering 
pmtu exception")
Signed-off-by: Martin KaFai Lau 
Reported-by: Wei Wang 
Cc: Cong Wang 
Cc: Eric Dumazet 
Cc: Wei Wang 
---
 include/net/ipv6.h  |  1 +
 net/ipv6/datagram.c | 20 
 net/ipv6/udp.c  |  1 +
 3 files changed, 22 insertions(+)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index fd02e90..1be050a 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -960,6 +960,7 @@ int ip6_datagram_connect(struct sock *sk, struct sockaddr 
*addr, int addr_len);
 int ip6_datagram_connect_v6_only(struct sock *sk, struct sockaddr *addr,
 int addr_len);
 int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr);
+void ip6_datagram_release_cb(struct sock *sk);
 
 int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len,
int *addr_len);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 59e01f27..9dd3882 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -119,6 +119,26 @@ out:
return err;
 }
 
+void ip6_datagram_release_cb(struct sock *sk)
+{
+   struct dst_entry *dst;
+
+   if (ipv6_addr_v4mapped(&sk->sk_v6_daddr))
+   return;
+
+   rcu_read_lock();
+   dst = __sk_dst_get(sk);
+   if (!dst || !dst->obsolete ||
+   dst->ops->check(dst, inet6_sk(sk)->dst_cookie)) {
+   rcu_read_unlock();
+   return;
+   }
+   rcu_read_unlock();
+
+   ip6_datagram_dst_update(sk, false);
+}
+EXPORT_SYMBOL_GPL(ip6_datagram_release_cb);
+
 static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int 
addr_len)
 {
struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 8125931..6bc5c66 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1539,6 +1539,7 @@ struct proto udpv6_prot = {
.sendmsg   = udpv6_sendmsg,
.recvmsg   = udpv6_recvmsg,
.backlog_rcv   = __udpv6_queue_rcv_skb,
+   .release_cb= ip6_datagram_release_cb,
.hash  = udp_lib_hash,
.unhash= udp_lib_unhash,
.rehash= udp_v6_rehash,
-- 
2.5.1

Re: [PATCH 0/9] RxRPC: 2nd rewrite part 1

2016-04-11 Thread David Howells

David Howells  wrote:

> > Series applied, but I had to fix up some trailing whitespace (reported by
> > GIT) by hand.
> 
> Do you perchance have a git hook script for checking for trailing whitespace?

No matter - there's already such a script installed by git clone - I just have
to enable it.

David

[PATCH v2 net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update

2016-04-11 Thread Martin KaFai Lau

There is a case in connected UDP socket such that
getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
sequence could be the following:
1. Create a connected UDP socket
2. Send some datagrams out
3. Receive a ICMPV6_PKT_TOOBIG
4. No new outgoing datagrams to trigger the sk_dst_check()
   logic to update the sk->sk_dst_cache.
5. getsockopt(IPV6_MTU) returns the mtu from the invalid
   sk->sk_dst_cache instead of the newly created RTF_CACHE clone.

This patch updates the sk->sk_dst_cache for a connected datagram sk
during pmtu-update code path.

Note that the sk->sk_v6_daddr is used to do the route lookup
instead of skb->data (i.e. iph).  It is because a UDP socket can become
connected after sending out some datagrams in un-connected state.  or
It can be connected multiple times to different destinations.  Hence,
iph may not be related to where sk is currently connected to.

It is done under '!sock_owned_by_user(sk)' condition because
the user may make another ip6_datagram_connect()  (i.e changing
the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update
code path.

For the sock_owned_by_user(sk) == true case, the next patch will
introduce a release_cb() which will update the sk->sk_dst_cache.

Test:

Server (Connected UDP Socket):
~
Route Details:
[root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac'
2fac::/64 dev eth0  proto kernel  metric 256  pref medium
2fac:face::/64 via 2fac::face dev eth0  metric 1024  pref medium

A simple python code to create a connected UDP socket:

import socket
import errno

HOST = '2fac::1'
PORT = 8080

s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
s.bind((HOST, PORT))
s.connect(('2fac:face::face', 53))
print("connected")
while True:
try:
data = s.recv(1024)
except socket.error as se:
if se.errno == errno.EMSGSIZE:
pmtu = s.getsockopt(41, 24)
print("PMTU:%d" % pmtu)
break
s.close()

Python program output after getting a ICMPV6_PKT_TOOBIG:
[root@arch-fb-vm1 ~]# python2 
~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py
connected
PMTU:1300

Cache routes after recieving TOOBIG:
[root@arch-fb-vm1 ~]# ip -6 r show table cache
2fac:face::face via 2fac::face dev eth0  metric 0
cache  expires 463sec mtu 1300 pref medium

Client (Send the ICMPV6_PKT_TOOBIG):
~~~
scapy is used to generate the TOOBIG message.  Here is the scapy script I have
used:

>>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', 
>>> type=0x86dd)/IPv6(src='2fac::face', 
>>> dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac::
1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53)
>>> sendp(p, iface='qemubr0')

Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering 
pmtu exception")
Signed-off-by: Martin KaFai Lau 
Reported-by: Wei Wang 
Cc: Cong Wang 
Cc: Eric Dumazet 
Cc: Wei Wang 
---
 include/net/ipv6.h  |  1 +
 net/ipv6/datagram.c | 20 +++-
 net/ipv6/route.c| 12 
 3 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index d0aeb97..fd02e90 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -959,6 +959,7 @@ int compat_ipv6_getsockopt(struct sock *sk, int level, int 
optname,
 int ip6_datagram_connect(struct sock *sk, struct sockaddr *addr, int addr_len);
 int ip6_datagram_connect_v6_only(struct sock *sk, struct sockaddr *addr,
 int addr_len);
+int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr);
 
 int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len,
int *addr_len);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 669585e..59e01f27 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -64,7 +64,7 @@ static void ip6_datagram_flow_key_init(struct flowi6 *fl6, 
struct sock *sk)
security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
 }
 
-static int ip6_datagram_dst_update(struct sock *sk)
+int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr)
 {
struct ip6_flowlabel *flowlabel = NULL;
struct in6_addr *final_p, final;
@@ -93,14 +93,16 @@ static int ip6_datagram_dst_update(struct sock *sk)
goto out;
}
 
-   if (ipv6_addr_any(&np->saddr))
-   np->saddr = fl6.saddr;
+   if (fix_sk_saddr) {
+   if (ipv6_addr_any(&np->saddr))
+   np->saddr = fl6.saddr;
 
-   if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
-   sk->sk_v6_rcv_saddr = fl6.saddr;
-   inet->inet_rcv_saddr = LOOPBACK4_IPV6;
-   if (sk->sk_prot->rehash)
-   sk->sk_prot->rehash(sk);
+   if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
+   sk->sk_v6_rcv_saddr = fl6.saddr;
+   inet->inet_rcv_saddr = LOOPBACK4_IPV6;
+   if (sk->sk_prot->rehash)
+

[PATCH v2 net 1/4] ipv6: datagram: Refactor flowi6 init codes to a new function

2016-04-11 Thread Martin KaFai Lau

Move flowi6 init codes for connected datagram sk to a newly created
function ip6_datagram_flow_key_init().

Notes:
1. fl6_flowlabel is used instead of fl6.flowlabel in __ip6_datagram_connect
2. ipv6_addr_is_multicast(&fl6->daddr) is used instead of
   (addr_type & IPV6_ADDR_MULTICAST) in ip6_datagram_flow_key_init()

This new function will be reused during pmtu update in the later patch.

Signed-off-by: Martin KaFai Lau 
Cc: Cong Wang 
Cc: Eric Dumazet 
Cc: Wei Wang 
---
 net/ipv6/datagram.c | 50 ++
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 4281621..f07c1dd 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -40,6 +40,30 @@ static bool ipv6_mapped_addr_any(const struct in6_addr *a)
return ipv6_addr_v4mapped(a) && (a->s6_addr32[3] == 0);
 }
 
+static void ip6_datagram_flow_key_init(struct flowi6 *fl6, struct sock *sk)
+{
+   struct inet_sock *inet = inet_sk(sk);
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   memset(fl6, 0, sizeof(*fl6));
+   fl6->flowi6_proto = sk->sk_protocol;
+   fl6->daddr = sk->sk_v6_daddr;
+   fl6->saddr = np->saddr;
+   fl6->flowi6_oif = sk->sk_bound_dev_if;
+   fl6->flowi6_mark = sk->sk_mark;
+   fl6->fl6_dport = inet->inet_dport;
+   fl6->fl6_sport = inet->inet_sport;
+   fl6->flowlabel = np->flow_label;
+
+   if (!fl6->flowi6_oif)
+   fl6->flowi6_oif = np->sticky_pktinfo.ipi6_ifindex;
+
+   if (!fl6->flowi6_oif && ipv6_addr_is_multicast(&fl6->daddr))
+   fl6->flowi6_oif = np->mcast_oif;
+
+   security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
+}
+
 static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int 
addr_len)
 {
struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
@@ -52,6 +76,7 @@ static int __ip6_datagram_connect(struct sock *sk, struct 
sockaddr *uaddr, int a
struct ipv6_txoptions   *opt;
int addr_type;
int err;
+   __be32  fl6_flowlabel = 0;
 
if (usin->sin6_family == AF_INET) {
if (__ipv6_only_sock(sk))
@@ -66,11 +91,10 @@ static int __ip6_datagram_connect(struct sock *sk, struct 
sockaddr *uaddr, int a
if (usin->sin6_family != AF_INET6)
return -EAFNOSUPPORT;
 
-   memset(&fl6, 0, sizeof(fl6));
if (np->sndflow) {
-   fl6.flowlabel = usin->sin6_flowinfo&IPV6_FLOWINFO_MASK;
-   if (fl6.flowlabel&IPV6_FLOWLABEL_MASK) {
-   flowlabel = fl6_sock_lookup(sk, fl6.flowlabel);
+   fl6_flowlabel = usin->sin6_flowinfo & IPV6_FLOWINFO_MASK;
+   if (fl6_flowlabel & IPV6_FLOWLABEL_MASK) {
+   flowlabel = fl6_sock_lookup(sk, fl6_flowlabel);
if (!flowlabel)
return -EINVAL;
}
@@ -145,7 +169,7 @@ ipv4_connected:
}
 
sk->sk_v6_daddr = *daddr;
-   np->flow_label = fl6.flowlabel;
+   np->flow_label = fl6_flowlabel;
 
inet->inet_dport = usin->sin6_port;
 
@@ -154,21 +178,7 @@ ipv4_connected:
 *  destination cache for it.
 */
 
-   fl6.flowi6_proto = sk->sk_protocol;
-   fl6.daddr = sk->sk_v6_daddr;
-   fl6.saddr = np->saddr;
-   fl6.flowi6_oif = sk->sk_bound_dev_if;
-   fl6.flowi6_mark = sk->sk_mark;
-   fl6.fl6_dport = inet->inet_dport;
-   fl6.fl6_sport = inet->inet_sport;
-
-   if (!fl6.flowi6_oif)
-   fl6.flowi6_oif = np->sticky_pktinfo.ipi6_ifindex;
-
-   if (!fl6.flowi6_oif && (addr_type&IPV6_ADDR_MULTICAST))
-   fl6.flowi6_oif = np->mcast_oif;
-
-   security_sk_classify_flow(sk, flowi6_to_flowi(&fl6));
+   ip6_datagram_flow_key_init(&fl6, sk);
 
rcu_read_lock();
opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt);
-- 
2.5.1

[PATCH v2 net 0/4] ipv6: datagram: Update dst cache of a connected udp sk during pmtu update

2016-04-11 Thread Martin KaFai Lau

v2:
~ Protect __sk_dst_get() operations with rcu_read_lock in
  release_cb() because another thread may do ip6_dst_store()
  for a udp sk without taking the sk lock (e.g. in sendmsg).
~ Do a ipv6_addr_v4mapped(&sk->sk_v6_daddr) check before
  calling ip6_datagram_dst_update() in patch 3 and 4.  It is
  similar to how __ip6_datagram_connect handles it.
~ One fix in ip6_datagram_dst_update() in patch 2.  It needs
  to check (np->flow_label & IPV6_FLOWLABEL_MASK) before
  doing fl6_sock_lookup.  I was confused with the naming
  of IPV6_FLOWLABEL_MASK and IPV6_FLOWINFO_MASK.
~ Check dst->obsolete just on the safe side, although I think it
  should at least have DST_OBSOLETE_FORCE_CHK by now.
~ Add Fixes tag to patch 3 and 4
~ Add some points from the previous discussion about holding
  sk lock to the commit message in patch 3.

v1:
There is a case in connected UDP socket such that
getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
sequence could be the following:
1. Create a connected UDP socket
2. Send some datagrams out
3. Receive a ICMPV6_PKT_TOOBIG
4. No new outgoing datagrams to trigger the sk_dst_check()
   logic to update the sk->sk_dst_cache.
5. getsockopt(IPV6_MTU) returns the mtu from the invalid
   sk->sk_dst_cache instead of the newly created RTF_CACHE clone.

Patch 1 and 2 are the prep work.
Patch 3 and 4 are the fixes.

[PATCH v2 net 2/4] ipv6: datagram: Refactor dst lookup and update codes to a new function

2016-04-11 Thread Martin KaFai Lau

This patch moves the route lookup and update codes for connected
datagram sk to a newly created function ip6_datagram_dst_update()

It will be reused during the pmtu update in the later patch.

Signed-off-by: Martin KaFai Lau 
Cc: Cong Wang 
Cc: Eric Dumazet 
Cc: Wei Wang 
---
 net/ipv6/datagram.c | 103 +---
 1 file changed, 57 insertions(+), 46 deletions(-)

diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index f07c1dd..669585e 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -64,16 +64,65 @@ static void ip6_datagram_flow_key_init(struct flowi6 *fl6, 
struct sock *sk)
security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
 }
 
+static int ip6_datagram_dst_update(struct sock *sk)
+{
+   struct ip6_flowlabel *flowlabel = NULL;
+   struct in6_addr *final_p, final;
+   struct ipv6_txoptions *opt;
+   struct dst_entry *dst;
+   struct inet_sock *inet = inet_sk(sk);
+   struct ipv6_pinfo *np = inet6_sk(sk);
+   struct flowi6 fl6;
+   int err = 0;
+
+   if (np->sndflow && (np->flow_label & IPV6_FLOWLABEL_MASK)) {
+   flowlabel = fl6_sock_lookup(sk, np->flow_label);
+   if (!flowlabel)
+   return -EINVAL;
+   }
+   ip6_datagram_flow_key_init(&fl6, sk);
+
+   rcu_read_lock();
+   opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt);
+   final_p = fl6_update_dst(&fl6, opt, &final);
+   rcu_read_unlock();
+
+   dst = ip6_dst_lookup_flow(sk, &fl6, final_p);
+   if (IS_ERR(dst)) {
+   err = PTR_ERR(dst);
+   goto out;
+   }
+
+   if (ipv6_addr_any(&np->saddr))
+   np->saddr = fl6.saddr;
+
+   if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
+   sk->sk_v6_rcv_saddr = fl6.saddr;
+   inet->inet_rcv_saddr = LOOPBACK4_IPV6;
+   if (sk->sk_prot->rehash)
+   sk->sk_prot->rehash(sk);
+   }
+
+   ip6_dst_store(sk, dst,
+ ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr) ?
+ &sk->sk_v6_daddr : NULL,
+#ifdef CONFIG_IPV6_SUBTREES
+ ipv6_addr_equal(&fl6.saddr, &np->saddr) ?
+ &np->saddr :
+#endif
+ NULL);
+
+out:
+   fl6_sock_release(flowlabel);
+   return err;
+}
+
 static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int 
addr_len)
 {
struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
struct inet_sock*inet = inet_sk(sk);
struct ipv6_pinfo   *np = inet6_sk(sk);
-   struct in6_addr *daddr, *final_p, final;
-   struct dst_entry*dst;
-   struct flowi6   fl6;
-   struct ip6_flowlabel*flowlabel = NULL;
-   struct ipv6_txoptions   *opt;
+   struct in6_addr *daddr;
int addr_type;
int err;
__be32  fl6_flowlabel = 0;
@@ -91,14 +140,8 @@ static int __ip6_datagram_connect(struct sock *sk, struct 
sockaddr *uaddr, int a
if (usin->sin6_family != AF_INET6)
return -EAFNOSUPPORT;
 
-   if (np->sndflow) {
+   if (np->sndflow)
fl6_flowlabel = usin->sin6_flowinfo & IPV6_FLOWINFO_MASK;
-   if (fl6_flowlabel & IPV6_FLOWLABEL_MASK) {
-   flowlabel = fl6_sock_lookup(sk, fl6_flowlabel);
-   if (!flowlabel)
-   return -EINVAL;
-   }
-   }
 
addr_type = ipv6_addr_type(&usin->sin6_addr);
 
@@ -178,45 +221,13 @@ ipv4_connected:
 *  destination cache for it.
 */
 
-   ip6_datagram_flow_key_init(&fl6, sk);
-
-   rcu_read_lock();
-   opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt);
-   final_p = fl6_update_dst(&fl6, opt, &final);
-   rcu_read_unlock();
-
-   dst = ip6_dst_lookup_flow(sk, &fl6, final_p);
-   err = 0;
-   if (IS_ERR(dst)) {
-   err = PTR_ERR(dst);
+   err = ip6_datagram_dst_update(sk);
+   if (err)
goto out;
-   }
-
-   /* source address lookup done in ip6_dst_lookup */
-
-   if (ipv6_addr_any(&np->saddr))
-   np->saddr = fl6.saddr;
-
-   if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
-   sk->sk_v6_rcv_saddr = fl6.saddr;
-   inet->inet_rcv_saddr = LOOPBACK4_IPV6;
-   if (sk->sk_prot->rehash)
-   sk->sk_prot->rehash(sk);
-   }
-
-   ip6_dst_store(sk, dst,
- ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr) ?
- &sk->sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
- ipv6_addr_equal(&fl6.saddr, &np->saddr) ?
- &np->saddr :
-#endif
- NULL);
 
sk->sk_state = TCP_ESTABLISHED;
sk_set_txhash(sk);
 out:
-   fl6_so

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Alexei Starovoitov

On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote:
> 
> On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg  wrote:
> 
> > >> This is also very interesting for storage targets, which face the same
> > >> issue.  SCST has a mode where it caches some fully constructed SGLs,
> > >> which is probably very similar to what NICs want to do.  
> > >
> > > I think a cached allocator for page sets + the scatterlists that
> > > describe these page sets would not only be useful for SCSI target
> > > implementations but also for the Linux SCSI initiator. Today the scsi-mq
> > > code reserves space in each scsi_cmnd for a scatterlist of
> > > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
> > > sets less memory would be needed per scsi_cmnd.  
> > 
> > If we go down this road how about also attaching some driver opaques
> > to the page sets?
> 
> That was the ultimate plan... to leave some opaques bytes left in the
> page struct that drivers could use.
> 
> In struct page I would need a pointer back to my page_pool struct and a
> page flag.  Then, I would need room to store the dma_unmap address.
> (And then some of the usual fields are still needed, like the refcnt,
> and reusing some of the list constructs).  And a zero-copy cross-domain
> id.

I don't think we need to add anything to struct page.
This is supposed to be small cache of dma_mapped pages with lockless access.
It can be implemented as an array or link list where every element
is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to
send it to back to page allocator.

Re: [PATCH iproute2 0/4] vxlan: add VXLAN-GPE support

2016-04-11 Thread Stephen Hemminger

On Thu,  7 Apr 2016 14:36:25 +0200
Jiri Benc  wrote:

> This patchset adds a couple of refinements to the 'external' keyword and
> adds support for configuring VXLAN-GPE interfaces.
> 
> The VXLAN-GPE support was recently merged to net-next.
> 
> The first patch is not intended to be applied directly but I still include
> it in order for this patchset to be compilable and testable by those
> interested.
> 
> Jiri Benc (4):
>   if_link.h: rebase to the kernel
>   vxlan: 'external' implies 'nolearning'
>   ip-link.8: document "external" flag for vxlan
>   vxlan: add support for VXLAN-GPE
> 
>  include/linux/if_link.h |  1 +
>  ip/iplink_vxlan.c   | 14 --
>  man/man8/ip-link.8.in   | 17 +
>  3 files changed, 30 insertions(+), 2 deletions(-)
> 

Applied to net-next

Re: [PATCH] iproute2: tc_bpf.c: fix building with musl libc

2016-04-11 Thread Stephen Hemminger

On Fri,  8 Apr 2016 09:59:33 -0300
Gustavo Zacarias  wrote:

> We need limits.h for PATH_MAX, fixes:
> 
> tc_bpf.c: In function ‘bpf_map_selfcheck_pinned’:
> tc_bpf.c:222:12: error: ‘PATH_MAX’ undeclared (first use in this
> function)
>   char file[PATH_MAX], buff[4096];
> 
> Signed-off-by: Gustavo Zacarias 

Applied

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Alexander Duyck

On Mon, Apr 11, 2016 at 2:41 PM, Jesper Dangaard Brouer
 wrote:
>
> On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg  wrote:
>
>> >> This is also very interesting for storage targets, which face the same
>> >> issue.  SCST has a mode where it caches some fully constructed SGLs,
>> >> which is probably very similar to what NICs want to do.
>> >
>> > I think a cached allocator for page sets + the scatterlists that
>> > describe these page sets would not only be useful for SCSI target
>> > implementations but also for the Linux SCSI initiator. Today the scsi-mq
>> > code reserves space in each scsi_cmnd for a scatterlist of
>> > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
>> > sets less memory would be needed per scsi_cmnd.
>>
>> If we go down this road how about also attaching some driver opaques
>> to the page sets?
>
> That was the ultimate plan... to leave some opaques bytes left in the
> page struct that drivers could use.
>
> In struct page I would need a pointer back to my page_pool struct and a
> page flag.  Then, I would need room to store the dma_unmap address.
> (And then some of the usual fields are still needed, like the refcnt,
> and reusing some of the list constructs).  And a zero-copy cross-domain
> id.
>
>
> For my packet-page idea, I would need a packet length and an offset
> where data starts (I can derive the "head-room" for encap from these
> two).

Have you taken a look at possibly trying to optimize the DMA pool API
to work with pages?  It sounds like it is supposed to do something
similar to what you are wanting to do.

- Alex

Re: [PATCH 0/9] RxRPC: 2nd rewrite part 1

2016-04-11 Thread David Howells

David Miller  wrote:

> Series applied, but I had to fix up some trailing whitespace (reported by
> GIT) by hand.

Do you perchance have a git hook script for checking for trailing whitespace?

David

Re: [PATCH iproute2 -master 0/3] Minor tc/bpf updates

2016-04-11 Thread Stephen Hemminger

On Sat,  9 Apr 2016 00:32:02 +0200
Daniel Borkmann  wrote:

> Some minor updates to improve error reporting, add signatures
> and recently introduced map flags attribute. Set is against
> master branch.
> 
> Thanks!
> 
> Daniel Borkmann (3):
>   tc, bpf: add new csum and tunnel signatures
>   tc, bpf: further improve error reporting
>   tc, bpf: add support for map pre/allocation
> 
>  examples/bpf/bpf_cyclic.c   |  9 -
>  examples/bpf/bpf_graft.c|  8 +++-
>  examples/bpf/bpf_prog.c |  2 +
>  examples/bpf/bpf_shared.c   |  8 +++-
>  examples/bpf/bpf_tailcall.c | 29 --
>  include/bpf_api.h   | 52 
>  include/bpf_elf.h   |  1 +
>  tc/tc_bpf.c | 98 
> +++--
>  tc/tc_bpf.h |  4 ++
>  9 files changed, 138 insertions(+), 73 deletions(-)
> 

Applied thanks

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Jesper Dangaard Brouer

On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg  wrote:

> >> This is also very interesting for storage targets, which face the same
> >> issue.  SCST has a mode where it caches some fully constructed SGLs,
> >> which is probably very similar to what NICs want to do.  
> >
> > I think a cached allocator for page sets + the scatterlists that
> > describe these page sets would not only be useful for SCSI target
> > implementations but also for the Linux SCSI initiator. Today the scsi-mq
> > code reserves space in each scsi_cmnd for a scatterlist of
> > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
> > sets less memory would be needed per scsi_cmnd.  
> 
> If we go down this road how about also attaching some driver opaques
> to the page sets?

That was the ultimate plan... to leave some opaques bytes left in the
page struct that drivers could use.

In struct page I would need a pointer back to my page_pool struct and a
page flag.  Then, I would need room to store the dma_unmap address.
(And then some of the usual fields are still needed, like the refcnt,
and reusing some of the list constructs).  And a zero-copy cross-domain
id.

For my packet-page idea, I would need a packet length and an offset
where data starts (I can derive the "head-room" for encap from these
two).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features

2016-04-11 Thread Or Gerlitz

On Tue, Apr 12, 2016 at 12:24 AM, Saeed Mahameed
 wrote:
> On Tue, Apr 12, 2016 at 12:17 AM, Or Gerlitz  wrote:

>> feature --> features

> Correct, will fix.

>>> * Add vport to steering commands for SRIOV ACL support
>>> * Add mlcr, pcmr and mcia registers for dump module EEPROM
>>> * Add support for FCS, baeacon led and disable_link bits to hca caps
>>> * Add CQE period mode bit in  CQ context for CQE based CQ
>>>   moderation support
>>> * Add umr SQ bit for fragmented memory registration
>>> * Add needed bits and caps for Striding RQ support

>> AFAIK, all the above are features will go through net-next, what made
>> you anticipate conflicts with linux-rdma?

> FCS bit is needed also for rdma, so we took the liberty of updating
> all the needed HW structs, bits, caps, etc ..
> at once for all mlx5 features planned for 4.7 regardless of rdma/net 
> conflicts.

The cover letter states that this series deals with shared code.

I guess you might also could extend it a bit to deal also with code
that you suspect could lead to conflicts, but I don't see why it
evolved to that extent.

Or.

Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Eric Dumazet

On Mon, 2016-04-11 at 22:23 +0200, Jesper Dangaard Brouer wrote:

> If we have a page-pool recycle facility, then we could use the trick,
> right? (As we know that get_page_unless_zero() cannot happen for pages
> in the pool).

Well, if you disable everything that possibly use
get_page_unless_zero(), I guess this could work.

But then, you'll have to spy lkml traffic forever to make sure no new
feature is added in the kernel, using this get_page_unless_zero() in a
new clever way.

You could use a page flag so that z BUG() triggers if
get_page_unless_zero() is attempted on one of your precious pages ;)\

We had very subtle issues before my fixes (check
35b7a1915aa33da812074744647db0d9262a555c and children), so I would not
waste time on the lock prefix avoidance at this point.

Re: [PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features

2016-04-11 Thread Saeed Mahameed

On Tue, Apr 12, 2016 at 12:17 AM, Or Gerlitz  wrote:
>
> feature --> features

Correct, will fix.

>
>> * Add vport to steering commands for SRIOV ACL support
>> * Add mlcr, pcmr and mcia registers for dump module EEPROM
>> * Add support for FCS, baeacon led and disable_link bits to
>>   hca caps
>> * Add CQE period mode bit in  CQ context for CQE based CQ
>>   moderation support
>> * Add umr SQ bit for fragmented memory registration
>> * Add needed bits and caps for Striding RQ support
>
> AFAIK, all the above are features will go through net-next, what made
> you anticipate conflicts with linux-rdma?

FCS bit is needed also for rdma, so we took the liberty of updating
all the needed HW structs, bits, caps, etc ..
at once for all mlx5 features planned for 4.7 regardless of rdma/net conflicts.

Re: Getting at the UDP headers from ->data_ready()

2016-04-11 Thread David Howells

Willem de Bruijn  wrote:

> The network and transport header pointers are still valid. Commit e6afc8ace6dd
> only  changes where skb->data points to. It does not discard the data between
> skb->head and skb->data. This rxrpc follow-up patch fixes some offset
> arithmetic to the payload, which is computed relative to skb->data.
> 
> There are other uses of ip_hdr and udp_hdr in udp_recvmsg and similar
> recvmsg handlers for other protocols. For instance, in the source address
> processing for recvfrom ("if (sin) { .. }")

Excellent, thanks!

David

Re: [PATCHv2 net-next 4/6] sctp: add the sctp_diag.c file

2016-04-11 Thread Marcelo Ricardo Leitner

On Fri, Apr 08, 2016 at 10:51:56PM -0700, Eric Dumazet wrote:
> On Sat, 2016-04-09 at 12:53 +0800, Xin Long wrote:
> > This one will implement all the interface of inet_diag, inet_diag_handler.
> > which includes sctp_diag_dump, sctp_diag_dump_one and sctp_diag_get_info.
> 
> 
> > +static int inet_assoc_diag_fill(struct sock *sk,
> > +   struct sctp_association *asoc,
> > +   struct sk_buff *skb,
> > +   const struct inet_diag_req_v2 *req,
> > +   struct user_namespace *user_ns,
> > +   int portid, u32 seq, u16 nlmsg_flags,
> > +   const struct nlmsghdr *unlh)
> > +{
> > +   const struct inet_sock *inet = inet_sk(sk);
> > +   const struct inet_diag_handler *handler;
> > +   int ext = req->idiag_ext;
> > +   struct inet_diag_msg *r;
> > +   struct nlmsghdr  *nlh;
> > +   struct nlattr *attr;
> > +   void *info = NULL;
> > +   union sctp_addr laddr, paddr;
> > +   struct dst_entry *dst;
> > +   struct sctp_infox infox;
> > +
> > +   handler = inet_diag_get_handler(req->sdiag_protocol);
> > +   BUG_ON(!handler);
> > +
> > +   nlh = nlmsg_put(skb, portid, seq, unlh->nlmsg_type, sizeof(*r),
> > +   nlmsg_flags);
> > +   if (!nlh)
> > +   return -EMSGSIZE;
> > +
> > +   r = nlmsg_data(nlh);
> > +   BUG_ON(!sk_fullsock(sk));
> > +
> > +   laddr = list_entry(asoc->base.bind_addr.address_list.next,
> > +  struct sctp_sockaddr_entry, list)->a;
> > +   paddr = asoc->peer.primary_path->ipaddr;
> > +   dst = asoc->peer.primary_path->dst;
> > +
> > +   r->idiag_family = sk->sk_family;
> > +   r->id.idiag_sport = htons(asoc->base.bind_addr.port);
> > +   r->id.idiag_dport = htons(asoc->peer.port);
> > +   r->id.idiag_if = dst ? dst->dev->ifindex : 0;
> > +   sock_diag_save_cookie(sk, r->id.idiag_cookie);
> > +
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +   if (sk->sk_family == AF_INET6) {
> > +   *(struct in6_addr *)r->id.idiag_src = laddr.v6.sin6_addr;
> > +   *(struct in6_addr *)r->id.idiag_dst = paddr.v6.sin6_addr;
> > +   } else
> > +#endif
> > +   {
> > +   memset(&r->id.idiag_src, 0, sizeof(r->id.idiag_src));
> > +   memset(&r->id.idiag_dst, 0, sizeof(r->id.idiag_dst));
> > +
> > +   r->id.idiag_src[0] = laddr.v4.sin_addr.s_addr;
> > +   r->id.idiag_dst[0] = paddr.v4.sin_addr.s_addr;
> > +   }
> > +
> > +   r->idiag_state = asoc->state;
> > +   r->idiag_timer = SCTP_EVENT_TIMEOUT_T3_RTX;
> > +   r->idiag_retrans = asoc->rtx_data_chunks;
> > +#define EXPIRES_IN_MS(tmo)  DIV_ROUND_UP((tmo - jiffies) * 1000, HZ)
> > +   r->idiag_expires =
> > +   EXPIRES_IN_MS(asoc->timeouts[SCTP_EVENT_TIMEOUT_T3_RTX]);
> > +#undef EXPIRES_IN_MS
> > +

vvv
> > +   if (nla_put_u8(skb, INET_DIAG_SHUTDOWN, sk->sk_shutdown))
> > +   goto errout;
> > +
> > +   /* IPv6 dual-stack sockets use inet->tos for IPv4 connections,
> > +* hence this needs to be included regardless of socket family.
> > +*/
> > +   if (ext & (1 << (INET_DIAG_TOS - 1)))
> > +   if (nla_put_u8(skb, INET_DIAG_TOS, inet->tos) < 0)
> > +   goto errout;
> > +
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +   if (r->idiag_family == AF_INET6) {
> > +   if (ext & (1 << (INET_DIAG_TCLASS - 1)))
> > +   if (nla_put_u8(skb, INET_DIAG_TCLASS,
> > +  inet6_sk(sk)->tclass) < 0)
> > +   goto errout;
> > +
> > +   if (((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE)) &&
> > +   nla_put_u8(skb, INET_DIAG_SKV6ONLY, ipv6_only_sock(sk)))
> > +   goto errout;
> > +   }
> > +#endif
> > +
> > +   r->idiag_uid = from_kuid_munged(user_ns, sock_i_uid(sk));
> > +   r->idiag_inode = sock_i_ino(sk);
> > +
> > +   if (ext & (1 << (INET_DIAG_MEMINFO - 1))) {
> > +   struct inet_diag_meminfo minfo = {
> > +   .idiag_rmem = sk_rmem_alloc_get(sk),
> > +   .idiag_wmem = sk->sk_wmem_queued,
> > +   .idiag_fmem = sk->sk_forward_alloc,
> > +   .idiag_tmem = sk_wmem_alloc_get(sk),
> > +   };
> > +
> 
> All this code looks familiar.
> 
> Why inet_sk_diag_fill() is not used instead ?

Actually we have an issue here. idiag_tmem is using sk_wmem_alloc_get()   
directly, but it shouldn't. SCTP supports using sndbuf per socket or  
per assoc on that socket. We need an if and report the correct value, 
as it's done in sctp_wspace().

For inet_ep_diag_fill, it's reporting an endpoint, so this is not needed
as we don't have this option in there.

> > +   if (nla_put(skb, INET_DIAG_MEMINFO, sizeof(minfo), &minfo) < 0)
> > +   goto errout;
> > +   }
> > +
> > +   if (ext & (1 << (INET_DIAG_SKMEMINFO - 1)))
> > +   if (sock_diag_put_meminfo(sk

Re: Getting at the UDP headers from ->data_ready()

2016-04-11 Thread Willem de Bruijn

On Mon, Apr 11, 2016 at 5:04 PM, David Howells  wrote:
> Hi Willem,
>
> With regards to:
>
> commit 4d0fc73ebe94ac984a187f21fbf4f3a1ac846f5a
> Author: Willem de Bruijn 
> Date:   Thu Apr 7 11:44:59 2016 -0400
>
> rxrpc: do not pull udp headers on receive
>
> Commit e6afc8ace6dd modified the udp receive path by pulling the udp
> header before queuing an skbuff onto the receive queue.
>
> Does that mean that I can no longer access the UDP header via udp_hdr(skb)
> from with the ->data_ready() handler?
>
> I'm guessing that's not actually the case since ip_hdr(skb) seems to work - or
> is that something I shouldn't be using since the part of the buffer containing
> the IP header might've been discarded?

The network and transport header pointers are still valid. Commit e6afc8ace6dd
only  changes where skb->data points to. It does not discard the data between
skb->head and skb->data. This rxrpc follow-up patch fixes some offset
arithmetic to the payload, which is computed relative to skb->data.

There are other uses of ip_hdr and udp_hdr in udp_recvmsg and similar
recvmsg handlers for other protocols. For instance, in the source address
processing for recvfrom ("if (sin) { .. }")

Re: [PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features

2016-04-11 Thread Or Gerlitz

On Mon, Apr 11, 2016 at 11:10 PM, Saeed Mahameed  wrote:
> Adding the needed mlx5_ifc hardware bits and structs
> for the following feature:

feature --> features

> * Add vport to steering commands for SRIOV ACL support
> * Add mlcr, pcmr and mcia registers for dump module EEPROM
> * Add support for FCS, baeacon led and disable_link bits to
>   hca caps
> * Add CQE period mode bit in  CQ context for CQE based CQ
>   moderation support
> * Add umr SQ bit for fragmented memory registration
> * Add needed bits and caps for Striding RQ support

AFAIK, all the above are features will go through net-next, what made
you anticipate conflicts with linux-rdma?

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Eric Dumazet

On Mon, 2016-04-11 at 21:47 +0200, Jesper Dangaard Brouer wrote:
> On Mon, 11 Apr 2016 09:53:54 -0700
> Eric Dumazet  wrote:
> 
> > On Mon, 2016-04-11 at 18:19 +0200, Jesper Dangaard Brouer wrote:
> > 
> > > Drivers also do tricks where they fallback to smaller order pages. E.g.
> > > lookup function mlx4_alloc_pages().  I've tried to simulate that
> > > function here:
> > > https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69
> > >   
> > 
> > We use order-0 pages on mlx4 at Google, as order-3 pages are very
> > dangerous for some kind of attacks...
> 
> Interesting!
> 
> > An out of order TCP packet can hold an order-3 pages, while claiming to
> > use 1.5 KBvia skb->truesize.
> > 
> > order-0 only pages allow the page recycle trick used by Intel driver,
> > and we hardly see any page allocations in typical workloads.
> 
> Yes, I looked at the Intel ixgbe drivers page recycle trick. 
> 
> It is actually quite cool, but code wise it is a little hard to
> follow.  I started to look at the variant in i40e, specifically
> function i40e_clean_rx_irq_ps() explains it a bit more explicit.
>  
> 
> > While order-3 pages are 'nice' for friendly datacenter kind of
> > traffic, they also are a higher risk on hosts connected to the wild
> > Internet.
> > 
> > Maybe I should upstream this patch ;)
> 
> Definitely!
> 
> Does this patch also include a page recycle trick?  Else how do you get
> around the cost of allocating a single order-0 page?
> 

Yes, we use the page recycle trick.

Obviously not on powerpc (or any arch with PAGE_SIZE >= 8192), but
definitely on x86.

Getting at the UDP headers from ->data_ready()

2016-04-11 Thread David Howells

Hi Willem,

With regards to:

commit 4d0fc73ebe94ac984a187f21fbf4f3a1ac846f5a
Author: Willem de Bruijn 
Date:   Thu Apr 7 11:44:59 2016 -0400

rxrpc: do not pull udp headers on receive

Commit e6afc8ace6dd modified the udp receive path by pulling the udp
header before queuing an skbuff onto the receive queue.
 
Does that mean that I can no longer access the UDP header via udp_hdr(skb)
from with the ->data_ready() handler?

I'm guessing that's not actually the case since ip_hdr(skb) seems to work - or
is that something I shouldn't be using since the part of the buffer containing
the IP header might've been discarded?

David

[PATCH] devlink: add missing install of header

2016-04-11 Thread Stephen Hemminger

The new devlink.h in uapi was not being installed by
make headers_install

Signed-off-by: Stephen Hemminger 
---
 include/uapi/linux/Kbuild | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index b71fd0b..813ffb2e 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -96,6 +96,7 @@ header-y += cyclades.h
 header-y += cycx_cfm.h
 header-y += dcbnl.h
 header-y += dccp.h
+header-y += devlink.h
 header-y += dlmconstants.h
 header-y += dlm_device.h
 header-y += dlm.h
-- 
2.1.4

Re: [PATCH 0/9] RxRPC: 2nd rewrite part 1

2016-04-11 Thread David Howells

David Miller  wrote:

> Series applied, but I had to fix up some trailing whitespace (reported by
> GIT) by hand.

Okay, thanks.

David

Re: [PATCH net-next 6/7] dsa: Rename phys_port_mask to user_port_mask

2016-04-11 Thread Andrew Lunn

On Mon, Apr 11, 2016 at 01:03:52PM -0700, Florian Fainelli wrote:
> On 11/04/16 12:50, Andrew Lunn wrote:
> > The phys in phys_port_mask suggests this mask is about PHYs. In fact,
> > it means physical ports. Rename to user_port_mask, indicating user
> > ports of the switch, which is hopefully less confusing.
> 
> Even though the change looks fine in principle, I am more worried about
> the difficulty for people to backport fixes because of the renaming
> happening here. How about "enabled_ports_mask" as a name?

I'm fine with that. Anything, so long as it does not contain phys.

> Did not
> Guenter had a helper function at some point which tested for (1 << port
> & ds->phys_port_mask)?

Maybe you are thinking of:

static inline bool dsa_is_port_initialized(struct dsa_switch *ds, int p)
{
return ds->phys_port_mask & (1 << p) && ds->ports[p];
}
 
So how about initialized_port_mask, although it is a bit long.

   Andrew

Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Jesper Dangaard Brouer


On Sat, 09 Apr 2016 05:34:38 -0700 Eric Dumazet  wrote:

> On Sat, 2016-04-09 at 11:11 +0200, Jesper Dangaard Brouer wrote:
> 
> > Above code is okay.  But do you think we also can get away with the same
> > trick we do with the SKB refcnf?  Where we avoid an atomic operation if
> > refcnt==1.
> > 
> > void kfree_skb(struct sk_buff *skb)
> > {
> > if (unlikely(!skb))
> > return;
> > if (likely(atomic_read(&skb->users) == 1))
> > smp_rmb();
> > else if (likely(!atomic_dec_and_test(&skb->users)))
> > return;
> > trace_kfree_skb(skb, __builtin_return_address(0));
> > __kfree_skb(skb);
> > }
> > EXPORT_SYMBOL(kfree_skb);  
> 
> No we can not use this trick this for pages :
> 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ec91698360b3818ff426488a1529811f7a7ab87f
> 

If we have a page-pool recycle facility, then we could use the trick,
right? (As we know that get_page_unless_zero() cannot happen for pages
in the pool).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features

2016-04-11 Thread Saeed Mahameed

Adding the needed mlx5_ifc hardware bits and structs
for the following feature:

* Add vport to steering commands for SRIOV ACL support
* Add mlcr, pcmr and mcia registers for dump module EEPROM
* Add support for FCS, baeacon led and disable_link bits to
  hca caps
* Add CQE period mode bit in  CQ context for CQE based CQ
  moderation support
* Add umr SQ bit for fragmented memory registration
* Add needed bits and caps for Striding RQ support

Signed-off-by: Saeed Mahameed 
Signed-off-by: Matan Barak 
---
 include/linux/mlx5/mlx5_ifc.h |  146 ++--
 1 files changed, 124 insertions(+), 22 deletions(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index c300e74..4ce4ea4 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -513,7 +513,9 @@ struct mlx5_ifc_per_protocol_networking_offload_caps_bits {
u8 max_lso_cap[0x5];
u8 reserved_at_10[0x4];
u8 rss_ind_tbl_cap[0x4];
-   u8 reserved_at_18[0x3];
+   u8 reg_umr_sq[0x1];
+   u8 scatter_fcs[0x1];
+   u8 reserved_at_1a[0x1];
u8 tunnel_lso_const_out_ip_id[0x1];
u8 reserved_at_1c[0x2];
u8 tunnel_statless_gre[0x1];
@@ -648,7 +650,7 @@ struct mlx5_ifc_vector_calc_cap_bits {
 enum {
MLX5_WQ_TYPE_LINKED_LIST  = 0x0,
MLX5_WQ_TYPE_CYCLIC   = 0x1,
-   MLX5_WQ_TYPE_STRQ = 0x2,
+   MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ = 0x2,
 };
 
 enum {
@@ -753,7 +755,11 @@ struct mlx5_ifc_cmd_hca_cap_bits {
u8 early_vf_enable[0x1];
u8 reserved_at_1a9[0x2];
u8 local_ca_ack_delay[0x5];
-   u8 reserved_at_1af[0x6];
+   u8 reserved_at_1af[0x2];
+   u8 ports_check[0x1];
+   u8 reserved_at_1b2[0x1];
+   u8 disable_link_up[0x1];
+   u8 beacon_led[0x1];
u8 port_type[0x2];
u8 num_ports[0x8];
 
@@ -778,7 +784,8 @@ struct mlx5_ifc_cmd_hca_cap_bits {
u8 cqe_version[0x4];
 
u8 compact_address_vector[0x1];
-   u8 reserved_at_200[0x3];
+   u8 striding_rq[0x1];
+   u8 reserved_at_201[0x2];
u8 ipoib_basic_offloads[0x1];
u8 reserved_at_205[0xa];
u8 drain_sigerr[0x1];
@@ -807,12 +814,12 @@ struct mlx5_ifc_cmd_hca_cap_bits {
u8 block_lb_mc[0x1];
u8 reserved_at_229[0x1];
u8 scqe_break_moderation[0x1];
-   u8 reserved_at_22a[0x1];
+   u8 cq_period_start_from_cqe[0x1];
u8 cd[0x1];
u8 reserved_at_22d[0x1];
u8 apm[0x1];
u8 vector_calc[0x1];
-   u8 reserved_at_22f[0x1];
+   u8 umr_ptr_rlky[0x1];
u8 imaicl[0x1];
u8 reserved_at_232[0x4];
u8 qkv[0x1];
@@ -913,10 +920,10 @@ struct mlx5_ifc_cmd_hca_cap_bits {
u8 reserved_at_500[0x80];
 
u8 reserved_at_580[0x3f];
-   u8 cqe_zip[0x1];
+   u8 cqe_compression[0x1];
 
-   u8 cqe_zip_timeout[0x10];
-   u8 cqe_zip_max_num[0x10];
+   u8 cqe_compression_timeout[0x10];
+   u8 cqe_compression_max_num[0x10];
 
u8 reserved_at_5e0[0x220];
 };
@@ -1000,7 +1007,13 @@ struct mlx5_ifc_wq_bits {
u8 reserved_at_118[0x3];
u8 log_wq_sz[0x5];
 
-   u8 reserved_at_120[0x4e0];
+   u8 reserved_at_120[0x15];
+   u8 log_wqe_num_of_strides[0x3];
+   u8 two_byte_shift_en[0x1];
+   u8 reserved_at_139[0x4];
+   u8 log_wqe_stride_size[0x3];
+
+   u8 reserved_at_140[0x4c0];
 
struct mlx5_ifc_cmd_pas_bits pas[0];
 };
@@ -2199,7 +2212,8 @@ struct mlx5_ifc_sqc_bits {
u8 flush_in_error_en[0x1];
u8 reserved_at_4[0x4];
u8 state[0x4];
-   u8 reserved_at_c[0x14];
+   u8 reg_umr[0x1];
+   u8 reserved_at_d[0x13];
 
u8 reserved_at_20[0x8];
u8 user_index[0x18];
@@ -2247,7 +2261,8 @@ enum {
 
 struct mlx5_ifc_rqc_bits {
u8 rlky[0x1];
-   u8 reserved_at_1[0x2];
+   u8 reserved_at_1[0x1];
+   u8 scatter_fcs[0x1];
u8 vsd[0x1];
u8 mem_rq_type[0x4];
u8 state[0x4];
@@ -2604,6 +2619,11 @@ enum {
MLX5_CQC_ST_FIRED = 0xa,
 };
 
+enum {
+   MLX5_CQ_PERIOD_MODE_START_FROM_EQE = 0x0,
+   MLX5_CQ_PERIOD_MODE_START_FROM_CQE = 0x1,
+};
+
 struct mlx5_ifc_cqc_bits {
u8 status[0x4];
u8 reserved_at_4[0x4];
@@ -2612,8 +2632,8 @@ struct mlx5_ifc_cqc_bits {
u8

[PATCH for-next 1/2] net/mlx5: Fix mlx5 ifc cmd_hca_cap bad offsets

2016-04-11 Thread Saeed Mahameed

From: Tariq Toukan 

All reserved fields after early_vf_enable are off by 1, since
early_vf_enable was not explicitly declared as array of size 1.

Reserved field before cqe_zip had a wrong size, it should
be 0x80 + 0x3f.

Fixes: b084590e ("net/mlx5_core: Introduce access function to read internal 
timer ")
Fixes: b4ff3a36d3e4 ("net/mlx5: Use offset based reserved field names in the 
IFC header file")
Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
Signed-off-by: Matan Barak 
---
 include/linux/mlx5/mlx5_ifc.h |  107 +
 1 files changed, 55 insertions(+), 52 deletions(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index c15b8a8..c300e74 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -750,21 +750,21 @@ struct mlx5_ifc_cmd_hca_cap_bits {
u8 ets[0x1];
u8 nic_flow_table[0x1];
u8 eswitch_flow_table[0x1];
-   u8 early_vf_enable;
-   u8 reserved_at_1a8[0x2];
+   u8 early_vf_enable[0x1];
+   u8 reserved_at_1a9[0x2];
u8 local_ca_ack_delay[0x5];
u8 reserved_at_1af[0x6];
u8 port_type[0x2];
u8 num_ports[0x8];
 
-   u8 reserved_at_1bf[0x3];
+   u8 reserved_at_1c0[0x3];
u8 log_max_msg[0x5];
-   u8 reserved_at_1c7[0x4];
+   u8 reserved_at_1c8[0x4];
u8 max_tc[0x4];
-   u8 reserved_at_1cf[0x6];
+   u8 reserved_at_1d0[0x6];
u8 rol_s[0x1];
u8 rol_g[0x1];
-   u8 reserved_at_1d7[0x1];
+   u8 reserved_at_1d8[0x1];
u8 wol_s[0x1];
u8 wol_g[0x1];
u8 wol_a[0x1];
@@ -774,47 +774,47 @@ struct mlx5_ifc_cmd_hca_cap_bits {
u8 wol_p[0x1];
 
u8 stat_rate_support[0x10];
-   u8 reserved_at_1ef[0xc];
+   u8 reserved_at_1f0[0xc];
u8 cqe_version[0x4];
 
u8 compact_address_vector[0x1];
u8 reserved_at_200[0x3];
u8 ipoib_basic_offloads[0x1];
-   u8 reserved_at_204[0xa];
+   u8 reserved_at_205[0xa];
u8 drain_sigerr[0x1];
u8 cmdif_checksum[0x2];
u8 sigerr_cqe[0x1];
-   u8 reserved_at_212[0x1];
+   u8 reserved_at_213[0x1];
u8 wq_signature[0x1];
u8 sctr_data_cqe[0x1];
-   u8 reserved_at_215[0x1];
+   u8 reserved_at_216[0x1];
u8 sho[0x1];
u8 tph[0x1];
u8 rf[0x1];
u8 dct[0x1];
-   u8 reserved_at_21a[0x1];
+   u8 reserved_at_21b[0x1];
u8 eth_net_offloads[0x1];
u8 roce[0x1];
u8 atomic[0x1];
-   u8 reserved_at_21e[0x1];
+   u8 reserved_at_21f[0x1];
 
u8 cq_oi[0x1];
u8 cq_resize[0x1];
u8 cq_moderation[0x1];
-   u8 reserved_at_222[0x3];
+   u8 reserved_at_223[0x3];
u8 cq_eq_remap[0x1];
u8 pg[0x1];
u8 block_lb_mc[0x1];
-   u8 reserved_at_228[0x1];
+   u8 reserved_at_229[0x1];
u8 scqe_break_moderation[0x1];
u8 reserved_at_22a[0x1];
u8 cd[0x1];
-   u8 reserved_at_22c[0x1];
+   u8 reserved_at_22d[0x1];
u8 apm[0x1];
u8 vector_calc[0x1];
u8 reserved_at_22f[0x1];
u8 imaicl[0x1];
-   u8 reserved_at_231[0x4];
+   u8 reserved_at_232[0x4];
u8 qkv[0x1];
u8 pkv[0x1];
u8 set_deth_sqpn[0x1];
@@ -824,98 +824,101 @@ struct mlx5_ifc_cmd_hca_cap_bits {
u8 uc[0x1];
u8 rc[0x1];
 
-   u8 reserved_at_23f[0xa];
+   u8 reserved_at_240[0xa];
u8 uar_sz[0x6];
-   u8 reserved_at_24f[0x8];
+   u8 reserved_at_250[0x8];
u8 log_pg_sz[0x8];
 
u8 bf[0x1];
-   u8 reserved_at_260[0x1];
+   u8 reserved_at_261[0x1];
u8 pad_tx_eth_packet[0x1];
-   u8 reserved_at_262[0x8];
+   u8 reserved_at_263[0x8];
u8 log_bf_reg_size[0x5];
-   u8 reserved_at_26f[0x10];
+   u8 reserved_at_270[0x10];
 
-   u8 reserved_at_27f[0x10];
+   u8 reserved_at_280[0x10];
u8 max_wqe_sz_sq[0x10];
 
-   u8 reserved_at_29f[0x10];
+   u8 reserved_at_2a0[0x10];
u8 max_wqe_sz_rq[0x10];
 
-   u8 reserved_at_2bf[0x10];
+   u8 reserved_at_2c0[0x10];

[PATCH for-next 0/2] mlx5_core: mlx5_ifc updates

2016-04-11 Thread Saeed Mahameed

Hi Dave and Doug

This series include mlx5_core updates for both net-next and rdma
trees for 4.7 kernel cycle. This is the only shared code planned
for 4.7 between rdma and net trees. Hopefully, this will prevent 
future conflicts when merging between ib-next and net-next once
4.7 cycle is over and merge window is opened.

Both Mellanox rdma and net submissions will proceed once this series
is applied into both trees.

Future shared code will be sent to both maintainers as pull requests
from Mellanox's kernel.org tree.

We have included all the maintainers of respective drivers.
Kindly review the change and let us know in case of any review comments.

Saeed Mahameed (1):
  net/mlx5: Update mlx5_ifc hardware features

Tariq Toukan (1):
  net/mlx5: Fix mlx5 ifc cmd_hca_cap bad offsets

 include/linux/mlx5/mlx5_ifc.h |  253 +
 1 files changed, 179 insertions(+), 74 deletions(-)

Re: [PATCH net-next 6/7] dsa: Rename phys_port_mask to user_port_mask

2016-04-11 Thread Florian Fainelli

On 11/04/16 12:50, Andrew Lunn wrote:
> The phys in phys_port_mask suggests this mask is about PHYs. In fact,
> it means physical ports. Rename to user_port_mask, indicating user
> ports of the switch, which is hopefully less confusing.

Even though the change looks fine in principle, I am more worried about
the difficulty for people to backport fixes because of the renaming
happening here. How about "enabled_ports_mask" as a name? Did not
Guenter had a helper function at some point which tested for (1 << port
& ds->phys_port_mask)?

> 
> Signed-off-by: Andrew Lunn 
> ---
>  drivers/net/dsa/bcm_sf2.c   | 10 +-
>  drivers/net/dsa/mv88e6060.c |  2 +-
>  include/net/dsa.h   |  4 ++--
>  net/dsa/dsa.c   |  8 
>  4 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
> index 50caa525cda3..e1e8070fb9f9 100644
> --- a/drivers/net/dsa/bcm_sf2.c
> +++ b/drivers/net/dsa/bcm_sf2.c
> @@ -160,7 +160,7 @@ static void bcm_sf2_imp_vlan_setup(struct dsa_switch *ds, 
> int cpu_port)
>* the same VLAN.
>*/
>   for (i = 0; i < priv->hw_params.num_ports; i++) {
> - if (!((1 << i) & ds->phys_port_mask))
> + if (!((1 << i) & ds->user_port_mask))
>   continue;
>  
>   reg = core_readl(priv, CORE_PORT_VLAN_CTL_PORT(i));
> @@ -1009,7 +1009,7 @@ static int bcm_sf2_sw_setup(struct dsa_switch *ds)
>   /* Enable all valid ports and disable those unused */
>   for (port = 0; port < priv->hw_params.num_ports; port++) {
>   /* IMP port receives special treatment */
> - if ((1 << port) & ds->phys_port_mask)
> + if ((1 << port) & ds->user_port_mask)
>   bcm_sf2_port_setup(ds, port, NULL);
>   else if (dsa_is_cpu_port(ds, port))
>   bcm_sf2_imp_setup(ds, port);
> @@ -1022,7 +1022,7 @@ static int bcm_sf2_sw_setup(struct dsa_switch *ds)
>* 7445D0, since 7445E0 disconnects the internal switch pseudo-PHY such
>* that we can use the regular SWITCH_MDIO master controller instead.
>*
> -  * By default, DSA initializes ds->phys_mii_mask to ds->phys_port_mask
> +  * By default, DSA initializes ds->phys_mii_mask to ds->user_port_mask
>* to have a 1:1 mapping between Port address and PHY address in order
>* to utilize the slave_mii_bus instance to read from Port PHYs. This is
>* not what we want here, so we initialize phys_mii_mask 0 to always
> @@ -1284,7 +1284,7 @@ static int bcm_sf2_sw_suspend(struct dsa_switch *ds)
>* bcm_sf2_sw_setup
>*/
>   for (port = 0; port < DSA_MAX_PORTS; port++) {
> - if ((1 << port) & ds->phys_port_mask ||
> + if ((1 << port) & ds->user_port_mask ||
>   dsa_is_cpu_port(ds, port))
>   bcm_sf2_port_disable(ds, port, NULL);
>   }
> @@ -1308,7 +1308,7 @@ static int bcm_sf2_sw_resume(struct dsa_switch *ds)
>   bcm_sf2_gphy_enable_set(ds, true);
>  
>   for (port = 0; port < DSA_MAX_PORTS; port++) {
> - if ((1 << port) & ds->phys_port_mask)
> + if ((1 << port) & ds->user_port_mask)
>   bcm_sf2_port_setup(ds, port, NULL);
>   else if (dsa_is_cpu_port(ds, port))
>   bcm_sf2_imp_setup(ds, port);
> diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c
> index adb608ccd9aa..0ba0d413cf73 100644
> --- a/drivers/net/dsa/mv88e6060.c
> +++ b/drivers/net/dsa/mv88e6060.c
> @@ -170,7 +170,7 @@ static int mv88e6060_setup_port(struct dsa_switch *ds, 
> int p)
>   REG_WRITE(addr, PORT_VLAN_MAP,
> ((p & 0xf) << PORT_VLAN_MAP_DBNUM_SHIFT) |
>  (dsa_is_cpu_port(ds, p) ?
> - ds->phys_port_mask :
> + ds->user_port_mask :
>   BIT(ds->dst->cpu_port)));
>  
>   /* Port Association Vector: when learning source addresses
> diff --git a/include/net/dsa.h b/include/net/dsa.h
> index 165c2e10615c..b36c2a5c206f 100644
> --- a/include/net/dsa.h
> +++ b/include/net/dsa.h
> @@ -167,7 +167,7 @@ struct dsa_switch {
>* Slave mii_bus and devices for the individual ports.
>*/
>   u32 dsa_port_mask;
> - u32 phys_port_mask;
> + u32 user_port_mask;
>   u32 phys_mii_mask;
>   struct mii_bus  *slave_mii_bus;
>   struct net_device   *ports[DSA_MAX_PORTS];
> @@ -185,7 +185,7 @@ static inline bool dsa_is_dsa_port(struct dsa_switch *ds, 
> int p)
>  
>  static inline bool dsa_is_port_initialized(struct dsa_switch *ds, int p)
>  {
> - return ds->phys_port_mask & (1 << p) && ds->ports[p];
> + return ds->user_port_mask & (1 << p) && ds->ports[p];
>  }
>  
>  static inline u8 dsa_upstream_port(struct dsa_switch *ds)
> diff --git

[PATCH net-next 7/7] dsa: mv88e6xxx: Use bus in mv88e6xxx_lookup_name()

2016-04-11 Thread Andrew Lunn

mv88e6xxx_lookup_name() returns the model name of a switch at a given
address on an MII bus. Using mii_bus to identify the bus rather than
the host device is more logical, so change the parameter.

Signed-off-by: Andrew Lunn 
---
v2: Check bus is valid before using it.
---
 drivers/net/dsa/mv88e6xxx.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index c242ffd8eb09..9985a0cf31f1 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -3068,11 +3068,10 @@ int mv88e6xxx_get_temp_alarm(struct dsa_switch *ds, 
bool *alarm)
 }
 #endif /* CONFIG_NET_DSA_HWMON */
 
-static char *mv88e6xxx_lookup_name(struct device *host_dev, int sw_addr,
+static char *mv88e6xxx_lookup_name(struct mii_bus *bus, int sw_addr,
   const struct mv88e6xxx_switch_id *table,
   unsigned int num)
 {
-   struct mii_bus *bus = dsa_host_dev_to_mii_bus(host_dev);
int i, ret;
 
if (!bus)
@@ -3090,7 +3089,8 @@ static char *mv88e6xxx_lookup_name(struct device 
*host_dev, int sw_addr,
/* Look up only the product number */
for (i = 0; i < num; ++i) {
if (table[i].id == (ret & PORT_SWITCH_ID_PROD_NUM_MASK)) {
-   dev_warn(host_dev, "unknown revision %d, using base 
switch 0x%x\n",
+   dev_warn(&bus->dev,
+"unknown revision %d, using base switch 
0x%x\n",
 ret & PORT_SWITCH_ID_REV_MASK,
 ret & PORT_SWITCH_ID_PROD_NUM_MASK);
return table[i].name;
@@ -3106,9 +3106,13 @@ char *mv88e6xxx_drv_probe(struct device *dsa_dev, struct 
device *host_dev,
  unsigned int num)
 {
struct mv88e6xxx_priv_state *ps;
+   struct mii_bus *bus = dsa_host_dev_to_mii_bus(host_dev);
char *name;
 
-   name = mv88e6xxx_lookup_name(host_dev, sw_addr, table, num);
+   if (!bus)
+   return NULL;
+
+   name = mv88e6xxx_lookup_name(bus, sw_addr, table, num);
if (name) {
ps = devm_kzalloc(dsa_dev, sizeof(*ps), GFP_KERNEL);
if (!ps)
-- 
2.7.0

[PATCH net-next 5/7] net: dsa: Rename DSA probe function.

2016-04-11 Thread Andrew Lunn

Rename the function called from the DSA to perform a probe for the
switch. This makes the normal _probe() name available for a standard
Linux device driver probe function.

Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/bcm_sf2.c   | 7 ---
 drivers/net/dsa/mv88e6060.c | 7 ---
 drivers/net/dsa/mv88e6123.c | 7 ---
 drivers/net/dsa/mv88e6131.c | 7 ---
 drivers/net/dsa/mv88e6171.c | 7 ---
 drivers/net/dsa/mv88e6352.c | 7 ---
 6 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 7d62802a66d5..50caa525cda3 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -135,8 +135,9 @@ static int bcm_sf2_sw_get_sset_count(struct dsa_switch *ds)
return BCM_SF2_STATS_SIZE;
 }
 
-static char *bcm_sf2_sw_probe(struct device *dsa_dev, struct device *host_dev,
- int sw_addr, void **_priv)
+static char *bcm_sf2_sw_drv_probe(struct device *dsa_dev,
+ struct device *host_dev,
+ int sw_addr, void **_priv)
 {
struct bcm_sf2_priv *priv;
 
@@ -1370,7 +1371,7 @@ static int bcm_sf2_sw_set_wol(struct dsa_switch *ds, int 
port,
 
 static struct dsa_switch_driver bcm_sf2_switch_driver = {
.tag_protocol   = DSA_TAG_PROTO_BRCM,
-   .probe  = bcm_sf2_sw_probe,
+   .probe  = bcm_sf2_sw_drv_probe,
.setup  = bcm_sf2_sw_setup,
.set_addr   = bcm_sf2_sw_set_addr,
.get_phy_flags  = bcm_sf2_sw_get_phy_flags,
diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c
index 46fe5dc65a99..adb608ccd9aa 100644
--- a/drivers/net/dsa/mv88e6060.c
+++ b/drivers/net/dsa/mv88e6060.c
@@ -69,8 +69,9 @@ static char *mv88e6060_get_name(struct mii_bus *bus, int 
sw_addr)
return NULL;
 }
 
-static char *mv88e6060_probe(struct device *dsa_dev, struct device *host_dev,
-int sw_addr, void **_priv)
+static char *mv88e6060_drv_probe(struct device *dsa_dev,
+struct device *host_dev,
+int sw_addr, void **_priv)
 {
struct mii_bus *bus = dsa_host_dev_to_mii_bus(host_dev);
struct mv88e6060_priv *priv;
@@ -248,7 +249,7 @@ mv88e6060_phy_write(struct dsa_switch *ds, int port, int 
regnum, u16 val)
 
 static struct dsa_switch_driver mv88e6060_switch_driver = {
.tag_protocol   = DSA_TAG_PROTO_TRAILER,
-   .probe  = mv88e6060_probe,
+   .probe  = mv88e6060_drv_probe,
.setup  = mv88e6060_setup,
.set_addr   = mv88e6060_set_addr,
.phy_read   = mv88e6060_phy_read,
diff --git a/drivers/net/dsa/mv88e6123.c b/drivers/net/dsa/mv88e6123.c
index 8aaac0894752..c34283d929c4 100644
--- a/drivers/net/dsa/mv88e6123.c
+++ b/drivers/net/dsa/mv88e6123.c
@@ -29,8 +29,9 @@ static const struct mv88e6xxx_switch_id mv88e6123_table[] = {
{ PORT_SWITCH_ID_6165_A2, "Marvell 88e6165 (A2)" },
 };
 
-static char *mv88e6123_probe(struct device *dsa_dev, struct device *host_dev,
-int sw_addr, void **priv)
+static char *mv88e6123_drv_probe(struct device *dsa_dev,
+struct device *host_dev,
+int sw_addr, void **priv)
 {
return mv88e6xxx_drv_probe(dsa_dev, host_dev, sw_addr, priv,
   mv88e6123_table,
@@ -106,7 +107,7 @@ static int mv88e6123_setup(struct dsa_switch *ds)
 
 struct dsa_switch_driver mv88e6123_switch_driver = {
.tag_protocol   = DSA_TAG_PROTO_EDSA,
-   .probe  = mv88e6123_probe,
+   .probe  = mv88e6123_drv_probe,
.setup  = mv88e6123_setup,
.set_addr   = mv88e6xxx_set_addr_indirect,
.phy_read   = mv88e6xxx_phy_read,
diff --git a/drivers/net/dsa/mv88e6131.c b/drivers/net/dsa/mv88e6131.c
index 9e6edf94a855..f5d75fce1e96 100644
--- a/drivers/net/dsa/mv88e6131.c
+++ b/drivers/net/dsa/mv88e6131.c
@@ -25,8 +25,9 @@ static const struct mv88e6xxx_switch_id mv88e6131_table[] = {
{ PORT_SWITCH_ID_6185, "Marvell 88E6185" },
 };
 
-static char *mv88e6131_probe(struct device *dsa_dev, struct device *host_dev,
-int sw_addr, void **priv)
+static char *mv88e6131_drv_probe(struct device *dsa_dev,
+struct device *host_dev,
+int sw_addr, void **priv)
 {
return mv88e6xxx_drv_probe(dsa_dev, host_dev, sw_addr, priv,
   mv88e6131_table,
@@ -163,7 +164,7 @@ mv88e6131_phy_write(struct dsa_switch *ds,
 
 struct dsa_switch_driver mv88e6131_switch_driver = {
.tag_protocol   = DSA_TAG_PROTO_DSA,
-   .probe  = mv88e6131_probe,
+   .probe  = mv88e6131_drv_p

[PATCH net-next 3/7] net: dsa: Remove allocation of driver private memory

2016-04-11 Thread Andrew Lunn

The drivers now allocate their own memory for private usage. Remove
the allocation from the core code.

Signed-off-by: Andrew Lunn 
Acked-by: Florian Fainelli 
---
 include/net/dsa.h | 1 -
 net/dsa/dsa.c | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 7bc7bd9b5ded..165c2e10615c 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -213,7 +213,6 @@ struct dsa_switch_driver {
struct list_headlist;
 
enum dsa_tag_protocol   tag_protocol;
-   int priv_size;
 
/*
 * Probing and setup.
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 7ef8a92a9e39..14bf12f637d2 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -402,7 +402,7 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index,
/*
 * Allocate and initialise switch state.
 */
-   ds = devm_kzalloc(parent, sizeof(*ds) + drv->priv_size, GFP_KERNEL);
+   ds = devm_kzalloc(parent, sizeof(*ds), GFP_KERNEL);
if (ds == NULL)
return ERR_PTR(-ENOMEM);
 
-- 
2.7.0

[PATCH net-next 6/7] dsa: Rename phys_port_mask to user_port_mask

2016-04-11 Thread Andrew Lunn

The phys in phys_port_mask suggests this mask is about PHYs. In fact,
it means physical ports. Rename to user_port_mask, indicating user
ports of the switch, which is hopefully less confusing.

Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/bcm_sf2.c   | 10 +-
 drivers/net/dsa/mv88e6060.c |  2 +-
 include/net/dsa.h   |  4 ++--
 net/dsa/dsa.c   |  8 
 4 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 50caa525cda3..e1e8070fb9f9 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -160,7 +160,7 @@ static void bcm_sf2_imp_vlan_setup(struct dsa_switch *ds, 
int cpu_port)
 * the same VLAN.
 */
for (i = 0; i < priv->hw_params.num_ports; i++) {
-   if (!((1 << i) & ds->phys_port_mask))
+   if (!((1 << i) & ds->user_port_mask))
continue;
 
reg = core_readl(priv, CORE_PORT_VLAN_CTL_PORT(i));
@@ -1009,7 +1009,7 @@ static int bcm_sf2_sw_setup(struct dsa_switch *ds)
/* Enable all valid ports and disable those unused */
for (port = 0; port < priv->hw_params.num_ports; port++) {
/* IMP port receives special treatment */
-   if ((1 << port) & ds->phys_port_mask)
+   if ((1 << port) & ds->user_port_mask)
bcm_sf2_port_setup(ds, port, NULL);
else if (dsa_is_cpu_port(ds, port))
bcm_sf2_imp_setup(ds, port);
@@ -1022,7 +1022,7 @@ static int bcm_sf2_sw_setup(struct dsa_switch *ds)
 * 7445D0, since 7445E0 disconnects the internal switch pseudo-PHY such
 * that we can use the regular SWITCH_MDIO master controller instead.
 *
-* By default, DSA initializes ds->phys_mii_mask to ds->phys_port_mask
+* By default, DSA initializes ds->phys_mii_mask to ds->user_port_mask
 * to have a 1:1 mapping between Port address and PHY address in order
 * to utilize the slave_mii_bus instance to read from Port PHYs. This is
 * not what we want here, so we initialize phys_mii_mask 0 to always
@@ -1284,7 +1284,7 @@ static int bcm_sf2_sw_suspend(struct dsa_switch *ds)
 * bcm_sf2_sw_setup
 */
for (port = 0; port < DSA_MAX_PORTS; port++) {
-   if ((1 << port) & ds->phys_port_mask ||
+   if ((1 << port) & ds->user_port_mask ||
dsa_is_cpu_port(ds, port))
bcm_sf2_port_disable(ds, port, NULL);
}
@@ -1308,7 +1308,7 @@ static int bcm_sf2_sw_resume(struct dsa_switch *ds)
bcm_sf2_gphy_enable_set(ds, true);
 
for (port = 0; port < DSA_MAX_PORTS; port++) {
-   if ((1 << port) & ds->phys_port_mask)
+   if ((1 << port) & ds->user_port_mask)
bcm_sf2_port_setup(ds, port, NULL);
else if (dsa_is_cpu_port(ds, port))
bcm_sf2_imp_setup(ds, port);
diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c
index adb608ccd9aa..0ba0d413cf73 100644
--- a/drivers/net/dsa/mv88e6060.c
+++ b/drivers/net/dsa/mv88e6060.c
@@ -170,7 +170,7 @@ static int mv88e6060_setup_port(struct dsa_switch *ds, int 
p)
REG_WRITE(addr, PORT_VLAN_MAP,
  ((p & 0xf) << PORT_VLAN_MAP_DBNUM_SHIFT) |
   (dsa_is_cpu_port(ds, p) ?
-   ds->phys_port_mask :
+   ds->user_port_mask :
BIT(ds->dst->cpu_port)));
 
/* Port Association Vector: when learning source addresses
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 165c2e10615c..b36c2a5c206f 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -167,7 +167,7 @@ struct dsa_switch {
 * Slave mii_bus and devices for the individual ports.
 */
u32 dsa_port_mask;
-   u32 phys_port_mask;
+   u32 user_port_mask;
u32 phys_mii_mask;
struct mii_bus  *slave_mii_bus;
struct net_device   *ports[DSA_MAX_PORTS];
@@ -185,7 +185,7 @@ static inline bool dsa_is_dsa_port(struct dsa_switch *ds, 
int p)
 
 static inline bool dsa_is_port_initialized(struct dsa_switch *ds, int p)
 {
-   return ds->phys_port_mask & (1 << p) && ds->ports[p];
+   return ds->user_port_mask & (1 << p) && ds->ports[p];
 }
 
 static inline u8 dsa_upstream_port(struct dsa_switch *ds)
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 14bf12f637d2..20293d1a5a93 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -246,7 +246,7 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, 
struct device *parent)
} else if (!strcmp(name, "dsa")) {
ds->dsa_port_mask |= 1 << i;
} else {
-   ds->phys_port_mask |= 1 << i;
+   ds->user_port

[PATCH net-next 4/7] net: dsa: Keep the mii bus and address in the private structure

2016-04-11 Thread Andrew Lunn

Rather than looking up the mii bus and address every time, do it once
at probe, and keep it in the private structure. Centralise this probe
code in mv88e6xxx.

Signed-off-by: Andrew Lunn 
Acked-by: Florian Fainelli 
---
v2: Add check for valid mii_bus.
---
 drivers/net/dsa/mv88e6060.c | 44 ++--
 drivers/net/dsa/mv88e6060.h | 11 +++
 drivers/net/dsa/mv88e6123.c | 15 +++
 drivers/net/dsa/mv88e6131.c | 15 +++
 drivers/net/dsa/mv88e6171.c | 15 +++
 drivers/net/dsa/mv88e6352.c | 15 +++
 drivers/net/dsa/mv88e6xxx.c | 43 +--
 drivers/net/dsa/mv88e6xxx.h | 14 +++---
 8 files changed, 89 insertions(+), 83 deletions(-)

diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c
index 41195f1417f1..46fe5dc65a99 100644
--- a/drivers/net/dsa/mv88e6060.c
+++ b/drivers/net/dsa/mv88e6060.c
@@ -19,12 +19,9 @@
 
 static int reg_read(struct dsa_switch *ds, int addr, int reg)
 {
-   struct mii_bus *bus = dsa_host_dev_to_mii_bus(ds->master_dev);
+   struct mv88e6060_priv *priv = ds_to_priv(ds);
 
-   if (bus == NULL)
-   return -EINVAL;
-
-   return mdiobus_read_nested(bus, ds->pd->sw_addr + addr, reg);
+   return mdiobus_read_nested(priv->bus, priv->sw_addr + addr, reg);
 }
 
 #define REG_READ(addr, reg)\
@@ -40,12 +37,9 @@ static int reg_read(struct dsa_switch *ds, int addr, int reg)
 
 static int reg_write(struct dsa_switch *ds, int addr, int reg, u16 val)
 {
-   struct mii_bus *bus = dsa_host_dev_to_mii_bus(ds->master_dev);
-
-   if (bus == NULL)
-   return -EINVAL;
+   struct mv88e6060_priv *priv = ds_to_priv(ds);
 
-   return mdiobus_write_nested(bus, ds->pd->sw_addr + addr, reg, val);
+   return mdiobus_write_nested(priv->bus, priv->sw_addr + addr, reg, val);
 }
 
 #define REG_WRITE(addr, reg, val)  \
@@ -57,16 +51,10 @@ static int reg_write(struct dsa_switch *ds, int addr, int 
reg, u16 val)
return __ret;   \
})
 
-static char *mv88e6060_probe(struct device *dsa_dev, struct device *host_dev,
-int sw_addr, void **priv)
+static char *mv88e6060_get_name(struct mii_bus *bus, int sw_addr)
 {
-   struct mii_bus *bus = dsa_host_dev_to_mii_bus(host_dev);
int ret;
 
-   *priv = NULL;
-   if (bus == NULL)
-   return NULL;
-
ret = mdiobus_read(bus, sw_addr + REG_PORT(0), PORT_SWITCH_ID);
if (ret >= 0) {
if (ret == PORT_SWITCH_ID_6060)
@@ -81,6 +69,26 @@ static char *mv88e6060_probe(struct device *dsa_dev, struct 
device *host_dev,
return NULL;
 }
 
+static char *mv88e6060_probe(struct device *dsa_dev, struct device *host_dev,
+int sw_addr, void **_priv)
+{
+   struct mii_bus *bus = dsa_host_dev_to_mii_bus(host_dev);
+   struct mv88e6060_priv *priv;
+   char *name;
+
+   name = mv88e6060_get_name(bus, sw_addr);
+   if (name) {
+   priv = devm_kzalloc(dsa_dev, sizeof(*priv), GFP_KERNEL);
+   if (!priv)
+   return NULL;
+   *_priv = priv;
+   priv->bus = bus;
+   priv->sw_addr = sw_addr;
+   }
+
+   return name;
+}
+
 static int mv88e6060_switch_reset(struct dsa_switch *ds)
 {
int i;
@@ -176,8 +184,8 @@ static int mv88e6060_setup_port(struct dsa_switch *ds, int 
p)
 
 static int mv88e6060_setup(struct dsa_switch *ds)
 {
-   int i;
int ret;
+   int i;
 
ret = mv88e6060_switch_reset(ds);
if (ret < 0)
diff --git a/drivers/net/dsa/mv88e6060.h b/drivers/net/dsa/mv88e6060.h
index cc9b2ed4aff4..10249bd16292 100644
--- a/drivers/net/dsa/mv88e6060.h
+++ b/drivers/net/dsa/mv88e6060.h
@@ -108,4 +108,15 @@
 #define GLOBAL_ATU_MAC_23  0x0e
 #define GLOBAL_ATU_MAC_45  0x0f
 
+struct mv88e6060_priv {
+   /* MDIO bus and address on bus to use. When in single chip
+* mode, address is 0, and the switch uses multiple addresses
+* on the bus.  When in multi-chip mode, the switch uses a
+* single address which contains two registers used for
+* indirect access to more registers.
+*/
+   struct mii_bus *bus;
+   int sw_addr;
+};
+
 #endif
diff --git a/drivers/net/dsa/mv88e6123.c b/drivers/net/dsa/mv88e6123.c
index bcab3ef22448..8aaac0894752 100644
--- a/drivers/net/dsa/mv88e6123.c
+++ b/drivers/net/dsa/mv88e6123.c
@@ -32,18 +32,9 @@ static const struct mv88e6xxx_switch_id mv88e6123_table[] = {
 static char *mv88e6123_probe(struct device *dsa_dev, struct device *host_dev,
 int sw_addr, void **priv)
 {
-   struct mv88e6xxx_priv_state *ps;
-   char *name;
-
-   name = mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6123_table,
-

[PATCH net-next 1/7] net: dsa: Pass the dsa device to the switch drivers

2016-04-11 Thread Andrew Lunn

By passing a device structure to the switch devices, it allows them
to use devm_* methods for resource management.

Signed-off-by: Andrew Lunn 
Acked-by: Florian Fainelli 
---
 drivers/net/dsa/bcm_sf2.c   | 3 ++-
 drivers/net/dsa/mv88e6060.c | 3 ++-
 drivers/net/dsa/mv88e6123.c | 3 ++-
 drivers/net/dsa/mv88e6131.c | 3 ++-
 drivers/net/dsa/mv88e6171.c | 3 ++-
 drivers/net/dsa/mv88e6352.c | 3 ++-
 include/net/dsa.h   | 3 ++-
 net/dsa/dsa.c   | 7 ---
 8 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 780f22876538..18a79579141f 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -135,7 +135,8 @@ static int bcm_sf2_sw_get_sset_count(struct dsa_switch *ds)
return BCM_SF2_STATS_SIZE;
 }
 
-static char *bcm_sf2_sw_probe(struct device *host_dev, int sw_addr)
+static char *bcm_sf2_sw_probe(struct device *dsa_dev, struct device *host_dev,
+ int sw_addr)
 {
return "Broadcom Starfighter 2";
 }
diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c
index 0527f485c3dc..34d0f9fe19db 100644
--- a/drivers/net/dsa/mv88e6060.c
+++ b/drivers/net/dsa/mv88e6060.c
@@ -57,7 +57,8 @@ static int reg_write(struct dsa_switch *ds, int addr, int 
reg, u16 val)
return __ret;   \
})
 
-static char *mv88e6060_probe(struct device *host_dev, int sw_addr)
+static char *mv88e6060_probe(struct device *dsa_dev, struct device *host_dev,
+int sw_addr)
 {
struct mii_bus *bus = dsa_host_dev_to_mii_bus(host_dev);
int ret;
diff --git a/drivers/net/dsa/mv88e6123.c b/drivers/net/dsa/mv88e6123.c
index 69a6f79dcb10..ede4c6f0b31a 100644
--- a/drivers/net/dsa/mv88e6123.c
+++ b/drivers/net/dsa/mv88e6123.c
@@ -29,7 +29,8 @@ static const struct mv88e6xxx_switch_id mv88e6123_table[] = {
{ PORT_SWITCH_ID_6165_A2, "Marvell 88e6165 (A2)" },
 };
 
-static char *mv88e6123_probe(struct device *host_dev, int sw_addr)
+static char *mv88e6123_probe(struct device *dsa_dev, struct device *host_dev,
+int sw_addr)
 {
return mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6123_table,
 ARRAY_SIZE(mv88e6123_table));
diff --git a/drivers/net/dsa/mv88e6131.c b/drivers/net/dsa/mv88e6131.c
index 24070287c2bc..bfadfd2cbb8d 100644
--- a/drivers/net/dsa/mv88e6131.c
+++ b/drivers/net/dsa/mv88e6131.c
@@ -25,7 +25,8 @@ static const struct mv88e6xxx_switch_id mv88e6131_table[] = {
{ PORT_SWITCH_ID_6185, "Marvell 88E6185" },
 };
 
-static char *mv88e6131_probe(struct device *host_dev, int sw_addr)
+static char *mv88e6131_probe(struct device *dsa_dev, struct device *host_dev,
+int sw_addr)
 {
return mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6131_table,
 ARRAY_SIZE(mv88e6131_table));
diff --git a/drivers/net/dsa/mv88e6171.c b/drivers/net/dsa/mv88e6171.c
index 0e62f3b5bc81..fb35d3ac1644 100644
--- a/drivers/net/dsa/mv88e6171.c
+++ b/drivers/net/dsa/mv88e6171.c
@@ -24,7 +24,8 @@ static const struct mv88e6xxx_switch_id mv88e6171_table[] = {
{ PORT_SWITCH_ID_6351, "Marvell 88E6351" },
 };
 
-static char *mv88e6171_probe(struct device *host_dev, int sw_addr)
+static char *mv88e6171_probe(struct device *dsa_dev, struct device *host_dev,
+int sw_addr)
 {
return mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6171_table,
 ARRAY_SIZE(mv88e6171_table));
diff --git a/drivers/net/dsa/mv88e6352.c b/drivers/net/dsa/mv88e6352.c
index 7f452e4a04a5..577ab2cfa944 100644
--- a/drivers/net/dsa/mv88e6352.c
+++ b/drivers/net/dsa/mv88e6352.c
@@ -37,7 +37,8 @@ static const struct mv88e6xxx_switch_id mv88e6352_table[] = {
{ PORT_SWITCH_ID_6352_A1, "Marvell 88E6352 (A1)" },
 };
 
-static char *mv88e6352_probe(struct device *host_dev, int sw_addr)
+static char *mv88e6352_probe(struct device *dsa_dev, struct device *host_dev,
+int sw_addr)
 {
return mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6352_table,
 ARRAY_SIZE(mv88e6352_table));
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 18d1be3ad62d..0f9f6f38f255 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -212,7 +212,8 @@ struct dsa_switch_driver {
/*
 * Probing and setup.
 */
-   char*(*probe)(struct device *host_dev, int sw_addr);
+   char*(*probe)(struct device *dsa_dev, struct device *host_dev,
+ int sw_addr);
int (*setup)(struct dsa_switch *ds);
int (*set_addr)(struct dsa_switch *ds, u8 *addr);
u32 (*get_phy_flags)(struct dsa_switch *ds, int port);
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index c28c47463b7e..c06275311cb2 100644
--- a/net/dsa/dsa.c
+++ b

[PATCH net-next 0/7] DSA refactoring: set 1

2016-04-11 Thread Andrew Lunn

There has been a long running effort to refractor DSA probing to make
the switches true linux devices. Here are a small collection of
patches moving in this direction. Most have been seen before.

We take a little step forward by passing the dsa device point to the
driver, thus allowing it to perform resource allocations using the
normal mechanisms. This device structure will later be replaced by the
devices own device structure.

Future patches will add a true driver probe function, so we rename the
current probe function, cleaning up the namespace.

phys_port_mask continually confuses me, thinking it is about PHYs. But
it is actually about ports to the outside world, user ports. So rename
it.

Lots more patches yet to follow, this is just doing some ground work.

Andrew Lunn (7):
  net: dsa: Pass the dsa device to the switch drivers
  net: dsa: Have the switch driver allocate there own private memory
  net: dsa: Remove allocation of driver private memory
  net: dsa: Keep the mii bus and address in the private structure
  net: dsa: Rename DSA probe function.
  dsa: Rename phys_port_mask to user_port_mask
  dsa: mv88e6xxx: Use bus in mv88e6xxx_lookup_name()

 drivers/net/dsa/bcm_sf2.c   | 24 +---
 drivers/net/dsa/mv88e6060.c | 47 +++---
 drivers/net/dsa/mv88e6060.h | 11 +
 drivers/net/dsa/mv88e6123.c | 14 +++-
 drivers/net/dsa/mv88e6131.c | 14 +++-
 drivers/net/dsa/mv88e6171.c | 14 +++-
 drivers/net/dsa/mv88e6352.c | 14 +++-
 drivers/net/dsa/mv88e6xxx.c | 55 +++--
 drivers/net/dsa/mv88e6xxx.h | 17 +++---
 include/net/dsa.h   | 16 -
 net/dsa/dsa.c   | 19 +---
 11 files changed, 166 insertions(+), 79 deletions(-)

-- 
2.7.0

[PATCH net-next 2/7] net: dsa: Have the switch driver allocate there own private memory

2016-04-11 Thread Andrew Lunn

Now the switch devices have a dev pointer, make use of it for allocating
the drivers private data structures using a devm_kzalloc().

Signed-off-by: Andrew Lunn 
Acked-by: Florian Fainelli 
---
v2: Added missing assignment of priv to ds->priv.
---
 drivers/net/dsa/bcm_sf2.c   | 10 --
 drivers/net/dsa/mv88e6060.c |  3 ++-
 drivers/net/dsa/mv88e6123.c | 17 ++---
 drivers/net/dsa/mv88e6131.c | 17 ++---
 drivers/net/dsa/mv88e6171.c | 17 ++---
 drivers/net/dsa/mv88e6352.c | 17 ++---
 drivers/net/dsa/mv88e6xxx.c |  6 --
 drivers/net/dsa/mv88e6xxx.h |  3 +++
 include/net/dsa.h   | 10 --
 net/dsa/dsa.c   |  8 +---
 10 files changed, 86 insertions(+), 22 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 18a79579141f..7d62802a66d5 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -136,8 +136,15 @@ static int bcm_sf2_sw_get_sset_count(struct dsa_switch *ds)
 }
 
 static char *bcm_sf2_sw_probe(struct device *dsa_dev, struct device *host_dev,
- int sw_addr)
+ int sw_addr, void **_priv)
 {
+   struct bcm_sf2_priv *priv;
+
+   priv = devm_kzalloc(dsa_dev, sizeof(*priv), GFP_KERNEL);
+   if (!priv)
+   return NULL;
+   *_priv = priv;
+
return "Broadcom Starfighter 2";
 }
 
@@ -1363,7 +1370,6 @@ static int bcm_sf2_sw_set_wol(struct dsa_switch *ds, int 
port,
 
 static struct dsa_switch_driver bcm_sf2_switch_driver = {
.tag_protocol   = DSA_TAG_PROTO_BRCM,
-   .priv_size  = sizeof(struct bcm_sf2_priv),
.probe  = bcm_sf2_sw_probe,
.setup  = bcm_sf2_sw_setup,
.set_addr   = bcm_sf2_sw_set_addr,
diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c
index 34d0f9fe19db..41195f1417f1 100644
--- a/drivers/net/dsa/mv88e6060.c
+++ b/drivers/net/dsa/mv88e6060.c
@@ -58,11 +58,12 @@ static int reg_write(struct dsa_switch *ds, int addr, int 
reg, u16 val)
})
 
 static char *mv88e6060_probe(struct device *dsa_dev, struct device *host_dev,
-int sw_addr)
+int sw_addr, void **priv)
 {
struct mii_bus *bus = dsa_host_dev_to_mii_bus(host_dev);
int ret;
 
+   *priv = NULL;
if (bus == NULL)
return NULL;
 
diff --git a/drivers/net/dsa/mv88e6123.c b/drivers/net/dsa/mv88e6123.c
index ede4c6f0b31a..bcab3ef22448 100644
--- a/drivers/net/dsa/mv88e6123.c
+++ b/drivers/net/dsa/mv88e6123.c
@@ -30,10 +30,20 @@ static const struct mv88e6xxx_switch_id mv88e6123_table[] = 
{
 };
 
 static char *mv88e6123_probe(struct device *dsa_dev, struct device *host_dev,
-int sw_addr)
+int sw_addr, void **priv)
 {
-   return mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6123_table,
+   struct mv88e6xxx_priv_state *ps;
+   char *name;
+
+   name = mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6123_table,
 ARRAY_SIZE(mv88e6123_table));
+   if (name) {
+   ps = devm_kzalloc(dsa_dev, sizeof(*ps), GFP_KERNEL);
+   if (!ps)
+   return NULL;
+   *priv = ps;
+   }
+   return name;
 }
 
 static int mv88e6123_setup_global(struct dsa_switch *ds)
@@ -74,6 +84,8 @@ static int mv88e6123_setup(struct dsa_switch *ds)
struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
int ret;
 
+   ps->ds = ds;
+
ret = mv88e6xxx_setup_common(ds);
if (ret < 0)
return ret;
@@ -103,7 +115,6 @@ static int mv88e6123_setup(struct dsa_switch *ds)
 
 struct dsa_switch_driver mv88e6123_switch_driver = {
.tag_protocol   = DSA_TAG_PROTO_EDSA,
-   .priv_size  = sizeof(struct mv88e6xxx_priv_state),
.probe  = mv88e6123_probe,
.setup  = mv88e6123_setup,
.set_addr   = mv88e6xxx_set_addr_indirect,
diff --git a/drivers/net/dsa/mv88e6131.c b/drivers/net/dsa/mv88e6131.c
index bfadfd2cbb8d..b9f9b009f65a 100644
--- a/drivers/net/dsa/mv88e6131.c
+++ b/drivers/net/dsa/mv88e6131.c
@@ -26,10 +26,20 @@ static const struct mv88e6xxx_switch_id mv88e6131_table[] = 
{
 };
 
 static char *mv88e6131_probe(struct device *dsa_dev, struct device *host_dev,
-int sw_addr)
+int sw_addr, void **priv)
 {
-   return mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6131_table,
+   struct mv88e6xxx_priv_state *ps;
+   char *name;
+
+   name = mv88e6xxx_lookup_name(host_dev, sw_addr, mv88e6131_table,
 ARRAY_SIZE(mv88e6131_table));
+   if (name) {
+   ps = devm_kzalloc(dsa_dev, sizeof(*ps), GFP_KERNEL);
+   if (!ps)
+

Re: FWD: [PATCH v2] Marvell phy: add fiber status check for some components

2016-04-11 Thread Florian Fainelli

On 04/04/16 06:25, Andrew Lunn wrote:
>> >From 564b767163d19355a3b5efaad195e93796570c71 Mon Sep 17 00:00:00 2001
>> From: Charles-Antoine Couret 
>> Date: Fri, 1 Apr 2016 16:16:35 +0200
>> Subject: [PATCH] Marvell phy: add fiber status check for some components
>>
>> Marvell's phy could have two modes: fiber and copper. Currently, the driver
>> checks only the copper mode registers to get the status link which could be
>> wrong.
>>
>> This commit add a handler to check fiber then copper status link.
>> If the fiber link is activated, the driver would use this information.
>> Else, it would use the copper status.
> 
> Hi Florian
> 
> What do you think about this?
> 
> This works for basic status information. But what about other ethtool
> options? Setting the speed and duplex, turning pause on/off, etc.

Agreed, it seems like a PHY configured for fiber will need to provide
these informations differently, or does the standard BMSR register
provide accurate information already?

> 
> Do we actually need to stay on page 1 if fibre is in use? How do we
> initially change to page 1 when the fibre link is still down?

I also do not feel very comfortable with reading the fiber status first,
and then copper and then combine these two. At the very best, could we
do something like:

- identify if the PHY is configured for fiber in drv->probe or
drv->config_init, retain that information
- have two paths in drv->read_status which take care of reading one
status or the other?

Are there other side effects for other register accesses (say,
statistics, or auto-negotiation) that need to be fiber vs. copper aware?

> 
> Should we be using the old mechanism to swap between TP, BNC and AUI
> to swap between copper and fibre?

Did you mean using ethtool -s  port fibre for instance?
-- 
Florian

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Jesper Dangaard Brouer

On Mon, 11 Apr 2016 09:53:54 -0700
Eric Dumazet  wrote:

> On Mon, 2016-04-11 at 18:19 +0200, Jesper Dangaard Brouer wrote:
> 
> > Drivers also do tricks where they fallback to smaller order pages. E.g.
> > lookup function mlx4_alloc_pages().  I've tried to simulate that
> > function here:
> > https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69
> >   
> 
> We use order-0 pages on mlx4 at Google, as order-3 pages are very
> dangerous for some kind of attacks...

Interesting!

> An out of order TCP packet can hold an order-3 pages, while claiming to
> use 1.5 KBvia skb->truesize.
> 
> order-0 only pages allow the page recycle trick used by Intel driver,
> and we hardly see any page allocations in typical workloads.

Yes, I looked at the Intel ixgbe drivers page recycle trick. 

It is actually quite cool, but code wise it is a little hard to
follow.  I started to look at the variant in i40e, specifically
function i40e_clean_rx_irq_ps() explains it a bit more explicit.
 

> While order-3 pages are 'nice' for friendly datacenter kind of
> traffic, they also are a higher risk on hosts connected to the wild
> Internet.
> 
> Maybe I should upstream this patch ;)

Definitely!

Does this patch also include a page recycle trick?  Else how do you get
around the cost of allocating a single order-0 page?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH net-next] net: mdio: Fix lockdep falls positive splat

2016-04-11 Thread Andrew Lunn

MDIO devices can be stacked upon each other. The current code supports
two levels, which until recently has been enough for a DSA mdio bus on
top of another bus. Now we have hardware which has an MDIO mux in the
middle.

Define an MDIO MUTEX class with three levels.

Signed-off-by: Andrew Lunn 
---
 drivers/net/phy/mdio-mux.c | 10 ++
 drivers/net/phy/mdio_bus.c |  4 ++--
 include/linux/mdio.h   | 11 +++
 3 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/net/phy/mdio-mux.c b/drivers/net/phy/mdio-mux.c
index 308ade0eb1b6..5c81d6faf304 100644
--- a/drivers/net/phy/mdio-mux.c
+++ b/drivers/net/phy/mdio-mux.c
@@ -45,13 +45,7 @@ static int mdio_mux_read(struct mii_bus *bus, int phy_id, 
int regnum)
struct mdio_mux_parent_bus *pb = cb->parent;
int r;
 
-   /* In theory multiple mdio_mux could be stacked, thus creating
-* more than a single level of nesting.  But in practice,
-* SINGLE_DEPTH_NESTING will cover the vast majority of use
-* cases.  We use it, instead of trying to handle the general
-* case.
-*/
-   mutex_lock_nested(&pb->mii_bus->mdio_lock, SINGLE_DEPTH_NESTING);
+   mutex_lock_nested(&pb->mii_bus->mdio_lock, MDIO_MUTEX_MUX);
r = pb->switch_fn(pb->current_child, cb->bus_number, pb->switch_data);
if (r)
goto out;
@@ -76,7 +70,7 @@ static int mdio_mux_write(struct mii_bus *bus, int phy_id,
 
int r;
 
-   mutex_lock_nested(&pb->mii_bus->mdio_lock, SINGLE_DEPTH_NESTING);
+   mutex_lock_nested(&pb->mii_bus->mdio_lock, MDIO_MUTEX_MUX);
r = pb->switch_fn(pb->current_child, cb->bus_number, pb->switch_data);
if (r)
goto out;
diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index 0cba64f1ecf4..751202a285a6 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -457,7 +457,7 @@ int mdiobus_read_nested(struct mii_bus *bus, int addr, u32 
regnum)
 
BUG_ON(in_interrupt());
 
-   mutex_lock_nested(&bus->mdio_lock, SINGLE_DEPTH_NESTING);
+   mutex_lock_nested(&bus->mdio_lock, MDIO_MUTEX_NESTED);
retval = bus->read(bus, addr, regnum);
mutex_unlock(&bus->mdio_lock);
 
@@ -509,7 +509,7 @@ int mdiobus_write_nested(struct mii_bus *bus, int addr, u32 
regnum, u16 val)
 
BUG_ON(in_interrupt());
 
-   mutex_lock_nested(&bus->mdio_lock, SINGLE_DEPTH_NESTING);
+   mutex_lock_nested(&bus->mdio_lock, MDIO_MUTEX_NESTED);
err = bus->write(bus, addr, regnum, val);
mutex_unlock(&bus->mdio_lock);
 
diff --git a/include/linux/mdio.h b/include/linux/mdio.h
index 5bfd99d1a40a..bf9d1d750693 100644
--- a/include/linux/mdio.h
+++ b/include/linux/mdio.h
@@ -13,6 +13,17 @@
 
 struct mii_bus;
 
+/* Multiple levels of nesting are possible. However typically this is
+ * limited to nested DSA like layer, a MUX layer, and the normal
+ * user. Instead of trying to handle the general case, just define
+ * these cases.
+ */
+enum mdio_mutex_lock_class {
+   MDIO_MUTEX_NORMAL,
+   MDIO_MUTEX_MUX,
+   MDIO_MUTEX_NESTED,
+};
+
 struct mdio_device {
struct device dev;
 
-- 
2.7.0

Re: [PATCH 0/9] RxRPC: 2nd rewrite part 1

2016-04-11 Thread David Miller

From: David Howells 
Date: Thu, 07 Apr 2016 17:22:56 +0100

> Okay, I'm in the process of rewriting the RxRPC rewrite.  The primary aim of
> this second rewrite is to strictly control the number of active connections we
> know about and to get rid of connections we don't need much more quickly.
> 
> On top of this, there are fixes to the protocol handling which will all occur
> in later parts.

Series applied, but I had to fix up some trailing whitespace (reported by GIT)
by hand.

Thanks.

Re: [PATCH net-next 0/2] fix udp pull header breakage

2016-04-11 Thread David Miller

From: Willem de Bruijn 
Date: Thu,  7 Apr 2016 11:44:57 -0400

> Commit e6afc8ace6dd ("udp: remove headers from UDP packets before
> queueing") modified udp receive processing to pull headers before
> enqueue and to not expect them on dequeue.
> 
> The patch missed protocols on top of udp with in-kernel
> implementations that have their own skb_recv_datagram calls and
> dequeue logic. Modify these datapaths to also no longer expect
> a udp header at skb->data.
> 
> Sunrpc and rxrpc are the only two protocols that call this
> function and contain references to udphr (some others, like tipc,
> are based on encap_rcv, which acts before enqueue, before the
> the header pull).

Series applied, thanks for fixing this regression so quickly.

Re: [PATCH RFT 1/2] phylib: add device reset GPIO support

2016-04-11 Thread Sergei Shtylyov


On 04/11/2016 10:25 PM, Rob Herring wrote:


The PHY  devices sometimes do have their reset signal (maybe even power
supply?) tied to some GPIO and sometimes it also does happen that a boot
loader does not leave it deasserted. So far this issue has been attacked
from (as I believe) a wrong angle: by teaching the MAC driver to manipulate
the GPIO in question;  that solution, when  applied to the device trees,
led to adding the PHY reset GPIO properties to the MAC device node, with
one exception: Cadence MACB driver which could handle the "reset-gpios"
prop  in a PHY device  subnode.  I believe that the correct approach is to
teach the 'phylib' to get the MDIO device reset GPIO from the device tree
node corresponding to this device -- which this patch is doing...

Note that I had to modify the  AT803x PHY driver as it would stop working
otherwise as it made use of the reset GPIO for its own purposes...


Lots of double spaces in here. Please fix.


   Oh, it's you again! :-D


Signed-off-by: Sergei Shtylyov 

---
  Documentation/devicetree/bindings/net/phy.txt |2 +
  drivers/net/phy/at803x.c  |   19 ++
  drivers/net/phy/mdio_bus.c|4 +++
  drivers/net/phy/mdio_device.c |   27 +++--
  drivers/net/phy/phy_device.c  |   33 
--
  drivers/of/of_mdio.c  |   16 
  include/linux/mdio.h  |3 ++
  include/linux/phy.h   |5 +++
  8 files changed, 89 insertions(+), 20 deletions(-)


[...]


Index: net-next/drivers/of/of_mdio.c
===
--- net-next.orig/drivers/of/of_mdio.c
+++ net-next/drivers/of/of_mdio.c
@@ -44,6 +44,7 @@ static int of_get_phy_id(struct device_n
  static int of_mdiobus_register_phy(struct mii_bus *mdio, struct device_node 
*child,
   u32 addr)
  {
+   struct gpio_desc *gpiod;
struct phy_device *phy;
bool is_c45;
int rc;
@@ -52,10 +53,17 @@ static int of_mdiobus_register_phy(struc
is_c45 = of_device_is_compatible(child,
 "ethernet-phy-ieee802.3-c45");

+   gpiod = fwnode_get_named_gpiod(&child->fwnode, "reset-gpios");


Calling fwnode_* functions in a DT specific file/function? That doesn't
make sense.


   Really?! 8-)
   Where is a DT-only analog I wonder...

MBR, Sergei

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Jesper Dangaard Brouer

On Mon, 11 Apr 2016 19:07:03 +0100
Mel Gorman  wrote:

> On Mon, Apr 11, 2016 at 06:19:07PM +0200, Jesper Dangaard Brouer wrote:
> > > http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5
> > >   
> > 
> > The cost decreased to: 228 cycles(tsc), but there are some variations,
> > sometimes it increase to 238 cycles(tsc).
> >   
> 
> In the free path, a bulk pcp free adds to the cycles. In the alloc path,
> a refill of the pcp lists costs quite a bit. Either option introduces
> variances. The bulk free path can be optimised a little so I chucked
> some additional patches at it that are not released yet but I suspect the
> benefit will be marginal. The real heavy costs there are splitting/merging
> buddies. Fixing that is much more fundamental but even fronting the allocator
> with a new recycle allocator would not offset that as the refill of the
> page-recycling thing would incur high costs.
>

Yes, re-filling page-pool (in the non-steady state) could be
problematic for performance.  That is why I'm very motivated in helping
out with a bulk alloc/free scheme for the page allocator.

 
> > Nice, but there is still a looong way to my performance target, where I
> > can spend 201 cycles for the entire forwarding path
> >   
> 
> While I accept the cost is still too high, I think the effort should still
> be spent on improving the allocator in general than trying to bypass it.
> 

I do think improving the page allocator is very important work.
I just don't see how we can ever reach my performance target, without a
page-pool recycle facility.

I work in the area, where I think the cost of a single atomic operation
is too high.  I work on amortizing the individual atomic operations.
That is what I did for the SLUB allocator, with the bulk API. see:

Commit d0ecd894e3d5 ("slub: optimize bulk slowpath free by detached freelist")
 https://git.kernel.org/torvalds/c/d0ecd894e3d5

Commit fbd02630c6e3 ("slub: initial bulk free implementation")
 https://git.kernel.org/torvalds/c/fbd02630c6e3
 
This is now also used in the network stack:
 Commit 3134b9f019f2 ("Merge branch 'net-mitigate-kmem_free-slowpath'")
 Commit a3a8749d34d8 ("ixgbe: bulk free SKBs during TX completion cleanup 
cycle")


> > > This is an unreleased series that contains both the page allocator
> > > optimisations and the one-LRU-per-node series which in combination remove 
> > > a
> > > lot of code from the page allocator fast paths. I have no data on how the
> > > combined series behaves but each series individually is known to improve
> > > page allocator performance.
> > >
> > > Once you have that, do a hackjob to remove the debugging checks from both 
> > > the
> > > alloc and free path and see what that leaves. They could be bypassed 
> > > properly
> > > with a __GFP_NOACCT flag used only by drivers that absolutely require 
> > > pages
> > > as quickly as possible and willing to be less safe to get that 
> > > performance.  
> > 
> > I would be interested in testing/benchmarking a patch where you remove
> > the debugging checks...
> >   
> 
> Right now, I'm not proposing to remove the debugging checks despite their
> cost. They catch really difficult problems in the field unfortunately
> including corruption from buggy hardware. A GFP flag that disables them
> for a very specific case would be ok but I expect it to be resisted by
> others if it's done for the general case. Even a static branch for runtime
> debugging checks may be resisted.
> 
> Even if GFP flags are tight, I have a patch that deletes __GFP_COLD on
> the grounds it is of questionable value. Applying that would free a flag
> for __GFP_NOACCT that bypasses debugging checks and statistic updates.
> That would work for the allocation side at least but doing the same for
> the free side would be hard (potentially impossible) to do transparently
> for drivers.

Before spending too much work on something, I usually try to determine
what the maximum benefit of something would be.  Thus, I propose you
create a patch that hack remove all the debug checks that you think
could be beneficial to remove.  And then benchmark it yourself or send
it to me for benchmarking... that is the quickest way to determine if
this is worth spending time on.


 
> > You are also welcome to try out my benchmarking modules yourself:
> >  
> > https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst
> >   
> 
> I took a quick look and functionally it's similar to the systemtap-based
> microbenchmark I'm using in mmtests so I don't think we have a problem
> with reproduction at the moment.
> 
> > > Be aware that compound order allocs like this are a double edged sword as
> > > it'll be fast sometimes and other times require reclaim/compaction which
> > > can stall for prolonged periods of time.  
> > 
> > Yes, I've notice that there can be a fairly high variation, when doing
> > compound order allocs, which is not so nice!  I really don't like these

Re: [PATCH RFC 5/5] ARM: dts: sun8i: Enable Ethernet controller on the Orange PI PC

2016-04-11 Thread Florian Fainelli

On 04/04/16 09:22, Chen-Yu Tsai wrote:
> The Orange PI PC uses the H3's internal Ethernet PHY with the EMAC
> Ethernet controller.
> 
> Set a proper address for the PHY and enable the EMAC.
> 
> Signed-off-by: Chen-Yu Tsai 
> ---
> 
> This patch depends on "ARM: dts: sun8i-h3: Add Ethernet controller device",
> which uses an binding still in development.
> 
> Do not merge.
> 
> ---
>  arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts 
> b/arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts
> index daf50b9a6657..f01e10df812a 100644
> --- a/arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts
> +++ b/arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts
> @@ -102,6 +102,20 @@
>   status = "okay";
>  };
>  
> +&ephy {
> + allwinner,ephy-addr = <0x1>;
> +};
> +
> +&emac {
> + phy = <&phy1>;
> + phy-mode = "mii";
> + status = "okay";
> +
> + phy1: ethernet-phy@1 {
> + reg = <1>;
> + };
> +};

As commented in patch 1, the fact that you have to put the Ethernet PHY
address twice here is not really a good thing, because they cannot be
dissociated from eath other, I would rather have a standard Ethernet PHY
DT node represent the desired PHY address, and have your glue/SHIM
SUN8I_H3_EMAC driver scan the Device Tree to know what address to
program for the Ethernet PHY.
-- 
Florian

Re: [PATCH RFT 1/2] phylib: add device reset GPIO support

2016-04-11 Thread Rob Herring

On Sat, Apr 09, 2016 at 01:22:47AM +0300, Sergei Shtylyov wrote:
> The PHY  devices sometimes do have their reset signal (maybe even power
> supply?) tied to some GPIO and sometimes it also does happen that a boot
> loader does not leave it deasserted. So far this issue has been attacked
> from (as I believe) a wrong angle: by teaching the MAC driver to manipulate
> the GPIO in question;  that solution, when  applied to the device trees,
> led to adding the PHY reset GPIO properties to the MAC device node, with
> one exception: Cadence MACB driver which could handle the "reset-gpios"
> prop  in a PHY device  subnode.  I believe that the correct approach is to
> teach the 'phylib' to get the MDIO device reset GPIO from the device tree
> node corresponding to this device -- which this patch is doing...
> 
> Note that I had to modify the  AT803x PHY driver as it would stop working
> otherwise as it made use of the reset GPIO for its own purposes...

Lots of double spaces in here. Please fix.

> 
> Signed-off-by: Sergei Shtylyov 
> 
> ---
>  Documentation/devicetree/bindings/net/phy.txt |2 +
>  drivers/net/phy/at803x.c  |   19 ++
>  drivers/net/phy/mdio_bus.c|4 +++
>  drivers/net/phy/mdio_device.c |   27 +++--
>  drivers/net/phy/phy_device.c  |   33 
> --
>  drivers/of/of_mdio.c  |   16 
>  include/linux/mdio.h  |3 ++
>  include/linux/phy.h   |5 +++
>  8 files changed, 89 insertions(+), 20 deletions(-)

[...]

> Index: net-next/drivers/of/of_mdio.c
> ===
> --- net-next.orig/drivers/of/of_mdio.c
> +++ net-next/drivers/of/of_mdio.c
> @@ -44,6 +44,7 @@ static int of_get_phy_id(struct device_n
>  static int of_mdiobus_register_phy(struct mii_bus *mdio, struct device_node 
> *child,
>  u32 addr)
>  {
> + struct gpio_desc *gpiod;
>   struct phy_device *phy;
>   bool is_c45;
>   int rc;
> @@ -52,10 +53,17 @@ static int of_mdiobus_register_phy(struc
>   is_c45 = of_device_is_compatible(child,
>"ethernet-phy-ieee802.3-c45");
>  
> + gpiod = fwnode_get_named_gpiod(&child->fwnode, "reset-gpios");

Calling fwnode_* functions in a DT specific file/function? That doesn't 
make sense.

> + /* Deassert the reset signal */
> + if (!IS_ERR(gpiod))
> + gpiod_direction_output(gpiod, 0);
>   if (!is_c45 && !of_get_phy_id(child, &phy_id))
>   phy = phy_device_create(mdio, addr, phy_id, 0, NULL);
>   else
>   phy = get_phy_device(mdio, addr, is_c45);
> + /* Assert the reset signal again */
> + if (!IS_ERR(gpiod))
> + gpiod_set_value(gpiod, 1);
>   if (IS_ERR_OR_NULL(phy))
>   return 1;
>

Re: [PATCH RFC 1/5] net: phy: sun8i-h3-ephy: Add bindings for Allwinner H3 Ethernet PHY

2016-04-11 Thread Florian Fainelli

On 04/04/16 09:22, Chen-Yu Tsai wrote:
> The Allwinner H3 SoC incorporates an Ethernet PHY. This is enabled and
> configured through a memory mapped hardware register.
> 
> This same register also configures the MAC interface mode and TX clock
> source. Also covered by the register, but not supported in these bindings,
> are TX/RX clock delay chains and inverters, and an RMII module.
> 
> Signed-off-by: Chen-Yu Tsai 
> ---
>  .../bindings/net/allwinner,sun8i-h3-ephy.txt   | 44 
> ++
>  1 file changed, 44 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt
> 
> diff --git 
> a/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt 
> b/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt
> new file mode 100644
> index ..146f227e6d88
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt
> @@ -0,0 +1,44 @@
> +* Allwinner H3 E(thernet) PHY
> +
> +The Allwinner H3 integrates an MII ethernet PHY. As with external PHYs,
> +before it can be configured over the MDIO bus and used, certain hardware
> +features must be configured, such as the PHY address and LED polarity.

Is the internal PHY address really configurable? Not that there is
anything wrong with it, this is good.

> +The PHY must also be powered on and brought out of reset.
> +
> +This is accomplished with regulators and pull-up/downs for external PHYs.
> +For this internal PHY, a hardware register is programmed.
> +
> +The same hardware register also contains clock and interface controls
> +for the MAC. This is also present in earlier SoCs, and is covered by
> +"allwinner,sun7i-a20-gmac-clk". The controls in the H3 are slightly
> +different due to the inclusion of what appears to be an RMII-MII
> +bridge.
> +
> +Required properties:
> +- compatible: should be "allwinner,sun8i-h3-ephy"
> +- reg: address and length of the register set for the device
> +- clocks: A phandle to the reference clock for this device
> +- resets: A phandle to the reset control for this device
> +
> +Ethernet PHY related properties:
> +- allwinner,ephy-addr: the MDIO bus address the PHY should respond to.
> +If this is not set, the external PHY is used, and
> +everything else in this section is ignored.

So we are going to put the same value here, and in the actual Ethernet
PHY device tree node that should be hanging off the EMAC/MDIO bus
controller, this is confusing and error prone.

> +- allwinner,leds-active-low: LEDs are active low. Without this, LEDs are
> +  active high.
> +
> +Ethernet MAC clock related properties:
> +- #clock-cells: should be 0
> +- clock-output-names: "mac_tx"
> +
> +Example:
> +
> +ethernet-phy@01c00030 {
> + compatible = "allwinner,sun8i-h3-ephy";
> + reg = <0x01c00030 0x4>;

Looking at this register space this looks more like an internal PHY SHIM
that is needed to be configured before the internal PHY can be access,
not a proper Ethernet PHY per-se, see replies in aptch 2.

Should not this block be a second cell associated with the Ethernet MAC
block? One or the other are not going to be very useful without
knowledge of each other.
-- 
Florian

Re: [PATCH RFC 2/5] net: phy: sun8i-h3-ephy: Add driver for Allwinner H3 Ethernet PHY

2016-04-11 Thread Florian Fainelli

On 04/04/16 09:22, Chen-Yu Tsai wrote:
> The Allwinner H3 SoC incorporates an Ethernet PHY. This is enabled and
> configured through a memory mapped hardware register.
> 
> This same register also configures the MAC interface mode and TX clock
> source. Also covered by the register, but not supported in this driver,
> are TX/RX clock delay chains and inverters, and an RMII module.

This is not really a PHY driver, more a driver for a special piece of
hardware responsible for properly configuring a more standard integrated
PHY which is then driven via standard MDIO accesses, right?

The intention to make this driver re-usable is fine, but still makes me
wonder if it should not be put in a file which is linked into the
Ethernet MAC driver, and utilized by this one in a way that may be more
"ad-hoc" than what you are proposing here.

One thing that is not obvious here, is how is the device parenting done?
Are we able to associate a phy_device to this SUN8I_H3_EPHY platform
device here?

Another thing is that the Ethernet MAC driver is fully aware of when
putting an Ethernet PHY into suspend, shutdown, or fully functional
power state should occur, if you have a separate platform driver here
which does not listen for such kinds of events (hint: none are generated
right now), then you cannot implement a working power state interface
between the MAC, SHIM and PHY here, even though you would want that.
-- 
Florian

Re: [PATCH net 0/2] tipc: name distributor pernet queue

2016-04-11 Thread David Miller

From: Jon Maloy 
Date: Thu,  7 Apr 2016 10:40:42 -0400

> Commit #1 fixes a potential issue with deferred binding table 
> updates being pushed to the wrong namespace.
> 
> Commit #2 solves a problem with deferred binding table updates
> remaining in the the defer queue after the issuing node has gone
> down.

Series applied, thanks.

Re: [PATCH v6 net-next] net: ipv4: Consider failed nexthops in multipath routes

2016-04-11 Thread David Miller

From: David Ahern 
Date: Thu,  7 Apr 2016 07:21:00 -0700

> Multipath route lookups should consider knowledge about next hops and not
> select a hop that is known to be failed.
> 
> Example:
 ...
> The failed path can be avoided by considering known neighbor information
> when selecting next hops. If the neighbor lookup fails we have no
> knowledge about the nexthop, so give it a shot. If there is an entry
> then only select the nexthop if the state is sane. This is similar to
> what fib_detect_death does.
> 
> To maintain backward compatibility use of the neighbor information is
> based on a new sysctl, fib_multipath_use_neigh.
> 
> Signed-off-by: David Ahern 

Applied, thanks.

Re: [PATCH 0/2] drivers: net: cpsw: fix ale calls and drop host_port field from cpsw_priv

2016-04-11 Thread David Miller

From: Grygorii Strashko 
Date: Thu, 7 Apr 2016 15:16:42 +0300

> This clean up series intended to:
>  - fix port_mask parameters in ale calls and drop unnecessary shifts
>  - drop host_port field from struct cpsw_priv
> 
> Nothing critical. Tested on am437x-idk-evm in dual mac and switch modes.

Looks good, series applied, thanks.

Re: [PATCH net-next] vxlan: fix incorrect type

2016-04-11 Thread David Miller

From: Jiri Benc 
Date: Mon, 11 Apr 2016 17:06:08 +0200

> The protocol is 16bit, not 32bit.
> 
> Fixes: e1e5314de08ba ("vxlan: implement GPE")
> Reported-by: Dan Carpenter 
> Signed-off-by: Jiri Benc 

Applied.

Re: [PATCH] net: fjes: Use resource_size

2016-04-11 Thread David Miller

From: Vaishali Thakkar 
Date: Mon, 11 Apr 2016 15:58:17 +0530

> Use the function resource_size instead of explicit computation.
> 
> Problem found using Coccinelle.
> 
> Signed-off-by: Vaishali Thakkar 

Applied.

Re: [PATCH RFT 2/2] macb: kill PHY reset code

2016-04-11 Thread Sergei Shtylyov


On 04/11/2016 09:51 PM, Andrew Lunn wrote:


The code you are deleting would of ignored the flags in the gpio
property, i.e. active low.


Hm, you're right -- I forgot about that... :-/


The new code in the previous patch does
however take the flags into account. Did you check if there are any
device trees which have flags, which were never used, but are now
going to be used and thus break...


Checked this now and found out arch/arm/boot/dts/ar91-vinco.dts.
Looks like it needs to be fixed indeed...


And this is where it gets tricky. You are breaking backwards
compatibility by now respecting the flag. An old DT blob is not going
to work.


Do we care that much about the DT blobs that are just *wrong*?


Wrong, but currently works.


   Note that it's not only using GPIO_ACTIVE_HIGH but does that against what 
the MACB binding documents.



You potentially need to add a new property and deprecate the old one.


I would like to avoid that...


You will need the agreement from the at91-vinco maintainer.


  I'll try submitting a formal DT patch...


 Andrew


MBR, Sergei

Re: [PATCH net-next 0/4] bnxt_en: Update for net-next

2016-04-11 Thread David Miller

From: Michael Chan 
Date: Mon, 11 Apr 2016 04:11:10 -0400

> Misc. changes for link speed and VF MAC address change.

Applied, thanks Michael.

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Bart Van Assche


On 04/11/2016 11:37 AM, Jesper Dangaard Brouer wrote:

On Mon, 11 Apr 2016 14:46:25 -0300
Thadeu Lima de Souza Cascardo  wrote:


So, Jesper, please take into consideration that this pool design
would rather be per device. Otherwise, we allow some device to write
into another's device/driver memory.


Yes, that was my intended use.  I want to have a page-pool per device.
I actually, want to go as far as a page-pool per NIC HW RX-ring queue.

Because the other use-case for the page-pool is zero-copy RX.

The NIC HW trick is that we today can create a HW filter in the NIC
(via ethtool) and place that traffic into a separate RX queue in the
NIC.  Lets say matching NFS traffic or guest traffic. Then we can allow
RX zero-copy of these pages, into the application/guest, somehow
binding it to RX queue, e.g. introducing a "cross-domain-id" in the
page-pool page that need to match.


I think it is important to keep in mind that using a page pool for 
zero-copy RX is specific to protocols that are based on TCP/IP. 
Protocols like FC, SRP and iSER have been designed such that the side 
that allocates the buffers also initiates the data transfer (the target 
side). With TCP/IP however transferring data and allocating receive 
buffers happens by opposite sides of the connection.


Bart.

Re: [PATCH RFT 2/2] macb: kill PHY reset code

2016-04-11 Thread Andrew Lunn

On Mon, Apr 11, 2016 at 09:39:02PM +0300, Sergei Shtylyov wrote:
> Hello.
> 
> On 04/11/2016 09:19 PM, Andrew Lunn wrote:
> 
> >>>The code you are deleting would of ignored the flags in the gpio
> >>>property, i.e. active low.
> >>
> >>Hm, you're right -- I forgot about that... :-/
> >>
> >>>The new code in the previous patch does
> >>>however take the flags into account. Did you check if there are any
> >>>device trees which have flags, which were never used, but are now
> >>>going to be used and thus break...
> >>
> >>Checked this now and found out arch/arm/boot/dts/ar91-vinco.dts.
> >>Looks like it needs to be fixed indeed...
> >>
> >And this is where it gets tricky. You are breaking backwards
> >compatibility by now respecting the flag. An old DT blob is not going
> >to work.
> 
>Do we care that much about the DT blobs that are just *wrong*?

Wrong, but currently works.

> >You potentially need to add a new property and deprecate the old one.
> 
>I would like to avoid that...

You will need the agreement from the at91-vinco maintainer.

Andrew

Re: [PATCH RFT 2/2] macb: kill PHY reset code

2016-04-11 Thread Sergei Shtylyov


Hello.

On 04/11/2016 09:19 PM, Andrew Lunn wrote:


The code you are deleting would of ignored the flags in the gpio
property, i.e. active low.


Hm, you're right -- I forgot about that... :-/


The new code in the previous patch does
however take the flags into account. Did you check if there are any
device trees which have flags, which were never used, but are now
going to be used and thus break...


Checked this now and found out arch/arm/boot/dts/ar91-vinco.dts.
Looks like it needs to be fixed indeed...


And this is where it gets tricky. You are breaking backwards
compatibility by now respecting the flag. An old DT blob is not going
to work.


   Do we care that much about the DT blobs that are just *wrong*?


You potentially need to add a new property and deprecate the old one.


   I would like to avoid that...


 Andrew


MBR, Sergei

Re: [PATCH] mwifiex: fix possible NULL dereference

2016-04-11 Thread Christian Daudt

On Mon, Apr 11, 2016 at 8:27 AM, Sudip Mukherjee
 wrote:
>
> From: Sudip Mukherjee 
>
> We have a check for card just after dereferencing it. So if it is NULL
> we have already dereferenced it before its check. Lets dereference it
> after checking card for NULL.
>
> Signed-off-by: Sudip Mukherjee 
> ---
>  drivers/net/wireless/marvell/mwifiex/pcie.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c 
> b/drivers/net/wireless/marvell/mwifiex/pcie.c
> index edf8b07..84562d0 100644
> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
> @@ -2884,10 +2884,11 @@ static void mwifiex_unregister_dev(struct 
> mwifiex_adapter *adapter)
>  {
> struct pcie_service_card *card = adapter->card;
> const struct mwifiex_pcie_card_reg *reg;
> -   struct pci_dev *pdev = card->dev;
> +   struct pci_dev *pdev;


you might want to move the variable declaration into the if block
below to avoid it being left undefined

> int i;
>
> if (card) {
> +   pdev = card->dev;


to here
  +struct pci_dev pdev = card->dev;

>
> if (card->msix_enable) {
> for (i = 0; i < MWIFIEX_NUM_MSIX_VECTORS; i++)
> synchronize_irq(card->msix_entries[i].vector);
>

cheers,
  csd

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Jesper Dangaard Brouer

On Mon, 11 Apr 2016 14:46:25 -0300
Thadeu Lima de Souza Cascardo  wrote:

> So, Jesper, please take into consideration that this pool design
> would rather be per device. Otherwise, we allow some device to write
> into another's device/driver memory.

Yes, that was my intended use.  I want to have a page-pool per device.
I actually, want to go as far as a page-pool per NIC HW RX-ring queue.

Because the other use-case for the page-pool is zero-copy RX.

The NIC HW trick is that we today can create a HW filter in the NIC
(via ethtool) and place that traffic into a separate RX queue in the
NIC.  Lets say matching NFS traffic or guest traffic. Then we can allow
RX zero-copy of these pages, into the application/guest, somehow
binding it to RX queue, e.g. introducing a "cross-domain-id" in the
page-pool page that need to match.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net v2] net: sched: do not requeue a NULL skb

2016-04-11 Thread Eric Dumazet

On Mon, 2016-04-11 at 11:26 -0700, Eric Dumazet wrote:
> On Mon, 2016-04-11 at 11:02 -0700, Cong Wang wrote:
> 
> > I am fine with either way as long as the loop stops on failure.

Note that skb that could not be validated is already freed.

So I do not see any value from stopping the loop, since
we need to schedule the queue to avoid tx hang.

Just process following skb if there is one, fact that skb is sent or
dropped does not matter.

Re: [PATCH net v2] net: sched: do not requeue a NULL skb

2016-04-11 Thread Eric Dumazet

On Mon, 2016-04-11 at 11:02 -0700, Cong Wang wrote:

> I am fine with either way as long as the loop stops on failure.
> Folding the test "if (skb)" into one also requires to retake the spinlock.

Adding the likely() in this path would probably help as well.

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index f18c35024207..07202d9ac4f6 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -159,12 +159,14 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
if (validate)
skb = validate_xmit_skb_list(skb, dev);
 
-   if (skb) {
+   if (likely(skb)) {
HARD_TX_LOCK(dev, txq, smp_processor_id());
if (!netif_xmit_frozen_or_stopped(txq))
skb = dev_hard_start_xmit(skb, dev, txq, &ret);
 
HARD_TX_UNLOCK(dev, txq);
+   } else {
+   ... does all and return...
}
spin_lock(root_lock);

Re: Unhandled fault during system suspend in sky2_shutdown

2016-04-11 Thread Stephen Hemminger

On Mon, 11 Apr 2016 17:24:37 +0100
Sudeep Holla  wrote:

> Hi,
> 
> I am seeing unhandled fault during system suspend in sky2_shutdown.
> I am not sure if it's something missing in the firmware, but just wanted
> to check. I see that networkmanager is invoking calling to 
> netlink_sendmsg which calls sky2_get_stats after the device is shutdown.
> 
> Unhandled fault: synchronous external abort (0x96000210) at 
> 0x091c2918
> Internal error: : 96000210 [#1] PREEMPT SMP
> Modules linked in:
> CPU: 3 PID: 2029 Comm: NetworkManager Not tainted 4.6.0-rc3 #126
> Hardware name: ARM Juno development board (r2) (DT)
> task: 80007a673000 ti: 800940b5c000 task.ti: 800940b5c000
> PC is at sky2_get_stats+0x44/0x3b8
> LR is at dev_get_stats+0x58/0xc8
>   sky2_get_stats+0x44/0x3b8
>   rtnl_fill_stats+0x20/0x138
>   rtnl_fill_ifinfo+0x440/0xb38
>   rtnl_getlink+0xe8/0x198
>   rtnetlink_rcv_msg+0xe4/0x220
>   netlink_rcv_skb+0xc4/0xf8
>   rtnetlink_rcv+0x2c/0x40
>   netlink_unicast+0x160/0x238
>   netlink_sendmsg+0x2f0/0x358
>   sock_sendmsg+0x18/0x30
>   ___sys_sendmsg+0x204/0x218
>   __sys_sendmsg+0x44/0x88
>   SyS_sendmsg+0xc/0x18
>   el0_svc_naked+0x24/0x28
> 
> The below patch is the hack I came up to check if the netdev is detached 
> and unregistered, I no longer see the issue.
> 
> Regards,
> Sudeep
> 
> -->8  
> 
> diff --git i/drivers/net/ethernet/marvell/sky2.c 
> w/drivers/net/ethernet/marvell/sky2.c
> index ec0a22119e09..0ff0434e32fc 100644
> --- i/drivers/net/ethernet/marvell/sky2.c
> +++ w/drivers/net/ethernet/marvell/sky2.c
> @@ -5220,6 +5220,13 @@ static SIMPLE_DEV_PM_OPS(sky2_pm_ops, 
> sky2_suspend, sky2_resume);
> 
>   static void sky2_shutdown(struct pci_dev *pdev)
>   {
> +   struct sky2_hw *hw = pci_get_drvdata(pdev);
> +   int i;
> +
> +   for (i = hw->ports - 1; i >= 0; --i) {
> +   sky2_detach(hw->dev[i]);
> +   unregister_netdev(hw->dev[i]);
> +   }
>  sky2_suspend(&pdev->dev);
>  pci_wake_from_d3(pdev, device_may_wakeup(&pdev->dev));
>  pci_set_power_state(pdev, PCI_D3hot);

This is not the correct fix, the device is supposed to stay registered.
The correct way to fix this would be to make get_stats ignore requests for 
device
when suspended.

Re: [PATCH RFT 2/2] macb: kill PHY reset code

2016-04-11 Thread Andrew Lunn

> >The code you are deleting would of ignored the flags in the gpio
> >property, i.e. active low.
> 
>Hm, you're right -- I forgot about that... :-/
> 
> >The new code in the previous patch does
> >however take the flags into account. Did you check if there are any
> >device trees which have flags, which were never used, but are now
> >going to be used and thus break...
> 
>Checked this now and found out arch/arm/boot/dts/ar91-vinco.dts.
> Looks like it needs to be fixed indeed...

And this is where it gets tricky. You are breaking backwards
compatibility by now respecting the flag. An old DT blob is not going
to work.

You potentially need to add a new property and deprecate the old one.

Andrew

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Mel Gorman

On Mon, Apr 11, 2016 at 06:19:07PM +0200, Jesper Dangaard Brouer wrote:
> > http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5
> > 
> 
> The cost decreased to: 228 cycles(tsc), but there are some variations,
> sometimes it increase to 238 cycles(tsc).
> 

In the free path, a bulk pcp free adds to the cycles. In the alloc path,
a refill of the pcp lists costs quite a bit. Either option introduces
variances. The bulk free path can be optimised a little so I chucked
some additional patches at it that are not released yet but I suspect the
benefit will be marginal. The real heavy costs there are splitting/merging
buddies. Fixing that is much more fundamental but even fronting the allocator
with a new recycle allocator would not offset that as the refill of the
page-recycling thing would incur high costs.

> Nice, but there is still a looong way to my performance target, where I
> can spend 201 cycles for the entire forwarding path
> 

While I accept the cost is still too high, I think the effort should still
be spent on improving the allocator in general than trying to bypass it.

> 
> > This is an unreleased series that contains both the page allocator
> > optimisations and the one-LRU-per-node series which in combination remove a
> > lot of code from the page allocator fast paths. I have no data on how the
> > combined series behaves but each series individually is known to improve
> > page allocator performance.
> >
> > Once you have that, do a hackjob to remove the debugging checks from both 
> > the
> > alloc and free path and see what that leaves. They could be bypassed 
> > properly
> > with a __GFP_NOACCT flag used only by drivers that absolutely require pages
> > as quickly as possible and willing to be less safe to get that performance.
> 
> I would be interested in testing/benchmarking a patch where you remove
> the debugging checks...
> 

Right now, I'm not proposing to remove the debugging checks despite their
cost. They catch really difficult problems in the field unfortunately
including corruption from buggy hardware. A GFP flag that disables them
for a very specific case would be ok but I expect it to be resisted by
others if it's done for the general case. Even a static branch for runtime
debugging checks may be resisted.

Even if GFP flags are tight, I have a patch that deletes __GFP_COLD on
the grounds it is of questionable value. Applying that would free a flag
for __GFP_NOACCT that bypasses debugging checks and statistic updates.
That would work for the allocation side at least but doing the same for
the free side would be hard (potentially impossible) to do transparently
for drivers.

> You are also welcome to try out my benchmarking modules yourself:
>  
> https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst
> 

I took a quick look and functionally it's similar to the systemtap-based
microbenchmark I'm using in mmtests so I don't think we have a problem
with reproduction at the moment.

> > Be aware that compound order allocs like this are a double edged sword as
> > it'll be fast sometimes and other times require reclaim/compaction which
> > can stall for prolonged periods of time.
> 
> Yes, I've notice that there can be a fairly high variation, when doing
> compound order allocs, which is not so nice!  I really don't like these
> variations
> 

They can cripple you which is why I'm very wary of performance patches that
require compound pages. It tends to look great only on benchmarks and then
the corner cases hit in the real world and the bug reports are unpleasant.

> Drivers also do tricks where they fallback to smaller order pages. E.g.
> lookup function mlx4_alloc_pages().  I've tried to simulate that
> function here:
> https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69
> 
> It does not seem very optimal. I tried to mem pressure the system a bit
> to cause the alloc_pages() to fail, and then the result were very bad,
> something like 2500 cycles, and it usually got the next order pages.

The options for fallback tend to have one hazard after the next. It's
partially why the last series focused on order-0 pages only.

> > > I've measured order 3 (32KB) alloc_pages(order=3) + __free_pages() to
> > > cost approx 500 cycles(tsc).  That was more expensive, BUT an order=3
> > > page 32Kb correspond to 8 pages (32768/4096), thus 500/8 = 62.5
> > > cycles.  Usually a network RX-frame only need to be 2048 bytes, thus
> > > the "bulk" effect speed up is x16 (32768/2048), thus 31.25 cycles.
> 
> The order=3 cost were reduced to: 417 cycles(tsc), nice!  But I've also
> seen it jump to 611 cycles.
> 

The corner cases can be minimised to some extent -- lazy buddy merging for
example but it unfortunately has other consequences for users that require
high-order pages for functional reasons. I tried something like that once
(http://thread.gmane.org/gmane.linux.kernel/807683) b

Re: [PATCH net v2] net: sched: do not requeue a NULL skb

2016-04-11 Thread Cong Wang

On Mon, Apr 11, 2016 at 8:52 AM, Eric Dumazet  wrote:
> On Mon, 2016-04-11 at 17:17 +0200, Lars Persson wrote:
>>
>> On 04/11/2016 04:22 PM, Eric Dumazet wrote:
>> > On Mon, 2016-04-11 at 15:38 +0200, Lars Persson wrote:
>> >
>> >> I though it would be prudent because the queue can be non-empty even for
>> >> the case of skb=NULL. So should it be there in this patch, another patch
>> >> or not at all ?
>> >
>> > Then maybe change return code ?
>> >
>> > It seems strange that a validate_xmit_skb_list() failure stops the
>> > __qdisc_run() loop but schedules another round.
>> >
>> >
>>
>> It was suggested by Cong Wang to return 0 in order to stop the loop. Do
>> you guys agree that the loop should be stopped for such failures ? Then
>> I will put the schedule call inside the if as you proposed earlier.
>
> What are the causes of validate_xmit_skb_list() failures ?
>
> If gso segmentations fail because of memory pressure, better free more
> skbs right now.
>
> In any case, having a single test " if (skb)  " sounds better to me,
> to have a fast path.
>
> So your first patch was probably a better idea.
>
> v2 has two tests instead of one.

I am fine with either way as long as the loop stops on failure.
Folding the test "if (skb)" into one also requires to retake the spinlock.

Re: [RFC PATCH net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update

2016-04-11 Thread Martin KaFai Lau

On Tue, Apr 05, 2016 at 07:56:54PM -0400, David Miller wrote:
> From: Cong Wang 
> Date: Mon, 4 Apr 2016 13:45:02 -0700
>
> > On Sat, Apr 2, 2016 at 7:33 PM, Martin KaFai Lau  wrote:
> >> One thing to note is that this patch uses the addresses from the sk
> >> instead of iph when updating sk->sk_dst_cache.  It is basically the
> >> same logic that the __ip6_datagram_connect() is doing, so some
> >> refactoring works in the first two patches.
> >>
> >> AFAIK, a UDP socket can become connected after sending out some
> >> datagrams in un-connected state.  or It can be connected
> >> multiple times to different destinations.  I did some quick
> >> tests but I could be wrong.
> >>
> >> I am thinking if there could be a chance that the skb->data, which
> >> has the original outgoing iph, is not related to the current
> >> connected address.  If it is possible, we have to specifically
> >> use the addresses in the sk instead of skb->data (i.e. iph) when
> >> updating the sk->sk_dst_cache.
> >>
> >> If we need to use the sk addresses (and other info) to find out a
> >> new dst for a connected udp socket, it is better not doing it while
> >> the userland is connecting to somewhere else.
> >>
> >> If the above case is impossible, we can keep using the info from iph to
> >> do the dst update for a connected-udp sk without taking the lock.
> >
> > I see your point, but calling __ip6_datagram_connect() seems overkill
> > here, we don't need to update so many things in the pmtu update context,
> > at least IPv4 doesn't do that either. I don't think you have to do that.
> >
> > So why just updating the dst cache (also some addr cache) here is not
> > enough?
>
> I think we are steadily getting closer to a version of this fix that
> we have some agreement on, right?
>
> Martin can you address Cong's feedback and spin another version of this
> series?
Dave, I will spin v2 shortly with some minor changes.

Cong, I believe the loose ends have been tied up after the last email thread
since last Thursday?

Re: [PATCH net v2] net: sched: do not requeue a NULL skb

2016-04-11 Thread Cong Wang

On Mon, Apr 11, 2016 at 8:17 AM, Lars Persson  wrote:
>
>
> On 04/11/2016 04:22 PM, Eric Dumazet wrote:
>>
>> On Mon, 2016-04-11 at 15:38 +0200, Lars Persson wrote:
>>
>>> I though it would be prudent because the queue can be non-empty even for
>>> the case of skb=NULL. So should it be there in this patch, another patch
>>> or not at all ?
>>
>>
>> Then maybe change return code ?
>>
>> It seems strange that a validate_xmit_skb_list() failure stops the
>> __qdisc_run() loop but schedules another round.
>>
>>
>
> It was suggested by Cong Wang to return 0 in order to stop the loop. Do you
> guys agree that the loop should be stopped for such failures ? Then I will
> put the schedule call inside the if as you proposed earlier.

I don't see any reason why we should continue on failure. And, I didn't
suggest you to return reschedule it either. I was suggesting to just return
0 for skb == NULL case.

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Thadeu Lima de Souza Cascardo

On Mon, Apr 11, 2016 at 12:20:47PM -0400, Matthew Wilcox wrote:
> On Mon, Apr 11, 2016 at 02:08:27PM +0100, Mel Gorman wrote:
> > On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote:
> > > On arch's like PowerPC, the DMA API is the bottleneck.  To workaround
> > > the cost of DMA calls, NIC driver alloc large order (compound) pages.
> > > (dma_map compound page, handout page-fragments for RX ring, and later
> > > dma_unmap when last RX page-fragments is seen).
> > 
> > So, IMO only holding onto the DMA pages is all that is justified but not a
> > recycle of order-0 pages built on top of the core allocator. For DMA pages,
> > it would take a bit of legwork but the per-cpu allocator could be split
> > and converted to hold arbitrary sized pages with a constructer/destructor
> > to do the DMA coherency step when pages are taken from or handed back to
> > the core allocator. I'm not volunteering to do that unfortunately but I
> > estimate it'd be a few days work unless it needs to be per-CPU and NUMA
> > aware in which case the memory footprint will be high.
> 
> Have "we" tried to accelerate the DMA calls in PowerPC?  For example, it
> could hold onto a cache of recently used mappings and recycle them if that
> still works.  It trades off a bit of security (a device can continue to DMA
> after the memory should no longer be accessible to it) for speed, but then
> so does the per-driver hack of keeping pages around still mapped.
> 

There are two problems on the DMA calls on Power servers. One is scalability. A
new allocation method for the address space would be necessary to fix it.

The other one is the latency or the cost of updating the TCE tables. The only
number I have is that I could push around 1M updates per second. So, we could
guess 1us per operation, which is pretty much a no-no for Jesper use case.

Your solution could address both. But I am concerned about the security problem.
Here is why I think this problem should be ignored if we go this way. IOMMU can
be used for three problems: virtualization, paranoia security and debuggability.

For virtualization, there is a solution already, and it's in place for Power and
x86. Power servers have the ability to enlarge the DMA window, allowing the
entire VM memory to be mapped during PCI driver probe time. After that, dma_map
is a simple sum and dma_unmap is a nop. x86 KVM maps the entire VM memory even
before booting the guest. Unless we want to fix this for old Power servers, I
see no point in fixing it.

Now, if you are using IOMMU on the host with no passthrough or linear system
memory mapping, you are paranoid. It's not just a matter of security, in fact.
It's also a matter of stability. Hardware, firmware and drivers can be buggy,
and they are. When I worked with drivers on Power servers, I found and fixed a
lot of driver bugs that caused the device to write to memory it was not supposed
to. Good thing is that IOMMU prevented that memory write to happen and the
driver would be reset by EEH. If we can make this scenario faster, and if we
want it to be the default we need to, then your solution might not be desired.
Otherwise, just turn your IOMMU off or put it into passthrough.

Now, the driver keeps pages mapped, but those pages belong to the driver. They
are not pages we decide to give to a userspace process because it's no longer in
use by the driver. So, I don't quite agree this would be a good tradeoff.
Certainly not if we can do it in a way that does not require this.

So, Jesper, please take into consideration that this pool design would rather be
per device. Otherwise, we allow some device to write into another's
device/driver memory.

Cascardo.

Re: [PATCH RFT 2/2] macb: kill PHY reset code

2016-04-11 Thread Sergei Shtylyov


Hello.

On 04/11/2016 05:28 AM, Andrew Lunn wrote:


With  the 'phylib' now  being aware of  the "reset-gpios" PHY node property,
there should be no need to frob the PHY reset in this  driver anymore...

Signed-off-by: Sergei Shtylyov 

---
  drivers/net/ethernet/cadence/macb.c |   17 -
  drivers/net/ethernet/cadence/macb.h |1 -
  2 files changed, 18 deletions(-)

Index: net-next/drivers/net/ethernet/cadence/macb.c
===
--- net-next.orig/drivers/net/ethernet/cadence/macb.c
+++ net-next/drivers/net/ethernet/cadence/macb.c

[...]

@@ -2977,18 +2976,6 @@ static int macb_probe(struct platform_de
else
macb_get_hwaddr(bp);

-   /* Power up the PHY if there is a GPIO reset */
-   phy_node =  of_get_next_available_child(np, NULL);
-   if (phy_node) {
-   int gpio = of_get_named_gpio(phy_node, "reset-gpios", 0);
-
-   if (gpio_is_valid(gpio)) {
-   bp->reset_gpio = gpio_to_desc(gpio);
-   gpiod_direction_output(bp->reset_gpio, 1);


Hi Sergei

The code you are deleting would of ignored the flags in the gpio
property, i.e. active low.


   Hm, you're right -- I forgot about that... :-/


The new code in the previous patch does
however take the flags into account. Did you check if there are any
device trees which have flags, which were never used, but are now
going to be used and thus break...


   Checked this now and found out arch/arm/boot/dts/ar91-vinco.dts. Looks 
like it needs to be fixed indeed...



   Andrew


MBR, Sergei

Re: [PATCH] sh_eth: re-enable-E-MAC interrupts in sh_eth_set_ringparam()

2016-04-11 Thread Sergei Shtylyov


Hello.

On 04/10/2016 04:27 AM, David Miller wrote:


The E-MAC interrupts are left disabled when the ring parameters are changed
via 'ethtool'. In order to fix this, it's enough to call sh_eth_dev_init()
with 'true' instead of 'false' for the second argument (which conveniently
allows us to remove the following code re-enabling E-DMAC interrupts and
reception).

Signed-off-by: Sergei Shtylyov 


Applied, thanks.


   Thanks!
   Unfortunately, this patch isn't fit for the stable kernels because there's 
a prerequisite cleanup (most recent patch merged to this driver). I clearly 
hadn't thought this out well -- should have enabled the E-MAC interrupts 
outside sh_eth_dev_init() and then removed the duplicate code like I did here.


MBR, Sergei

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Eric Dumazet

On Mon, 2016-04-11 at 18:19 +0200, Jesper Dangaard Brouer wrote:

> Drivers also do tricks where they fallback to smaller order pages. E.g.
> lookup function mlx4_alloc_pages().  I've tried to simulate that
> function here:
> https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69

We use order-0 pages on mlx4 at Google, as order-3 pages are very
dangerous for some kind of attacks...

An out of order TCP packet can hold an order-3 pages, while claiming to
use 1.5 KBvia skb->truesize.

order-0 only pages allow the page recycle trick used by Intel driver,
and we hardly see any page allocations in typical workloads.

While order-3 pages are 'nice' for friendly datacenter kind of traffic,
they also are a higher risk on hosts connected to the wild Internet.

Maybe I should upstream this patch ;)

Re: [PATCH net-next 2/2] net: exit busy loop when another process is runnable

2016-04-11 Thread Michael S. Tsirkin

On Fri, Aug 22, 2014 at 09:36:53AM +0200, Ingo Molnar wrote:
> 
> > > diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> > > index 1d67fb6..8a33fb2 100644
> > > --- a/include/net/busy_poll.h
> > > +++ b/include/net/busy_poll.h
> > > @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int 
> > > nonblock)
> > >   cpu_relax();
> > >  
> > >   } while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
> > > -  !need_resched() && !busy_loop_timeout(end_time));
> > > +  !need_resched() && !busy_loop_timeout(end_time) &&
> > > +  nr_running_this_cpu() < 2);
> 
> So it's generally a bad idea to couple to the scheduler through 
> such a low level, implementation dependent value like 
> 'nr_running', causing various problems:
> 
>  - It misses important work that might be pending on this CPU,
>like RCU callbacks.
> 
>  - It will also over-credit task contexts that might be
>runnable, but which are less important than the currently
>running one: such as a SCHED_IDLE task
> 
>  - It will also over-credit even regular SCHED_NORMAL tasks, if
>this current task is more important than them: say
>SCHED_FIFO. A SCHED_FIFO workload should run just as fast 
>with SCHED_NORMAL tasks around, as a SCHED_NORMAL workload 
>on an otherwise idle system.
> 
> So what you want is a more sophisticated query to the 
> scheduler, a sched_expected_runtime() method that returns the 
> number of nsecs this task is expected to run in the future, 
> which returns 0 if you will be scheduled away on the next 
> schedule(), and returns infinity for a high prio SCHED_FIFO 
> task, or if this SCHED_NORMAL task is on an otherwise idle CPU.
> 
> It will return a regular time slice value in other cases, when 
> there's some load on the CPU.
> 
> The polling logic can then do its decision based on that time 
> value.
> 
> All this can be done reasonably fast and lockless in most 
> cases, so that it can be called from busy-polling code.
>
> An added advantage would be that this approach consolidates the 
> somewhat random need_resched() checks into this method as well.
> 
> In any case I don't agree with the nr_running_this_cpu() 
> method.
> 
> (Please Cc: me and lkml to future iterations of this patchset.)
> 
> Thanks,
> 
>   Ingo

I tried to look into this: it might be even nicer to add
sched_expected_to_run(time) which tells us whether we expect the current
task to keep running for the next XX nsecs.

For the fair scheduler, it seems that it could be as simple as

+static bool expected_to_run_fair(struct cfs_rq *cfs_rq, s64 t)
+{
+   struct sched_entity *left;
+   struct sched_entity *curr = cfs_rq->curr;
+
+   if (!curr || !curr->on_rq)
+   return false;
+
+   left = __pick_first_entity(cfs_rq);
+   if (!left)
+   return true;
+
+   return (s64)(curr->vruntime + calc_delta_fair(t, curr) -
+left->vruntime) < 0;
+}

The reason it seems easier is because that way we can reuse
calc_delta_fair and don't have to do the reverse translation
from vruntime to nsec.

And I guess if we do this with interrupts disabled, and only poke
at the current CPU's rq, we know first entity
won't go away so we don't need locks? 

Is this close to what you had in mind?

Thanks,

-- 
MST

Unhandled fault during system suspend in sky2_shutdown

2016-04-11 Thread Sudeep Holla


Hi,

I am seeing unhandled fault during system suspend in sky2_shutdown.
I am not sure if it's something missing in the firmware, but just wanted
to check. I see that networkmanager is invoking calling to 
netlink_sendmsg which calls sky2_get_stats after the device is shutdown.


Unhandled fault: synchronous external abort (0x96000210) at 
0x091c2918

Internal error: : 96000210 [#1] PREEMPT SMP
Modules linked in:
CPU: 3 PID: 2029 Comm: NetworkManager Not tainted 4.6.0-rc3 #126
Hardware name: ARM Juno development board (r2) (DT)
task: 80007a673000 ti: 800940b5c000 task.ti: 800940b5c000
PC is at sky2_get_stats+0x44/0x3b8
LR is at dev_get_stats+0x58/0xc8
 sky2_get_stats+0x44/0x3b8
 rtnl_fill_stats+0x20/0x138
 rtnl_fill_ifinfo+0x440/0xb38
 rtnl_getlink+0xe8/0x198
 rtnetlink_rcv_msg+0xe4/0x220
 netlink_rcv_skb+0xc4/0xf8
 rtnetlink_rcv+0x2c/0x40
 netlink_unicast+0x160/0x238
 netlink_sendmsg+0x2f0/0x358
 sock_sendmsg+0x18/0x30
 ___sys_sendmsg+0x204/0x218
 __sys_sendmsg+0x44/0x88
 SyS_sendmsg+0xc/0x18
 el0_svc_naked+0x24/0x28

The below patch is the hack I came up to check if the netdev is detached 
and unregistered, I no longer see the issue.


Regards,
Sudeep

-->8

diff --git i/drivers/net/ethernet/marvell/sky2.c 
w/drivers/net/ethernet/marvell/sky2.c

index ec0a22119e09..0ff0434e32fc 100644
--- i/drivers/net/ethernet/marvell/sky2.c
+++ w/drivers/net/ethernet/marvell/sky2.c
@@ -5220,6 +5220,13 @@ static SIMPLE_DEV_PM_OPS(sky2_pm_ops, 
sky2_suspend, sky2_resume);


 static void sky2_shutdown(struct pci_dev *pdev)
 {
+   struct sky2_hw *hw = pci_get_drvdata(pdev);
+   int i;
+
+   for (i = hw->ports - 1; i >= 0; --i) {
+   sky2_detach(hw->dev[i]);
+   unregister_netdev(hw->dev[i]);
+   }
sky2_suspend(&pdev->dev);
pci_wake_from_d3(pdev, device_may_wakeup(&pdev->dev));
pci_set_power_state(pdev, PCI_D3hot);

Re: pull-request: wireless-drivers-next 2016-04-11

2016-04-11 Thread David Miller

From: Kalle Valo 
Date: Mon, 11 Apr 2016 15:48:48 +0300

> here's a pull request for 4.7. More features, but nothing really
> standing out. Please let me know if you have any problems.

Pulled, thanks.

Re: [PATCH net] cxgb4: Stop Rx Queues before freeing it up

2016-04-11 Thread David Miller

From: Hariprasad Shenai 
Date: Mon, 11 Apr 2016 11:07:58 +0530

> Stop all Ethernet RX Queues before freeing up various Ingress/Egress
> Queues, etc. We were seeing cases of Ingress Queues not getting serviced
> during the shutdown process leading to Ingress Paths jamming up through
> the chip and blocking the shutdown effort itself.
> 
> One such case involved the Firmware sending a "Flush Token" through the
> ULP-TX -> ULP-RX path for an Ethernet TX Queue being freed in order to
> make sure there weren't any remaining TX Work Requests in the pipeline.
> But the return path was stalled by Ingress Data unable to be delivered to
> the Host because those Ingress Queues were no longer being serviced.
> 
> Based on original work by Casey Leedom 
> 
> Signed-off-by: Hariprasad Shenai 

Applied, thank you.

Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?

2016-04-11 Thread Matthew Wilcox

On Mon, Apr 11, 2016 at 02:08:27PM +0100, Mel Gorman wrote:
> On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote:
> > On arch's like PowerPC, the DMA API is the bottleneck.  To workaround
> > the cost of DMA calls, NIC driver alloc large order (compound) pages.
> > (dma_map compound page, handout page-fragments for RX ring, and later
> > dma_unmap when last RX page-fragments is seen).
> 
> So, IMO only holding onto the DMA pages is all that is justified but not a
> recycle of order-0 pages built on top of the core allocator. For DMA pages,
> it would take a bit of legwork but the per-cpu allocator could be split
> and converted to hold arbitrary sized pages with a constructer/destructor
> to do the DMA coherency step when pages are taken from or handed back to
> the core allocator. I'm not volunteering to do that unfortunately but I
> estimate it'd be a few days work unless it needs to be per-CPU and NUMA
> aware in which case the memory footprint will be high.

Have "we" tried to accelerate the DMA calls in PowerPC?  For example, it
could hold onto a cache of recently used mappings and recycle them if that
still works.  It trades off a bit of security (a device can continue to DMA
after the memory should no longer be accessible to it) for speed, but then
so does the per-driver hack of keeping pages around still mapped.

1 2 >

1 - 100 of 161 matches

Mail list logo